{"id":2156,"date":"2019-05-03T12:18:00","date_gmt":"2019-05-03T12:18:00","guid":{"rendered":"https:\/\/azoora.com\/blog\/?p=2156"},"modified":"2019-07-22T03:27:28","modified_gmt":"2019-07-22T03:27:28","slug":"converting-from-speech-to-text-with-javascript","status":"publish","type":"post","link":"https:\/\/azoora.com\/blog\/code\/converting-from-speech-to-text-with-javascript\/","title":{"rendered":"Converting from Speech to Text with JavaScript"},"content":{"rendered":"\n<p>In this tutorial we are going to experiment with the&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/API\/Web_Speech_API\" target=\"_blank\" rel=\"noreferrer noopener\">Web Speech API<\/a>. It&#8217;s a very powerful browser interface that allows you to record human speech and convert it into text. We will also use it to do the opposite &#8211; reading out strings in a human-like voice.<\/p>\n\n\n\n<p>Let&#8217;s jump right in!<\/p>\n\n\n\n<h2>The App<\/h2>\n\n\n\n<p>To showcase the ability of the API we are going to build a simple voice-powered note app. It does 3 things:<\/p>\n\n\n\n<ul><li>Takes notes by using voice-to-text or traditional keyboard input.<\/li><li>Saves notes to localStorage.<\/li><li>Shows all notes and gives the option to listen to them via Speech Synthesis.<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img data-attachment-id=\"2157\" data-permalink=\"https:\/\/azoora.com\/blog\/code\/converting-from-speech-to-text-with-javascript\/attachment\/demo2\/#main\" data-orig-file=\"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2.png\" data-orig-size=\"1536,698\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"demo2\" data-image-description=\"\" data-medium-file=\"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-300x136.png\" data-large-file=\"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-1024x465.png\" loading=\"lazy\" width=\"1024\" height=\"465\" src=\"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-1024x465.png\" alt=\"\" class=\"wp-image-2157\" srcset=\"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-1024x465.png 1024w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-300x136.png 300w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-768x349.png 768w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-720x327.png 720w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-580x264.png 580w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2-320x145.png 320w, https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/demo2.png 1536w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption> <br>App for Taking Notes Using Voice Input.<br><\/figcaption><\/figure>\n\n\n\n<p>We won&#8217;t be using any fancy dependencies, just good old&nbsp;<a href=\"https:\/\/jquery.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">jQuery<\/a>&nbsp;for easier DOM operations and&nbsp;<a href=\"https:\/\/shoelace.style\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">Shoelace<\/a>&nbsp;for CSS styles. We are going to include them directly via CDN, no need to get NPM involved for such a tiny project.<\/p>\n\n\n\n<p>The HTML and CSS are pretty standard so we are going to skip them and go straight to the JavaScript. To view the full source code go to the&nbsp;<strong>Download<\/strong>&nbsp;button near the top of the page.<\/p>\n\n\n\n<h2>Speech to Text<\/h2>\n\n\n\n<p>The Web Speech API is actually separated into two totally independent interfaces. We have&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/API\/SpeechRecognition\" target=\"_blank\">Speech Recognition<\/a> for understanding human voice and turning it into text (Speech -&gt; Text) and&nbsp;<a rel=\"noreferrer noopener\" href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/API\/SpeechSynthesis\" target=\"_blank\">Speech Synthesis<\/a>&nbsp;for reading strings out loud in a computer generated voice (Text -&gt; Speech). We&#8217;ll start with the former.<\/p>\n\n\n\n<p>The Speech Recognition API is surprisingly accurate for a free browser feature. It recognized correctly almost all of my speaking and knew which words go together to form phrases that make sense. It also allows you to dictate special characters like full stops, question marks, and new lines.<\/p>\n\n\n\n<p>The first thing we need to do is check if the user has access to the API and show an appropriate error message. Unfortunately, the speech-to-text API is supported only in Chrome and Firefox (with a flag), so a lot of people will probably see that message.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">try {<br>   var SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition;<br>   var recognition = new SpeechRecognition();<br> }<br> catch(e) {<br>   console.error(e);<br>   $('.no-browser-support').show();<br>   $('.app').hide();<br> }<\/pre>\n\n\n\n<p>The&nbsp;<code>recognition<\/code>&nbsp;variable will give us access to all the API&#8217;s methods and properties. There are various options available but we will only set&nbsp;<code>recognition.continuous<\/code>&nbsp;to true. This will enable users to speak with longer pauses between words and phrases.<\/p>\n\n\n\n<p>Before we can use the voice recognition, we also have to set up a couple of event handlers. Most of them simply listen for changes in the recognition status:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">recognition.onstart = function() { <br>   instructions.text('Voice recognition activated. Try speaking into the microphone.');<br> }<br> recognition.onspeechend = function() {<br>   instructions.text('You were quiet for a while so voice recognition turned itself off.');<br> }<br> recognition.onerror = function(event) {<br>   if(event.error == 'no-speech') {<br>     instructions.text('No speech was detected. Try again.');  <br>   };<br> }<\/pre>\n\n\n\n<p>There is, however, a special&nbsp;<code>onresult<\/code>&nbsp;event that is very crucial. It is executed every time the user speaks a word or several words in quick succession, giving us access to a text transcription of what was said.<\/p>\n\n\n\n<p>When we capture something with the&nbsp;<code>onresult<\/code>&nbsp;handler we save it in a global variable and display it in a textarea:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">recognition.onresult = function(event) {<br>   \/\/ event is a SpeechRecognitionEvent object.<br>   \/\/ It holds all the lines we have captured so far.<br>   \/\/ We only need the current one.<br>   var current = event.resultIndex;<br><br>   \/\/ Get a transcript of what was said.<br>   var transcript = event.results[current][0].transcript;<br><br>   \/\/ Add the current transcript to the contents of our Note.<br>   noteContent += transcript;   noteTextarea.val(noteContent);<br>} <\/pre>\n\n\n\n<p>The above code is slightly simplified. There is a very weird bug on Android devices that causes everything to be repeated twice. There is no official solution yet but we managed to solve the problem without any obvious side effects. With that bug in mind the code looks like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">var mobileRepeatBug = (current == 1 &amp;&amp; transcript == event.results[0][0].transcript); <br><br>if(!mobileRepeatBug) {<br>   noteContent += transcript;<br>   noteTextarea.val(noteContent);<br>} <\/pre>\n\n\n\n<p>Once we have everything set up we can start using the browser&#8217;s voice recognition feature. To start it simply call the&nbsp;<code>start()<\/code>&nbsp;method: <\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$('#start-record-btn').on('click', function(e) {<br>   recognition.start();<br>});<\/pre>\n\n\n\n<p>This will prompt users to give permission. If such is granted the device&#8217;s microphone will be activated.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote\"><p><em>Most APIs that require user permission don&#8217;t work on non-secure hosts. Make sure you are serving your Web Speech apps over HTTPS.<\/em><\/p><\/blockquote>\n\n\n\n<p>The browser will listen for a while and every recognized phrase or word will be transcribed. The API will stop listening automatically after a couple seconds of silence or when manually stopped.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">$('#pause-record-btn').on('click', function(e) {<br>   recognition.stop();<br>});<\/pre>\n\n\n\n<p>With this, the speech-to-text portion of our app is complete! Now, let&#8217;s do the opposite!<\/p>\n\n\n\n<h2>Text to Speech<\/h2>\n\n\n\n<p>Speech Synthesys is actually very easy. The API is accessible through the&nbsp;<a href=\"https:\/\/developer.mozilla.org\/en-US\/docs\/Web\/API\/SpeechSynthesis\" target=\"_blank\" rel=\"noreferrer noopener\">speechSynthesis<\/a>&nbsp;object and there are a couple of methods for playing, pausing and other audio related stuff. It also has a couple of cool options that change the pitch, rate, and even the voice of the reader.<\/p>\n\n\n\n<p>All we will actually need for our demo is the&nbsp;<code>speak()<\/code>&nbsp;method. It expects one argument, an instance of the beautifully named&nbsp;<code>SpeechSynthesisUtterance<\/code>&nbsp;class.<\/p>\n\n\n\n<p>Here is the entire code needed to read out a string.<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">function readOutLoud(message) {<br>   var speech = new SpeechSynthesisUtterance();<br><br>   \/\/ Set the text and voice attributes.<br>   speech.text = message;<br>   speech.volume = 1;<br>   speech.rate = 1;<br>   speech.pitch = 1;<br>   window.speechSynthesis.speak(speech);<br>}<\/pre>\n\n\n\n<p>When this function is called, a robot voice will read out the given string, doing it&#8217;s best human impression.<\/p>\n\n\n\n<h2>Conclusion<\/h2>\n\n\n\n<p>In an era where voice assistants are more popular then ever, an API like this gives you a quick shortcut to building bots that understand and speak human language.<\/p>\n\n\n\n<p>Adding voice control to your apps can also be a great form of accessibility enhancement. Users with visual impairment can benefit from both speech-to-text and text-to-speech user interfaces.<\/p>\n\n\n\n<p>The speech synthesis and speech recognition APIs work pretty well and handle different languages and accents with ease. Sadly, they have limited browser support for now which narrows their usage in production. If you need a more reliable form of speech recognition, take a look at these third-party APIs:<\/p>\n\n\n\n<ul><li><a href=\"https:\/\/cloud.google.com\/speech\/\" target=\"_blank\" rel=\"noreferrer noopener\">Google Cloud Speech API<\/a><\/li><li><a href=\"https:\/\/azure.microsoft.com\/en-us\/services\/cognitive-services\/speech\/\" target=\"_blank\" rel=\"noreferrer noopener\">Bing Speech API<\/a><\/li><li><a href=\"https:\/\/cmusphinx.github.io\/\" target=\"_blank\" rel=\"noreferrer noopener\">CMUSphinx<\/a>&nbsp;and it&#8217;s JavaScript version&nbsp;<a href=\"https:\/\/syl22-00.github.io\/pocketsphinx.js\/\" target=\"_blank\" rel=\"noreferrer noopener\">Pocketsphinx<\/a>&nbsp;(both open-source).<\/li><li><a href=\"https:\/\/api.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">API.AI<\/a>&nbsp;&#8211; Free Google API powered by Machine Learning<\/li><\/ul>\n\n\n\n<p><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>In this tutorial we are going to experiment with the&nbsp;Web Speech API. It&#8217;s a very powerful browser interface that allows you to record human speech and convert it into text. We will also use it to do the opposite &#8211; reading out strings in a human-like voice. Let&#8217;s jump right in! The App To showcase [&hellip;]<\/p>\n","protected":false},"author":2,"featured_media":2158,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"spay_email":"","jetpack_publicize_message":"","jetpack_is_tweetstorm":false},"categories":[4,12,62],"tags":[25,87,110],"jetpack_featured_media_url":"https:\/\/azoora.com\/blog\/wp-content\/uploads\/2019\/04\/web-speech-api.png","jetpack_publicize_connections":[],"jetpack_shortlink":"https:\/\/wp.me\/p7FQPL-yM","jetpack-related-posts":[],"jetpack_sharing_enabled":true,"jetpack_likes_enabled":true,"_links":{"self":[{"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/posts\/2156"}],"collection":[{"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/comments?post=2156"}],"version-history":[{"count":2,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/posts\/2156\/revisions"}],"predecessor-version":[{"id":2527,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/posts\/2156\/revisions\/2527"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/media\/2158"}],"wp:attachment":[{"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/media?parent=2156"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/categories?post=2156"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/azoora.com\/blog\/wp-json\/wp\/v2\/tags?post=2156"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}