Title: A WWW Browser Using Speech Recognition And Its Evaluation

AuthorsKazuhiro Kondo (Member) and Charles T. Hemphill (Non-member)

Affiliation: Texas Instruments Media Technologies Laboratory

8330 LBJ Freeway, MS8374, Dallas, Texas 75243, USA

Abstract:

We developed the Japanese Speech-Aware Multimedia (JAM) which controls a World Wide Web (WWW) browser using speech. This system allows the user to browse a linked page by reading the anchor text within a Web page. The user can also control the browser using speech. The system integrates new vocabulary each time a new Web page is read by extracting the anchor text, converting this text to phonetic string notation, creating a new speech recognition grammar and integrating this grammar with the system dynamically. During this anchor text-to-phone conversion process, we employ numerous exception handling techniques to accommodate counters, dates and many other phenomena.

Preliminary tests show that the conversion results contain the correct phone sequences over 97% of the time. We also allowed limited English speech recognition since a large percentage of the Japanese Web pages include some English text in anchors. User tests showed that the prototype correctly understands the input speech 91.5 % of the time, or 94.1% if we exclude user errors caused by unfamiliarity with the system including erroneous readings or speech detection errors.

Keywords: WWW, speech recognition, user interface, text-to-phone conversion

  1. INTRODUCTION

Traffic on the World Wide Web (WWW) has grown continuously at an enormous rate since its introduction. There are many reasons which might have caused this boom. Multimedia documents integrating text, still images, video and audio can be easily published for worldwide distribution. Browsing these hyperlinked pages is very intuitive. Since its hyped introduction in the media, users of the Web have grown from a very small group of researchers and engineers to a large non-technical society that has never used the computer or the Internet. Speech may provide a more user-friendly interface for these naive users as well as the physically-challenged. It may also provide more convenient interface for other applications such as hands-free operation and multimedia presentations. Accordingly, we first developed the Speech-Aware Multimedia (SAM) system [1] which enables the user to control the WWW browser using English speech.

SAM uses speaker-independent continuous speech recognition to enable continuous English speech input. Its main features are as follows:

SAM was first implemented on UNIX platforms using NCSA Mosaicas the browser. It has been ported to Microsoft Windows295 with Netscape Navigator3as the browser.

After lagging behind the U. S. boom, WWW traffic in Japan has also grown dramatically. Many Japanese language pages can now be seen. Major efforts to localize browsers for Japanese were seen since the early stages. Mosaic was first localized to many languages including Japanese. However, Netscape Navigator, which appeared shortly after Mosaic, became the first stable localized browser in major languages including Japanese. It has since been the browser of choice for Web surfing in Japanese.

With the introduction of the WWW, PC penetration in Japan, which had significantly lower penetration than the U. S., increased dramatically. Because of this, the majority of the users are still naive users. In one survey, over half of the PC users accessing the Internet were fairly inexperienced users with less than 2 years experience [2]. Speech input can potentially become an attractive alternative interface for these users who may still be uncomfortable with the mouse and the keyboard.

There are a number of efforts to apply speech to the WWW browser, but these are still limited since the Web is relatively new. We first describe two systems for English speech for the Apple Macintosh. The Macintosh bundled its proprietary speech recognition software, PlainTalk4 Speech Recognition, to its OS from early on. Thus, systems which tried to use this software to add speech capabilities to the browser were developed. Both systems described below are add-on software which function as an interface between the PlainTalk and the browser.

The first system, ListenUp [3], extracts the keyword URL (Uniform Resource Locator; address of documents and resources including text, audio, image and video ) pairs from an existing table file, and feeds keywords to PlainTalk. From the keywords, PlainTalk creates speech grammars, and incorporates this grammar dynamically. When PlainTalk recognizes keywords according to the grammar, the result is fed back to ListenUp. ListenUp then sends a command to the browser to visit the corresponding Web page with the appropriate URL. A separate keyword URL table file per Web page needs to be prepared to use ListenUp. The system is implemented as a Netscape Navigator plug-in.

The second system, SurfTalk [4] from Digital Dreams, is also implemented as a plug-in for the Netscape Navigator browser and uses PlainTalk. SurfTalk extracts anchor text from Web pages, and outputs the text to PlainTalk to enable speech recognition of the text dynamically. However, at the time of this writing, the software was in its early beta stage and was quite unstable.

On IBM-compatible PCs, IBM5 uses its own speech recognizer, VoiceType, along with a modified version of Netscape Navigator in a system called VoiceType Connection [5]Similar to SAM, this system extracts anchor text from a Web page, constructs a speech grammar, and allows the user to visit a linked page by speaking the corresponding anchor text. However, users can only speak the words already in the vocabulary, or train the system for new words as they appear in a speaker-dependent fashion. It also allows speech input into forms with its 22,000 word dictation system. At the time of this writing, the system was in beta, and is downloadable from their Web site.

There are fewer applications of Japanese speech recognition to the Web browser compared to English. Tokyo University uses NCSA Mosaic and a commercially available Japanese speech recognition software to enable surfing the WWW with Japanese speech [6]. The system also uses a visual software agent to give the user visual feedback through facial animation. The system allows words in the anchor text which are in the dictionary of the speech recognizer to be recognized. The words not in the dictionary are not recognized. It also compiles a table in which each link is assigned an index number. The user can speak the index number to visit the corresponding linked page. However, unrestricted digit recognition is still somewhat unreliable since digits are short, often pronounced sloppily, and context cannot be used. It also seems unintuitive and inconvenient to look up the index number each time the user wants to visit a new Web page.

We developed the Japanese Aware Multimedia (JAM) system that allows the user to read the anchor text in an arbitrary page to visit the corresponding link. In situations where simply reading the anchor text is too restrictive, the system also allows navigation with arbitrary speech according to a speech grammar referenced by an embedded pointer in the header of the page. The system also allows browser commands to be spoken and bookmarks with arbitrary spoken keywords. In the following chapter, we summarize the issues in applying speech recognition to navigate Web pages. Chapter 3 describes the system architecture. Chapter 4 gives the results of a user evaluation test with the system. Finally, chapter 5 states the conclusions and suggestions for further work.

  1. Speech Recognition of Web Pages

This chapter summarizes issues when creating speech recognition grammars from WWW pages. WWW pages are written using the hypertext markup language (HTML). A simple example of a page in HTML is as follows:

<html>

<body>

<head><TITLE>HTML example </TITLE></head>

<A HREF=http://www.afirm.co.jp/link1.html>Link 1</A>

<A HREF=http://www.afirm.co..jp/link2.html>Link 2</A>

</body>

</html>

As is shown in the example, the linked URL is shown in double quotes with the tag A HREF. The anchor text is written between two tags, <A HREF=..> and </A>. Thus, the system merely needs to extract the text between these tags, and convert the text to speech grammars. Neither the language nor the character code of the text is restricted, thus Japanese text can also be used.

However, there are three major character encoding methods currently in use to encode Japanese: Extended Unix Code (EUC)Japanese Industrial Standard (JIS) code and Shift JIS code [7]. In our preliminary tests, Japanese Web pages seem to be equally divided between these codes, with 30% EUC , 28% JIS code (both new and old JIS), and 42% JIS code. There were some pages with mixed encoding. This is probably because the author of the page created parts of the page by cutting and pasting from multiple pages with different encoding methods. There are also a variety of character sets included in Japanese pages, with Zenkaku characters (double ASCII character width Kanji, Hiragana, Katakana, Alphanumeric characters and symbols), Hankaku (half-width, equivalent to one ASCII character width) Katakana characters, ASCII characters, and JIS-Roman characters. It is necessary to convert these various standards and formats into one uniform notation for internal processing.

After extracting anchor text from a page and converting this to a uniform notation, it is necessary to segment this text into appropriate units. This is required for the following:

In the next phase, anchor texts need to be converted into phonetic strings. Here, we encounter the same classic problems encountered in text-to-speech systems. Namely,

Non-text anchors also need to be processed. These are mostly images which are used for anchors instead of text. The images serve as clickable buttons to links, and are commonly called image maps. Figure 1 shows an example. In this example, the boxes with the strings company info, products/services, etc. serve as buttons. These do not explicitly use text as anchors and so cannot be used directly to create speakable links. However, HTML provides tags to define substitute text for the images typically for users of low-speed modems. These users commonly disable auto-loading of images in their browsers. To hint at the content of the images which these users can not see, the browser substitutes text defined by an HTML tag ALT in place of the images. The following is HTML code for the page shown in Figure 1 which uses this tag.

<A HREF=/corp/docs/companyinfo.html>

<IMGSRC=/corp/graphics/cont1w.gif ALT=Company Info></A>

<A HREF=/corp/docs/prodserv.html>

<IMGSRC=/corp/graphics/cont2w.gif ALT=Products/Services></A>

As shown in this example, substitution text can be defined with the tag ALT=... when images are used as anchors. This text can be used to create speech grammars for this link, thereby enabling speech to navigate to this link. In this example, the substitution text describes the content of the image appropriately, and thus the user may use speech to visit this link. However, not all substitution text is as comprehensive. Thus the speech input which the user thought might select the link may actually not work. Further study is needed to solve this problem.

Another item in a page which is difficult to handle with speech is form input. Forms provide a mechanism for the user to send data to the server. For example, the user can fill in text in the forms and click the submit button to send the data to the server for database query. Arbitrary text can be input using forms, but in reality, the variety of data that can be input in a particular page can be limited depending on the context. Therefore, we can prepare a speech recognition grammar which defines all sentences which the user may possibly input, and reference this grammar from the Web page. When the browser fetches this page, the grammar can be downloaded simultaneously, and the corresponding speech input sentences can now be accepted. We will call mechanism smart pages. Smart pages can also enable handling of image maps with speech. For example, in an image map where the user can click a map of Japan at the prefecture of their choice in order to access information about the prefecture, a speech grammar with all prefecture names can be prepared and used in a smart page. More details about smart pages can be found in the following chapter.

  1. System Architecture
    1. Speakable links, speakable bookmarks and speakable commands
    2. A prototype system of JAM was developed on a UNIX platform. We used Netscape Navigator as the browser. Figure 2 shows the configuration of the system.

      The browser downloads a Web page written in HTML from the network. This HTML code is used to render the page in the browser, and also is output to JAM for analysis. JAM first detects the character encoding of the page, and converts the page to EUC if necessary. We decided to use EUC for all internal processing since Japanese characters in EUC are easily distinguishable from ASCII text. The converted code is analyzed, and anchor text URL pairs are extracted. If the anchor points to an image file, the system looks for ALT tags and, if there is one, the text inside this tag is used as anchor text for this link. The Hankaku Katakanas and ASCII alphanumerics and symbols are converted to the corresponding Zenkaku characters. All alphabets are converted to upper case. Then, the anchor texts are segmented into bunsetsus using a dictionary. We applied a freely available system, Kakasi [9] for the dictionary look-up. Kakasi is software that matches the input to the dictionary entries in 1-path. It is very simple but very fast, and is well suited for real-time applications. The original objective of Kakasi was to look-up readings of Kanji character strings. For this purpose, Kakasi comes with a dictionary with 121,824 entries. We added our own entries from a dictionary we had developed before [10] to bring the total to 357,203 entries. This dictionary was also developed for the same purpose as Kakasi. However, the dictionary has most of its entries in bunsetsu units. It also exhaustively lists most conjugations and variations. The system matches the longest entry in the combined dictionary as a bunsetsu candidate. In the worst case, when no appropriate bunsetsu is found in the dictionary, word entries or single character entries will always be found, thus falling back to a redundant shorter segmentation. It is possible to use a full-scale morphological analyzer such as the JUMAN [11] system to segment the input into more accurate bunsetsus, but we have concluded that this level of accuracy is not required for our purpose and is too expensive.

      The segmented anchor text is then converted to a sentence-level speech recognition grammar. An example of this grammar is shown below.

      start(jam_link_command_).

      jam_link_command_ ---> 日本語 リンク __.

      start(日本語 リンク __).

      日本語 リンク __ ---> 日本語_, Z_1.

      日本語 リンク __ ---> non_speech_, 日本語 リンク __.

      Z_1 ---> リンク_, Z_2.

      Z_1 ---> non_speech, Z_1.

      Z_2 ---> _, Z_3.

      Z_2 ---> non_speech, Z_2.

      Z_3 ---> _, Z_4.

      Z_3 ---> “”.

      Z_3 ---> non_speech, Z_3.

      Z_4 ---> “”.

      Z_4 ---> non_speech, Z_4.

      In this example, the anchor text is segmented into four bunsetsus and optional pause models non_speech, are inserted between each bunsetsu. After transition through three bunsetsus, the grammar allows transition to the exit node, shown by empty double quote pairs (“”). In this example, speech input is accepted by this grammar if bunsetsus up to at least the third bunsetsu, or are spoken. Also, all symbols are made optional in the grammar.

      Speakable bookmarks, speakable commands, and smart pages, which we will explain in detail later, are all written using similar grammars. For example, grammar for the speech command which moves one page forward can be written as:

      start(jam_page_forward_command).

      jam_page_forward_command ---> ( 先に | 次に ) [ 進む ].

      Here, (.|.) refers to selection, and [.] refers to optional. The user can program arbitrary sentences or phrases to commands. However, commands are set once during start-up, and thus cannot be modified until restarted. An example speakable bookmark grammar is as follows:

      start(テキサス インスツルメンツ).

      url(http://www.ti.com).

      テキサス インスツルメンツ ---> ( テキサス インスツルメンツ | TI ).

      As shown in this example, the corresponding URL is also coded in a speakable bookmark grammar. New bookmarks can be added while browsing, but by default, the page title will become the key word for the page. This key word can be reassigned arbitrarily by the user. By reloading the speakable bookmarks, new bookmarks can be used immediately.

      The bunsetsu nodes are expanded to a sub-level grammar which define the phonetic string representing their pronunciations. Kakasi is used in this process again. Using the example cited before,

      start(日本語_).

      日本語_ ---> NIHONXGO_, Z_1_.

      日本語_ ---> NIPPONXGO_, Z_1_.

      日本語_ ---> non_speech_, 日本語_.

      Z_1_ ---> “”.

      Z_1_ ---> non_speech_, Z_1_.

      Start(NIHONXGO_).

      NIHONXGO_ ---> n, i, h, o, N, (ng|g), o.

      This example allows two alternate pronunciations, nihongo and nippongo. The second grammar above defines the phonetic string for one of the pronunciations, nihongo.

      Exception handling involving digits and counters which we mentioned before are also accommodated. For digit strings, grammars which allow both discrete digits and natural numbers are generated. For example,

      start(1969_).

      1996_ ---> (SENX_ KYUUHAKU_ KYUUJYUU_ ROKU_ | ICHI_ KYUU_ KYUU_ ROKU_).

      Start(1_).

      1_ ---> (ICHINICHI_|TSUITACHI_).

      English anchor text is expanded into Hiragana. English words not in the dictionary are expanded into their spelling pronunciation, and are further expanded to Hiragana. Symbols are also expanded when appropriate to Hiragana.

      start(Apple_).

      Apple_ ---> APPURU_.

      Start(A T & T_).

      A T & T_ ---> EI_ TII_ (ANXDO_|ANXPAASANXDO_) TII_.

      We ran a test to estimate the performance of the text-to-phone conversion accuracy. We extracted 117 anchor texts from some major Japanese newspapers on the Web, namely Asahi [12], Yomiuri [13], and Nikkei [14]. Texts extracted from these pages were converted to phonetic notation using the modified Kakasi system and the conversion accuracy was measured. The vocabulary covered numerous domains, including politics, economy, science and technology, and sports. Since Kakasi outputs multiple candidates, we chose the closest candidate. We measured the conversion accuracy according to the following formula:

      Phonetic conversion accuracy =

      Since the dictionary packaged with the original Kakasi distribution has wide coverage, perhaps with the exception of proper nouns including names, the conversion errors were mostly bunsetsu segmentation errors and exceptions including English terms, digits, and symbols. The phonetic conversion with the original dictionary was 92.4 %. However, with the expanded dictionary and with exception handling, the accuracy increases to 97.2 %. The improvement in the correct sentence percentage was more dramatic, with 57 % and 82 % for these two cases respectively. Since the anchor text in this application is relatively long, a few errors in the phone conversion may not have significant impact on the recognition accuracy. Thus, we believe this level of accuracy is adequate for our application.

      Sentence-level grammars and the pronunciation grammars defining the phonetic strings for each bunsetsu are output to the speech recognizer. The recognition results are interpreted and converted to browser commands. When results corresponding to speakable commands are obtained, appropriate commands are issued to the browser. For example, when Sakini Susumu (go forward) is recognized, the command page forward is issued. When results corresponding to speakable bookmarks or speakable links are recognized, the goto URL command for the corresponding URL is issued.

      Figure 3 shows the console of the system. The console has a simple level meter to the left, and gives speech level feedback to the user. There is also a start button which additionally functions as a status indicator, showing the input status of the system; accepting input, and processing. The change in the input status is also fed back with beeps.

    3. Smart Pages
    4. With speakable commands and speakable links, users can surf most of the Web using speech. However, there are components on a Web page which are difficult to control with speech, e.g., forms and image maps. Accordingly, we propose a mechanism which we will call smart pages.

      In smart pages, speech recognition grammars are used to define allowed speech input in addition to conventional links. A pointer to the speech recognition grammar is embedded in the header portion of the HTML code as follows:

      <HEAD>

      <LINK REL=X-GRAMMAR HREF=smart_page.grm>

      </HEAD>

      The <LINK> tag conforms to the HTML standard, and is used to designate the relationship of the linked resource to this page. The REL tag defines the relationship to the current document. In the above example, it designates the linked resource as a grammar file. The format of the grammar is similar to the grammar which defines speakable commands. The following example shows a smart page grammar which extracts weather information for major Japanese cities:

      start(weather)

      weather ---> City_ (現在の (天気|気温|湿度) 予報)

      [(は|を見せて [下さい]].

      City_ ---> (東京|大阪|京都|名古屋|福岡|神戸|横浜|札幌|.....).

      Speech recognition results using this grammar are passed on to a CGI (Common Gateway Interface) script as arguments to be used as keys for the weather database query. For example, for the recognition forecast for Yokohama (横浜 の 予報), the following URL is accessed:

      http://www.tenki.co.jp/cgi-bin/tenki?横浜+の+予報

      The script returns a page with the forecast for Yokohama.

      As shown in this example, much more flexible speech input is possible using smart pages. Image maps can also be navigated using speech to some extent if an appropriate smart page grammar is used. The additions for smart pages do not affect conventional WWW browsers in any way.

    5. Speech Recognition

    We use a continuous multi-variate single Gaussian mixture HMM speech recognizer. Context-dependent phonetic models are used. The phonetic context defined using phonological features was clustered using a binary decision tree [5]. Speech segments are identified from input speech using an energy-based speech detector.

    The grammar used in the system is defined as a set of regular grammars. Each start symbol of a regular grammar is connected to a terminal symbol in a higher-level regular grammar. The system allows a subset of the start symbols to start the search, and thus allows dynamic adjustment of the grammar. We can also add or delete regular grammars easily, which enables us to add or delete sentences required for each page as required.

  2. Evaluation Experiments
  3. We conducted a simple evaluation test to measure the performance of the system, and to identify problems. We conducted two types of tests. In the first, in order to measure the performance of speakable links and speakable commands, we asked the subjects to freely surf the WWW. In the second test, we tried to compare the efficiency of smart pages to regular pages. In both tests, there were 12 subjects, of which 7 were female and 5 male. Six of the subjects reported that they use the computer regularly as well as surf the Web. Two use computers a few times a week, while the remaining four never or rarely use them.

    A Sun Sparc20 workstation was used in all tests. An unidirectional dynamic microphone was used. The users were asked to operate the on-off switch on the microphone for each utterance, but we also applied simple speech detection. The internal A/D conversion of the Sparc 20 was employed. All tests were conducted in a closed office. The room was relatively silent, but the air conditioning and the fan on the Sparc 20 were quite audible.

    1. Speakable Links And Speakable Command Evaluations

In the first experiment, we evaluated the performance of speakable links and speakable commands. The subjects were asked to freely surf the Web within the allotted time. Task completion ratio was evaluated on all valid utterances. Speakable bookmarks were not used in the tests. Speakable commands were limited to the following six commands:

We prepared a home page with links to the following items which we thought will be of general interest.

Starting from the above page, the users were asked to surf Japanese pages. Only pages which can be visited through speakable links were considered. Mouse and keyboard were not used. Users were given approximately 30 minutes to surf freely. All speech input, recognition results, and the browser status were recorded. After the evaluation sessions, the logs were used to measure the task completion ratio. There were a total of 1174 utterances in all sessions, for an average of about 98 utterances per subject. Table 1 summarizes the result.

 

Table 1. Speakable links and commands evaluation test results

User Classification

Task Completion [%]

Commands

Links

All Utterances

Male

97.44

78.76

88.62

Female

99.01

86.84

93.61

All

98.37

83.13

91.48

In this table, all utterances which included substitutions and deletions but resulted in the intended action were rated as correct. The overall task completion ratio was 91.5 %. The completion ratio for speakable commands, which make up 58 % of all utterances, was 98.4 %. Female subjects scored somewhat higher in all categories, especially speakable links. This is probably because most female subjects spoke carefully. There was no significant difference in the completion ratio due to daily computer usage. Speakable links showed significantly lower completion ratios than speakable commands. Table 2 shows the breakdown of causes of these errors using speakable links.

Table 2. Break down of speakable link recognition errors

Error Types

Percentage of All Errors[%]

Male

Female

All Users

Speech Detection

18.07

10.84

28.92

Speech Recognition

13.25

10.84

24.10

Insufficient Entry

3.61

15.66

19.28

Out of Vocabulary

13.25

4.84

18.07

Others

9.64

0.0

9.64

The most frequent cause of errors was speech detection errors, followed by speech recognition errors, insufficient speech input, and out of vocabulary. Insufficient speech input refers to errors caused by utterances in which the user did not speak the required three bunsetsus. Out of vocabulary includes misread words, utterances that were read from the middle of the anchor (which is not allowed) and attempts to read from image maps without ALT tags. The Others category includes pages which were not possible to parse due to HTML coding errors. There were significantly more speech detection errors with male speakers, which we believe is caused by subjects not holding the microphone in a stable manner or at appropriate distance. A close-talking head-mounted microphone may have helped here. There were also male subjects with extremely long inter-word pauses, which triggered the speech detector to end speech input. In order to deal with these, we need to improve the detection accuracy through optimization or adaption of the detection level and the introduction of other features. Insufficient entry and out of vocabulary utterances also constitute a large portion of the errors. However, we believe these errors will decrease to negligible level as the user becomes familiar with the system. The task completion without these two error categories comes to 94.1 %.

    1. Comparison of Smart Pages and Regular Pages

To evaluate efficiency improvements brought by smart pages, we prepared two pages with essentially the same content, one with a smart page grammar and one without. The task selected was database query of Sumo wrestler (Rikishi) profiles. We used profiles presented by the Sumo Organization [16], and provided links to these profiles based on the recognition results.

We provided full names of each Rikishi as anchor text for the links in one page with only speakable links. On the other hand, to fully take advantage of the flexibility allowed with smart page grammars, rank, name, or combination of both were allowed for the page with a smart page grammar. The initial portion of the grammar is shown below.

start(rikishi).

rikishi --->

[ Higashi | Higashikata ] [ Yokozuna ] Tanaknohana [ Kouji ] |

( Higashi | Higashikata ) Yokozuna |

[ Nishi | Nishikata ] [ Yokozuna ] Akebono [ Taroo ] |

( Nishi | Nishikata ) Yokozuna |

[ Higashi | Higashikata ] [ Oozeki ] Wakanohana [ Masaru ] |

( Higashi | Higashikata ) Oozeki |

(Abridged)

Here, Higashi, Higashikata, Nishi, and Nishikata are the sides on which the Rikishis compete (east and west). Yokozuna and Oozeki are their ranks, and the rest are their names.

Since the smart page grammar remains active after navigating to the linked page, it is possible to directly search the next profile. However, with regular pages, the user needs to move back to the original page to look-up the next profile.

To compare the efficiency of these methods, the user was asked to look-up stables, favorite techniques, origin, weight, and real names for 10 random Rikishis. The Rikishis were specifed using either rank, name, or a combination of name and rank. The number of trials necessary to look up 10 Rikishis, as well as the task completion ratio was measured. Table 3 shows the results.

Table 3. Efficiency comparison test result between speakable links and smart pages.

User Classification

Smart Pages

Speakable Links

Average Trials

Task Completion [%]

Average Trials

Task Completion [%]

Male

12.80

89.06

32.40

93.21

Female

13.86

83.51

32.00

89.73

All

13.42

85.71

32.16

91.19

Smart pages required an average of 13.4 trials to look up 10 items, in other words 3.4 additional trials. On the other hand, with speakable links, 22 additional trials were necessary. In other words, more than one additional trial was necessary in addition to the one trial necessary to move back to the original page necessary with speakable links. The task completion was somewhat higher with speakable links. This is because the perplexity is higher in smart page grammars with the added flexibility, and because the grammar includes digits in the rank (e.g., Maegashira 3 Maime) which are generally more difficult to recognize. More accurate models will compensate for this difference in the future.

Admittedly, this task is clearly advantageous to smart pages, which showed significant difference in the number of trials necessary to complete the task. However, we believe that even in practical tasks the advantages of smart pages will hold, although perhaps not as dramatically. Smart pages also allow speech control of components not possible with regular speakable links, e.g., form input and image maps with speech.

  1. Conclusion

We have built a prototype system, the Japanese Aware Multimedia (JAM), which uses speaker-independent continuous speech recognition to allow Japanese speakers to surf Web pages. Its major features are as follows:

and

These features were basically a Japanese localization of features implemented in the English system, SAM. However, the following features were unique to the localized JAM:

In evaluation tests where the users were asked to surf the Web using speech, a task completion ratio of over 91% was observed. If user errors caused by unfamiliarity with the system are excluded, the completion ratio becomes 94%. In a comparison test between smart pages and regular speakable links, we confirmed the superior efficiency of smart pages in our benchmark task.

This system can handle most Web pages. However, since new technology is being constantly added and modifications are being constantly made, there are numerous issues to be solved.

Acknowledgment

The authors thank Dr. P. K. Rajasekaran, Dr. V. Viswanathan, and the members of the Speech Research group for their input. They also thank Japanese expatriates in Texas who participated in the tests.

References

  1. C. Hemphill, P. Thrift and J. Linn, Speech-Aware Multimedia, IEEE Multimedia, vol. 3, no. 1, pp. 74-78, Spring 1996.
  2. Y. Yamazaki,Men for Survey, Women for Communication: Usage Analysis of PC CommunicationNikkei Electronics, no. 665, pp. 129-138, July 1996 (in Japanese).
  3. http://snow.cit.cornell.edu/noon/ListenUp.html.
  4. http://www.surftalk.com.
  5. http://www.software.ibm.com/is/voicetype/vtconn/vtconn.html.
  6. H. Dohi, M. Ishizuka, A Visual Software Agent Connected to WWW/Mosaic, Trans. IEICE, vol. J79-D-U, no. 4, April 1996 (in Japanese).
  7. Ken Lunde, Understanding Japanese Information Processing, pp. 59-99, OReilly & Associates, Sebastopol, California, 1993.
  8. K. Hakoda, H. Sato, A Pause Insertion Rule for Connected Speech, Technical Report of the speech research group of the ASJ, S74-64, pp. 1-7, March 1975 (in Japanese).
  9. H. Takahashi, Kakaksi: Kanji Kana Simple Inverter, version 2.2.5, June, 1994. Available from ftp.uwtc.washinton.edu.
  10. J. Picone, T. Staples, K.Kondo and N. Arai, Kanji to Hiragana Conversion Based on a Length-Constrained N-Gram Analysis, to be published in the IEEE Transactions on Speech and Audio Processing.
  11. Y. Matsumoto, S. Kurohashi, T. Utsuro, Y. Myoogi, M. Naga, “”Japanese Morphological Analysis System JUMAN Manual, version 2.0, July, 1994. Available from ftp://ftp.aist-nara.ac.jp/pub/nlp/tools/juman.
  12. http://www.asahi.com.
  13. http://www.mainichi.co.jp.
  14. http://www.nikkei.co.jp.
  15. Y. H. Kao, C. T. Hemphill, B. J. Wheatley and P. K. Rajasekaran, Toward Vocabulary Independent Telephone Speech Recognition, Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-117 - I-120, April 1994.
  16. http://www.wnn.or.jp/wnn-t/database/rikishidata.

 

Biography

Kazuhiro Kondo (Member, IEICEJ) received the B. E., M. E. and the Ph. D. from Waseda University in 1982, 1984 and 1998, respectively. From 1984, he worked at Central Research Laboratory, Hitachi Ltd., Tokyo, Japan, where he was engaged in R & D on speech signal processing systems and video coding systems. In 1992, he joined Texas Instruments Tsukuba R & D Center, Tsukuba, Japan, and was transferred to Texas Instruments Inc. Media Technologies Laboratory, Dallas, TX in 1996, where he is currently a Member of Technical Staff. He is currently engaged in R & D in speech recognition systems.

Dr. Kondo is a member of the Acoustical Society of Japan, and the IEEE.

Charles Hemphill received the B. S. in mathematics from the University of Arizona in 1981, and M. S. in computer science from the Southern Methodist University in 1985. He is currently working towards his Ph. D. in CS at the University of Texas at Dallas. He joined Texas Instruments in 1982, where he is currently a Senior Member of Technical Staff. His research interests include grammar representation and spoken language understanding. He was the principal investigator for the definition and collection of the DARPA ATIS pilot corpus.

Mr. Hemphill is a member of the ACM and ACL.

 

List of Tables

Table 1. Speakable links and commands evaluation test results

Table 2. Break down of speakable link recognition errors

Table 3. Efficiency comparison test result between speakable links and smart pages.

 

List of Figures

Fig. 1 Example of an image map

Fig. 2 JAM system configuration

Fig. 3 JAM console

 

 

Fig. 1 Example of an image map

 

Fig. 2 JAM system configuration

 

 

Fig. 3 JAM console