論文題目（和文）：音声認識を用いたＷＷＷブラウザとその評価

Title: A WWW Browser Using Speech Recognition And Its Evaluation

Authors：Kazuhiro Kondo (Member) and Charles T. Hemphill (Non-member)

Affiliation: Texas Instruments Media Technologies Laboratory

8330 LBJ Freeway, MS8374, Dallas, Texas 75243, USA

Abstract:

We developed the Japanese Speech-Aware Multimedia (JAM) which controls a World Wide Web (WWW) browser using speech. This system allows the user to browse a linked page by reading the anchor text within a Web page. The user can also control the browser using speech. The system integrates new vocabulary each time a new Web page is read by extracting the anchor text, converting this text to phonetic string notation, creating a new speech recognition grammar and integrating this grammar with the system dynamically. During this anchor text-to-phone conversion process, we employ numerous exception handling techniques to accommodate counters, dates and many other phenomena.

Preliminary tests show that the conversion results contain the correct phone sequences over 97% of the time. We also allowed limited English speech recognition since a large percentage of the Japanese Web pages include some English text in anchors. User tests showed that the prototype correctly understands the input speech 91.5 % of the time, or 94.1% if we exclude user errors caused by unfamiliarity with the system including erroneous readings or speech detection errors.

Keywords: WWW, speech recognition, user interface, text-to-phone conversion

INTRODUCTION

Traffic on the World Wide Web (WWW) has grown continuously at an enormous rate since its introduction. There are many reasons which might have caused this boom. Multimedia documents integrating text, still images, video and audio can be easily published for worldwide distribution. Browsing these hyperlinked pages is very intuitive. Since its hyped introduction in the media, users of the Web have grown from a very small group of researchers and engineers to a large non-technical society that has never used the computer or the Internet. Speech may provide a more user-friendly interface for these naive users as well as the physically-challenged. It may also provide more convenient interface for other applications such as hands-free operation and multimedia presentations. Accordingly, we first developed the Speech-Aware Multimedia (SAM) system [1] which enables the user to control the WWW browser using English speech.

SAM uses speaker-independent continuous speech recognition to enable continuous English speech input. Its main features are as follows:

Speakable commands: The user can control the browser using spoken commands, e.g. “scroll,” “go back” and “add this page to my bookmark.”

Speakable bookmarks: The user can visit their favorite pages by enrolling a keyword or phrase for the page, and later speaking the keyword.

Speakable links: The user can read the hyperlinked anchor text to visit the linked page.

Smart pages: There may be situations where merely reading the anchor text may be too restrictive or inconvenient. Accordingly, we have implemented smart pages, in which a pointer to a speech grammar for the page is embedded in the header. The user can speak according to this grammar to navigate to linked pages. Flexible sentences with variations are allowed since arbitrary context-free grammars are allowed for the grammar.

SAM was first implemented on UNIX platforms using NCSA Mosaicas the browser. It has been ported to Microsoft Windows²95 with Netscape Navigator³as the browser.

After lagging behind the U. S. boom, WWW traffic in Japan has also grown dramatically. Many Japanese language pages can now be seen. Major efforts to localize browsers for Japanese were seen since the early stages. Mosaic was first localized to many languages including Japanese. However, Netscape Navigator, which appeared shortly after Mosaic, became the first stable localized browser in major languages including Japanese. It has since been the browser of choice for Web surfing in Japanese.

With the introduction of the WWW, PC penetration in Japan, which had significantly lower penetration than the U. S., increased dramatically. Because of this, the majority of the users are still naive users. In one survey, over half of the PC users accessing the Internet were fairly inexperienced users with less than 2 years experience [2]. Speech input can potentially become an attractive alternative interface for these users who may still be uncomfortable with the mouse and the keyboard.

There are a number of efforts to apply speech to the WWW browser, but these are still limited since the Web is relatively new. We first describe two systems for English speech for the Apple Macintosh. The Macintosh bundled its proprietary speech recognition software, PlainTalk⁴ Speech Recognition, to its OS from early on. Thus, systems which tried to use this software to add speech capabilities to the browser were developed. Both systems described below are add-on software which function as an interface between the PlainTalk and the browser.

The first system, ListenUp [3], extracts the keyword URL (Uniform Resource Locator; address of documents and resources including text, audio, image and video ) pairs from an existing table file, and feeds keywords to PlainTalk. From the keywords, PlainTalk creates speech grammars, and incorporates this grammar dynamically. When PlainTalk recognizes keywords according to the grammar, the result is fed back to ListenUp. ListenUp then sends a command to the browser to visit the corresponding Web page with the appropriate URL. A separate keyword URL table file per Web page needs to be prepared to use ListenUp. The system is implemented as a Netscape Navigator plug-in.

The second system, SurfTalk [4] from Digital Dreams, is also implemented as a plug-in for the Netscape Navigator browser and uses PlainTalk. SurfTalk extracts anchor text from Web pages, and outputs the text to PlainTalk to enable speech recognition of the text dynamically. However, at the time of this writing, the software was in its early beta stage and was quite unstable.

On IBM-compatible PCs, IBM⁵ uses its own speech recognizer, VoiceType, along with a modified version of Netscape Navigator in a system called VoiceType Connection [5]。Similar to SAM, this system extracts anchor text from a Web page, constructs a speech grammar, and allows the user to visit a linked page by speaking the corresponding anchor text. However, users can only speak the words already in the vocabulary, or train the system for new words as they appear in a speaker-dependent fashion. It also allows speech input into forms with its 22,000 word dictation system. At the time of this writing, the system was in beta, and is downloadable from their Web site.

There are fewer applications of Japanese speech recognition to the Web browser compared to English. Tokyo University uses NCSA Mosaic and a commercially available Japanese speech recognition software to enable surfing the WWW with Japanese speech [6]. The system also uses a visual software agent to give the user visual feedback through facial animation. The system allows words in the anchor text which are in the dictionary of the speech recognizer to be recognized. The words not in the dictionary are not recognized. It also compiles a table in which each link is assigned an index number. The user can speak the index number to visit the corresponding linked page. However, unrestricted digit recognition is still somewhat unreliable since digits are short, often pronounced sloppily, and context cannot be used. It also seems unintuitive and inconvenient to look up the index number each time the user wants to visit a new Web page.

We developed the Japanese Aware Multimedia (JAM) system that allows the user to read the anchor text in an arbitrary page to visit the corresponding link. In situations where simply reading the anchor text is too restrictive, the system also allows navigation with arbitrary speech according to a speech grammar referenced by an embedded pointer in the header of the page. The system also allows browser commands to be spoken and bookmarks with arbitrary spoken keywords. In the following chapter, we summarize the issues in applying speech recognition to navigate Web pages. Chapter 3 describes the system architecture. Chapter 4 gives the results of a user evaluation test with the system. Finally, chapter 5 states the conclusions and suggestions for further work.

Speech Recognition of Web Pages

This chapter summarizes issues when creating speech recognition grammars from WWW pages. WWW pages are written using the hypertext markup language (HTML). A simple example of a page in HTML is as follows:

<html>

<body>

<head><TITLE>HTML example </TITLE></head>

</body>

</html>

As is shown in the example, the linked URL is shown in double quotes with the tag A HREF. The anchor text is written between two tags, <A HREF=”..”> and </A>. Thus, the system merely needs to extract the text between these tags, and convert the text to speech grammars. Neither the language nor the character code of the text is restricted, thus Japanese text can also be used.

However, there are three major character encoding methods currently in use to encode Japanese: Extended Unix Code (EUC)，Japanese Industrial Standard (JIS) code and Shift JIS code [7]. In our preliminary tests, Japanese Web pages seem to be equally divided between these codes, with 30% EUC , 28% JIS code (both new and old JIS), and 42% JIS code. There were some pages with mixed encoding. This is probably because the author of the page created parts of the page by cutting and pasting from multiple pages with different encoding methods. There are also a variety of character sets included in Japanese pages, with Zenkaku characters (double ASCII character width Kanji, Hiragana, Katakana, Alphanumeric characters and symbols), Hankaku (half-width, equivalent to one ASCII character width) Katakana characters, ASCII characters, and JIS-Roman characters. It is necessary to convert these various standards and formats into one uniform notation for internal processing.

After extracting anchor text from a page and converting this to a uniform notation, it is necessary to segment this text into appropriate units. This is required for the following:

Pauses need to be inserted explicitly in speech grammars at appropriate positions. Generally, it is known that pauses are optionally inserted at phrase boundaries. “Bunsetsus (phrases)” are composed of clauses [8]. Thus, if optional silences are inserted between phrase boundaries, pauses will be allowed at likely positions in the sentence, although somewhat redundantly.

The system is configured to accept the input speech after reading a predefined number of phrases from the beginning (the default is currently set to 3) so that the user can omit some parts of a long anchor text. The speech grammar needs to have optional transitions to exit states after the predefined number of phrases for this to be possible. Thus, we need to define phrase boundaries in the text to determine positions to insert this transition.

We need appropriate units to convert text to phonetic strings. Phrases are reasonable units.

In the next phase, anchor texts need to be converted into phonetic strings. Here, we encounter the same classic problems encountered in text-to-speech systems. Namely,

Exceptional handling of particles “ha” (actually read as “wa”) and “he” (read as “e”).

Numerous exceptional handling of counters and digits. To give a few examples, both discrete and natural number reading needs to be allowed for numbers (e.g., for the number “123,” both “hyaku nijuu san” and “ichi ni san” is needed). Some counters change their pronunciation according to the preceding number (e.g., “ippon nihon sanbon” for one, two and three stick-like items). Some number-related readings are historical (e.g., “tsuitachi” for the first of the month).

Some symbols are always omitted. Other symbols are sometimes read or omitted. Words in parentheses may be totally deleted. A common example of this is when names are spelled out in Hiragana. The speech grammar needs to allow all these exceptions and variations.

The majority of Japanese language pages include some English anchor text. During the user evaluation tests, even though all the pages were Japanese pages, 17 % of the anchor texts were written all in English, and 24 % were mixed between English and Japanese. Thus, in order to accommodate these links which total 41 %, English text needs to be processed as well, although the vocabulary may be limited.

Non-text anchors also need to be processed. These are mostly images which are used for anchors instead of text. The images serve as clickable buttons to links, and are commonly called image maps. Figure 1 shows an example. In this example, the boxes with the strings “company info,” “products/services,” etc. serve as buttons. These do not explicitly use text as anchors and so cannot be used directly to create speakable links. However, HTML provides tags to define substitute text for the images typically for users of low-speed modems. These users commonly disable auto-loading of images in their browsers. To hint at the content of the images which these users can not see, the browser substitutes text defined by an HTML tag “ALT” in place of the images. The following is HTML code for the page shown in Figure 1 which uses this tag.

<IMGSRC=”/corp/graphics/cont1w.gif” ALT=”Company Info”></A>

<IMGSRC=”/corp/graphics/cont2w.gif” ALT=”Products/Services”></A>

As shown in this example, substitution text can be defined with the tag ALT=”...” when images are used as anchors. This text can be used to create speech grammars for this link, thereby enabling speech to navigate to this link. In this example, the substitution text describes the content of the image appropriately, and thus the user may use speech to visit this link. However, not all substitution text is as comprehensive. Thus the speech input which the user thought might select the link may actually not work. Further study is needed to solve this problem.

Another item in a page which is difficult to handle with speech is form input. Forms provide a mechanism for the user to send data to the server. For example, the user can fill in text in the forms and click the submit button to send the data to the server for database query. Arbitrary text can be input using forms, but in reality, the variety of data that can be input in a particular page can be limited depending on the context. Therefore, we can prepare a speech recognition grammar which defines all sentences which the user may possibly input, and reference this grammar from the Web page. When the browser fetches this page, the grammar can be downloaded simultaneously, and the corresponding speech input sentences can now be accepted. We will call mechanism smart pages. Smart pages can also enable handling of image maps with speech. For example, in an image map where the user can click a map of Japan at the prefecture of their choice in order to access information about the prefecture, a speech grammar with all prefecture names can be prepared and used in a smart page. More details about smart pages can be found in the following chapter.

System Architecture

Speakable links, speakable bookmarks and speakable commands

A prototype system of JAM was developed on a UNIX platform. We used Netscape Navigator as the browser. Figure 2 shows the configuration of the system.

The browser downloads a Web page written in HTML from the network. This HTML code is used to render the page in the browser, and also is output to JAM for analysis. JAM first detects the character encoding of the page, and converts the page to EUC if necessary. We decided to use EUC for all internal processing since Japanese characters in EUC are easily distinguishable from ASCII text. The converted code is analyzed, and anchor text URL pairs are extracted. If the anchor points to an image file, the system looks for ALT tags and, if there is one, the text inside this tag is used as anchor text for this link. The Hankaku Katakanas and ASCII alphanumerics and symbols are converted to the corresponding Zenkaku characters. All alphabets are converted to upper case. Then, the anchor texts are segmented into “bunsetsus” using a dictionary. We applied a freely available system, “Kakasi” [9] for the dictionary look-up. “Kakasi” is software that matches the input to the dictionary entries in 1-path. It is very simple but very fast, and is well suited for real-time applications. The original objective of “Kakasi” was to look-up readings of Kanji character strings. For this purpose, “Kakasi” comes with a dictionary with 121,824 entries. We added our own entries from a dictionary we had developed before [10] to bring the total to 357,203 entries. This dictionary was also developed for the same purpose as “Kakasi.” However, the dictionary has most of its entries in “bunsetsu” units. It also exhaustively lists most conjugations and variations. The system matches the longest entry in the combined dictionary as a “bunsetsu” candidate. In the worst case, when no appropriate “bunsetsu” is found in the dictionary, word entries or single character entries will always be found, thus falling back to a redundant shorter segmentation. It is possible to use a full-scale morphological analyzer such as the “JUMAN” [11] system to segment the input into more accurate “bunsetsus,” but we have concluded that this level of accuracy is not required for our purpose and is too expensive.

The segmented anchor text is then converted to a sentence-level speech recognition grammar. An example of this grammar is shown below.

start(jam_link_command_).

jam_link_command_ ---> “日本語リンクの例__”.

start(“日本語リンクの例__”).

“日本語リンクの例__” ---> 日本語_, Z_1.

“日本語リンクの例__” ---> non_speech_, “日本語リンクの例__”.

Z_1 ---> リンク_, Z_2.

Z_1 ---> non_speech, Z_1.

Z_2 ---> の_, Z_3.

Z_2 ---> non_speech, Z_2.

Z_3 ---> 例_, Z_4.

Z_3 ---> “”.

Z_3 ---> non_speech, Z_3.

Z_4 ---> “”.

Z_4 ---> non_speech, Z_4.

In this example, the anchor text is segmented into four “bunsetsus” and optional pause models “non_speech,” are inserted between each “bunsetsu.” After transition through three “bunsetsus,” the grammar allows transition to the exit node, shown by empty double quote pairs (“”). In this example, speech input is accepted by this grammar if “bunsetsus” up to at least the third “bunsetsu”, or “の” are spoken. Also, all symbols are made optional in the grammar.

Speakable bookmarks, speakable commands, and smart pages, which we will explain in detail later, are all written using similar grammars. For example, grammar for the speech command which moves one page forward can be written as:

start(jam_page_forward_command).

jam_page_forward_command ---> ( 先に | 次に ) [ 進む ].

Here, (.|.) refers to selection, and [.] refers to optional. The user can program arbitrary sentences or phrases to commands. However, commands are set once during start-up, and thus cannot be modified until restarted. An example speakable bookmark grammar is as follows:

start(“テキサスインスツルメンツ”).

url(“http://www.ti.com”).

“テキサスインスツルメンツ” ---> ( テキサスインスツルメンツ | TI ).

As shown in this example, the corresponding URL is also coded in a speakable bookmark grammar. New bookmarks can be added while browsing, but by default, the page title will become the key word for the page. This key word can be reassigned arbitrarily by the user. By reloading the speakable bookmarks, new bookmarks can be used immediately.

The “bunsetsu” nodes are expanded to a sub-level grammar which define the phonetic string representing their pronunciations. “Kakasi” is used in this process again. Using the example cited before,

start(日本語_).

日本語_ ---> NIHONXGO_, Z_1_.

日本語_ ---> NIPPONXGO_, Z_1_.

日本語_ ---> non_speech_, 日本語_.

Z_1_ ---> “”.

Z_1_ ---> non_speech_, Z_1_.

Start(NIHONXGO_).

NIHONXGO_ ---> n, i, h, o, N, (ng|g), o.

This example allows two alternate pronunciations, “nihongo” and “nippongo.” The second grammar above defines the phonetic string for one of the pronunciations, “nihongo.”

Exception handling involving digits and counters which we mentioned before are also accommodated. For digit strings, grammars which allow both discrete digits and natural numbers are generated. For example,

start(1969_).

1996_ ---> (SENX_ KYUUHAKU_ KYUUJYUU_ ROKU_ | ICHI_ KYUU_ KYUU_ ROKU_).

Start(1日_).

1日_ ---> (ICHINICHI_|TSUITACHI_).

English anchor text is expanded into Hiragana. English words not in the dictionary are expanded into their spelling pronunciation, and are further expanded to Hiragana. Symbols are also expanded when appropriate to Hiragana.

start(Apple_).

Apple_ ---> APPURU_.

Start(“A T & T_”).

“A T & T_” ---> EI_ TII_ (ANXDO_|ANXPAASANXDO_) TII_.

We ran a test to estimate the performance of the text-to-phone conversion accuracy. We extracted 117 anchor texts from some major Japanese newspapers on the Web, namely Asahi [12], Yomiuri [13], and Nikkei [14]. Texts extracted from these pages were converted to phonetic notation using the modified “Kakasi” system and the conversion accuracy was measured. The vocabulary covered numerous domains, including politics, economy, science and technology, and sports. Since “Kakasi” outputs multiple candidates, we chose the closest candidate. We measured the conversion accuracy according to the following formula:

Phonetic conversion accuracy ＝

Since the dictionary packaged with the original “Kakasi” distribution has wide coverage, perhaps with the exception of proper nouns including names, the conversion errors were mostly “bunsetsu” segmentation errors and exceptions including English terms, digits, and symbols. The phonetic conversion with the original dictionary was 92.4 %. However, with the expanded dictionary and with exception handling, the accuracy increases to 97.2 %. The improvement in the correct sentence percentage was more dramatic, with 57 % and 82 % for these two cases respectively. Since the anchor text in this application is relatively long, a few errors in the phone conversion may not have significant impact on the recognition accuracy. Thus, we believe this level of accuracy is adequate for our application.

Sentence-level grammars and the pronunciation grammars defining the phonetic strings for each “bunsetsu” are output to the speech recognizer. The recognition results are interpreted and converted to browser commands. When results corresponding to speakable commands are obtained, appropriate commands are issued to the browser. For example, when “Sakini Susumu (go forward)” is recognized, the command “page forward” is issued. When results corresponding to speakable bookmarks or speakable links are recognized, the “goto URL” command for the corresponding URL is issued.

Figure 3 shows the console of the system. The console has a simple level meter to the left, and gives speech level feedback to the user. There is also a start button which additionally functions as a status indicator, showing the input status of the system; accepting input, and processing. The change in the input status is also fed back with beeps.

Smart Pages

With speakable commands and speakable links, users can surf most of the Web using speech. However, there are components on a Web page which are difficult to control with speech, e.g., forms and image maps. Accordingly, we propose a mechanism which we will call smart pages.

In smart pages, speech recognition grammars are used to define allowed speech input in addition to conventional links. A pointer to the speech recognition grammar is embedded in the header portion of the HTML code as follows:

<HEAD>

</HEAD>

The <LINK> tag conforms to the HTML standard, and is used to designate the relationship of the linked resource to this page. The REL tag defines the relationship to the current document. In the above example, it designates the linked resource as a grammar file. The format of the grammar is similar to the grammar which defines speakable commands. The following example shows a smart page grammar which extracts weather information for major Japanese cities:

start(weather)

weather ---> City＿の（現在の（天気｜気温｜湿度）｜予報）

[（は｜を見せて [下さい]）].

City＿ ---> (東京｜大阪｜京都｜名古屋｜福岡｜神戸｜横浜｜札幌｜.....).

Speech recognition results using this grammar are passed on to a CGI (Common Gateway Interface) script as arguments to be used as keys for the weather database query. For example, for the recognition “forecast for Yokohama” (”横浜の予報”), the following URL is accessed:

http://www.tenki.co.jp/cgi-bin/tenki?横浜+の+予報

The script returns a page with the forecast for Yokohama.

As shown in this example, much more flexible speech input is possible using smart pages. Image maps can also be navigated using speech to some extent if an appropriate smart page grammar is used. The additions for smart pages do not affect conventional WWW browsers in any way.

Speech Recognition

We use a continuous multi-variate single Gaussian mixture HMM speech recognizer. Context-dependent phonetic models are used. The phonetic context defined using phonological features was clustered using a binary decision tree [5]. Speech segments are identified from input speech using an energy-based speech detector.

The grammar used in the system is defined as a set of regular grammars. Each start symbol of a regular grammar is connected to a terminal symbol in a higher-level regular grammar. The system allows a subset of the start symbols to start the search, and thus allows dynamic adjustment of the grammar. We can also add or delete regular grammars easily, which enables us to add or delete sentences required for each page as required.

Evaluation Experiments

We conducted a simple evaluation test to measure the performance of the system, and to identify problems. We conducted two types of tests. In the first, in order to measure the performance of speakable links and speakable commands, we asked the subjects to freely surf the WWW. In the second test, we tried to compare the efficiency of smart pages to regular pages. In both tests, there were 12 subjects, of which 7 were female and 5 male. Six of the subjects reported that they use the computer regularly as well as surf the Web. Two use computers a few times a week, while the remaining four never or rarely use them.

A Sun Sparc20 workstation was used in all tests. An unidirectional dynamic microphone was used. The users were asked to operate the on-off switch on the microphone for each utterance, but we also applied simple speech detection. The internal A/D conversion of the Sparc 20 was employed. All tests were conducted in a closed office. The room was relatively silent, but the air conditioning and the fan on the Sparc 20 were quite audible.

Speakable Links And Speakable Command Evaluations

In the first experiment, we evaluated the performance of speakable links and speakable commands. The subjects were asked to freely surf the Web within the allotted time. Task completion ratio was evaluated on all valid utterances. Speakable bookmarks were not used in the tests. Speakable commands were limited to the following six commands:

Sukurooru: scroll down within the same page

Ueni sukurooru：scroll up within the same page

Sakini susumu：Move forward one page in the history if there is one.

Maeni modoru：Move backward one page in the history if there is one.

Peeji sai roodo：Reload current page from the Web.

Hoomu peeji：Move back to the home starting page used in the evaluation.

We prepared a home page with links to the following items which we thought will be of general interest.

Newspapers and publishers: Asahi News, Yomiuri News and Nikkei Electronics.

Search engine: Yahoo Japan.

Others: The Japanese Prime Minister’s page, The Sumo Association.

Starting from the above page, the users were asked to surf Japanese pages. Only pages which can be visited through speakable links were considered. Mouse and keyboard were not used. Users were given approximately 30 minutes to surf freely. All speech input, recognition results, and the browser status were recorded. After the evaluation sessions, the logs were used to measure the task completion ratio. There were a total of 1174 utterances in all sessions, for an average of about 98 utterances per subject. Table 1 summarizes the result.

Table 1. Speakable links and commands evaluation test results

User Classification	Task Completion [%]
User Classification	Commands	Links	All Utterances
Male	97.44	78.76	88.62
Female	99.01	86.84	93.61
All	98.37	83.13	91.48

In this table, all utterances which included substitutions and deletions but resulted in the intended action were rated as correct. The overall task completion ratio was 91.5 %. The completion ratio for speakable commands, which make up 58 % of all utterances, was 98.4 %. Female subjects scored somewhat higher in all categories, especially speakable links. This is probably because most female subjects spoke carefully. There was no significant difference in the completion ratio due to daily computer usage. Speakable links showed significantly lower completion ratios than speakable commands. Table 2 shows the breakdown of causes of these errors using speakable links.

Table 2. Break down of speakable link recognition errors

Error Types	Percentage of All Errors[%]
Error Types	Male	Female	All Users
Speech Detection	18.07	10.84	28.92
Speech Recognition	13.25	10.84	24.10
Insufficient Entry	3.61	15.66	19.28
Out of Vocabulary	13.25	4.84	18.07
Others	9.64	0.0	9.64

The most frequent cause of errors was speech detection errors, followed by speech recognition errors, insufficient speech input, and out of vocabulary. Insufficient speech input refers to errors caused by utterances in which the user did not speak the required three “bunsetsus.” Out of vocabulary includes misread words, utterances that were read from the middle of the anchor (which is not allowed) and attempts to read from image maps without ALT tags. The “Others” category includes pages which were not possible to parse due to HTML coding errors. There were significantly more speech detection errors with male speakers, which we believe is caused by subjects not holding the microphone in a stable manner or at appropriate distance. A close-talking head-mounted microphone may have helped here. There were also male subjects with extremely long inter-word pauses, which triggered the speech detector to end speech input. In order to deal with these, we need to improve the detection accuracy through optimization or adaption of the detection level and the introduction of other features. Insufficient entry and out of vocabulary utterances also constitute a large portion of the errors. However, we believe these errors will decrease to negligible level as the user becomes familiar with the system. The task completion without these two error categories comes to 94.1 %.

Comparison of Smart Pages and Regular Pages

To evaluate efficiency improvements brought by smart pages, we prepared two pages with essentially the same content, one with a smart page grammar and one without. The task selected was database query of Sumo wrestler (Rikishi) profiles. We used profiles presented by the Sumo Organization [16], and provided links to these profiles based on the recognition results.

We provided full names of each Rikishi as anchor text for the links in one page with only speakable links. On the other hand, to fully take advantage of the flexibility allowed with smart page grammars, rank, name, or combination of both were allowed for the page with a smart page grammar. The initial portion of the grammar is shown below.

start(rikishi).

rikishi --->

[ Higashi | Higashikata ] [ Yokozuna ] Tanaknohana [ Kouji ] |

( Higashi | Higashikata ) Yokozuna |

[ Nishi | Nishikata ] [ Yokozuna ] Akebono [ Taroo ] |

( Nishi | Nishikata ) Yokozuna |

[ Higashi | Higashikata ] [ Oozeki ] Wakanohana [ Masaru ] |

( Higashi | Higashikata ) Oozeki |

(Abridged)

Here, Higashi, Higashikata, Nishi, and Nishikata are the sides on which the Rikishis compete (east and west). Yokozuna and Oozeki are their ranks, and the rest are their names.

Since the smart page grammar remains active after navigating to the linked page, it is possible to directly search the next profile. However, with regular pages, the user needs to move back to the original page to look-up the next profile.

To compare the efficiency of these methods, the user was asked to look-up stables, favorite techniques, origin, weight, and real names for 10 random Rikishis. The Rikishis were specifed using either rank, name, or a combination of name and rank. The number of trials necessary to look up 10 Rikishis, as well as the task completion ratio was measured. Table 3 shows the results.

Table 3. Efficiency comparison test result between speakable links and smart pages.

User Classification	Smart Pages		Speakable Links
User Classification	Average Trials	Task Completion [%]	Average Trials	Task Completion [%]
Male	12.80	89.06	32.40	93.21
Female	13.86	83.51	32.00	89.73
All	13.42	85.71	32.16	91.19

Smart pages required an average of 13.4 trials to look up 10 items, in other words 3.4 additional trials. On the other hand, with speakable links, 22 additional trials were necessary. In other words, more than one additional trial was necessary in addition to the one trial necessary to move back to the original page necessary with speakable links. The task completion was somewhat higher with speakable links. This is because the perplexity is higher in smart page grammars with the added flexibility, and because the grammar includes digits in the rank (e.g., Maegashira 3 Maime) which are generally more difficult to recognize. More accurate models will compensate for this difference in the future.

Admittedly, this task is clearly advantageous to smart pages, which showed significant difference in the number of trials necessary to complete the task. However, we believe that even in practical tasks the advantages of smart pages will hold, although perhaps not as dramatically. Smart pages also allow speech control of components not possible with regular speakable links, e.g., form input and image maps with speech.

Conclusion

We have built a prototype system, the Japanese Aware Multimedia (JAM), which uses speaker-independent continuous speech recognition to allow Japanese speakers to surf Web pages. Its major features are as follows:

Speakable commands,

Speakable bookmarks,

Speakable links,

and

Smart pages.

These features were basically a Japanese localization of features implemented in the English system, SAM. However, the following features were unique to the localized JAM:

The majority of Japanese pages include some English. Thus, practical systems need to be able to handle both English and Japanese. JAM can handle limited English words as well as Japanese.

Unlike English, Japanese does include explicit word boundaries in the text. Thus, segmentation into appropriate units is necessary. This is required because when generating speech grammars , (1) we need to insert optional silences between these units (2) optional transition to exit node needs to be inserted in order to allow omission after three words from the top are read.

In evaluation tests where the users were asked to surf the Web using speech, a task completion ratio of over 91% was observed. If user errors caused by unfamiliarity with the system are excluded, the completion ratio becomes 94%. In a comparison test between smart pages and regular speakable links, we confirmed the superior efficiency of smart pages in our benchmark task.

This system can handle most Web pages. However, since new technology is being constantly added and modifications are being constantly made, there are numerous issues to be solved.

Ambiguous anchor text: Some pages use the same anchor text for several different links. A common example is a page with “here” as the anchor text for a number of links. These links can be navigated using mouse clicks, but navigation using speech requires disambiguation. Grids can be superposed on pages, and speech can be used to specify the position of the selected link. The same mechanism can be used for image maps. Alternatively, we can disambiguate the links by altering the page.

Non-text anchors: Most of these anchors are image maps. Some images also have “ALT” tags to define substitution text for users with the automatic image loading option turned off. The substitution text can be used to define speech input to select this link. However, not all images carry “ALT” tags. These images need to be dealt using other techniques, such as using grids as we have described in the previous paragraph. Also, some text defined with “ALT” tags may not describe the image appropriately. This text needs to be displayed, for instance, superimposed on or near the image, or in a different window.

Handling of components other than links: Mostly buttons and form inputs. Buttons can be handled in similar ways as image maps, for instance using grids. Most form inputs should be possible to handle with smart pages.

Pages with extremely large number of links: In the pages we encountered during our evaluation tests, the maximum number of links we found in a page was 395. Pages with this many links require noticeable amount of time to generate the speech grammars, and the recognition accuracy also suffers. The average number of links in a page was 65.9, which suggests that the majority of the pages are within reasonable range. Pages with too many links can be accommodated by limiting the construction of the speech recognition grammars to the links currently within visual range, in other words, links currently shown in the window. However, even when navigating with mouse, these long pages are often ill-organized pages which are difficult to navigate. We expect these pages to eventually decrease as the authors improve their skills.

New HTML components: HTML is still a rapidly evolving standard. Developers of the most popular browsers are constantly introducing new proprietary components, which may eventually become standards. It is difficult to accommodate all these new components. However, not all are widely used, and some eventually become obsolete. We need to use common sense to select components which seem to have been accepted by the community and accommodate these. For instance, frames seem to have been generally accepted, and we need to be able to handle these in our system.

HTML coding errors: A considerable number of pages still include coding errors. Some common examples are missing quotes, missing tags, misspelled tags, mixed Japanese character encoding, and missing control codes. Most of these errors seem to be introduced when the authors are using plain text editors to hand-code their pages. WYSIWIG editors and HTML-savvy editors are becoming available, which should reduce these errors. Still, we need to be able to gracefully handle these errors to some extent through robust parsing.

Acknowledgment

The authors thank Dr. P. K. Rajasekaran, Dr. V. Viswanathan, and the members of the Speech Research group for their input. They also thank Japanese expatriates in Texas who participated in the tests.

References

C. Hemphill, P. Thrift and J. Linn, “Speech-Aware Multimedia,” IEEE Multimedia, vol. 3, no. 1, pp. 74-78, Spring 1996.

Y. Yamazaki,“Men for Survey, Women for Communication: Usage Analysis of PC Communication”Nikkei Electronics, no. 665, pp. 129-138, July 1996 (in Japanese).

http://snow.cit.cornell.edu/noon/ListenUp.html.

http://www.surftalk.com.

http://www.software.ibm.com/is/voicetype/vtconn/vtconn.html.

H. Dohi, M. Ishizuka, ”A Visual Software Agent Connected to WWW/Mosaic,” Trans. IEICE, vol. J79-D-Ⅱ, no. 4, April 1996 (in Japanese).

Ken Lunde, “Understanding Japanese Information Processing,” pp. 59-99, O’Reilly & Associates, Sebastopol, California, 1993.

K. Hakoda, H. Sato, “A Pause Insertion Rule for Connected Speech,” Technical Report of the speech research group of the ASJ, S74-64, pp. 1-7, March 1975 (in Japanese).

H. Takahashi, ”Kakaksi: Kanji Kana Simple Inverter,” version 2.2.5, June, 1994. Available from ftp.uwtc.washinton.edu.

J. Picone, T. Staples, K.Kondo and N. Arai, “Kanji to Hiragana Conversion Based on a Length-Constrained N-Gram Analysis,” to be published in the IEEE Transactions on Speech and Audio Processing.

Y. Matsumoto, S. Kurohashi, T. Utsuro, Y. Myoogi, M. Naga, “”Japanese Morphological Analysis System JUMAN Manual,” version 2.0, July, 1994. Available from ftp://ftp.aist-nara.ac.jp/pub/nlp/tools/juman.

http://www.asahi.com.

http://www.mainichi.co.jp.

http://www.nikkei.co.jp.

Y. H. Kao, C. T. Hemphill, B. J. Wheatley and P. K. Rajasekaran, “Toward Vocabulary Independent Telephone Speech Recognition,” Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. I-117 - I-120, April 1994.

http://www.wnn.or.jp/wnn-t/database/rikishidata.

　

Biography

Kazuhiro Kondo (Member, IEICEJ) received the B. E., M. E. and the Ph. D. from Waseda University in 1982, 1984 and 1998, respectively. From 1984, he worked at Central Research Laboratory, Hitachi Ltd., Tokyo, Japan, where he was engaged in R & D on speech signal processing systems and video coding systems. In 1992, he joined Texas Instruments Tsukuba R & D Center, Tsukuba, Japan, and was transferred to Texas Instruments Inc. Media Technologies Laboratory, Dallas, TX in 1996, where he is currently a Member of Technical Staff. He is currently engaged in R & D in speech recognition systems.

Dr. Kondo is a member of the Acoustical Society of Japan, and the IEEE.

Charles Hemphill received the B. S. in mathematics from the University of Arizona in 1981, and M. S. in computer science from the Southern Methodist University in 1985. He is currently working towards his Ph. D. in CS at the University of Texas at Dallas. He joined Texas Instruments in 1982, where he is currently a Senior Member of Technical Staff. His research interests include grammar representation and spoken language understanding. He was the principal investigator for the definition and collection of the DARPA ATIS pilot corpus.

Mr. Hemphill is a member of the ACM and ACL.

　

List of Tables

Table 1. Speakable links and commands evaluation test results

Table 2. Break down of speakable link recognition errors

Table 3. Efficiency comparison test result between speakable links and smart pages.

List of Figures

Fig. 1 Example of an image map

Fig. 2 JAM system configuration

Fig. 3 JAM console

Fig. 1 Example of an image map

Fig. 2 JAM system configuration

Fig. 3 JAM console