Overview design of Search mechanism.
The searching is a fully client-side implementation of querying texts for content searching, and no server is involved. That means when a user enters a query, it is processed by JavaScript inside the browser, and displays the matching results by comparing the query with a generated 'index', which too reside in the client-side web browser. Mainly the search mechanism has two parts.
- Indexing: First we need to traverse the content in the docs/content folder and index the words in it. This is done by - nw-cms.jar. You can invoke it by- ant indexcommand from the root of webhelp of directory. You can recompile it again and build the jar file by- ant build-indexer. Indexer has some extensive support for such as stemming of words. Indexer has extensive support for English, German, French languages. By extensive support, what I meant is that those texts are stemmed first, to get the root word and then indexes them. For CJK (Chinese, Japanese, Korean) languages, it uses bi-gram tokenizing to break up the words. (CJK languages does not have spaces between words.)- When we run - ant index, it generates five output files:- htmlFileList.js- This contains an array named- flwhich stores details all the files indexed by the indexer.
- htmlFileInfoList.js- This includes some meta data about the indexed files in an array named- fil. It includes details about file name, file (html) title, a summary of the content.Format would look like,- fil["4"]= "ch03.html@@@Developer Docs@@@This chapter provides an overview of how webhelp is implemented.";
- index-*.js(Three index files) - These three files actually stores the index of the content. Index is added to an array named- w.
 
- Querying: Query processing happens totally in client side. Following JavaScript files handles them. - nwSearchFnt.js- This handles the user query and returns the search results. It does query word tokenizing, drop unnecessary punctuations and common words, do stemming if docbook language supports it, etc.
- {$indexer-language-code}_stemmer.js- This includes the stemming library.- nwSearchFnt.jsfile calls- stemmermethod in this file for stemming. ex:- var stem = stemmer(foobar);
 
Adding new Stemmers is very simple.
Currently, only English, French, and German stemmers are integrated in to WebHelp. But the code is extensible such that you can add new stemmers easily by few steps.
What you need:
- You'll need two versions of the stemmer; One written in JavaScript, and another in Java. But fortunately, Snowball contains Java stemmers for number of popular languages, and are already included with the package. You can see the full list in Adding support for other (non-CJKV) languages. If your language is listed there, Then you have to find javascript version of the stemmer. Generally, new stemmers are getting added in to Snowball Stemmers in other languages location. If javascript stemmer for your language is available, then download it. Else, you can write a new stemmer in JavaScript using SnowBall algorithm fairly easily. Algorithms are at Snowball. 
- Then, name the JS stemmer exactly like this: - {$language-code}_stemmer.js. For example, for Italian(it), name it as,- it_stemmer.js. Then, copy it to the- docbook-webhelp/template/content/search/stemmers/folder. (I assumed- docbook-webhelpis the root folder for webhelp.)- Note- Make sure you changed the - webhelp.indexer.languageproperty in- build.propertiesto your language.
- Now two easy changes needed for the indexer. - Open - docbook-webhelp/indexer/src/com/nexwave/nquindexer/IndexerTask.javain a text editor and add your language code to the- supportedLanguagesString Array.- Example 3.1. Add new language to supportedLanguages array - change the Array from, - private String[] supportedLanguages= {"en", "de", "fr", "cn", "ja", "ko"}; //currently extended support available for // English, German, French and CJK (Chinese, Japanese, Korean) languages only.- To, - private String[] supportedLanguages= {"en", "de", "fr", "cn", "ja", "ko", "it"}; //currently extended support available for // English, German, French, CJK (Chinese, Japanese, Korean), and Italian languages only.
- Now, open - docbook-webhelp/indexer/src/com/nexwave/nquindexer/SaxHTMLIndex.javaand add the following line to the code where it initializes the Stemmer (Search for- SnowballStemmer stemmer;). Then add code to initialize the stemmer Object in your language. It's self understandable. See the example. The class names are at:- docbook-webhelp/indexer/src/com/nexwave/stemmer/snowball/ext/.- Example 3.2. initialize correct stemmer based on the - webhelp.indexer.languagespecified- SnowballStemmer stemmer; if(indexerLanguage.equalsIgnoreCase("en")){ stemmer = new EnglishStemmer(); } else if (indexerLanguage.equalsIgnoreCase("de")){ stemmer= new GermanStemmer(); } else if (indexerLanguage.equalsIgnoreCase("fr")){ stemmer= new FrenchStemmer(); } else if (indexerLanguage.equalsIgnoreCase("it")){ //If language code is "it" (Italian) stemmer= new italianStemmer(); //Initialize the stemmer to- italianStemmerobject. } else { stemmer = null; }
 
That's all. Now run ant build-indexer to compile and build the java code. 
          Then, run ant webhelp to generate the output from your docbook file. 
          For any questions, contact us or email to the docbook mailing list 
          <docbook-apps@lists.oasis-open.org>.
        


