Karthik Manimaran wrote:
Hi,
I followed the following approach to make the raw files searchable using Lucene.

Thanks for this info. The problem I see with this solution is that you have external scripts etc. to handle the generation of the data. Perhaps having Forrest itself generate the necessary indexes would be better. How about something like this:

Forrest uses site.xml to pass the documents to the Lucene index transformer. site.xml will not have the list of all the raw files as entries. In my case I wanted javadocs for a component library to be placed as raw HTML files and be searchable. Hence updating site.xml every time the raw HTML files change is out of the question. Hence a new file site-lucene.xml that contains both site.xml and entries corresponding to all the raw HTML files was created. Steps are as follows: 1. Write a batch file (UpdateLuceneSearchList.bat) that gets the recursive list of all the HTML files and writes it to a file jupd.txt. Place it in the root of the folder containing the raw HTML files.
Contents of UpdateLuceneSearchList.bat >>
dir *.htm* /n /b /s >jupd.txt

Replace this with a sitemap entry that uses the directoryGenerator [1] to create an XML list of raw files you want to index.

2. Write a java program that takes site.xml and jupd.txt and produces a new xml file site-lucene.xml. Source attached.

Replace with a pipeline that aggregates the above XML with site.xml.

3. Update search.xmap to enable our new site-lucene.xml to be used to obtain the input

This step stays the same.

4. Add an entry for abs-linkmap-lucene to the pipeline in linkmap.xmap

This step stays the same.

5. Comment the following lines in site2book.xsl (as we generate the tags in site-lucene.xml without labels)
<!--
      <xsl:when test="not(@label)">
      </xsl:when>
-->

This is a bad idea, those entries are there for a reason, commenting them out will affect the "normal" use of site2book.xsl in some sites (i.e. ones with site entries without labels).

Instead you should have a label in site-lucene.xml entries.

6. Create a batch file that calls UpdateLuceneSearchList.bat and executes the java program to update the index.

...

This batch file can be scheduled to call every time there are updates to the raw files to keep the index updated. If this is of any help and the search related info on Forrest documentation could be updated, will be glad to do so.

This step is no longer needed as site-lucene.xml file would now be generated dynamically when required.

If you decide to implement this, patches are welcome, if you need some more pointers we'll do our best.

Ross