Karthik Manimaran wrote:
Hi,
I followed the following approach to make the raw files searchable using
Lucene.
Thanks for this info. The problem I see with this solution is that you
have external scripts etc. to handle the generation of the data. Perhaps
having Forrest itself generate the necessary indexes would be better.
How about something like this:
Forrest uses site.xml to pass the documents to the Lucene index
transformer. site.xml will not have the list of all the raw files as
entries. In my case I wanted javadocs for a component library to be
placed as raw HTML files and be searchable. Hence updating site.xml
every time the raw HTML files change is out of the question. Hence a new
file site-lucene.xml that contains both site.xml and entries
corresponding to all the raw HTML files was created. Steps are as follows:
1. Write a batch file (UpdateLuceneSearchList.bat) that gets the
recursive list of all the HTML files and writes it to a file jupd.txt.
Place it in the root of the folder containing the raw HTML files.
Contents of UpdateLuceneSearchList.bat >>
dir *.htm* /n /b /s >jupd.txt
Replace this with a sitemap entry that uses the directoryGenerator [1]
to create an XML list of raw files you want to index.
2. Write a java program that takes site.xml and jupd.txt and produces a
new xml file site-lucene.xml. Source attached.
Replace with a pipeline that aggregates the above XML with site.xml.
3. Update search.xmap to enable our new site-lucene.xml to be used to
obtain the input
This step stays the same.
4. Add an entry for abs-linkmap-lucene to the pipeline in linkmap.xmap
This step stays the same.
5. Comment the following lines in site2book.xsl (as we generate the tags
in site-lucene.xml without labels)
<!--
<xsl:when test="not(@label)">
</xsl:when>
-->
This is a bad idea, those entries are there for a reason, commenting
them out will affect the "normal" use of site2book.xsl in some sites
(i.e. ones with site entries without labels).
Instead you should have a label in site-lucene.xml entries.
6. Create a batch file that calls UpdateLuceneSearchList.bat and
executes the java program to update the index.
...
This batch file can be scheduled to call every time there are updates to
the raw files to keep the index updated. If this is of any help and the
search related info on Forrest documentation could be updated, will be
glad to do so.
This step is no longer needed as site-lucene.xml file would now be
generated dynamically when required.
If you decide to implement this, patches are welcome, if you need some
more pointers we'll do our best.
Ross