Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed, 25 Aug 2004, Hetan Shah wrote: Hello all, Is there a way to reduce the indexing time taken when the indexer is

Re: Time to index documents

2004-08-25 Thread Hetan Shah
Do you have any pointers for sample code for them? Would highly appreciate it. Thanks. -H Stephane James Vaucher wrote: I don't think that the demo parser is meant as a production system component. You can look at Tidy or NekoHtml. They cleanup your html and are probably optimised. sv On Wed,

Re: Time to index documents

2004-08-25 Thread Stephane James Vaucher
JGuru explanation: http://www.jguru.com/faq/view.jsp?EID=1074228 I have no sample code for neko, I think nutch uses it though. For tidy, you can look at ant in the sandbox:

RE: Time to index documents

2004-08-25 Thread Karthik N S
Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are Indexing ,resulting in lag time taken for Indexing process If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html' file [U have to have some Knowledge of JAVACC for this].

RE: Time to index documents

2004-08-25 Thread Stephane James Vaucher
Hetan, If you are using a corpus with multiple editors, I suggest that you use a cleaner like tidy as there might be weird stuff appearing in the html. sv On Thu, 26 Aug 2004, Karthik N S wrote: Hi Hetan Th's the major Problem of non Standatrdized Tags for HTML Document's u are