I don't think that the demo parser is meant as a production
system component. You can look at Tidy or NekoHtml. They cleanup your html
and are probably optimised.
sv
On Wed, 25 Aug 2004, Hetan Shah wrote:
Hello all,
Is there a way to reduce the indexing time taken when the indexer is
Do you have any pointers for sample code for them?
Would highly appreciate it.
Thanks.
-H
Stephane James Vaucher wrote:
I don't think that the demo parser is meant as a production
system component. You can look at Tidy or NekoHtml. They cleanup your html
and are probably optimised.
sv
On Wed,
JGuru explanation:
http://www.jguru.com/faq/view.jsp?EID=1074228
I have no sample code for neko, I think nutch uses it though. For tidy,
you can look at ant in the sandbox:
Hi Hetan
Th's the major Problem of non Standatrdized Tags for HTML Document's
u are Indexing ,resulting in lag time taken for Indexing process
If u can Tweak the HTMLParser.jj file within lucene.zip '/demo/html'
file
[U have to have some Knowledge of JAVACC for this].
Hetan,
If you are using a corpus with multiple editors, I suggest that you
use a cleaner like tidy as there might be weird stuff appearing in the
html.
sv
On Thu, 26 Aug 2004, Karthik N S wrote:
Hi Hetan
Th's the major Problem of non Standatrdized Tags for HTML Document's
u are