HTML saga continues...

2002-12-12 Thread Leo Galambos
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 
parser.

THX

-g-


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
indexed HTML documents.  It uses JTidy under the covers to parse HTML 
into title and body content, and it could be extended to pull other 
information such meta keywords.

	Erik


Leo Galambos wrote:
So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case of 
(2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, but 
it looks complicated and I do not want to affect the results by ``robust'' 
parser.

THX

-g-


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML saga continues...

2002-12-12 Thread Erik Hatcher
On a related note, I've also released a project that I developed for my 
book and for presentations that I have been giving on Ant, XDoclet, and 
JUnit.  This project is a documentation search engine with a web 
(Struts) interface.  It uses Lucene and the Ant task I mentioned already 
to index a directory full of HTML and text files.  The sample data 
provided is Ant's documentation.

Its available as version 0.3 (currently, but always grab the latest 
thats there) at http://www.ehatchersolutions.com/downloads/

I have not documented it well yet, but that is my plan over the next 
couple of weeks.

To get it running you need:

- Ant 1.5.1 (1.5 is not sufficient)
- JUnit 3.8 or up (3.8.1 is the latest)
- j2ee.jar - I don't provide this in the download for size (and legal?) 
reasons.

Build it this way:

	ant -Dj2ee.jar=/path/to/my/j2ee.jar

Or if you run it without the -D switch it will tell you where to place 
j2ee.jar by default.  If you have J2EE_HOME set it will pick that up 
automatically and use it appropriately.

Deploy the WAR in a web container, or the EAR in JBoss.  Navigate to:

	http://localhost:8080/ant-sample/

and search for your favorite Ant tasks or Ant related information.

Let me know if you experience any issues with it, or have comments.

	Erik

Erik Hatcher wrote:
Look in the Lucene sandbox in CVS.  I contributed an Ant task that 
indexed HTML documents.  It uses JTidy under the covers to parse HTML 
into title and body content, and it could be extended to pull other 
information such meta keywords.

Erik


Leo Galambos wrote:

So, I have tried this with Lucene:
1) original JavaCC LL(k) HTML parser
2) SWING's HTML parser

In case of (1) I could process about 300K of HTML documents. In case 
of (2) more than 400K.

But I cannot process complete collection (5M) and finish my hard stress
tests of Lucene.

Is there anyone who has HTML parser that really works with Lucene? :) If
you think that you have one, please let me know. I wanted to try Neko, 
but it looks complicated and I do not want to affect the results by 
``robust'' parser.

THX

-g-


--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:   
mailto:[EMAIL PROTECTED]
For additional commands, e-mail: 
mailto:[EMAIL PROTECTED]





--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML saga continues...

2002-12-12 Thread Otis Gospodnetic
Yeah, Neko is not the most straight forward, but it works.
Sorry, the code is somewhere.can;t look for it now.
But you could also look at LARM under Lucene Sanbox, it's got a nice
HTML parser, too.

Otis

--- Leo Galambos [EMAIL PROTECTED] wrote:
 So, I have tried this with Lucene:
 1) original JavaCC LL(k) HTML parser
 2) SWING's HTML parser
 
 In case of (1) I could process about 300K of HTML documents. In case
 of 
 (2) more than 400K.
 
 But I cannot process complete collection (5M) and finish my hard
 stress
 tests of Lucene.
 
 Is there anyone who has HTML parser that really works with Lucene? :)
 If
 you think that you have one, please let me know. I wanted to try
 Neko, but 
 it looks complicated and I do not want to affect the results by
 ``robust'' 
 parser.
 
 THX
 
 -g-
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]