Re: HTML parser

2002-04-20 Thread [EMAIL PROTECTED]

Hi all,

I'm very interested about this thread. I also have to solve the problem 
of spidering web sites, creating index (weel about this there is the 
BIG problem that lucene can't be integrated easily with a DB), 
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think 
that with this library you can also parse a non well formed page (that 
you can take from the web with URLConnection) setting the property to 
clean the page. The class Tidy() returns a org.w3c.dom.Document that 
you can use for analizing all the document: for example you can use 
doc.getElementsByTagName(a) for taking all the a elements. You can 
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura




 
 While trying to research the same thing, I found the following...here
's a 
 good example of link extraction.
 
 Try http://www.quiotix.com/opensource/html-parser
 
 Its easy to write a Visitor which extracts the links; should take abou
t ten 
 lines of code.
 
 
 
 --
 Brian Goetz
 Quiotix Corporation
 [EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-
8032
 
 http://www.quiotix.com
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


RE: HTML parser

2002-04-19 Thread Mark Ayad

You can use the swing html parser to do this but it's only a 3.2 DTD based
parser.
I have written (attached) a totall hack job for braking up an html page into
its
component parts, the code gives you an idea ... If anyone wants to know how
to use
the swing based parser I add some code ?

Mark




-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: 19 April 2002 07:29
To: [EMAIL PROTECTED]
Subject: HTML parser


Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




PageBreaker.java
Description: java/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: HTML parser

2002-04-19 Thread Ian Forsyth


Are there core classes part of lucene that allow one to feed lucene links,
and 'it' will capture the contents of those urls into the index..

or does one write a file capture class to seek out the url store the file in
a directory, then index the local directory..

Ian


-Original Message-
From: Terence Parr [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 1:38 AM
To: Lucene Users List
Subject: Re: HTML parser



On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

:snip

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all
sorts of HTML from various websites to such them into the jGuru search
engine.  I use a simple stripHTML() method I wrote to handle it.  Works
great.  Kills everything but the text.  is that the kind of thing you
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: HTML parser

2002-04-19 Thread Otis Gospodnetic

Such classes are not included with Lucene.
This was _just_ mentioned on this list earlier today.
Look at the archives and search for crawler, URL, lucene sandbox, etc.

Otis

--- Ian Forsyth [EMAIL PROTECTED] wrote:
 
 Are there core classes part of lucene that allow one to feed lucene
 links,
 and 'it' will capture the contents of those urls into the index..
 
 or does one write a file capture class to seek out the url store the
 file in
 a directory, then index the local directory..
 
 Ian
 
 
 -Original Message-
 From: Terence Parr [mailto:[EMAIL PROTECTED]]
 Sent: Friday, April 19, 2002 1:38 AM
 To: Lucene Users List
 Subject: Re: HTML parser
 
 
 
 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
 :snip
 
 Hi Otis,
 
 I have an HTML parser built for ANTLR, but it's pretty strict in what
 it
 accepts.  Not sure how useful it will be for you, but here it is:
 
 http://www.antlr.org/grammars/HTML
 
 I am not sure what your goal is, but I personally have to scarf all
 sorts of HTML from various websites to such them into the jGuru
 search
 engine.  I use a simple stripHTML() method I wrote to handle it. 
 Works
 great.  Kills everything but the text.  is that the kind of thing you
 are looking for or do you really want to parse not filter?
 
 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread David Black

While trying to research the same thing, I found the following...here's 
a good example of link extraction.

http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

It seems like I could use this to also get the text out from between the 
tags but haven't been able to do it yet.  It seems like it should be 
simple but geez...my head hurts.






On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:


 Are there core classes part of lucene that allow one to feed lucene 
 links,
 and 'it' will capture the contents of those urls into the index..

 or does one write a file capture class to seek out the url store the 
 file in
 a directory, then index the local directory..

 Ian


 -Original Message-
 From: Terence Parr [mailto:[EMAIL PROTECTED]]
 Sent: Friday, April 19, 2002 1:38 AM
 To: Lucene Users List
 Subject: Re: HTML parser



 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

 :snip

 Hi Otis,

 I have an HTML parser built for ANTLR, but it's pretty strict in what it
 accepts.  Not sure how useful it will be for you, but here it is:

 http://www.antlr.org/grammars/HTML

 I am not sure what your goal is, but I personally have to scarf all
 sorts of HTML from various websites to such them into the jGuru search
 engine.  I use a simple stripHTML() method I wrote to handle it.  Works
 great.  Kills everything but the text.  is that the kind of thing you
 are looking for or do you really want to parse not filter?

 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org


 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread Erik Hatcher

HttpUnit (which uses JTidy under the covers) makes childs play out of
pulling out links and navigating to them.

The only caveat (and this would be true for practically all tools, I
suspect) is that the HTML has to be relatively well-formed for it to work
well.  JTidy can be somewhat forgiving though.

Erik

- Original Message -
From: David Black [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, April 19, 2002 5:26 PM
Subject: Re: HTML parser


 While trying to research the same thing, I found the following...here's
 a good example of link extraction.

 http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

 It seems like I could use this to also get the text out from between the
 tags but haven't been able to do it yet.  It seems like it should be
 simple but geez...my head hurts.






 On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:

 
  Are there core classes part of lucene that allow one to feed lucene
  links,
  and 'it' will capture the contents of those urls into the index..
 
  or does one write a file capture class to seek out the url store the
  file in
  a directory, then index the local directory..
 
  Ian
 
 
  -Original Message-
  From: Terence Parr [mailto:[EMAIL PROTECTED]]
  Sent: Friday, April 19, 2002 1:38 AM
  To: Lucene Users List
  Subject: Re: HTML parser
 
 
 
  On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
  :snip
 
  Hi Otis,
 
  I have an HTML parser built for ANTLR, but it's pretty strict in what it
  accepts.  Not sure how useful it will be for you, but here it is:
 
  http://www.antlr.org/grammars/HTML
 
  I am not sure what your goal is, but I personally have to scarf all
  sorts of HTML from various websites to such them into the jGuru search
  engine.  I use a simple stripHTML() method I wrote to handle it.  Works
  great.  Kills everything but the text.  is that the kind of thing you
  are looking for or do you really want to parse not filter?
 
  Terence
  --
  Co-founder, http://www.jguru.com
  Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 
 
  --
  To unsubscribe, e-mail:   mailto:lucene-user-
  [EMAIL PROTECTED]
  For additional commands, e-mail: mailto:lucene-user-
  [EMAIL PROTECTED]
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread Brian Goetz


While trying to research the same thing, I found the following...here's a 
good example of link extraction.

Try http://www.quiotix.com/opensource/html-parser

Its easy to write a Visitor which extracts the links; should take about ten 
lines of code.



--
Brian Goetz
Quiotix Corporation
[EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-18 Thread Terence Parr


On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

 Hello,

 I need to select an HTML parser for the application that I'm writing
 and I'm not sure what to choose.
 The HTML parser included with Lucene looks flimsy, JTidy looks like a
 hack and an overkill, using classes written for Swing
 (javax.swing.text.html.parser) seems wrong, and I haven't tried David
 McNicol's parser (included with Spindle).

 Somebody on this list must have done some research on this subject.
 Can anyone share some experiences?
 Have you found a better HTML parser than any of those I listed above?
 If your application deals with HTML, what do you use for parsing it?

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it 
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all 
sorts of HTML from various websites to such them into the jGuru search 
engine.  I use a simple stripHTML() method I wrote to handle it.  Works 
great.  Kills everything but the text.  is that the kind of thing you 
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-18 Thread Otis Gospodnetic

Hello Terrence,

Ah, you got me.
I guess I need a bit of both.
I need to just strip HTML and get raw body text so that I can stick it
in Lucene's index.
I would also like something that can extract at least the
title.../title stuff, so that I can stick that in a separate field
in Lucene index.
While doing that I, like you, need to be able to handle poorly
formatted web pages.

In a future I may need something that has the ability to extract HREFs,
but I'll stick to one of the XP principles and just look for something
that meets current needs :)

I looked for ANTLR-based HTML parser a few days ago, but must have
missed the one you pointed out.  I'll take a look at it now.
Can you share or describe your stripHTML method?  Simple java that
looks for s and s or something smarter?

Thanks,
Otis
P.S.
This type of thing makes me wish I can use Perl or Python :)


--- Terence Parr [EMAIL PROTECTED] wrote:
 
 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
  Hello,
 
  I need to select an HTML parser for the application that I'm
 writing
  and I'm not sure what to choose.
  The HTML parser included with Lucene looks flimsy, JTidy looks like
 a
  hack and an overkill, using classes written for Swing
  (javax.swing.text.html.parser) seems wrong, and I haven't tried
 David
  McNicol's parser (included with Spindle).
 
  Somebody on this list must have done some research on this subject.
  Can anyone share some experiences?
  Have you found a better HTML parser than any of those I listed
 above?
  If your application deals with HTML, what do you use for parsing
 it?
 
 Hi Otis,
 
 I have an HTML parser built for ANTLR, but it's pretty strict in what
 it 
 accepts.  Not sure how useful it will be for you, but here it is:
 
 http://www.antlr.org/grammars/HTML
 
 I am not sure what your goal is, but I personally have to scarf all 
 sorts of HTML from various websites to such them into the jGuru
 search 
 engine.  I use a simple stripHTML() method I wrote to handle it. 
 Works 
 great.  Kills everything but the text.  is that the kind of thing you
 
 are looking for or do you really want to parse not filter?
 
 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]