Re: HTML parser
Hi all, I'm very interested about this thread. I also have to solve the problem of spidering web sites, creating index (weel about this there is the BIG problem that lucene can't be integrated easily with a DB), extracting links from the page repeating all the process. For extracting links from a page I'm thinking to use JTidy. I think that with this library you can also parse a non well formed page (that you can take from the web with URLConnection) setting the property to clean the page. The class Tidy() returns a org.w3c.dom.Document that you can use for analizing all the document: for example you can use doc.getElementsByTagName(a) for taking all the a elements. You can parse as xml. Did someone solve the problem to spider recursively a web pages? Laura While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324- 8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
RE: HTML parser
You can use the swing html parser to do this but it's only a 3.2 DTD based parser. I have written (attached) a totall hack job for braking up an html page into its component parts, the code gives you an idea ... If anyone wants to know how to use the swing based parser I add some code ? Mark -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: 19 April 2002 07:29 To: [EMAIL PROTECTED] Subject: HTML parser Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Thanks, Otis __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] PageBreaker.java Description: java/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: HTML parser
Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: HTML parser
Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc. Otis --- Ian Forsyth [EMAIL PROTECTED] wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
While trying to research the same thing, I found the following...here's a good example of link extraction. http://developer.java.sun.com/developer/TechTips/1999/tt0923.html It seems like I could use this to also get the text out from between the tags but haven't been able to do it yet. It seems like it should be simple but geez...my head hurts. On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
HttpUnit (which uses JTidy under the covers) makes childs play out of pulling out links and navigating to them. The only caveat (and this would be true for practically all tools, I suspect) is that the HTML has to be relatively well-formed for it to work well. JTidy can be somewhat forgiving though. Erik - Original Message - From: David Black [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, April 19, 2002 5:26 PM Subject: Re: HTML parser While trying to research the same thing, I found the following...here's a good example of link extraction. http://developer.java.sun.com/developer/TechTips/1999/tt0923.html It seems like I could use this to also get the text out from between the tags but haven't been able to do it yet. It seems like it should be simple but geez...my head hurts. On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
While trying to research the same thing, I found the following...here's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take about ten lines of code. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324-8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
Hello Terrence, Ah, you got me. I guess I need a bit of both. I need to just strip HTML and get raw body text so that I can stick it in Lucene's index. I would also like something that can extract at least the title.../title stuff, so that I can stick that in a separate field in Lucene index. While doing that I, like you, need to be able to handle poorly formatted web pages. In a future I may need something that has the ability to extract HREFs, but I'll stick to one of the XP principles and just look for something that meets current needs :) I looked for ANTLR-based HTML parser a few days ago, but must have missed the one you pointed out. I'll take a look at it now. Can you share or describe your stripHTML method? Simple java that looks for s and s or something smarter? Thanks, Otis P.S. This type of thing makes me wish I can use Perl or Python :) --- Terence Parr [EMAIL PROTECTED] wrote: On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]