Re: which HTML parser is better?
Oops. It's in the Google cache and also the Internet Archive Wayback machine. I'll drop the original author a note to let him know that his links are stale. http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ Ian "Karl Koch" <[EMAIL PROTECTED]> writes: > The link does not work. > >> >> One which we've been using can be found at: >> http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ >> >> We absolutely need to be able to recover gracefully from malformed >> HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there >> failed this criterion when we started our effort. The above one is >> kind of SAX-y but doesn't fall over at the sight of a real web page >> ;-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
The link does not work. > > One which we've been using can be found at: > http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ > > We absolutely need to be able to recover gracefully from malformed > HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there > failed this criterion when we started our effort. The above one is > kind of SAX-y but doesn't fall over at the sight of a real web page > ;-) > > Ian > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION "Kein Einrichtungspreis" nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above one is kind of SAX-y but doesn't fall over at the sight of a real web page ;-) Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular expression may be sufficient. (I that's the original poster wants). If you are building a web crawler you need something really tolerant. Once I have prototyped a nice and fast parser. Later I have to abandon it because it failed to parse about 15% documents (problem handling nested quotes like onclick="alert('hi')"). No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better? - Thread closed
Thank you, I will do that. > Karl Koch wrote: > > >I appologise in advance, if some of my writing here has been said before. > >The last three answers to my question have been suggesting pattern > matching > >solutions and Swing. Pattern matching was introduced in Java 1.4 and > Swing > >is something I cannot use since I work with Java 1.1 on a PDA. > > > > > I see, > > In this case you can read line by line your HTML file and then write > something like this: > > String line; > int startPos, endPos; > StringBuffer text = new StringBuffer(); > while((line = reader.readLine()) != null ){ > startPos = line.indexOf(">"); > endPos = line.indexOf("<"); > if(startPos >0 && endPos > startPos) > text.append(line.substring(startPos, endPos)); > } > > This is just a sample code that should work if you have just one tag per > line in the HTML file. > This can be a start point for you. > > Hope it helps, > > Best, > > Sergiu > > >I am wondering if somebody knows a piece of simple sourcecode with low > >requirement which is running under this tense specification. > > > >Thank you all, > >Karl > > > > > > > >>No one has yet mentioned using ParserDelegator and ParserCallback that > >>are part of HTMLEditorKit in Swing. I have been successfully using > >>these classes to parse out the text of an HTML file. You just need to > >>extend HTMLEditorKit.ParserCallback and override the various methods > >>that are called when different tags are encountered. > >> > >> > >>On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: > >> > >> > >> > >>>Three HTML parsers(Lucene web application > >>>demo,CyberNeko HTML Parser,JTidy) are mentioned in > >>>Lucene FAQ > >>>1.3.27.Which is the best?Can it filter tags that are > >>>auto-created by MS-word 'Save As HTML files' function? > >>> > >>> > >>-- > >>Bill Tschumy > >>Otherwise -- Austin, TX > >>http://www.otherwise.com > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl, Two things, try to experiment with both: 1) I would try to write a lexical scanner that strips HTML tags, much like the regular expression does. Java lexical scanner packages produce nice pure Java classes that seldom use any advanced API, so they should work on Java 1.1. They are simple state machines with states encoded in integers -- this should work like a charm, be fast and small. 2) Write a parser yourself. Having a regular expression it isn't that difficult to do... :) D. Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I see, In this case you can read line by line your HTML file and then write something like this: String line; int startPos, endPos; StringBuffer text = new StringBuffer(); while((line = reader.readLine()) != null ){ startPos = line.indexOf(">"); endPos = line.indexOf("<"); if(startPos >0 && endPos > startPos) text.append(line.substring(startPos, endPos)); } This is just a sample code that should work if you have just one tag per line in the HTML file. This can be a start point for you. Hope it helps, Best, Sergiu I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
I am using Java 1.1 with a Sharp Zaurus PDA. I have very limited memory constraints. I do not think CPU performance is a big issues though. But I have other parts in my application which use quite a lot of memory and soemthing run short. I therefore do not look into solutions which build up tag trees etc. More like a solution who reads a stream of HTML and transforms it into a stream of text. I see your point of using an external program. I am however not entirely sure if this is available. Also it would be much simpler to have a 3-5 kB solution in Java, perhaps encapsulated in a class which does the job without the need for advanced libraries which need 100-200 KB on my internal storage. I hope I could clarify my situation now. Cheers, Karl > Karl Koch wrote: > > >Hello Sergiu, > > > >thank you for your help so far. I appreciate it. > > > >I am working with Java 1.1 which does not include regular expressions. > > > > > Why are you using Java 1.1? Are you so limited in resources? > What operating system do you use? > I asume that you just need to index the html files, and you need a > html2txt conversion. > If an external converter si a solution for you, you can use > Runtime.executeCommnand(...) to run the converter that will extract the > information from your HTMLs > and generate a .txt file. Then you can use a reader to index the txt. > > As I told you before, the best solution depends on your constraints > (time, effort, hardware, performance) and requirements :) > > Best, > > Sergiu > > >Your turn ;-) > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>I am in control of the html, which means it is well formated HTML. I > use > >>>only HTML files which I have transformed from XML. No external HTML > (e.g. > >>>the web). > >>> > >>>Are there any very-short solutions for that? > >>> > >>> > >>> > >>> > >>if you are using only correct formated HTML pages and you are in control > >>of these pages. > >>you can use a regular exprestion to remove the tags. > >> > >>something like > >>replaceAll("<*>",""); > >> > >>This is the ideea behind the operation. If you will search on google you > >>will find a more robust > >>regular expression. > >> > >>Using a simple regular expression will be a very cheap solution, that > >>can cause you a lot of problems in the future. > >> > >> It's up to you to use it > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > Karl Koch wrote: > > > > > > >Hi, > > > >yes, but the library your are using is quite big. I was thinking that > a > > > > > > > > > 5kB > > > > > >code could actually do that. That sourceforge project is doing much > > > > > >>more > >> > >> > >than that but I do not need it. > > > > > > > > > > > > > you need just the htmlparser.jar 200k. > ... you know ... the functionality is strongly correclated with the > > > >>size. > >> > >> > You can use 3 lines of code with a good regular expresion to > eliminate > the html tags, > but this won't give you any guarantie that the text from the bad > fromated html files will be > correctly extracted... > > Best, > > Sergiu > > > > > > >Karl > > > > > > > > > > > > > > > >>Hi Karl, > >> > >>I already submitted a peace of code that removes the html tags. > >>Search for my previous answer in this thread. > >> > >>Best, > >> > >> Sergiu > >> > >>Karl Koch wrote: > >> > >> > >> > >> > >> > >> > >> > >>>Hello, > >>> > >>>I have been following this thread and have another question. > >>> > >>>Is there a piece of sourcecode (which is preferably very short and > >>> > >>> > >>> > >>> > simple > > > > > >>>(KISS)) which allows to remove all HTML tags from HTML content? > HTML > >>> > >>> > >>> > >>> > 3.2 > > > > > >>>would be enough...also no frames, CSS, etc. > >>> > >>>I do not need to have the HTML strucutre tree or any other > structure > >>> > >>> > >>> > >>> > but > > > > > >>>need a facility to clean up HTML into its normal underlying content > >>> > >>> > >>> > >>> > >>> > >>> > >>before > >> > >> > >> > >> > >> > >> > >>>indexing that content as a whole. > >>> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >
Re: which HTML parser is better?
Karl Koch wrote: Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that in a single class or even method called by another part in my Java application. It should also run on Java 1.1 and it should be small and simple. As I said before, I am in control of the HTML and it will be well formated, because I generate it from XML using XSLT. Why don't you get the data directly from XML files? You can use a SAX parser, ... but I think it will require java 1.3 or at least 1.2.2 Best, Sergiu Karl If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea <[EMAIL PROTECTED]> wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll("<*>",""); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,Cybe
Re: which HTML parser is better?
I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl > No one has yet mentioned using ParserDelegator and ParserCallback that > are part of HTMLEditorKit in Swing. I have been successfully using > these classes to parse out the text of an HTML file. You just need to > extend HTMLEditorKit.ParserCallback and override the various methods > that are called when different tags are encountered. > > > On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: > > > Three HTML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > -- > Bill Tschumy > Otherwise -- Austin, TX > http://www.otherwise.com > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Why are you using Java 1.1? Are you so limited in resources? What operating system do you use? I asume that you just need to index the html files, and you need a html2txt conversion. If an external converter si a solution for you, you can use Runtime.executeCommnand(...) to run the converter that will extract the information from your HTMLs and generate a .txt file. Then you can use a reader to index the txt. As I told you before, the best solution depends on your constraints (time, effort, hardware, performance) and requirements :) Best, Sergiu Your turn ;-) Karl Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll("<*>",""); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
Re: which HTML parser is better?
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that in a single class or even method called by another part in my Java application. It should also run on Java 1.1 and it should be small and simple. As I said before, I am in control of the HTML and it will be well formated, because I generate it from XML using XSLT. Karl > If you are not married to Java: > http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm > > Otis > > --- sergiu gordea <[EMAIL PROTECTED]> wrote: > > > Karl Koch wrote: > > > > >I am in control of the html, which means it is well formated HTML. I > > use > > >only HTML files which I have transformed from XML. No external HTML > > (e.g. > > >the web). > > > > > >Are there any very-short solutions for that? > > > > > > > > if you are using only correct formated HTML pages and you are in > > control > > of these pages. > > you can use a regular exprestion to remove the tags. > > > > something like > > replaceAll("<*>",""); > > > > This is the ideea behind the operation. If you will search on google > > you > > will find a more robust > > regular expression. > > > > Using a simple regular expression will be a very cheap solution, that > > > > can cause you a lot of problems in the future. > > > > It's up to you to use it > > > > Best, > > > > Sergiu > > > > >Karl > > > > > > > > > > > >>Karl Koch wrote: > > >> > > >> > > >> > > >>>Hi, > > >>> > > >>>yes, but the library your are using is quite big. I was thinking > > that a > > >>> > > >>> > > >>5kB > > >> > > >> > > >>>code could actually do that. That sourceforge project is doing > > much more > > >>>than that but I do not need it. > > >>> > > >>> > > >>> > > >>> > > >>you need just the htmlparser.jar 200k. > > >>... you know ... the functionality is strongly correclated with the > > size. > > >> > > >> You can use 3 lines of code with a good regular expresion to > > eliminate > > >>the html tags, > > >>but this won't give you any guarantie that the text from the bad > > >>fromated html files will be > > >>correctly extracted... > > >> > > >> Best, > > >> > > >> Sergiu > > >> > > >> > > >> > > >>>Karl > > >>> > > >>> > > >>> > > >>> > > >>> > > Hi Karl, > > > > I already submitted a peace of code that removes the html tags. > > Search for my previous answer in this thread. > > > > Best, > > > > Sergiu > > > > Karl Koch wrote: > > > > > > > > > > > > >Hello, > > > > > >I have been following this thread and have another question. > > > > > >Is there a piece of sourcecode (which is preferably very short > > and > > > > > > > > >>simple > > >> > > >> > > >(KISS)) which allows to remove all HTML tags from HTML content? > > HTML > > > > > > > > >>3.2 > > >> > > >> > > >would be enough...also no frames, CSS, etc. > > > > > >I do not need to have the HTML strucutre tree or any other > > structure > > > > > > > > >>but > > >> > > >> > > >need a facility to clean up HTML into its normal underlying > > content > > > > > > > > > > > > > > before > > > > > > > > > > >indexing that content as a whole. > > > > > >Karl > > > > > > > > > > > > > > > > > > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo > > parser > > >> > > >> > > >> > > >> > > does > > > > > > > > > > >>simple mapping of HTML files into Lucene Documents; it does not > > give > > >> > > >> > > >>you > > >> > > >> > > >> > > >> > > >> > > >> > > a > > > > > > > > > > >>parse tree for the HTML doc. CyberNeko is an extension of > > Xerces > > >> > > >> > > >>(uses > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >the > > > > > > > > > > > > > > > > > > > > >>same API; will likely become part of Xerces), and so maps an > > HTML > > >> > > >> > > >> > > >> > > document > > > > > > > > > > >>into a full DOM that you can manipulate easily for a wide range > > of > > >>purposes. I haven't used JTidy at an API level and so don't > > know it > > >> > > >> > > >>as > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >well -- > > > > > > > > > > > > > > > > > > > > >>based on its UI, it appears to be focused primarily on HTML > > validation > > >> > > >
Re: which HTML parser is better?
Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Your turn ;-) Karl > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > can cause you a lot of problems in the future. > > It's up to you to use it > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing much > more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the > size. > >> > >> You can use 3 lines of code with a good regular expresion to eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > > Sergiu > > Karl Koch wrote: > > > > > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short and > > > > > >>simple > >> > >> > >(KISS)) which allows to remove all HTML tags from HTML content? HTML > > > > > >>3.2 > >> > >> > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other structure > > > > > >>but > >> > >> > >need a facility to clean up HTML into its normal underlying content > > > > > > > > > before > > > > > >indexing that content as a whole. > > > >Karl > > > > > > > > > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo parser > >> > >> > >> > >> > does > > > > > >>simple mapping of HTML files into Lucene Documents; it does not give > >> > >> > >>you > >> > >> > >> > >> > >> > >> > a > > > > > >>parse tree for the HTML doc. CyberNeko is an extension of Xerces > >> > >> > >>(uses > >> > >> > >> > >> > >> > >> > >> > >> > >the > > > > > > > > > > > > > >>same API; will likely become part of Xerces), and so maps an HTML > >> > >> > >> > >> > document > > > > > >>into a full DOM that you can manipulate easily for a wide range of > >>purposes. I haven't used JTidy at an API level and so don't know it > >> > >> > >>as > >> > >> > >> > >> > >> > >> > >> > >> > >well -- > > > > > > > > > > > > > >>based on its UI, it appears to be focused primarily on HTML > validation > >> > >> > >> > >> > and > > > > > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > >> > >> > >> > >> > beyond > > > > > >>indexing them in Lucene, and really like it. It has been robust for > >> > >> > >>me > >> > >> > >> > >> > >> > >> > so > > > > > >>far. > >> > >>Chuck > >> > >> > >> > >>>-Original Message- > >>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >>>Sent: Tuesday, February 01, 2005 1:15 AM > >>>To: lucene-user@jakarta.apache.org > >>>Subject: which HTML parser is b
Re: which HTML parser is better?
Kauler, Leto S wrote: Another very cheap, but robust solution in the case you use linux is to make lynx to parse your pages. lynx page.html > page.txt. This will strip out all html and script, style, csimport tags. And you will have a .txt file ready for indexing. Best, Sergiu We index the content from HTML files and because we only want the "good" text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered. Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from though. --Leto -Original Message- From: Karl Koch [mailto:[EMAIL PROTECTED] I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl > -Original Message- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
We index the content from HTML files and because we only want the "good" text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered. Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from though. --Leto > -Original Message- > From: Karl Koch [mailto:[EMAIL PROTECTED] > > I have been following this thread and have another question. > > Is there a piece of sourcecode (which is preferably very > short and simple > (KISS)) which allows to remove all HTML tags from HTML > content? HTML 3.2 would be enough...also no frames, CSS, etc. > > I do not need to have the HTML strucutre tree or any other > structure but need a facility to clean up HTML into its > normal underlying content before indexing that content as a whole. > > Karl > > > > > > -Original Message- > > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > > Sent: Tuesday, February 01, 2005 1:15 AM > > > To: lucene-user@jakarta.apache.org > > > Subject: which HTML parser is better? > > > > > > Three HTML parsers(Lucene web application > > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > > Lucene FAQ > > > 1.3.27.Which is the best?Can it filter tags that are > > > auto-created by MS-word 'Save As HTML files' function? > > > CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
In our application I use regular expressions to strip all tags in one situation and specific ones in another situation. Here is sample code for both: This strips all html 4.0 tags except , , , , , , : html_source = Pattern.compile("", Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll(""); When I want to strip anything in a tag I use the following pattern with the code above: String strPattern1 = "<\\s?(.|\n)*?\\s?>"; HTH Luke - Original Message - From: "sergiu gordea" <[EMAIL PROTECTED]> To: "Lucene Users List" Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I use > >only HTML files which I have transformed from XML. No external HTML (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > can cause you a lot of problems in the future. > > It's up to you to use it > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing much more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the size. > >> > >> You can use 3 lines of code with a good regular expresion to eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>>> Hi Karl, > >>>> > >>>>I already submitted a peace of code that removes the html tags. > >>>>Search for my previous answer in this thread. > >>>> > >>>> Best, > >>>> > >>>> Sergiu > >>>> > >>>>Karl Koch wrote: > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>>Hello, > >>>>> > >>>>>I have been following this thread and have another question. > >>>>> > >>>>>Is there a piece of sourcecode (which is preferably very short and > >>>>> > >>>>> > >>simple > >> > >> > >>>>>(KISS)) which allows to remove all HTML tags from HTML content? HTML > >>>>> > >>>>> > >>3.2 > >> > >> > >>>>>would be enough...also no frames, CSS, etc. > >>>>> > >>>>>I do not need to have the HTML strucutre tree or any other structure > >>>>> > >>>>> > >>but > >> > >> > >>>>>need a facility to clean up HTML into its normal underlying content > >>>>> > >>>>> > >>>>> > >>>>> > >>>>before > >>>> > >>>> > >>>> > >>>> > >>>>>indexing that content as a whole. > >>>>> > >>>>>Karl > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>>I think that depends on what you want to do. The Lucene demo parser > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>does > >>>> > >>
Re: which HTML parser is better?
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea <[EMAIL PROTECTED]> wrote: > Karl Koch wrote: > > >I am in control of the html, which means it is well formated HTML. I > use > >only HTML files which I have transformed from XML. No external HTML > (e.g. > >the web). > > > >Are there any very-short solutions for that? > > > > > if you are using only correct formated HTML pages and you are in > control > of these pages. > you can use a regular exprestion to remove the tags. > > something like > replaceAll("<*>",""); > > This is the ideea behind the operation. If you will search on google > you > will find a more robust > regular expression. > > Using a simple regular expression will be a very cheap solution, that > > can cause you a lot of problems in the future. > > It's up to you to use it > > Best, > > Sergiu > > >Karl > > > > > > > >>Karl Koch wrote: > >> > >> > >> > >>>Hi, > >>> > >>>yes, but the library your are using is quite big. I was thinking > that a > >>> > >>> > >>5kB > >> > >> > >>>code could actually do that. That sourceforge project is doing > much more > >>>than that but I do not need it. > >>> > >>> > >>> > >>> > >>you need just the htmlparser.jar 200k. > >>... you know ... the functionality is strongly correclated with the > size. > >> > >> You can use 3 lines of code with a good regular expresion to > eliminate > >>the html tags, > >>but this won't give you any guarantie that the text from the bad > >>fromated html files will be > >>correctly extracted... > >> > >> Best, > >> > >> Sergiu > >> > >> > >> > >>>Karl > >>> > >>> > >>> > >>> > >>> > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > > Sergiu > > Karl Koch wrote: > > > > > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short > and > > > > > >>simple > >> > >> > >(KISS)) which allows to remove all HTML tags from HTML content? > HTML > > > > > >>3.2 > >> > >> > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other > structure > > > > > >>but > >> > >> > >need a facility to clean up HTML into its normal underlying > content > > > > > > > > > before > > > > > >indexing that content as a whole. > > > >Karl > > > > > > > > > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo > parser > >> > >> > >> > >> > does > > > > > >>simple mapping of HTML files into Lucene Documents; it does not > give > >> > >> > >>you > >> > >> > >> > >> > >> > >> > a > > > > > >>parse tree for the HTML doc. CyberNeko is an extension of > Xerces > >> > >> > >>(uses > >> > >> > >> > >> > >> > >> > >> > >> > >the > > > > > > > > > > > > > >>same API; will likely become part of Xerces), and so maps an > HTML > >> > >> > >> > >> > document > > > > > >>into a full DOM that you can manipulate easily for a wide range > of > >>purposes. I haven't used JTidy at an API level and so don't > know it > >> > >> > >>as > >> > >> > >> > >> > >> > >> > >> > >> > >well -- > > > > > > > > > > > > > >>based on its UI, it appears to be focused primarily on HTML > validation > >> > >> > >> > >> > and > > > > > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents > that go > >> > >> > >> > >> > beyond > > > > > >>indexing them in Lucene, and really like it. It has been > robust for > >> > >> > >>me > >> > >> > >> > >> > >> > >> > so > > > > > >>far. > >> > >>Chuck > >> > >> > >> > >>>-Original Message- > >>>From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >>>Sent: Tuesday, February 01, 2005 1:15 AM > >>>To: lucene-user@jakarta.apache.org > >>>Subject
Re: which HTML parser is better?
Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll("<*>",""); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? Karl > Karl Koch wrote: > > >Hi, > > > >yes, but the library your are using is quite big. I was thinking that a > 5kB > >code could actually do that. That sourceforge project is doing much more > >than that but I do not need it. > > > > > you need just the htmlparser.jar 200k. > ... you know ... the functionality is strongly correclated with the size. > > You can use 3 lines of code with a good regular expresion to eliminate > the html tags, > but this won't give you any guarantie that the text from the bad > fromated html files will be > correctly extracted... > > Best, > > Sergiu > > >Karl > > > > > > > >> Hi Karl, > >> > >> I already submitted a peace of code that removes the html tags. > >> Search for my previous answer in this thread. > >> > >> Best, > >> > >> Sergiu > >> > >>Karl Koch wrote: > >> > >> > >> > >>>Hello, > >>> > >>>I have been following this thread and have another question. > >>> > >>>Is there a piece of sourcecode (which is preferably very short and > simple > >>>(KISS)) which allows to remove all HTML tags from HTML content? HTML > 3.2 > >>>would be enough...also no frames, CSS, etc. > >>> > >>>I do not need to have the HTML strucutre tree or any other structure > but > >>>need a facility to clean up HTML into its normal underlying content > >>> > >>> > >>before > >> > >> > >>>indexing that content as a whole. > >>> > >>>Karl > >>> > >>> > >>> > >>> > >>> > >>> > I think that depends on what you want to do. The Lucene demo parser > > > >>does > >> > >> > simple mapping of HTML files into Lucene Documents; it does not give > you > > > >>a > >> > >> > parse tree for the HTML doc. CyberNeko is an extension of Xerces > (uses > > > > > >>>the > >>> > >>> > >>> > >>> > same API; will likely become part of Xerces), and so maps an HTML > > > >>document > >> > >> > into a full DOM that you can manipulate easily for a wide range of > purposes. I haven't used JTidy at an API level and so don't know it > as > > > > > >>>well -- > >>> > >>> > >>> > >>> > based on its UI, it appears to be focused primarily on HTML validation > > > >>and > >> > >> > error detection/correction. > > I use CyberNeko for a range of operations on HTML documents that go > > > >>beyond > >> > >> > indexing them in Lucene, and really like it. It has been robust for > me > > > >>so > >> > >> > far. > > Chuck > > > -Original Message- > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 01, 2005 1:15 AM > > To: lucene-user@jakarta.apache.org > > Subject: which HTML parser is better? > > > > Three HTML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > > > > _ > > Do You Yahoo!? > > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > > http://music.yisou.com/ > > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > > http://image.yisou.com > > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > > > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > > il_1g/ > > > > > > > >>- > >> > >> > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > >>> > >>> > >>> > >>> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck > -Original Message- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > > _ > Do You Yahoo!? > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > http://music.yisou.com/ > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > http://image.yisou.com > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > il_1g/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. Karl > Hi Karl, > > I already submitted a peace of code that removes the html tags. > Search for my previous answer in this thread. > > Best, > >Sergiu > > Karl Koch wrote: > > >Hello, > > > >I have been following this thread and have another question. > > > >Is there a piece of sourcecode (which is preferably very short and simple > >(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 > >would be enough...also no frames, CSS, etc. > > > >I do not need to have the HTML strucutre tree or any other structure but > >need a facility to clean up HTML into its normal underlying content > before > >indexing that content as a whole. > > > >Karl > > > > > > > > > >>I think that depends on what you want to do. The Lucene demo parser > does > >>simple mapping of HTML files into Lucene Documents; it does not give you > a > >>parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses > >> > >> > >the > > > > > >>same API; will likely become part of Xerces), and so maps an HTML > document > >>into a full DOM that you can manipulate easily for a wide range of > >>purposes. I haven't used JTidy at an API level and so don't know it as > >> > >> > >well -- > > > > > >>based on its UI, it appears to be focused primarily on HTML validation > and > >>error detection/correction. > >> > >>I use CyberNeko for a range of operations on HTML documents that go > beyond > >>indexing them in Lucene, and really like it. It has been robust for me > so > >>far. > >> > >>Chuck > >> > >> > -Original Message- > >> > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > >> > Sent: Tuesday, February 01, 2005 1:15 AM > >> > To: lucene-user@jakarta.apache.org > >> > Subject: which HTML parser is better? > >> > > >> > Three HTML parsers(Lucene web application > >> > demo,CyberNeko HTML Parser,JTidy) are mentioned in > >> > Lucene FAQ > >> > 1.3.27.Which is the best?Can it filter tags that are > >> > auto-created by MS-word 'Save As HTML files' function? > >> > > >> > _ > >> > Do You Yahoo!? > >> > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > >> > http://music.yisou.com/ > >> > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > >> > http://image.yisou.com > >> > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > >> > > >>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > >> > il_1g/ > >> > > >> > > - > >> > To unsubscribe, e-mail: [EMAIL PROTECTED] > >> > For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >>- > >>To unsubscribe, e-mail: [EMAIL PROTECTED] > >>For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > >> > > > > > > > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. The code in the Lucene Sandbox for parsing HTML with JTidy (under contributions/ant) for the task does what you ask. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck > -Original Message- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > > _ > Do You Yahoo!? > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > http://music.yisou.com/ > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > http://image.yisou.com > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > il_1g/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl > I think that depends on what you want to do. The Lucene demo parser does > simple mapping of HTML files into Lucene Documents; it does not give you a > parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the > same API; will likely become part of Xerces), and so maps an HTML document > into a full DOM that you can manipulate easily for a wide range of > purposes. I haven't used JTidy at an API level and so don't know it as well -- > based on its UI, it appears to be focused primarily on HTML validation and > error detection/correction. > > I use CyberNeko for a range of operations on HTML documents that go beyond > indexing them in Lucene, and really like it. It has been robust for me so > far. > > Chuck > > > -Original Message- > > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > > Sent: Tuesday, February 01, 2005 1:15 AM > > To: lucene-user@jakarta.apache.org > > Subject: which HTML parser is better? > > > > Three HTML parsers(Lucene web application > > demo,CyberNeko HTML Parser,JTidy) are mentioned in > > Lucene FAQ > > 1.3.27.Which is the best?Can it filter tags that are > > auto-created by MS-word 'Save As HTML files' function? > > > > _ > > Do You Yahoo!? > > 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà > > http://music.yisou.com/ > > ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ > > http://image.yisou.com > > 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ > > > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > > il_1g/ > > > > - > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck > -Original Message- > From: Jingkang Zhang [mailto:[EMAIL PROTECTED] > Sent: Tuesday, February 01, 2005 1:15 AM > To: lucene-user@jakarta.apache.org > Subject: which HTML parser is better? > > Three HTML parsers(Lucene web application > demo,CyberNeko HTML Parser,JTidy) are mentioned in > Lucene FAQ > 1.3.27.Which is the best?Can it filter tags that are > auto-created by MS-word 'Save As HTML files' function? > > _ > Do You Yahoo!? > 150万曲MP3疯狂搜,带您闯入音乐殿堂 > http://music.yisou.com/ > 美女明星应有尽有,搜遍美图、艳图和酷图 > http://image.yisou.com > 1G就是1000兆,雅虎电邮自助扩容! > http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma > il_1g/ > > - > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > >_ >Do You Yahoo!? >150万曲MP3疯狂搜,带您闯入音乐殿堂 >http://music.yisou.com/ >美女明星应有尽有,搜遍美图、艳图和酷图 >http://image.yisou.com >1G就是1000兆,雅虎电邮自助扩容! >http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Jingkang Zhang wrote: >Three HTML parsers(Lucene web application >demo,CyberNeko HTML Parser,JTidy) are mentioned in >Lucene FAQ >1.3.27.Which is the best?Can it filter tags that are >auto-created by MS-word 'Save As HTML files' function? > > maybe you can try this library... http://htmlparser.sourceforge.net/ I use the following code to get the text from HTML files, it was not intensively tested, but it works. import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.Translate; Parser parser = new Parser(source.getAbsolutePath()); NodeIterator iter = parser.elements(); while (iter.hasMoreNodes()) { Node element = (Node) iter.nextNode(); //System.out.println("1:" + element.getText()); String text = Translate.decode(element.toPlainTextString()); if (Utils.notEmptyString(text)) writer.write(text); } Sergiu >_ >Do You Yahoo!? >150万曲MP3疯狂搜,带您闯入音乐殿堂 >http://music.yisou.com/ >美女明星应有尽有,搜遍美图、艳图和酷图 >http://image.yisou.com >1G就是1000兆,雅虎电邮自助扩容! >http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ > >- >To unsubscribe, e-mail: [EMAIL PROTECTED] >For additional commands, e-mail: [EMAIL PROTECTED] > > > - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]