The problem of using Cyber Neko HTML Parser parse HTML files
When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--??1 root?? root? 50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: The problem of using Cyber Neko HTML Parser parse HTML files
This is not an unknown character.. it is a non breaking space (unicode value 0x00A0) - Original Message - From: Jingkang Zhang [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 5:12 PM Subject: The problem of using Cyber Neko HTML Parser parse HTML files When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--?1 root root 50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Re: The problem of using Cyber Neko HTML Parser parse HTML files
Thank you. But how can I view correct output? If my html files using different encode method (Like : UTF-8, ISO8859-1, GBK , JIS, etc) , how can I treat it? --- Jason Polites [EMAIL PROTECTED] This is not an unknown character.. it is a non breaking space (unicode value 0x00A0) - Original Message - From: Jingkang Zhang [EMAIL PROTECTED] To: lucene-user@jakarta.apache.org Sent: Friday, February 18, 2005 5:12 PM Subject: The problem of using Cyber Neko HTML Parser parse HTML files When I was using Cyber Neko HTML Parser parse HTML files( created by Microsoft word ), if the file contains HTML built-in entity references(for example: nbsp;) , node value may contain unknown character. Like this: source html: DIV P class=MsoNormal style=MARGIN: 0cm 0cm 0pt 18ptSPAN lang=EN-US style=mso-bidi-font-size: 10.5ptFONT face=Times New RomanFONT size=3-rw-r--r--SPAN style=mso-spacerun: yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp; /SPANrootSPAN style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp; /SPAN50 Jan 21 16:12 _1e.f6o:p/o:p/FONT/FONT/SPAN/P /DIV after parsing html: -rw-r--r--??1 root??? root50 Jan 21 16:12 _1e.f6 How can I avoid it? _ Do You Yahoo!? 150??MP3?? http://music.yisou.com/ ??? http://image.yisou.com 1G?1000?? http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
The link does not work. One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above one is kind of SAX-y but doesn't fall over at the sight of a real web page ;-) Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen! AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Oops. It's in the Google cache and also the Internet Archive Wayback machine. I'll drop the original author a note to let him know that his links are stale. http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ Ian Karl Koch [EMAIL PROTECTED] writes: The link does not work. One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above one is kind of SAX-y but doesn't fall over at the sight of a real web page ;-) - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Your turn ;-) Karl Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: which HTML parser is better?
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that in a single class or even method called by another part in my Java application. It should also run on Java 1.1 and it should be small and simple. As I said before, I am in control of the HTML and it will be well formated, because I generate it from XML using XSLT. Karl If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g
Re: which HTML parser is better?
Karl Koch wrote: Hello Sergiu, thank you for your help so far. I appreciate it. I am working with Java 1.1 which does not include regular expressions. Why are you using Java 1.1? Are you so limited in resources? What operating system do you use? I asume that you just need to index the html files, and you need a html2txt conversion. If an external converter si a solution for you, you can use Runtime.executeCommnand(...) to run the converter that will extract the information from your HTMLs and generate a .txt file. Then you can use a reader to index the txt. As I told you before, the best solution depends on your constraints (time, effort, hardware, performance) and requirements :) Best, Sergiu Your turn ;-) Karl Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ
Re: which HTML parser is better?
I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that in a single class or even method called by another part in my Java application. It should also run on Java 1.1 and it should be small and simple. As I said before, I am in control of the HTML and it will be well formated, because I generate it from XML using XSLT. Why don't you get the data directly from XML files? You can use a SAX parser, ... but I think it will require java 1.3 or at least 1.2.2 Best, Sergiu Karl If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko
Re: which HTML parser is better?
-- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I see, In this case you can read line by line your HTML file and then write something like this: String line; int startPos, endPos; StringBuffer text = new StringBuffer(); while((line = reader.readLine()) != null ){ startPos = line.indexOf(); endPos = line.indexOf(); if(startPos 0 endPos startPos) text.append(line.substring(startPos, endPos)); } This is just a sample code that should work if you have just one tag per line in the HTML file. This can be a start point for you. Hope it helps, Best, Sergiu I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl, Two things, try to experiment with both: 1) I would try to write a lexical scanner that strips HTML tags, much like the regular expression does. Java lexical scanner packages produce nice pure Java classes that seldom use any advanced API, so they should work on Java 1.1. They are simple state machines with states encoded in integers -- this should work like a charm, be fast and small. 2) Write a parser yourself. Having a regular expression it isn't that difficult to do... :) D. Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better? - Thread closed
Thank you, I will do that. Karl Koch wrote: I appologise in advance, if some of my writing here has been said before. The last three answers to my question have been suggesting pattern matching solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing is something I cannot use since I work with Java 1.1 on a PDA. I see, In this case you can read line by line your HTML file and then write something like this: String line; int startPos, endPos; StringBuffer text = new StringBuffer(); while((line = reader.readLine()) != null ){ startPos = line.indexOf(); endPos = line.indexOf(); if(startPos 0 endPos startPos) text.append(line.substring(startPos, endPos)); } This is just a sample code that should work if you have just one tag per line in the HTML file. This can be a start point for you. Hope it helps, Best, Sergiu I am wondering if somebody knows a piece of simple sourcecode with low requirement which is running under this tense specification. Thank you all, Karl No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
For all parser suggestion I think there is one important attribute. Some parsers returns data provide that the input HTML is sensible. Some parsers is designed to be most flexible as tolerant as it can be. If the input is clean and controlled the former class is sufficient. Even some regular expression may be sufficient. (I that's the original poster wants). If you are building a web crawler you need something really tolerant. Once I have prototyped a nice and fast parser. Later I have to abandon it because it failed to parse about 15% documents (problem handling nested quotes like onclick=alert('hi')). No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
One which we've been using can be found at: http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/ We absolutely need to be able to recover gracefully from malformed HTML and/or SGML. Most of the nicer SAX/DOM/TLA parsers out there failed this criterion when we started our effort. The above one is kind of SAX-y but doesn't fall over at the sight of a real web page ;-) Ian - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. The code in the Lucene Sandbox for parsing HTML with JTidy (under contributions/ant) for the index task does what you ask. Erik - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- 10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail +++ GMX - die erste Adresse für Mail, Message, More +++ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- GMX im TV ... Die Gedanken sind frei ... Schon gesehen? Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
If you are not married to Java: http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm Otis --- sergiu gordea [EMAIL PROTECTED] wrote: Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED
Re: which HTML parser is better?
In our application I use regular expressions to strip all tags in one situation and specific ones in another situation. Here is sample code for both: This strips all html 4.0 tags except p, ul, br, li, strong, em, u: html_source = Pattern.compile(/?\\s?(A|ABBR|ACRONYM|ADDRESS|APPLET|AREA|B|BASE|BASEFONT| BDO|BIG|BLOCKQUOTE|BODY|BUTTON|CAPTION|CENTER|CITE|CODE|COL|COLGROUP|DD|DEL| DFN|DIR|DIV|DL|DT|FIELDSET|FONT|FORM|FRAME|FRAMESET|H1|H2|H3|H4|H5|H6|HEAD|H R|HTML|I|IFRAME|IMG|INPUT|INS|ISINDEX|KBD|LABEL|LEGEND|LINK|MAP|MENU|META|NO FRAMES|NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|Q|S|SAMP|SCRIPT|SELECT|S MALL|SPAN|STRIKE|STYLE|SUB|SUP|TABLE|TBODY|TD|TEXTAREA|TFOOT|TH|THEAD|TITLE| TR|TT|VAR)(.|\n)*?\\s?, Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll(); When I want to strip anything in a tag I use the following pattern with the code above: String strPattern1 = \\s?(.|\n)*?\\s?; HTH Luke - Original Message - From: sergiu gordea [EMAIL PROTECTED] To: Lucene Users List lucene-user@jakarta.apache.org Sent: Wednesday, February 02, 2005 1:23 PM Subject: Re: which HTML parser is better? Karl Koch wrote: I am in control of the html, which means it is well formated HTML. I use only HTML files which I have transformed from XML. No external HTML (e.g. the web). Are there any very-short solutions for that? if you are using only correct formated HTML pages and you are in control of these pages. you can use a regular exprestion to remove the tags. something like replaceAll(*,); This is the ideea behind the operation. If you will search on google you will find a more robust regular expression. Using a simple regular expression will be a very cheap solution, that can cause you a lot of problems in the future. It's up to you to use it Best, Sergiu Karl Karl Koch wrote: Hi, yes, but the library your are using is quite big. I was thinking that a 5kB code could actually do that. That sourceforge project is doing much more than that but I do not need it. you need just the htmlparser.jar 200k. ... you know ... the functionality is strongly correclated with the size. You can use 3 lines of code with a good regular expresion to eliminate the html tags, but this won't give you any guarantie that the text from the bad fromated html files will be correctly extracted... Best, Sergiu Karl Hi Karl, I already submitted a peace of code that removes the html tags. Search for my previous answer in this thread. Best, Sergiu Karl Koch wrote: Hello, I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌà http://music.yisou.com/ ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ http://image.yisou.com 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡ http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/m a il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL
RE: which HTML parser is better?
We index the content from HTML files and because we only want the good text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered. Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from img though. --Leto -Original Message- From: Karl Koch [mailto:[EMAIL PROTECTED] I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
No one has yet mentioned using ParserDelegator and ParserCallback that are part of HTMLEditorKit in Swing. I have been successfully using these classes to parse out the text of an HTML file. You just need to extend HTMLEditorKit.ParserCallback and override the various methods that are called when different tags are encountered. On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? -- Bill Tschumy Otherwise -- Austin, TX http://www.otherwise.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Kauler, Leto S wrote: Another very cheap, but robust solution in the case you use linux is to make lynx to parse your pages. lynx page.html page.txt. This will strip out all html and script, style, csimport tags. And you will have a .txt file ready for indexing. Best, Sergiu We index the content from HTML files and because we only want the good text and do not care about the structure, well-formedness, etc we went with regular expressions similar to what Luke Shannon offered. Only real difference being that we firstly remove entire blocks of (script|style|csimport) and similar since the contents of those are not useful for keyword searching, and afterward just remove every leftover HTML tags. I have been meaning to add an expression to extract things like alt attribute text from img though. --Leto -Original Message- From: Karl Koch [mailto:[EMAIL PROTECTED] I have been following this thread and have another question. Is there a piece of sourcecode (which is preferably very short and simple (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2 would be enough...also no frames, CSS, etc. I do not need to have the HTML strucutre tree or any other structure but need a facility to clean up HTML into its normal underlying content before indexing that content as a whole. Karl -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? CONFIDENTIALITY NOTICE AND DISCLAIMER Information in this transmission is intended only for the person(s) to whom it is addressed and may contain privileged and/or confidential information. If you are not the intended recipient, any disclosure, copying or dissemination of the information is unauthorised and you should delete/destroy all copies and notify the sender. No liability is accepted for any unauthorised use of the information contained in this transmission. This disclaimer has been automatically added. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
which HTML parser is better?
Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? maybe you can try this library... http://htmlparser.sourceforge.net/ I use the following code to get the text from HTML files, it was not intensively tested, but it works. import org.htmlparser.Node; import org.htmlparser.Parser; import org.htmlparser.util.NodeIterator; import org.htmlparser.util.Translate; Parser parser = new Parser(source.getAbsolutePath()); NodeIterator iter = parser.elements(); while (iter.hasMoreNodes()) { Node element = (Node) iter.nextNode(); //System.out.println(1: + element.getText()); String text = Translate.decode(element.toPlainTextString()); if (Utils.notEmptyString(text)) writer.write(text); } Sergiu _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: which HTML parser is better?
When I tested parsers a year or so ago for intensive use in Furl, the best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page) parser by far was TagSoup ( http://www.tagsoup.info ). It is actively maintained and improved and I have never had any problems with it. -Mike Jingkang Zhang wrote: Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
RE: which HTML parser is better?
I think that depends on what you want to do. The Lucene demo parser does simple mapping of HTML files into Lucene Documents; it does not give you a parse tree for the HTML doc. CyberNeko is an extension of Xerces (uses the same API; will likely become part of Xerces), and so maps an HTML document into a full DOM that you can manipulate easily for a wide range of purposes. I haven't used JTidy at an API level and so don't know it as well -- based on its UI, it appears to be focused primarily on HTML validation and error detection/correction. I use CyberNeko for a range of operations on HTML documents that go beyond indexing them in Lucene, and really like it. It has been robust for me so far. Chuck -Original Message- From: Jingkang Zhang [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 01, 2005 1:15 AM To: lucene-user@jakarta.apache.org Subject: which HTML parser is better? Three HTML parsers(Lucene web application demo,CyberNeko HTML Parser,JTidy) are mentioned in Lucene FAQ 1.3.27.Which is the best?Can it filter tags that are auto-created by MS-word 'Save As HTML files' function? _ Do You Yahoo!? 150MP3 http://music.yisou.com/ http://image.yisou.com 1G1000 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma il_1g/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
Hi Fred, We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's just a guess, I would love to hear from others about this. Anyway, since it is a separate thread, a token error could kill it and there is no way for the calling thread to know about it. We had to create our own html parser since we only cared about grabbing the entire text from the html document and also we wanted to avoid the extra thread. We also do a lot of SKIPping for minimal EOF errors (html documents in email almost never follow standards). For your html needs, you might want to check out other JavaCC HTML parsers from the JavaCC web site. Roy. On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: demo HTML parser question
[EMAIL PROTECTED] wrote: We were originally attempting to use the demo html parser (Lucene 1.2), but as you know, its for a demo. I think its threaded to optimize on time, to allow the calling thread to grab the title or top message even though its not done parsing the entire html document. That's almost right. I originally wrote it that way to avoid having to ever buffer the entire text of the document. The document is indexed while it is parsed. But, as observed, this has lots of problems and was probably a bad idea. Could someone provide a patch that removes the multi-threading? We'd simply use a StringBuffer in HTMLParser.jj to collect the text. Calls to pipeOut.write() would be replaced with text.append(). Then have the HTMLParser's constructor parse the page before returning, rather than spawn a thread, and getReader() would return a StringReader. The public API of HTMLParser need not change at all and lots of complex threading code would be thrown away. Anyone interested in coding this? Doug - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
demo HTML parser question
Hi, I've been working with the HTML parser demo that comes with Lucene and I'm trying to understand why it's multi-threaded, and, more importantly, how to exit gracefully on errors. I've discovered if I throw an exception in the front-end static code (main(), etc.), the JVM hangs instead of exiting. Presumably this is because there are threads hanging around doing something. But I'm not sure what! Any pointers? I just want to exit gracefully on an error such as a required meta tag is missing or similar. Thanks, Fred - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best HTML Parser !!
I've had fairly good experience with Jtidy! But HTMLParser http://htmlparser.sourceforge.net/ seems to have the lighter looking API. It is Event based and I might need to parse some large HTML sometime soon, where DOM might be the problem. Does anyone have practical experience with HTMLParser? Thanks Frank -Ursprüngliche Nachricht- Von: petite_abeille [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 25. Februar 2003 19:49 An: Lucene Users List Betreff: Re: Best HTML Parser !! On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote: I have some good experiences with JTidy. It works like DOM-XML parser and cleans HTML it by the way. I use jtidy also. Both for parsing and clean-up. Works pretty nicely. This is VERY useful, because EVERY HTML have at least ONE error. This rule should be tattooed on every parsers head: out of the laboratory, nothing is compliant. Which render the race to more compliance among the different parsers somewhat ridiculous. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best HTML Parser !!
Pierre Lacchini wrote: Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;) I have some good experiences with JTidy. It works like DOM-XML parser and cleans HTML it by the way. This is VERY useful, because EVERY HTML have at least ONE error. Documents that was unparsable with Neko JTidy parsed without problems. Creating indexing program was work for 2 hours. -- Lukas Zapletal [EMAIL PROTECTED] http://www.tanecni-olomouc.cz/lzap - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Best HTML Parser !!
On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote: I have some good experiences with JTidy. It works like DOM-XML parser and cleans HTML it by the way. I use jtidy also. Both for parsing and clean-up. Works pretty nicely. This is VERY useful, because EVERY HTML have at least ONE error. This rule should be tattooed on every parsers head: out of the laboratory, nothing is compliant. Which render the race to more compliance among the different parsers somewhat ridiculous. Cheers, PA. - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Best HTML Parser !!
Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;)
Re: Best HTML Parser !!
It's not possible to generalize like that. I like NekoHTML. Otis --- Pierre Lacchini [EMAIL PROTECTED] wrote: Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;) __ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
AW: Best HTML Parser !!
I prefer JTidy http://lempinen.net/sami/jtidy/. Michael -Ursprüngliche Nachricht- Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Gesendet: Montag, 24. Februar 2003 15:03 An: Lucene Users List; [EMAIL PROTECTED] Betreff: Re: Best HTML Parser !! It's not possible to generalize like that. I like NekoHTML. Otis --- Pierre Lacchini [EMAIL PROTECTED] wrote: Hello, i'm trying to index html file with Lucene. Do u know what's the best HTML Parser in Java ? The most Powerful ? I need to extract meta-tag, and many other differents text fields... Thx for ur help ;) __ Do you Yahoo!? Yahoo! Tax Center - forms, calculators, tips, more http://taxes.yahoo.com/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Demo provided HTML parser bug (was RE: Newbie quizzes further...)
List Fellows: Lacking any knowledge of JavaCC, I solicted help in hacking the HTMLParser.jj included in the demo. I retreat from this solication, for two reasons: 1) I'm using other ideas gleaned from the list archives, 2) I'm not prepared to dive into the world of complier compliers. The mere sound of it is intimidating. So the bug. (If the bug is not worth fixing in the provided HTMLParser, drop another one in, like Quiotix's; I did.) Summary: The current HTMLParser fails to correctly handle HTML decimal entities. titleMyWebsite#8212;Home Page/title pMy website#8217;s address is.../p The following is produced after indexing the HTML and performing a query: MyWebsite?Home Page My website?s address is... Another problem is manifest in the following oddity: Given the following *source*; **note the use of the ampersand entity** titleMyWebsiteamp;#8212;Home Page/title pMy websiteamp;#8217;s address is.../p This produces the output (where two dashes represent an em dash) MyWebsite--Home Page My website's address is... And the source of the *results* appears correctly, even if the source document that was indexed is incorrect! Some kind of entity replacement is occuring here. titleMyWebsite#8212;Home Page/title pMy website#8217;s address is.../p (I ran across the latter oddity courtesy of Adobe GoLive's annoying syntax rewriter.) Now, some might be asking, and rightly so, why hasn't this been seen before? I know a search in the archives didn't turn anything up. It's likely because the use of decimal entities is misunderstood by the HTM community at large. A for instance is that some, quite possibly a whole lot, use #151; for em dash--this is incorrect as the whole range #127; to #159; is invalid. Second, many may use named encoding. Named encoding, i.e. emdash;, is fine, but decimal encoding provides a more consistent behavior cross-platform. For more on this, read The Trouble with EM 'n EN and Other Shady Characters at A List Apart (www.alistapart.com/stories/emen/) Yours in Lucene. Tim -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: problems with HTML Parser
Keith, I haven't noticed the problem with the Parser...but you trigger me by saying that you have a PDFParser!!! Are you able to contribute this PDFParser?? Maurits. - Original Message - From: Keith Gunn [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, August 14, 2002 9:46 AM Subject: problems with HTML Parser Has anyone noticed that the HTML Parser that comes with Lucene joins terms together when parsing a file. I used to think it was my PDFParser but after fixing that I found out it was the HMTLParser. I managed to find a replacement parser that doesn't join terms. Just wondered if anyone had come across this problem?? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: problems with HTML Parser
Maurits, You can get a PDF parser from http://www.pdfbox.org -Ben On Wed, 14 Aug 2002, Maurits van Wijland wrote: Keith, I haven't noticed the problem with the Parser...but you trigger me by saying that you have a PDFParser!!! Are you able to contribute this PDFParser?? Maurits. -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: problems with HTML Parser
If your parsing html files have a check in lucene to see the terms that are index and see if you can spot any joined terms. The PDF parser as you can see from the other mail is from www.pdfbox.org and i highly recommend it (thanks again Ben!) On Wed, 14 Aug 2002, Maurits van Wijland wrote: Keith, I haven't noticed the problem with the Parser...but you trigger me by saying that you have a PDFParser!!! Are you able to contribute this PDFParser?? Maurits. - Original Message - From: Keith Gunn [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Wednesday, August 14, 2002 9:46 AM Subject: problems with HTML Parser Has anyone noticed that the HTML Parser that comes with Lucene joins terms together when parsing a file. I used to think it was my PDFParser but after fixing that I found out it was the HMTLParser. I managed to find a replacement parser that doesn't join terms. Just wondered if anyone had come across this problem?? -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
Hi all, I'm very interested about this thread. I also have to solve the problem of spidering web sites, creating index (weel about this there is the BIG problem that lucene can't be integrated easily with a DB), extracting links from the page repeating all the process. For extracting links from a page I'm thinking to use JTidy. I think that with this library you can also parse a non well formed page (that you can take from the web with URLConnection) setting the property to clean the page. The class Tidy() returns a org.w3c.dom.Document that you can use for analizing all the document: for example you can use doc.getElementsByTagName(a) for taking all the a elements. You can parse as xml. Did someone solve the problem to spider recursively a web pages? Laura While trying to research the same thing, I found the following...here 's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take abou t ten lines of code. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324- 8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED]
RE: HTML parser
You can use the swing html parser to do this but it's only a 3.2 DTD based parser. I have written (attached) a totall hack job for braking up an html page into its component parts, the code gives you an idea ... If anyone wants to know how to use the swing based parser I add some code ? Mark -Original Message- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]] Sent: 19 April 2002 07:29 To: [EMAIL PROTECTED] Subject: HTML parser Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Thanks, Otis __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] PageBreaker.java Description: java/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: HTML parser
Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
RE: HTML parser
Such classes are not included with Lucene. This was _just_ mentioned on this list earlier today. Look at the archives and search for crawler, URL, lucene sandbox, etc. Otis --- Ian Forsyth [EMAIL PROTECTED] wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
While trying to research the same thing, I found the following...here's a good example of link extraction. http://developer.java.sun.com/developer/TechTips/1999/tt0923.html It seems like I could use this to also get the text out from between the tags but haven't been able to do it yet. It seems like it should be simple but geez...my head hurts. On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
HttpUnit (which uses JTidy under the covers) makes childs play out of pulling out links and navigating to them. The only caveat (and this would be true for practically all tools, I suspect) is that the HTML has to be relatively well-formed for it to work well. JTidy can be somewhat forgiving though. Erik - Original Message - From: David Black [EMAIL PROTECTED] To: Lucene Users List [EMAIL PROTECTED] Sent: Friday, April 19, 2002 5:26 PM Subject: Re: HTML parser While trying to research the same thing, I found the following...here's a good example of link extraction. http://developer.java.sun.com/developer/TechTips/1999/tt0923.html It seems like I could use this to also get the text out from between the tags but haven't been able to do it yet. It seems like it should be simple but geez...my head hurts. On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote: Are there core classes part of lucene that allow one to feed lucene links, and 'it' will capture the contents of those urls into the index.. or does one write a file capture class to seek out the url store the file in a directory, then index the local directory.. Ian -Original Message- From: Terence Parr [mailto:[EMAIL PROTECTED]] Sent: Friday, April 19, 2002 1:38 AM To: Lucene Users List Subject: Re: HTML parser On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: :snip Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:lucene-user- [EMAIL PROTECTED] For additional commands, e-mail: mailto:lucene-user- [EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
While trying to research the same thing, I found the following...here's a good example of link extraction. Try http://www.quiotix.com/opensource/html-parser Its easy to write a Visitor which extracts the links; should take about ten lines of code. -- Brian Goetz Quiotix Corporation [EMAIL PROTECTED] Tel: 650-843-1300Fax: 650-324-8032 http://www.quiotix.com -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
HTML parser
Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Thanks, Otis __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
Re: HTML parser
Hello Terrence, Ah, you got me. I guess I need a bit of both. I need to just strip HTML and get raw body text so that I can stick it in Lucene's index. I would also like something that can extract at least the title.../title stuff, so that I can stick that in a separate field in Lucene index. While doing that I, like you, need to be able to handle poorly formatted web pages. In a future I may need something that has the ability to extract HREFs, but I'll stick to one of the XP principles and just look for something that meets current needs :) I looked for ANTLR-based HTML parser a few days ago, but must have missed the one you pointed out. I'll take a look at it now. Can you share or describe your stripHTML method? Simple java that looks for s and s or something smarter? Thanks, Otis P.S. This type of thing makes me wish I can use Perl or Python :) --- Terence Parr [EMAIL PROTECTED] wrote: On Thursday, April 18, 2002, at 10:28 PM, Otis Gospodnetic wrote: Hello, I need to select an HTML parser for the application that I'm writing and I'm not sure what to choose. The HTML parser included with Lucene looks flimsy, JTidy looks like a hack and an overkill, using classes written for Swing (javax.swing.text.html.parser) seems wrong, and I haven't tried David McNicol's parser (included with Spindle). Somebody on this list must have done some research on this subject. Can anyone share some experiences? Have you found a better HTML parser than any of those I listed above? If your application deals with HTML, what do you use for parsing it? Hi Otis, I have an HTML parser built for ANTLR, but it's pretty strict in what it accepts. Not sure how useful it will be for you, but here it is: http://www.antlr.org/grammars/HTML I am not sure what your goal is, but I personally have to scarf all sorts of HTML from various websites to such them into the jGuru search engine. I use a simple stripHTML() method I wrote to handle it. Works great. Kills everything but the text. is that the kind of thing you are looking for or do you really want to parse not filter? Terence -- Co-founder, http://www.jguru.com Creator, ANTLR Parser Generator: http://www.antlr.org -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED] __ Do You Yahoo!? Yahoo! Tax Center - online filing with TurboTax http://taxes.yahoo.com/ -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]
HTML Parser
Hi, I am working with the lucene demo and would like to compile the demo so that I may eventually modify it for my own use. I am using the source from lucene-demos-1.2-rc4.jar.zip. However, the HTMLParser class had the filename HTMLParser.jj and won't compile. I changed the name to HTMLParser.java, still the same problem. Any help would be greatly appreciated. Thanks, Neal Neal Weinstein Manager Software Development blue*spark [EMAIL PROTECTED] T (416) 971-6612 x205 F (416) 971-6549 489 King Street West, Suite 200 Toronto, Ontario M5V 1K4 Canada www.bluespark.com
HTML Parser
Hi, How should I integrate the HTML Parser (which is in the demo directory) in a new project ? In particular with the HTMLParser.jj file. Do a need to compile it before trying to use it in my code. Any help would be apreciated ! Thank. - Christophe -- To unsubscribe, e-mail: mailto:[EMAIL PROTECTED] For additional commands, e-mail: mailto:[EMAIL PROTECTED]