The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jingkang Zhang
When I was using Cyber Neko HTML Parser parse HTML
files( created by Microsoft word ), if the file
contains HTML built-in entity references(for example:
nbsp;) , node value may contain unknown character. 

Like this:
source html:
DIV
P class=MsoNormal style=MARGIN: 0cm 0cm 0pt
18ptSPAN lang=EN-US style=mso-bidi-font-size:
10.5ptFONT face=Times New RomanFONT
size=3-rw-r--r--SPAN style=mso-spacerun:
yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN
style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;
/SPANrootSPAN style=mso-spacerun:
yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
/SPAN50 Jan 21 16:12
_1e.f6o:p/o:p/FONT/FONT/SPAN/P
/DIV

after parsing html:
-rw-r--r--??1 root?? root? 50 Jan 21 16:12
_1e.f6

How can I avoid it?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jason Polites
This is not an unknown character.. it is a non breaking space (unicode value 
0x00A0)

- Original Message - 
From: Jingkang Zhang [EMAIL PROTECTED]
To: lucene-user@jakarta.apache.org
Sent: Friday, February 18, 2005 5:12 PM
Subject: The problem of using Cyber Neko HTML Parser parse HTML files


When I was using Cyber Neko HTML Parser parse HTML
files( created by Microsoft word ), if the file
contains HTML built-in entity references(for example:
nbsp;) , node value may contain unknown character.
Like this:
source html:
DIV
P class=MsoNormal style=MARGIN: 0cm 0cm 0pt
18ptSPAN lang=EN-US style=mso-bidi-font-size:
10.5ptFONT face=Times New RomanFONT
size=3-rw-r--r--SPAN style=mso-spacerun:
yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN
style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;
/SPANrootSPAN style=mso-spacerun:
yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
/SPAN50 Jan 21 16:12
_1e.f6o:p/o:p/FONT/FONT/SPAN/P
/DIV
after parsing html:
-rw-r--r--?1 root root 50 Jan 21 16:12
_1e.f6
How can I avoid it?
_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Re: The problem of using Cyber Neko HTML Parser parse HTML files

2005-02-17 Thread Jingkang Zhang
Thank you. But how can I view correct output? If my
html files using different encode method (Like :
UTF-8, ISO8859-1, GBK , JIS, etc) , how can I treat
it?



 --- Jason Polites [EMAIL PROTECTED] 
 This is not an unknown character.. it is a non
 breaking space (unicode value 
 0x00A0)
 
 
 - Original Message - 
 From: Jingkang Zhang [EMAIL PROTECTED]
 To: lucene-user@jakarta.apache.org
 Sent: Friday, February 18, 2005 5:12 PM
 Subject: The problem of using Cyber Neko HTML Parser
 parse HTML files
 
 
  When I was using Cyber Neko HTML Parser parse HTML
  files( created by Microsoft word ), if the file
  contains HTML built-in entity references(for
 example:
  nbsp;) , node value may contain unknown
 character.
 
  Like this:
  source html:
  DIV
  P class=MsoNormal style=MARGIN: 0cm 0cm 0pt
  18ptSPAN lang=EN-US style=mso-bidi-font-size:
  10.5ptFONT face=Times New RomanFONT
  size=3-rw-r--r--SPAN style=mso-spacerun:
  yesnbsp;nbsp;nbsp; /SPAN1 rootSPAN
  style=mso-spacerun: yesnbsp;nbsp;nbsp;nbsp;
  /SPANrootSPAN style=mso-spacerun:
 

yesnbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;nbsp;
  /SPAN50 Jan 21 16:12
  _1e.f6o:p/o:p/FONT/FONT/SPAN/P
  /DIV
 
  after parsing html:
  -rw-r--r--??1 root??? root50
 Jan 21 16:12
  _1e.f6
 
  How can I avoid it?
 
 

_
  Do You Yahoo!?
  150??MP3??
  http://music.yisou.com/
 

?€??
  http://image.yisou.com
  1G?1000?? 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
 
 

-
  To unsubscribe, e-mail:
 [EMAIL PROTECTED]
  For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-04 Thread Karl Koch
The link does not work.

 
 One which we've been using can be found at:
 http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
 
 We absolutely need to be able to recover gracefully from malformed
 HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
 failed this criterion when we started our effort.  The above one is
 kind of SAX-y but doesn't fall over at the sight of a real web page
 ;-)
 
 Ian
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
DSL Komplett von GMX +++ Supergünstig und stressfrei einsteigen!
AKTION Kein Einrichtungspreis nutzen: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-04 Thread Ian Soboroff

Oops.  It's in the Google cache and also the Internet Archive Wayback
machine.  I'll drop the original author a note to let him know that
his links are stale.

http://web.archive.org/web/20040208014740/http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

Ian

Karl Koch [EMAIL PROTECTED] writes:

 The link does not work.

 
 One which we've been using can be found at:
 http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/
 
 We absolutely need to be able to recover gracefully from malformed
 HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
 failed this criterion when we started our effort.  The above one is
 kind of SAX-y but doesn't fall over at the sight of a real web page
 ;-)



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Hello Sergiu,

thank you for your help so far. I appreciate it.

I am working with Java 1.1 which does not include regular expressions.

Your turn ;-)
Karl 

 Karl Koch wrote:
 
 I am in control of the html, which means it is well formated HTML. I use
 only HTML files which I have transformed from XML. No external HTML (e.g.
 the web).
 
 Are there any very-short solutions for that?
   
 
 if you are using only correct formated HTML pages and you are in control 
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google you 
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 Karl
 
   
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
   
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing much
 more
 than that but I do not need it.
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 size.
 
   You can use 3 lines of code with a good regular expresion to eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
  
 
   
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
   
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
   
 
 3.2
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure
   
 
 but
 
 
 need a facility to clean up HTML into its normal underlying content
  
 
   
 
 before

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser

 
 
 
 does

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 
 
 you
 
 

 
 
 
 a

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 
 
 (uses
 
 
   
 

 
 
 
 the
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML

 
 
 
 document

 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 
 
 as
 
 
   
 

 
 
 
 well --
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 validation

 
 
 
 and

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go

 
 
 
 beyond

 
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 
 
 me
 
 

 
 
 
 so

 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   
 
 il_1g/
 
 
   
 

 
 
 
 -

 
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
   
 
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.

Karl

 If you are not married to Java:
 http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
 
 Otis
 
 --- sergiu gordea [EMAIL PROTECTED] wrote:
 
  Karl Koch wrote:
  
  I am in control of the html, which means it is well formated HTML. I
  use
  only HTML files which I have transformed from XML. No external HTML
  (e.g.
  the web).
  
  Are there any very-short solutions for that?

  
  if you are using only correct formated HTML pages and you are in
  control 
  of these pages.
  you can use a regular exprestion to remove the tags.
  
  something like
  replaceAll(*,);
  
  This is the ideea behind the operation. If you will search on google
  you 
  will find a more robust
  regular expression.
  
  Using a simple regular expression will be a very cheap solution, that
  
  can cause you a lot of problems in the future.
   
   It's up to you to use it 
  
   Best,
   
   Sergiu
  
  Karl
  

  
  Karl Koch wrote:
  
  
  
  Hi,
  
  yes, but the library your are using is quite big. I was thinking
  that a

  
  5kB
  
  
  code could actually do that. That sourceforge project is doing
  much more
  than that but I do not need it.
   
  

  
  you need just the htmlparser.jar 200k.
  ... you know ... the functionality is strongly correclated with the
  size.
  
You can use 3 lines of code with a good regular expresion to
  eliminate 
  the html tags,
  but this won't give you any guarantie that the text from the bad 
  fromated html files will be
  correctly extracted...
  
Best,
  
Sergiu
  
  
  
  Karl
  
   
  

  
   Hi Karl,
  
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
  
   Best,
  
Sergiu
  
  Karl Koch wrote:
  
 
  
  
  
  Hello,
  
  I have  been following this thread and have another question. 
  
  Is there a piece of sourcecode (which is preferably very short
  and

  
  simple
  
  
  (KISS)) which allows to remove all HTML tags from HTML content?
  HTML

  
  3.2
  
  
  would be enough...also no frames, CSS, etc. 
  
  I do not need to have the HTML strucutre tree or any other
  structure

  
  but
  
  
  need a facility to clean up HTML into its normal underlying
  content
   
  

  
  before
 
  
  
  
  indexing that content as a whole.
  
  Karl
  
  
  
  
   
  

  
  I think that depends on what you want to do.  The Lucene demo
  parser
 
  
  
  
  does
 
  
  
  
  simple mapping of HTML files into Lucene Documents; it does not
  give
  
  
  you
  
  
 
  
  
  
  a
 
  
  
  
  parse tree for the HTML doc.  CyberNeko is an extension of
  Xerces
  
  
  (uses
  
  

  
 
  
  
  
  the
  
  
   
  

  
  same API; will likely become part of Xerces), and so maps an
  HTML
 
  
  
  
  document
 
  
  
  
  into a full DOM that you can manipulate easily for a wide range
  of
  purposes.  I haven't used JTidy at an API level and so don't
  know it
  
  
  as
  
  

  
 
  
  
  
  well --
  
  
   
  

  
  based on its UI, it appears to be focused primarily on HTML
  validation
 
  
  
  
  and
 
  
  
  
  error detection/correction.
  
  I use CyberNeko for a range of operations on HTML documents
  that go
 
  
  
  
  beyond
 
  
  
  
  indexing them in Lucene, and really like it.  It has been
  robust for
  
  
  me
  
  
 
  
  
  
  so
 
  
  
  
  far.
  
  Chuck
  
  
  
  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
  _
  Do You Yahoo!?
  150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
  http://music.yisou.com/
  ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
  http://image.yisou.com
  1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
  

  
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma

  
  il_1g

Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote:
Hello Sergiu,
thank you for your help so far. I appreciate it.
I am working with Java 1.1 which does not include regular expressions.
 

Why are you using Java 1.1? Are you so limited in resources?
What operating system do you use?
I asume that you just need to index the html files, and you need a 
html2txt conversion.
If  an external converter si a solution for you, you can use
Runtime.executeCommnand(...) to run the converter that will extract the 
information from your HTMLs
and generate a .txt file. Then you can use a reader to index the txt.

As I told you before, the best solution depends on your constraints 
(time, effort, hardware, performance) and requirements :)

 Best,
 Sergiu
Your turn ;-)
Karl 

 

Karl Koch wrote:
   

I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).
Are there any very-short solutions for that?
 

if you are using only correct formated HTML pages and you are in control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that 
can cause you a lot of problems in the future.

It's up to you to use it 
Best,
Sergiu
   

Karl

 

Karl Koch wrote:
  

   

Hi,
yes, but the library your are using is quite big. I was thinking that a


 

5kB
  

   

code could actually do that. That sourceforge project is doing much
 

more
   

than that but I do not need it.


 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
   

size.
   

You can use 3 lines of code with a good regular expresion to eliminate
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

   

Karl



 

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and


 

simple
  

   

(KISS)) which allows to remove all HTML tags from HTML content? HTML


 

3.2
  

   

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure


 

but
  

   

need a facility to clean up HTML into its normal underlying content
   



 

before
 

  

   

indexing that content as a whole.
Karl

   



 

I think that depends on what you want to do.  The Lucene demo parser
 

  

   

does
 

  

   

simple mapping of HTML files into Lucene Documents; it does not give
  

   

you
  

   

 

  

   

a
 

  

   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces
  

   

(uses
  

   


 

  

   

the
   



 

same API; will likely become part of Xerces), and so maps an HTML
 

  

   

document
 

  

   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it
  

   

as
  

   


 

  

   

well --
   



 

based on its UI, it appears to be focused primarily on HTML
   

validation
   

 

  

   

and
 

  

   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
 

  

   

beyond
 

  

   

indexing them in Lucene, and really like it.  It has been robust for
  

   

me
  

   

 

  

   

so
 

  

   

far.
Chuck
  

   

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.

I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.

Thank you all,
Karl

 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
Sparen beginnt mit GMX DSL: http://www.gmx.net/de/go/dsl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote:
Unfortunaltiy I am faithful ;-). Just for practical reason I want to do that
in a single class or even method called by another part in my Java
application. It should also run on Java 1.1 and it should be small and
simple. As I said before, I am in control of the HTML and it will be well
formated, because I generate it from XML using XSLT.
 

Why don't you get the data directly from  XML files?
You can use a SAX parser, ... but I think it will require java 1.3 or at 
least 1.2.2

Best,
 Sergiu
Karl
 

If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm
Otis
--- sergiu gordea [EMAIL PROTECTED] wrote:
   

Karl Koch wrote:
 

I am in control of the html, which means it is well formated HTML. I
   

use
 

only HTML files which I have transformed from XML. No external HTML
   

(e.g.
 

the web).
Are there any very-short solutions for that?
   

if you are using only correct formated HTML pages and you are in
control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google
you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that
can cause you a lot of problems in the future.
It's up to you to use it 
Best,
Sergiu
 

Karl

   

Karl Koch wrote:
  

 

Hi,
yes, but the library your are using is quite big. I was thinking
   

that a
 



   

5kB
  

 

code could actually do that. That sourceforge project is doing
   

much more
 

than that but I do not need it.


   

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the
 

size.
 

You can use 3 lines of code with a good regular expresion to
 

eliminate 
 

the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

Best,
Sergiu
  

 

Karl



   

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
Sergiu
Karl Koch wrote:
 

  

 

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short
   

and
 



   

simple
  

 

(KISS)) which allows to remove all HTML tags from HTML content?
   

HTML
 



   

3.2
  

 

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other
   

structure
 



   

but
  

 

need a facility to clean up HTML into its normal underlying
   

content
 

   



   

before
 

  

 

indexing that content as a whole.
Karl

   



   

I think that depends on what you want to do.  The Lucene demo
 

parser
 

 

  

 

does
 

  

 

simple mapping of HTML files into Lucene Documents; it does not
 

give
 

  

 

you
  

 

 

  

 

a
 

  

 

parse tree for the HTML doc.  CyberNeko is an extension of
 

Xerces
 

  

 

(uses
  

 


 

  

 

the
   



   

same API; will likely become part of Xerces), and so maps an
 

HTML
 

 

  

 

document
 

  

 

into a full DOM that you can manipulate easily for a wide range
 

of
 

purposes.  I haven't used JTidy at an API level and so don't
 

know it
 

  

 

as
  

 


 

  

 

well --
   



   

based on its UI, it appears to be focused primarily on HTML
 

validation
 

 

  

 

and
 

  

 

error detection/correction.
I use CyberNeko for a range of operations on HTML documents
 

that go
 

 

  

 

beyond
 

  

 

indexing them in Lucene, and really like it.  It has been
 

robust for
 

  

 

me
  

 

 

  

 

so
 

  

 

far.
Chuck
  

 

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko

Re: which HTML parser is better?

2005-02-03 Thread Karl Koch
 --
 
 
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 
 
 validation
 
 
   
 

 
 
 
 and
   
 

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that
 go
   
 

 
 
 
 beyond
   
 

 
 
 
 indexing them in Lucene, and really like it.  It has been robust
 for

 
 
 
 me

 
 
 
   
 

 
 
 
 so
   
 

 
 
 
 far.
 
 Chuck
 

 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
  
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
 
 
  
 
   
 
 il_1g/
 
 
  
 
   
 
   
 

 
 
 

-
   
 

 
 
 
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
  
 
   
 
 [EMAIL PROTECTED]

 
 
 

-
   
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
  
 
   
 

 
 
 
 
 
  
 
   
 

-
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread sergiu gordea
Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
 

I see,
In this case you can read line by line your HTML file and then write 
something like this:

String line;
int startPos, endPos;
StringBuffer text = new StringBuffer();
while((line = reader.readLine()) != null   ){
   startPos = line.indexOf();
   endPos = line.indexOf();
   if(startPos 0  endPos  startPos)
 text.append(line.substring(startPos, endPos));
}
This is just a sample code that should work if you have just one tag per 
line in the HTML file.
This can be a start point for you.

 Hope it helps,
Best,
Sergiu
I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.
Thank you all,
Karl
 

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
   

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
 

--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-03 Thread Dawid Weiss
Karl,
Two things, try to experiment with both:
1) I would try to write a lexical scanner that strips HTML tags, much 
like the regular expression does. Java lexical scanner packages produce 
nice pure Java classes that seldom use any advanced API, so they should 
work on Java 1.1. They are simple state machines with states encoded in 
integers -- this should work like a charm, be fast and small.

2) Write a parser yourself. Having a regular expression it isn't that 
difficult to do... :)

D.
Karl Koch wrote:
I appologise in advance, if some of my writing here has been said before.
The last three answers to my question have been suggesting pattern matching
solutions and Swing. Pattern matching was introduced in Java 1.4 and Swing
is something I cannot use since I work with Java 1.1 on a PDA.
I am wondering if somebody knows a piece of simple sourcecode with low
requirement which is running under this tense specification.
Thank you all,
Karl

No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better? - Thread closed

2005-02-03 Thread Karl Koch
Thank you, I will do that.

 Karl Koch wrote:
 
 I appologise in advance, if some of my writing here has been said before.
 The last three answers to my question have been suggesting pattern
 matching
 solutions and Swing. Pattern matching was introduced in Java 1.4 and
 Swing
 is something I cannot use since I work with Java 1.1 on a PDA.
   
 
 I see,
 
 In this case you can read line by line your HTML file and then write 
 something like this:
 
 String line;
 int startPos, endPos;
 StringBuffer text = new StringBuffer();
 while((line = reader.readLine()) != null   ){
 startPos = line.indexOf();
 endPos = line.indexOf();
 if(startPos 0  endPos  startPos)
   text.append(line.substring(startPos, endPos));
 }
 
 This is just a sample code that should work if you have just one tag per 
 line in the HTML file.
 This can be a start point for you.
 
   Hope it helps,
 
  Best,
 
  Sergiu
 
 I am wondering if somebody knows a piece of simple sourcecode with low
 requirement which is running under this tense specification.
 
 Thank you all,
 Karl
 
   
 
 No one has yet mentioned using ParserDelegator and ParserCallback that 
 are part of HTMLEditorKit in Swing.  I have been successfully using 
 these classes to parse out the text of an HTML file.  You just need to 
 extend HTMLEditorKit.ParserCallback and override the various methods 
 that are called when different tags are encountered.
 
 
 On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
 
 
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
   
 
 -- 
 Bill Tschumy
 Otherwise -- Austin, TX
 http://www.otherwise.com
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-03 Thread aurora
For all parser suggestion I think there is one important attribute. Some  
parsers returns data provide that the input HTML is sensible. Some parsers  
is designed to be most flexible as tolerant as it can be. If the input is  
clean and controlled the former class is sufficient. Even some regular  
expression may be sufficient. (I that's the original poster wants). If you  
are building a web crawler you need something really tolerant.

Once I have prototyped a nice and fast parser. Later I have to abandon it  
because it failed to parse about 15% documents (problem handling nested  
quotes like onclick=alert('hi')).

No one has yet mentioned using ParserDelegator and ParserCallback that  
are part of HTMLEditorKit in Swing.  I have been successfully using  
these classes to parse out the text of an HTML file.  You just need to  
extend HTMLEditorKit.ParserCallback and override the various methods  
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-03 Thread Ian Soboroff

One which we've been using can be found at:
http://www.ltg.ed.ac.uk/~richard/ftp-area/html-parser/

We absolutely need to be able to recover gracefully from malformed
HTML and/or SGML.  Most of the nicer SAX/DOM/TLA parsers out there
failed this criterion when we started our effort.  The above one is
kind of SAX-y but doesn't fall over at the sight of a real web page
;-)

Ian


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hello,

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.

Karl


 I think that depends on what you want to do.  The Lucene demo parser does
 simple mapping of HTML files into Lucene Documents; it does not give you a
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
the
 same API; will likely become part of Xerces), and so maps an HTML document
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it as
well --
 based on its UI, it appears to be focused primarily on HTML validation and
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go beyond
 indexing them in Lucene, and really like it.  It has been robust for me so
 far.
 
 Chuck
 
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
   
 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
il_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
 Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
 Best,
  Sergiu
Karl Koch wrote:
Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content before
indexing that content as a whole.
Karl
 

I think that depends on what you want to do.  The Lucene demo parser does
simple mapping of HTML files into Lucene Documents; it does not give you a
parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
   

the
 

same API; will likely become part of Xerces), and so maps an HTML document
into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it as
   

well --
 

based on its UI, it appears to be focused primarily on HTML validation and
error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go beyond
indexing them in Lucene, and really like it.  It has been robust for me so
far.
Chuck
  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
  _
  Do You Yahoo!?
  150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
  http://music.yisou.com/
  ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
  http://image.yisou.com
  1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
  il_1g/
  
  -
  To unsubscribe, e-mail: [EMAIL PROTECTED]
  For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread Erik Hatcher
On Feb 2, 2005, at 6:17 AM, Karl Koch wrote:
Hello,
I have  been following this thread and have another question.
Is there a piece of sourcecode (which is preferably very short and 
simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 
3.2
would be enough...also no frames, CSS, etc.

I do not need to have the HTML strucutre tree or any other structure 
but
need a facility to clean up HTML into its normal underlying content 
before
indexing that content as a whole.

The code in the Lucene Sandbox for parsing HTML with JTidy (under 
contributions/ant) for the index task does what you ask.

Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
Hi,

yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.

Karl

   Hi Karl,
 
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
 
   Best,
 
Sergiu
 
 Karl Koch wrote:
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and simple
 (KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure but
 need a facility to clean up HTML into its normal underlying content
 before
 indexing that content as a whole.
 
 Karl
 
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser
 does
 simple mapping of HTML files into Lucene Documents; it does not give you
 a
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
 
 
 the
   
 
 same API; will likely become part of Xerces), and so maps an HTML
 document
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it as
 
 
 well --
   
 
 based on its UI, it appears to be focused primarily on HTML validation
 and
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go
 beyond
 indexing them in Lucene, and really like it.  It has been robust for me
 so
 far.
 
 Chuck
 
-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
   
 http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
il_1g/

   
 -
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
10 GB Mailbox, 100 FreeSMS http://www.gmx.net/de/go/topmail
+++ GMX - die erste Adresse für Mail, Message, More +++

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Karl Koch wrote:
Hi,
yes, but the library your are using is quite big. I was thinking that a 5kB
code could actually do that. That sourceforge project is doing much more
than that but I do not need it.
 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.
 You can use 3 lines of code with a good regular expresion to eliminate 
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

 Best,
 Sergiu
Karl
 

 Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
 Best,
  Sergiu
Karl Koch wrote:
   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and simple
(KISS)) which allows to remove all HTML tags from HTML content? HTML 3.2
would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure but
need a facility to clean up HTML into its normal underlying content
 

before
   

indexing that content as a whole.
Karl

 

I think that depends on what you want to do.  The Lucene demo parser
   

does
   

simple mapping of HTML files into Lucene Documents; it does not give you
   

a
   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses
  

   

the
 

same API; will likely become part of Xerces), and so maps an HTML
   

document
   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it as
  

   

well --
 

based on its UI, it appears to be focused primarily on HTML validation
   

and
   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
   

beyond
   

indexing them in Lucene, and really like it.  It has been robust for me
   

so
   

far.
Chuck
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
 il_1g/
 

   

-
   

 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread Karl Koch
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).

Are there any very-short solutions for that?

Karl

 Karl Koch wrote:
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
 5kB
 code could actually do that. That sourceforge project is doing much more
 than that but I do not need it.
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the size.
 
   You can use 3 lines of code with a good regular expresion to eliminate 
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 Karl
 
   
 
   Hi Karl,
 
  I already submitted a peace of code that removes the html tags.
  Search for my previous answer in this thread.
 
   Best,
 
Sergiu
 
 Karl Koch wrote:
 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short and
 simple
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
 3.2
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other structure
 but
 need a facility to clean up HTML into its normal underlying content
   
 
 before
 
 
 indexing that content as a whole.
 
 Karl
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo parser
 
 
 does
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 you
 
 
 a
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 (uses

 
 
 
 the
  
 
   
 
 same API; will likely become part of Xerces), and so maps an HTML
 
 
 document
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 as

 
 
 
 well --
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML validation
 
 
 and
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go
 
 
 beyond
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 me
 
 
 so
 
 
 far.
 
 Chuck
 
   -Original Message-
   From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, February 01, 2005 1:15 AM
   To: lucene-user@jakarta.apache.org
   Subject: which HTML parser is better?
   
   Three HTML parsers(Lucene web application
   demo,CyberNeko HTML Parser,JTidy) are mentioned in
   Lucene FAQ
   1.3.27.Which is the best?Can it filter tags that are
   auto-created by MS-word 'Save As HTML files' function?
   
   _
   Do You Yahoo!?
   150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
   http://music.yisou.com/
   ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
   http://image.yisou.com
   1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
  

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   il_1g/
   
  
 
 
 -
 
 
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

 
 
 
  
 
   
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 
 
 
 
   
 
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-- 
GMX im TV ... Die Gedanken sind frei ... Schon gesehen?
Jetzt Spot online ansehen: http://www.gmx.net/de/go/tv-spot

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Karl Koch wrote:
I am in control of the html, which means it is well formated HTML. I use
only HTML files which I have transformed from XML. No external HTML (e.g.
the web).
Are there any very-short solutions for that?
 

if you are using only correct formated HTML pages and you are in control 
of these pages.
you can use a regular exprestion to remove the tags.

something like
replaceAll(*,);
This is the ideea behind the operation. If you will search on google you 
will find a more robust
regular expression.

Using a simple regular expression will be a very cheap solution, that 
can cause you a lot of problems in the future.

It's up to you to use it 
Best,
Sergiu
Karl
 

Karl Koch wrote:
   

Hi,
yes, but the library your are using is quite big. I was thinking that a
 

5kB
   

code could actually do that. That sourceforge project is doing much more
than that but I do not need it.
 

you need just the htmlparser.jar 200k.
... you know ... the functionality is strongly correclated with the size.
 You can use 3 lines of code with a good regular expresion to eliminate 
the html tags,
but this won't give you any guarantie that the text from the bad 
fromated html files will be
correctly extracted...

 Best,
 Sergiu
   

Karl

 

Hi Karl,
I already submitted a peace of code that removes the html tags.
Search for my previous answer in this thread.
Best,
 Sergiu
Karl Koch wrote:
  

   

Hello,
I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very short and
 

simple
   

(KISS)) which allows to remove all HTML tags from HTML content? HTML
 

3.2
   

would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other structure
 

but
   

need a facility to clean up HTML into its normal underlying content


 

before
  

   

indexing that content as a whole.
Karl



 

I think that depends on what you want to do.  The Lucene demo parser
  

   

does
  

   

simple mapping of HTML files into Lucene Documents; it does not give
   

you
   

  

   

a
  

   

parse tree for the HTML doc.  CyberNeko is an extension of Xerces
   

(uses
   

 

  

   

the


 

same API; will likely become part of Xerces), and so maps an HTML
  

   

document
  

   

into a full DOM that you can manipulate easily for a wide range of
purposes.  I haven't used JTidy at an API level and so don't know it
   

as
   

 

  

   

well --


 

based on its UI, it appears to be focused primarily on HTML validation
  

   

and
  

   

error detection/correction.
I use CyberNeko for a range of operations on HTML documents that go
  

   

beyond
  

   

indexing them in Lucene, and really like it.  It has been robust for
   

me
   

  

   

so
  

   

far.
Chuck
   

-Original Message-
From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, February 01, 2005 1:15 AM
To: lucene-user@jakarta.apache.org
Subject: which HTML parser is better?
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
_
Do You Yahoo!?
150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
http://music.yisou.com/
ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
http://image.yisou.com
1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
 

il_1g/
 

  

   

-
  

   

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail:
 

[EMAIL PROTECTED]
   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 

  

   



 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
  

   


 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
   

 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread Otis Gospodnetic
If you are not married to Java:
http://search.cpan.org/~kilinrax/HTML-Strip-1.04/Strip.pm

Otis

--- sergiu gordea [EMAIL PROTECTED] wrote:

 Karl Koch wrote:
 
 I am in control of the html, which means it is well formated HTML. I
 use
 only HTML files which I have transformed from XML. No external HTML
 (e.g.
 the web).
 
 Are there any very-short solutions for that?
   
 
 if you are using only correct formated HTML pages and you are in
 control 
 of these pages.
 you can use a regular exprestion to remove the tags.
 
 something like
 replaceAll(*,);
 
 This is the ideea behind the operation. If you will search on google
 you 
 will find a more robust
 regular expression.
 
 Using a simple regular expression will be a very cheap solution, that
 
 can cause you a lot of problems in the future.
  
  It's up to you to use it 
 
  Best,
  
  Sergiu
 
 Karl
 
   
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking
 that a
   
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing
 much more
 than that but I do not need it.
  
 
   
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
 size.
 
   You can use 3 lines of code with a good regular expresion to
 eliminate 
 the html tags,
 but this won't give you any guarantie that the text from the bad 
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
  
 
   
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 

 
 
 
 Hello,
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very short
 and
   
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content?
 HTML
   
 
 3.2
 
 
 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other
 structure
   
 
 but
 
 
 need a facility to clean up HTML into its normal underlying
 content
  
 
   
 
 before

 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
  
 
   
 
 I think that depends on what you want to do.  The Lucene demo
 parser

 
 
 
 does

 
 
 
 simple mapping of HTML files into Lucene Documents; it does not
 give
 
 
 you
 
 

 
 
 
 a

 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of
 Xerces
 
 
 (uses
 
 
   
 

 
 
 
 the
 
 
  
 
   
 
 same API; will likely become part of Xerces), and so maps an
 HTML

 
 
 
 document

 
 
 
 into a full DOM that you can manipulate easily for a wide range
 of
 purposes.  I haven't used JTidy at an API level and so don't
 know it
 
 
 as
 
 
   
 

 
 
 
 well --
 
 
  
 
   
 
 based on its UI, it appears to be focused primarily on HTML
 validation

 
 
 
 and

 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents
 that go

 
 
 
 beyond

 
 
 
 indexing them in Lucene, and really like it.  It has been
 robust for
 
 
 me
 
 

 
 
 
 so

 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
   
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   
 
 il_1g/
 
 
   
 

 
 
 

-

 
 
 
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
   
 
 [EMAIL PROTECTED]
 
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
   
 

 
 
 
  
 
   
 

-
 To unsubscribe, e-mail:
 [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED

Re: which HTML parser is better?

2005-02-02 Thread Luke Shannon
In our application I use regular expressions to strip all tags in one
situation and specific ones in another situation. Here is sample code for
both:

This strips all html 4.0 tags except p, ul, br, li, strong, em,
u:

html_source =
Pattern.compile(/?\\s?(A|ABBR|ACRONYM|ADDRESS|APPLET|AREA|B|BASE|BASEFONT|
BDO|BIG|BLOCKQUOTE|BODY|BUTTON|CAPTION|CENTER|CITE|CODE|COL|COLGROUP|DD|DEL|
DFN|DIR|DIV|DL|DT|FIELDSET|FONT|FORM|FRAME|FRAMESET|H1|H2|H3|H4|H5|H6|HEAD|H
R|HTML|I|IFRAME|IMG|INPUT|INS|ISINDEX|KBD|LABEL|LEGEND|LINK|MAP|MENU|META|NO
FRAMES|NOSCRIPT|OBJECT|OL|OPTGROUP|OPTION|PARAM|PRE|Q|S|SAMP|SCRIPT|SELECT|S
MALL|SPAN|STRIKE|STYLE|SUB|SUP|TABLE|TBODY|TD|TEXTAREA|TFOOT|TH|THEAD|TITLE|
TR|TT|VAR)(.|\n)*?\\s?,
Pattern.CASE_INSENSITIVE).matcher(html_source).replaceAll();

When I want to strip anything in a tag I use the following pattern with the
code above:

String strPattern1 = \\s?(.|\n)*?\\s?;

HTH

Luke



- Original Message - 
From: sergiu gordea [EMAIL PROTECTED]
To: Lucene Users List lucene-user@jakarta.apache.org
Sent: Wednesday, February 02, 2005 1:23 PM
Subject: Re: which HTML parser is better?


 Karl Koch wrote:

 I am in control of the html, which means it is well formated HTML. I use
 only HTML files which I have transformed from XML. No external HTML (e.g.
 the web).
 
 Are there any very-short solutions for that?
 
 
 if you are using only correct formated HTML pages and you are in control
 of these pages.
 you can use a regular exprestion to remove the tags.

 something like
 replaceAll(*,);

 This is the ideea behind the operation. If you will search on google you
 will find a more robust
 regular expression.

 Using a simple regular expression will be a very cheap solution, that
 can cause you a lot of problems in the future.

  It's up to you to use it 

  Best,

  Sergiu

 Karl
 
 
 
 Karl Koch wrote:
 
 
 
 Hi,
 
 yes, but the library your are using is quite big. I was thinking that a
 
 
 5kB
 
 
 code could actually do that. That sourceforge project is doing much
more
 than that but I do not need it.
 
 
 
 
 you need just the htmlparser.jar 200k.
 ... you know ... the functionality is strongly correclated with the
size.
 
   You can use 3 lines of code with a good regular expresion to eliminate
 the html tags,
 but this won't give you any guarantie that the text from the bad
 fromated html files will be
 correctly extracted...
 
   Best,
 
   Sergiu
 
 
 
 Karl
 
 
 
 
 
  Hi Karl,
 
 I already submitted a peace of code that removes the html tags.
 Search for my previous answer in this thread.
 
  Best,
 
   Sergiu
 
 Karl Koch wrote:
 
 
 
 
 
 Hello,
 
 I have  been following this thread and have another question.
 
 Is there a piece of sourcecode (which is preferably very short and
 
 
 simple
 
 
 (KISS)) which allows to remove all HTML tags from HTML content? HTML
 
 
 3.2
 
 
 would be enough...also no frames, CSS, etc.
 
 I do not need to have the HTML strucutre tree or any other structure
 
 
 but
 
 
 need a facility to clean up HTML into its normal underlying content
 
 
 
 
 before
 
 
 
 
 indexing that content as a whole.
 
 Karl
 
 
 
 
 
 
 
 
 I think that depends on what you want to do.  The Lucene demo parser
 
 
 
 
 does
 
 
 
 
 simple mapping of HTML files into Lucene Documents; it does not give
 
 
 you
 
 
 
 
 
 
 a
 
 
 
 
 parse tree for the HTML doc.  CyberNeko is an extension of Xerces
 
 
 (uses
 
 
 
 
 
 
 
 
 the
 
 
 
 
 
 
 same API; will likely become part of Xerces), and so maps an HTML
 
 
 
 
 document
 
 
 
 
 into a full DOM that you can manipulate easily for a wide range of
 purposes.  I haven't used JTidy at an API level and so don't know it
 
 
 as
 
 
 
 
 
 
 
 
 well --
 
 
 
 
 
 
 based on its UI, it appears to be focused primarily on HTML
validation
 
 
 
 
 and
 
 
 
 
 error detection/correction.
 
 I use CyberNeko for a range of operations on HTML documents that go
 
 
 
 
 beyond
 
 
 
 
 indexing them in Lucene, and really like it.  It has been robust for
 
 
 me
 
 
 
 
 
 
 so
 
 
 
 
 far.
 
 Chuck
 
 
 
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 
 _
 Do You Yahoo!?
 150ÍòÇúMP3·è¿ñËÑ£¬´øÄú´³ÈëÒôÀÖµîÌÃ
 http://music.yisou.com/
 ÃÀÅ®Ã÷ÐÇÓ¦Óо¡ÓУ¬ËѱéÃÀͼ¡¢ÑÞͼºÍ¿áͼ
 http://image.yisou.com
 1G¾ÍÊÇ1000Õ×£¬ÑÅ»¢µçÓÊ×ÔÖúÀ©ÈÝ£¡
 
 
 

http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/m
a
 
 
 il_1g/
 
 
 
 
 
 
 
 
 -
 
 
 
 
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 
 
 [EMAIL

RE: which HTML parser is better?

2005-02-02 Thread Kauler, Leto S
We index the content from HTML files and because we only want the good
text and do not care about the structure, well-formedness, etc we went
with regular expressions similar to what Luke Shannon offered.

Only real difference being that we firstly remove entire blocks of
(script|style|csimport) and similar since the contents of those are not
useful for keyword searching, and afterward just remove every leftover
HTML tags.  I have been meaning to add an expression to extract things
like alt attribute text from img though.

--Leto



 -Original Message-
 From: Karl Koch [mailto:[EMAIL PROTECTED] 
 
 I have  been following this thread and have another question. 
 
 Is there a piece of sourcecode (which is preferably very 
 short and simple
 (KISS)) which allows to remove all HTML tags from HTML 
 content? HTML 3.2 would be enough...also no frames, CSS, etc. 
 
 I do not need to have the HTML strucutre tree or any other 
 structure but need a facility to clean up HTML into its 
 normal underlying content before indexing that content as a whole.
 
 Karl
 
  
 -Original Message-
 From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, February 01, 2005 1:15 AM
 To: lucene-user@jakarta.apache.org
 Subject: which HTML parser is better?
 
 Three HTML parsers(Lucene web application
 demo,CyberNeko HTML Parser,JTidy) are mentioned in
 Lucene FAQ
 1.3.27.Which is the best?Can it filter tags that are
 auto-created by MS-word 'Save As HTML files' function?
 

CONFIDENTIALITY NOTICE AND DISCLAIMER

Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.

This disclaimer has been automatically added.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-02 Thread Bill Tschumy
No one has yet mentioned using ParserDelegator and ParserCallback that 
are part of HTMLEditorKit in Swing.  I have been successfully using 
these classes to parse out the text of an HTML file.  You just need to 
extend HTMLEditorKit.ParserCallback and override the various methods 
that are called when different tags are encountered.

On Feb 1, 2005, at 3:14 AM, Jingkang Zhang wrote:
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
--
Bill Tschumy
Otherwise -- Austin, TX
http://www.otherwise.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: which HTML parser is better?

2005-02-02 Thread sergiu gordea
Kauler, Leto S wrote:
Another very cheap, but robust solution in the case you use linux is to 
make lynx to parse your pages.

lynx page.html  page.txt.
This will strip out all html and  script, style, csimport tags. And you 
will have a .txt file ready for indexing.

 Best,
 Sergiu
We index the content from HTML files and because we only want the good
text and do not care about the structure, well-formedness, etc we went
with regular expressions similar to what Luke Shannon offered.
Only real difference being that we firstly remove entire blocks of
(script|style|csimport) and similar since the contents of those are not
useful for keyword searching, and afterward just remove every leftover
HTML tags.  I have been meaning to add an expression to extract things
like alt attribute text from img though.
--Leto

 

-Original Message-
From: Karl Koch [mailto:[EMAIL PROTECTED] 

I have  been following this thread and have another question. 

Is there a piece of sourcecode (which is preferably very 
short and simple
(KISS)) which allows to remove all HTML tags from HTML 
content? HTML 3.2 would be enough...also no frames, CSS, etc. 

I do not need to have the HTML strucutre tree or any other 
structure but need a facility to clean up HTML into its 
normal underlying content before indexing that content as a whole.

Karl
   

  -Original Message-
  From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
  Sent: Tuesday, February 01, 2005 1:15 AM
  To: lucene-user@jakarta.apache.org
  Subject: which HTML parser is better?
  
  Three HTML parsers(Lucene web application
  demo,CyberNeko HTML Parser,JTidy) are mentioned in
  Lucene FAQ
  1.3.27.Which is the best?Can it filter tags that are
  auto-created by MS-word 'Save As HTML files' function?
  
 

CONFIDENTIALITY NOTICE AND DISCLAIMER
Information in this transmission is intended only for the person(s) to whom it 
is addressed and may contain privileged and/or confidential information. If you 
are not the intended recipient, any disclosure, copying or dissemination of the 
information is unauthorised and you should delete/destroy all copies and notify 
the sender. No liability is accepted for any unauthorised use of the 
information contained in this transmission.
This disclaimer has been automatically added.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


which HTML parser is better?

2005-02-01 Thread Jingkang Zhang
Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-01 Thread sergiu gordea
Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?
  


maybe you can try this library...

http://htmlparser.sourceforge.net/

I use the following code to get the text from HTML files,
it was not intensively tested, but it works.

import org.htmlparser.Node;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.Translate;

Parser parser = new Parser(source.getAbsolutePath());
NodeIterator iter = parser.elements();
while (iter.hasMoreNodes()) {
Node element = (Node) iter.nextNode();
//System.out.println(1: + element.getText());
String text = Translate.decode(element.toPlainTextString());
if (Utils.notEmptyString(text))
writer.write(text);
}

Sergiu

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.

-Mike

Jingkang Zhang wrote:

Three HTML parsers(Lucene web application
demo,CyberNeko HTML Parser,JTidy) are mentioned in
Lucene FAQ
1.3.27.Which is the best?Can it filter tags that are
auto-created by MS-word 'Save As HTML files' function?

_
Do You Yahoo!?
150MP3
http://music.yisou.com/

http://image.yisou.com
1G1000
http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: which HTML parser is better?

2005-02-01 Thread Chuck Williams
I think that depends on what you want to do.  The Lucene demo parser does 
simple mapping of HTML files into Lucene Documents; it does not give you a 
parse tree for the HTML doc.  CyberNeko is an extension of Xerces (uses the 
same API; will likely become part of Xerces), and so maps an HTML document into 
a full DOM that you can manipulate easily for a wide range of purposes.  I 
haven't used JTidy at an API level and so don't know it as well -- based on its 
UI, it appears to be focused primarily on HTML validation and error 
detection/correction.

I use CyberNeko for a range of operations on HTML documents that go beyond 
indexing them in Lucene, and really like it.  It has been robust for me so far.

Chuck

   -Original Message-
   From: Jingkang Zhang [mailto:[EMAIL PROTECTED]
   Sent: Tuesday, February 01, 2005 1:15 AM
   To: lucene-user@jakarta.apache.org
   Subject: which HTML parser is better?
   
   Three HTML parsers(Lucene web application
   demo,CyberNeko HTML Parser,JTidy) are mentioned in
   Lucene FAQ
   1.3.27.Which is the best?Can it filter tags that are
   auto-created by MS-word 'Save As HTML files' function?
   
   _
   Do You Yahoo!?
   150MP3
   http://music.yisou.com/
   
   http://image.yisou.com
   1G1000
   http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/ma
   il_1g/
   
   -
   To unsubscribe, e-mail: [EMAIL PROTECTED]
   For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo HTML parser question

2004-09-23 Thread roy-lucene-user
Hi Fred,

We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.  That's just a guess, I would love to hear
from others about this.  Anyway, since it is a separate thread, a token error
could kill it and there is no way for the calling thread to know about it.

We had to create our own html parser since we only cared about grabbing the
entire text from the html document and also we wanted to avoid the extra
thread.  We also do a lot of SKIPping for minimal EOF errors (html documents
in email almost never follow standards).  For your html needs, you might want
to check out other JavaCC HTML parsers from the JavaCC web site.

Roy.

On Wed, 22 Sep 2004 22:42:55 -0400, Fred Toth wrote
 Hi,
 
 I've been working with the HTML parser demo that comes with
 Lucene and I'm trying to understand why it's multi-threaded,
 and, more importantly, how to exit gracefully on errors.
 
 I've discovered if I throw an exception in the front-end static
 code (main(), etc.), the JVM hangs instead of exiting. Presumably
 this is because there are threads hanging around doing something.
 But I'm not sure what!
 
 Any pointers? I just want to exit gracefully on an error such as
 a required meta tag is missing or similar.
 
 Thanks,
 
 Fred
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: demo HTML parser question

2004-09-23 Thread Doug Cutting
[EMAIL PROTECTED] wrote:
We were originally attempting to use the demo html parser (Lucene 1.2), but as
you know, its for a demo.  I think its threaded to optimize on time, to allow
the calling thread to grab the title or top message even though its not done
parsing the entire html document.
That's almost right.  I originally wrote it that way to avoid having to 
ever buffer the entire text of the document.  The document is indexed 
while it is parsed.  But, as observed, this has lots of problems and was 
probably a bad idea.

Could someone provide a patch that removes the multi-threading?  We'd 
simply use a StringBuffer in HTMLParser.jj to collect the text.  Calls 
to pipeOut.write() would be replaced with text.append().  Then have the 
HTMLParser's constructor parse the page before returning, rather than 
spawn a thread, and getReader() would return a StringReader.  The public 
API of HTMLParser need not change at all and lots of complex threading 
code would be thrown away.  Anyone interested in coding this?

Doug
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


demo HTML parser question

2004-09-22 Thread Fred Toth
Hi,
I've been working with the HTML parser demo that comes with
Lucene and I'm trying to understand why it's multi-threaded,
and, more importantly, how to exit gracefully on errors.
I've discovered if I throw an exception in the front-end static
code (main(), etc.), the JVM hangs instead of exiting. Presumably
this is because there are threads hanging around doing something.
But I'm not sure what!
Any pointers? I just want to exit gracefully on an error such as
a required meta tag is missing or similar.
Thanks,
Fred
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best HTML Parser !!

2003-02-26 Thread Nestel, Frank IZ/HZA-IC4
I've had fairly good experience with Jtidy!

But HTMLParser http://htmlparser.sourceforge.net/
seems to have the lighter looking API. It is Event
based and I might need to parse some large HTML sometime
soon, where DOM might be the problem. Does anyone
have practical experience with HTMLParser?

Thanks
Frank

 -Ursprüngliche Nachricht-
 Von: petite_abeille [mailto:[EMAIL PROTECTED] 
 Gesendet: Dienstag, 25. Februar 2003 19:49
 An: Lucene Users List
 Betreff: Re: Best HTML Parser !!
 
 
 
 On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote:
 
  I have some good experiences with JTidy. It works like 
 DOM-XML parser
  and cleans HTML it by the way.
 
 I use jtidy also. Both for parsing and clean-up. Works pretty nicely.
 
  This is VERY useful, because EVERY HTML have at least ONE error.
 
 This rule should be tattooed on every parsers head: out of the 
 laboratory, nothing is compliant. Which render the race to more 
 compliance among the different parsers somewhat ridiculous.
 
 Cheers,
 
 PA.
 
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail: [EMAIL PROTECTED]
 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Best HTML Parser !!

2003-02-25 Thread Lukas Zapletal
Pierre Lacchini wrote:

Hello,

i'm trying to index html file with Lucene.
Do u know what's the best HTML Parser in Java ? 
The most Powerful ?
I need to extract meta-tag, and many other differents text fields...

Thx for ur help ;)

 

I have some good experiences with JTidy. It works like DOM-XML parser 
and cleans HTML it by the way.
This is VERY useful, because EVERY HTML have at least ONE error.

Documents that was unparsable with Neko JTidy parsed without problems.

Creating indexing program was work for 2 hours.

--
Lukas Zapletal  [EMAIL PROTECTED]
http://www.tanecni-olomouc.cz/lzap


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Best HTML Parser !!

2003-02-25 Thread petite_abeille
On Monday, Feb 24, 2003, at 20:28 Europe/Zurich, Lukas Zapletal wrote:

I have some good experiences with JTidy. It works like DOM-XML parser 
and cleans HTML it by the way.
I use jtidy also. Both for parsing and clean-up. Works pretty nicely.

This is VERY useful, because EVERY HTML have at least ONE error.
This rule should be tattooed on every parsers head: out of the 
laboratory, nothing is compliant. Which render the race to more 
compliance among the different parsers somewhat ridiculous.

Cheers,

PA.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Best HTML Parser !!

2003-02-24 Thread Pierre Lacchini
Hello,
 
i'm trying to index html file with Lucene.
Do u know what's the best HTML Parser in Java ? 
The most Powerful ?
I need to extract meta-tag, and many other differents text fields...
 
Thx for ur help ;)


Re: Best HTML Parser !!

2003-02-24 Thread Otis Gospodnetic
It's not possible to generalize like that.
I like NekoHTML.

Otis

--- Pierre Lacchini [EMAIL PROTECTED] wrote:
 Hello,
  
 i'm trying to index html file with Lucene.
 Do u know what's the best HTML Parser in Java ? 
 The most Powerful ?
 I need to extract meta-tag, and many other differents text fields...
  
 Thx for ur help ;)
 


__
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



AW: Best HTML Parser !!

2003-02-24 Thread Borkenhagen, Michael (ofd-ko zdfin)
I prefer JTidy http://lempinen.net/sami/jtidy/.

Michael
-Ursprüngliche Nachricht-
Von: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Gesendet: Montag, 24. Februar 2003 15:03
An: Lucene Users List; [EMAIL PROTECTED]
Betreff: Re: Best HTML Parser !!


It's not possible to generalize like that.
I like NekoHTML.

Otis

--- Pierre Lacchini [EMAIL PROTECTED] wrote:
 Hello,
  
 i'm trying to index html file with Lucene.
 Do u know what's the best HTML Parser in Java ? 
 The most Powerful ?
 I need to extract meta-tag, and many other differents text fields...
  
 Thx for ur help ;)
 


__
Do you Yahoo!?
Yahoo! Tax Center - forms, calculators, tips, more
http://taxes.yahoo.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Demo provided HTML parser bug (was RE: Newbie quizzes further...)

2002-09-06 Thread Stone, Timothy

List Fellows:

Lacking any knowledge of JavaCC, I solicted help in hacking the
HTMLParser.jj included in the demo. I retreat from this solication, for two
reasons: 1) I'm using other ideas gleaned from the list archives, 2) I'm not
prepared to dive into the world of complier compliers. The mere sound of it
is intimidating. 

So the bug. (If the bug is not worth fixing in the provided HTMLParser, drop
another one in, like Quiotix's; I did.)

Summary:
The current HTMLParser fails to correctly handle HTML decimal entities.

titleMyWebsite#8212;Home Page/title
pMy website#8217;s address is.../p

The following is produced after indexing the HTML and performing a query:

MyWebsite?Home Page
My website?s address is...

Another problem is manifest in the following oddity:

Given the following *source*; **note the use of the ampersand entity**

titleMyWebsiteamp;#8212;Home Page/title 
pMy websiteamp;#8217;s address is.../p

This produces the output (where two dashes represent an em dash)

MyWebsite--Home Page
My website's address is...

And the source of the *results* appears correctly, even if the source
document that was indexed is incorrect! Some kind of entity replacement is
occuring here.

titleMyWebsite#8212;Home Page/title
pMy website#8217;s address is.../p

(I ran across the latter oddity courtesy of Adobe GoLive's annoying syntax
rewriter.)

Now, some might be asking, and rightly so, why hasn't this been seen before?
I know a search in the archives didn't turn anything up. It's likely because
the use of decimal entities is misunderstood by the HTM community at large.
A for instance is that some, quite possibly a whole lot, use #151; for em
dash--this is incorrect as the whole range #127; to #159; is invalid.
Second, many may use named encoding. Named encoding, i.e. emdash;, is fine,
but decimal encoding provides a more consistent behavior cross-platform. 

For more on this, read The Trouble with EM 'n EN and Other Shady
Characters at A List Apart (www.alistapart.com/stories/emen/) 

Yours in Lucene.
Tim



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with HTML Parser

2002-08-14 Thread Maurits van Wijland

Keith,

I haven't noticed the problem with the Parser...but you trigger me
by saying that you have a PDFParser!!!

Are you able to contribute this PDFParser??

Maurits.
- Original Message -
From: Keith Gunn [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Wednesday, August 14, 2002 9:46 AM
Subject: problems with HTML Parser


 Has anyone noticed that the HTML Parser that comes with
 Lucene joins terms together when parsing a file.
 I used to think it was my PDFParser but after fixing that
 I found out it was the HMTLParser.

 I managed to find a replacement parser that doesn't join terms.

 Just wondered if anyone had come across this problem??




 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with HTML Parser

2002-08-14 Thread Ben Litchfield

Maurits,

You can get a PDF parser from http://www.pdfbox.org

-Ben


On Wed, 14 Aug 2002, Maurits van Wijland wrote:

 Keith,

 I haven't noticed the problem with the Parser...but you trigger me
 by saying that you have a PDFParser!!!

 Are you able to contribute this PDFParser??

 Maurits.


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: problems with HTML Parser

2002-08-14 Thread Keith Gunn

If your parsing html files have a check in lucene
to see the terms that are index and see if you can
spot any joined terms.

The PDF parser as you can see from the other mail is from
www.pdfbox.org and i highly recommend it (thanks again Ben!)




On Wed, 14 Aug 2002, Maurits van Wijland wrote:

 Keith,

 I haven't noticed the problem with the Parser...but you trigger me
 by saying that you have a PDFParser!!!

 Are you able to contribute this PDFParser??

 Maurits.
 - Original Message -
 From: Keith Gunn [EMAIL PROTECTED]
 To: Lucene Users List [EMAIL PROTECTED]
 Sent: Wednesday, August 14, 2002 9:46 AM
 Subject: problems with HTML Parser


  Has anyone noticed that the HTML Parser that comes with
  Lucene joins terms together when parsing a file.
  I used to think it was my PDFParser but after fixing that
  I found out it was the HMTLParser.
 
  I managed to find a replacement parser that doesn't join terms.
 
  Just wondered if anyone had come across this problem??
 
 
 
 
  --
  To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


 --
 To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-20 Thread [EMAIL PROTECTED]

Hi all,

I'm very interested about this thread. I also have to solve the problem 
of spidering web sites, creating index (weel about this there is the 
BIG problem that lucene can't be integrated easily with a DB), 
extracting links from the page repeating all the process.

For extracting links from a page I'm thinking to use JTidy. I think 
that with this library you can also parse a non well formed page (that 
you can take from the web with URLConnection) setting the property to 
clean the page. The class Tidy() returns a org.w3c.dom.Document that 
you can use for analizing all the document: for example you can use 
doc.getElementsByTagName(a) for taking all the a elements. You can 
parse as xml.

Did someone solve the problem to spider recursively a web pages?

Laura




 
 While trying to research the same thing, I found the following...here
's a 
 good example of link extraction.
 
 Try http://www.quiotix.com/opensource/html-parser
 
 Its easy to write a Visitor which extracts the links; should take abou
t ten 
 lines of code.
 
 
 
 --
 Brian Goetz
 Quiotix Corporation
 [EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-
8032
 
 http://www.quiotix.com
 
 
 --
 To unsubscribe, e-mail:   mailto:lucene-user-
[EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
[EMAIL PROTECTED]
 
 


RE: HTML parser

2002-04-19 Thread Mark Ayad

You can use the swing html parser to do this but it's only a 3.2 DTD based
parser.
I have written (attached) a totall hack job for braking up an html page into
its
component parts, the code gives you an idea ... If anyone wants to know how
to use
the swing based parser I add some code ?

Mark




-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
Sent: 19 April 2002 07:29
To: [EMAIL PROTECTED]
Subject: HTML parser


Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




PageBreaker.java
Description: java/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


RE: HTML parser

2002-04-19 Thread Ian Forsyth


Are there core classes part of lucene that allow one to feed lucene links,
and 'it' will capture the contents of those urls into the index..

or does one write a file capture class to seek out the url store the file in
a directory, then index the local directory..

Ian


-Original Message-
From: Terence Parr [mailto:[EMAIL PROTECTED]]
Sent: Friday, April 19, 2002 1:38 AM
To: Lucene Users List
Subject: Re: HTML parser



On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

:snip

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all
sorts of HTML from various websites to such them into the jGuru search
engine.  I use a simple stripHTML() method I wrote to handle it.  Works
great.  Kills everything but the text.  is that the kind of thing you
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
For additional commands, e-mail:
mailto:[EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




RE: HTML parser

2002-04-19 Thread Otis Gospodnetic

Such classes are not included with Lucene.
This was _just_ mentioned on this list earlier today.
Look at the archives and search for crawler, URL, lucene sandbox, etc.

Otis

--- Ian Forsyth [EMAIL PROTECTED] wrote:
 
 Are there core classes part of lucene that allow one to feed lucene
 links,
 and 'it' will capture the contents of those urls into the index..
 
 or does one write a file capture class to seek out the url store the
 file in
 a directory, then index the local directory..
 
 Ian
 
 
 -Original Message-
 From: Terence Parr [mailto:[EMAIL PROTECTED]]
 Sent: Friday, April 19, 2002 1:38 AM
 To: Lucene Users List
 Subject: Re: HTML parser
 
 
 
 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
 :snip
 
 Hi Otis,
 
 I have an HTML parser built for ANTLR, but it's pretty strict in what
 it
 accepts.  Not sure how useful it will be for you, but here it is:
 
 http://www.antlr.org/grammars/HTML
 
 I am not sure what your goal is, but I personally have to scarf all
 sorts of HTML from various websites to such them into the jGuru
 search
 engine.  I use a simple stripHTML() method I wrote to handle it. 
 Works
 great.  Kills everything but the text.  is that the kind of thing you
 are looking for or do you really want to parse not filter?
 
 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread David Black

While trying to research the same thing, I found the following...here's 
a good example of link extraction.

http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

It seems like I could use this to also get the text out from between the 
tags but haven't been able to do it yet.  It seems like it should be 
simple but geez...my head hurts.






On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:


 Are there core classes part of lucene that allow one to feed lucene 
 links,
 and 'it' will capture the contents of those urls into the index..

 or does one write a file capture class to seek out the url store the 
 file in
 a directory, then index the local directory..

 Ian


 -Original Message-
 From: Terence Parr [mailto:[EMAIL PROTECTED]]
 Sent: Friday, April 19, 2002 1:38 AM
 To: Lucene Users List
 Subject: Re: HTML parser



 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

 :snip

 Hi Otis,

 I have an HTML parser built for ANTLR, but it's pretty strict in what it
 accepts.  Not sure how useful it will be for you, but here it is:

 http://www.antlr.org/grammars/HTML

 I am not sure what your goal is, but I personally have to scarf all
 sorts of HTML from various websites to such them into the jGuru search
 engine.  I use a simple stripHTML() method I wrote to handle it.  Works
 great.  Kills everything but the text.  is that the kind of thing you
 are looking for or do you really want to parse not filter?

 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org


 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]



 --
 To unsubscribe, e-mail:   mailto:lucene-user-
 [EMAIL PROTECTED]
 For additional commands, e-mail: mailto:lucene-user-
 [EMAIL PROTECTED]



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread Erik Hatcher

HttpUnit (which uses JTidy under the covers) makes childs play out of
pulling out links and navigating to them.

The only caveat (and this would be true for practically all tools, I
suspect) is that the HTML has to be relatively well-formed for it to work
well.  JTidy can be somewhat forgiving though.

Erik

- Original Message -
From: David Black [EMAIL PROTECTED]
To: Lucene Users List [EMAIL PROTECTED]
Sent: Friday, April 19, 2002 5:26 PM
Subject: Re: HTML parser


 While trying to research the same thing, I found the following...here's
 a good example of link extraction.

 http://developer.java.sun.com/developer/TechTips/1999/tt0923.html

 It seems like I could use this to also get the text out from between the
 tags but haven't been able to do it yet.  It seems like it should be
 simple but geez...my head hurts.






 On Friday, April 19, 2002, at 01:40 PM, Ian Forsyth wrote:

 
  Are there core classes part of lucene that allow one to feed lucene
  links,
  and 'it' will capture the contents of those urls into the index..
 
  or does one write a file capture class to seek out the url store the
  file in
  a directory, then index the local directory..
 
  Ian
 
 
  -Original Message-
  From: Terence Parr [mailto:[EMAIL PROTECTED]]
  Sent: Friday, April 19, 2002 1:38 AM
  To: Lucene Users List
  Subject: Re: HTML parser
 
 
 
  On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
  :snip
 
  Hi Otis,
 
  I have an HTML parser built for ANTLR, but it's pretty strict in what it
  accepts.  Not sure how useful it will be for you, but here it is:
 
  http://www.antlr.org/grammars/HTML
 
  I am not sure what your goal is, but I personally have to scarf all
  sorts of HTML from various websites to such them into the jGuru search
  engine.  I use a simple stripHTML() method I wrote to handle it.  Works
  great.  Kills everything but the text.  is that the kind of thing you
  are looking for or do you really want to parse not filter?
 
  Terence
  --
  Co-founder, http://www.jguru.com
  Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 
 
 
  --
  To unsubscribe, e-mail:   mailto:lucene-user-
  [EMAIL PROTECTED]
  For additional commands, e-mail: mailto:lucene-user-
  [EMAIL PROTECTED]
 


 --
 To unsubscribe, e-mail:
mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-19 Thread Brian Goetz


While trying to research the same thing, I found the following...here's a 
good example of link extraction.

Try http://www.quiotix.com/opensource/html-parser

Its easy to write a Visitor which extracts the links; should take about ten 
lines of code.



--
Brian Goetz
Quiotix Corporation
[EMAIL PROTECTED]   Tel: 650-843-1300Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




HTML parser

2002-04-18 Thread Otis Gospodnetic

Hello,

I need to select an HTML parser for the application that I'm writing
and I'm not sure what to choose.
The HTML parser included with Lucene looks flimsy, JTidy looks like a
hack and an overkill, using classes written for Swing
(javax.swing.text.html.parser) seems wrong, and I haven't tried David
McNicol's parser (included with Spindle).

Somebody on this list must have done some research on this subject.
Can anyone share some experiences?
Have you found a better HTML parser than any of those I listed above?
If your application deals with HTML, what do you use for parsing it?

Thanks,
Otis


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-18 Thread Terence Parr


On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:

 Hello,

 I need to select an HTML parser for the application that I'm writing
 and I'm not sure what to choose.
 The HTML parser included with Lucene looks flimsy, JTidy looks like a
 hack and an overkill, using classes written for Swing
 (javax.swing.text.html.parser) seems wrong, and I haven't tried David
 McNicol's parser (included with Spindle).

 Somebody on this list must have done some research on this subject.
 Can anyone share some experiences?
 Have you found a better HTML parser than any of those I listed above?
 If your application deals with HTML, what do you use for parsing it?

Hi Otis,

I have an HTML parser built for ANTLR, but it's pretty strict in what it 
accepts.  Not sure how useful it will be for you, but here it is:

http://www.antlr.org/grammars/HTML

I am not sure what your goal is, but I personally have to scarf all 
sorts of HTML from various websites to such them into the jGuru search 
engine.  I use a simple stripHTML() method I wrote to handle it.  Works 
great.  Kills everything but the text.  is that the kind of thing you 
are looking for or do you really want to parse not filter?

Terence
--
Co-founder, http://www.jguru.com
Creator, ANTLR Parser Generator: http://www.antlr.org


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




Re: HTML parser

2002-04-18 Thread Otis Gospodnetic

Hello Terrence,

Ah, you got me.
I guess I need a bit of both.
I need to just strip HTML and get raw body text so that I can stick it
in Lucene's index.
I would also like something that can extract at least the
title.../title stuff, so that I can stick that in a separate field
in Lucene index.
While doing that I, like you, need to be able to handle poorly
formatted web pages.

In a future I may need something that has the ability to extract HREFs,
but I'll stick to one of the XP principles and just look for something
that meets current needs :)

I looked for ANTLR-based HTML parser a few days ago, but must have
missed the one you pointed out.  I'll take a look at it now.
Can you share or describe your stripHTML method?  Simple java that
looks for s and s or something smarter?

Thanks,
Otis
P.S.
This type of thing makes me wish I can use Perl or Python :)


--- Terence Parr [EMAIL PROTECTED] wrote:
 
 On Thursday, April 18, 2002, at 10:28  PM, Otis Gospodnetic wrote:
 
  Hello,
 
  I need to select an HTML parser for the application that I'm
 writing
  and I'm not sure what to choose.
  The HTML parser included with Lucene looks flimsy, JTidy looks like
 a
  hack and an overkill, using classes written for Swing
  (javax.swing.text.html.parser) seems wrong, and I haven't tried
 David
  McNicol's parser (included with Spindle).
 
  Somebody on this list must have done some research on this subject.
  Can anyone share some experiences?
  Have you found a better HTML parser than any of those I listed
 above?
  If your application deals with HTML, what do you use for parsing
 it?
 
 Hi Otis,
 
 I have an HTML parser built for ANTLR, but it's pretty strict in what
 it 
 accepts.  Not sure how useful it will be for you, but here it is:
 
 http://www.antlr.org/grammars/HTML
 
 I am not sure what your goal is, but I personally have to scarf all 
 sorts of HTML from various websites to such them into the jGuru
 search 
 engine.  I use a simple stripHTML() method I wrote to handle it. 
 Works 
 great.  Kills everything but the text.  is that the kind of thing you
 
 are looking for or do you really want to parse not filter?
 
 Terence
 --
 Co-founder, http://www.jguru.com
 Creator, ANTLR Parser Generator: http://www.antlr.org
 
 
 --
 To unsubscribe, e-mail:  
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]
 


__
Do You Yahoo!?
Yahoo! Tax Center - online filing with TurboTax
http://taxes.yahoo.com/

--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




HTML Parser

2002-04-09 Thread Neal Weinstein

Hi,

I am working with the lucene demo and would like to compile the demo so that
I may eventually modify it for my own use. I am using the source from
lucene-demos-1.2-rc4.jar.zip.

However, the HTMLParser class had the filename HTMLParser.jj and won't
compile.
I changed the name to HTMLParser.java, still the same problem.

Any help would be greatly appreciated.

Thanks,
Neal
 

Neal Weinstein
Manager Software Development
blue*spark
[EMAIL PROTECTED]
T (416) 971-6612 x205
F (416) 971-6549
489 King Street West, Suite 200
Toronto, Ontario M5V 1K4 Canada
www.bluespark.com



HTML Parser

2001-12-17 Thread Christophe GOGUYER DESSAGNES

Hi,

How should I integrate the HTML Parser (which is in the demo directory) in a
new project ?
In particular with the HTMLParser.jj file.
Do a need to compile it before trying to use it in my code.
Any help would be apreciated !
Thank.

-
Christophe


--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]