SV: SV: Indexing HTML

2002-12-09 Thread Ronnie Kolehmainen
HI,

these are the classes i use. I only use them to extract the text stuff, so
they don't have methods for getting document title and such. However text
extraction has worked fine for me.

The HtmlParser main method takes a file path as argument and outputs the
contents to a file named html.txt - useful when testing.

/Ronnie


 -Ursprungligt meddelande-
 Fran: Otis Gospodnetic [mailto:[EMAIL PROTECTED]]
 Skickat: den 7 december 2002 17:12
 Till: Lucene Users List
 Amne: Re: SV: Indexing HTML


 I have had good experiences with nekoHTML parser.

 Otis

 --- Leo Galambos [EMAIL PROTECTED] wrote:
   I'm not sure this is a solution to your problem. However, it seems
  that the
   HTMLParser used by the IndexHTML class has problems parsing the
  document
   (there is a test class included in the jar):
  
  
   java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
   org.apache.lucene.demo.html.Test f01529.txt
   Title: Webcz.cz - Power of search
   Parse Aborted: Encountered \' at line 106, column 27.
   Was expecting one of:
   ArgName ...
   TagEnd ...
   /Ronnie
 
  Hi Ronnie!
 
  I know about it and the exception is handled well (see log file
  below). I
  have found a better example than 1529, try this:
  http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go
  throught
  Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file
  is
  specific, i.e. it has two titles, two base tags etc.
 
  I have not debugger here, so I cannot find the line where is the bug.
  If
  you try your magic, please, let me know about the patch. :) THX
 
  -g-
 
 
 
  adding save/d00320/f01516.html
  Parse Aborted: Lexical error at line 68, column 11.  Encountered:
  \u0178
  (376), after : 
  :
  adding save/d00320/f01527.html
  Parse Aborted: Encountered = at line 83, column 48.
  Was expecting one of:
  ArgName ...
  TagEnd ...
 
  adding save/d00320/f01528.html
 
 
 
  --
  To unsubscribe, e-mail:
  mailto:[EMAIL PROTECTED]
  For additional commands, e-mail:
  mailto:[EMAIL PROTECTED]
 


 __
 Do you Yahoo!?
 Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
 http://mailplus.yahoo.com

 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]





HtmlDocument.java
Description: Binary data


HtmlParser.java
Description: Binary data
--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]


Re: SV: Indexing HTML

2002-12-07 Thread Leo Galambos
 I'm not sure this is a solution to your problem. However, it seems that the
 HTMLParser used by the IndexHTML class has problems parsing the document
 (there is a test class included in the jar):
 
 
 java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
 org.apache.lucene.demo.html.Test f01529.txt
 Title: Webcz.cz - Power of search
 Parse Aborted: Encountered \' at line 106, column 27.
 Was expecting one of:
 ArgName ...
 TagEnd ...
 /Ronnie

Hi Ronnie!

I know about it and the exception is handled well (see log file below). I
have found a better example than 1529, try this:
http://com-os2.ms.mff.cuni.cz/bugs/f00034.txt This file cannot go throught
Lucene HTML parser (I have tried 1.2 and IBM JDK 1.3.1r3). The file is
specific, i.e. it has two titles, two base tags etc.

I have not debugger here, so I cannot find the line where is the bug. If
you try your magic, please, let me know about the patch. :) THX

-g-



adding save/d00320/f01516.html
Parse Aborted: Lexical error at line 68, column 11.  Encountered: \u0178 
(376), after : 
:
adding save/d00320/f01527.html
Parse Aborted: Encountered = at line 83, column 48.
Was expecting one of:
ArgName ...
TagEnd ...

adding save/d00320/f01528.html



--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]




SV: Indexing HTML

2002-12-04 Thread Ronnie Kolehmainen
Dear Leo,

I'm not sure this is a solution to your problem. However, it seems that the
HTMLParser used by the IndexHTML class has problems parsing the document
(there is a test class included in the jar):


java -cp C:\projects\lucene\jakarta-lucene\bin\lucene-demos.jar
org.apache.lucene.demo.html.Test f01529.txt
Title: Webcz.cz - Power of search
Parse Aborted: Encountered \' at line 106, column 27.
Was expecting one of:
ArgName ...
TagEnd ...


If you look at the source of that document you can see there is a Javascript
with this problematic line:


document.write('s' + 'cript
src=http://ad.webcz.cz/adwebcz/adscript.asp?a=10t=0b=0x=468y=60nocache
=' + nIndex + '');
^


Looks to me the HTMLParser does _not_ treat/handle the script tags
correct, i e ignore everything until /script. If you check stdout there
should be error messages from the ParserThread class like the one above.

I tried parsing the same document with another html parser class without any
problems. Maybe try replacing the HTMLParser class used by HTMLDocument with
your own? Or edit the HTMLParser.jj file if you have javacc knowledge.


/Ronnie



 -Ursprungligt meddelande-
 Fran: Leo Galambos [mailto:[EMAIL PROTECTED]]
 Skickat: den 3 december 2002 20:32
 Till: [EMAIL PROTECTED]
 Amne: Indexing HTML


 I tried to use IndexHTML (demo) and Lucene 1.2 for indexing *.CZ, but
 Lucene often falls to never-ending loop. I've analyzed my data, so I know
 what file(s) sent Lucene down. I don't see anything special in the
 file(s), so I think, that it can go throught parser to main Lucene
 routines (and then the problem could be in Merger).

 Could you help me, please?

 One of the problematic files:
 http://com-os2.ms.mff.cuni.cz/bugs/f01529.txt
 My program (based on Lucene demo):
 http://com-os2.ms.mff.cuni.cz/bugs/IndexHTML.java

 Thank you very much.

 -g-


 --
 To unsubscribe, e-mail:
 mailto:[EMAIL PROTECTED]
 For additional commands, e-mail:
 mailto:[EMAIL PROTECTED]




--
To unsubscribe, e-mail:   mailto:[EMAIL PROTECTED]
For additional commands, e-mail: mailto:[EMAIL PROTECTED]