Re: which HTML parser is better?

2005-02-01 Thread Michael Giles
When I tested parsers a year or so ago for intensive use in Furl, the
best (tolerant of bad HTML) and fastest (tested on a 1.5M HTML page)
parser by far was TagSoup ( http://www.tagsoup.info ). It is actively
maintained and improved and I have never had any problems with it.

-Mike

Jingkang Zhang wrote:

>Three HTML parsers(Lucene web application
>demo,CyberNeko HTML Parser,JTidy) are mentioned in
>Lucene FAQ
>1.3.27.Which is the best?Can it filter tags that are
>auto-created by MS-word 'Save As HTML files' function?
>
>_
>Do You Yahoo!?
>150万曲MP3疯狂搜,带您闯入音乐殿堂
>http://music.yisou.com/
>美女明星应有尽有,搜遍美图、艳图和酷图
>http://image.yisou.com
>1G就是1000兆,雅虎电邮自助扩容!
>http://cn.rd.yahoo.com/mail_cn/tag/1g/*http://cn.mail.yahoo.com/event/mail_1g/
>
>-
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>For additional commands, e-mail: [EMAIL PROTECTED]
>
>  
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Title of PDF

2004-06-28 Thread Michael Giles
Don,
I think you misunderstood Otis.  You will need to use some sort of parser 
(i.e. pdfbox, xpdf) to get the title text from the PDF (I assume you are 
indexing the documents, so you have this already).  Then you create a 
"title" field in your index and store the text of the title in there (so 
that you can return it with the results).  That is independent of what you 
do with the rest of the PDF document.

-Mike
At 11:17 AM 6/28/2004, Don Vaillancourt wrote:
It can't be that simple because the field will contain the whole PDF and 
not just the title.  And for PDFs that are 3 or 4 megs, it is  really not 
reasonable to store the who PDF in the collection just to get the title.

Save and share anything you find online - Furl @ http://www.furl.net  

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Performance profile of optimization...

2004-05-24 Thread Michael Giles
What is the performance profile of optimizing an index?  By that I mean, 
what are the primary variables that negatively impact its speed (i.e. index 
size (bytes, docs), number of adds/deletes since last optimization, 
etc).  For example, if I add a single document to a small (i.e. < 10K docs) 
index and still have that index open (but would otherwise close it until 
the next update, a few minutes later), what type of a performance hit would 
optimizing the index be?  Does that cost change as the index gets bigger or 
is it tied to the number of changes that need to be rolled in?

-Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Internal full content store within Lucene

2004-05-18 Thread Michael Giles
Certainly any advancement in this area seems like a good idea.
I'll throw a use case on the pile as well.  For my own interest, the 
biggest need is in highlighting (i.e. highlighting relevant segments within 
the full text of documents).  I need to provide highlighted abstracts in 
the search results, so the solution would need to be performant enough to 
provide that service.

-Mike
At 02:43 PM 5/18/2004, you wrote:
Per the discussion the other day about storing content external to Lucene 
I think we have an opportunity to improve the lucene core and bring a lot 
of functionality to future developers.

Save and share anything you find online - Furl @ http://www.furl.net  


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Storing numbers

2004-03-09 Thread Michael Giles
Tim,

Looks like you can only access it with a subscription.  :(  Sounds good, 
though.

-Mike

At 02:39 PM 3/9/2004, you wrote:
[EMAIL PROTECTED] wrote:

Hi!
I want to store numbers (id) in my index:
long id = 1069421083284;
doc.add(Field.UnStored("in", String.valueOf(id)));
But searching for "id:1069421083284" doesn't return any hits.
Well, did I misunderstand something? UnStored is the number is stored but 
not index (analyzed), isn't it? Anyway, Field.Text doesn't work either.
TIA
Timo
Craig Walls wrote an excellent article in JDJ at the end of 2002 regarding 
Lucene (not shown in any of the resources BTW). He documents using Lucene 
along side a database as well as provides two classes (and others 
unrelated) that extend the functionality of the StopAnalyzer to include 
numbers and or alpha numerics.

Check out the article at: http://www.sys-con.com/story/print.cfm?storyid=37296

HTH,
Tim
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Save and share anything you find online - Furl @ http://www.furl.net  

Filtering out duplicate documents...

2004-03-08 Thread Michael Giles
I'm looking for a way to filter out duplicate documents from an index 
(either while indexing, or after the fact).  It seems like there should be 
an approach of comparing the terms for two documents, but I'm wondering if 
any other folks (i.e. nutch) have come up with a solution to this problem.

Obviously you can compute the Levenstein distance on the text, but that is 
way too computationally intensive to scale.  So the goal is to find 
something that would be workable in a production system.  For example, a 
given NYT article, and its printer friendly version should be deemed to be 
the same.

-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Concurrency

2004-02-20 Thread Michael Giles
It would be great if we could come up with way to integrate the Lucene 
locking information with something more incremental like rsync.  At Furl ( 
http://www.furl.net ) we have this problem in spades because we have 
thousands (and thousands) of indexes that need to be backed up.  Currently, 
we run rsync frequently (i.e. hourly) with "safe" (i.e. stopped server) 
daily snapshots (that rollover, etc.).  Essentially we are playing the odds 
on the frequent backups since indexes are only open for the moment that a 
new item is saved (i.e. they are reopened/closed each time) and we also 
have all of the data to recreate them if need be.  But this is definitely a 
topic we have discussed and it would be nice to have a solution to it.

Any other ideas out there (i.e. a way to check the locks and then retry a 
bit later)?

-Mike

At 11:43 AM 2/20/2004, you wrote:
Alan Smith wrote:
1. What happens if i make a backup (copy) of an index while documents are 
being added? Can it cause problems, and if so is there a way to safely do this?
This is not in general safe.  A copy may not be a usable index.  The 
segments file points to the current set of files.  An IndexWriter 
periodically rewrites the segments file, and then may delete files which 
are no longer used.  If you copy the segments file, then, before you copy 
all the files, the segments file is re-written and some files are deleted, 
your copied index will be incoherent.  Or, vice versa, you might copy the 
segments file last, and it may refer to newly created files which you 
failed to copy.

The safest way to do this is to use an IndexReader to make your backup, 
with something like:

  IndexReader reader = IndexReader.open("index");
  IndexWriter writer = new IndexWriter("backup", analyzer, true);
  writer.addIndexes(new IndexReader[] { reader });
  writer.close();
  reader.close();
This will use Lucene's locking code to make sure all is safe.

2. When I create a new IndexSearcher, what method does Lucene use to take 
a 'snapshot' of the index (because if i add documents after the search 
object is created they dont appear in the search results)?
It keeps the existing set of files open.  As described above, Lucene 
modifies an index by adding new files and removing old ones.  But an open 
index never changes its set of files.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


See Lucene in action at Furl...

2004-01-25 Thread Michael Giles
Furl - http://www.furl.net

I've been meaning to write to the list about Furl for a while as it is a 
pretty cool use of Lucene (Otis finally connected the dots and tracked me 
down last week).  Furl (http://www.furl.net) is basically an Internet 
filing cabinet for useful web pages.  Or to put it in Lucene-geek terms, it 
is a browser-based app that gives you your own Lucene index of all the 
interesting web pages/articles you find online.

I read all of my news online and there just isn't a good way to keep track 
of it and then recall it at a later date (i.e. "What the heck was that 
article I read about...?").  Bookmarks are really not built for that 
task.  And blogs aren't either (because you can only search on your 
comments, which I never remember).  Thus, Furl was born last spring with 
some traits of bookmarks (i.e. single click in your browser), some traits 
of blogs (i.e. bookmarklets, links, comments, RSS feeds), some traits of 
Google (i.e. full-text index of web content) and a bunch of its own mojo.

I just launched it publicly last week and folks are signing up fast and 
really enjoying it.  I think it's damn useful (i.e. one of the few things I 
use every day) so check it out.  And if you want to play around with search 
without signing up, you can use the demo account or search my public 
entries (http://www.furl.net/members/mgiles) for things like "lucene", 
"genetics", "whales", whatever.

It certainly wouldn't have been possible without Lucene.  Enjoy!

-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Ordening documents

2004-01-16 Thread Michael Giles
William,

The order of the results are going to be based on how well they match the 
query (i.e. weighted by relevancy).  So although all of those values 
contain the term "Palm", I would assume you would get the shorter entries 
(i.e. 1 & 3) before the longer ones (2) as they have a higher percentage of 
"palmness".  Same goes for the second query (it is a better match for 1, 
than 2).  If you want the documents to come back in their document order, 
you would need to sort the results yourself.

-Mike

At 02:33 PM 1/16/2004, you wrote:

Hi Folks,

To the order of the result What really matters is ONLY the order in which 
the information is stored in the index ?

Thanks,
William.


From: "William W" <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Ordening documents
Date: Fri, 16 Jan 2004 19:14:06 +
Hi folks,

I have some documents

 doc 1 ==>  name="Palm Zire"
 doc 2 ==>  name="Palm Zilion Zire"
 doc 3 ==>  name="Palm Test"
I will insert these docs in my index following the order  doc 1, doc 2, doc3.

If I execute the query ==> name:Palm
Witch order will the documents come ?
And if I execute the query ==> name:(Palml Zire) ??

I thougth that the documents would ALWAYS be in the order that I included 
in the index.

How will I know the order of the result ?

Thanks,
William.
_
Find high-speed ‘net deals — comparison-shop your local providers here. 
https://broadband.msn.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Check out the coupons and bargains on MSN Offers! 
http://shopping.msn.com/softcontent/softcontent.aspx?scmId=1418

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Multiple Creation of Writers

2004-01-14 Thread Michael Giles
Couldn't you solve this by creating your own synchronized getWriter 
method?  I'm thinking something like (pseudo code):

protected void myProgram() {
...
File dir = new File("c:/import/test");
IndexWriter wrt = getWriter(dir, new StandardAnalyzer(), create(dir));
...
}
protected boolean create(String dir) {
return true if index doesn't exist;
}
protected synchronized IndexWriter getWriter(dir, anlyz, crt) {
if (crt && !create(dir)) {
crt = false;
}
return new IndexWriter(dir, anylz, crt);
}
You would actually want to get fancier on the synchronization and grab a 
mutex that is specific to the directory name, but you get the idea.

-Mike

At 12:48 PM 1/14/2004, David Townsend wrote:
In my system indices are created and updated by multiple threads.  I need 
to check if an index exists to decide whether to pass true or false to the 
IndexWriter constructor.

new IndexWriter(FSDirectory, Analyzer, boolean);

The problem arises when two threads attempt to create the same index after 
simultaneously finding that the index does not exist.  This problem can be 
reproduced in a single thread by

writerA = new IndexWriter(new File("c:/import/test"), new 
StandardAnalyzer(), true);
writerB = new IndexWriter(new File("c:/import/test"), new 
StandardAnalyzer(), true);
add1000Docs(writerA);
add1000Docs(writerB);

this will throw an IOException

C:\import\test\_a.fnm (The system cannot find the file specified)

The only solution I can think of is to create a database/file lock to get 
around this, or change the Lucene code to obtain a lock before creating an 
index.  Any ideas?

David







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


RE: Returning one result

2003-12-05 Thread Michael Giles
Tracy,

I believe what Dror was referring to was the call to 
MultiFieldQueryParser.parse(). The second argument to that call is a 
String[] of field names on which to execute the query.  If the field that 
contains "AR345" isn't listed in that array, you will not get any results.

-Mike

At 03:14 PM 12/5/2003, you wrote:
What do you mean 'add' in MultiFieldQueryParser?  I am using all the
fields
When I index it does

 add (Field.Keyword(..,..))

But I don't want the user to have to type ID: It would be
nice to just type ID Number. On your site if you just put: 11183 in the
search box there are no results.
well, right now I'll just do it as text and query that field for the id
# to display the document.  It can't hurt, right? :)  Unless the Keyword
is a better way


-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles
Yes, he was the lead Ph.D. student on the GroupLens project at Minnesota.

-Mike

At 12:18 PM 11/25/2003, you wrote:
Hello Mike,

I had a quick look over the javadoc and it looks promising, as you said. Did
Jon Herlocker worked on GroupLens? I know GroupLens was quite a pioneer work
in the early days of collaborative systems...
Cheers
Ralph
> You should check out the work of Jon Herlocker at Oregon State
> (http://eecs.oregonstate.edu/iis/).  They have written a CF engine that
> has
> been on my to-do list to check out for a few months (sounds good on
> "paper").  If you get the chance to play with it, I'd be curious to hear
> your feedback.  Having a CF engine in the open source domain would be a
> great thing.
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Collaborative Filtering API

2003-11-25 Thread Michael Giles
You should check out the work of Jon Herlocker at Oregon State 
(http://eecs.oregonstate.edu/iis/).  They have written a CF engine that has 
been on my to-do list to check out for a few months (sounds good on 
"paper").  If you get the chance to play with it, I'd be curious to hear 
your feedback.  Having a CF engine in the open source domain would be a 
great thing.

-Mike

At 10:49 AM 11/25/2003, you wrote:
Hello togehter,

I am asking this group because I think people here might know about this
since it is a similar approach.
Is there a Java based API which assist developers of collaborative filtering
in their programs. With this I mean software, which does use user ratings
between items and provide ways (algorithsm, methods) to find users with 
similar
interests for prediction generation. Finding a API like Lucene would be
dream for me but any pointer to other API's (also in other programming 
lanuages)
to see and learn from would be appreciated.

Kind Regards


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: MultiFieldQueryParser default operator

2003-10-30 Thread Michael Giles
This would be great to get fixed (I think I emailed a similar question a 
month or so ago).  If MultiFieldQueryParser is being mucked with, the 
constructor should be updated to take an array of fields instead of the 
single field it takes currently.  The code snippet below is actually 
passing the query ("britney spears") into the constructor instead of a 
default field.  If you made that single change to the constructor, no other 
changes would be necessary, as you could simply call the parse(String) 
method from QueryParser to do your work.

-Mike

At 10:03 AM 10/30/2003, you wrote:
It was posted on lucene-dev, not lucene-user.  I've pasted it below.

I will be fixing this at some point in the near future based on this fix 
and other related ones needed.

Erik

-

From: Bernhard Messer <[EMAIL PROTECTED]>
Date: Wed Oct 29, 2003  11:27:02  AM US/Eastern
To: Lucene Developers List <[EMAIL PROTECTED]>
Subject: MultiFieldQueryParser, can't change default search operator
Reply-To: "Lucene Developers List" <[EMAIL PROTECTED]>
hi all,

just played around with the MultiFieldQueryParser and didn't find a 
working way to change the "operator" value.

The problem is that MultiFieldQueryParser is implementing two public 
static methods "parse" only. Calling one of those, in the extended 
superclass, the static method implementation for "parse" is called. Due to 
the fact that the QueryParser class creates a new Instance for each call 
thru the static method, the "operator" flag is simply ignored.

There would be a simple fix within MultiFieldQueryParser class without 
touching the rest of the system. One have to add a new non static method 
"parse" which could look like this:

/***/
public Query parse(String query, String[] fields) throws ParseException {
   BooleanQuery bQuery = new BooleanQuery();
   for (int i = 0; i < fields.length; i++)
   {
   QueryParser parser = new QueryParser(fields[i], analyzer);
   parser.setOperator(getOperator());
   Query q = parser.parse(query);
   bQuery.add(q, false, false);
   }
   return bQuery;
 }
/***/
To test the new implementation, following code fragment can be used:

/***/
   Directory directory = 
FSDirectory.getDirectory("/tmp/idx-test", false);
   Analyzer analyzer = new SimpleAnalyzer();
   Searcher searcher = new IndexSearcher(directory);
   Hits hits = null;
 String[] fields = { "contents", "title" };
   MultiFieldQueryParser parser = new 
MultiFieldQueryParser("britney spears", analyzer);

parser.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
 Query query = parser.parse("britney spears", 
fields);
 System.out.println("Query: " + query.toString());
 hits = searcher.search(query);

   System.out.println ("Results: " + hits.length());
 searcher.close();
/***/
best regards

Bernhard


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-20 Thread Michael Giles
Erik,

I agree with that assessment.  I hadn't taken the time to look at the 
patch, but I am in agreement that the fix should be "stop QueryParser from 
interpreting characters as operators when there is no whitespace 
surrounding them".  As long as the QP doesn't do anything in this case, the 
Analyzer will be able to handle it the same way it did when indexing (which 
is what we want).

-Mike

At 12:57 PM 10/20/2003, you wrote:
On Wednesday, October 15, 2003, at 10:24  AM, Michael Giles wrote:

I looked at the patch here:

http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23838

I'm not entirely satisfied with it.  I'm of the opinion that we should 
only change QueryParser to fix the behavior of operators nestled within 
text with no surrounding whitespace.  The provided patch only works with 
the "-" character, but what about "Wal+Mart"?  Shouldn't we keep that 
together also and hand it to the analyzer?

I'm not convinced at all that we should change the StandardTokenizer to 
not split on dash.  If only QueryParser was fixed and handed "Wal-Mart" to 
the StandardAnalyzer, it would be split the same way as during indexing 
and searches would return the expected hits.

Thoughts?  I'd like to see this fixed, but in a way that makes the most 
general sense.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-15 Thread Michael Giles
So how do we move this issue forward.  I can't think of a single case where 
a "-" with no whitespace on either side (i.e. t-shirt, Wal-Mart) should be 
interpreted as a NOT command.  Is there a feeling that changing the 
interpretation of such cases is a break in compatibility?  I agree that it 
will change behavior, but I think that it will change it for the better 
(i.e. fix it).  The current behavior is really broken (and very frustrating 
for a user trying to search).

-Mike

At 10:08 AM 10/15/2003, you wrote:

--- Ulrich Mayring <[EMAIL PROTECTED]> wrote:
> Victor Hadianto wrote:
> >
> > If the QueryParser implemented the solution that I suggested then
> "t-shirt"
> > will get you the correct hits :)
>
> Well, what's the problem? I saw a couple of +1s, so why is your patch
>
> not added?
1. +1s were from non-developers
2. The change looked like it would not be backwards compatible. (see
the original email from Victor)
It is also better if patches are added to Bugzilla.

Otis

__
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-14 Thread Michael Giles
So what do we need to do to resolve this?  Has the discussion stopped 
because this is the "user" list and not "dev" or did it move over to the 
dev list?

-Mike

At 03:49 AM 10/13/2003, you wrote:
Michael Giles wrote:
He is probably using the StandardAnalyzer.  I was about to write the 
exact same email (but using Wal-Mart as an example on this page - 
http://www.benchmark.com/cgi-bin/suid/~bcmlp/newsletter.cgi?mode=show&year=2003&date=2003-10-07). 
I index and search with the same analyzer (Standard), but when I search 
for Wal-Mart, I don't find a match.  I DO find a match if I search for 
"Wal-Mart" or Wal Mart (no hyphen).  This seems like a bug.
I'm not sure whether it has to do with the Analyzer, the thing happens 
with the Snowball Analyzers as well.

Ulrich



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: Dash Confusion in QueryParser - Bug? Feature?

2003-10-11 Thread Michael Giles
He is probably using the StandardAnalyzer.  I was about to write the exact 
same email (but using Wal-Mart as an example on this page - 
http://www.benchmark.com/cgi-bin/suid/~bcmlp/newsletter.cgi?mode=show&year=2003&date=2003-10-07). 
I index and search with the same analyzer (Standard), but when I search for 
Wal-Mart, I don't find a match.  I DO find a match if I search for 
"Wal-Mart" or Wal Mart (no hyphen).  This seems like a bug.

-Mike

At 11:59 PM 10/10/2003, you wrote:
On Friday, October 10, 2003, at 04:30  AM, Ulrich Mayring wrote:
when I search for "MS-Word" I get all the documents that contain exactly 
that word, which is good. If, however, I search for MS-Word (without the 
quotes), then the MultiFieldQueryParser restructures the query to "MS 
-Word" and I consequently get all documents that contain "MS" and not "Word".
What Analyzer are you using?

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Default AND for multi-field queries...

2003-10-07 Thread Michael Giles
As with many people, I want the default query behavior to be AND (instead 
of OR).  However, I'm also (always) creating multi-field queries.  I don't 
see a way to accomplish this cleanly in the API.  It would be great if 
MultiFieldQueryParser had a constructor that took an array of fields (i.e. 
String[]).  That would solve the problem (and seems like a mistaken 
omission).  It would also be nice if QueryParser allowed you to parse a 
query once, and then alter the field setting and generate different 
queries.  As it stands (from my read of the API), I have to loop through my 
list of fields (6 of them) and create a new QueryParser each time.  Parsing 
the same query 6 times is pretty ugly.  Perhaps there is a better way that 
I'm missing.

-Mike



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-22 Thread Michael Giles
Yeah, I was using HTMLParser for a few days until I tried to parse a 400K 
document and it spun at 100% CPU for a very long time.  It is tolerant of 
bad HTML, but does not appear to scale.  TagSoup processed the same 
document in a second or less at <25% CPU.

-Mike

At 02:42 PM 9/22/2003 +0200, you wrote:

TagSoup is great - however, it is not maintained nor developed (the same 
could be said about JTidy as well, but TagSoup's history is much 
shorter...). I'm using HTMLParser (http://htmlparser.sourceforge.net) for 
my application, and it also works very well, even for ill-formed input. 
It's also very actively developed.

--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-20 Thread Michael Giles
Erik,

Probably a good idea to swap something else in, although Neko introduces a 
dependency on Xerces.  I didn't play with Neko because I am currently using 
a different XML parser and didn't want to deal with the conflicts (and also 
find dependencies on specific parsers annoying).  However, yesterday I 
downloaded TagSoup(http://mercury.ccil.org/~cowan/XML/tagsoup/) and it is 
great!  It is small and fast and so far has parsed every page I've thrown 
at it.  I wrote a SAX ContentHandler that only grabs the text and does a 
few other little things (like inserting spaces, removing tabs/line feeds, 
grabbing title) and it seems to be a perfect fit for the job.  It requires 
the SAX framework, but is parser independent.  The only tweak I made to the 
TagSoup code was to add an "else" to deal with a bug where it was consuming 
";" after entities that it did not deal with.

If Neko is potentially headed into the Apache fold, that probably makes 
sense.  But if you are interested in my TagSoup ContentHandler for testing 
it out, just let me know.

-Mike

At 08:08 PM 9/19/2003 -0400, you wrote:
I'm going to swap in the neko HTML parser for the demo refactorings I'm
doing.  I would be all for replacing the demo HTML parser with this.
If you look at the Ant  task in the sandbox, you'll see that I
used JTidy for it and it works well, but I've heard that neko is faster
and better so I'll give it a try.
Erik



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


Re: HTML Parsing problems...

2003-09-19 Thread Michael Giles
Tatu,

Thanks for the reply.  See below for comments.

> just ignore everything inside of 

HTML Parsing problems...

2003-09-18 Thread Michael Giles
I know, I know, the HTML Parser in the demo is just that (i.e. a demo), but 
I also know that it is updated from time to time and performs much better 
than the other ones that I have tested.  Frustratingly, the very first page 
I tried to parse failed 
(http://www.theregister.co.uk/content/54/32593.html). 
It seems to be choking on tags that are being written inside of JavaScript 
code (i.e. document.write('');.  Obviously, the simple 
solution (that I am using with another parser) is to just ignore everything 
inside of