Re: Indexing Urls pointing to same content

2006-01-20 Thread Otis Gospodnetic
Mario, Lucene != web indexer, so Lucene doesn't know anything about files or URLs, etc. It just indexes what it's told. You should check how Nutch does it, and I believe it does it by comparing "fingerprints" of web pages. Fingerprints are MD5 checksums, but I believe the recent changes ther

[jira] Updated: (LUCENE-488) adding docs with large (binary) fields of 5mb causes OOM regardless of heap size

2006-01-20 Thread Hoss Man (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-488?page=all ] Hoss Man updated LUCENE-488: Attachment: TestBigBinary.java two things i forgot to mention before... 1) It seems i can as many 4mb documents as my heart desires, but once i go up to 5 all hell br

[jira] Created: (LUCENE-488) adding docs with large (binary) fields of 5mb causes OOM regardless of heap size

2006-01-20 Thread Hoss Man (JIRA)
adding docs with large (binary) fields of 5mb causes OOM regardless of heap size Key: LUCENE-488 URL: http://issues.apache.org/jira/browse/LUCENE-488 Project: Lucene - Java Type: Bug

Urgent issue with custom Directory/IndexInput/IndexOutput

2006-01-20 Thread Dmitry Goldenberg
Hello, I'm having trouble getting a custom Directory to work, keep getting exceptions in org.apache.lucene.store.BufferedIndexInput.refill (stack attached below). Could someone review the code below and tell me what I'm doing wrong? Any pointers would be greatly appreciated. - Dmitry ===

[jira] Updated: (LUCENE-140) docs out of order

2006-01-20 Thread Jarrod Cuzens (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-140?page=all ] Jarrod Cuzens updated LUCENE-140: - Attachment: corrupted.part2.rar Second part. :) > docs out of order > - > > Key: LUCENE-140 > URL: http://issues.apache.org

[jira] Updated: (LUCENE-140) docs out of order

2006-01-20 Thread Jarrod Cuzens (JIRA)
[ http://issues.apache.org/jira/browse/LUCENE-140?page=all ] Jarrod Cuzens updated LUCENE-140: - Attachment: corrupted.part1.rar I am posting our corrupted index (I have to do it in two parts because it is 14.5M). I looked at it in Luke but Luke doesn't

Indexing Urls pointing to same content

2006-01-20 Thread Mario Alejandro M.
I found that in the data I'm searching I have a lot of duplicated content. Only diference is that the url change, ie, one say http://localhost/sample.html and the other http://localhost/sample2.html. However, sample1 and sample2 are diferent files, that its, here is not involved redirection or link

Re: getting rid of 'empty' segments

2006-01-20 Thread Volodymyr Bychkoviak
we should not only delete segment but remove it from index structure (i.e. SegmentInfos). Volodymyr Bychkoviak wrote: followup: in SegmentReader.doCommit() we can add check if (deletedDocs.count()==deletedDocs.size()) { //delete this segment } deleting segment can be done by code that is