Mario,
Lucene != web indexer, so Lucene doesn't know anything about files or URLs,
etc. It just indexes what it's told. You should check how Nutch does it, and
I believe it does it by comparing "fingerprints" of web pages. Fingerprints
are MD5 checksums, but I believe the recent changes ther
[ http://issues.apache.org/jira/browse/LUCENE-488?page=all ]
Hoss Man updated LUCENE-488:
Attachment: TestBigBinary.java
two things i forgot to mention before...
1) It seems i can as many 4mb documents as my heart desires, but once i go up
to 5 all hell br
adding docs with large (binary) fields of 5mb causes OOM regardless of heap size
Key: LUCENE-488
URL: http://issues.apache.org/jira/browse/LUCENE-488
Project: Lucene - Java
Type: Bug
Hello,
I'm having trouble getting a custom Directory to work, keep getting exceptions
in org.apache.lucene.store.BufferedIndexInput.refill (stack attached below).
Could someone review the code below and tell me what I'm doing wrong? Any
pointers would be greatly appreciated.
- Dmitry
===
[ http://issues.apache.org/jira/browse/LUCENE-140?page=all ]
Jarrod Cuzens updated LUCENE-140:
-
Attachment: corrupted.part2.rar
Second part. :)
> docs out of order
> -
>
> Key: LUCENE-140
> URL: http://issues.apache.org
[ http://issues.apache.org/jira/browse/LUCENE-140?page=all ]
Jarrod Cuzens updated LUCENE-140:
-
Attachment: corrupted.part1.rar
I am posting our corrupted index (I have to do it in two parts because it is
14.5M). I looked at it in Luke but Luke doesn't
I found that in the data I'm searching I have a lot of duplicated content.
Only diference is that the url change, ie, one say
http://localhost/sample.html and the other http://localhost/sample2.html.
However, sample1 and sample2 are diferent files, that its, here is not
involved redirection or link
we should not only delete segment but remove it from index structure
(i.e. SegmentInfos).
Volodymyr Bychkoviak wrote:
followup:
in SegmentReader.doCommit()
we can add check
if (deletedDocs.count()==deletedDocs.size()) {
//delete this segment
}
deleting segment can be done by code that is