Which OS/filesystem?
Mike McCandless
http://blog.mikemccandless.com
On Thu, Dec 8, 2011 at 9:46 PM, Jamir Shaikh wrote:
> I am using Lucene 3.5. I want to create around 30 million documents.
> While doing Indexing I am getting the following Exception:
>
> Caused by: java.lang.RuntimeException:
http://blog.mikemccandless.com
On Fri, Dec 9, 2011 at 2:22 PM, Jamir Shaikh wrote:
> OS : RHEL 5.5 64 bit.
> Filesystem: NFS
>
> Thanks for the reply.
>
> Thanks,
> Jamir
>
> On Fri, Dec 9, 2011 at 10:22 AM, Michael McCandless <
> luc...@mikemccandless.com> wr
You could also use FieldCache.getDocsWithField; it returns a bit set
where the bit is set if that document had that field.
Mike McCandless
http://blog.mikemccandless.com
On Mon, Dec 19, 2011 at 7:32 AM, Trejkaz wrote:
> On Mon, Dec 19, 2011 at 9:05 PM, Paul Taylor wrote:
>> I was looking for a
Is the index accessed over NFS?
Mike McCandless
http://blog.mikemccandless.com
On Fri, Jan 6, 2012 at 9:28 PM, superruiye wrote:
> Hi,
> I use lucene 3.4.0 in a search project,but encounter a problem and i
> don't know how to resolve.
> I index and it run well,but one week or two(it appear tw
These blog posts may also help describe SearcherManager and NRTManager:
http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html
http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html
Mike McCandless
http://blog.mikemccandless.com
On Sat,
OK then that's the problem.
Unlike local file systems, NFS makes no effort to protect
still-open-for-read files from being deleted (which Lucene by default
relies on by default).
The solution is easy: create your own IndexDeletionPolicy to "protect"
old index commit points from being deleted unti
I think it's hard to compare the results here?
In test 1 (single IW shared across threads) you end up with one index.
In test 2 (private IW per thread) you end up with N indexes, which to
be "fair" need to be merged down into one index (eg with .addIndexes)?
Or seen another way, test 1 should ha
You shouldn't have to write first to intermediate RAMDirectorys
anymore just share a single IndexWriter instance across all of
your threads.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jan 11, 2012 at 12:19 PM, Cheng wrote:
> I have read a lot about IndexWriter and multi-threadin
On Wed, Jan 11, 2012 at 1:32 PM, dyzc2010 wrote:
> Mike, do you mean if I create a FSDirectory based writer in first place, then
> the writer should be used in every thread rather than create a new
> RAMDirectory based writer in that thread?
Right.
> What about I do want to use RAMDirectory t
ate a RAMDirectory based writer and have it work cross all
> threads? In the sense, I would like to use RAMDirectory every where and
> have the RAMDirectory written to FSDirectory in the end.
>
> I suppose that should work, right?
>
>
> On Wed, Jan 11, 2012 at 2:31 PM, Mi
and a same searcher and pass them through
> every thread too?
>
>
>
> On Wed, Jan 11, 2012 at 3:21 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Yes that would work fine but you should see a net perf loss by
>> doing so (once you include
Indeed the 3.5.0 link is missing! I just committed a fix but is
this site automagically pushed somehow...? (I forget!).
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jan 9, 2012 at 5:54 AM, Ian Lea wrote:
> Hi
>
>
> The "Documentation" link on
> http://lucene.apache.org/java/docs
The ES is just passed through to the IndexSearchers that NRTManager
opens, so see IndexSearcher's javadocs.
But it's not clear how much passing an ES to IS really helps; you
should test yourself (and report back!).
Also, I wrote this blog post:
http://blog.mikemccandless.com/2011/11/near-re
OK this is now fixed I think!
And for the record: nothing seems to auto-push this site ;)
Mike McCandless
http://blog.mikemccandless.com
On Fri, Jan 13, 2012 at 1:57 PM, Michael McCandless
wrote:
> Indeed the 3.5.0 link is missing! I just committed a fix but is
> this site automag
Do you have a full traceback of the exception?
Mike McCandless
http://blog.mikemccandless.com
On Sun, Jan 15, 2012 at 7:21 PM, Duke DAI wrote:
> Hi friends,
> Any one meet ArrayIndexOutOfBoundsException: -65536 described in
> https://issues.apache.org/jira/browse/LUCENE-1995 after it declared b
> Best regards,
> Duke
> If not now, when? If not me, who?
> M 13818420095
>
>
>
> On Mon, Jan 16, 2012 at 9:09 AM, Michael McCandless
> wrote:
>>
>> Do you have a full traceback of the exception?
>>
>> Mike McCandless
>>
>> http:/
se AIOOBE? Is there any possible?
>
>
> Best regards,
> Duke
> If not now, when? If not me, who?
>
>
>
> On Wed, Jan 18, 2012 at 9:47 PM, Michael McCandless
> wrote:
>>
>> Hmm, are you certain your RAM buffer is 3 MB?
>>
>> Is it possible you ar
Have a look at lucene/MIGRATE.txt? It [tries to] describe this
change... and if something is missing please report back!
Mike McCandless
http://blog.mikemccandless.com
On Tue, Jan 24, 2012 at 4:10 PM, Stephen Howe wrote:
> Hi all,
>
> Looking at some older Lucene examples, I noticed for older
You should be able to use the Scorer.visitSubScorers API? You'd do
this up front, to recursively gather all "interesting" scorers in the
Query, and then in a custom collector, in the collect method, you can
go and ask each subScorer whether it matched the current document
(call its .freq() and see
I don't think there is one yet... it's [still] one of the limitations
I listed here:
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
But... if there were one, I don't think it would be user controllable.
I think it's more of an up-front schema thing, eg you'd tell
Right, you have to ensure (by using the "right" IndexDeletionPolicy)
that no commit is ever removed until all readers open against that
commit have been closed.
"Normally" the filesystem ensures this for us (protects still-open
files from being deleted), but NFS (unfortunately!) lacks such
semanti
On Tue, Jan 31, 2012 at 9:42 PM, Trejkaz wrote:
> So when we close() our own TextIndex wrapper class, it would call
> decRef() - but if another thread is still using the index, this call
> to decRef() wouldn't actually close the reader. IMO, this wouldn't
> really satisfy the meaning of "close" f
Instead of .getVersion() you should use .getTimestamp()... version is
not "really" a timestamp. (Though, really, you should store your own
timestamp inside the commit userData, and retrieve that, instead...
the getTimestamp API will be deprecated in 3.6.0).
Also, you may need to implement onInit,
Are you using near-real-time readers?
(IndexReader.open(IndexWriter))
Mike McCandless
http://blog.mikemccandless.com
On Sun, Feb 5, 2012 at 9:03 AM, Cheng wrote:
> Hi Uwe,
>
> My challenge is that I need to update/modify the indexes frequently while
> providing the search capability. I was try
nager and SearcherManager things should be
>> > >> >> easy and blazingly fast rather than unbearably slow. The latter
>> > >> >> phrase is not one often associated with lucene.
>> > >> >>
>> > >> >>
Feb 6, 2012 at 11:46 AM, Cheng wrote:
> Good point. I should remove the commits.
>
> Any difference between NRTCashingDirectory and RAMDirectory? how to define
> the "small"?
>
> On Tue, Feb 7, 2012 at 12:42 AM, Michael McCandless <
> luc...@mikemccandless.com>
Hmm, there's a problem with the logic here (sorry: this is my fault --
my prior suggestion is flat out wrong!).
The problem is... say you commit once, creating commit point 1. Two
hours later, you commit again, creating commit point 2. The bug is,
at this point, immediately on committing commit
I'm glad the timed deletion policy is working on NFS!
Thanks for bringing closure,
Mike McCandless
http://blog.mikemccandless.com
On Fri, Feb 10, 2012 at 9:58 PM, superruiye wrote:
> Thanks for your advice and patient.
> I modify "present",and use stress testing two day(loop search and index),
IndexWriter doesn't require refreshing... just keep it open forever.
It'll run it's own merges when needed (see the MergePolicy/Scheduler).
Just call .commit() when you want changes to be durable (survive
OS/JVM crash, power loss, etc.).
Mike McCandless
http://blog.mikemccandless.com
On Mon, Fe
Is your deletion policy actually deleting commits?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Feb 15, 2012 at 5:21 AM, superruiye wrote:
> http://lucene.472066.n3.nabble.com/file/n3746464/index.jpg
>
> The index files are same size,and the index increase to 7.5G in one day,but
> it
Wait: I see your DP above calling .delete() -- can you verify that
code is in fact invoked? EG print on each onCommit how many commits
are deleted and how many are not?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Feb 15, 2012 at 9:21 PM, superruiye wrote:
> My IndexWriter only creat
OK, thanks for bringing closure!
Mike McCandless
http://blog.mikemccandless.com
On Thu, Feb 16, 2012 at 10:08 PM, superruiye wrote:
> Oh,I made a mistake.Our testing server's time is faster hours than it should
> be.I reminded workmate to modify it,and index maintain in a range size.
> Thank y
Hmm, though, one question: if you are using a single IndexWriter,
always on the same machine, then it should not matter that the
computer's clock is way off.
Because, the DeletionPolicy is comparing timestamps pulled only from a
single clock. Ie the shift won't matter; only relative comparisons
m
This is from ConcurrentMergeScheduler (the default MergeScheduler).
But, are you sure the threads are sleeping, not exiting? (They should
be exiting).
This merge scheduler starts a new thread when a merge is needed,
allows that thread to do another merge (if one is immediately
available), else t
Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
the insert/delete/transposition changes...
Is the number of edits smallish? Ie you're not concerned about
combinatoric explosion of step 1?
For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use
BasicAutomata.mak
On Tue, Feb 28, 2012 at 8:42 AM, Alan Woodward
wrote:
>
> On 28 Feb 2012, at 13:31, Michael McCandless wrote:
>
>> Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for
>> the insert/delete/transposition changes...
>>
>> Is the number of ed
You shouldn't use doc.removeField -- it's costly (the fields are a
list internally so we walk that list looking for which field(s) to
remove).
To reuse you can just use Field.setValue, and leave the Field instance
on the Document.
But: you should only do this if you really have a meaningful
perfo
Phew, tricky.
The problem is purging is potentially costly... it iterates all
entries in the map (threads that have called get) looking for dead
threads.
Can you open an issue...? We can iterate there. Thanks for raising this,
Mike McCandless
http://blog.mikemccandless.com
On Wed, Feb 29, 20
It's because the delete is buffered and only later applied in batch...
so we can't easily know the count.
Mike McCandless
http://blog.mikemccandless.com
On Sun, Mar 4, 2012 at 4:42 PM, Benson Margulies wrote:
> Is there a reason why this doesn't return a count? Would a JIRA
> requesting same be
I think MIGRATE.txt talks about this?
Mike McCandless
http://blog.mikemccandless.com
On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies wrote:
> Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt
> appears to be missing one critical hint. If you have existing code
> that called Inde
Hmm something is up here... I'll dig. Seems like we are somehow
analyzing StringField when we shouldn't...
Mike McCandless
http://blog.mikemccandless.com
On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote:
> On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies
> wrote:
>> On Tue, Mar 6, 2012 at 9
On Tue, Mar 6, 2012 at 10:06 AM, Benson Margulies wrote:
> On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir wrote:
>> Thanks Benson: look like the problem revolves around indexing
>> Document/Fields you get back from IR.document... this has always been
>> 'lossy', but I think this is a real API trap.
Maybe try TieredMergePolicy to see if it'd do any merges here...?
More responses below:
On Tue, Mar 6, 2012 at 8:00 PM, Paul Hill wrote:
> I have an index with 421163 documents (including body text)
> after running a test index for a couple of months with 3.4 code with the
> default LogByteSiz
Hmm... that doesn't sound good.
Is the issue repeatable once it happens? And, when it happens, can
you verify that the index is corrrect (eg, the missing doc is
retrievable by non-grouped searches)? This way we can isolate the
issue to the search-side.
Can you boil it down to a small test case?
On Thu, Mar 8, 2012 at 7:22 AM, Grzegorz Tańczyk
wrote:
> Hello,
>
> Thanks for reply, I can find first document from group using non grouping
> search.
OK, so the index seems ok.
> To be sure about this I deleted index and indexed only first 100 groups
> which gives around 2300 documents and I
Phew, thanks for bringing closure!
Mike McCandless
http://blog.mikemccandless.com
On Fri, Mar 9, 2012 at 8:52 AM, Grzegorz Tańczyk wrote:
> Hello,
>
> I found the problem and it was my misunderstanding. I didn't get first
> documents in every group, because some of head documents didn't match g
Hi,
Actually, this is a hard requirement for BlockJoinQuery: the parent
document must always be last in the doc block; the package.html
describes this I think?
Mike McCandless
http://blog.mikemccandless.com
On Mon, Mar 12, 2012 at 12:57 PM, Jean-Marc MORAS
wrote:
> Dear
>
> Bravo for your work
On Wed, Mar 14, 2012 at 5:17 AM, Jean-Marc MORAS
wrote:
> ð Ok now I have seen the mention of that on ToParentBlockJoinQuery class
> java doc
>
> ð This java doc specify : "At search time you provide a Filter
> * identifying the parents, however this Filter must provide
>
> * an {@link Fix
Hmm, that's odd...
Can you set IndexWriter's infoStream and then capture the output while
doing the small writes every few seconds and post back?
If you run CheckIndex on the index does it also report ~3000 segments?
Mike McCandless
http://blog.mikemccandless.com
On Thu, Mar 15, 2012 at 7:14 A
hu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: start
> IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: enter lock
> IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: already
> prepared
> IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commi
is Solaris. Directory is a NAS.
> Directory implementation is SimpleFSDirectory.
> I sent you the full log.
>
> Thanks,
> Tim
>
> On Thu, Mar 15, 2012 at 4:04 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm, which OS/filesystem is the index
You're welcome!
Happy searching,
Mike McCandless
http://blog.mikemccandless.com
On Thu, Mar 15, 2012 at 11:40 AM, Jean-Marc MORAS
wrote:
> Thanks for your two responses.
>
> Best regards,
>
> Jean-Marc
>
> --
>
> -> Ok now I have seen the mention of that on ToParentBlockJoinQuery clas
On Thu, Mar 15, 2012 at 12:02 PM, Tim Bogaert wrote:
> while removing the prepareCommit we noticed we didn't actually called the
> IW.commit() method before the IW.close().
> Altough the documentation says the close() method commits all the changes
> we tried to add the commit() method before the
On Thu, Mar 15, 2012 at 12:33 PM, Uwe Schindler wrote:
> Close calls and always did call commit in 3.x?
Right, it does.
But in the case when prepareCommit was called... it then only commits
the changes as of that prepareCommit and *not* any changes done after
that and before close. That's the
Hmm, I agree we could be more RAM efficient if the field is DOCS_ONLY.
We shouldn't have to allocate/use docFreqs, lastDocCodes,
lastPositions arrays (3 of the 7); the others are still needed, I
think.
But, that said, you shouldn't hit OOME, as long as your max heap sizes
is large enough (and, yo
You have to replace all documents in the block (1 parent, 4 children
in your example) to update any of the documents... only updating the
child (or child + parent) will break the join...
There's also query-time joining coming in 3.6.0.
Mike McCandless
http://blog.mikemccandless.com
On Thu, Mar
I think you're hitting the exception because you passed
trackScores=true to ToParentBlockJoinCollector. If you do that, the
ScoreMode cannot be None... I'll update the javadocs to make this
clear, and I'll also fix the exception message.
I think you're hitting the infinite loop because your paren
Hi,
The equivalent in trunk is IndexReader.getTermVectors.
It returns a Fields instance, just like "normal" postings
(IndexReader.fields()), except it's postings for just a single
document.
So, you can pull a specific field, iterate the terms, get the
positions/offsets, etc.
I'll update MIGRATE
In general how Lucene assigns docIDs is a volatile implementation
detail: it's free to change from release to release.
Eg, the default merge policy (TieredMergePolicy) merges out-of-order
segments. Another eg: at one point, IndexSearcher re-ordered the
segments on init. Another: because Concurre
The code seems OK on quick glance...
Are you closing the writer?
Are you hitting any exceptions?
Mike McCandless
http://blog.mikemccandless.com
On Tue, Mar 27, 2012 at 12:19 PM, Luis Paiva wrote:
> Hey all,
>
> i'm in my first steps in Lucene.
> I was trying to index some txt files, and my pr
Alas, no, not yet. This is an oft-requested feature, but challenging to build.
That said, there is a possible start towards making something possible in 4.0:
https://issues.apache.org/jira/browse/LUCENE-3837
Mike McCandless
http://blog.mikemccandless.com
On Wed, Mar 28, 2012 at 8:16 AM, Anu
On Wed, Mar 28, 2012 at 3:37 AM, Christoph Kaser
wrote:
> Thank you for your answer!
>
> That's too bad. I thought of using my own ID-field, but I wanted to save the
> additional indirection (from docId to my ID to my value).
> Do document IDs remain constant for one IndexReader as long as it isn'
It'll be in both 3.6 and 4.0.
Mike McCandless
http://blog.mikemccandless.com
On Thu, Mar 29, 2012 at 7:55 AM, kiwi clive wrote:
> Hi Guys,
> Will this be available in Lucene 3.6 or is it only going into version 4.0 ?
>
> Clive
---
On Wed, Mar 28, 2012 at 2:30 PM, Tim Eck wrote:
> Excuse my ignorance of lucene internals, but is the problem any easier if
> the requirement is just to allow the addition/removal of stored only fields
> (as opposed to indexed)?
It would substantially simplify the problem... but even this
simplif
quot;) ||
> filename.endsWith(".xml") || filename.endsWith(".txt")) {
> queue.add(file);
> } else {
> System.out.println("Skipped " + filename);
> }
> }
> }
>
> /**
> * Close the index.
> *
Hmm that's odd.
If the scores were identical I'd expect different sort order, since we
tie-break by internal docID.
But if the scores are different... the insertion order shouldn't
matter. And, the score should not change as a function of insertion
order...
Do you have a small test case?
Mike
On Wed, Apr 4, 2012 at 6:15 PM, Alan Bawden wrote:
> So I sat down to try to make a small test case that exhibited this
> behavior, and while I was working on that I thought of a possible
> explanation for what we are seeing. If you agree that my explanation is
> what's going on here, then Benson
I'm assuming this is a "build once and never change" index...? Else,
it sounds like you should never run forceMerge...
To preserve insertion order you just need to use one of the
Log*MergePolicy (which you are already doing). Merge factor doesn't
affect this...
For the fastest way to get to a s
On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic wrote:
> On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless
> wrote:
>> I'm assuming this is a "build once and never change" index...? Else,
>> it sounds like you should never run forceMerge...
>
> Correct. Th
RAM can be used in IndexWriter for other reasons: merge is running,
near-real-time reader was opened.
The RAMBufferSizeMB only applies to buffered postings (indexed documents)
If you turn on IndexWriter's infoStream, do you see output saying it's
flushing a new segment because RAM is > 5.0 MB?
M
ts have been dramatic. Our indexing time has returned to 2.3
> levels.
>
> Thanks again,
>
> Ivan
>
> On Fri, Apr 6, 2012 at 11:36 AM, Michael McCandless
> wrote:
>> On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic wrote:
>>
>>> On Thu, Apr 5, 2012 at 11:36 AM,
Hmm, not good. Are you sure the index was previously locked?
Can you describe your environment? Which OS / Directory class are you using?
Maybe boil down to a small code fragment showing the issue?
Mike McCandless
http://blog.mikemccandless.com
On Sun, May 6, 2012 at 8:29 AM, S Eslamian wro
es not contain write.lock file
> and code goes to the if loop while it shouldn't passes the if clause!
>
> S Eslamian
>
> On Sun, May 6, 2012 at 5:56 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> Hmm, not good. Are you sure the index was pr
On Mon, May 7, 2012 at 7:19 AM, S Eslamian wrote:
> hmm... , What is a leftover lock file?
>
> You know I debug my code, befor index folder has lock file, till line 7.
> Then I close the program, like in a real run an interrupt has happened.
How do you close it? Just kill the process? That is w
On Tue, May 8, 2012 at 12:31 AM, S Eslamian wrote:
> So if my program interrupts, the lock files in the indexes will be released
> in the next run. hoom?
If you use NativeFSLockFactory (which is the default for
NIOFSDirectory) then, yes, the lock is always released by the OS when
the process exi
This is actually due to a bug:
https://issues.apache.org/jira/browse/LUCENE-3065
which was ixed in 3.2. The bug is that, prior to Lucene 3.2, if you
stored a NumericField, when you later load that document, the field is
converted to an ordinary Field (no longer numeric), so when you then
ind
On Sat, May 12, 2012 at 9:12 AM, Valeriy Felberg
wrote:
>> the Document IDs in Lucene are per segment. ie. they are always
>> segment based.
>
> @Simon I'm just wondering: If the document IDs are per segment how
> does it work if I call Searcher.search(Query, int) and get TopDocs
> referencing Sco
I think you need to pay attention to what td.next() returned; I
suspect in your case it returned false which means you cannot use any
of its APIs (.doc(), .freq(), etc.) after that.
Mike McCandless
http://blog.mikemccandless.com
On Thu, May 17, 2012 at 5:52 PM, Edward W. Rouse wrote:
> Lucene 3
a next() method.
>
>> -Original Message-
>> From: Michael McCandless [mailto:luc...@mikemccandless.com]
>> Sent: Thursday, May 17, 2012 6:20 PM
>> To: java-user@lucene.apache.org
>> Subject: Re: NullPointerException using IndexReader.termDocs when there
>
Good question!
One way to run it is temporarily comment out the code in the validate
method in
lucene/test-framework/src/java/org/apache/lucene/util/TestRuleAssertionsRequired.java
Maybe we should give this tool a static main instead of running it as
a test case..
Mike McCandless
http://blog.m
You do have to call getTopGroups for each grandchild query, and the
order should match the TopGroups you got for the children
However looking at the code, I suspect there's a bug... by the
time the collector collects the parent hit, some of the grand children
will have been discarded. I susp
On Thu, May 24, 2012 at 11:48 AM, Christoph Kaser
wrote:
> thank you for your response. Unfortunately, I won't be able to try this
> today, but I should be able to try it in the next few days. If I find the
> bug you described, I will open an issue.
Thanks!
> On a somewhat related note, is ther
The simplest way is to stop all index writing (close the IndexWriter),
do the copy, then start your IndexWriter again.
If that's a problem (usually it is!) then use SnapshotDeletionPolicy
to protect the commit point (ie prevent any of the files it uses from
being deleted) while you are making the
Multiple threads are free to interact with Directory.
But it will be only one thread at a time interacting with a single
instance of IndexInput and IndexOutput.
Mike McCandless
http://blog.mikemccandless.com
On Tue, May 29, 2012 at 6:39 PM, Dhruv wrote:
> I am trying to implement an in-memory
64% greater index size when you merge at the end is odd.
Can you post the ls -l output of the final index in both cases?
Are you only adding (not deleting) docs?
This is perfectly valid to do... but I'm surprised you see the two
approaches taking about the same time. I would expect letting Luce
On Fri, Jun 1, 2012 at 8:09 PM, Vitaly Funstein wrote:
> Yes, I am only calling IndexWriter.addDocument()
OK.
> Interestingly, relative performance of either approach seems to greatly
> depend on the number of documents per index. In both types of runs, I used
> 10 writer threads, each writing d
It could be your index has an unusual number of unique terms.
If you can upgrade to the latest 3.x, the RAM used by the terms index
has been very substantially reduced...
If not, try setting the termInfosIndexDivisor to eg 2 or 3 ... this
will load 1/2 or 1/3 of the indexed terms into RAM, but ma
ndexWriter.maybeMerge();
> IndexWriter.waitForMerges();
>
> to simply calling IndexWriter.close(true) the disk size and run time are
> now very close to the case of parallel segment merges.
>
> On Sat, Jun 2, 2012 at 6:43 AM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>>
I think this use case makes sense; such logic (for a distributed / ref
counted deletion policy) would make a nice contribution ... it's the
"proper" way to delete commits when multiple nodes are in use (vs eg
using a timeout deletion policy).
You can actually do it today: call IndexWriter.deleteUn
I think that javadoc is stale; my guess is it was written back when
the collect method took a score, but we changed that so the collector
calls .score() if it really needs the score... so I can't think of why
that search method is inherently inefficient.
I'll fix the javadocs (remove that warning)
9 fold improvement using RAMDir over MMapDir is much more than I've
seen (~30-40% maybe) in the past.
Can you explain how you are using Lucene?
You may also want to try the CachingRAMDirectory patch on
https://issues.apache.org/jira/browse/LUCENE-4123
Mike McCandless
http://blog.mikemccandless.
This shouldn't normally happen, even on crash, kill -9, power loss, etc.
It can only mean either there is a bug in Lucene, or there's something
wrong with your hardware/IO system, or the fsync operation doesn't
actually work on the IO system.
You can run CheckIndex to see what's broken (then, add
Likely the bottleneck is pulling content from the database? Maybe
test just that and see how long it takes?
24 hours is way too long to index all of Wikipedia. For example, we
index Wikipedia every night for our trunk/4.0 performance tests, here:
http://people.apache.org/~mikemccand/luceneb
Hmm which Lucene version are you using? For 3.x before 3.4, there was
a bug (https://issues.apache.org/jira/browse/LUCENE-3418) where we
failed to actually fsync...
More below:
On Tue, Jun 19, 2012 at 4:54 PM, Chris Gioran
wrote:
> On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless
>
ld it be possible to index Wikipedia in a 2 core machine with 3 GB in
> RAM? I have had the same problem trying to index it.
>
> I've tried with a dump from april 2011.
>
> Thanks
> Reyna
> CIC-IPN
> Mexico
>
> 2012/6/19 Michael McCandless
>
>> Likely the bot
I have the index locally ... but it's really impractical to send it
especially if you already have the source text locally.
Maybe index directly from the source text instead of via a database?
Lucene's benchmark contrib/module has code to decode the XML into
documents...
Mike McCandless
http://b
Right, the field must have a single token for FieldCache.
But if you are on 4.x you can use DocTermOrds
(FieldCache.getDocTermOrds) which allows for multiple tokens per
field.
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jun 20, 2012 at 9:47 AM, Li Li wrote:
> but as l can remember,
decaperated?
>
> Thanks
>
> On Mon, Jun 18, 2012 at 7:32 PM, Michael McCandless <
> luc...@mikemccandless.com> wrote:
>
>> 9 fold improvement using RAMDir over MMapDir is much more than I've
>> seen (~30-40% maybe) in the past.
>>
>>
There are blanks at the important places (your code, and which
JavaDoc) in your email!
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 11, 2012 at 6:18 AM, Konstantyn Smirnov wrote:
> Hi all
>
> in my app (Lucene 3.5.0 powered) I index the documents (not too many, say up
> to 100k) u
What I meant was your original email says "My code looks like",
followed by blank lines, and then "Doesn't it conflict with the
JavaDoc saying:", followed by blank lines. Ie we can't see your code.
However, when I look at your email here at
http://lucene.472066.n3.nabble.com/RAMDirectory-and-expun
1 - 100 of 2755 matches
Mail list logo