Re: Getting RuntimeException: after flush: fdx size mismatch while Indexing

2011-12-09 Thread Michael McCandless
Which OS/filesystem? Mike McCandless http://blog.mikemccandless.com On Thu, Dec 8, 2011 at 9:46 PM, Jamir Shaikh wrote: > I am using Lucene 3.5. I want to create around 30 million documents. > While doing Indexing I am getting the following Exception: > > Caused by: java.lang.RuntimeException:

Re: Getting RuntimeException: after flush: fdx size mismatch while Indexing

2011-12-09 Thread Michael McCandless
http://blog.mikemccandless.com On Fri, Dec 9, 2011 at 2:22 PM, Jamir Shaikh wrote: > OS : RHEL 5.5 64 bit. > Filesystem: NFS > > Thanks for the reply. > > Thanks, > Jamir > > On Fri, Dec 9, 2011 at 10:22 AM, Michael McCandless < > luc...@mikemccandless.com> wr

Re: Query that returns all docs that contain a field

2011-12-19 Thread Michael McCandless
You could also use FieldCache.getDocsWithField; it returns a bit set where the bit is set if that document had that field. Mike McCandless http://blog.mikemccandless.com On Mon, Dec 19, 2011 at 7:32 AM, Trejkaz wrote: > On Mon, Dec 19, 2011 at 9:05 PM, Paul Taylor wrote: >> I was looking for a

Re: Why read past EOF

2012-01-07 Thread Michael McCandless
Is the index accessed over NFS? Mike McCandless http://blog.mikemccandless.com On Fri, Jan 6, 2012 at 9:28 PM, superruiye wrote: > Hi, >   I use lucene 3.4.0 in a search project,but encounter a problem and i > don't know how to resolve. > I index and it run well,but one week or two(it appear tw

Re: question about SearcherManager in version 3.5.0

2012-01-07 Thread Michael McCandless
These blog posts may also help describe SearcherManager and NRTManager: http://blog.mikemccandless.com/2011/09/lucenes-searchermanager-simplifies.html http://blog.mikemccandless.com/2011/11/near-real-time-readers-with-lucenes.html Mike McCandless http://blog.mikemccandless.com On Sat,

Re: Why read past EOF

2012-01-09 Thread Michael McCandless
OK then that's the problem. Unlike local file systems, NFS makes no effort to protect still-open-for-read files from being deleted (which Lucene by default relies on by default). The solution is easy: create your own IndexDeletionPolicy to "protect" old index commit points from being deleted unti

Re: shared instance of IndexWriter doesn't improve proformance

2012-01-11 Thread Michael McCandless
I think it's hard to compare the results here? In test 1 (single IW shared across threads) you end up with one index. In test 2 (private IW per thread) you end up with N indexes, which to be "fair" need to be merged down into one index (eg with .addIndexes)? Or seen another way, test 1 should ha

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
You shouldn't have to write first to intermediate RAMDirectorys anymore just share a single IndexWriter instance across all of your threads. Mike McCandless http://blog.mikemccandless.com On Wed, Jan 11, 2012 at 12:19 PM, Cheng wrote: > I have read a lot about IndexWriter and multi-threadin

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
On Wed, Jan 11, 2012 at 1:32 PM, dyzc2010 wrote: > Mike, do you mean if I create a FSDirectory based writer in first place, then > the writer should be used in every thread rather than create a new > RAMDirectory based writer in that thread? Right. > What about I do want to use RAMDirectory t

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
ate a RAMDirectory based writer and have it work cross all > threads? In the sense, I would like to use RAMDirectory every where and > have the RAMDirectory written to FSDirectory in the end. > > I suppose that should work, right? > > > On Wed, Jan 11, 2012 at 2:31 PM, Mi

Re: Seem contradictive -- indexwriter in handling multiple threads

2012-01-11 Thread Michael McCandless
and a same searcher and pass them through > every thread too? > > > > On Wed, Jan 11, 2012 at 3:21 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Yes that would work fine but you should see a net perf loss by >> doing so (once you include

Re: 3.5.0 javadocs link missing?

2012-01-13 Thread Michael McCandless
Indeed the 3.5.0 link is missing! I just committed a fix but is this site automagically pushed somehow...? (I forget!). Mike McCandless http://blog.mikemccandless.com On Mon, Jan 9, 2012 at 5:54 AM, Ian Lea wrote: > Hi > > > The "Documentation" link on > http://lucene.apache.org/java/docs

Re: How NRTManagerReopenThread works with Java Executor framework?

2012-01-15 Thread Michael McCandless
The ES is just passed through to the IndexSearchers that NRTManager opens, so see IndexSearcher's javadocs. But it's not clear how much passing an ES to IS really helps; you should test yourself (and report back!). Also, I wrote this blog post: http://blog.mikemccandless.com/2011/11/near-re

Re: 3.5.0 javadocs link missing?

2012-01-15 Thread Michael McCandless
OK this is now fixed I think! And for the record: nothing seems to auto-push this site ;) Mike McCandless http://blog.mikemccandless.com On Fri, Jan 13, 2012 at 1:57 PM, Michael McCandless wrote: > Indeed the 3.5.0 link is missing!  I just committed a fix but is > this site automag

Re: ArrayIndexOutOfBoundsException: -65536

2012-01-15 Thread Michael McCandless
Do you have a full traceback of the exception? Mike McCandless http://blog.mikemccandless.com On Sun, Jan 15, 2012 at 7:21 PM, Duke DAI wrote: > Hi friends, > Any one meet ArrayIndexOutOfBoundsException: -65536 described in > https://issues.apache.org/jira/browse/LUCENE-1995 after it declared b

Re: ArrayIndexOutOfBoundsException: -65536

2012-01-18 Thread Michael McCandless
> Best regards, > Duke > If not now, when? If not me, who? > M 13818420095 > > > > On Mon, Jan 16, 2012 at 9:09 AM, Michael McCandless > wrote: >> >> Do you have a full traceback of the exception? >> >> Mike McCandless >> >> http:/

Re: ArrayIndexOutOfBoundsException: -65536

2012-01-19 Thread Michael McCandless
se AIOOBE? Is there any possible? > > > Best regards, > Duke > If not now, when? If not me, who? > > > > On Wed, Jan 18, 2012 at 9:47 PM, Michael McCandless > wrote: >> >> Hmm, are you certain your RAM buffer is 3 MB? >> >> Is it possible you ar

Re: Lucene 4.0 Get All Index Terms

2012-01-24 Thread Michael McCandless
Have a look at lucene/MIGRATE.txt? It [tries to] describe this change... and if something is missing please report back! Mike McCandless http://blog.mikemccandless.com On Tue, Jan 24, 2012 at 4:10 PM, Stephen Howe wrote: > Hi all, > > Looking at some older Lucene examples, I noticed for older

Re: Query term counting, again...

2012-01-26 Thread Michael McCandless
You should be able to use the Scorer.visitSubScorers API? You'd do this up front, to recursively gather all "interesting" scorers in the Query, and then in a custom collector, in the collect method, you can go and ask each subScorer whether it matched the current document (call its .freq() and see

Re: BlockJoinQuery in text queries

2012-01-26 Thread Michael McCandless
I don't think there is one yet... it's [still] one of the limitations I listed here: http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html But... if there were one, I don't think it would be user controllable. I think it's more of an up-front schema thing, eg you'd tell

Re: Why read past EOF

2012-02-01 Thread Michael McCandless
Right, you have to ensure (by using the "right" IndexDeletionPolicy) that no commit is ever removed until all readers open against that commit have been closed. "Normally" the filesystem ensures this for us (protects still-open files from being deleted), but NFS (unfortunately!) lacks such semanti

Re: Lucene appears to use memory maps after unmapping them

2012-02-01 Thread Michael McCandless
On Tue, Jan 31, 2012 at 9:42 PM, Trejkaz wrote: > So when we close() our own TextIndex wrapper class, it would call > decRef() - but if another thread is still using the index, this call > to decRef() wouldn't actually close the reader. IMO, this wouldn't > really satisfy the meaning of "close" f

Re: Why read past EOF

2012-02-03 Thread Michael McCandless
Instead of .getVersion() you should use .getTimestamp()... version is not "really" a timestamp. (Though, really, you should store your own timestamp inside the commit userData, and retrieve that, instead... the getTimestamp API will be deprecated in 3.6.0). Also, you may need to implement onInit,

Re: Configure writer to write to FSDirectory?

2012-02-05 Thread Michael McCandless
Are you using near-real-time readers? (IndexReader.open(IndexWriter)) Mike McCandless http://blog.mikemccandless.com On Sun, Feb 5, 2012 at 9:03 AM, Cheng wrote: > Hi Uwe, > > My challenge is that I need to update/modify the indexes frequently while > providing the search capability. I was try

Re: Configure writer to write to FSDirectory?

2012-02-06 Thread Michael McCandless
nager and SearcherManager things should be >> > >> >> easy and blazingly fast rather than unbearably slow.  The latter >> > >> >> phrase is not one often associated with lucene. >> > >> >> >> > >> >>

Re: Configure writer to write to FSDirectory?

2012-02-06 Thread Michael McCandless
Feb 6, 2012 at 11:46 AM, Cheng wrote: > Good point. I should remove the commits. > > Any difference between NRTCashingDirectory and RAMDirectory? how to define > the "small"? > > On Tue, Feb 7, 2012 at 12:42 AM, Michael McCandless < > luc...@mikemccandless.com>

Re: Why read past EOF

2012-02-08 Thread Michael McCandless
Hmm, there's a problem with the logic here (sorry: this is my fault -- my prior suggestion is flat out wrong!). The problem is... say you commit once, creating commit point 1. Two hours later, you commit again, creating commit point 2. The bug is, at this point, immediately on committing commit

Re: Why read past EOF

2012-02-11 Thread Michael McCandless
I'm glad the timed deletion policy is working on NFS! Thanks for bringing closure, Mike McCandless http://blog.mikemccandless.com On Fri, Feb 10, 2012 at 9:58 PM, superruiye wrote: > Thanks for your advice and patient. > I modify "present",and use stress testing two day(loop search and index),

Re: When to refresh writer?

2012-02-13 Thread Michael McCandless
IndexWriter doesn't require refreshing... just keep it open forever. It'll run it's own merges when needed (see the MergePolicy/Scheduler). Just call .commit() when you want changes to be durable (survive OS/JVM crash, power loss, etc.). Mike McCandless http://blog.mikemccandless.com On Mon, Fe

Re: Why read past EOF

2012-02-15 Thread Michael McCandless
Is your deletion policy actually deleting commits? Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2012 at 5:21 AM, superruiye wrote: > http://lucene.472066.n3.nabble.com/file/n3746464/index.jpg > > The index files are same size,and the index increase to 7.5G in one day,but > it

Re: Why read past EOF

2012-02-16 Thread Michael McCandless
Wait: I see your DP above calling .delete() -- can you verify that code is in fact invoked? EG print on each onCommit how many commits are deleted and how many are not? Mike McCandless http://blog.mikemccandless.com On Wed, Feb 15, 2012 at 9:21 PM, superruiye wrote: > My IndexWriter only creat

Re: Why read past EOF

2012-02-17 Thread Michael McCandless
OK, thanks for bringing closure! Mike McCandless http://blog.mikemccandless.com On Thu, Feb 16, 2012 at 10:08 PM, superruiye wrote: > Oh,I made a mistake.Our testing server's time is faster hours than it should > be.I reminded workmate to modify it,and index maintain in a range size. > Thank y

Re: Why read past EOF

2012-02-17 Thread Michael McCandless
Hmm, though, one question: if you are using a single IndexWriter, always on the same machine, then it should not matter that the computer's clock is way off. Because, the DeletionPolicy is comparing timestamps pulled only from a single clock. Ie the shift won't matter; only relative comparisons m

Re: Here a merge thread, there a merge thread ...

2012-02-24 Thread Michael McCandless
This is from ConcurrentMergeScheduler (the default MergeScheduler). But, are you sure the threads are sleeping, not exiting? (They should be exiting). This merge scheduler starts a new thread when a merge is needed, allows that thread to do another merge (if one is immediately available), else t

Re: Building FST-like automaton queries

2012-02-28 Thread Michael McCandless
Neat :) It's like a FuzzyQuery w/ a custom (binary?) cost matrix for the insert/delete/transposition changes... Is the number of edits smallish? Ie you're not concerned about combinatoric explosion of step 1? For steps 2 and 3 you shouldn't use FST at all. Instead, for 2) use BasicAutomata.mak

Re: Building FST-like automaton queries

2012-02-28 Thread Michael McCandless
On Tue, Feb 28, 2012 at 8:42 AM, Alan Woodward wrote: > > On 28 Feb 2012, at 13:31, Michael McCandless wrote: > >> Neat :)  It's like a FuzzyQuery w/ a custom (binary?) cost matrix for >> the insert/delete/transposition changes... >> >> Is the number of ed

Re: How to add DocValues Field to a document in an optimal manner.

2012-03-01 Thread Michael McCandless
You shouldn't use doc.removeField -- it's costly (the fields are a list internally so we walk that list looking for which field(s) to remove). To reuse you can just use Field.setValue, and leave the Field instance on the Document. But: you should only do this if you really have a meaningful perfo

Re: CloseableThreadLocal problem

2012-03-01 Thread Michael McCandless
Phew, tricky. The problem is purging is potentially costly... it iterates all entries in the map (threads that have called get) looking for dead threads. Can you open an issue...? We can iterate there. Thanks for raising this, Mike McCandless http://blog.mikemccandless.com On Wed, Feb 29, 20

Re: Return value (or lack thereof) from IndexWriter.deleteDocuments

2012-03-04 Thread Michael McCandless
It's because the delete is buffered and only later applied in batch... so we can't easily know the count. Mike McCandless http://blog.mikemccandless.com On Sun, Mar 4, 2012 at 4:42 PM, Benson Margulies wrote: > Is there a reason why this doesn't return a count? Would a JIRA > requesting same be

Re: A little more CHANGES.txt help on terms(), please

2012-03-06 Thread Michael McCandless
I think MIGRATE.txt talks about this? Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 8:50 AM, Benson Margulies wrote: > Under "LUCENE-1458, LUCENE-2111: Flexible Indexing", CHANGES.txt > appears to be missing one critical hint. If you have existing code > that called Inde

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Michael McCandless
Hmm something is up here... I'll dig. Seems like we are somehow analyzing StringField when we shouldn't... Mike McCandless http://blog.mikemccandless.com On Tue, Mar 6, 2012 at 9:33 AM, Robert Muir wrote: > On Tue, Mar 6, 2012 at 9:23 AM, Benson Margulies > wrote: >> On Tue, Mar 6, 2012 at 9

Re: Problem with updating a document or TermQuery with current trunk

2012-03-06 Thread Michael McCandless
On Tue, Mar 6, 2012 at 10:06 AM, Benson Margulies wrote: > On Tue, Mar 6, 2012 at 10:04 AM, Robert Muir wrote: >> Thanks Benson: look like the problem revolves around indexing >> Document/Fields you get back from IR.document... this has always been >> 'lossy', but I think this is a real API trap.

Re: More About NOT Optimizing

2012-03-07 Thread Michael McCandless
Maybe try TieredMergePolicy to see if it'd do any merges here...? More responses below: On Tue, Mar 6, 2012 at 8:00 PM, Paul Hill wrote: > I have an index with 421163 documents (including body text) > after running a test index for a couple of months with 3.4 code with the > default LogByteSiz

Re: BlockGroupingCollector, not always getting first document

2012-03-08 Thread Michael McCandless
Hmm... that doesn't sound good. Is the issue repeatable once it happens? And, when it happens, can you verify that the index is corrrect (eg, the missing doc is retrievable by non-grouped searches)? This way we can isolate the issue to the search-side. Can you boil it down to a small test case?

Re: Re: BlockGroupingCollector, not always getting first document

2012-03-09 Thread Michael McCandless
On Thu, Mar 8, 2012 at 7:22 AM, Grzegorz Tańczyk wrote: > Hello, > > Thanks for reply, I can find first document from group using non grouping > search. OK, so the index seems ok. > To be sure about this I deleted index and indexed only first 100 groups > which gives around 2300 documents and I

Re: Re: Re: BlockGroupingCollector, not always getting first document

2012-03-09 Thread Michael McCandless
Phew, thanks for bringing closure! Mike McCandless http://blog.mikemccandless.com On Fri, Mar 9, 2012 at 8:52 AM, Grzegorz Tańczyk wrote: > Hello, > > I found the problem and it was my misunderstanding. I didn't get first > documents in every group, because some of head documents didn't match g

Re: TO Mike McCandless : ToParentBlockJoinQuery inconsistent return

2012-03-12 Thread Michael McCandless
Hi, Actually, this is a hard requirement for BlockJoinQuery: the parent document must always be last in the doc block; the package.html describes this I think? Mike McCandless http://blog.mikemccandless.com On Mon, Mar 12, 2012 at 12:57 PM, Jean-Marc MORAS wrote: > Dear > > Bravo for your work

Re: TO Mike McCandless : ToParentBlockJoinQuery inconsistent return

2012-03-14 Thread Michael McCandless
On Wed, Mar 14, 2012 at 5:17 AM, Jean-Marc MORAS wrote: > ð  Ok now I have seen the mention of that on  ToParentBlockJoinQuery class > java doc > > ð  This java doc specify : "At search time you provide a Filter > * identifying the parents, however this Filter must provide > > * an {@link Fix

Re: lots of .cfs (compound files) in the index directory

2012-03-15 Thread Michael McCandless
Hmm, that's odd... Can you set IndexWriter's infoStream and then capture the output while doing the small writes every few seconds and post back? If you run CheckIndex on the index does it also report ~3000 segments? Mike McCandless http://blog.mikemccandless.com On Thu, Mar 15, 2012 at 7:14 A

Re: lots of .cfs (compound files) in the index directory

2012-03-15 Thread Michael McCandless
hu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: start > IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: enter lock > IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commit: already > prepared > IW 53 [Thu Mar 15 15:25:38 MET 2012; pool-2-thread-1]: commi

Re: lots of .cfs (compound files) in the index directory

2012-03-15 Thread Michael McCandless
is Solaris. Directory is a NAS. > Directory implementation is SimpleFSDirectory. > I sent you the full log. > > Thanks, > Tim > > On Thu, Mar 15, 2012 at 4:04 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Hmm, which OS/filesystem is the index

Re: TO Mike McCandless : ToParentBlockJoinQuery inconsistent return

2012-03-15 Thread Michael McCandless
You're welcome! Happy searching, Mike McCandless http://blog.mikemccandless.com On Thu, Mar 15, 2012 at 11:40 AM, Jean-Marc MORAS wrote: > Thanks for your two responses. > > Best regards, > > Jean-Marc > > -- > > -> Ok now I have seen the mention of that on  ToParentBlockJoinQuery clas

Re: lots of .cfs (compound files) in the index directory

2012-03-15 Thread Michael McCandless
On Thu, Mar 15, 2012 at 12:02 PM, Tim Bogaert wrote: > while removing the prepareCommit we noticed we didn't actually called the > IW.commit() method before the IW.close(). > Altough the documentation says the close() method commits all the changes > we tried to add the commit() method before the

Re: lots of .cfs (compound files) in the index directory

2012-03-15 Thread Michael McCandless
On Thu, Mar 15, 2012 at 12:33 PM, Uwe Schindler wrote: > Close calls and always did call commit in 3.x? Right, it does. But in the case when prepareCommit was called... it then only commits the changes as of that prepareCommit and *not* any changes done after that and before close. That's the

Re: suppressing FreqProxPostingsArray

2012-03-19 Thread Michael McCandless
Hmm, I agree we could be more RAM efficient if the field is DOCS_ONLY. We shouldn't have to allocate/use docFreqs, lastDocCodes, lastPositions arrays (3 of the 7); the others are still needed, I think. But, that said, you shouldn't hit OOME, as long as your max heap sizes is large enough (and, yo

Re: BlockJoinQuery Clarification

2012-03-22 Thread Michael McCandless
You have to replace all documents in the block (1 parent, 4 children in your example) to update any of the documents... only updating the child (or child + parent) will break the join... There's also query-time joining coming in 3.6.0. Mike McCandless http://blog.mikemccandless.com On Thu, Mar

Re: ToParentBlockJoinQuery query loop finitely

2012-03-23 Thread Michael McCandless
I think you're hitting the exception because you passed trackScores=true to ToParentBlockJoinCollector. If you do that, the ScoreMode cannot be None... I'll update the javadocs to make this clear, and I'll also fix the exception message. I think you're hitting the infinite loop because your paren

Re: Lucene 4.0 getTermFreqVector and TermVectorMapper

2012-03-23 Thread Michael McCandless
Hi, The equivalent in trunk is IndexReader.getTermVectors. It returns a Fields instance, just like "normal" postings (IndexReader.fields()), except it's postings for just a single document. So, you can pull a specific field, iterate the terms, get the positions/offsets, etc. I'll update MIGRATE

Re: Document-Ids and Merges

2012-03-27 Thread Michael McCandless
In general how Lucene assigns docIDs is a volatile implementation detail: it's free to change from release to release. Eg, the default merge policy (TieredMergePolicy) merges out-of-order segments. Another eg: at one point, IndexSearcher re-ordered the segments on init. Another: because Concurre

Re: TVD, TVX and TVF files

2012-03-27 Thread Michael McCandless
The code seems OK on quick glance... Are you closing the writer? Are you hitting any exceptions? Mike McCandless http://blog.mikemccandless.com On Tue, Mar 27, 2012 at 12:19 PM, Luis Paiva wrote: > Hey all, > > i'm in my first steps in Lucene. > I was trying to index some txt files, and my pr

Re: Can I add new field values to a existing lucene index ?

2012-03-28 Thread Michael McCandless
Alas, no, not yet. This is an oft-requested feature, but challenging to build. That said, there is a possible start towards making something possible in 4.0: https://issues.apache.org/jira/browse/LUCENE-3837 Mike McCandless http://blog.mikemccandless.com On Wed, Mar 28, 2012 at 8:16 AM, Anu

Re: Document-Ids and Merges

2012-03-28 Thread Michael McCandless
On Wed, Mar 28, 2012 at 3:37 AM, Christoph Kaser wrote: > Thank you for your answer! > > That's too bad. I thought of using my own ID-field, but I wanted to save the > additional indirection (from docId to my ID to my value). > Do document IDs remain constant for one IndexReader as long as it isn'

Re: JoinUtil.createJoinQuery in 3.6 ?

2012-03-29 Thread Michael McCandless
It'll be in both 3.6 and 4.0. Mike McCandless http://blog.mikemccandless.com On Thu, Mar 29, 2012 at 7:55 AM, kiwi clive wrote: > Hi Guys, > Will this be available in Lucene 3.6 or is it only going into version 4.0 ? > > Clive ---

Re: Can I add new field values to a existing lucene index ?

2012-03-29 Thread Michael McCandless
On Wed, Mar 28, 2012 at 2:30 PM, Tim Eck wrote: > Excuse my ignorance of lucene internals, but is the problem any easier if > the requirement is just to allow the addition/removal of stored only fields > (as opposed to indexed)? It would substantially simplify the problem... but even this simplif

Re: TVD, TVX and TVF files

2012-04-02 Thread Michael McCandless
quot;) || >              filename.endsWith(".xml") || filename.endsWith(".txt")) { >        queue.add(file); >      } else { >        System.out.println("Skipped " + filename); >      } >    } >  } > >  /** >   * Close the index. >   *

Re: Repeatability of results

2012-04-02 Thread Michael McCandless
Hmm that's odd. If the scores were identical I'd expect different sort order, since we tie-break by internal docID. But if the scores are different... the insertion order shouldn't matter. And, the score should not change as a function of insertion order... Do you have a small test case? Mike

Re: Repeatability of results

2012-04-04 Thread Michael McCandless
On Wed, Apr 4, 2012 at 6:15 PM, Alan Bawden wrote: > So I sat down to try to make a small test case that exhibited this > behavior, and while I was working on that I thought of a possible > explanation for what we are seeing.  If you agree that my explanation is > what's going on here, then Benson

Re: Slow merging after upgrading to 3.5

2012-04-05 Thread Michael McCandless
I'm assuming this is a "build once and never change" index...? Else, it sounds like you should never run forceMerge... To preserve insertion order you just need to use one of the Log*MergePolicy (which you are already doing). Merge factor doesn't affect this... For the fastest way to get to a s

Re: Slow merging after upgrading to 3.5

2012-04-06 Thread Michael McCandless
On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic wrote: > On Thu, Apr 5, 2012 at 11:36 AM, Michael McCandless > wrote: >> I'm assuming this is a "build once and never change" index...?  Else, >> it sounds like you should never run forceMerge... > > Correct. Th

Re: IndexWriteConfig ignored?

2012-04-16 Thread Michael McCandless
RAM can be used in IndexWriter for other reasons: merge is running, near-real-time reader was opened. The RAMBufferSizeMB only applies to buffered postings (indexed documents) If you turn on IndexWriter's infoStream, do you see output saying it's flushing a new segment because RAM is > 5.0 MB? M

Re: Slow merging after upgrading to 3.5

2012-04-18 Thread Michael McCandless
ts have been dramatic. Our indexing time has returned to 2.3 > levels. > > Thanks again, > > Ivan > > On Fri, Apr 6, 2012 at 11:36 AM, Michael McCandless > wrote: >> On Thu, Apr 5, 2012 at 3:31 PM, Ivan Brusic wrote: >> >>> On Thu, Apr 5, 2012 at 11:36 AM,

Re: IndexWriter.isLock()

2012-05-06 Thread Michael McCandless
Hmm, not good. Are you sure the index was previously locked? Can you describe your environment? Which OS / Directory class are you using? Maybe boil down to a small code fragment showing the issue? Mike McCandless http://blog.mikemccandless.com On Sun, May 6, 2012 at 8:29 AM, S Eslamian wro

Re: IndexWriter.isLock()

2012-05-07 Thread Michael McCandless
es not contain write.lock file > and code goes to the if loop while it shouldn't passes the if clause! > > S Eslamian > > On Sun, May 6, 2012 at 5:56 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> Hmm, not good.  Are you sure the index was pr

Re: IndexWriter.isLock()

2012-05-07 Thread Michael McCandless
On Mon, May 7, 2012 at 7:19 AM, S Eslamian wrote: > hmm... , What is a leftover lock file? > > You know I debug my code, befor index folder has lock file, till line 7. > Then I close the program, like in a real run an interrupt has happened. How do you close it? Just kill the process? That is w

Re: IndexWriter.isLock()

2012-05-09 Thread Michael McCandless
On Tue, May 8, 2012 at 12:31 AM, S Eslamian wrote: > So if my program interrupts, the lock files in the indexes will be released > in the next run. hoom? If you use NativeFSLockFactory (which is the default for NIOFSDirectory) then, yes, the lock is always released by the OS when the process exi

Re: update/re-add an existing document with numeric fields

2012-05-10 Thread Michael McCandless
This is actually due to a bug: https://issues.apache.org/jira/browse/LUCENE-3065 which was ixed in 3.2. The bug is that, prior to Lucene 3.2, if you stored a NumericField, when you later load that document, the field is converted to an ordinary Field (no longer numeric), so when you then ind

Re: Lucene's internal doc ID space

2012-05-12 Thread Michael McCandless
On Sat, May 12, 2012 at 9:12 AM, Valeriy Felberg wrote: >> the Document IDs in Lucene are per segment. ie. they are always >> segment based. > > @Simon I'm just wondering: If the document IDs are per segment how > does it work if I call Searcher.search(Query, int) and get TopDocs > referencing Sco

Re: NullPointerException using IndexReader.termDocs when there are no matches

2012-05-17 Thread Michael McCandless
I think you need to pay attention to what td.next() returned; I suspect in your case it returned false which means you cannot use any of its APIs (.doc(), .freq(), etc.) after that. Mike McCandless http://blog.mikemccandless.com On Thu, May 17, 2012 at 5:52 PM, Edward W. Rouse wrote: > Lucene 3

Re: NullPointerException using IndexReader.termDocs when there are no matches

2012-05-18 Thread Michael McCandless
a next() method. > >> -Original Message- >> From: Michael McCandless [mailto:luc...@mikemccandless.com] >> Sent: Thursday, May 17, 2012 6:20 PM >> To: java-user@lucene.apache.org >> Subject: Re: NullPointerException using IndexReader.termDocs when there >

Re: Unable to run LookupBenchmarkTest

2012-05-19 Thread Michael McCandless
Good question! One way to run it is temporarily comment out the code in the validate method in lucene/test-framework/src/java/org/apache/lucene/util/TestRuleAssertionsRequired.java Maybe we should give this tool a static main instead of running it as a test case.. Mike McCandless http://blog.m

Re: ToParentBlockJoinQuery and grand-children

2012-05-23 Thread Michael McCandless
You do have to call getTopGroups for each grandchild query, and the order should match the TopGroups you got for the children However looking at the code, I suspect there's a bug... by the time the collector collects the parent hit, some of the grand children will have been discarded. I susp

Re: ToParentBlockJoinQuery and grand-children

2012-05-24 Thread Michael McCandless
On Thu, May 24, 2012 at 11:48 AM, Christoph Kaser wrote: > thank you for your response. Unfortunately, I won't be able to try this > today, but I should be able to try it in the next few days. If I find the > bug you described, I will open an issue. Thanks! > On a somewhat related note, is ther

Re: Taking backup of Lucene DB

2012-05-25 Thread Michael McCandless
The simplest way is to stop all index writing (close the IndexWriter), do the copy, then start your IndexWriter again. If that's a problem (usually it is!) then use SnapshotDeletionPolicy to protect the commit point (ie prevent any of the files it uses from being deleted) while you are making the

Re: Directory, IndexInput and IndexOutput concurrency

2012-05-29 Thread Michael McCandless
Multiple threads are free to interact with Directory. But it will be only one thread at a time interacting with a single instance of IndexInput and IndexOutput. Mike McCandless http://blog.mikemccandless.com On Tue, May 29, 2012 at 6:39 PM, Dhruv wrote: > I am trying to implement an in-memory

Re: Deferring merging of index segments

2012-06-01 Thread Michael McCandless
64% greater index size when you merge at the end is odd. Can you post the ls -l output of the final index in both cases? Are you only adding (not deleting) docs? This is perfectly valid to do... but I'm surprised you see the two approaches taking about the same time. I would expect letting Luce

Re: Deferring merging of index segments

2012-06-02 Thread Michael McCandless
On Fri, Jun 1, 2012 at 8:09 PM, Vitaly Funstein wrote: > Yes, I am only calling IndexWriter.addDocument() OK. > Interestingly, relative performance of either approach seems to greatly > depend on the number of documents per index. In both types of runs, I used > 10 writer threads, each writing d

Re: OOM during IndexReader open

2012-06-02 Thread Michael McCandless
It could be your index has an unusual number of unique terms. If you can upgrade to the latest 3.x, the RAM used by the terms index has been very substantially reduced... If not, try setting the termInfosIndexDivisor to eg 2 or 3 ... this will load 1/2 or 1/3 of the indexed terms into RAM, but ma

Re: Deferring merging of index segments

2012-06-04 Thread Michael McCandless
ndexWriter.maybeMerge(); > IndexWriter.waitForMerges(); > > to simply calling IndexWriter.close(true) the disk size and run time are > now very close to the case of parallel segment merges. > > On Sat, Jun 2, 2012 at 6:43 AM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >>

Re: IndexCommit.delete() outside of IndexDeletionPolicy

2012-06-06 Thread Michael McCandless
I think this use case makes sense; such logic (for a distributed / ref counted deletion policy) would make a nice contribution ... it's the "proper" way to delete commits when multiple nodes are in use (vs eg using a timeout deletion policy). You can actually do it today: call IndexWriter.deleteUn

Re: IndexSearcher.search(query, filter, collector) considered less efficient

2012-06-08 Thread Michael McCandless
I think that javadoc is stale; my guess is it was written back when the collect method took a score, but we changed that so the collector calls .score() if it really needs the score... so I can't think of why that search method is inherently inefficient. I'll fix the javadocs (remove that warning)

Re: RAMDirectory unexpectedly slows

2012-06-18 Thread Michael McCandless
9 fold improvement using RAMDir over MMapDir is much more than I've seen (~30-40% maybe) in the past. Can you explain how you are using Lucene? You may also want to try the CachingRAMDirectory patch on https://issues.apache.org/jira/browse/LUCENE-4123 Mike McCandless http://blog.mikemccandless.

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless
This shouldn't normally happen, even on crash, kill -9, power loss, etc. It can only mean either there is a bug in Lucene, or there's something wrong with your hardware/IO system, or the fsync operation doesn't actually work on the IO system. You can run CheckIndex to see what's broken (then, add

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
Likely the bottleneck is pulling content from the database? Maybe test just that and see how long it takes? 24 hours is way too long to index all of Wikipedia. For example, we index Wikipedia every night for our trunk/4.0 performance tests, here: http://people.apache.org/~mikemccand/luceneb

Re: zero sized cfs files in index lead to IOException: read past EOF

2012-06-19 Thread Michael McCandless
Hmm which Lucene version are you using? For 3.x before 3.4, there was a bug (https://issues.apache.org/jira/browse/LUCENE-3418) where we failed to actually fsync... More below: On Tue, Jun 19, 2012 at 4:54 PM, Chris Gioran wrote: > On Tue, Jun 19, 2012 at 6:18 PM, Michael McCandless >

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
ld it be possible to index Wikipedia in a 2 core machine with 3 GB in > RAM? I have had the same problem trying to index it. > > I've tried with a dump from april 2011. > > Thanks > Reyna > CIC-IPN > Mexico > > 2012/6/19 Michael McCandless > >> Likely the bot

Re: Wikipedia Index

2012-06-19 Thread Michael McCandless
I have the index locally ... but it's really impractical to send it especially if you already have the source text locally. Maybe index directly from the source text instead of via a database? Lucene's benchmark contrib/module has code to decode the XML into documents... Mike McCandless http://b

Re: any good idea for loading fields into memory?

2012-06-20 Thread Michael McCandless
Right, the field must have a single token for FieldCache. But if you are on 4.x you can use DocTermOrds (FieldCache.getDocTermOrds) which allows for multiple tokens per field. Mike McCandless http://blog.mikemccandless.com On Wed, Jun 20, 2012 at 9:47 AM, Li Li wrote: > but as l can remember,

Re: RAMDirectory unexpectedly slows

2012-06-30 Thread Michael McCandless
decaperated? > > Thanks > > On Mon, Jun 18, 2012 at 7:32 PM, Michael McCandless < > luc...@mikemccandless.com> wrote: > >> 9 fold improvement using RAMDir over MMapDir is much more than I've >> seen (~30-40% maybe) in the past. >> >>

Re: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Michael McCandless
There are blanks at the important places (your code, and which JavaDoc) in your email! Mike McCandless http://blog.mikemccandless.com On Wed, Jul 11, 2012 at 6:18 AM, Konstantyn Smirnov wrote: > Hi all > > in my app (Lucene 3.5.0 powered) I index the documents (not too many, say up > to 100k) u

Re: RAMDirectory and expungeDeletes()/optimize()

2012-07-11 Thread Michael McCandless
What I meant was your original email says "My code looks like", followed by blank lines, and then "Doesn't it conflict with the JavaDoc saying:", followed by blank lines. Ie we can't see your code. However, when I look at your email here at http://lucene.472066.n3.nabble.com/RAMDirectory-and-expun

  1   2   3   4   5   6   7   8   9   10   >