Re: Incremental Field Updates
On 29 Mar 2010, at 07:45, Earwin Burrfoot wrote: Of course introducing the idea of updates also introduces the notion of a primary key and there's probably an entirely separate discussion to be had around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? If you want to update a document you need a way of expressing *which* document you are updating. This already works somehow for 'deleting' documents? Yes, the convention being user-supplied keys. The question posed is if we add another use case where keys are required do we want to turn this existing informal convention into more formalized support the way databases do eg duplicate key checks on insert, auto-inc primary key generators. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
> Of course introducing the idea of updates also introduces the notion of a > primary key and there's probably an entirely separate discussion to be had > around user-supplied vs Lucene-generated keys. Not sure I see that need. Can you explain your reasoning a bit more? >>> If you want to update a document you need a way of expressing *which* >>> document you are updating. >> This already works somehow for 'deleting' documents? > Yes, the convention being user-supplied keys. I can delete by lucene-generated docId. It's too volatile to be database-style PK, but nonetheless. > The question posed is if we add another use case where keys are required do > we want to turn this existing informal convention > into more formalized support the way databases do eg duplicate key checks on > insert, auto-inc primary key generators. If someone needs this, it can be built over lucene, without introducing it as a core feature and needlessly complicating things. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>I can delete by lucene-generated docId. Which users used to have to find by first coding a primary-key-term search. Delete by term removed this step to make life easier. >If someone needs this, it can be built over lucene, without >introducing it as a core feature and needlessly complicating things. I think with any partial-update feature the *absence* of primary key support would "needlessly complicate things": If Lucene is not capable of performing duplicate detection on insert (because it has no notion of a primary key field), we need to be prepared for the situation where we have duplicate-key docs in the index. What then happens when Grant wants to do a "partial update" as opposed to the existing full-update semantics which first deletes all documents containing the supplied term (always a form of primary key)? Which document instance gets "partially updated"? We either: a) throw a "duplicate" error (which ideally should have happened back at dup insert time) b) Choose one of the documents to "partially update" and keep the duplicate(s) c) Choose one of the documents to "partially update" and delete the duplicate(s) d) "Partially update" all of the duplicate(s) All less than ideal. I know we are schema-averse with Lucene (and I value that) but surely any partial update feature has to start with a strongly maintained notion of document identity as a foundation? Rather than "needless complexity" I'd argue this was "needed rigour" and actually simplifies the user's job if Lucene can do the duplicate-key-on-insert check automatically rather than relying on ropy application code and dealing with any failures in that. Of course primary keys are not mandatory. You only use them when you need this behaviour - just like in SQL. Cheers Mark - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2351) optimize automatonquery
[ https://issues.apache.org/jira/browse/LUCENE-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850844#action_12850844 ] Michael McCandless commented on LUCENE-2351: OOOH I like this approach!! It makes the linear decision "local", and bounds (by linearUpperBound) the region so that we don't have to revisit the decision on every term. And it enables efficiently using the suffix :) And it's FAST! With this fix, the hard query (un*t) on flex is 105 QPS (best of 5, on 5 M doc wikipedia index), vs 62 QPS on trunk. Yay :) > optimize automatonquery > --- > > Key: LUCENE-2351 > URL: https://issues.apache.org/jira/browse/LUCENE-2351 > Project: Lucene - Java > Issue Type: Improvement > Components: Search >Affects Versions: Flex Branch >Reporter: Robert Muir >Priority: Minor > Fix For: Flex Branch > > Attachments: LUCENE-2351.patch, LUCENE-2351.patch, > LUCENE-2351_infinite.patch, LUCENE-2351_infinite.patch > > > Mike found a few cases in flex where we have some bad behavior with > automatonquery. > The problem is similar to a database query planner, where sometimes simply > doing a full table scan is faster than using an index. > We can optimize automatonquery a little bit, and get better performance for > fuzzy,wildcard,regex queries. > Here is a list of ideas: > * create commonSuffixRef for infinite automata, not just really-bad linear > scan cases > * do a null check rather than populating an empty commonSuffixRef > * localize the 'linear' case to not seek, but instead scan, when ping-ponging > against loops in the state machine > * add a mechanism to enable/disable the terms dict cache, e.g. we can disable > it for infinite cases, and maybe fuzzy N>1 also. > * change the use of BitSet to OpenBitSet or long[] gen for path-tracking > * optimize the backtracking code where it says /* String is good to go as-is > */, this need not be a full run(), I think... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>If someone needs this, it can be built over lucene, without >>introducing it as a core feature and needlessly complicating things. > > I think with any partial-update feature the *absence* of primary key support > would "needlessly complicate things": > If Lucene is not capable of performing duplicate detection on insert (because > it has no notion of a primary key field), we need to be prepared for the > situation where we have duplicate-key docs in the index. > What then happens when Grant wants to do a "partial update" as opposed to the > existing full-update semantics which first deletes all documents containing > the supplied term (always a form of primary key)? > Which document instance gets "partially updated"? We either: > a) throw a "duplicate" error (which ideally should have happened back at dup > insert time) > b) Choose one of the documents to "partially update" and keep the duplicate(s) > c) Choose one of the documents to "partially update" and delete the > duplicate(s) > d) "Partially update" all of the duplicate(s) > All less than ideal. Variant d) sounds most logical? And enables all sorts of fun stuff. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850857#action_12850857 ] Michael McCandless commented on LUCENE-2324: Yeah I think we're gonna need the global sequenceID in some form -- my Options 1 or 2 can't work because the interleaving issue (as seen/required by the app) is a global thing. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>Variant d) sounds most logical? And enables all sorts of fun stuff. So the duplicate-key docs can have different values for initial-insert fields but partial updates will cause sharing of a common field value? And subsequent same-key doc inserts do or don't share these previous "partial-update" values? Sounds like a complex model for users to understand let alone code support for. Everyone gets primary keys though. - Original Message From: Earwin Burrfoot To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 10:14:24 Subject: Re: Incremental Field Updates >>If someone needs this, it can be built over lucene, without >>introducing it as a core feature and needlessly complicating things. > > I think with any partial-update feature the *absence* of primary key support > would "needlessly complicate things": > If Lucene is not capable of performing duplicate detection on insert (because > it has no notion of a primary key field), we need to be prepared for the > situation where we have duplicate-key docs in the index. > What then happens when Grant wants to do a "partial update" as opposed to the > existing full-update semantics which first deletes all documents containing > the supplied term (always a form of primary key)? > Which document instance gets "partially updated"? We either: > a) throw a "duplicate" error (which ideally should have happened back at dup > insert time) > b) Choose one of the documents to "partially update" and keep the duplicate(s) > c) Choose one of the documents to "partially update" and delete the > duplicate(s) > d) "Partially update" all of the duplicate(s) > All less than ideal. Variant d) sounds most logical? And enables all sorts of fun stuff. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850864#action_12850864 ] Michael McCandless commented on LUCENE-2324: bq. Mike, can you explain what the advantages of this kind of thread affinity are? I was always wondering why the DocumentsWriter code currently makes efforts to assign a ThreadState always to the same Thread? Is that being done for performance reasons? It's for performance. I expect there are apps where a given thread/pool indexes certain kind of docs, ie, the app threads themselves have "affinity" for docs with similar term distributions. In which case, it's best (most RAM efficient) if those docs w/ presumably similar term stats are sent back to the same DW. If you mix in different term stats into one buffer you get worse RAM efficiency. Also, for better RAM efficiency you want *fewer* DWs... because we get more RAM efficiency the higher the freq of the terms... but of course you want more DWs for better CPU efficiency whenever that many threads are running at once. Net/net CPU efficiency should trump RAM efficiency, I think, so if there is a conflict we should favor CPU efficiency. Though, thread affinity doesn't seem that CPU costly to implement? Lookup the DW your thread first used... if it's free, seize it. If it's not, fallback to any DW that's free. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>Variant d) sounds most logical? And enables all sorts of fun stuff. > > So the duplicate-key docs can have different values for initial-insert fields > but partial updates will cause sharing of a common field value? > And subsequent same-key doc inserts do or don't share these previous > "partial-update" values? > > Sounds like a complex model for users to understand let alone code support > for. > Everyone gets primary keys though. What you say IS complex. Sharing? Bleargh. But everyone digs "update qweqwe set field=value where some_condition". Who ever said that some_condition should point to a unique document? It could, if you wish it so. Or you can do bulk updates if that's what you need. Very flexible and no need to introduce any new concepts. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
I agree this is a long overdue feature... we need to get it into Lucene somehow. I like the Layers analogy... I think that will work well with Lucene's transactional semantics, ie a prior commit point would continue to see the index before the updates but new commit points would see the updates. I think we would somehow want the new postings "layer" written to cleanly be merged under Docs/PositionsEnum? So that searching is unaffected -- ie the scorers just see a normal postings enum. FieldCache would also just populate normally. But somehow these partial docs would have to not "count" as real docIDs... and the normal merging of segments would coalesce these updates... Also: how would we handle stored fields & term vectors? Mike On Sat, Mar 27, 2010 at 7:25 AM, Grant Ingersoll wrote: > First off, this is something I've had in my head for a long time, but don't > have any code. > As many of you know, one of the main things that vexes any search engine > based on an inverted index is how to do fast updates of just one field w/o > having to delete and re-add the whole document like we do today. When I > think about the whole update problem, I keep coming back to the notion of > Photoshop (or any other real photo editing solution) Layers. In a photo > editing solution, when you want to hide/change a piece of a photo, it is > considered best practice to add a layer over that part of the photo to be > changed. This way, the original photo is maintained and you don't have to > worry about accidentally damaging the area you aren't interested in. Thus, > a layer is essentially a mask on the original photo. The analogy isn't quite > the same here, but nevertheless... > > So, thinking out loud here and I'm not sure on the best wording of this: > > When a document first comes in, it is all in one place, just as it is now. > Then, when an update comes in on a particular field, we somehow mark in the > index that the document in question is modified and then we add the new > change onto the end of the index (just like we currently do when adding new > docs, but this time it's just a doc w/ a single field). Then, when > searching, we would, when scoring the affected documents, go to a secondary > process that knew where to look up the incremental changes. As background > merging takes place, these "disjoint" documents would be merged back > together. We'd maybe even consider a "high update" merge scheduler that > could more frequently handle these incremental merges. > > I'm not sure where we would maintain the list of changes. That is, is it > something that goes in the posting list, or is it a side structure. I think > in the posting list would be to slow. Also, perhaps it is worthwhile for > people to indicate that a particular field is expected to be updated while > others maintain their current format so as not to incur the penalty on each. > > In a sense, the old field for that document is masked by the new field. I > think, given proper index structure, that we maybe could make that marking > of the old field fast (maybe it's a pointer to the new field, maybe it's > just a bit indicating to go look in the "update" segment) > > On the search side, I think performance would still be maintained b/c even > in high update envs. you aren't usually talking about more than a few > thousand changes in a minute or two and the background merger would be > responsible for keeping the total number of disjoint documents low. > > I realize there isn't a whole lot to go on here just yet, but perhaps it > will spawn some questions/ideas that will help us work it out in a better > way. > At any rate, I think adding incr. field update capability would be a huge > win for Lucene. > -Grant - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Baby steps towards making Lucene's scoring more flexible...
On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey wrote: > On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: >> >> Also, will Lucy store the original stats? >> > >> > These? >> > >> > * Total number of tokens in the field. >> > * Number of unique terms in the field. >> > * Doc boost. >> > * Field boost. >> >> Also sum(tf). Robert can generate more :) > > Hmm, aren't "Total number of tokens in the field" and sum(tf) normally > equivalent? I guess there might be analyzers for which that isn't true, e.g. > those which perform synonym-injection? > > In any case, "sum(tf)" is probably a better definition, because it makes no > ancillary claims... Sorry, yes they are. >> > Incidentally, what are you planning to do about field boost if it's not >> > always >> > 1.0? Are you going to store full 32-bit floats? >> >> For starters, yes. > > OK, how are those going to be encoded? IEEE 754? Big-endian? > >http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness For starters, I think so. Lucene's ints are bigendian today. >> We may (later) want to make a new attr that sets >> the #bits (levels/precision) you want... then uses packed ints to >> encode. > > I'm concerned that the bit-wise entropy of floats may make them a poor match > for compression via packed ints. We'll probably get a compressed > representation which is larger than the original. > > Are there any standard algorithms out there for compressing IEEE 754 floats? > RLE works, but only with certain data patterns. > > ... [ time passes ] ... > > Hmm, maybe not: > > > http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data Sorry, I was proposing a fixed-point boost, where you specify how many levels (in bits, powers of 2) you want. >> I was specifically asking if Lucy will allow the user to force true >> average to be recomputed, ie, at commit time from the writer. > > That's theoretically possible. We'd have to implement the reader the same way > we have DeletionsReader -- the most recent segment may contain data which > applies to older segments. OK. > Here's the DeletionsReader code, which searches backwards through the segments > looking for a particular file: > >/* Start with deletions files in the most recently added segments and work > * backwards. The first one we find which addresses our segment is the > * one we need. */ >for (i = VA_Get_Size(segments) - 1; i >= 0; i--) { >Segment *other_seg = (Segment*)VA_Fetch(segments, i); >Hash *metadata >= (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9); >if (metadata) { >Hash *files = (Hash*)CERTIFY( >Hash_Fetch_Str(metadata, "files", 5), HASH); >Hash *seg_files_data >= (Hash*)Hash_Fetch(files, (Obj*)my_seg_name); >if (seg_files_data) { >Obj *count = (Obj*)CERTIFY( >Hash_Fetch_Str(seg_files_data, "count", 5), OBJ); >del_count = (i32_t)Obj_To_I64(count); >del_file = (CharBuf*)CERTIFY( >Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF); >break; >} >} >} Hmm -- simililar to tombstones? But, different in that the most recently written file has *all* deletions for that old segment? Ie you don't have to OR together N generations of written deletions... only 1 file has all current deletions for the segment? This is somewhat wasteful of disk space though? Hmm unless your deletion policy can reclaim the now-stale deletions files from past flushed segments? > What we'd do is write the regenerated boost bytes for *all* segments to the > most recent segment. It would be roughly analogous to building up an NRT > reader. Right, except Lucy must go through the filesystem. >> > What's trickier is that Schemas are not normally mutable, and that they are >> > part of the index. You don't have to supply an Analyzer, or a Similarity, >> > or >> > anything else when opening a Searcher -- you just provide the location of >> > the >> > index, and the Schema gets deserialized from the latest schema_NNN.json >> > file. >> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty >> > much >> > a thing of the past for us. >> >> That's nice... though... is it too rigid? Do users even want to pick >> a different analyzer at search time? > > It's not common. > > To my mind, the way a field is tokenized is part of its field definition, thus > the Analyzer is part of the field definition, thus the analyzer is part of the > schema and needs to be stored with the index. OK. > Still, we support different Analyzers at search time by way of QueryParser. > QueryParser's constructor requires a Schema, but also accepts an optional > Analyzer which if supplied will be used instead of the Analyzers from the > Schema. Ahh OK there's an out. >> > Maybe aggressive autom
Re: Baby steps towards making Lucene's scoring more flexible...
I think that's a good idea for Lucy. Mike On Fri, Mar 26, 2010 at 10:58 AM, Marvin Humphrey wrote: > On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote: >> > Maybe aggressive automatic data-reduction makes more sense in the context >> > of >> > "flexible matching", which is more expansive than "flexible scoring"? >> >> I think so. Maybe it shouldn't be called a Similarity (which to me >> (though, carrying a heavy curse of knowledge burden...) means >> "scoring")? Matcher? > > I think we can express the difference between your proposed approach for > Lucene Similarity (no effect on index) and my proposed approach for Lucy > Similarity (aggressive index-time data reduction) by putting Lucy's Similarity > under Lucy::Index instead of Lucy::Search. > > Marvin Humphrey > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850884#action_12850884 ] Michael McCandless commented on LUCENE-2329: I think we need to fix how RAM is managed for this... right now, if you turn on IW's infoStream you'll see a zillion prints where IW tries to balance RAM (it "runs hot"), but, nothing can be freed. We do this per-doc, after the parallel arrays resize themselves to net/net over our allowed RAM buffer. A few ideas on how we can fix: * I think we have to change when we flush. It's now based on RAM used (not alloc'd), but I think we should switch it to use RAM alloc'd after we've freed all we can. Ie if we free things up and we've still alloc'd over the limit, we flush. This'll fix the running hot we now see... * TermsHash.freeRAM is now a no-op right? We have to fix this to actually free something when it can because you can imagine indexing docs that are postings heavy but then switching to docs that are byte[] block heavy. On that switch you have to balance the allocations (ie, shrink your postings). I think we should walk the threads/fields and use ArrayUtil.shrink to shrink down, but, don't shrink by much at a time (to avoid running hot) -- IW will invoke this method again if more shrinkage is needed. * Also, shouldn't we use ArrayUtil.grow to increase? Instead of always a 50% growth? Because with such a large growth you can easily have horrible RAM efficiency... ie that 50% growth can suddenly put you over the limit and then you flush, having effectively used only half of the allowed RAM buffer in the worst case. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Reopened: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless reopened LUCENE-2329: Reopening to fix the RAM balancing problems... > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers
Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers - Key: LUCENE-2356 URL: https://issues.apache.org/jira/browse/LUCENE-2356 Project: Lucene - Java Issue Type: Bug Reporter: Michael McCandless Fix For: 3.1 Opening a place holder issue... if all the refactoring being discussed don't make this possible, then we should add a setting to IWC to do so. Apps with very large numbers of unique terms must set the terms index divisor to control RAM usage. (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will help such apps). But, when IW resolves deletes internally it always uses default 1 terms index divisor, and the app cannot change that. Though one workaround is to call getReader(termInfosIndexDivisor) which will pool the reader with the right divisor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
Reduce transient RAM usage while merging by using packed ints array for docID re-mapping Key: LUCENE-2357 URL: https://issues.apache.org/jira/browse/LUCENE-2357 Project: Lucene - Java Issue Type: Improvement Components: Index Reporter: Michael McCandless Priority: Minor Fix For: 3.1 We allocate this int[] to remap docIDs due to compaction of deleted ones. This uses alot of RAM for large segment merges, and can fail to allocate due to fragmentation on 32 bit JREs. Now that we have packed ints, a simple fix would be to use a packed int array... and maybe instead of storing abs docID in the mapping, we could store the number of del docs seen so far (so the remap would do a lookup then a subtract). This may add some CPU cost to merging but should bring down transient RAM usage quite a bit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers
[ https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850887#action_12850887 ] Michael McCandless commented on LUCENE-2356: I won't have any time to take this any time soon :) So if anyone has the itch, jump! > Enable setting the terms index divisor used by IndexWriter whenever it opens > internal readers > - > > Key: LUCENE-2356 > URL: https://issues.apache.org/jira/browse/LUCENE-2356 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless > Fix For: 3.1 > > > Opening a place holder issue... if all the refactoring being discussed don't > make this possible, then we should add a setting to IWC to do so. > Apps with very large numbers of unique terms must set the terms index divisor > to control RAM usage. > (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will > help such apps). > But, when IW resolves deletes internally it always uses default 1 terms index > divisor, and the app cannot change that. Though one workaround is to call > getReader(termInfosIndexDivisor) which will pool the reader with the right > divisor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>Who ever said that some_condition should point to a unique document? My assumption was, for now, we were still talking about the simpler case of updating a single document. If we extend the discussion to support set-based updates it's worth considering the common requirements for updating sets: a) update values can be non-constants such as "reduce price of all products in ski-wear dept by 10%". b) the criteria to define the set can be most usefully expressed as a query rather than mandating a single term e.g. "set published:false on all docs in last week's date range" That feels like too much functionality to consider adding right now but I can see a much more basic solution is possible which supports single and simple set based updates. - Original Message From: Earwin Burrfoot To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 11:05:39 Subject: Re: Incremental Field Updates >>Variant d) sounds most logical? And enables all sorts of fun stuff. > > So the duplicate-key docs can have different values for initial-insert fields > but partial updates will cause sharing of a common field value? > And subsequent same-key doc inserts do or don't share these previous > "partial-update" values? > > Sounds like a complex model for users to understand let alone code support > for. > Everyone gets primary keys though. What you say IS complex. Sharing? Bleargh. But everyone digs "update qweqwe set field=value where some_condition". Who ever said that some_condition should point to a unique document? It could, if you wish it so. Or you can do bulk updates if that's what you need. Very flexible and no need to introduce any new concepts. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>>Who ever said that some_condition should point to a unique document? > > My assumption was, for now, we were still talking about the simpler case of > updating a single document. > If we extend the discussion to support set-based updates it's worth > considering the common requirements for updating sets: > a) update values can be non-constants such as "reduce price of all products > in ski-wear dept by 10%". > b) the criteria to define the set can be most usefully expressed as a query > rather than mandating a single term e.g. "set published:false on all docs in > last week's date range" > > That feels like too much functionality to consider adding right now but I can > see a much more basic solution is possible which supports single and simple > set based updates. I must be missing something :) a) We're not a freaking database, why the constant attempts to compare ourselves to it / mimic some functionality? b) The criteria to define the set of deleted documents can already be expressed as a query - IndexWriter.deleteDocuments(query). So what I am offering is to preserve the way to point at the docs we want to see deleted, and allow to do partial modifications on them. Thus we add new and exciting functionality, while introducing zero new concepts. Profit? -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote: > >> >>> Of course introducing the idea of updates also introduces the notion of a >>> primary key and there's probably an entirely separate discussion to be had >>> around user-supplied vs Lucene-generated keys. >> >> Not sure I see that need. Can you explain your reasoning a bit more? >>> > > If you want to update a document you need a way of expressing *which* > document you are updating. Of course, but what about the Lucene doc id doesn't provide that?
Re: Incremental Field Updates
On 2010-03-29 12:26, Michael McCandless wrote: I agree this is a long overdue feature... we need to get it into Lucene somehow. I like the Layers analogy... I think that will work well with Lucene's transactional semantics, ie a prior commit point would continue to see the index before the updates but new commit points would see the updates. I'm coming late to this discussion ... are you guys familiar with this paper? It seems to describe the same model of incremental field-level updates, and the algo operates on internal Lucene ids: http://portal.acm.org/citation.cfm?id=1458171 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
AW: Incremental Field Updates
The filed this as patent, too: http://www.freepatentsonline.com/y2009/0228528.html Regards Uwe Goetzke -Ursprüngliche Nachricht- Von: Andrzej Bialecki [mailto:a...@getopt.org] Gesendet: Montag, 29. März 2010 14:50 An: java-dev@lucene.apache.org Betreff: Re: Incremental Field Updates On 2010-03-29 12:26, Michael McCandless wrote: > I agree this is a long overdue feature... we need to get it into > Lucene somehow. > > I like the Layers analogy... I think that will work well with Lucene's > transactional semantics, ie a prior commit point would continue to see > the index before the updates but new commit points would see the > updates. I'm coming late to this discussion ... are you guys familiar with this paper? It seems to describe the same model of incremental field-level updates, and the algo operates on internal Lucene ids: http://portal.acm.org/citation.cfm?id=1458171 -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: AW: Incremental Field Updates
On 2010-03-29 15:11, Uwe Goetzke wrote: The filed this as patent, too: http://www.freepatentsonline.com/y2009/0228528.html .. which is not granted yet, right? It's a patent application. Besides, I live in EU ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping
[ https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850914#action_12850914 ] Michael McCandless commented on LUCENE-2357: I won't have any time to take this any time soon :) So if anyone has the itch, jump! > Reduce transient RAM usage while merging by using packed ints array for docID > re-mapping > > > Key: LUCENE-2357 > URL: https://issues.apache.org/jira/browse/LUCENE-2357 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael McCandless >Priority: Minor > Fix For: 3.1 > > > We allocate this int[] to remap docIDs due to compaction of deleted ones. > This uses alot of RAM for large segment merges, and can fail to allocate due > to fragmentation on 32 bit JREs. > Now that we have packed ints, a simple fix would be to use a packed int > array... and maybe instead of storing abs docID in the mapping, we could > store the number of del docs seen so far (so the remap would do a lookup then > a subtract). This may add some CPU cost to merging but should bring down > transient RAM usage quite a bit. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers
[ https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850916#action_12850916 ] Michael McCandless commented on LUCENE-2356: The above comment was on the wrong issue :) We should only do this issue if the ongoing ideas about refactoring IW/IR don't make controlling the terms index divisor possible, for readers opened by IW. > Enable setting the terms index divisor used by IndexWriter whenever it opens > internal readers > - > > Key: LUCENE-2356 > URL: https://issues.apache.org/jira/browse/LUCENE-2356 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless > Fix For: 3.1 > > > Opening a place holder issue... if all the refactoring being discussed don't > make this possible, then we should add a setting to IWC to do so. > Apps with very large numbers of unique terms must set the terms index divisor > to control RAM usage. > (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will > help such apps). > But, when IW resolves deletes internally it always uses default 1 terms index > divisor, and the app cannot change that. Though one workaround is to call > getReader(termInfosIndexDivisor) which will pool the reader with the right > divisor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Incremental Field Updates
>Of course, but what about the Lucene doc id doesn't provide that? The question being how you determine the correct doc id to use in the first place (especially when they are know to be volatile) - the current answer is to use a stable identifier term which your app holds in the index, AKA a primary key. To support single-doc updates, app developers currently have to : a) allocate keys uniquely b) ensure they do not store >1 document with the same key. My suggestion was, being fundamental requirements to supporting updates Lucene could, as a convenience, provide some support for this in it's API - in the same way a database typically does. Earwin has perhaps extended your (and my) original thinking to incorporate set-based updates (a single set of values applied to many documents which match a query). His proposal (correct me if I'm wrong, Earwin) is that single and set-based changes could both be supported by a single IndexWriter.updateDocuments(query, changedFields) type method. The benefit of this scheme is that we are providing a simple method, re-using established concepts (Queries for document selection) but this does not change the fact that many users will still need to use primary keys for single-doc updates and they have to assume responsibility for a) and b) above. On reflection, I guess these responsibilities are not too tough. a) is catered for by the fact that Lucene is not typically the master data store (yet!) and filesystem/webserver/database datasources where document content is sourced usually have the responsibility to allocate some form of unique identifier in the form of URLs, database keys or filenames which can be used. Also, b) is not too hard to handle in app code if you always use the IndexWriter.updateDocument(term,doc) method for inserts. Cheers, Mark From: Grant Ingersoll To: java-dev@lucene.apache.org Sent: Mon, 29 March, 2010 13:11:56 Subject: Re: Incremental Field Updates On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote: > > >> >>Of course introducing the idea of updates also introduces the notion of a >>primary key and there's probably an entirely separate discussion to be had >>around user-supplied vs Lucene-generated keys. >> >> >>Not sure I see that need. Can you explain your reasoning a bit more? > >>> > > >If you want to update a document you need a way of expressing *which* document >you are updating. Of course, but what about the Lucene doc id doesn't provide that?
[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers
[ https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850928#action_12850928 ] Earwin Burrfoot commented on LUCENE-2356: - That's likely orthogonal. If you want all IW readers to have same divisor - shove it into IWC and it's all done. If you want to use different divisors when returning SR as a part of NRT reader and using it inside (say, for deletions) - okay, you'll have the ability to do that at the cost of partial SR reload - shove two settings into IWC and it's done. > Enable setting the terms index divisor used by IndexWriter whenever it opens > internal readers > - > > Key: LUCENE-2356 > URL: https://issues.apache.org/jira/browse/LUCENE-2356 > Project: Lucene - Java > Issue Type: Bug >Reporter: Michael McCandless > Fix For: 3.1 > > > Opening a place holder issue... if all the refactoring being discussed don't > make this possible, then we should add a setting to IWC to do so. > Apps with very large numbers of unique terms must set the terms index divisor > to control RAM usage. > (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will > help such apps). > But, when IW resolves deletes internally it always uses default 1 terms index > divisor, and the app cannot change that. Though one workaround is to call > getReader(termInfosIndexDivisor) which will pool the reader with the right > divisor. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects
[ https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850989#action_12850989 ] Michael Busch commented on LUCENE-2329: --- Good catch! Thanks for the thorough explanation and suggestions. I think it all makes sense. Will work on a patch. > Use parallel arrays instead of PostingList objects > -- > > Key: LUCENE-2329 > URL: https://issues.apache.org/jira/browse/LUCENE-2329 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch > > > This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324. > In order to avoid having very many long-living PostingList objects in > TermsHashPerField we want to switch to parallel arrays. The termsHash will > simply be a int[] which maps each term to dense termIDs. > All data that the PostingList classes currently hold will then we placed in > parallel arrays, where the termID is the index into the arrays. This will > avoid the need for object pooling, will remove the overhead of object > initialization and garbage collection. Especially garbage collection should > benefit significantly when the JVM runs out of memory, because in such a > situation the gc mark times can get very long if there is a big number of > long-living objects in memory. > Another benefit could be to build more efficient TermVectors. We could avoid > the need of having to store the term string per document in the TermVector. > Instead we could just store the segment-wide termIDs. This would reduce the > size and also make it easier to implement efficient algorithms that use > TermVectors, because no term mapping across documents in a segment would be > necessary. Though this improvement we can make with a separate jira issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]
[ https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851008#action_12851008 ] Michael McCandless commented on LUCENE-2354: bq. NumericUtils still contains lots of unused String-based methods, I think we should remove them +1 Patch looks good! NumericFields are the first thing to index terms directly as byte[] (ie not first going through char[]) in flex. But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). And you cutover to BytesRef TermsEnum API too -- great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms? > Convert NumericUtils and NumericTokenStream to use BytesRef instead of > Strings/char[] > - > > Key: LUCENE-2354 > URL: https://issues.apache.org/jira/browse/LUCENE-2354 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: Flex Branch > > Attachments: LUCENE-2354.patch > > > After LUCENE-2302, we should use TermToBytesRefAttribute to index using > NumericTokenStream. This also should convert the whole NumericUtils to use > BytesRef when converting numerics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]
[ https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851010#action_12851010 ] Uwe Schindler commented on LUCENE-2354: --- bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). Yes. And i think we should keep it for now using 7 bit. Problems start when the sort order of terms is needed (which is the case for NRQ). As default in flex is the UTF-8 term comparator, it would not sort correctly for numeric fields with full 8 bits? bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms? I dont think you will notice a difference. A standard int range contains maybe 10 to 20 sub-ranges (at maximum), so converting between string and TermRef should not count. But the new implementation is more clean. In principle we could remove the whole char[]/String based API in NumericUtils - I only have to rewrite the tests and remove the NumericUtils test in backwards (as no longer applies then, too). > Convert NumericUtils and NumericTokenStream to use BytesRef instead of > Strings/char[] > - > > Key: LUCENE-2354 > URL: https://issues.apache.org/jira/browse/LUCENE-2354 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: Flex Branch > > Attachments: LUCENE-2354.patch > > > After LUCENE-2302, we should use TermToBytesRefAttribute to index using > NumericTokenStream. This also should convert the whole NumericUtils to use > BytesRef when converting numerics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]
[ https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851010#action_12851010 ] Uwe Schindler edited comment on LUCENE-2354 at 3/29/10 5:23 PM: bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). Yes. And i think we should keep it for now using 7 bit. Problems start when the sort order of terms is needed (which is the case for NRQ). As default in flex is the UTF-8 term comparator, it would not sort correctly for numeric fields with full 8 bits? By the way, the recently added backwards test checks that an old index with NumericField behaves as before! This is why I added a new zip file to TestBackwardCompatibility. bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms? I dont think you will notice a difference. A standard int range contains maybe 10 to 20 sub-ranges (at maximum), so converting between string and TermRef should not count. But the new implementation is more clean. In principle we could remove the whole char[]/String based API in NumericUtils - I only have to rewrite the tests and remove the NumericUtils test in backwards (as no longer applies then, too). was (Author: thetaphi): bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, same as trunk). Yes. And i think we should keep it for now using 7 bit. Problems start when the sort order of terms is needed (which is the case for NRQ). As default in flex is the UTF-8 term comparator, it would not sort correctly for numeric fields with full 8 bits? bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search perf would improve but only a tiny bit since NRQ visits so few terms? I dont think you will notice a difference. A standard int range contains maybe 10 to 20 sub-ranges (at maximum), so converting between string and TermRef should not count. But the new implementation is more clean. In principle we could remove the whole char[]/String based API in NumericUtils - I only have to rewrite the tests and remove the NumericUtils test in backwards (as no longer applies then, too). > Convert NumericUtils and NumericTokenStream to use BytesRef instead of > Strings/char[] > - > > Key: LUCENE-2354 > URL: https://issues.apache.org/jira/browse/LUCENE-2354 > Project: Lucene - Java > Issue Type: Improvement >Affects Versions: Flex Branch >Reporter: Uwe Schindler >Assignee: Uwe Schindler > Fix For: Flex Branch > > Attachments: LUCENE-2354.patch > > > After LUCENE-2302, we should use TermToBytesRefAttribute to index using > NumericTokenStream. This also should convert the whole NumericUtils to use > BytesRef when converting numerics. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index
[ https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851013#action_12851013 ] Grant Ingersoll commented on LUCENE-2184: - Note, this bug exists for the "min" case too, that is when a distance is too large > CartesianPolyFilterBuilder doesn't properly account for which tiers actually > exist in the index > > > Key: LUCENE-2184 > URL: https://issues.apache.org/jira/browse/LUCENE-2184 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/spatial >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Grant Ingersoll > > In the CartesianShapeFilterBuilder, there is logic that determines the "best > fit" tier to create the Filter against. However, it does not account for > which fields actually exist in the index when doing so. For instance, if you > index tiers 1 through 10, but then choose a very small radius to restrict the > space to, it will likely choose a tier like 15 or 16, which of course does > not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Assigned: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index
[ https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll reassigned LUCENE-2184: --- Assignee: Grant Ingersoll > CartesianPolyFilterBuilder doesn't properly account for which tiers actually > exist in the index > > > Key: LUCENE-2184 > URL: https://issues.apache.org/jira/browse/LUCENE-2184 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/spatial >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll > > In the CartesianShapeFilterBuilder, there is logic that determines the "best > fit" tier to create the Filter against. However, it does not account for > which fields actually exist in the index when doing so. For instance, if you > index tiers 1 through 10, but then choose a very small radius to restrict the > space to, it will likely choose a tier like 15 or 16, which of course does > not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851017#action_12851017 ] Jason Rutherglen commented on LUCENE-2324: -- Michael B.: What you're talking about here: https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentI d=12850792page=com.atlassian.jira.plugin.system.issuetabpanels%3A comment-tabpanel#action_12850792 is a transaction log? I'm not sure we need that level of complexity just yet? How would we make the transaction log memory efficient? Are there other uses you foresee? Maybe there's a simpler solution for the BufferedDeletes.Num per DW problem that could make use of global sequence ids? I'd prefer to continue to use the per term/query max doc id. There aren't performance issues with concurrently accessing and updating maps, so a global sync lock as the DW map values are updated should be OK? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index
[ https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated LUCENE-2184: Attachment: LUCENE-2184.patch Here's a patch. All tests still pass. > CartesianPolyFilterBuilder doesn't properly account for which tiers actually > exist in the index > > > Key: LUCENE-2184 > URL: https://issues.apache.org/jira/browse/LUCENE-2184 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/spatial >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll > Attachments: LUCENE-2184.patch > > > In the CartesianShapeFilterBuilder, there is logic that determines the "best > fit" tier to create the Filter against. However, it does not account for > which fields actually exist in the index when doing so. For instance, if you > index tiers 1 through 10, but then choose a very small radius to restrict the > space to, it will likely choose a tier like 15 or 16, which of course does > not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index
[ https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851043#action_12851043 ] Grant Ingersoll commented on LUCENE-2184: - Committed revision 928860 w/ the patch above plus some more javadocs. I'll leave open for a day or so in case anyone has quibbles about the names of things. > CartesianPolyFilterBuilder doesn't properly account for which tiers actually > exist in the index > > > Key: LUCENE-2184 > URL: https://issues.apache.org/jira/browse/LUCENE-2184 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/spatial >Affects Versions: 2.9, 2.9.1, 3.0 >Reporter: Grant Ingersoll >Assignee: Grant Ingersoll > Attachments: LUCENE-2184.patch > > > In the CartesianShapeFilterBuilder, there is logic that determines the "best > fit" tier to create the Filter against. However, it does not account for > which fields actually exist in the index when doing so. For instance, if you > index tiers 1 through 10, but then choose a very small radius to restrict the > space to, it will likely choose a tier like 15 or 16, which of course does > not exist. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851078#action_12851078 ] Michael Busch commented on LUCENE-2324: --- {quote} I'm not sure we need that level of complexity just yet? How would we make the transaction log memory efficient? {quote} Is that really so complex? You only need one additional int per doc in the DWPTs, and the global map for the delete terms. You don't need to buffer the actual terms per DWPT. I thought that's quite efficient? But I'm totally open to other ideas. I can try tonight to code a prototype of this - I don't think it would be very complex actually. But of course there might be complications I haven't thought of. bq. Are there other uses you foresee? Not really for the "transaction log" as you called it. I'd remove that log once we switch to deletes in the FG (when the RAM buffer is searchable). But a nice thing would be for add/update/delete to return the seqID, and also the if RAMReader in the future had an API to check up to which seqID it's able to "see". Then it's very clear to user of the API where a given reader is at. For this to work we have to assign the seqID at the *end* of a call. E.g. when adding a large document, which takes a long time to process, it should get the seqID assigned after the "work" is done and right before the addDocument() call returns. > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851099#action_12851099 ] Jason Rutherglen commented on LUCENE-2324: -- {quote}You only need one additional int per doc in the DWPTs, and the global map for the delete terms.{quote} Ok, lets give it a try, it'll be more clear with the prototype. The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int? > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments
[ https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851142#action_12851142 ] Michael Busch commented on LUCENE-2324: --- {quote} The clarify, the apply deletes doc id up to will be the flushed doc count saved per term/query per DW, though it won't be saved, it'll be derived from the sequence id int array where the action has been encoded into the seq id int? {quote} Yeah, that's the idea. Let's see if it works :) > Per thread DocumentsWriters that write their own private segments > - > > Key: LUCENE-2324 > URL: https://issues.apache.org/jira/browse/LUCENE-2324 > Project: Lucene - Java > Issue Type: Improvement > Components: Index >Reporter: Michael Busch >Assignee: Michael Busch >Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2324.patch > > > See LUCENE-2293 for motivation and more details. > I'm copying here Mike's summary he posted on 2293: > Change the approach for how we buffer in RAM to a more isolated > approach, whereby IW has N fully independent RAM segments > in-process and when a doc needs to be indexed it's added to one of > them. Each segment would also write its own doc stores and > "normal" segment merging (not the inefficient merge we now do on > flush) would merge them. This should be a good simplification in > the chain (eg maybe we can remove the *PerThread classes). The > segments can flush independently, letting us make much better > concurrent use of IO & CPU. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames
[ https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-2353: --- Attachment: LUCENE-2353.patch Updated to also match 'c:/temp' like paths, which are also accepted on Windows > Config incorrectly handles Windows absolute pathnames > - > > Key: LUCENE-2353 > URL: https://issues.apache.org/jira/browse/LUCENE-2353 > Project: Lucene - Java > Issue Type: Bug > Components: contrib/benchmark >Reporter: Shai Erera > Fix For: 3.1 > > Attachments: LUCENE-2353.patch, LUCENE-2353.patch > > > I have no idea how no one ran into this so far, but I tried to execute an > .alg file which used ReutersContentSource and referenced both docs.dir and > work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the > run reported an error of missing content under benchmark\work\something. > I've traced the problem back to Config, where get(String, String) includes > the following code: > {code} > if (sval.indexOf(":") < 0) { > return sval; > } > // first time this prop is extracted by round > int k = sval.indexOf(":"); > String colName = sval.substring(0, k); > sval = sval.substring(k + 1); > ... > {code} > It detects ":" in the value and so it thinks it's a per-round property, thus > stripping "d:" from the value ... fix is very simple: > {code} > if (sval.indexOf(":") < 0) { > return sval; > } else if (sval.indexOf(":\\") >= 0) { > // this previously messed up absolute path names on Windows. Assuming > // there is no real value that starts with \\ > return sval; > } > // first time this prop is extracted by round > int k = sval.indexOf(":"); > String colName = sval.substring(0, k); > sval = sval.substring(k + 1); > {code} > I'll post a patch w/ the above fix + test shortly. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org