date:20100329

Re: Incremental Field Updates

2010-03-29 Thread Mark Harwood



On 29 Mar 2010, at 07:45, Earwin Burrfoot  wrote:

Of course introducing the idea of updates also introduces the notion of a
primary key and there's probably an entirely separate discussion to be had
around user-supplied vs Lucene-generated keys.
Not sure I see that need.  Can you explain your reasoning a bit more?
If you want to update a document you need a way of expressing *which*
document you are updating.

This already works somehow for 'deleting' documents?


Yes, the convention being user-supplied keys. The question posed is if we add 
another use case where keys are required do we want to turn this existing 
informal convention into more formalized support the way databases do eg 
duplicate key checks on insert, auto-inc primary key generators. 




-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org







-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

> Of course introducing the idea of updates also introduces the notion of a
> primary key and there's probably an entirely separate discussion to be had
> around user-supplied vs Lucene-generated keys.
 Not sure I see that need.  Can you explain your reasoning a bit more?
>>> If you want to update a document you need a way of expressing *which*
>>> document you are updating.
>> This already works somehow for 'deleting' documents?
> Yes, the convention being user-supplied keys.
I can delete by lucene-generated docId. It's too volatile to be
database-style PK, but nonetheless.

> The question posed is if we add another use case where keys are required do 
> we want to turn this existing informal convention
> into more formalized support the way databases do eg duplicate key checks on 
> insert, auto-inc primary key generators.
If someone needs this, it can be built over lucene, without
introducing it as a core feature and needlessly complicating things.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>I can delete by lucene-generated docId. 


Which users used to have to find by first coding a primary-key-term search. 
Delete by term removed this step to make life easier.


>If someone needs this, it can be built over lucene, without
>introducing it as a core feature and needlessly complicating things.

I think with any partial-update feature the *absence* of primary key support 
would  "needlessly complicate things":
If Lucene is not capable of performing duplicate detection on insert (because 
it has no notion of a primary key field), we need to be prepared for the 
situation where we have duplicate-key docs in the index.
What then happens when Grant wants to do a "partial update" as opposed to the 
existing full-update semantics which first deletes all documents containing the 
supplied term (always a form of primary key)? 
Which document instance gets "partially updated"? We either:
a) throw a "duplicate" error (which ideally should have happened back at dup 
insert time)
b) Choose one of the documents to "partially update" and keep the duplicate(s)
c) Choose one of the documents to "partially update" and delete the duplicate(s)
d) "Partially update" all of the duplicate(s)
All less than ideal.

I know we are schema-averse with Lucene (and I value that) but surely any 
partial update feature has to start with a strongly maintained notion of 
document identity as a foundation?
Rather than "needless complexity" I'd argue this was "needed rigour" and 
actually simplifies the user's job if Lucene can do the duplicate-key-on-insert 
check automatically rather than relying on ropy application code and dealing 
with any failures in that.
Of course primary keys are not mandatory. You only use them when you need this 
behaviour - just like in SQL.

Cheers
Mark





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2351) optimize automatonquery

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850844#action_12850844
 ] 

Michael McCandless commented on LUCENE-2351:


OOOH I like this approach!!  It makes the linear decision "local", and bounds 
(by linearUpperBound) the region so that we don't have to revisit the decision 
on every term.  And it enables efficiently using the suffix :)

And it's FAST!  With this fix, the hard query (un*t) on flex is 105 QPS 
(best of 5, on 5 M doc wikipedia index), vs 62 QPS on trunk.  Yay :)

> optimize automatonquery
> ---
>
> Key: LUCENE-2351
> URL: https://issues.apache.org/jira/browse/LUCENE-2351
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: Flex Branch
>Reporter: Robert Muir
>Priority: Minor
> Fix For: Flex Branch
>
> Attachments: LUCENE-2351.patch, LUCENE-2351.patch, 
> LUCENE-2351_infinite.patch, LUCENE-2351_infinite.patch
>
>
> Mike found a few cases in flex where we have some bad behavior with 
> automatonquery.
> The problem is similar to a database query planner, where sometimes simply 
> doing a full table scan is faster than using an index.
> We can optimize automatonquery a little bit, and get better performance for 
> fuzzy,wildcard,regex queries.
> Here is a list of ideas:
> * create commonSuffixRef for infinite automata, not just really-bad linear 
> scan cases
> * do a null check rather than populating an empty commonSuffixRef
> * localize the 'linear' case to not seek, but instead scan, when ping-ponging 
> against loops in the state machine
> * add a mechanism to enable/disable the terms dict cache, e.g. we can disable 
> it for infinite cases, and maybe fuzzy N>1 also.
> * change the use of BitSet to OpenBitSet or long[] gen for path-tracking
> * optimize the backtracking code where it says /* String is good to go as-is 
> */, this need not be a full run(), I think...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>If someone needs this, it can be built over lucene, without
>>introducing it as a core feature and needlessly complicating things.
>
> I think with any partial-update feature the *absence* of primary key support 
> would  "needlessly complicate things":
> If Lucene is not capable of performing duplicate detection on insert (because 
> it has no notion of a primary key field), we need to be prepared for the 
> situation where we have duplicate-key docs in the index.
> What then happens when Grant wants to do a "partial update" as opposed to the 
> existing full-update semantics which first deletes all documents containing 
> the supplied term (always a form of primary key)?
> Which document instance gets "partially updated"? We either:
> a) throw a "duplicate" error (which ideally should have happened back at dup 
> insert time)
> b) Choose one of the documents to "partially update" and keep the duplicate(s)
> c) Choose one of the documents to "partially update" and delete the 
> duplicate(s)
> d) "Partially update" all of the duplicate(s)
> All less than ideal.

Variant d) sounds most logical? And enables all sorts of fun stuff.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850857#action_12850857
 ] 

Michael McCandless commented on LUCENE-2324:


Yeah I think we're gonna need the global sequenceID in some form -- my Options 
1 or 2 can't work because the interleaving issue (as seen/required by the app) 
is a global thing.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>Variant d) sounds most logical? And enables all sorts of fun stuff.

So the duplicate-key docs can have different values for initial-insert fields 
but partial updates will cause sharing of  a common field value?
And subsequent same-key doc inserts do or don't share these previous 
"partial-update" values?

Sounds like a complex model for users to understand let alone code support for.
Everyone gets primary keys though.




- Original Message 
From: Earwin Burrfoot 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 10:14:24
Subject: Re: Incremental Field Updates

>>If someone needs this, it can be built over lucene, without
>>introducing it as a core feature and needlessly complicating things.
>
> I think with any partial-update feature the *absence* of primary key support 
> would  "needlessly complicate things":
> If Lucene is not capable of performing duplicate detection on insert (because 
> it has no notion of a primary key field), we need to be prepared for the 
> situation where we have duplicate-key docs in the index.
> What then happens when Grant wants to do a "partial update" as opposed to the 
> existing full-update semantics which first deletes all documents containing 
> the supplied term (always a form of primary key)?
> Which document instance gets "partially updated"? We either:
> a) throw a "duplicate" error (which ideally should have happened back at dup 
> insert time)
> b) Choose one of the documents to "partially update" and keep the duplicate(s)
> c) Choose one of the documents to "partially update" and delete the 
> duplicate(s)
> d) "Partially update" all of the duplicate(s)
> All less than ideal.

Variant d) sounds most logical? And enables all sorts of fun stuff.

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850864#action_12850864
 ] 

Michael McCandless commented on LUCENE-2324:



bq. Mike, can you explain what the advantages of this kind of thread affinity 
are? I was always wondering why the DocumentsWriter code currently makes 
efforts to assign a ThreadState always to the same Thread? Is that being done 
for performance reasons?

It's for performance. I expect there are apps where a given
thread/pool indexes certain kind of docs, ie, the app threads
themselves have "affinity" for docs with similar term distributions.
In which case, it's best (most RAM efficient) if those docs w/
presumably similar term stats are sent back to the same DW.  If you
mix in different term stats into one buffer you get worse RAM
efficiency.

Also, for better RAM efficiency you want *fewer* DWs... because we get
more RAM efficiency the higher the freq of the terms... but of course
you want more DWs for better CPU efficiency whenever that many threads
are running at once.

Net/net CPU efficiency should trump RAM efficiency, I think, so if
there is a conflict we should favor CPU efficiency.

Though, thread affinity doesn't seem that CPU costly to implement?
Lookup the DW your thread first used... if it's free, seize it.  If
it's not, fallback to any DW that's free.


> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>Variant d) sounds most logical? And enables all sorts of fun stuff.
>
> So the duplicate-key docs can have different values for initial-insert fields 
> but partial updates will cause sharing of  a common field value?
> And subsequent same-key doc inserts do or don't share these previous 
> "partial-update" values?
>
> Sounds like a complex model for users to understand let alone code support 
> for.
> Everyone gets primary keys though.

What you say IS complex. Sharing? Bleargh.

But everyone digs "update qweqwe set field=value where some_condition".
Who ever said that some_condition should point to a unique document?
It could, if you wish it so. Or you can do bulk updates if that's what
you need. Very flexible and no need to introduce any new concepts.


--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Michael McCandless

I agree this is a long overdue feature... we need to get it into
Lucene somehow.

I like the Layers analogy... I think that will work well with Lucene's
transactional semantics, ie a prior commit point would continue to see
the index before the updates but new commit points would see the
updates.

I think we would somehow want the new postings "layer" written to
cleanly be merged under Docs/PositionsEnum?  So that searching is
unaffected -- ie the scorers just see a normal postings enum.
FieldCache would also just populate normally.  But somehow these
partial docs would have to not "count" as real docIDs... and the
normal merging of segments would coalesce these updates...

Also: how would we handle stored fields & term vectors?

Mike

On Sat, Mar 27, 2010 at 7:25 AM, Grant Ingersoll  wrote:
> First off, this is something I've had in my head for a long time, but don't
> have any code.
> As many of you know, one of the main things that vexes any search engine
> based on an inverted index is how to do fast updates of just one field w/o
> having to delete and re-add the whole document like we do today.   When I
> think about the whole update problem, I keep coming back to the notion of
> Photoshop (or any other real photo editing solution) Layers.  In a photo
> editing solution, when you want to hide/change a piece of a photo, it is
> considered best practice to add a layer over that part of the photo to be
> changed.  This way, the original photo is maintained and you don't have to
> worry about accidentally damaging the area you aren't interested in.  Thus,
> a layer is essentially a mask on the original photo. The analogy isn't quite
> the same here, but nevertheless...
>
> So, thinking out loud here and I'm not sure on the best wording of this:
>
> When a document first comes in, it is all in one place, just as it is now.
> Then, when an update comes in on a particular field, we somehow mark in the
> index that the document in question is modified and then we add the new
> change onto the end of the index (just like we currently do when adding new
> docs, but this time it's just a doc w/ a single field). Then, when
> searching, we would, when scoring the affected documents, go to a secondary
> process that knew where to look up the incremental changes. As background
> merging takes place, these "disjoint" documents would be merged back
> together. We'd maybe even consider a "high update" merge scheduler that
> could more frequently handle these incremental merges.
>
> I'm not sure where we would maintain the list of changes.  That is, is it
> something that goes in the posting list, or is it a side structure.  I think
> in the posting list would be to slow.  Also, perhaps it is worthwhile for
> people to indicate that a particular field is expected to be updated while
> others maintain their current format so as not to incur the penalty on each.
>
>  In a sense, the old field for that document is masked by the new field. I
> think, given proper index structure, that we maybe could make that marking
> of the old field fast (maybe it's a pointer to the new field, maybe it's
> just a bit indicating to go look in the "update" segment)
>
> On the search side, I think performance would still be maintained b/c even
> in high update envs. you aren't usually talking about more than a few
> thousand changes in a minute or two and the background merger would be
> responsible for keeping the total number of disjoint documents low.
>
> I realize there isn't a whole lot to go on here just yet, but perhaps it
> will spawn some questions/ideas that will help us work it out in a better
> way.
> At any rate, I think adding incr. field update capability would be a huge
> win for Lucene.
> -Grant

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-29 Thread Michael McCandless

On Thu, Mar 25, 2010 at 1:20 PM, Marvin Humphrey  wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> >> Also, will Lucy store the original stats?
>> >
>> > These?
>> >
>> >   * Total number of tokens in the field.
>> >   * Number of unique terms in the field.
>> >   * Doc boost.
>> >   * Field boost.
>>
>> Also sum(tf).  Robert can generate more :)
>
> Hmm, aren't "Total number of tokens in the field" and sum(tf) normally
> equivalent?  I guess there might be analyzers for which that isn't true, e.g.
> those which perform synonym-injection?
>
> In any case, "sum(tf)" is probably a better definition, because it makes no
> ancillary claims...

Sorry, yes they are.

>> > Incidentally, what are you planning to do about field boost if it's not 
>> > always
>> > 1.0?  Are you going to store full 32-bit floats?
>>
>> For starters, yes.
>
> OK, how are those going to be encoded?  IEEE 754?  Big-endian?
>
>http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness

For starters, I think so.  Lucene's ints are bigendian today.

>> We may (later) want to make a new attr that sets
>> the #bits (levels/precision) you want... then uses packed ints to
>> encode.
>
> I'm concerned that the bit-wise entropy of floats may make them a poor match
> for compression via packed ints.  We'll probably get a compressed
> representation which is larger than the original.
>
> Are there any standard algorithms out there for compressing IEEE 754 floats?
> RLE works, but only with certain data patterns.
>
> ... [ time passes ] ...
>
> Hmm, maybe not:
>
>
> http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data

Sorry, I was proposing a fixed-point boost, where you specify how many
levels (in bits, powers of 2) you want.

>> I was specifically asking if Lucy will allow the user to force true
>> average to be recomputed, ie, at commit time from the writer.
>
> That's theoretically possible.  We'd have to implement the reader the same way
> we have DeletionsReader -- the most recent segment may contain data which
> applies to older segments.

OK.

> Here's the DeletionsReader code, which searches backwards through the segments
> looking for a particular file:
>
>/* Start with deletions files in the most recently added segments and work
> * backwards.  The first one we find which addresses our segment is the
> * one we need. */
>for (i = VA_Get_Size(segments) - 1; i >= 0; i--) {
>Segment *other_seg = (Segment*)VA_Fetch(segments, i);
>Hash *metadata
>= (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9);
>if (metadata) {
>Hash *files = (Hash*)CERTIFY(
>Hash_Fetch_Str(metadata, "files", 5), HASH);
>Hash *seg_files_data
>= (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
>if (seg_files_data) {
>Obj *count = (Obj*)CERTIFY(
>Hash_Fetch_Str(seg_files_data, "count", 5), OBJ);
>del_count = (i32_t)Obj_To_I64(count);
>del_file  = (CharBuf*)CERTIFY(
>Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF);
>break;
>}
>}
>}

Hmm -- simililar to tombstones?  But, different in that the most
recently written file has *all* deletions for that old segment?  Ie
you don't have to OR together N generations of written
deletions... only 1 file has all current deletions for the segment?
This is somewhat wasteful of disk space though?  Hmm unless your
deletion policy can reclaim the now-stale deletions files from past
flushed segments?

> What we'd do is write the regenerated boost bytes for *all* segments to the
> most recent segment.  It would be roughly analogous to building up an NRT
> reader.

Right, except Lucy must go through the filesystem.

>> > What's trickier is that Schemas are not normally mutable, and that they are
>> > part of the index.  You don't have to supply an Analyzer, or a Similarity, 
>> > or
>> > anything else when opening a Searcher -- you just provide the location of 
>> > the
>> > index, and the Schema gets deserialized from the latest schema_NNN.json 
>> > file.
>> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty 
>> > much
>> > a thing of the past for us.
>>
>> That's nice... though... is it too rigid?  Do users even want to pick
>> a different analyzer at search time?
>
> It's not common.
>
> To my mind, the way a field is tokenized is part of its field definition, thus
> the Analyzer is part of the field definition, thus the analyzer is part of the
> schema and needs to be stored with the index.

OK.

> Still, we support different Analyzers at search time by way of QueryParser.
> QueryParser's constructor requires a Schema, but also accepts an optional
> Analyzer which if supplied will be used instead of the Analyzers from the
> Schema.

Ahh OK there's an out.

>> > Maybe aggressive autom

Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-29 Thread Michael McCandless

I think that's a good idea for Lucy.

Mike

On Fri, Mar 26, 2010 at 10:58 AM, Marvin Humphrey
 wrote:
> On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
>> > Maybe aggressive automatic data-reduction makes more sense in the context 
>> > of
>> > "flexible matching", which is more expansive than "flexible scoring"?
>>
>> I think so.  Maybe it shouldn't be called a Similarity (which to me
>> (though, carrying a heavy curse of knowledge burden...) means
>> "scoring")?  Matcher?
>
> I think we can express the difference between your proposed approach for
> Lucene Similarity (no effect on index) and my proposed approach for Lucy
> Similarity (aggressive index-time data reduction) by putting Lucy's Similarity
> under Lucy::Index instead of Lucy::Search.
>
> Marvin Humphrey
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850884#action_12850884
 ] 

Michael McCandless commented on LUCENE-2329:


I think we need to fix how RAM is managed for this... right now, if
you turn on IW's infoStream you'll see a zillion prints where IW tries
to balance RAM (it "runs hot"), but, nothing can be freed.  We do this
per-doc, after the parallel arrays resize themselves to net/net over
our allowed RAM buffer.

A few ideas on how we can fix:

  * I think we have to change when we flush.  It's now based on RAM
used (not alloc'd), but I think we should switch it to use RAM
alloc'd after we've freed all we can.  Ie if we free things up and
we've still alloc'd over the limit, we flush.  This'll fix the
running hot we now see...

  * TermsHash.freeRAM is now a no-op right?  We have to fix this to
actually free something when it can because you can imagine
indexing docs that are postings heavy but then switching to docs
that are byte[] block heavy.  On that switch you have to balance
the allocations (ie, shrink your postings).  I think we should
walk the threads/fields and use ArrayUtil.shrink to shrink down,
but, don't shrink by much at a time (to avoid running hot) -- IW
will invoke this method again if more shrinkage is needed.

  * Also, shouldn't we use ArrayUtil.grow to increase?  Instead of
always a 50% growth?  Because with such a large growth you can
easily have horrible RAM efficiency... ie that 50% growth can
suddenly put you over the limit and then you flush, having
effectively used only half of the allowed RAM buffer in the worst
case.


> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Reopened: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-29 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reopened LUCENE-2329:



Reopening to fix the RAM balancing problems...

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers

2010-03-29 Thread Michael McCandless (JIRA)

Enable setting the terms index divisor used by IndexWriter whenever it opens 
internal readers
-

 Key: LUCENE-2356
 URL: https://issues.apache.org/jira/browse/LUCENE-2356
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 3.1


Opening a place holder issue... if all the refactoring being discussed don't 
make this possible, then we should add a setting to IWC to do so.

Apps with very large numbers of unique terms must set the terms index divisor 
to control RAM usage.

(NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will 
help such apps).

But, when IW resolves deletes internally it always uses default 1 terms index 
divisor, and the app cannot change that.  Though one workaround is to call 
getReader(termInfosIndexDivisor) which will pool the reader with the right 
divisor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Created: (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping

2010-03-29 Thread Michael McCandless (JIRA)

Reduce transient RAM usage while merging by using packed ints array for docID 
re-mapping


 Key: LUCENE-2357
 URL: https://issues.apache.org/jira/browse/LUCENE-2357
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael McCandless
Priority: Minor
 Fix For: 3.1


We allocate this int[] to remap docIDs due to compaction of deleted ones.

This uses alot of RAM for large segment merges, and can fail to allocate due to 
fragmentation on 32 bit JREs.

Now that we have packed ints, a simple fix would be to use a packed int 
array... and maybe instead of storing abs docID in the mapping, we could store 
the number of del docs seen so far (so the remap would do a lookup then a 
subtract).  This may add some CPU cost to merging but should bring down 
transient RAM usage quite a bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850887#action_12850887
 ] 

Michael McCandless commented on LUCENE-2356:


I won't have any time to take this any time soon :)  So if anyone has the itch, 
jump!

> Enable setting the terms index divisor used by IndexWriter whenever it opens 
> internal readers
> -
>
> Key: LUCENE-2356
> URL: https://issues.apache.org/jira/browse/LUCENE-2356
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> Opening a place holder issue... if all the refactoring being discussed don't 
> make this possible, then we should add a setting to IWC to do so.
> Apps with very large numbers of unique terms must set the terms index divisor 
> to control RAM usage.
> (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will 
> help such apps).
> But, when IW resolves deletes internally it always uses default 1 terms index 
> divisor, and the app cannot change that.  Though one workaround is to call 
> getReader(termInfosIndexDivisor) which will pool the reader with the right 
> divisor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood



>Who ever said that some_condition should point to a unique document?

My assumption was, for now, we were still talking about the simpler case of 
updating a single document.
If we extend the discussion to support set-based updates it's worth considering 
the common requirements for updating sets:
a)  update values can be non-constants such as "reduce price of all products in 
ski-wear dept by 10%".
b)  the criteria to define the set can be most usefully expressed as a query 
rather than mandating a single term e.g. "set published:false on all docs in 
last week's date range"

That feels like too much functionality to consider adding right now but I can 
see a much more basic solution is possible which supports single and simple set 
based updates.






- Original Message 
From: Earwin Burrfoot 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 11:05:39
Subject: Re: Incremental Field Updates

>>Variant d) sounds most logical? And enables all sorts of fun stuff.
>
> So the duplicate-key docs can have different values for initial-insert fields 
> but partial updates will cause sharing of  a common field value?
> And subsequent same-key doc inserts do or don't share these previous 
> "partial-update" values?
>
> Sounds like a complex model for users to understand let alone code support 
> for.
> Everyone gets primary keys though.

What you say IS complex. Sharing? Bleargh.

But everyone digs "update qweqwe set field=value where some_condition".
Who ever said that some_condition should point to a unique document?
It could, if you wish it so. Or you can do bulk updates if that's what
you need. Very flexible and no need to introduce any new concepts.


--
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Earwin Burrfoot

>>Who ever said that some_condition should point to a unique document?
>
> My assumption was, for now, we were still talking about the simpler case of 
> updating a single document.
> If we extend the discussion to support set-based updates it's worth 
> considering the common requirements for updating sets:
> a)  update values can be non-constants such as "reduce price of all products 
> in ski-wear dept by 10%".
> b)  the criteria to define the set can be most usefully expressed as a query 
> rather than mandating a single term e.g. "set published:false on all docs in 
> last week's date range"
>
> That feels like too much functionality to consider adding right now but I can 
> see a much more basic solution is possible which supports single and simple 
> set based updates.

I must be missing something :)
a) We're not a freaking database, why the constant attempts to compare
ourselves to it / mimic some functionality?
b) The criteria to define the set of deleted documents can already be
expressed as a query - IndexWriter.deleteDocuments(query).

So what I am offering is to preserve the way to point at the docs we
want to see deleted, and allow to do partial modifications on them.
Thus we add new and exciting functionality, while introducing zero new
concepts. Profit?

-- 
Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com)
Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423
ICQ: 104465785

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread Grant Ingersoll


On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote:

> 
>> 
>>> Of course introducing the idea of updates also introduces the notion of a 
>>> primary key and there's probably an entirely separate discussion to be had 
>>> around user-supplied vs Lucene-generated keys.
>> 
>> Not sure I see that need.  Can you explain your reasoning a bit more?
>>> 
> 
> If you want to update a document you need a way of expressing *which* 
> document you are updating.

Of course, but what about the Lucene doc id doesn't provide that?

Re: Incremental Field Updates

2010-03-29 Thread Andrzej Bialecki


On 2010-03-29 12:26, Michael McCandless wrote:

I agree this is a long overdue feature... we need to get it into
Lucene somehow.

I like the Layers analogy... I think that will work well with Lucene's
transactional semantics, ie a prior commit point would continue to see
the index before the updates but new commit points would see the
updates.


I'm coming late to this discussion ... are you guys familiar with this 
paper? It seems to describe the same model of incremental field-level 
updates, and the algo operates on internal Lucene ids:


http://portal.acm.org/citation.cfm?id=1458171


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

AW: Incremental Field Updates

2010-03-29 Thread Uwe Goetzke

The filed this as patent, too:
http://www.freepatentsonline.com/y2009/0228528.html

Regards

Uwe Goetzke

-Ursprüngliche Nachricht-
Von: Andrzej Bialecki [mailto:a...@getopt.org] 
Gesendet: Montag, 29. März 2010 14:50
An: java-dev@lucene.apache.org
Betreff: Re: Incremental Field Updates

On 2010-03-29 12:26, Michael McCandless wrote:
> I agree this is a long overdue feature... we need to get it into
> Lucene somehow.
>
> I like the Layers analogy... I think that will work well with Lucene's
> transactional semantics, ie a prior commit point would continue to see
> the index before the updates but new commit points would see the
> updates.

I'm coming late to this discussion ... are you guys familiar with this 
paper? It seems to describe the same model of incremental field-level 
updates, and the algo operates on internal Lucene ids:

http://portal.acm.org/citation.cfm?id=1458171


-- 
Best regards,
Andrzej Bialecki <><
  ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: AW: Incremental Field Updates

2010-03-29 Thread Andrzej Bialecki


On 2010-03-29 15:11, Uwe Goetzke wrote:

The filed this as patent, too:
http://www.freepatentsonline.com/y2009/0228528.html


.. which is not granted yet, right? It's a patent application. Besides, 
I live in EU ;)



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2357) Reduce transient RAM usage while merging by using packed ints array for docID re-mapping

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850914#action_12850914
 ] 

Michael McCandless commented on LUCENE-2357:


I won't have any time to take this any time soon :)  So if anyone has the itch, 
jump!

> Reduce transient RAM usage while merging by using packed ints array for docID 
> re-mapping
> 
>
> Key: LUCENE-2357
> URL: https://issues.apache.org/jira/browse/LUCENE-2357
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael McCandless
>Priority: Minor
> Fix For: 3.1
>
>
> We allocate this int[] to remap docIDs due to compaction of deleted ones.
> This uses alot of RAM for large segment merges, and can fail to allocate due 
> to fragmentation on 32 bit JREs.
> Now that we have packed ints, a simple fix would be to use a packed int 
> array... and maybe instead of storing abs docID in the mapping, we could 
> store the number of del docs seen so far (so the remap would do a lookup then 
> a subtract).  This may add some CPU cost to merging but should bring down 
> transient RAM usage quite a bit.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850916#action_12850916
 ] 

Michael McCandless commented on LUCENE-2356:


The above comment was on the wrong issue :)

We should only do this issue if the ongoing ideas about refactoring IW/IR don't 
make controlling the terms index divisor possible, for readers opened by IW.

> Enable setting the terms index divisor used by IndexWriter whenever it opens 
> internal readers
> -
>
> Key: LUCENE-2356
> URL: https://issues.apache.org/jira/browse/LUCENE-2356
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> Opening a place holder issue... if all the refactoring being discussed don't 
> make this possible, then we should add a setting to IWC to do so.
> Apps with very large numbers of unique terms must set the terms index divisor 
> to control RAM usage.
> (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will 
> help such apps).
> But, when IW resolves deletes internally it always uses default 1 terms index 
> divisor, and the app cannot change that.  Though one workaround is to call 
> getReader(termInfosIndexDivisor) which will pool the reader with the right 
> divisor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Re: Incremental Field Updates

2010-03-29 Thread mark harwood

>Of course, but what about the Lucene doc id doesn't provide that?

The question being how you determine the correct doc id to use in the first 
place (especially when they are know to be volatile) - the current answer is to 
use a stable identifier term which your app holds in the index, AKA a primary 
key. 
To support single-doc updates, app developers currently have to :
a) allocate keys uniquely
b) ensure they do not store >1 document with the same key.

My suggestion was, being fundamental requirements to supporting updates Lucene 
could, as a convenience, provide some support for this in it's API - in the 
same way a database typically does.

Earwin has perhaps extended your (and my) original thinking to incorporate 
set-based updates (a single set of values applied to many documents which match 
a query).
His proposal (correct me if I'm wrong, Earwin) is that single and set-based 
changes could both be supported by a single IndexWriter.updateDocuments(query, 
changedFields) type method.
The benefit of this scheme is that we are providing a simple method, re-using 
established concepts (Queries for document selection) but this does not change 
the fact that many users will still need to use primary keys for single-doc 
updates and they have to assume responsibility for a) and b) above.

On reflection, I guess these responsibilities are not too tough.
a) is catered for by the fact that Lucene is not typically the master data 
store (yet!) and filesystem/webserver/database datasources where document 
content is sourced  usually have the responsibility to allocate some form of 
unique identifier in the form of URLs, database keys or filenames which can be 
used. Also, b) is not too hard to handle in app code if you always use the 
IndexWriter.updateDocument(term,doc) method for inserts.

Cheers,
Mark

From: Grant Ingersoll 
To: java-dev@lucene.apache.org
Sent: Mon, 29 March, 2010 13:11:56
Subject: Re: Incremental Field Updates

On Mar 29, 2010, at 2:26 AM, Mark Harwood wrote:

>
>
>>
>>Of course introducing the idea of updates also introduces the notion of a 
>>primary key and there's probably an entirely separate discussion to be had 
>>around user-supplied vs Lucene-generated keys.
>>
>>
>>Not sure I see that need.  Can you explain your reasoning a bit more?
>
>>>
>
>
>If you want to update a document you need a way of expressing *which* document 
>you are updating.

Of course, but what about the Lucene doc id doesn't provide that?

[jira] Commented: (LUCENE-2356) Enable setting the terms index divisor used by IndexWriter whenever it opens internal readers

2010-03-29 Thread Earwin Burrfoot (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850928#action_12850928
 ] 

Earwin Burrfoot commented on LUCENE-2356:
-

That's likely orthogonal.
If you want all IW readers to have same divisor - shove it into IWC and it's 
all done.
If you want to use different divisors when returning SR as a part of NRT reader 
and using it inside (say, for deletions) - okay, you'll have the ability to do 
that at the cost of partial SR reload - shove two settings into IWC and it's 
done.

> Enable setting the terms index divisor used by IndexWriter whenever it opens 
> internal readers
> -
>
> Key: LUCENE-2356
> URL: https://issues.apache.org/jira/browse/LUCENE-2356
> Project: Lucene - Java
>  Issue Type: Bug
>Reporter: Michael McCandless
> Fix For: 3.1
>
>
> Opening a place holder issue... if all the refactoring being discussed don't 
> make this possible, then we should add a setting to IWC to do so.
> Apps with very large numbers of unique terms must set the terms index divisor 
> to control RAM usage.
> (NOTE: flex's RAM terms dict index RAM usage is more efficient, so this will 
> help such apps).
> But, when IW resolves deletes internally it always uses default 1 terms index 
> divisor, and the app cannot change that.  Though one workaround is to call 
> getReader(termInfosIndexDivisor) which will pool the reader with the right 
> divisor.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2329) Use parallel arrays instead of PostingList objects

2010-03-29 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850989#action_12850989
 ] 

Michael Busch commented on LUCENE-2329:
---

Good catch!

Thanks for the thorough explanation and suggestions.  I think it all makes 
sense.  Will work on a patch.

> Use parallel arrays instead of PostingList objects
> --
>
> Key: LUCENE-2329
> URL: https://issues.apache.org/jira/browse/LUCENE-2329
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2329.patch, lucene-2329.patch, lucene-2329.patch
>
>
> This is Mike's idea that was discussed in LUCENE-2293 and LUCENE-2324.
> In order to avoid having very many long-living PostingList objects in 
> TermsHashPerField we want to switch to parallel arrays.  The termsHash will 
> simply be a int[] which maps each term to dense termIDs.
> All data that the PostingList classes currently hold will then we placed in 
> parallel arrays, where the termID is the index into the arrays.  This will 
> avoid the need for object pooling, will remove the overhead of object 
> initialization and garbage collection.  Especially garbage collection should 
> benefit significantly when the JVM runs out of memory, because in such a 
> situation the gc mark times can get very long if there is a big number of 
> long-living objects in memory.
> Another benefit could be to build more efficient TermVectors.  We could avoid 
> the need of having to store the term string per document in the TermVector.  
> Instead we could just store the segment-wide termIDs.  This would reduce the 
> size and also make it easier to implement efficient algorithms that use 
> TermVectors, because no term mapping across documents in a segment would be 
> necessary.  Though this improvement we can make with a separate jira issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]

2010-03-29 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851008#action_12851008
 ] 

Michael McCandless commented on LUCENE-2354:


bq. NumericUtils still contains lots of unused String-based methods, I think we 
should remove them

+1

Patch looks good!

NumericFields are the first thing to index terms directly as byte[]
(ie not first going through char[]) in flex.  But the encoding is
unchanged right?  (Ie only using 7 bits per byte, same as trunk).

And you cutover to BytesRef TermsEnum API too -- great.  Presumably
search perf would improve but only a tiny bit since NRQ visits so few
terms?


> Convert NumericUtils and NumericTokenStream to use BytesRef instead of 
> Strings/char[]
> -
>
> Key: LUCENE-2354
> URL: https://issues.apache.org/jira/browse/LUCENE-2354
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2354.patch
>
>
> After LUCENE-2302, we should use TermToBytesRefAttribute to index using 
> NumericTokenStream. This also should convert the whole NumericUtils to use 
> BytesRef when converting numerics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]

2010-03-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851010#action_12851010
 ] 

Uwe Schindler commented on LUCENE-2354:
---

bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, same 
as trunk).

Yes. And i think we should keep it for now using 7 bit. Problems start when the 
sort order of terms is needed (which is the case for NRQ). As default in flex 
is the UTF-8 term comparator, it would not sort correctly for numeric fields 
with full 8 bits?

bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search 
perf would improve but only a tiny bit since NRQ visits so few terms?

I dont think you will notice a difference. A standard int range contains maybe 
10 to 20 sub-ranges (at maximum), so converting between string and TermRef 
should not count. But the new implementation is more clean. In principle we 
could remove the whole char[]/String based API in NumericUtils - I only have to 
rewrite the tests and remove the NumericUtils test in backwards (as no longer 
applies then, too).

> Convert NumericUtils and NumericTokenStream to use BytesRef instead of 
> Strings/char[]
> -
>
> Key: LUCENE-2354
> URL: https://issues.apache.org/jira/browse/LUCENE-2354
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2354.patch
>
>
> After LUCENE-2302, we should use TermToBytesRefAttribute to index using 
> NumericTokenStream. This also should convert the whole NumericUtils to use 
> BytesRef when converting numerics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Issue Comment Edited: (LUCENE-2354) Convert NumericUtils and NumericTokenStream to use BytesRef instead of Strings/char[]

2010-03-29 Thread Uwe Schindler (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851010#action_12851010
 ] 

Uwe Schindler edited comment on LUCENE-2354 at 3/29/10 5:23 PM:


bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, same 
as trunk).

Yes. And i think we should keep it for now using 7 bit. Problems start when the 
sort order of terms is needed (which is the case for NRQ). As default in flex 
is the UTF-8 term comparator, it would not sort correctly for numeric fields 
with full 8 bits?

By the way, the recently added backwards test checks that an old index with 
NumericField behaves as before! This is why I added a new zip file to 
TestBackwardCompatibility.

bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search 
perf would improve but only a tiny bit since NRQ visits so few terms?

I dont think you will notice a difference. A standard int range contains maybe 
10 to 20 sub-ranges (at maximum), so converting between string and TermRef 
should not count. But the new implementation is more clean. In principle we 
could remove the whole char[]/String based API in NumericUtils - I only have to 
rewrite the tests and remove the NumericUtils test in backwards (as no longer 
applies then, too).

  was (Author: thetaphi):
bq. But the encoding is unchanged right? (Ie only using 7 bits per byte, 
same as trunk).

Yes. And i think we should keep it for now using 7 bit. Problems start when the 
sort order of terms is needed (which is the case for NRQ). As default in flex 
is the UTF-8 term comparator, it would not sort correctly for numeric fields 
with full 8 bits?

bq. And you cutover to BytesRef TermsEnum API too - great. Presumably search 
perf would improve but only a tiny bit since NRQ visits so few terms?

I dont think you will notice a difference. A standard int range contains maybe 
10 to 20 sub-ranges (at maximum), so converting between string and TermRef 
should not count. But the new implementation is more clean. In principle we 
could remove the whole char[]/String based API in NumericUtils - I only have to 
rewrite the tests and remove the NumericUtils test in backwards (as no longer 
applies then, too).
  
> Convert NumericUtils and NumericTokenStream to use BytesRef instead of 
> Strings/char[]
> -
>
> Key: LUCENE-2354
> URL: https://issues.apache.org/jira/browse/LUCENE-2354
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: Flex Branch
>Reporter: Uwe Schindler
>Assignee: Uwe Schindler
> Fix For: Flex Branch
>
> Attachments: LUCENE-2354.patch
>
>
> After LUCENE-2302, we should use TermToBytesRefAttribute to index using 
> NumericTokenStream. This also should convert the whole NumericUtils to use 
> BytesRef when converting numerics.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-03-29 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851013#action_12851013
 ] 

Grant Ingersoll commented on LUCENE-2184:
-

Note, this bug exists for the "min" case too, that is when a distance is too 
large

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Assigned: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-03-29 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll reassigned LUCENE-2184:
---

Assignee: Grant Ingersoll

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851017#action_12851017
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael B.: What you're talking about here:
https://issues.apache.org/jira/browse/LUCENE-2324?focusedCommentI
d=12850792page=com.atlassian.jira.plugin.system.issuetabpanels%3A
comment-tabpanel#action_12850792 is a transaction log?

I'm not sure we need that level of complexity just yet? How
would we make the transaction log memory efficient? Are there
other uses you foresee? Maybe there's a simpler solution for the
BufferedDeletes.Num per DW problem that could make use of global
sequence ids? I'd prefer to continue to use the per term/query
max doc id. There aren't performance issues with concurrently
accessing and updating maps, so a global sync lock as the DW map
values are updated should be OK?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-03-29 Thread Grant Ingersoll (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-2184:


Attachment: LUCENE-2184.patch

Here's a patch.  All tests still pass.

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-2184.patch
>
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2184) CartesianPolyFilterBuilder doesn't properly account for which tiers actually exist in the index

2010-03-29 Thread Grant Ingersoll (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851043#action_12851043
 ] 

Grant Ingersoll commented on LUCENE-2184:
-

Committed revision 928860 w/ the patch above plus some more javadocs.  I'll 
leave open for a day or so in case anyone has quibbles about the names of 
things.

> CartesianPolyFilterBuilder doesn't properly account for which tiers actually 
> exist in the index 
> 
>
> Key: LUCENE-2184
> URL: https://issues.apache.org/jira/browse/LUCENE-2184
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/spatial
>Affects Versions: 2.9, 2.9.1, 3.0
>Reporter: Grant Ingersoll
>Assignee: Grant Ingersoll
> Attachments: LUCENE-2184.patch
>
>
> In the CartesianShapeFilterBuilder, there is logic that determines the "best 
> fit" tier to create the Filter against.  However, it does not account for 
> which fields actually exist in the index when doing so.  For instance, if you 
> index tiers 1 through 10, but then choose a very small radius to restrict the 
> space to, it will likely choose a tier like 15 or 16, which of course does 
> not exist.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851078#action_12851078
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
I'm not sure we need that level of complexity just yet? How
would we make the transaction log memory efficient?
{quote}

Is that really so complex?  You only need one additional int per doc in the 
DWPTs, and the global map for the delete terms.  You don't need to buffer the 
actual terms per DWPT.  I thought that's quite efficient?  But I'm totally open 
to other ideas.

I can try tonight to code a prototype of this - I don't think it would be very 
complex actually.  But of course there might be complications I haven't thought 
of.

bq.  Are there other uses you foresee?

Not really for the "transaction log" as you called it.  I'd remove that log 
once we switch to deletes in the FG (when the RAM buffer is searchable).  But a 
nice thing would be for add/update/delete to return the seqID, and also the if 
RAMReader in the future had an API to check up to which seqID it's able to 
"see".  Then it's very clear to user of the API where a given reader is at.  
For this to work we have to assign the seqID at the *end* of a call.  E.g. when 
adding a large document, which takes a long time to process, it should get the 
seqID assigned after the "work" is done and right before the addDocument() call 
returns.  



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Jason Rutherglen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851099#action_12851099
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

{quote}You only need one additional int per doc in the DWPTs, and the global 
map for the delete terms.{quote}

Ok, lets give it a try, it'll be more clear with the prototype.  

The clarify, the apply deletes doc id up to will be the flushed doc count saved 
per term/query per DW, though it won't be saved, it'll be derived from the 
sequence id int array where the action has been encoded into the seq id int?



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-29 Thread Michael Busch (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851142#action_12851142
 ] 

Michael Busch commented on LUCENE-2324:
---

{quote}
The clarify, the apply deletes doc id up to will be the flushed doc count saved 
per term/query per DW, though it won't be saved, it'll be derived from the 
sequence id int array where the action has been encoded into the seq id int?
{quote}

Yeah, that's the idea.  Let's see if it works :)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: LUCENE-2324.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Updated: (LUCENE-2353) Config incorrectly handles Windows absolute pathnames

2010-03-29 Thread Shai Erera (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shai Erera updated LUCENE-2353:
---

Attachment: LUCENE-2353.patch

Updated to also match 'c:/temp' like paths, which are also accepted on Windows

> Config incorrectly handles Windows absolute pathnames
> -
>
> Key: LUCENE-2353
> URL: https://issues.apache.org/jira/browse/LUCENE-2353
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/benchmark
>Reporter: Shai Erera
> Fix For: 3.1
>
> Attachments: LUCENE-2353.patch, LUCENE-2353.patch
>
>
> I have no idea how no one ran into this so far, but I tried to execute an 
> .alg file which used ReutersContentSource and referenced both docs.dir and 
> work.dir as Windows absolute pathnames (e.g. d:\something). Surprisingly, the 
> run reported an error of missing content under benchmark\work\something.
> I've traced the problem back to Config, where get(String, String) includes 
> the following code:
> {code}
> if (sval.indexOf(":") < 0) {
>   return sval;
> }
> // first time this prop is extracted by round
> int k = sval.indexOf(":");
> String colName = sval.substring(0, k);
> sval = sval.substring(k + 1);
> ...
> {code}
> It detects ":" in the value and so it thinks it's a per-round property, thus 
> stripping "d:" from the value ... fix is very simple:
> {code}
> if (sval.indexOf(":") < 0) {
>   return sval;
> } else if (sval.indexOf(":\\") >= 0) {
>   // this previously messed up absolute path names on Windows. Assuming
>   // there is no real value that starts with \\
>   return sval;
> }
> // first time this prop is extracted by round
> int k = sval.indexOf(":");
> String colName = sval.substring(0, k);
> sval = sval.substring(k + 1);
> {code}
> I'll post a patch w/ the above fix + test shortly.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

40 matches

Mail list logo