ported lucandra: lucene index on HBase

2010-03-25 Thread Thomas Koch
Hi,

Lucandra stores a lucene index on cassandra:
http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend

As the author of lucandra writes: "I’m sure something similar could be built 
on hbase."

So here it is:
http://github.com/thkoch2001/lucehbase

This is only a first prototype which has not been tested on anything real yet. 
But if you're interested, please join me to get it production ready!

I propose to keep this thread on hbase-user and java-dev only.
Would it make sense to aim this project to become an hbase contrib? Or a 
lucene contrib?

Best regards,

Thomas Koch, http://www.koch.ro

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849639#action_12849639
 ] 

Michael McCandless commented on LUCENE-2215:


This is a neat collector!

I like the idea of chaining/filtering... couldn't we put this in core
(under TFC/TSDC.create), but instead of doubling the 12 specialized
(anonymous) impls we now have, just delegate?

Ie, we'd make a FilteredCollector, taking another collector when it's
created, and then on every collect call, only if the hit is "weak"
enough (ie is worse than what the app provided as prev low score/doc)
would it forward it to the delegate?  I guess we should test perf w/
(the new additions to benchmark -- yay!) to see if specializing the
code (even anonymously) is warranted.

The indent whitespace needs to fixed to 2 spaces...


> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Michael McCandless
On Mon, Mar 22, 2010 at 12:45 PM, Marvin Humphrey
 wrote:
> On Thu, Mar 18, 2010 at 05:16:23AM -0500, Michael McCandless wrote:
>> Also, will Lucy store the original stats?
>
> These?
>
>   * Total number of tokens in the field.
>   * Number of unique terms in the field.
>   * Doc boost.
>   * Field boost.

Also sum(tf).  Robert can generate more :)

> That would depend on which Similiarity the user specs for that field.  In
> other words, it's just another data-reduction decision: if the Sim needs it,
> keep it, and if doesn't, throw it away.

OK.

> Incidentally, what are you planning to do about field boost if it's not always
> 1.0?  Are you going to store full 32-bit floats?

For starters, yes.  We may (later) want to make a new attr that sets
the #bits (levels/precision) you want... then uses packed ints to
encode.

>> Ie so the chosen Sim can properly recompute all boost bytes (if it uses
>> those), for scoring models that "pivot" based on avg's of these stats?
>
> Yes, we could support that.
>
> It's not high on my todo-list for core Lucy, though: poor payoff for all the
> complexity it would introduce, particularly file format complexity with its
> heavy backwards compatibility burden.  Right now, we only have the boost
> bytes, and the fact that they are used for length normalization, field boost,
> and doc boost is incidental.  If we add all the raw stats, that's a bunch of
> stuff we have to support for a long time, yet which doesn't yield practical
> advantages for us yet.
>
> I'd be much more interested in finding a way to support such a feature as an
> extension.

I was specifically asking if Lucy will allow the user to force true
average to be recomputed, ie, at commit time from the writer.  It's
more costly and often not needed (ie, once your index is large enough,
new docs "typically" won't shift the average much).  But I imagine
some users will want "true average".

>> > In any case, the proposal to start delaying Sim choice to search-time -- 
>> > while
>> > a nice feature for Lucene -- is a non-starter for Lucy.   We can't do that
>> > because it would kill the cheap-Searcher model to generate boost bytes at
>> > Searcher construction time and cache them within the object.  We need those
>> > boost bytes written to disk so we can mmap them and share them amongst many
>> > cheap Searchers.
>>
>> It'd seem like Lucy could re-gen the boost bytes if a different Sim
>> were selected, or, the current Sim hadn't yet computed & cached its
>> bytes?  But then logically this means a "reader" needs write
>> permission to the index dir, which is not good...
>
> Whatever's reading the boost bytes can't tell the difference between process
> RAM and mmap'd RAM, so write-permission on the index dir isn't required.

Hmm if you could somehow soften this... so that a custom Sim could
regen its boost bytes (if it needed to), write them into the index,
and then "whoever's reading" can mmap... that'd buy you some
flexibility back.

> What's trickier is that Schemas are not normally mutable, and that they are
> part of the index.  You don't have to supply an Analyzer, or a Similarity, or
> anything else when opening a Searcher -- you just provide the location of the
> index, and the Schema gets deserialized from the latest schema_NNN.json file.
> That has many advantages, e.g. inadvertent Analyzer conflicts are pretty much
> a thing of the past for us.

That's nice... though... is it too rigid?  Do users even want to pick
a different analyzer at search time?

> But it makes your feature request of runtime settability for
> Similarity awkward to implement: by the time you have a Schema
> object to work with, the Searcher is already open.
>
>  Searcher searcher = new Searcher("/path/to/index");
>  Schema schema = searcher.getSchema();
>  schema.setSim("content", altSim); // Too late, and not implemented anyway.

I see...

>> > To my mind, these are all related data reduction tasks:
>> >
>> >  * Omit doc-boost and field-boost, replacing them with a single float
>> >docXfield multiplier -- because you never need doc-boost on its own.
>> >  * Omit length-in-tokens, term-cardinality, doc-boost, and field-boost,
>> >replacing them all with a single boost byte -- because for the kind of
>> >scoring you want to do, you don't need all those raw stats.
>> >  * Omit the boost byte, because you don't need to do scoring at all.
>> >  * Omit positions because you don't need PhraseQueries, etc. to match.
>>
>> I wouldn't group this one with the others -- I mean technically it is
>> "data reduction" -- but omitting positions means certain queries
>> (PhraseQuery) won't work even in "match only" searching.  Whereas the
>> rest of these examples affect how scoring is done (or whether it's
>> done).
>
> Couldn't disagree more.  Omitting positions is *exactly* the kind of data
> reduction task which we know is safe to perform when a user specifically tells
> us they don't need PhraseQueries by specifying a MinimalSimi

[jira] Updated: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Smith updated LUCENE-2345:
--

Attachment: LUCENE-2345_3.0.patch

Here's a patch against 3.0 that provides the SegmentReaderFactory ability
(not tested yet, but i'll be doing that shortly as i integrate this 
functionality)

It adds a SegmentReaderFactory.

The IndexWriter now has a getter and setter for setting this

SegmentReader has a new protected method init() which is called after the 
segment reader has been initialized (to allow subclasses to hook this action 
and do additional initialization, etc

added 2 new IndexReader.open() calls that allow specifying the 
SegmentReaderFactory



> Make it possible to subclass SegmentReader
> --
>
> Key: LUCENE-2345
> URL: https://issues.apache.org/jira/browse/LUCENE-2345
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Index
>Reporter: Tim Smith
> Fix For: 3.1
>
> Attachments: LUCENE-2345_3.0.patch
>
>
> I would like the ability to subclass SegmentReader for numerous reasons:
> * to capture initialization/close events
> * attach custom objects to an instance of a segment reader (caches, 
> statistics, so on and so forth)
> * override methods on segment reader as needed
> currently this isn't really possible
> I propose adding a SegmentReaderFactory that would allow creating custom 
> subclasses of SegmentReader
> default implementation would be something like:
> {code}
> public class SegmentReaderFactory {
>   public SegmentReader get(boolean readOnly) {
> return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
>   }
>   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
> return newSegmentReader(readOnly);
>   }
> }
> {code}
> It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
> (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
> etc)
> I could prepare a patch if others think this has merit
> Obviously, this API would be "experimental/advanced/will change in future"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849728#action_12849728
 ] 

Shai Erera commented on LUCENE-2345:


bq. The IndexWriter now has a getter and setter for setting this

If this is not expected to change during the lifetime of IW, I think it should 
be added to IWC when you upgrade the patch to 3.1.

> Make it possible to subclass SegmentReader
> --
>
> Key: LUCENE-2345
> URL: https://issues.apache.org/jira/browse/LUCENE-2345
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Index
>Reporter: Tim Smith
> Fix For: 3.1
>
> Attachments: LUCENE-2345_3.0.patch
>
>
> I would like the ability to subclass SegmentReader for numerous reasons:
> * to capture initialization/close events
> * attach custom objects to an instance of a segment reader (caches, 
> statistics, so on and so forth)
> * override methods on segment reader as needed
> currently this isn't really possible
> I propose adding a SegmentReaderFactory that would allow creating custom 
> subclasses of SegmentReader
> default implementation would be something like:
> {code}
> public class SegmentReaderFactory {
>   public SegmentReader get(boolean readOnly) {
> return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
>   }
>   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
> return newSegmentReader(readOnly);
>   }
> }
> {code}
> It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
> (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
> etc)
> I could prepare a patch if others think this has merit
> Obviously, this API would be "experimental/advanced/will change in future"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2345) Make it possible to subclass SegmentReader

2010-03-25 Thread Tim Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849731#action_12849731
 ] 

Tim Smith commented on LUCENE-2345:
---

that was my plan

> Make it possible to subclass SegmentReader
> --
>
> Key: LUCENE-2345
> URL: https://issues.apache.org/jira/browse/LUCENE-2345
> Project: Lucene - Java
>  Issue Type: Wish
>  Components: Index
>Reporter: Tim Smith
> Fix For: 3.1
>
> Attachments: LUCENE-2345_3.0.patch
>
>
> I would like the ability to subclass SegmentReader for numerous reasons:
> * to capture initialization/close events
> * attach custom objects to an instance of a segment reader (caches, 
> statistics, so on and so forth)
> * override methods on segment reader as needed
> currently this isn't really possible
> I propose adding a SegmentReaderFactory that would allow creating custom 
> subclasses of SegmentReader
> default implementation would be something like:
> {code}
> public class SegmentReaderFactory {
>   public SegmentReader get(boolean readOnly) {
> return readOnly ? new ReadOnlySegmentReader() : new SegmentReader();
>   }
>   public SegmentReader reopen(SegmentReader reader, boolean readOnly) {
> return newSegmentReader(readOnly);
>   }
> }
> {code}
> It would then be made possible to pass a SegmentReaderFactory to IndexWriter 
> (for pooled readers) as well as to SegmentReader.get() (DirectoryReader.open, 
> etc)
> I could prepare a patch if others think this has merit
> Obviously, this API would be "experimental/advanced/will change in future"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: Baby steps towards making Lucene's scoring more flexible...

2010-03-25 Thread Marvin Humphrey
On Thu, Mar 25, 2010 at 06:24:34AM -0400, Michael McCandless wrote:
> >> Also, will Lucy store the original stats?
> >
> > These?
> >
> >   * Total number of tokens in the field.
> >   * Number of unique terms in the field.
> >   * Doc boost.
> >   * Field boost.
> 
> Also sum(tf).  Robert can generate more :)

Hmm, aren't "Total number of tokens in the field" and sum(tf) normally
equivalent?  I guess there might be analyzers for which that isn't true, e.g.
those which perform synonym-injection?

In any case, "sum(tf)" is probably a better definition, because it makes no
ancillary claims...

> > Incidentally, what are you planning to do about field boost if it's not 
> > always
> > 1.0?  Are you going to store full 32-bit floats?
> 
> For starters, yes.  

OK, how are those going to be encoded?  IEEE 754?  Big-endian?

http://en.wikipedia.org/wiki/Endianness#Floating-point_and_endianness

> We may (later) want to make a new attr that sets
> the #bits (levels/precision) you want... then uses packed ints to
> encode.

I'm concerned that the bit-wise entropy of floats may make them a poor match
for compression via packed ints.  We'll probably get a compressed
representation which is larger than the original.

Are there any standard algorithms out there for compressing IEEE 754 floats?
RLE works, but only with certain data patterns.

... [ time passes ] ...

Hmm, maybe not:


http://stackoverflow.com/questions/2238754/compression-algorithm-for-ieee-754-data

> I was specifically asking if Lucy will allow the user to force true
> average to be recomputed, ie, at commit time from the writer. 

That's theoretically possible.  We'd have to implement the reader the same way
we have DeletionsReader -- the most recent segment may contain data which
applies to older segments.  

Here's the DeletionsReader code, which searches backwards through the segments
looking for a particular file:

/* Start with deletions files in the most recently added segments and work
 * backwards.  The first one we find which addresses our segment is the
 * one we need. */
for (i = VA_Get_Size(segments) - 1; i >= 0; i--) {
Segment *other_seg = (Segment*)VA_Fetch(segments, i);
Hash *metadata 
= (Hash*)Seg_Fetch_Metadata_Str(other_seg, "deletions", 9);
if (metadata) {
Hash *files = (Hash*)CERTIFY(
Hash_Fetch_Str(metadata, "files", 5), HASH);
Hash *seg_files_data 
= (Hash*)Hash_Fetch(files, (Obj*)my_seg_name);
if (seg_files_data) {
Obj *count = (Obj*)CERTIFY(
Hash_Fetch_Str(seg_files_data, "count", 5), OBJ);
del_count = (i32_t)Obj_To_I64(count);
del_file  = (CharBuf*)CERTIFY(
Hash_Fetch_Str(seg_files_data, "filename", 8), CHARBUF);
break;
}
}
}

What we'd do is write the regenerated boost bytes for *all* segments to the
most recent segment.  It would be roughly analogous to building up an NRT
reader.

> > What's trickier is that Schemas are not normally mutable, and that they are
> > part of the index.  You don't have to supply an Analyzer, or a Similarity, 
> > or
> > anything else when opening a Searcher -- you just provide the location of 
> > the
> > index, and the Schema gets deserialized from the latest schema_NNN.json 
> > file.
> > That has many advantages, e.g. inadvertent Analyzer conflicts are pretty 
> > much
> > a thing of the past for us.
> 
> That's nice... though... is it too rigid?  Do users even want to pick
> a different analyzer at search time?

It's not common.  

To my mind, the way a field is tokenized is part of its field definition, thus
the Analyzer is part of the field definition, thus the analyzer is part of the
schema and needs to be stored with the index.

Still, we support different Analyzers at search time by way of QueryParser.
QueryParser's constructor requires a Schema, but also accepts an optional
Analyzer which if supplied will be used instead of the Analyzers from the
Schema.

> > Maybe aggressive automatic data-reduction makes more sense in the context of
> > "flexible matching", which is more expansive than "flexible scoring"?
> 
> I think so.  Maybe it shouldn't be called a Similarity (which to me
> (though, carrying a heavy curse of knowledge burden...) means
> "scoring")?  Matcher?

Heh.  "Matcher" is taken.  It's a crucial class, too, roughly combining the
roles of Lucene's Scorer and DocIDSetIterator.

The first alternative that comes to mind is "Relevance", because not only can
one thing's relevance to another be continuously variable (i.e. score), it can
also be binary: relevant/not-relevant (i.e. match).

But I don't see why "Relevance", "Matcher", or anything else would be so much
better than "Similarity".  I think this is your hang up.  ;) 

> > I'm +0 (FWIW) on search-time Sim settability for Lucene.  It's a nice 
> > feature,
> > but I don't

[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849806#action_12849806
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, I'm guessing this patch needs to be updated as per LUCENE-2329?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324-no-pooling.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849808#action_12849808
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Actually, I just browsed the patch again, I don't think it implements private 
doc writers as of yet?  

I think you're right, we can get this issue completed.  LUCENE-2312's path 
looks clear at this point.  Shall I take a whack at it?

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
> Attachments: lucene-2324-no-pooling.patch
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch updated LUCENE-2324:
--

Attachment: (was: lucene-2324-no-pooling.patch)

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849819#action_12849819
 ] 

Michael Busch commented on LUCENE-2324:
---

Hey Jason,

Disregard my patch here.  I just experimented with removal of pooling, but then 
did LUCENE-2329 instead.  TermsHash and TermsHashPerThread are now much 
simpler, because all the pooling code is gone after 2329 was committed.  Should 
make it a little easier to get this patch done.

Sure it'd be awesome if you could provide a patch here.  I can help you, we 
should just frequently post patches here so that we don't both work on the same 
areas.



> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849843#action_12849843
 ] 

Grant Ingersoll commented on LUCENE-2215:
-

Mike,  don't you think, though, that through a fairly simple update of some of 
the clauses to appropriate short circuit things that we can just hook this into 
the existing collectors w/o no need for any delegation or changes?  Let me try 
a patch.  Now that the benchmark stuff is in, we should be able to test.


> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849844#action_12849844
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

Michael, I'm working on a patch and will post one (hopefully) shortly.

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Uwe Schindler (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849851#action_12849851
 ] 

Uwe Schindler commented on LUCENE-2215:
---

Hey, and I want to fix the NaN thing in TSDC: LUCENE-2271

Maybe when we delegate, we can also use my cool code that switches the delegate 
to remove on comparison after the queue is full.

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849863#action_12849863
 ] 

Michael McCandless commented on LUCENE-2215:


bq. ...through a fairly simple update of some of the clauses to appropriate 
short circuit things that we can just hook this into the existing collectors 
w/o no need for any delegation or changes? Let me try a patch. Now that the 
benchmark stuff is in, we should be able to test.

This'd make me nervous...

Ie I don't think we should insert bytecodes for the 99.9% of searches that 
wouldn't make use of this, even if we can't uncover a slowdown with 
benchmarking.

We should still benchmark it though (I'm curious)... we should also benchmark 
the delegate solution.

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849899#action_12849899
 ] 

Michael Busch commented on LUCENE-2324:
---

Awesome!

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2346) Explore other in-memory postinglist formats for realtime search

2010-03-25 Thread Michael Busch (JIRA)
Explore other in-memory postinglist formats for realtime search
---

 Key: LUCENE-2346
 URL: https://issues.apache.org/jira/browse/LUCENE-2346
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Index
Reporter: Michael Busch
Assignee: Michael Busch
Priority: Minor
 Fix For: 3.1


The current in-memory posting list format might not be optimal for searching. 
VInt decoding performance and the lack of skip lists would arguably be the 
biggest bottlenecks.

For LUCENE-2312 we should investigate other formats.

Some ideas:
- PFOR or packed ints for posting slices?
- Maybe even int[] slices instead of byte slices? This would be great for 
search performance, but the additional memory overhead might not be acceptable.
- For realtime search it's usually desirable to evaluate the most recent 
documents first.  So using backward pointers instead of forward pointers and 
having the postinglist pointer point to the most recent docID in a list is 
something to consider.
- Skipping: if we use fixed-length postings ([packed] ints) we can do binary 
search within a slice.  We can also locate a pointer then without scanning and 
thus skip entire slices quickly.  Is that sufficient or would we need more 
skipping layers, so that it's possible to skip directly to particular slices?


It would be awesome to find a format that doesn't slow down "normal" indexing, 
but is very efficient for in-memory searches.  If we can't find such a fits-all 
format, we should have a separate indexing chain for real-time indexing.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2347) Dump WordNet to SOLR Synonym format

2010-03-25 Thread Bill Bell (JIRA)
Dump WordNet to SOLR Synonym format
---

 Key: LUCENE-2347
 URL: https://issues.apache.org/jira/browse/LUCENE-2347
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Affects Versions: 3.0.1
Reporter: Bill Bell


This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get 
all your syns loaded easily.

1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ WordNet 
V2 to SOLR by first using the Sys2Index program
http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html

Get WNprolog from http://wordnetcode.princeton.edu/2.0/

2. We modified this program to work with SOLR (See attached) on 
amidev.kaango.com in /vol/src/lucene/contrib/wordnet
vi 
/vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java

3. Run ant

4. java -classpath 
/vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar 
org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr > index_synonyms.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2347) Dump WordNet to SOLR Synonym format

2010-03-25 Thread Bill Bell (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bill Bell updated LUCENE-2347:
--

Attachment: Syns2Solr.java

> Dump WordNet to SOLR Synonym format
> ---
>
> Key: LUCENE-2347
> URL: https://issues.apache.org/jira/browse/LUCENE-2347
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: contrib/*
>Affects Versions: 3.0.1
>Reporter: Bill Bell
> Attachments: Syns2Solr.java
>
>
> This enhancement allows you to dump v2 of WordNet to SOLR synonym format! Get 
> all your syns loaded easily.
> 1. You can load all synonyms from http://wordnetcode.princeton.edu/2.0/ 
> WordNet V2 to SOLR by first using the Sys2Index program
> http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/wordnet/Syns2Index.html
> Get WNprolog from http://wordnetcode.princeton.edu/2.0/
> 2. We modified this program to work with SOLR (See attached) on 
> amidev.kaango.com in /vol/src/lucene/contrib/wordnet
> vi 
> /vol/src/lucene/contrib/wordnet/src/java/org/apache/lucene/wordnet/Syns2Solr.java
> 3. Run ant
> 4. java -classpath 
> /vol/src/lucene/build/contrib/wordnet/lucene-wordnet-3.1-dev.jar 
> org.apache.lucene.wordnet.Syns2Solr prolog/wn_s.pl solr > index_synonyms.txt

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849961#action_12849961
 ] 

Grant Ingersoll commented on LUCENE-2215:
-

Yeah, but one could make the argument, Mike, that the existing "optimizations" 
are useless for the most common case, since I think it's safe to say most 
applications implement paging.  Of course, that being said, most users don't 
page all that deeply.  Also, for something like Solr that prefetches the top 50 
it might not be good, either.  Still, in my mind it is one additional boolean 
check, as in:
{code}
if ( (current stuff) || (pagingInfoPresent == true && paging check) )
...
{code}

pagingInfoPresent can be determined at construction time and that whole clause 
would be short circuited very quickly.

That being said, delegation could be done at construction time, too and more 
cleanly separates things.  I'll try to put up my version tomorrow.

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2324) Per thread DocumentsWriters that write their own private segments

2010-03-25 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12849965#action_12849965
 ] 

Jason Rutherglen commented on LUCENE-2324:
--

I'm a little confused in the flushedDocCount, remap deletes conversion portions 
of DocWriter.  flushedDocCount is used as a global counter, however when we 
move to per thread doc writers, it won't be global anymore.  Is there a 
different (easier) way to perform remap deletes?  

> Per thread DocumentsWriters that write their own private segments
> -
>
> Key: LUCENE-2324
> URL: https://issues.apache.org/jira/browse/LUCENE-2324
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Index
>Reporter: Michael Busch
>Assignee: Michael Busch
>Priority: Minor
> Fix For: 3.1
>
>
> See LUCENE-2293 for motivation and more details.
> I'm copying here Mike's summary he posted on 2293:
> Change the approach for how we buffer in RAM to a more isolated
> approach, whereby IW has N fully independent RAM segments
> in-process and when a doc needs to be indexed it's added to one of
> them. Each segment would also write its own doc stores and
> "normal" segment merging (not the inefficient merge we now do on
> flush) would merge them. This should be a good simplification in
> the chain (eg maybe we can remove the *PerThread classes). The
> segments can flush independently, letting us make much better
> concurrent use of IO & CPU.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2215) paging collector

2010-03-25 Thread Shai Erera (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850002#action_12850002
 ] 

Shai Erera commented on LUCENE-2215:


bq. since I think it's safe to say most applications implement paging

Let's be careful about the semantics here Grant. Most if not all applications 
implement paging indeed, but I believe only FEW actually store user contexts 
between searches. PagingCollector relies on the application to store the lowest 
ranking doc that was returned previously, which means storing context between 
user's searches.

I agree w/ Mike's statement about 99.9% of the searches would never run that 
code, which is why I've proposed a delegation/wrapper approach from the 
beginning. I also think that we should make some allowances here and there, for 
the non-common case, and introduce better software design than specialized 
code. A Collector filter approach for some rare (or even less common) cases 
seems very reasonable to me.

Also, I think that if we add to TSDC a create method which takes into account 
the previously scored lowest doc, it will confuse people. Now they will need to 
think "where do I get this low score from?" - but perhaps after I see the code, 
it wouldn't be such a bad thing  just have a feeling TSDC and TFC should be 
left on their own, and extreme paging stuff should either be its own 
specialized collector, or a wrapper.

> paging collector
> 
>
> Key: LUCENE-2215
> URL: https://issues.apache.org/jira/browse/LUCENE-2215
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Search
>Affects Versions: 2.4, 3.0
>Reporter: Adam Heinz
>Assignee: Grant Ingersoll
>Priority: Minor
> Attachments: IterablePaging.java, LUCENE-2215.patch, 
> PagingCollector.java, TestingPagingCollector.java
>
>
> http://issues.apache.org/jira/browse/LUCENE-2127?focusedCommentId=12796898&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12796898
> Somebody assign this to Aaron McCurry and we'll see if we can get enough 
> votes on this issue to convince him to upload his patch.  :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

2010-03-25 Thread Trejkaz (JIRA)
DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
readers
-

 Key: LUCENE-2348
 URL: https://issues.apache.org/jira/browse/LUCENE-2348
 Project: Lucene - Java
  Issue Type: Bug
  Components: Search
Affects Versions: 2.9.2
Reporter: Trejkaz


DuplicateFilter currently works by building a single doc ID set, without taking 
into account that getDocIdSet() will be called once per segment and only with 
each segment's local reader.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2348) DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment readers

2010-03-25 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-2348:


Component/s: (was: Search)
 contrib/*

Changing to contrib, only just realised it was in that location...


> DuplicateFilter incorrectly handles multiple calls to getDocIdSet for segment 
> readers
> -
>
> Key: LUCENE-2348
> URL: https://issues.apache.org/jira/browse/LUCENE-2348
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: contrib/*
>Affects Versions: 2.9.2
>Reporter: Trejkaz
>
> DuplicateFilter currently works by building a single doc ID set, without 
> taking into account that getDocIdSet() will be called once per segment and 
> only with each segment's local reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2323) reorganize contrib modules

2010-03-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850012#action_12850012
 ] 

Robert Muir commented on LUCENE-2323:
-

Committed 927696 (and 927697 for the solr piece).

Will keep the issue open and work on a patch for the next part.

> reorganize contrib modules
> --
>
> Key: LUCENE-2323
> URL: https://issues.apache.org/jira/browse/LUCENE-2323
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Reporter: Robert Muir
>Assignee: Robert Muir
> Attachments: LUCENE-2323.patch
>
>
> it would be nice to reorganize contrib modules, so that they are bundled 
> together by functionality.
> For example:
> * the wikipedia contrib is a tokenizer, i think really belongs in 
> contrib/analyzers
> * there are two highlighters, i think could be one highlighters package.
> * there are many queryparsers and queries in different places in contrib

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org