Re: Lucene TermsFilter lookup slow

2015-08-09 Thread jamie

Mike

Thank you kindly for the reply. I am using Lucene v4.10.4. Are the 
optimization you refer to, available in this version?


We haven't yet upgraded to Lucene 5 as there appear to be many API changes.

Jamie

On 2015/08/08 5:13 PM, Michael McCandless wrote:

Which version of Lucene are you using?  Newer versions have optimized
the "primary key" use case somewhat...

Mike McCandless

http://blog.mikemccandless.com


On Sat, Aug 8, 2015 at 8:32 AM, jamie  wrote:

Greetings

Our app primarily uses Lucene for its intended purpose i.e. to search across
large amounts of unstructured text. However, recently our requirement
expanded to perform look-ups on specific documents in the index based on
associated custom defined unique keys. For our purposes, a unique key is the
string representation of a 128 bit murmur hash, stored in a Lucene field
named uid.  We are currently using the TermsFilter to lookup Documents in
the Lucene index as follows:

List terms = new LinkedList<>();
 for (String id : ids) {
 terms.add(new Term("uid", id));
}
TermsFilter idFilter = new TermsFilter(terms);
... search logic...

At any time we may need to lookup say a couple of thousand documents. Our
problem is one of performance. On very large indexes with 30 million records
or more, the lookup can be excruciatingly slow. At this stage, its not
practical for us to move the data over to fit for purpose database, nor
change the uid field to a numeric type. I fully appreciate the fact that
Lucene is not designed to be a database, however, is there anything we can
do to improve the performance of these look-ups?

Much appreciate

Jamie




-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Mapping doc values back to doc ID (in decent time)

2015-08-09 Thread Trejkaz
On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand  wrote:
> Does your application actually iterate in order over dense ids, or is
> it just for benchmarking purposes? Because if it does, you probably
> don't actually need seeking, you could just see what the current ID in
> the terms enum is.

Both dense ID fetches and individual ID fetches exist in the
application. I put them in a benchmark deliberately doing it as
individual fetches to get an idea of average timing for a single
operation.

There are so many use cases of doing the individual fetches that it's
tough to enumerate. The first one I found was "fetch the term vector
for ID + field" but I'm sure there will be tons of them.

For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
would probably use something like DocValuesTermsQuery for that to get
them all in one shot. I also wondered whether writing our filters as
queries would help, but I think it would turn out to be about as fast
as DocValuesTermsQuery even if I did that.

I'm sure the only way to really improve the speed of these filters is
to start storing these things in the text index and use query-time
joins, but I can't do that until I solve the issue of relying on
stable doc IDs and it seems like trying to solve two large problems in
a single commit would be biting off more than I can chew.

> If you actually need seeking, then you should try
> to avoid MultiFields, it will call seedExact on each segment, while
> given what I see you could just stop after you found one segment with
> the value.

Ah, I did wonder whether MultiFields had any behaviour like that, so
that definitely means that I will avoid using it. Then I can try other
tricks, like trying the seeks in order of segment size (the largest
segment is most likely to contain the hit.)

TX

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Robert Muir
That makes no sense at all, it would make it slow as shit.

I am tired of repeating this:
Don't use BINARY docvalues
Don't use BINARY docvalues
Don't use BINARY docvalues

Use types like SORTED/SORTED_SET which will compress the term
dictionary and make use of ordinals in your application instead.



On Sat, Aug 8, 2015 at 10:19 AM, Olivier Binda  wrote:
> Greetings
>
> are there any plans to implement compression of the variable length bites[]
> binary doc Values,
> say in blocks of 16k like for stored values ?
>
> my .cfs file goes from 2MB to like 400k when I zip it
>
> Best regards,
> Olivier
>
>
>
> On 08/08/2015 02:32 PM, jamie wrote:
>>
>> Greetings
>>
>> Our app primarily uses Lucene for its intended purpose i.e. to search
>> across large amounts of unstructured text. However, recently our requirement
>> expanded to perform look-ups on specific documents in the index based on
>> associated custom defined unique keys. For our purposes, a unique key is the
>> string representation of a 128 bit murmur hash, stored in a Lucene field
>> named uid.  We are currently using the TermsFilter to lookup Documents in
>> the Lucene index as follows:
>>
>> List terms = new LinkedList<>();
>> for (String id : ids) {
>> terms.add(new Term("uid", id));
>> }
>> TermsFilter idFilter = new TermsFilter(terms);
>> ... search logic...
>>
>> At any time we may need to lookup say a couple of thousand documents. Our
>> problem is one of performance. On very large indexes with 30 million records
>> or more, the lookup can be excruciatingly slow. At this stage, its not
>> practical for us to move the data over to fit for purpose database, nor
>> change the uid field to a numeric type. I fully appreciate the fact that
>> Lucene is not designed to be a database, however, is there anything we can
>> do to improve the performance of these look-ups?
>>
>> Much appreciate
>>
>> Jamie
>>
>>
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
Robert Muir  wrote:
> I am tired of repeating this:
> Don't use BINARY docvalues
> Don't use BINARY docvalues
> Don't use BINARY docvalues

> Use types like SORTED/SORTED_SET which will compress the term
> dictionary and make use of ordinals in your application instead.

This seems contrary to
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html

Maybe you could update the JavaDoc for that field to warn against using it?

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Arjen van der Meijden


On 9-8-2015 16:22, Toke Eskildsen wrote:
> Robert Muir  wrote:
>> I am tired of repeating this:
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Don't use BINARY docvalues
>> Use types like SORTED/SORTED_SET which will compress the term
>> dictionary and make use of ordinals in your application instead.
> This seems contrary to
> http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html
>
> Maybe you could update the JavaDoc for that field to warn against using it?
It (probably) depends on the contents of the values. If the number of
distinct values is roughly equal to the number of documents the javadoc
suggest the binary docvalues are a valid choice.

That's this part:
"The values are stored directly with no sharing, which is a good fit
when the fields don't share (many) values, such as a title field."

If there are (much) less distinct values than documents, Robert's reply
and the documentation suggest the same:
" If values may be shared and sorted it's better to use
SortedDocValuesField."

So as soon as compression of smallish values starts making sense due to
repetition amongst documents, it may be time to move away from the
BinaryDocValuesField towards another variant.

If only parts of the values are repeated (for instance something like 
e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
it becomes more complicated.

Best regards,

Arjen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Mapping doc values back to doc ID (in decent time)

2015-08-09 Thread András Péteri
If I understand it correctly, the Zoie library [1][2] implements the
"sledgehammer" approach by collecting docValues for all documents when a
segment reader is opened. If you have some RAM to throw at the problem,
this could indeed bring you an acceptable level of performance.

[1] http://senseidb.github.io/zoie/
[2]
https://github.com/senseidb/zoie/blob/master/zoie-core/src/main/java/proj/zoie/api/impl/DocIDMapperImpl.java

On Sun, Aug 9, 2015 at 9:41 AM, Trejkaz  wrote:

> On Fri, Aug 7, 2015 at 5:34 PM, Adrien Grand  wrote:
> > Does your application actually iterate in order over dense ids, or is
> > it just for benchmarking purposes? Because if it does, you probably
> > don't actually need seeking, you could just see what the current ID in
> > the terms enum is.
>
> Both dense ID fetches and individual ID fetches exist in the
> application. I put them in a benchmark deliberately doing it as
> individual fetches to get an idea of average timing for a single
> operation.
>
> There are so many use cases of doing the individual fetches that it's
> tough to enumerate. The first one I found was "fetch the term vector
> for ID + field" but I'm sure there will be tons of them.
>
> For mapping a dense set of IDs to doc IDs (e.g. for filtering), I
> would probably use something like DocValuesTermsQuery for that to get
> them all in one shot. I also wondered whether writing our filters as
> queries would help, but I think it would turn out to be about as fast
> as DocValuesTermsQuery even if I did that.
>
> I'm sure the only way to really improve the speed of these filters is
> to start storing these things in the text index and use query-time
> joins, but I can't do that until I solve the issue of relying on
> stable doc IDs and it seems like trying to solve two large problems in
> a single commit would be biting off more than I can chew.
>
> > If you actually need seeking, then you should try
> > to avoid MultiFields, it will call seedExact on each segment, while
> > given what I see you could just stop after you found one segment with
> > the value.
>
> Ah, I did wonder whether MultiFields had any behaviour like that, so
> that definitely means that I will avoid using it. Then I can try other
> tricks, like trying the seeks in order of segment size (the largest
> segment is most likely to contain the hit.)
>
> TX
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
-- 
András


Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Olivier Binda

On 08/09/2015 04:55 PM, Arjen van der Meijden wrote:


On 9-8-2015 16:22, Toke Eskildsen wrote:

Robert Muir  wrote:

I am tired of repeating this:
Don't use BINARY docvalues
Don't use BINARY docvalues
Don't use BINARY docvalues
Use types like SORTED/SORTED_SET which will compress the term
dictionary and make use of ordinals in your application instead.

This seems contrary to
http://lucene.apache.org/core/5_2_0/core/org/apache/lucene/document/BinaryDocValuesField.html

Maybe you could update the JavaDoc for that field to warn against using it?

It (probably) depends on the contents of the values. If the number of
distinct values is roughly equal to the number of documents the javadoc
suggest the binary docvalues are a valid choice.

My values are unique and equal to the number of documents,
They have varying sizes, say at least 10 bytes and may be a lot bigger 
(say  4kbytes)


I don't share, index or sort them.
I don't do grouping/faceting either


I only want to store, retrieve and traverse those values


That's this part:
"The values are stored directly with no sharing, which is a good fit
when the fields don't share (many) values, such as a title field."

If there are (much) less distinct values than documents, Robert's reply
and the documentation suggest the same:
" If values may be shared and sorted it's better to use
SortedDocValuesField."

So as soon as compression of smallish values starts making sense due to
repetition amongst documents, it may be time to move away from the
BinaryDocValuesField towards another variant.

If only parts of the values are repeated (for instance something like
e-mail addresses where many will end with 'gmail.com' and 'outlook.com')
it becomes more complicated.


At the moment, there are some repeated parts inside but a lot of 
repeated parts across docIds  like "Expression", "Reading"


Also, I'm stuck with using Lucene 4.7.0 (or 4.7.2) because starting with 
version 4.8, lucene uses "try with resource" and this isn't supported on 
Android before Android 4.4



   SortedDocValuesField stores a per-document|BytesRef|
   
value,
   indexed for sorting.


If you also need to store the value, you should add a 
separate|StoredField| 
instance.



I actually went with the binaryDocValues because I thought that 
DocValues were way more efficient than the pre 4.0 fields to store stuff

(like only using 1 seek/read ...with mmap...), especially with traversal.

In my app, I traverse all binaryDocValues in increading docId order, 
unserializes my docValues (lightning fast with FlatBuffers, no object 
creation -> complex objects) and do some stats


Would I be able to do that as efficiently with a StoredField ?


Apparently, only StoredField are compressed


   CompressingStoredFieldsFormat



Maybee I should use that (and ditch the useless docValue or make it 
store a bytesRef) to get compression ?


Many thanks for all the insights, :)
Olivier


Best regards,

Arjen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org






RE: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Uwe Schindler
Hi,

> My values are unique and equal to the number of documents, They have
> varying sizes, say at least 10 bytes and may be a lot bigger (say  4kbytes)
> 
> I don't share, index or sort them.
> I don't do grouping/faceting either
> 
> 
> I only want to store, retrieve and traverse those values

Then use stored fields.

Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Toke Eskildsen
Arjen van der Meijden  wrote:
> On 9-8-2015 16:22, Toke Eskildsen wrote:
> > Maybe you could update the JavaDoc for that field to warn against using it?
> It (probably) depends on the contents of the values.

That was my impression too, but we both seem to be second-guessing Robert's 
very non-nuanced and clearly oft-repeated recommendation. I hope Robert can 
shed some light on this and tell us if he finds the JavaDocs to be in order or 
if binary DocValues should not be used at all.

- Toke Eskildsen

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Compressing docValues with variable length bytes[] by block of 16k ?

2015-08-09 Thread Olivier Binda

On 08/09/2015 06:29 PM, Uwe Schindler wrote:

Hi,


My values are unique and equal to the number of documents, They have
varying sizes, say at least 10 bytes and may be a lot bigger (say  4kbytes)

I don't share, index or sort them.
I don't do grouping/faceting either


I only want to store, retrieve and traverse those values

Then use stored fields.

ok, I'll try.

Thanks !



Uwe


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



GROUP BY in Lucene

2015-08-09 Thread Gimantha Bandara
Hi all,

Is there a way to achieve $subject? For example, consider the following SQL
query.

SELECT A, B, C SUM(D) as E FROM  `table` WHERE time BETWEEN fromDate AND
toDate *GROUP BY X,Y,Z*

In the above query we can group the records by, X,Y,Z. Is there a way to
achieve the same in Lucene? (I guess Faceting would help, But is it
possible get all the categoryPaths along with the matching records? ) Is
there any other way other than using Facets?

-- 
Gimantha Bandara
Software Engineer
WSO2. Inc : http://wso2.com
Mobile : +94714961919