Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-07 Thread Adrien Grand
Giving some guidance with Lucene APIs since the original question was about
Lucene: the APIs that you need to use are `IndexWriterConfig.setIndexSort`
to configure the index sort at index-time and then TopFieldCollector takes
a `boolean trackTotalHits` which you can set to `false` to tell Lucene to
automatically early terminate collection if the search sort order matches
the index sort order. The latter is only available since Lucene 7.2 so if
you are on an older release you will need to use
EarlyTerminatingSortingCollector instead.

Le mer. 7 mars 2018 à 03:27, Tony Ma  a écrit :

> Thanks Erick!!
> Index sorting and early termination is what I am looking for.
>
> On 3/6/18, 11:33 PM, "Erick Erickson"  wrote:
>
> OK, you're asking a different question I think.
>
> See SOLR-5730 and SOLR-8621, particularly SOLR-5730. This will work
> only a single field which you decide at index time. You can still sort
> by any field at the same expense as now, but since your docs are
> ordered by one field the early termination part won't be applicable to
> other fields.
>
> Best,
> Erick
>
> On Mon, Mar 5, 2018 at 6:28 PM, Tony Ma  wrote:
> > Hi Erick,
> >
> > I raise this question is about the sorting scenario as you mentioned
> in #2.
> >
> > If the hit docs are about 100, and my query just want top 2. If the
> values are not sorted, it has to iterate all 100 docs and find top2 in a
> priority queue. If the values are already sorted, it just need to iterate
> first 2. If the query is unselective, the hit doc might be huge, pre-sort
> or not will have big differences.
> >
> > I understand your thinking that if the doc values are not persisted
> with doc id sequence, it is unable to retrieve field value by doc id.
> >
> > Actually, I am just wondering how lucene handle the sorting
> scenario, is iterating all values of all docs unavoidable?
> >
> >
> > On 3/6/18, 6:50 AM, "Erick Erickson" 
> wrote:
> >
> > I think there are two issues here that are being conflated
> > 1> _within_ a document, i.e. for a multi-valued field the values
> are
> > stored as Dominik says as a SORTED_SET. Not only will they be
> returned
> > (if you return from docValues rather than stored) in lexical
> order,
> > but identical values will be collapsed
> >
> > 2> across multiple documents, the question about  "...persisted
> with
> > order of values, not document id..." really makes no sense. The
> point
> > of DocValues is to answer the question "for document X what is
> the
> > value of field Y". X here is the _internal_ document ID. Now
> consider
> > a search. There are two documents that are hits, doc 35 and doc
> 198
> > (internal lucene doc ID). To sort them by field Y you have to
> know
> > what the value in that field is for those two docs is. How would
> > "pre-ordering" the values help here? If I have the _values_ in
> order,
> > I have no clue what docs are associated with them. That question
> is
> > what the "inverted index" is there to answer.
> >
> > So I have doc 35 and 198. Think of DocValues as a large array
> indexed
> > by internal doc id. To know how these two docs sort all I have
> to do
> > is index into the array. It's slightly more complicated than
> that, but
> > conceptually that's what happens.
> >
> > Best,
> > Erick
> >
> > On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
> >  wrote:
> > >> So, can doc values be persisted with order of values, not
> document id? This should be fast in sort scenario that the values are
> pre-ordered instead of scan/sort at runtime.
> > >
> > >
> > > No, unfortunately doc values cannot be persisted in order.
> Lucene stores this values internally as a DocValuesType.SORTED_SET, where
> the values are being stored using for example Long.compareTo().
> > >
> > > If you'd like to retrieve the values in insertion order, use
> stored instead of doc values instead of. Then you might access the values
> in order using the LeafReader's document function. However, beware that may
> induce performance issues because it requires loading the document from
> disk.
> > >
> > > If you require to store and retrieve multiple numeric values
> per document in order, you might consider using PointValues. PointValues
> are internally indexed with KD-trees. But, beware that PointValues have a
> limited dimensionality, in terms that you can for example store values in 8
> dimensions, each of max 16 bytes.
> > >
> > >> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
> > >>
> > >> Per my understanding, doc 

Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-06 Thread Tony Ma
Thanks Erick!!
Index sorting and early termination is what I am looking for. 

On 3/6/18, 11:33 PM, "Erick Erickson"  wrote:

OK, you're asking a different question I think.

See SOLR-5730 and SOLR-8621, particularly SOLR-5730. This will work
only a single field which you decide at index time. You can still sort
by any field at the same expense as now, but since your docs are
ordered by one field the early termination part won't be applicable to
other fields.

Best,
Erick

On Mon, Mar 5, 2018 at 6:28 PM, Tony Ma  wrote:
> Hi Erick,
>
> I raise this question is about the sorting scenario as you mentioned in 
#2.
>
> If the hit docs are about 100, and my query just want top 2. If the 
values are not sorted, it has to iterate all 100 docs and find top2 in a 
priority queue. If the values are already sorted, it just need to iterate first 
2. If the query is unselective, the hit doc might be huge, pre-sort or not will 
have big differences.
>
> I understand your thinking that if the doc values are not persisted with 
doc id sequence, it is unable to retrieve field value by doc id.
>
> Actually, I am just wondering how lucene handle the sorting scenario, is 
iterating all values of all docs unavoidable?
>
>
> On 3/6/18, 6:50 AM, "Erick Erickson"  wrote:
>
> I think there are two issues here that are being conflated
> 1> _within_ a document, i.e. for a multi-valued field the values are
> stored as Dominik says as a SORTED_SET. Not only will they be returned
> (if you return from docValues rather than stored) in lexical order,
> but identical values will be collapsed
>
> 2> across multiple documents, the question about  "...persisted with
> order of values, not document id..." really makes no sense. The point
> of DocValues is to answer the question "for document X what is the
> value of field Y". X here is the _internal_ document ID. Now consider
> a search. There are two documents that are hits, doc 35 and doc 198
> (internal lucene doc ID). To sort them by field Y you have to know
> what the value in that field is for those two docs is. How would
> "pre-ordering" the values help here? If I have the _values_ in order,
> I have no clue what docs are associated with them. That question is
> what the "inverted index" is there to answer.
>
> So I have doc 35 and 198. Think of DocValues as a large array indexed
> by internal doc id. To know how these two docs sort all I have to do
> is index into the array. It's slightly more complicated than that, but
> conceptually that's what happens.
>
> Best,
> Erick
>
> On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
>  wrote:
> >> So, can doc values be persisted with order of values, not document 
id? This should be fast in sort scenario that the values are pre-ordered 
instead of scan/sort at runtime.
> >
> >
> > No, unfortunately doc values cannot be persisted in order. Lucene 
stores this values internally as a DocValuesType.SORTED_SET, where the values 
are being stored using for example Long.compareTo().
> >
> > If you'd like to retrieve the values in insertion order, use stored 
instead of doc values instead of. Then you might access the values in order 
using the LeafReader's document function. However, beware that may induce 
performance issues because it requires loading the document from disk.
> >
> > If you require to store and retrieve multiple numeric values per 
document in order, you might consider using PointValues. PointValues are 
internally indexed with KD-trees. But, beware that PointValues have a limited 
dimensionality, in terms that you can for example store values in 8 dimensions, 
each of max 16 bytes.
> >
> >> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
> >>
> >> Per my understanding, doc values (binary doc values / numeric doc 
values) are stored with sequence of document id. Sorted numeric doc values just 
means if a document has multiple values, the values will be sorted for same 
document, but for different documents, the value is still ordered by document 
id. Is that true?
> >> So, can doc values be persisted with order of values, not document 
id? This should be fast in sort scenario that the values are pre-ordered 
instead of scan/sort at runtime.
> >
> >
> > 
-
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> 

Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-06 Thread Erick Erickson
OK, you're asking a different question I think.

See SOLR-5730 and SOLR-8621, particularly SOLR-5730. This will work
only a single field which you decide at index time. You can still sort
by any field at the same expense as now, but since your docs are
ordered by one field the early termination part won't be applicable to
other fields.

Best,
Erick

On Mon, Mar 5, 2018 at 6:28 PM, Tony Ma  wrote:
> Hi Erick,
>
> I raise this question is about the sorting scenario as you mentioned in #2.
>
> If the hit docs are about 100, and my query just want top 2. If the values 
> are not sorted, it has to iterate all 100 docs and find top2 in a priority 
> queue. If the values are already sorted, it just need to iterate first 2. If 
> the query is unselective, the hit doc might be huge, pre-sort or not will 
> have big differences.
>
> I understand your thinking that if the doc values are not persisted with doc 
> id sequence, it is unable to retrieve field value by doc id.
>
> Actually, I am just wondering how lucene handle the sorting scenario, is 
> iterating all values of all docs unavoidable?
>
>
> On 3/6/18, 6:50 AM, "Erick Erickson"  wrote:
>
> I think there are two issues here that are being conflated
> 1> _within_ a document, i.e. for a multi-valued field the values are
> stored as Dominik says as a SORTED_SET. Not only will they be returned
> (if you return from docValues rather than stored) in lexical order,
> but identical values will be collapsed
>
> 2> across multiple documents, the question about  "...persisted with
> order of values, not document id..." really makes no sense. The point
> of DocValues is to answer the question "for document X what is the
> value of field Y". X here is the _internal_ document ID. Now consider
> a search. There are two documents that are hits, doc 35 and doc 198
> (internal lucene doc ID). To sort them by field Y you have to know
> what the value in that field is for those two docs is. How would
> "pre-ordering" the values help here? If I have the _values_ in order,
> I have no clue what docs are associated with them. That question is
> what the "inverted index" is there to answer.
>
> So I have doc 35 and 198. Think of DocValues as a large array indexed
> by internal doc id. To know how these two docs sort all I have to do
> is index into the array. It's slightly more complicated than that, but
> conceptually that's what happens.
>
> Best,
> Erick
>
> On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
>  wrote:
> >> So, can doc values be persisted with order of values, not document id? 
> This should be fast in sort scenario that the values are pre-ordered instead 
> of scan/sort at runtime.
> >
> >
> > No, unfortunately doc values cannot be persisted in order. Lucene 
> stores this values internally as a DocValuesType.SORTED_SET, where the values 
> are being stored using for example Long.compareTo().
> >
> > If you'd like to retrieve the values in insertion order, use stored 
> instead of doc values instead of. Then you might access the values in order 
> using the LeafReader's document function. However, beware that may induce 
> performance issues because it requires loading the document from disk.
> >
> > If you require to store and retrieve multiple numeric values per 
> document in order, you might consider using PointValues. PointValues are 
> internally indexed with KD-trees. But, beware that PointValues have a limited 
> dimensionality, in terms that you can for example store values in 8 
> dimensions, each of max 16 bytes.
> >
> >> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
> >>
> >> Per my understanding, doc values (binary doc values / numeric doc 
> values) are stored with sequence of document id. Sorted numeric doc values 
> just means if a document has multiple values, the values will be sorted for 
> same document, but for different documents, the value is still ordered by 
> document id. Is that true?
> >> So, can doc values be persisted with order of values, not document id? 
> This should be fast in sort scenario that the values are pre-ordered instead 
> of scan/sort at runtime.
> >
> >
> > -
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

Re: [EXTERNAL] - Re: Is docvalue sorted by value?

2018-03-05 Thread Tony Ma
Hi Erick,

I raise this question is about the sorting scenario as you mentioned in #2.

If the hit docs are about 100, and my query just want top 2. If the values are 
not sorted, it has to iterate all 100 docs and find top2 in a priority queue. 
If the values are already sorted, it just need to iterate first 2. If the query 
is unselective, the hit doc might be huge, pre-sort or not will have big 
differences.

I understand your thinking that if the doc values are not persisted with doc id 
sequence, it is unable to retrieve field value by doc id.

Actually, I am just wondering how lucene handle the sorting scenario, is 
iterating all values of all docs unavoidable? 


On 3/6/18, 6:50 AM, "Erick Erickson"  wrote:

I think there are two issues here that are being conflated
1> _within_ a document, i.e. for a multi-valued field the values are
stored as Dominik says as a SORTED_SET. Not only will they be returned
(if you return from docValues rather than stored) in lexical order,
but identical values will be collapsed

2> across multiple documents, the question about  "...persisted with
order of values, not document id..." really makes no sense. The point
of DocValues is to answer the question "for document X what is the
value of field Y". X here is the _internal_ document ID. Now consider
a search. There are two documents that are hits, doc 35 and doc 198
(internal lucene doc ID). To sort them by field Y you have to know
what the value in that field is for those two docs is. How would
"pre-ordering" the values help here? If I have the _values_ in order,
I have no clue what docs are associated with them. That question is
what the "inverted index" is there to answer.

So I have doc 35 and 198. Think of DocValues as a large array indexed
by internal doc id. To know how these two docs sort all I have to do
is index into the array. It's slightly more complicated than that, but
conceptually that's what happens.

Best,
Erick

On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
 wrote:
>> So, can doc values be persisted with order of values, not document id? 
This should be fast in sort scenario that the values are pre-ordered instead of 
scan/sort at runtime.
>
>
> No, unfortunately doc values cannot be persisted in order. Lucene stores 
this values internally as a DocValuesType.SORTED_SET, where the values are 
being stored using for example Long.compareTo().
>
> If you'd like to retrieve the values in insertion order, use stored 
instead of doc values instead of. Then you might access the values in order 
using the LeafReader's document function. However, beware that may induce 
performance issues because it requires loading the document from disk.
>
> If you require to store and retrieve multiple numeric values per document 
in order, you might consider using PointValues. PointValues are internally 
indexed with KD-trees. But, beware that PointValues have a limited 
dimensionality, in terms that you can for example store values in 8 dimensions, 
each of max 16 bytes.
>
>> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
>>
>> Per my understanding, doc values (binary doc values / numeric doc 
values) are stored with sequence of document id. Sorted numeric doc values just 
means if a document has multiple values, the values will be sorted for same 
document, but for different documents, the value is still ordered by document 
id. Is that true?
>> So, can doc values be persisted with order of values, not document id? 
This should be fast in sort scenario that the values are pre-ordered instead of 
scan/sort at runtime.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org





Re: Is docvalue sorted by value?

2018-03-05 Thread Erick Erickson
I think there are two issues here that are being conflated
1> _within_ a document, i.e. for a multi-valued field the values are
stored as Dominik says as a SORTED_SET. Not only will they be returned
(if you return from docValues rather than stored) in lexical order,
but identical values will be collapsed

2> across multiple documents, the question about  "...persisted with
order of values, not document id..." really makes no sense. The point
of DocValues is to answer the question "for document X what is the
value of field Y". X here is the _internal_ document ID. Now consider
a search. There are two documents that are hits, doc 35 and doc 198
(internal lucene doc ID). To sort them by field Y you have to know
what the value in that field is for those two docs is. How would
"pre-ordering" the values help here? If I have the _values_ in order,
I have no clue what docs are associated with them. That question is
what the "inverted index" is there to answer.

So I have doc 35 and 198. Think of DocValues as a large array indexed
by internal doc id. To know how these two docs sort all I have to do
is index into the array. It's slightly more complicated than that, but
conceptually that's what happens.

Best,
Erick

On Mon, Mar 5, 2018 at 11:29 AM, Dominik Safaric
 wrote:
>> So, can doc values be persisted with order of values, not document id? This 
>> should be fast in sort scenario that the values are pre-ordered instead of 
>> scan/sort at runtime.
>
>
> No, unfortunately doc values cannot be persisted in order. Lucene stores this 
> values internally as a DocValuesType.SORTED_SET, where the values are being 
> stored using for example Long.compareTo().
>
> If you'd like to retrieve the values in insertion order, use stored instead 
> of doc values instead of. Then you might access the values in order using the 
> LeafReader's document function. However, beware that may induce performance 
> issues because it requires loading the document from disk.
>
> If you require to store and retrieve multiple numeric values per document in 
> order, you might consider using PointValues. PointValues are internally 
> indexed with KD-trees. But, beware that PointValues have a limited 
> dimensionality, in terms that you can for example store values in 8 
> dimensions, each of max 16 bytes.
>
>> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
>>
>> Per my understanding, doc values (binary doc values / numeric doc values) 
>> are stored with sequence of document id. Sorted numeric doc values just 
>> means if a document has multiple values, the values will be sorted for same 
>> document, but for different documents, the value is still ordered by 
>> document id. Is that true?
>> So, can doc values be persisted with order of values, not document id? This 
>> should be fast in sort scenario that the values are pre-ordered instead of 
>> scan/sort at runtime.
>
>
> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Is docvalue sorted by value?

2018-03-05 Thread Dominik Safaric
> So, can doc values be persisted with order of values, not document id? This 
> should be fast in sort scenario that the values are pre-ordered instead of 
> scan/sort at runtime.


No, unfortunately doc values cannot be persisted in order. Lucene stores this 
values internally as a DocValuesType.SORTED_SET, where the values are being 
stored using for example Long.compareTo(). 

If you'd like to retrieve the values in insertion order, use stored instead of 
doc values instead of. Then you might access the values in order using the 
LeafReader's document function. However, beware that may induce performance 
issues because it requires loading the document from disk. 

If you require to store and retrieve multiple numeric values per document in 
order, you might consider using PointValues. PointValues are internally indexed 
with KD-trees. But, beware that PointValues have a limited dimensionality, in 
terms that you can for example store values in 8 dimensions, each of max 16 
bytes.

> On 5 Mar 2018, at 15:33, Tony Ma  wrote:
> 
> Per my understanding, doc values (binary doc values / numeric doc values) are 
> stored with sequence of document id. Sorted numeric doc values just means if 
> a document has multiple values, the values will be sorted for same document, 
> but for different documents, the value is still ordered by document id. Is 
> that true?
> So, can doc values be persisted with order of values, not document id? This 
> should be fast in sort scenario that the values are pre-ordered instead of 
> scan/sort at runtime.


-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Is docvalue sorted by value?

2018-03-05 Thread Tony Ma
Per my understanding, doc values (binary doc values / numeric doc values) are 
stored with sequence of document id. Sorted numeric doc values just means if a 
document has multiple values, the values will be sorted for same document, but 
for different documents, the value is still ordered by document id. Is that 
true?
So, can doc values be persisted with order of values, not document id? This 
should be fast in sort scenario that the values are pre-ordered instead of 
scan/sort at runtime.