[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702126#comment-16702126
 ] 

Toke Eskildsen commented on SOLR-13013:
---

Ah! I understand now, thanks. Guess I got a bit too focused on index data IO.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702034#comment-16702034
 ] 

Joel Bernstein commented on SOLR-13013:
---

 

You're exactly right that improving performance of export helps the MapReduce 
use cases as well. It's just that in a sharded, replicated environment with a 
tier of worker nodes performing a reduce operation, you can get massive 
throughput already just because you can have dozens of servers pushing out an 
export and reducing in parallel.

But you could easily argue that your usecase is the more common use case and we 
should really try to make it as fast as possible.

I wouldn't worry too much about testing this is in sharded scenarios. We can 
extrapolate the single shard findings to multiple shards, realizing that the 
aggregator node will quickly become the bottleneck and the /export will spend 
much of it's time blocked while writing data. Having a tier of worker nodes 
unlocks this bottleneck in the case where worker nodes are performing some form 
of reduce operation.

 

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701946#comment-16701946
 ] 

Toke Eskildsen commented on SOLR-13013:
---

[~joel.bernstein] Unfortunately I don't have proper hardware at hand to test 
with our large shards in a multi-shard setup. I _could_ put them on a spinning 
drive, now that I think about it, but I am also afraid that my test-box does 
not have adequate memory to fully cache the DocValues structures when using 
multiple shards, so that would complicate testing somewhat. I'll see what else 
we have lying around and if nothing else, I could just delete 3/4th of the data 
in 4 of the shards and run with those instead (takes some days to do though).

Up until now we have used export exclusively to do simple query-bases 
data-dumps, so that was my go-to case. It is probably due to my limited 
understanding of Streaming Expressions that I do not understand the 
methodological problem in my test:

I get that multi-sharding, replicas and hashing (bit unsure about the hashing 
part) can distribute and parallelize the load to make processing faster, but 
only the "more and smaller shards" of those 3 would reduce the total amount of 
work, as I understand it? So with regard to that, any optimization to the 
export should work equally well for a single-shard simple export and a more 
complex distributed setup, measured as total work to be done?

I am on (even) more shaky grounds with the local reduce operation. Isn't that 
step after the export part and therefore extremely dependent on raw export 
speed? Or is there some shortcut mechanism I haven't understood?

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701898#comment-16701898
 ] 

Joel Bernstein commented on SOLR-13013:
---

Interesting findings. I can work on getting this patch committed, possibly for 
the 8.0 release.

A couple of thoughts about the design of the /export handler.

The /export handler was very much designed to support MapReduce operations 
(distributed grouping, rollups, relational algebra) in Streaming Expressions. 
Scaling these MapReduce operations took the following path:

1) Sharding: The /export handler benefits tremendously by sharding. The 
benefits go well beyond linear. This is because 2 shards both doubles the 
computing power and more then halves the amount of work that needs to done by 
each shard. 

3) Hash partitioning and worker collections: Sharding very quickly causes 
bottlenecks on a single aggregator node. Streaming Expressions parallel 
function when combined with the hash partitioner allows the /exports to be 
partitioned into X number of slices and brings into play not just the shards 
but the replicas. When a reduce operations happens on the worker nodes 
(rollups, innerJoins) which limits the numbers of records that are emitted in 
the final stream, this is an extremely powerful scaling tool.

So, from a pure /export standpoint with no reduce operation, all from a single 
shard, you are working somewhat against the design goals of system. But that 
being said the faster we make the pure export form a single shard the more use 
cases the the /export handler serves.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701727#comment-16701727
 ] 

Toke Eskildsen commented on SOLR-13013:
---

I cherry-picked some DENSE fields from our netarchive index and tried exporting 
them from a single shard, to demonstrate the problem with large indexes in 
Lucene/Solr 7+ and to performance test the current patch.

I made sure everything was warmed (practically zero IO on the index-SSD 
according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 
turned on and off:
{code}
> curl -s "http://localhost:9090/solr/ns80/select?q=*:*; | jq .response.numFound
307171504

> curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn; | jq 
> .response.numFound'
52654

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=true;
>  -o t_export_true_true
0.433661 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=false;
>  -o t_export_true_false
0.555844 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=true;
>  -o t_export_false_true
1.037004 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=false;
>  -o t_export_false_false
843.477925 seconds

> diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true 
> t_export_false_true ; diff -s t_export_true_true t_export_false_false
Files t_export_true_true and t_export_true_false are identical
Files t_export_true_true and t_export_false_true are identical
Files t_export_true_true and t_export_false_false are identical
{code}
Observations from this ad-hoc test (which of course should be independently 
verified): Exporting from a large index with vanilla Solr master is not ideal. 
It does not make much sense to talk about what performance-factors the patches 
provides as they are mostly about changing time complexity: Our factor 1500 
speed-up with SOLR-13013 with this shard with this request will be something 
quite else for other setups.
 * The explicit sort in SOLR-13013 seems the superior solution and the addition 
of the O(n) → O(1) lookup-improvement in LUCENE-8374 only makes it slightly 
faster.
 * On the other hand, LUCENE-8374 works quite well for export and does not 
require any changes to export. This might influence whether or not energy 
should be spend on a "best as possible" fallback in case of memory problems or 
if simpler "full fallback to sliding window sort order" is preferable.
 * On the gripping hand, testing with a smaller index is likely to result in 
SOLR-13013 being (relative to LUCENE-8374) even faster, as SOLR-13013 avoids 
re-opening DV-readers all the time. More testing needed (no surprise there).

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # 

[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-26 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699385#comment-16699385
 ] 

Toke Eskildsen commented on SOLR-13013:
---

[~joel.bernstein] I am glad that it looks useful. I expect that it needs at 
least a full re-implementation of the {{MapWriter}}-parts. I am unfamiliar with 
that part of the code, and it would be great if you took over. I won't take any 
offense if you rewrite everything.

I'd be happy to try and review or at least do some testing on our 300M 
docs/segment shards as they are very affected by the DV API change.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-26 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699241#comment-16699241
 ] 

Joel Bernstein commented on SOLR-13013:
---

I've read through the patch and it looks like a big win. We can probably trade 
off window size if memory is an issue.

The 30,000 magic number was chosen after a lot of testing to determine which 
was the best window size to limit the number of passes over the results without 
bogging down the calls to an overly large priority queue. Up to about 30,000 I 
was seeing performance improvements. As the window size drops and more passes 
need to be made over the data performance drops. 

[~toke], how would you like to proceed? Do you want to commit this yourself or 
would you like me to work on the patch further and get it committed? I can 
spend time testing in either case.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-26 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16699102#comment-16699102
 ] 

Yonik Seeley commented on SOLR-13013:
-

bq. Are you thinking about making something generic? Maybe a bulk request 
wrapper for doc values, that temporarily re-sorts internally? Maybe a bulk 
request wrapper for doc values, that temporarily re-sorts internally?

Yep.  Something that collects out-of-order docids along with other value 
sources that should be internally retrieved mostly in-order.
 It shouldn't slow up this issue though. I just bring it up to get it on other 
people's radar (it's been on my TODO list for years...) and because it's 
related to this issue.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-25 Thread Yonik Seeley (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698258#comment-16698258
 ] 

Yonik Seeley commented on SOLR-13013:
-

Great results!

Retrieving results in order in batches has also been a TODO for augmenters 
(specifically, the ability to retrieve function query results along side field 
results) since they were added to Solr since some function queries needed to be 
accessed in order to be efficient.  With the changes to iterators for 
docvalues, and the ability to retrieve stored fields using document values, 
this becomes even more important.


> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-25 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698207#comment-16698207
 ] 

Toke Eskildsen commented on SOLR-13013:
---

[~janhoy] Nice idea! The values are collected as Objects, so it would involve 
some instanceof-checks to estimate memory overhead. I am unsure how that would 
affect performance, but it would be great if we could both avoid the risk of 
OOM and make it simpler for the user.

Technically a fallback would be easy to do. It would even be possible to do a 
partial fallback: If the memory limit for the buffered values is reached before 
all 30K SortDocs has been processed, switch back to standard "sort order 
DV-resolving with immediate delivery", but use the already collected values 
whenever possible.

Adjusting windows size for subsequent windows is tricky, as is requires 
weighing query + sort cost vs. DV-retrieval cost. It would be possible to 
collect a run time profile of the two parts and use that for qualified 
guessing, but then it's sounding like quite a large project.

If we determine that the base idea of this JIRA-issue has merit, the first 
version could just use a simple fallback to sort order DV-resolving, without 
any re-use of already collected values, and stay in that mode for the rest of 
the current export. Re-use and/or window shrinking could be later enhancements.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-25 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698182#comment-16698182
 ] 

Jan Høydahl commented on SOLR-13013:


Cool. Should there be a setting for max memory usage and if violated adjust the 
window size or fallback to old logic?

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-24 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698037#comment-16698037
 ] 

Joel Bernstein commented on SOLR-13013:
---

I'll spend some time this week testing out the patch. The approach sounds 
really promising. 

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-24 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697846#comment-16697846
 ] 

Toke Eskildsen commented on SOLR-13013:
---

I have uploaded a proof of concept for the idea in the issue description. The 
structure that collects and holds the temporary values is made by mashing the 
keyboard until it worked and the performance test is Frankensteined from 
existing unit-test code in TestExportWriter. Nevertheless unit-tests in 
TestExportWriter passes and a performance test can be executed with

{{TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g 
-Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter 
test | grep "TES:"}}

It takes a 10+ minutes and writes a summary at the end, For a quicker test, use 
TES_SIZES="1000,1" or something like that. For my desktop the result was

{{[junit4] 1> TES: Concatenated output:}}
{{ [junit4] 1> TES: Test 1/5: 1000 documents, trie: 11098 / 7525 docs/sec ( 
147%), points: 7639 / 11552 docs/sec ( 66%)}}
{{ [junit4] 1> TES: Test 2/5: 1000 documents, trie: 15135 / 9269 docs/sec ( 
163%), points: 27769 / 15986 docs/sec ( 174%)}}
{{ [junit4] 1> TES: Test 3/5: 1000 documents, trie: 11505 / 9593 docs/sec ( 
120%), points: 37643 / 13584 docs/sec ( 277%)}}
{{ [junit4] 1> TES: Test 4/5: 1000 documents, trie: 17495 / 9730 docs/sec ( 
180%), points: 39103 / 18222 docs/sec ( 215%)}}
{{ [junit4] 1> TES: Test 5/5: 1000 documents, trie: 17657 / 10331 docs/sec ( 
171%), points: 37633 / 19104 docs/sec ( 197%)}}
{{ [junit4] 1> TES: --}}
{{ [junit4] 1> TES: Test 1/5: 1 documents, trie: 17018 / 7901 docs/sec ( 
215%), points: 38606 / 12381 docs/sec ( 312%)}}
{{ [junit4] 1> TES: Test 2/5: 1 documents, trie: 17191 / 7879 docs/sec ( 
218%), points: 39920 / 12404 docs/sec ( 322%)}}
{{ [junit4] 1> TES: Test 3/5: 1 documents, trie: 17218 / 7881 docs/sec ( 
218%), points: 41696 / 12410 docs/sec ( 336%)}}
{{ [junit4] 1> TES: Test 4/5: 1 documents, trie: 17451 / 7884 docs/sec ( 
221%), points: 41719 / 12360 docs/sec ( 338%)}}
{{ [junit4] 1> TES: Test 5/5: 1 documents, trie: 17227 / 7855 docs/sec ( 
219%), points: 41879 / 12436 docs/sec ( 337%)}}
{{ [junit4] 1> TES: --}}
{{ [junit4] 1> TES: Test 1/5: 10 documents, trie: 15849 / 3718 docs/sec ( 
426%), points: 36037 / 4841 docs/sec ( 744%)}}
{{ [junit4] 1> TES: Test 2/5: 10 documents, trie: 16348 / 3717 docs/sec ( 
440%), points: 37994 / 4858 docs/sec ( 782%)}}
{{ [junit4] 1> TES: Test 3/5: 10 documents, trie: 15378 / 3718 docs/sec ( 
414%), points: 38831 / 4872 docs/sec ( 797%)}}
{{ [junit4] 1> TES: Test 4/5: 10 documents, trie: 16042 / 3710 docs/sec ( 
432%), points: 39084 / 4876 docs/sec ( 802%)}}
{{ [junit4] 1> TES: Test 5/5: 10 documents, trie: 16009 / 3713 docs/sec ( 
431%), points: 39503 / 4865 docs/sec ( 812%)}}
{{ [junit4] 1> TES: --}}
{{ [junit4] 1> TES: Test 1/5: 20 documents, trie: 15403 / 3031 docs/sec ( 
508%), points: 37349 / 3531 docs/sec (1058%)}}
{{ [junit4] 1> TES: Test 2/5: 20 documents, trie: 15853 / 3018 docs/sec ( 
525%), points: 37509 / 3544 docs/sec (1058%)}}
{{ [junit4] 1> TES: Test 3/5: 20 documents, trie: 14993 / 3018 docs/sec ( 
497%), points: 38468 / 3547 docs/sec (1084%)}}
{{ [junit4] 1> TES: Test 4/5: 20 documents, trie: 15191 / 3023 docs/sec ( 
502%), points: 38684 / 3538 docs/sec (1093%)}}
{{ [junit4] 1> TES: Test 5/5: 20 documents, trie: 15678 / 3035 docs/sec ( 
517%), points: 38729 / 3542 docs/sec (1093%)}}
{{ [junit4] 1> TES: --}}
{{ [junit4] 1> TES: Test 1/5: 30 documents, trie: 15529 / 2834 docs/sec ( 
548%), points: 36911 / 3652 docs/sec (1011%)}}
{{ [junit4] 1> TES: Test 2/5: 30 documents, trie: 15455 / 2846 docs/sec ( 
543%), points: 37705 / 3630 docs/sec (1039%)}}
{{ [junit4] 1> TES: Test 3/5: 30 documents, trie: 15805 / 2866 docs/sec ( 
551%), points: 37583 / 3660 docs/sec (1027%)}}
{{ [junit4] 1> TES: Test 4/5: 30 documents, trie: 15653 / 2883 docs/sec ( 
543%), points: 39365 / 3591 docs/sec (1096%)}}
{{ [junit4] 1> TES: Test 5/5: 30 documents, trie: 15736 / 2895 docs/sec ( 
543%), points: 38606 / 3667 docs/sec (1053%)}}

The two numbers for trie and points are sorted followed by non_sorted. The 
numbers in the parentheses are sorted/non_sorted. As can be seen, non_sorted 
export performance degrades as index size (measured in number of documents) 
goes up. Also, as can be seen from the percentages, reusing the 
DocValues-iterators and ensuring docID order improved the speed significantly,

The patch is not at alll production-ready. See it as a "is this idea worth 
exploring?". Ping to [~joel.bernstein], as I expect he will be interested in 
this.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: