[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702034#comment-16702034
 ] 

Joel Bernstein edited comment on SOLR-13013 at 11/28/18 3:39 PM:
-

You're exactly right that improving performance of export helps the MapReduce 
use cases as well. It's just that in a sharded, replicated environment with a 
tier of worker nodes performing a reduce operation, you can get massive 
throughput already just because you can have dozens of servers pushing out an 
export and reducing in parallel.

But you could easily argue that your usecase is the more common use case and we 
should really try to make it as fast as possible.

I wouldn't worry too much about testing this is in sharded scenarios. We can 
extrapolate the single shard findings to multiple shards, realizing that the 
aggregator node will quickly become the bottleneck and the /export will spend 
much of it's time blocked while writing data. Having a tier of worker nodes 
unlocks this bottleneck in the case where worker nodes are performing some form 
of reduce operation.

 


was (Author: joel.bernstein):
 

You're exactly right that improving performance of export helps the MapReduce 
use cases as well. It's just that in a sharded, replicated environment with a 
tier of worker nodes performing a reduce operation, you can get massive 
throughput already just because you can have dozens of servers pushing out an 
export and reducing in parallel.

But you could easily argue that your usecase is the more common use case and we 
should really try to make it as fast as possible.

I wouldn't worry too much about testing this is in sharded scenarios. We can 
extrapolate the single shard findings to multiple shards, realizing that the 
aggregator node will quickly become the bottleneck and the /export will spend 
much of it's time blocked while writing data. Having a tier of worker nodes 
unlocks this bottleneck in the case where worker nodes are performing some form 
of reduce operation.

 

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to the original sliding window order
> Deliver the values
> One big difference from the current export code is of course the need to hold 
> the whole sliding window scaled result set in memory. This might well be a 
> showstopper as there is no real limit to how large this partial result set 
> can be. Maybe such an optimization could be requested explicitly if the user 
> knows that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701898#comment-16701898
 ] 

Joel Bernstein edited comment on SOLR-13013 at 11/28/18 1:48 PM:
-

Interesting findings. I can work on getting this patch committed, possibly for 
the 8.0 release.

A couple of thoughts about the design of the /export handler.

The /export handler was very much designed to support MapReduce operations 
(distributed grouping, rollups, relational algebra) in Streaming Expressions. 
Scaling these MapReduce operations took the following path:

1) Sharding: The /export handler benefits tremendously by sharding. The 
benefits go well beyond linear. This is because 2 shards both doubles the 
computing power and more then halves the amount of work that needs to done by 
each shard. 

3) Hash partitioning and worker collections: Sharding very quickly causes 
bottlenecks on a single aggregator node. Streaming Expressions parallel 
function when combined with the hash partitioner allows the /exports to be 
partitioned into X number of slices and brings into play not just the shards 
but the replicas. When a reduce operations happens on the worker nodes 
(rollups, innerJoins) which limits the numbers of records that are emitted in 
the final stream, this is an extremely powerful scaling tool.

So, from a pure /export standpoint with no reduce operation, all from a single 
shard, you are working somewhat against the design goals of the system. But 
that being said the faster we make the pure export form a single shard, the 
more use cases the /export handler serves.


was (Author: joel.bernstein):
Interesting findings. I can work on getting this patch committed, possibly for 
the 8.0 release.

A couple of thoughts about the design of the /export handler.

The /export handler was very much designed to support MapReduce operations 
(distributed grouping, rollups, relational algebra) in Streaming Expressions. 
Scaling these MapReduce operations took the following path:

1) Sharding: The /export handler benefits tremendously by sharding. The 
benefits go well beyond linear. This is because 2 shards both doubles the 
computing power and more then halves the amount of work that needs to done by 
each shard. 

3) Hash partitioning and worker collections: Sharding very quickly causes 
bottlenecks on a single aggregator node. Streaming Expressions parallel 
function when combined with the hash partitioner allows the /exports to be 
partitioned into X number of slices and brings into play not just the shards 
but the replicas. When a reduce operations happens on the worker nodes 
(rollups, innerJoins) which limits the numbers of records that are emitted in 
the final stream, this is an extremely powerful scaling tool.

So, from a pure /export standpoint with no reduce operation, all from a single 
shard, you are working somewhat against the design goals of system. But that 
being said the faster we make the pure export form a single shard, the more use 
cases the /export handler serves.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to 

[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Joel Bernstein (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701898#comment-16701898
 ] 

Joel Bernstein edited comment on SOLR-13013 at 11/28/18 1:48 PM:
-

Interesting findings. I can work on getting this patch committed, possibly for 
the 8.0 release.

A couple of thoughts about the design of the /export handler.

The /export handler was very much designed to support MapReduce operations 
(distributed grouping, rollups, relational algebra) in Streaming Expressions. 
Scaling these MapReduce operations took the following path:

1) Sharding: The /export handler benefits tremendously by sharding. The 
benefits go well beyond linear. This is because 2 shards both doubles the 
computing power and more then halves the amount of work that needs to done by 
each shard. 

3) Hash partitioning and worker collections: Sharding very quickly causes 
bottlenecks on a single aggregator node. Streaming Expressions parallel 
function when combined with the hash partitioner allows the /exports to be 
partitioned into X number of slices and brings into play not just the shards 
but the replicas. When a reduce operations happens on the worker nodes 
(rollups, innerJoins) which limits the numbers of records that are emitted in 
the final stream, this is an extremely powerful scaling tool.

So, from a pure /export standpoint with no reduce operation, all from a single 
shard, you are working somewhat against the design goals of system. But that 
being said the faster we make the pure export form a single shard, the more use 
cases the /export handler serves.


was (Author: joel.bernstein):
Interesting findings. I can work on getting this patch committed, possibly for 
the 8.0 release.

A couple of thoughts about the design of the /export handler.

The /export handler was very much designed to support MapReduce operations 
(distributed grouping, rollups, relational algebra) in Streaming Expressions. 
Scaling these MapReduce operations took the following path:

1) Sharding: The /export handler benefits tremendously by sharding. The 
benefits go well beyond linear. This is because 2 shards both doubles the 
computing power and more then halves the amount of work that needs to done by 
each shard. 

3) Hash partitioning and worker collections: Sharding very quickly causes 
bottlenecks on a single aggregator node. Streaming Expressions parallel 
function when combined with the hash partitioner allows the /exports to be 
partitioned into X number of slices and brings into play not just the shards 
but the replicas. When a reduce operations happens on the worker nodes 
(rollups, innerJoins) which limits the numbers of records that are emitted in 
the final stream, this is an extremely powerful scaling tool.

So, from a pure /export standpoint with no reduce operation, all from a single 
shard, you are working somewhat against the design goals of system. But that 
being said the faster we make the pure export form a single shard the more use 
cases the the /export handler serves.

> Change export to extract DocValues in docID order
> -
>
> Key: SOLR-13013
> URL: https://issues.apache.org/jira/browse/SOLR-13013
> Project: Solr
>  Issue Type: Improvement
>  Security Level: Public(Default Security Level. Issues are Public) 
>  Components: Export Writer
>Affects Versions: 7.5, master (8.0)
>Reporter: Toke Eskildsen
>Priority: Major
> Fix For: master (8.0)
>
> Attachments: SOLR-13013_proof_of_concept.patch, 
> SOLR-13013_proof_of_concept.patch
>
>
> The streaming export writer uses a sliding window of 30,000 documents for 
> paging through the result set in a given sort order. Each time a window has 
> been calculated, the values for the export fields are retrieved from the 
> underlying DocValues structures in document sort order and delivered.
> The iterative DocValues API introduced in Lucene/Solr 7 does not support 
> random access. The current export implementation bypasses this by creating a 
> new DocValues-iterator for each individual value to retrieve. This slows down 
> export as the iterator has to seek to the given docID from start for each 
> value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
> alternative is to extract the DocValues in docID-order, with re-use of 
> DocValues-iterators. The idea is as follows:
>  # Change the FieldWriters for export to re-use the DocValues-iterators if 
> subsequent requests are for docIDs higher than the previous ones
>  # Calculate the sliding window of SortDocs as usual
>  # Take a note of the order of the SortDocs in the sliding window
>  # Re-sort the SortDocs in docID-order
>  # Extract the DocValues to a temporary on-heap structure
>  # Re-sort the extracted values to 

[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-28 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701727#comment-16701727
 ] 

Toke Eskildsen edited comment on SOLR-13013 at 11/28/18 11:15 AM:
--

I cherry-picked some DENSE fields from our netarchive index and tried exporting 
them from a single shard, to demonstrate the problem with large indexes in 
Lucene/Solr 7+ and to performance test the current patch.

I made sure everything was warmed (practically zero IO on the index-SSD 
according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 
turned on and off:
{code:java}
> curl -s "http://localhost:9090/solr/ns80/select?q=*:*; | jq .response.numFound
307171504

> curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn; | jq 
> .response.numFound'
52654

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=true;
>  -o t_export_true_true
0.433661 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=false;
>  -o t_export_true_false
0.555844 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=true;
>  -o t_export_false_true
1.037004 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=false;
>  -o t_export_false_false
843.477925 seconds

> diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true 
> t_export_false_true ; diff -s t_export_true_true t_export_false_false
Files t_export_true_true and t_export_true_false are identical
Files t_export_true_true and t_export_false_true are identical
Files t_export_true_true and t_export_false_false are identical
{code}
Observations from this ad-hoc test (which of course should be independently 
verified): Exporting from a large index with vanilla Solr master is not ideal. 
It does not make much sense to talk about what performance-factors the patches 
provides as they are mostly about changing time complexity: Our factor 1500 
speed-up with SOLR-13013 with this shard with this request will be something 
quite else for other setups.
 * The explicit sort in SOLR-13013 seems the superior solution and the addition 
of the O\(n) → O(1) lookup-improvement in LUCENE-8374 only makes it slightly 
faster.
 * On the other hand, LUCENE-8374 works quite well for export and does not 
require any changes to export. This might influence whether or not energy 
should be spend on a "best as possible" fallback in case of memory problems or 
if simpler "full fallback to sliding window sort order" is preferable.
 * On the gripping hand, testing with a smaller index is likely to result in 
SOLR-13013 being (relative to LUCENE-8374) even faster, as SOLR-13013 avoids 
re-opening DV-readers all the time. More testing needed (no surprise there).


was (Author: toke):
I cherry-picked some DENSE fields from our netarchive index and tried exporting 
them from a single shard, to demonstrate the problem with large indexes in 
Lucene/Solr 7+ and to performance test the current patch.

I made sure everything was warmed (practically zero IO on the index-SSD 
according to iostat) and tested with combinations of SOLR-13013 and LUCENE-8374 
turned on and off:
{code}
> curl -s "http://localhost:9090/solr/ns80/select?q=*:*; | jq .response.numFound
307171504

> curl -s "http://localhost:9090/solr/ns80/select?q=text:hestevogn; | jq 
> .response.numFound'
52654

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=true;
>  -o t_export_true_true
0.433661 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=true=false;
>  -o t_export_true_false
0.555844 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=true;
>  -o t_export_false_true
1.037004 seconds

> curl -s -w "%\{time_total} seconds"$\'\n\' 
> "http://localhost:9090/solr/ns80/export?q=text:hestevogn=id+asc=content_type_ext,content_type_served,crawl_date,content_length=false=false;
>  -o t_export_false_false
843.477925 seconds

> diff -s t_export_true_true t_export_true_false ; diff -s t_export_true_true 
> t_export_false_true ; diff -s t_export_true_true t_export_false_false
Files 

[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-24 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697846#comment-16697846
 ] 

Toke Eskildsen edited comment on SOLR-13013 at 11/24/18 3:12 PM:
-

I have uploaded a proof of concept for the idea in the issue description. The 
structure that collects and holds the temporary values is made by mashing the 
keyboard until it worked and the performance test is Frankensteined from 
existing unit-test code in {{TestExportWriter}}. Nevertheless unit-tests in 
{{TestExportWriter}} passes and a performance test can be executed with

{code}
TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g 
-Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter 
test | grep "TES:"
{code}

It takes a 10+ minutes and writes a summary at the end, For a quicker test, use 
{{TES_SIZES="1000,1"}} or something like that. For my desktop the result was

{code}
Test 1/5:   1000 docs, trie: 11098 /  7525 docs/sec ( 147%), points:  7639 / 
11552 docs/sec (  66%)
Test 2/5:   1000 docs, trie: 15135 /  9269 docs/sec ( 163%), points: 27769 / 
15986 docs/sec ( 174%)
Test 3/5:   1000 docs, trie: 11505 /  9593 docs/sec ( 120%), points: 37643 / 
13584 docs/sec ( 277%)
Test 4/5:   1000 docs, trie: 17495 /  9730 docs/sec ( 180%), points: 39103 / 
18222 docs/sec ( 215%)
Test 5/5:   1000 docs, trie: 17657 / 10331 docs/sec ( 171%), points: 37633 / 
19104 docs/sec ( 197%)
--
Test 1/5:  1 docs, trie: 17018 /  7901 docs/sec ( 215%), points: 38606 / 
12381 docs/sec ( 312%)
Test 2/5:  1 docs, trie: 17191 /  7879 docs/sec ( 218%), points: 39920 / 
12404 docs/sec ( 322%)
Test 3/5:  1 docs, trie: 17218 /  7881 docs/sec ( 218%), points: 41696 / 
12410 docs/sec ( 336%)
Test 4/5:  1 docs, trie: 17451 /  7884 docs/sec ( 221%), points: 41719 / 
12360 docs/sec ( 338%)
Test 5/5:  1 docs, trie: 17227 /  7855 docs/sec ( 219%), points: 41879 / 
12436 docs/sec ( 337%)
--
Test 1/5: 10 docs, trie: 15849 /  3718 docs/sec ( 426%), points: 36037 /  
4841 docs/sec ( 744%)
Test 2/5: 10 docs, trie: 16348 /  3717 docs/sec ( 440%), points: 37994 /  
4858 docs/sec ( 782%)
Test 3/5: 10 docs, trie: 15378 /  3718 docs/sec ( 414%), points: 38831 /  
4872 docs/sec ( 797%)
Test 4/5: 10 docs, trie: 16042 /  3710 docs/sec ( 432%), points: 39084 /  
4876 docs/sec ( 802%)
Test 5/5: 10 docs, trie: 16009 /  3713 docs/sec ( 431%), points: 39503 /  
4865 docs/sec ( 812%)
--
Test 1/5: 20 docs, trie: 15403 /  3031 docs/sec ( 508%), points: 37349 /  
3531 docs/sec (1058%)
Test 2/5: 20 docs, trie: 15853 /  3018 docs/sec ( 525%), points: 37509 /  
3544 docs/sec (1058%)
Test 3/5: 20 docs, trie: 14993 /  3018 docs/sec ( 497%), points: 38468 /  
3547 docs/sec (1084%)
Test 4/5: 20 docs, trie: 15191 /  3023 docs/sec ( 502%), points: 38684 /  
3538 docs/sec (1093%)
Test 5/5: 20 docs, trie: 15678 /  3035 docs/sec ( 517%), points: 38729 /  
3542 docs/sec (1093%)
--
Test 1/5: 30 docs, trie: 15529 /  2834 docs/sec ( 548%), points: 36911 /  
3652 docs/sec (1011%)
Test 2/5: 30 docs, trie: 15455 /  2846 docs/sec ( 543%), points: 37705 /  
3630 docs/sec (1039%)
Test 3/5: 30 docs, trie: 15805 /  2866 docs/sec ( 551%), points: 37583 /  
3660 docs/sec (1027%)
Test 4/5: 30 docs, trie: 15653 /  2883 docs/sec ( 543%), points: 39365 /  
3591 docs/sec (1096%)
Test 5/5: 30 docs, trie: 15736 /  2895 docs/sec ( 543%), points: 38606 /  
3667 docs/sec (1053%)
{code}

The two numbers for trie and points are sorted followed by non_sorted. The 
numbers in the parentheses are sorted/non_sorted. As can be seen, non_sorted 
export performance degrades as index size (measured in number of documents) 
goes up. Also, as can be seen from the percentages, reusing the 
DocValues-iterators and ensuring docID order improved the speed significantly,

The patch is not at alll production-ready. See it as a "is this idea worth 
exploring?". Ping to [~joel.bernstein], as I expect he will be interested in 
this.


was (Author: toke):
I have uploaded a proof of concept for the idea in the issue description. The 
structure that collects and holds the temporary values is made by mashing the 
keyboard until it worked and the performance test is Frankensteined from 
existing unit-test code in TestExportWriter. Nevertheless unit-tests in 
TestExportWriter passes and a performance test can be executed with

{code}
TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g 
-Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter 
test | grep "TES:"
{code}

It takes a 10+ minutes and writes a summary at the end, For a quicker test, use 
TES_SIZES="1000,1" or something like that. For my desktop the result was

{code}
Concatenated output:
Test 1/5:   1000 documents, trie:  11098 /   7525 docs/sec 

[jira] [Comment Edited] (SOLR-13013) Change export to extract DocValues in docID order

2018-11-24 Thread Toke Eskildsen (JIRA)


[ 
https://issues.apache.org/jira/browse/SOLR-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16697846#comment-16697846
 ] 

Toke Eskildsen edited comment on SOLR-13013 at 11/24/18 3:08 PM:
-

I have uploaded a proof of concept for the idea in the issue description. The 
structure that collects and holds the temporary values is made by mashing the 
keyboard until it worked and the performance test is Frankensteined from 
existing unit-test code in TestExportWriter. Nevertheless unit-tests in 
TestExportWriter passes and a performance test can be executed with

{code}
TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g 
-Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter 
test | grep "TES:"
{code}

It takes a 10+ minutes and writes a summary at the end, For a quicker test, use 
TES_SIZES="1000,1" or something like that. For my desktop the result was

{code}
Concatenated output:
Test 1/5:   1000 documents, trie:  11098 /   7525 docs/sec ( 147%), points:   
7639 /  11552 docs/sec (  66%)
Test 2/5:   1000 documents, trie:  15135 /   9269 docs/sec ( 163%), points:  
27769 /  15986 docs/sec ( 174%)
Test 3/5:   1000 documents, trie:  11505 /   9593 docs/sec ( 120%), points:  
37643 /  13584 docs/sec ( 277%)
Test 4/5:   1000 documents, trie:  17495 /   9730 docs/sec ( 180%), points:  
39103 /  18222 docs/sec ( 215%)
Test 5/5:   1000 documents, trie:  17657 /  10331 docs/sec ( 171%), points:  
37633 /  19104 docs/sec ( 197%)
--
Test 1/5:  1 documents, trie:  17018 /   7901 docs/sec ( 215%), points:  
38606 /  12381 docs/sec ( 312%)
Test 2/5:  1 documents, trie:  17191 /   7879 docs/sec ( 218%), points:  
39920 /  12404 docs/sec ( 322%)
Test 3/5:  1 documents, trie:  17218 /   7881 docs/sec ( 218%), points:  
41696 /  12410 docs/sec ( 336%)
Test 4/5:  1 documents, trie:  17451 /   7884 docs/sec ( 221%), points:  
41719 /  12360 docs/sec ( 338%)
Test 5/5:  1 documents, trie:  17227 /   7855 docs/sec ( 219%), points:  
41879 /  12436 docs/sec ( 337%)
--
Test 1/5: 10 documents, trie:  15849 /   3718 docs/sec ( 426%), points:  
36037 /   4841 docs/sec ( 744%)
Test 2/5: 10 documents, trie:  16348 /   3717 docs/sec ( 440%), points:  
37994 /   4858 docs/sec ( 782%)
Test 3/5: 10 documents, trie:  15378 /   3718 docs/sec ( 414%), points:  
38831 /   4872 docs/sec ( 797%)
Test 4/5: 10 documents, trie:  16042 /   3710 docs/sec ( 432%), points:  
39084 /   4876 docs/sec ( 802%)
Test 5/5: 10 documents, trie:  16009 /   3713 docs/sec ( 431%), points:  
39503 /   4865 docs/sec ( 812%)
--
Test 1/5: 20 documents, trie:  15403 /   3031 docs/sec ( 508%), points:  
37349 /   3531 docs/sec (1058%)
Test 2/5: 20 documents, trie:  15853 /   3018 docs/sec ( 525%), points:  
37509 /   3544 docs/sec (1058%)
Test 3/5: 20 documents, trie:  14993 /   3018 docs/sec ( 497%), points:  
38468 /   3547 docs/sec (1084%)
Test 4/5: 20 documents, trie:  15191 /   3023 docs/sec ( 502%), points:  
38684 /   3538 docs/sec (1093%)
Test 5/5: 20 documents, trie:  15678 /   3035 docs/sec ( 517%), points:  
38729 /   3542 docs/sec (1093%)
--
Test 1/5: 30 documents, trie:  15529 /   2834 docs/sec ( 548%), points:  
36911 /   3652 docs/sec (1011%)
Test 2/5: 30 documents, trie:  15455 /   2846 docs/sec ( 543%), points:  
37705 /   3630 docs/sec (1039%)
Test 3/5: 30 documents, trie:  15805 /   2866 docs/sec ( 551%), points:  
37583 /   3660 docs/sec (1027%)
Test 4/5: 30 documents, trie:  15653 /   2883 docs/sec ( 543%), points:  
39365 /   3591 docs/sec (1096%)
Test 5/5: 30 documents, trie:  15736 /   2895 docs/sec ( 543%), points:  
38606 /   3667 docs/sec (1053%)
{code}

The two numbers for trie and points are sorted followed by non_sorted. The 
numbers in the parentheses are sorted/non_sorted. As can be seen, non_sorted 
export performance degrades as index size (measured in number of documents) 
goes up. Also, as can be seen from the percentages, reusing the 
DocValues-iterators and ensuring docID order improved the speed significantly,

The patch is not at alll production-ready. See it as a "is this idea worth 
exploring?". Ping to [~joel.bernstein], as I expect he will be interested in 
this.


was (Author: toke):
I have uploaded a proof of concept for the idea in the issue description. The 
structure that collects and holds the temporary values is made by mashing the 
keyboard until it worked and the performance test is Frankensteined from 
existing unit-test code in TestExportWriter. Nevertheless unit-tests in 
TestExportWriter passes and a performance test can be executed with

{{TES_SIZES="1000,1,10,20,30" ant -Dtests.heapsize=5g 
-Dtests.codec=Lucene80 -Dtestmethod=testExportSpeed -Dtestcase=TestExportWriter 
test | grep "TES:"}}

It takes a 10+ minutes