[jira] [Created] (SOLR-13013) Change export to extract DocValues in docID order

Toke Eskildsen (JIRA) Sat, 24 Nov 2018 07:03:17 -0800

Toke Eskildsen created SOLR-13013:
-------------------------------------

             Summary: Change export to extract DocValues in docID order
                 Key: SOLR-13013
                 URL: https://issues.apache.org/jira/browse/SOLR-13013
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Export Writer
    Affects Versions: 7.5, master (8.0)
            Reporter: Toke Eskildsen
             Fix For: master (8.0)



The streaming export writer uses a sliding window of 30,000 documents for 
paging through the result set in a given sort order. Each time a window has 
been calculated, the values for the export fields are retrieved from the 
underlying DocValues structures in document sort order and delivered.

The iterative DocValues API introduced in Lucene/Solr 7 does not support random 
access. The current export implementation bypasses this by creating a new 
DocValues-iterator for each individual value to retrieve. This slows down 
export as the iterator has to seek to the given docID from start for each 
value. The slowdown scales with shard size (see LUCENE-8374 for details). An 
alternative is to extract the DocValues in docID-order, with re-use of 
DocValues-iterators. The idea is as follows:
 # Change the FieldWriters for export to re-use the DocValues-iterators if 
subsequent requests are for docIDs higher than the previous ones
 # Calculate the sliding window of SortDocs as usual
 # Take a note of the order of the SortDocs in the sliding window
 # Re-sort the SortDocs in docID-order
 # Extract the DocValues to a temporary on-heap structure
 # Re-sort the extracted values to the original sliding window order
Deliver the values

One big difference from the current export code is of course the need to hold 
the whole sliding window scaled result set in memory. This might well be a 
showstopper as there is no real limit to how large this partial result set can 
be. Maybe such an optimization could be requested explicitly if the user knows 
that there is enough memory?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Created] (SOLR-13013) Change export to extract DocValues in docID order

Reply via email to