[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8447: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > Fix For: 1.0.0 > > Time Spent: 10m > Remaining Estimate: 0h > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-8447: - Fix Version/s: 1.0.0 > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset > Fix For: 1.0.0 > > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8447: -- Description: This can be refactored with a little effort in Scanner::ToTable: # Change `batches` to `std::vector` # When pushing the closure to the TaskGroup, also track an incrementing integer, e.g. scan_task_id # In the closure, store the RecordBatches for this ScanTask in a local vector, when all batches are consumed, move the local vector in the `batches` at the right index, resizing and emplacing with mutex # After waiting for the task group completion either * Flatten into a single vector and call `Table::FromRecordBatch` or * Write a RecordBatchReader that supports vector and add method `Table::FromRecordBatchReader` The later involves more work but is the clean way, the other FromRecordBatch method can be implemented from it and support "streaming". was: This can be refactored with a little effort in Scanner::ToTable: # Change `batches` to `std::vector` # When pushing the closure to the TaskGroup, also track an incrementing integer, e.g. scan_task_id # In the closure, store the RecordBatches for this ScanTask in a local vector, when all batches are consumed, move the local vector in the `batches` at the right index, resizing and emplacing with mutex # After waiting for the task group completion either * Concatenate into a single vector and call `Table::FromRecordBatch` or * Write a RecordBatchReader that supports vector and add method `Table::FromRecordBatchReader` The later involves more work but is the clean way, the other FromRecordBatch method can be implemented from it and support "streaming". > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Flatten into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
[ https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8447: -- Summary: [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks (was: [C++][Dataset] Ensure Scanner::ToTable preserve ordering) > [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks > - > > Key: ARROW-8447 > URL: https://issues.apache.org/jira/browse/ARROW-8447 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Priority: Major > Labels: dataset > > This can be refactored with a little effort in Scanner::ToTable: > # Change `batches` to `std::vector` > # When pushing the closure to the TaskGroup, also track an incrementing > integer, e.g. scan_task_id > # In the closure, store the RecordBatches for this ScanTask in a local > vector, when all batches are consumed, move the local vector in the `batches` > at the right index, resizing and emplacing with mutex > # After waiting for the task group completion either > * Concatenate into a single vector and call `Table::FromRecordBatch` or > * Write a RecordBatchReader that supports vector and add > method `Table::FromRecordBatchReader` > The later involves more work but is the clean way, the other FromRecordBatch > method can be implemented from it and support "streaming". -- This message was sent by Atlassian Jira (v8.3.4#803005)