[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-30 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-8447:
--
Labels: dataset pull-request-available  (was: dataset)

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset, pull-request-available
> Fix For: 1.0.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-14 Thread Joris Van den Bossche (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-8447:
-
Fix Version/s: 1.0.0

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
> Fix For: 1.0.0
>
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8447:
--
Description: 
This can be refactored with a little effort in Scanner::ToTable:

# Change `batches` to `std::vector`
# When pushing the closure to the TaskGroup, also track an incrementing 
integer, e.g. scan_task_id
# In the closure, store the RecordBatches for this ScanTask in a local vector, 
when all batches are consumed, move the local vector in the `batches` at the 
right index, resizing and emplacing with mutex
# After waiting for the task group completion either
* Flatten into a single vector and call `Table::FromRecordBatch` or
* Write a RecordBatchReader that supports vector and add 
method `Table::FromRecordBatchReader`

The later involves more work but is the clean way, the other FromRecordBatch 
method can be implemented from it and support "streaming".

  was:
This can be refactored with a little effort in Scanner::ToTable:

# Change `batches` to `std::vector`
# When pushing the closure to the TaskGroup, also track an incrementing 
integer, e.g. scan_task_id
# In the closure, store the RecordBatches for this ScanTask in a local vector, 
when all batches are consumed, move the local vector in the `batches` at the 
right index, resizing and emplacing with mutex
# After waiting for the task group completion either
* Concatenate into a single vector and call `Table::FromRecordBatch` or
* Write a RecordBatchReader that supports vector and add 
method `Table::FromRecordBatchReader`

The later involves more work but is the clean way, the other FromRecordBatch 
method can be implemented from it and support "streaming".


> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Flatten into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-8447) [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks

2020-04-14 Thread Francois Saint-Jacques (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-8447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Francois Saint-Jacques updated ARROW-8447:
--
Summary: [C++][Dataset] Ensure Scanner::ToTable preserve ordering of 
ScanTasks  (was: [C++][Dataset] Ensure Scanner::ToTable preserve ordering)

> [C++][Dataset] Ensure Scanner::ToTable preserve ordering of ScanTasks
> -
>
> Key: ARROW-8447
> URL: https://issues.apache.org/jira/browse/ARROW-8447
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Francois Saint-Jacques
>Priority: Major
>  Labels: dataset
>
> This can be refactored with a little effort in Scanner::ToTable:
> # Change `batches` to `std::vector`
> # When pushing the closure to the TaskGroup, also track an incrementing 
> integer, e.g. scan_task_id
> # In the closure, store the RecordBatches for this ScanTask in a local 
> vector, when all batches are consumed, move the local vector in the `batches` 
> at the right index, resizing and emplacing with mutex
> # After waiting for the task group completion either
> * Concatenate into a single vector and call `Table::FromRecordBatch` or
> * Write a RecordBatchReader that supports vector and add 
> method `Table::FromRecordBatchReader`
> The later involves more work but is the clean way, the other FromRecordBatch 
> method can be implemented from it and support "streaming".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)