[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-8065: -- Labels: dataset pull-request-available (was: dataset) > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Francois Saint-Jacques >Assignee: Francois Saint-Jacques >Priority: Major > Labels: dataset, pull-request-available > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # Expose lazy accessor to Fragment::physical_schema() > # Consolidate ScanOptions and ScanContext > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8065: -- Description: Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments): # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset {{Dataset::GetFragments(std::shared_ptr predicate)}}. # Expose lazy accessor to Fragment::physical_schema() # Consolidate ScanOptions and ScanContext This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties. was: Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments): # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset {{Dataset::GetFragments(std::shared_ptr predicate)}}. # Expose lazy accessor to Fragment::physical_schema() This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties. > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Priority: Major > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # Expose lazy accessor to Fragment::physical_schema() > # Consolidate ScanOptions and ScanContext > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8065: -- Description: Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments): # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset {{Dataset::GetFragments(std::shared_ptr predicate)}}. # Expose lazy accessor to Fragment::physical_schema() This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties. was: Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments): # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset {{Dataset::GetFragments(std::shared_ptr predicate)}}. # {{Fragment::schema}} property should be set at construction time, usually extracted from the fragment's Dataset. This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties. > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Priority: Major > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # Expose lazy accessor to Fragment::physical_schema() > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ben Kietzman updated ARROW-8065: Description: Currently: a fragment is a product of a scan; it is a lazy collection of scan tasks corresponding to a data source which is logically singular (like a single file, a single row group, ...). It would be more useful if instead a fragment were the direct object of a scan; one scans a fragment (or a collection of fragments): # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an overload to support predicate pushdown in FileSystemDataset and UnionDataset {{Dataset::GetFragments(std::shared_ptr predicate)}}. # {{Fragment::schema}} property should be set at construction time, usually extracted from the fragment's Dataset. This will lessen the cognitive dissonance between fragments and files since fragments will no longer include references to scan properties. was: We should be able to list fragments without going through the Scanner/ScanOptions hoops. This exposes a flaw with the current API where it require a ScanOptions to create Fragment, this is also a problem for ARROW-7824, i.e. why do we need a ScanOptions (read manifest) to write record batches to a given path. # Remove {{ScanOptions}} from Fragment's properties and move it into {{Fragment::Scan}} parameters. # Remove {{ScanOptions}} from {{Dataset::GetFragments}}, if required, we can still provide an alternate signature, e.g. {{Dataset::GetFragments(std::shared_ptr predicate)}} for sub-tree pruning in FileSystemDataset. # Fragment constructor should take a schema (and store it as a property), usually extracted from the Dataset schema. Update the schema() method accordingly. > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Priority: Major > > Currently: a fragment is a product of a scan; it is a lazy collection of scan > tasks corresponding to a data source which is logically singular (like a > single file, a single row group, ...). It would be more useful if instead a > fragment were the direct object of a scan; one scans a fragment (or a > collection of fragments): > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}. We can provide an > overload to support predicate pushdown in FileSystemDataset and UnionDataset > {{Dataset::GetFragments(std::shared_ptr predicate)}}. > # {{Fragment::schema}} property should be set at construction time, usually > extracted from the fragment's Dataset. > This will lessen the cognitive dissonance between fragments and files since > fragments will no longer include references to scan properties. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-8065) [C++][Dataset] Untangle Dataset, Fragment and ScanOptions
[ https://issues.apache.org/jira/browse/ARROW-8065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Francois Saint-Jacques updated ARROW-8065: -- Component/s: C++ - Dataset > [C++][Dataset] Untangle Dataset, Fragment and ScanOptions > - > > Key: ARROW-8065 > URL: https://issues.apache.org/jira/browse/ARROW-8065 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ - Dataset >Reporter: Francois Saint-Jacques >Priority: Major > > We should be able to list fragments without going through the > Scanner/ScanOptions hoops. This exposes a flaw with the current API where it > require a ScanOptions to create Fragment, this is also a problem for > ARROW-7824, i.e. why do we need a ScanOptions (read manifest) to write record > batches to a given path. > # Remove {{ScanOptions}} from Fragment's properties and move it into > {{Fragment::Scan}} parameters. > # Remove {{ScanOptions}} from {{Dataset::GetFragments}}, if required, we can > still provide an alternate signature, e.g. > {{Dataset::GetFragments(std::shared_ptr predicate)}} for sub-tree > pruning in FileSystemDataset. > # Fragment constructor should take a schema (and store it as a property), > usually extracted from the Dataset schema. Update the schema() method > accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)