[jira] [Commented] (ARROW-6837) [C++/Python] access File Footer custom_metadata
[ https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951353#comment-16951353 ] John Muehlhausen commented on ARROW-6837: - Initially proposed API: {noformat} static Status RecordBatchFileWriter::Open(io::OutputStream* sink, const std::shared_ptr& schema, std::shared_ptr* out, const std::shared_ptr& metadata = NULLPTR); static Result> RecordBatchFileWriter::Open( io::OutputStream* sink, const std::shared_ptr& schema, const std::shared_ptr& metadata = NULLPTR); std::shared_ptr RecordBatchFileReader::metadata() const; {noformat} > [C++/Python] access File Footer custom_metadata > --- > > Key: ARROW-6837 > URL: https://issues.apache.org/jira/browse/ARROW-6837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: John Muehlhausen >Priority: Minor > > Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow
[ https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947997#comment-16947997 ] John Muehlhausen commented on ARROW-6830: - Not sure how the R integration works, but if the 30gigs are memory-mapped but you only access certain columns, the other columns won't actually consume any memory. > [R] Select Subset of Columns in read_arrow > -- > > Key: ARROW-6830 > URL: https://issues.apache.org/jira/browse/ARROW-6830 > Project: Apache Arrow > Issue Type: New Feature > Components: R >Reporter: Anthony Abate >Priority: Minor > > *Note:* Not sure if this is a limitation of the R library or the underlying > C++ code: > I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record > batches of varying row sizes > 1. Is it possible at to use *read_arrow* to filter out columns? (similar to > how *read_feather* has a (col_select =... ) > 2. Or is it possible using *RecordBatchFileReader* to filter columns? > > The only thing I seem to be able to do (please confirm if this is my only > option) is loop over all record batches, select a single column at a time, > and construct the data I need to pull out manually. ie like the following: > {code:java} > for(i in 0:data_rbfr$num_record_batches) { > rbn <- data_rbfr$get_batch(i) > > if (i == 0) > { > merged <- as.data.frame(rbn$column(5)$as_vector()) > } > else > { > dfn <- as.data.frame(rbn$column(5)$as_vector()) > merged <- rbind(merged,dfn) > } > > print(paste(i, nrow(merged))) > } {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths
[ https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muehlhausen updated ARROW-5916: Priority: Minor (was: Blocker) > [C++] Allow RecordBatch.length to be less than array lengths > > > Key: ARROW-5916 > URL: https://issues.apache.org/jira/browse/ARROW-5916 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: John Muehlhausen >Priority: Minor > Fix For: 1.0.0 > > Attachments: test.arrow_ipc > > > 0.13 ignored RecordBatch.length. 0.14 requires that RecordBatch.length and > array length be equal. As per > [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E] > , we discussed changing this so that RecordBatch.length can be [0,array > length]. > If RecordBatch.length is less than array length, the reader should ignore > the portion of the array(s) beyond RecordBatch.length. This will allow > partially populated batches to be read in scenarios identified in the above > discussion. > {code:c++} > Status GetFieldMetadata(int field_index, ArrayData* out) { > auto nodes = metadata_->nodes(); > // pop off a field > if (field_index >= static_cast(nodes->size())) { > return Status::Invalid("Ran out of field metadata, likely malformed"); > } > const flatbuf::FieldNode* node = nodes->Get(field_index); > *//out->length = node->length();* > *out->length = metadata_->length();* > out->null_count = node->null_count(); > out->offset = 0; > return Status::OK(); > } > {code} > Attached is a test IPC File containing a batch with length 1, array length 3. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths
[ https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muehlhausen updated ARROW-5916: Priority: Blocker (was: Minor) > [C++] Allow RecordBatch.length to be less than array lengths > > > Key: ARROW-5916 > URL: https://issues.apache.org/jira/browse/ARROW-5916 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: John Muehlhausen >Priority: Blocker > Fix For: 1.0.0 > > Attachments: test.arrow_ipc > > > 0.13 ignored RecordBatch.length. 0.14 requires that RecordBatch.length and > array length be equal. As per > [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E] > , we discussed changing this so that RecordBatch.length can be [0,array > length]. > If RecordBatch.length is less than array length, the reader should ignore > the portion of the array(s) beyond RecordBatch.length. This will allow > partially populated batches to be read in scenarios identified in the above > discussion. > {code:c++} > Status GetFieldMetadata(int field_index, ArrayData* out) { > auto nodes = metadata_->nodes(); > // pop off a field > if (field_index >= static_cast(nodes->size())) { > return Status::Invalid("Ran out of field metadata, likely malformed"); > } > const flatbuf::FieldNode* node = nodes->Get(field_index); > *//out->length = node->length();* > *out->length = metadata_->length();* > out->null_count = node->null_count(); > out->offset = 0; > return Status::OK(); > } > {code} > Attached is a test IPC File containing a batch with length 1, array length 3. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd
John Muehlhausen created ARROW-6840: --- Summary: [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd Key: ARROW-6840 URL: https://issues.apache.org/jira/browse/ARROW-6840 Project: Apache Arrow Issue Type: New Feature Components: C++ Reporter: John Muehlhausen We want to retrieve the file descriptor of a memory mapped file for the purpose of transferring it across process boundaries. On the receiving end, we want to be able to map a file based on the file descriptor rather than the path. This helps with race conditions when the path may have been unlinked. cf [https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata
John Muehlhausen created ARROW-6839: --- Summary: [Java] access File Footer custom_metadata Key: ARROW-6839 URL: https://issues.apache.org/jira/browse/ARROW-6839 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata
John Muehlhausen created ARROW-6838: --- Summary: [JS] access File Footer custom_metadata Key: ARROW-6838 URL: https://issues.apache.org/jira/browse/ARROW-6838 Project: Apache Arrow Issue Type: New Feature Components: JavaScript Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (ARROW-6837) [C++/Python] access File Footer custom_metadata
[ https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muehlhausen updated ARROW-6837: Priority: Minor (was: Major) > [C++/Python] access File Footer custom_metadata > --- > > Key: ARROW-6837 > URL: https://issues.apache.org/jira/browse/ARROW-6837 > Project: Apache Arrow > Issue Type: New Feature > Components: C++, Python >Reporter: John Muehlhausen >Priority: Minor > > Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata
John Muehlhausen created ARROW-6837: --- Summary: [C++/Python] access File Footer custom_metadata Key: ARROW-6837 URL: https://issues.apache.org/jira/browse/ARROW-6837 Project: Apache Arrow Issue Type: New Feature Components: C++, Python Reporter: John Muehlhausen Access custom_metadata from ARROW-6836 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs
John Muehlhausen created ARROW-6836: --- Summary: [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Key: ARROW-6836 URL: https://issues.apache.org/jira/browse/ARROW-6836 Project: Apache Arrow Issue Type: New Feature Components: Format Reporter: John Muehlhausen add a custom_metadata:[KeyValue] field to the Footer table in File.fbs Use case: If a file is expanded with additional recordbatches and the custom_metadata changes, Schema is no longer an appropriate place to make this change since the two copies of Schema (at the beginning and end of the file) would then be ambiguous cf https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths
John Muehlhausen created ARROW-5916: --- Summary: [C++] Allow RecordBatch.length to be less than array lengths Key: ARROW-5916 URL: https://issues.apache.org/jira/browse/ARROW-5916 Project: Apache Arrow Issue Type: New Feature Reporter: John Muehlhausen Attachments: test.arrow_ipc 0.13 ignored RecordBatch.length. 0.14 requires that RecordBatch.length and array length be equal. As per [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E] , we discussed changing this so that RecordBatch.length can be [0,array length]. If RecordBatch.length is less than array length, the reader should ignore the portion of the array(s) beyond RecordBatch.length. This will allow partially populated batches to be read in scenarios identified in the above discussion. {code:c++} Status GetFieldMetadata(int field_index, ArrayData* out) { auto nodes = metadata_->nodes(); // pop off a field if (field_index >= static_cast(nodes->size())) { return Status::Invalid("Ran out of field metadata, likely malformed"); } const flatbuf::FieldNode* node = nodes->Get(field_index); *//out->length = node->length();* *out->length = metadata_->length();* out->null_count = node->null_count(); out->offset = 0; return Status::OK(); } {code} Attached is a test IPC File containing a batch with length 1, array length 3. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Issue Comment Deleted] (ARROW-5438) [JS] Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muehlhausen updated ARROW-5438: Comment: was deleted (was: Will add test case when I can) > [JS] Utilize stream EOS in File format > -- > > Key: ARROW-5438 > URL: https://issues.apache.org/jira/browse/ARROW-5438 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ARROW-5439) [Java] Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Muehlhausen updated ARROW-5439: Comment: was deleted (was: Will add test case when I can) > [Java] Utilize stream EOS in File format > > > Key: ARROW-5439 > URL: https://issues.apache.org/jira/browse/ARROW-5439 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5439) [Java] Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855189#comment-16855189 ] John Muehlhausen commented on ARROW-5439: - Will add test case when I can > [Java] Utilize stream EOS in File format > > > Key: ARROW-5439 > URL: https://issues.apache.org/jira/browse/ARROW-5439 > Project: Apache Arrow > Issue Type: Improvement > Components: Java >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5438) [JS] Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855188#comment-16855188 ] John Muehlhausen commented on ARROW-5438: - Will add test case when I can > [JS] Utilize stream EOS in File format > -- > > Key: ARROW-5438 > URL: https://issues.apache.org/jira/browse/ARROW-5438 > Project: Apache Arrow > Issue Type: Improvement > Components: JavaScript >Reporter: John Muehlhausen >Priority: Minor > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5438) [JS] Utilize stream EOS in File format
John Muehlhausen created ARROW-5438: --- Summary: [JS] Utilize stream EOS in File format Key: ARROW-5438 URL: https://issues.apache.org/jira/browse/ARROW-5438 Project: Apache Arrow Issue Type: Improvement Reporter: John Muehlhausen We currently do not write EOS at the end of a Message stream inside the File format. As a result, the file cannot be parsed sequentially. This change prepares for other implementations or future reference features that parse a File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5395) Utilize stream EOS in File format
[ https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846074#comment-16846074 ] John Muehlhausen commented on ARROW-5395: - https://github.com/apache/arrow/pull/4372 > Utilize stream EOS in File format > - > > Key: ARROW-5395 > URL: https://issues.apache.org/jira/browse/ARROW-5395 > Project: Apache Arrow > Issue Type: Improvement > Components: C++, Documentation >Reporter: John Muehlhausen >Priority: Minor > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > We currently do not write EOS at the end of a Message stream inside the File > format. As a result, the file cannot be parsed sequentially. This change > prepares for other implementations or future reference features that parse a > File sequentially... i.e. without access to seek(). > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5395) Utilize stream EOS in File format
John Muehlhausen created ARROW-5395: --- Summary: Utilize stream EOS in File format Key: ARROW-5395 URL: https://issues.apache.org/jira/browse/ARROW-5395 Project: Apache Arrow Issue Type: Improvement Components: C++, Documentation Reporter: John Muehlhausen We currently do not write EOS at the end of a Message stream inside the File format. As a result, the file cannot be parsed sequentially. This change prepares for other implementations or future reference features that parse a File sequentially... i.e. without access to seek(). -- This message was sent by Atlassian JIRA (v7.6.3#76005)