[jira] [Commented] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-14 Thread John Muehlhausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16951353#comment-16951353
 ] 

John Muehlhausen commented on ARROW-6837:
-

Initially proposed API:
{noformat}
static Status RecordBatchFileWriter::Open(io::OutputStream* sink,
const std::shared_ptr& schema, 
std::shared_ptr* out,
const std::shared_ptr& metadata = NULLPTR);

static Result> RecordBatchFileWriter::Open(
io::OutputStream* sink, const std::shared_ptr& schema,
const std::shared_ptr& metadata = NULLPTR);

std::shared_ptr RecordBatchFileReader::metadata() const;
{noformat}

> [C++/Python] access File Footer custom_metadata
> ---
>
> Key: ARROW-6837
> URL: https://issues.apache.org/jira/browse/ARROW-6837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: John Muehlhausen
>Priority: Minor
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (ARROW-6830) [R] Select Subset of Columns in read_arrow

2019-10-09 Thread John Muehlhausen (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-6830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16947997#comment-16947997
 ] 

John Muehlhausen commented on ARROW-6830:
-

Not sure how the R integration works, but if the 30gigs are memory-mapped but 
you only access certain columns, the other columns won't actually consume any 
memory.

> [R] Select Subset of Columns in read_arrow
> --
>
> Key: ARROW-6830
> URL: https://issues.apache.org/jira/browse/ARROW-6830
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: R
>Reporter: Anthony Abate
>Priority: Minor
>
> *Note:*  Not sure if this is a limitation of the R library or the underlying 
> C++ code:
> I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record 
> batches of varying row sizes
> 1. Is it possible at to use *read_arrow* to filter out columns?  (similar to 
> how *read_feather* has a (col_select =... )
> 2. Or is it possible using *RecordBatchFileReader* to filter columns?
>  
> The only thing I seem to be able to do (please confirm if this is my only 
> option) is loop over all record batches, select a single column at a time, 
> and construct the data I need to pull out manually.  ie like the following:
> {code:java}
> for(i in 0:data_rbfr$num_record_batches) {
> rbn <- data_rbfr$get_batch(i)
>   
>   if (i == 0) 
>   {
> merged <- as.data.frame(rbn$column(5)$as_vector())
>   }
>   else 
>   {
> dfn <- as.data.frame(rbn$column(5)$as_vector())
> merged <- rbind(merged,dfn)
>   }
> 
>   print(paste(i, nrow(merged)))
> } {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5916:

Priority: Minor  (was: Blocker)

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Minor
> Fix For: 1.0.0
>
> Attachments: test.arrow_ipc
>
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-5916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5916:

Priority: Blocker  (was: Minor)

> [C++] Allow RecordBatch.length to be less than array lengths
> 
>
> Key: ARROW-5916
> URL: https://issues.apache.org/jira/browse/ARROW-5916
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: John Muehlhausen
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: test.arrow_ipc
>
>
> 0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
> array length be equal.  As per 
> [https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
>  , we discussed changing this so that RecordBatch.length can be [0,array 
> length].
>  If RecordBatch.length is less than array length, the reader should ignore 
> the portion of the array(s) beyond RecordBatch.length.  This will allow 
> partially populated batches to be read in scenarios identified in the above 
> discussion.
> {code:c++}
>   Status GetFieldMetadata(int field_index, ArrayData* out) {
> auto nodes = metadata_->nodes();
> // pop off a field
> if (field_index >= static_cast(nodes->size())) {
>   return Status::Invalid("Ran out of field metadata, likely malformed");
> }
> const flatbuf::FieldNode* node = nodes->Get(field_index);
> *//out->length = node->length();*
> *out->length = metadata_->length();*
> out->null_count = node->null_count();
> out->offset = 0;
> return Status::OK();
>   }
> {code}
> Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6840) [C++/Python] retrieve fd of open memory mapped file and Open() memory mapped file by fd

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6840:
---

 Summary: [C++/Python] retrieve fd of open memory mapped file and 
Open() memory mapped file by fd
 Key: ARROW-6840
 URL: https://issues.apache.org/jira/browse/ARROW-6840
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: John Muehlhausen


We want to retrieve the file descriptor of a memory mapped file for the purpose 
of transferring it across process boundaries.  On the receiving end, we want to 
be able to map a file based on the file descriptor rather than the path.

This helps with race conditions when the path may have been unlinked.


cf 
[https://lists.apache.org/thread.html/83373ab00f552ee8afd2bac2b2721468b3f28fe283490e379998453a@%3Cdev.arrow.apache.org%3E]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6839) [Java] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6839:
---

 Summary: [Java] access File Footer custom_metadata
 Key: ARROW-6839
 URL: https://issues.apache.org/jira/browse/ARROW-6839
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6838) [JS] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6838:
---

 Summary: [JS] access File Footer custom_metadata
 Key: ARROW-6838
 URL: https://issues.apache.org/jira/browse/ARROW-6838
 Project: Apache Arrow
  Issue Type: New Feature
  Components: JavaScript
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)


 [ 
https://issues.apache.org/jira/browse/ARROW-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-6837:

Priority: Minor  (was: Major)

> [C++/Python] access File Footer custom_metadata
> ---
>
> Key: ARROW-6837
> URL: https://issues.apache.org/jira/browse/ARROW-6837
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++, Python
>Reporter: John Muehlhausen
>Priority: Minor
>
> Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6837) [C++/Python] access File Footer custom_metadata

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6837:
---

 Summary: [C++/Python] access File Footer custom_metadata
 Key: ARROW-6837
 URL: https://issues.apache.org/jira/browse/ARROW-6837
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++, Python
Reporter: John Muehlhausen


Access custom_metadata from ARROW-6836



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6836) [Format] add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

2019-10-09 Thread John Muehlhausen (Jira)
John Muehlhausen created ARROW-6836:
---

 Summary: [Format] add a custom_metadata:[KeyValue] field to the 
Footer table in File.fbs
 Key: ARROW-6836
 URL: https://issues.apache.org/jira/browse/ARROW-6836
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Format
Reporter: John Muehlhausen


add a custom_metadata:[KeyValue] field to the Footer table in File.fbs

Use case:

If a file is expanded with additional recordbatches and the custom_metadata 
changes, Schema is no longer an appropriate place to make this change since the 
two copies of Schema (at the beginning and end of the file) would then be 
ambiguous

cf 
https://lists.apache.org/thread.html/c3b3d1456b7062a435f6795c0308ccb7c8fe55c818cfed2cf55f76c5@%3Cdev.arrow.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-5916) [C++] Allow RecordBatch.length to be less than array lengths

2019-07-11 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5916:
---

 Summary: [C++] Allow RecordBatch.length to be less than array 
lengths
 Key: ARROW-5916
 URL: https://issues.apache.org/jira/browse/ARROW-5916
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: John Muehlhausen
 Attachments: test.arrow_ipc

0.13 ignored RecordBatch.length.  0.14 requires that RecordBatch.length and 
array length be equal.  As per 
[https://lists.apache.org/thread.html/2692dd8fe09c92aa313bded2f4c2d4240b9ef75a8604ec214eb02571@%3Cdev.arrow.apache.org%3E]
 , we discussed changing this so that RecordBatch.length can be [0,array 
length].

 If RecordBatch.length is less than array length, the reader should ignore the 
portion of the array(s) beyond RecordBatch.length.  This will allow partially 
populated batches to be read in scenarios identified in the above discussion.

{code:c++}
  Status GetFieldMetadata(int field_index, ArrayData* out) {
auto nodes = metadata_->nodes();
// pop off a field
if (field_index >= static_cast(nodes->size())) {
  return Status::Invalid("Ran out of field metadata, likely malformed");
}
const flatbuf::FieldNode* node = nodes->Get(field_index);

*//out->length = node->length();*
*out->length = metadata_->length();*
out->null_count = node->null_count();
out->offset = 0;
return Status::OK();
  }
{code}

Attached is a test IPC File containing a batch with length 1, array length 3.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Issue Comment Deleted] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-06-03 Thread John Muehlhausen (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5438:

Comment: was deleted

(was: Will add test case when I can)

> [JS] Utilize stream EOS in File format
> --
>
> Key: ARROW-5438
> URL: https://issues.apache.org/jira/browse/ARROW-5438
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-06-03 Thread John Muehlhausen (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Muehlhausen updated ARROW-5439:

Comment: was deleted

(was: Will add test case when I can)

> [Java] Utilize stream EOS in File format
> 
>
> Key: ARROW-5439
> URL: https://issues.apache.org/jira/browse/ARROW-5439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5439) [Java] Utilize stream EOS in File format

2019-06-03 Thread John Muehlhausen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855189#comment-16855189
 ] 

John Muehlhausen commented on ARROW-5439:
-

Will add test case when I can

> [Java] Utilize stream EOS in File format
> 
>
> Key: ARROW-5439
> URL: https://issues.apache.org/jira/browse/ARROW-5439
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Java
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-06-03 Thread John Muehlhausen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855188#comment-16855188
 ] 

John Muehlhausen commented on ARROW-5438:
-

Will add test case when I can

> [JS] Utilize stream EOS in File format
> --
>
> Key: ARROW-5438
> URL: https://issues.apache.org/jira/browse/ARROW-5438
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: JavaScript
>Reporter: John Muehlhausen
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5438) [JS] Utilize stream EOS in File format

2019-05-29 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5438:
---

 Summary: [JS] Utilize stream EOS in File format
 Key: ARROW-5438
 URL: https://issues.apache.org/jira/browse/ARROW-5438
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: John Muehlhausen


We currently do not write EOS at the end of a Message stream inside the File 
format.  As a result, the file cannot be parsed sequentially.  This change 
prepares for other implementations or future reference features that parse a 
File sequentially... i.e. without access to seek().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16846074#comment-16846074
 ] 

John Muehlhausen commented on ARROW-5395:
-

https://github.com/apache/arrow/pull/4372

> Utilize stream EOS in File format
> -
>
> Key: ARROW-5395
> URL: https://issues.apache.org/jira/browse/ARROW-5395
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Documentation
>Reporter: John Muehlhausen
>Priority: Minor
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> We currently do not write EOS at the end of a Message stream inside the File 
> format.  As a result, the file cannot be parsed sequentially.  This change 
> prepares for other implementations or future reference features that parse a 
> File sequentially... i.e. without access to seek().
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5395) Utilize stream EOS in File format

2019-05-22 Thread John Muehlhausen (JIRA)
John Muehlhausen created ARROW-5395:
---

 Summary: Utilize stream EOS in File format
 Key: ARROW-5395
 URL: https://issues.apache.org/jira/browse/ARROW-5395
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++, Documentation
Reporter: John Muehlhausen


We currently do not write EOS at the end of a Message stream inside the File 
format.  As a result, the file cannot be parsed sequentially.  This change 
prepares for other implementations or future reference features that parse a 
File sequentially... i.e. without access to seek().

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)