[jira] [Commented] (ARROW-2296) [C++] Add num_rows to file footer

2019-08-24 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914990#comment-16914990
 ] 

Wes McKinney commented on ARROW-2296:
-

I took a brief look at this. It's more complicated than I expected because all 
the record batch metadata needs to be loaded. Currently the code for loading a 
block loads the body unconditionally, so we would need to have a function that 
loads _only_ the metadata. This is more work than I'm willing to volunteer -- 
feel free to contribute a PR =) 

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
> Fix For: 0.15.0
>
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-2296) [C++] Add num_rows to file footer

2019-08-21 Thread Wes McKinney (Jira)


[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16912507#comment-16912507
 ] 

Wes McKinney commented on ARROW-2296:
-

At minimum having a method in C++ to provide this information (without 
computing it yourself) seems useful. We don't need to change the file format

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
> Fix For: 0.15.0
>
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (ARROW-2296) [C++] Add num_rows to file footer

2018-03-12 Thread Lawrence Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395564#comment-16395564
 ] 

Lawrence Chan commented on ARROW-2296:
--

Yeah, I was thinking somewhere in the Footer struct, so we don't need to walk 
all the batches to sum them up.

Also they are indeed in the existing RecordBatch metadata, but the current 
implementation is inside a .cc file and I'd have to either copy+paste or modify 
my build to expose more of the existing code. Maybe we could expose something 
like this on the RecordBatchFileReader?
{code:cpp}
Status ReadRecordBatchMessage(int i, const flatbuf::RecordBatch** metadata) 
const;
{code}
Then it'd be possible to read the length fields without copying some of the 
other stuff. Not sure if this is a good idea though, since it seems that we 
dont usually expose the flatbuffers through the public API. Maybe just a 
{code:cpp}
int64_t num_rows() const;
{code}
is all I really want, and that can read the new Footer field once it's in 
there, and walk the batches in the current format?

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2296) [C++] Add num_rows to file footer

2018-03-10 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394295#comment-16394295
 ] 

Wes McKinney commented on ARROW-2296:
-

We could pretty easily add a "total length" field to the file footer, though, 
which would be more convenient 
https://github.com/apache/arrow/blob/master/format/File.fbs#L33

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2296) [C++] Add num_rows to file footer

2018-03-10 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-2296?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394287#comment-16394287
 ] 

Wes McKinney commented on ARROW-2296:
-

This is already contained in the RecordBatch metadata, and does not require 
reading the whole file

https://github.com/apache/arrow/blob/master/format/Message.fbs#L50

Does this not satisfy the use case?

> [C++] Add num_rows to file footer
> -
>
> Key: ARROW-2296
> URL: https://issues.apache.org/jira/browse/ARROW-2296
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++, Format
>Reporter: Lawrence Chan
>Priority: Minor
>
> Maybe I'm overlooking something, but I don't see something on the API surface 
> to get the number of rows in a arrow file without reading all the record 
> batches. This is useful when we want to read into contiguous buffers, because 
> it allows us to allocate the right sizes up front.
> I'd like to propose that we add `num_rows` as a field in the file footer so 
> it's easy to query without reading the whole file.
> Meanwhile, before we get that added to the official format fbs, it would be 
> nice to have a method that iterates over the record batch headers and sums up 
> the lengths without reading the actual record batch body.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)