[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392324#comment-16392324
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-371698673
 
 
   >I can't speak for my company. I will have to check with my manager and 
technical leader. However big company always wants something in return: 
reputation/business benefits etc.
   
   Sorry for the delay(busy setting up cluster spark app profiling).  I checked 
with my manager, the general response is that: 
   > with limit amount of dev resource, we will contribute back internally 
features if suitable,  but should not actively work on community issues.
   
   However, I will try to figure out if there is issues I can work on in my 
spare time
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392304#comment-16392304
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173353428
 
 

 ##
 File path: src/parquet/arrow/reader.cc
 ##
 @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+explicit RowGroupRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   FileReader* reader)
+  : row_group_indices_(row_group_indices),
+column_indices_(column_indices),
+file_reader_(reader),
+next_row_group_(0) {
+  file_reader_->GetSchema(column_indices_, &schema_);
+}
+
+~RowGroupRecordBatchReader() {}
+
+std::shared_ptr<::arrow::Schema> schema() const override {
+  return schema_;
+}
+
+Status ReadNext(std::shared_ptr<::arrow::RecordBatch> *out) override {
+  if (table_ != nullptr) { // one row group has been loaded
+std::shared_ptr<::arrow::RecordBatch> tmp;
+table_batch_reader_->ReadNext(&tmp);
+if (tmp != nullptr) { // some column chunks are left in table
+  *out = tmp;
+  return Status::OK();
+} else { // the entire table is consumed
+  table_batch_reader_.reset();
+  table_.reset();
+}
+  }
+
+  // all row groups has been consumed
+  if (next_row_group_ == row_group_indices_.size()) {
+*out = nullptr;
+return Status::OK();
+  }
+
+  
RETURN_NOT_OK(file_reader_->ReadRowGroup(row_group_indices_[next_row_group_],
 
 Review comment:
   I am most concern about this one. We have to read one entire row group, but 
the caller may consume only the first N RecordBatches.
   
   I am wondering that this is not optimal


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392306#comment-16392306
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352829
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   Of course, will do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392305#comment-16392305
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352853
 
 

 ##
 File path: src/parquet/arrow/reader.cc
 ##
 @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+explicit RowGroupRecordBatchReader(const std::vector& 
row_group_indices,
 
 Review comment:
   will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem
Actually because of Daylight saving time we will have one less hour next
week.
https://www.timeanddate.com/worldclock/meetingdetails.html?year=2018&month=3&day=13&hour=17&min=0&sec=0&p1=224&p2=50&p3=195
Location Local Time Time Zone UTC Offset
San Francisco (USA - California) Tuesday, March 13, 2018 at 10:00:00
am PDT UTC-7
hours
Budapest (Hungary) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1 hour
Paris (France - Île-de-France) Tuesday, March 13, 2018 at 6:00:00 pm CET UTC+1
hour
Corresponding UTC (GMT) Tuesday, March 13, 2018 at 17:00:00


On Thu, Mar 8, 2018 at 4:12 PM, Julien Le Dem 
wrote:

> or 10am PST but it's a little late for the team in Budapest.
>
> On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem 
> wrote:
>
>> I'm sorry, it turns out I now have a conflict at this particular time.
>> Maybe Wednesday?
>>
>> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker  wrote:
>>
>>> Hi All,
>>>
>>> It has been almost 3 weeks since the last sync and there are a bunch of
>>> ongoing discussions on the mailing list. Let's find a date for the next
>>> Parquet community sync. Last time we met on a Wednesday, so this time it
>>> should be Tuesday.
>>>
>>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
>>> allows us to get back to the biweekly cadence without overlapping with
>>> the
>>> Arrow sync, which happens this week.
>>>
>>> Please speak up if that time does not work for you.
>>>
>>> Cheers, Lars
>>>
>>
>>
>


Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem
or 10am PST but it's a little late for the team in Budapest.

On Thu, Mar 8, 2018 at 4:11 PM, Julien Le Dem 
wrote:

> I'm sorry, it turns out I now have a conflict at this particular time.
> Maybe Wednesday?
>
> On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker  wrote:
>
>> Hi All,
>>
>> It has been almost 3 weeks since the last sync and there are a bunch of
>> ongoing discussions on the mailing list. Let's find a date for the next
>> Parquet community sync. Last time we met on a Wednesday, so this time it
>> should be Tuesday.
>>
>> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
>> allows us to get back to the biweekly cadence without overlapping with the
>> Arrow sync, which happens this week.
>>
>> Please speak up if that time does not work for you.
>>
>> Cheers, Lars
>>
>
>


Re: Date for next Parquet sync

2018-03-08 Thread Julien Le Dem
I'm sorry, it turns out I now have a conflict at this particular time.
Maybe Wednesday?

On Mon, Mar 5, 2018 at 10:55 AM, Lars Volker  wrote:

> Hi All,
>
> It has been almost 3 weeks since the last sync and there are a bunch of
> ongoing discussions on the mailing list. Let's find a date for the next
> Parquet community sync. Last time we met on a Wednesday, so this time it
> should be Tuesday.
>
> I propose to meet next Tuesday, March 13th, at 6pm CET / 9am PST. That
> allows us to get back to the biweekly cadence without overlapping with the
> Arrow sync, which happens this week.
>
> Please speak up if that time does not work for you.
>
> Cheers, Lars
>


Re: Parquet repositories moved to Apache GitBox service

2018-03-08 Thread Wes McKinney
Thanks, Uwe!

On Thu, Mar 8, 2018 at 2:05 PM, Uwe L. Korn  wrote:
> The parquet-mr and parquet-format repositories are now moved to GitBox. Thus 
> the remotes of these repos changed to:
>
> https://gitbox.apache.org/repos/asf?p=parquet-format.git
> https://gitbox.apache.org/repos/asf?p=parquet-mr.git
>
> You will also be able now to push to the GitHub remote (e.g. to use the "Let 
> maintainers push to this PR" feature), therefore you need to activate the 
> linking of your ASF and GitHub accounts. I hope to find time tomorrow to 
> update the merge and release scripts.
>
> Uwe


[jira] [Created] (PARQUET-1241) Use LZ4 frame format

2018-03-08 Thread Lawrence Chan (JIRA)
Lawrence Chan created PARQUET-1241:
--

 Summary: Use LZ4 frame format
 Key: PARQUET-1241
 URL: https://issues.apache.org/jira/browse/PARQUET-1241
 Project: Parquet
  Issue Type: Improvement
  Components: parquet-cpp, parquet-format
Reporter: Lawrence Chan


The parquet-format spec doesn't currently specify whether lz4-compressed data 
should be framed or not. We should choose one and make it explicit in the spec, 
as they are not inter-operable. After some discussions with others [1], we 
think it would be beneficial to use the framed format, which adds a small 
header in exchange for more self-contained decompression as well as a richer 
feature set (checksums, parallel decompression, etc).

The current arrow implementation compresses using the lz4 block format, and 
this would need to be updated when we add the spec clarification.

If backwards compatibility is a concern, I would suggest adding an additional 
LZ4_FRAMED compression type, but that may be more noise than anything.

[1] https://github.com/dask/fastparquet/issues/314



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Parquet repositories moved to Apache GitBox service

2018-03-08 Thread Uwe L. Korn
The parquet-mr and parquet-format repositories are now moved to GitBox. Thus 
the remotes of these repos changed to:

https://gitbox.apache.org/repos/asf?p=parquet-format.git
https://gitbox.apache.org/repos/asf?p=parquet-mr.git

You will also be able now to push to the GitHub remote (e.g. to use the "Let 
maintainers push to this PR" feature), therefore you need to activate the 
linking of your ASF and GitHub accounts. I hope to find time tomorrow to update 
the merge and release scripts.

Uwe