[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16412375#comment-16412375
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176897588
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices, the
+  ///ordering in row_group_indices matters.
+  /// \returns error Status if row_group_indices contains invalid index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices,
+  /// whose columns are selected by column_indices. The ordering in 
row_group_indices
+  /// and column_indices matter.
+  /// \returns error Status if either row_group_indices or column_indices 
contains invalid
+  ///index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   >  My main critique of these APIs is that we will want to provide for 
setting the number of rows to be read for each call to `ReadNext`
   
   Ah, I do consider this when implementing this. However the 
[`RecordBatch`](https://github.com/apache/arrow/blob/master/cpp/src/arrow/record_batch.h#L166)
 interface in arrow doesn't expose that. And I'd like to hide impl details in 
`parquet/arrow/reader`. To enable this, I'd like to proposal new method to 
`RecordBatch` then.
   What do you think? @wesm 
   
   > Could you please open a JIRA about improving this code in this regard?
   
   Will do.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Assignee: Xianjin YE
>Priority: Major
> Fix For: cpp-1.5.0
>
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411973#comment-16411973
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm closed pull request #445: PARQUET-1166: Add GetRecordBatchReader in 
parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/parquet/arrow/arrow-reader-writer-test.cc 
b/src/parquet/arrow/arrow-reader-writer-test.cc
index 72e65d47..865f39f3 100644
--- a/src/parquet/arrow/arrow-reader-writer-test.cc
+++ b/src/parquet/arrow/arrow-reader-writer-test.cc
@@ -1504,6 +1504,38 @@ TEST(TestArrowReadWrite, ReadSingleRowGroup) {
   ASSERT_TRUE(table->Equals(*concatenated));
 }
 
+TEST(TestArrowReadWrite, GetRecordBatchReader) {
+  const int num_columns = 20;
+  const int num_rows = 1000;
+
+  std::shared_ptr table;
+  MakeDoubleTable(num_columns, num_rows, 1, );
+
+  std::shared_ptr buffer;
+  WriteTableToBuffer(table, 1, num_rows / 2, 
default_arrow_writer_properties(), );
+
+  std::unique_ptr reader;
+  ASSERT_OK_NO_THROW(OpenFile(std::make_shared(buffer),
+  ::arrow::default_memory_pool(),
+  ::parquet::default_reader_properties(), nullptr, 
));
+
+  std::shared_ptr<::arrow::RecordBatchReader> rb_reader;
+  ASSERT_OK_NO_THROW(reader->GetRecordBatchReader({0, 1}, _reader));
+
+  std::shared_ptr<::arrow::RecordBatch> batch;
+
+  ASSERT_OK(rb_reader->ReadNext());
+  ASSERT_EQ(500, batch->num_rows());
+  ASSERT_EQ(20, batch->num_columns());
+
+  ASSERT_OK(rb_reader->ReadNext());
+  ASSERT_EQ(500, batch->num_rows());
+  ASSERT_EQ(20, batch->num_columns());
+
+  ASSERT_OK(rb_reader->ReadNext());
+  ASSERT_EQ(nullptr, batch);
+}
+
 TEST(TestArrowReadWrite, ScanContents) {
   const int num_columns = 20;
   const int num_rows = 1000;
diff --git a/src/parquet/arrow/reader.cc b/src/parquet/arrow/reader.cc
index bd68ec32..de1dea6b 100644
--- a/src/parquet/arrow/reader.cc
+++ b/src/parquet/arrow/reader.cc
@@ -57,6 +57,7 @@ using parquet::schema::Node;
 // Help reduce verbosity
 using ParquetReader = parquet::ParquetFileReader;
 using arrow::ParallelFor;
+using arrow::RecordBatchReader;
 
 using parquet::internal::RecordReader;
 
@@ -152,6 +153,59 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+  explicit RowGroupRecordBatchReader(const std::vector& row_group_indices,
+ const std::vector& column_indices,
+ std::shared_ptr<::arrow::Schema> schema,
+ FileReader* reader)
+  : row_group_indices_(row_group_indices),
+column_indices_(column_indices),
+schema_(schema),
+file_reader_(reader),
+next_row_group_(0) {}
+
+  ~RowGroupRecordBatchReader() {}
+
+  std::shared_ptr<::arrow::Schema> schema() const override { return schema_; }
+
+  Status ReadNext(std::shared_ptr<::arrow::RecordBatch>* out) override {
+if (table_ != nullptr) {  // one row group has been loaded
+  std::shared_ptr<::arrow::RecordBatch> tmp;
+  RETURN_NOT_OK(table_batch_reader_->ReadNext());
+  if (tmp != nullptr) {  // some column chunks are left in table
+*out = tmp;
+return Status::OK();
+  } else {  // the entire table is consumed
+table_batch_reader_.reset();
+table_.reset();
+  }
+}
+
+// all row groups has been consumed
+if (next_row_group_ == row_group_indices_.size()) {
+  *out = nullptr;
+  return Status::OK();
+}
+
+
RETURN_NOT_OK(file_reader_->ReadRowGroup(row_group_indices_[next_row_group_],
+ column_indices_, _));
+
+next_row_group_++;
+table_batch_reader_.reset(new ::arrow::TableBatchReader(*table_.get()));
+return table_batch_reader_->ReadNext(out);
+  }
+
+ private:
+  std::vector row_group_indices_;
+  std::vector column_indices_;
+  std::shared_ptr<::arrow::Schema> schema_;
+  FileReader* file_reader_;
+  size_t next_row_group_;
+  std::shared_ptr<::arrow::Table> table_;
+  std::unique_ptr<::arrow::TableBatchReader> table_batch_reader_;
+};
+
 // --
 // File reader implementation
 
@@ -188,6 +242,8 @@ class FileReader::Impl {
 
   int num_row_groups() const { return reader_->metadata()->num_row_groups(); }
 
+  int num_columns() const { return reader_->metadata()->num_columns(); }
+
   void set_num_threads(int num_threads) { num_threads_ = num_threads; }
 
   ParquetFileReader* 

[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411969#comment-16411969
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176849315
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices, the
+  ///ordering in row_group_indices matters.
+  /// \returns error Status if row_group_indices contains invalid index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices,
+  /// whose columns are selected by column_indices. The ordering in 
row_group_indices
+  /// and column_indices matter.
+  /// \returns error Status if either row_group_indices or column_indices 
contains invalid
+  ///index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   Could you please open a JIRA about improving this code in this regard?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411968#comment-16411968
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r176849252
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +154,21 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices, the
+  ///ordering in row_group_indices matters.
+  /// \returns error Status if row_group_indices contains invalid index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  /// \brief Return a RecordBatchReader of row groups selected from 
row_group_indices,
+  /// whose columns are selected by column_indices. The ordering in 
row_group_indices
+  /// and column_indices matter.
+  /// \returns error Status if either row_group_indices or column_indices 
contains invalid
+  ///index
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   My main critique of these APIs is that we will want to provide for setting 
the number of rows to be read for each call to `ReadNext`, for example 
1,000,000 rows at a time. Right now this is returning a whole row group at a 
time


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16411391#comment-16411391
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

xhochy commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in 
parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-375667014
 
 
   Looks good from my side, if @wesm does not object, I'll merge tomorrow.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404402#comment-16404402
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-374107728
 
 
   ping @wesm @xhochy 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-13 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16396950#comment-16396950
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-372669874
 
 
   @wesm @xhochy do you have any other comments, in particular with:
   1. Document wording since I am not a native speaker
   2. The approach with generating table of row group first, then constructing 
a TableBatchReader
   3. More test cases should be added?
   
   If not, I will update the PR title, and maybe rebase commits to get a better 
commit history, then it should be ready for merging.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392324#comment-16392324
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-371698673
 
 
   >I can't speak for my company. I will have to check with my manager and 
technical leader. However big company always wants something in return: 
reputation/business benefits etc.
   
   Sorry for the delay(busy setting up cluster spark app profiling).  I checked 
with my manager, the general response is that: 
   > with limit amount of dev resource, we will contribute back internally 
features if suitable,  but should not actively work on community issues.
   
   However, I will try to figure out if there is issues I can work on in my 
spare time
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392304#comment-16392304
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173353428
 
 

 ##
 File path: src/parquet/arrow/reader.cc
 ##
 @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+explicit RowGroupRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   FileReader* reader)
+  : row_group_indices_(row_group_indices),
+column_indices_(column_indices),
+file_reader_(reader),
+next_row_group_(0) {
+  file_reader_->GetSchema(column_indices_, _);
+}
+
+~RowGroupRecordBatchReader() {}
+
+std::shared_ptr<::arrow::Schema> schema() const override {
+  return schema_;
+}
+
+Status ReadNext(std::shared_ptr<::arrow::RecordBatch> *out) override {
+  if (table_ != nullptr) { // one row group has been loaded
+std::shared_ptr<::arrow::RecordBatch> tmp;
+table_batch_reader_->ReadNext();
+if (tmp != nullptr) { // some column chunks are left in table
+  *out = tmp;
+  return Status::OK();
+} else { // the entire table is consumed
+  table_batch_reader_.reset();
+  table_.reset();
+}
+  }
+
+  // all row groups has been consumed
+  if (next_row_group_ == row_group_indices_.size()) {
+*out = nullptr;
+return Status::OK();
+  }
+
+  
RETURN_NOT_OK(file_reader_->ReadRowGroup(row_group_indices_[next_row_group_],
 
 Review comment:
   I am most concern about this one. We have to read one entire row group, but 
the caller may consume only the first N RecordBatches.
   
   I am wondering that this is not optimal


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392306#comment-16392306
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352829
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   Of course, will do.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-08 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16392305#comment-16392305
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r173352853
 
 

 ##
 File path: src/parquet/arrow/reader.cc
 ##
 @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+explicit RowGroupRecordBatchReader(const std::vector& 
row_group_indices,
 
 Review comment:
   will do


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390077#comment-16390077
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r172741476
 
 

 ##
 File path: src/parquet/arrow/reader.cc
 ##
 @@ -152,6 +153,64 @@ class SingleRowGroupIterator : public FileColumnIterator {
   bool done_;
 };
 
+class RowGroupRecordBatchReader : public ::arrow::RecordBatchReader {
+ public:
+explicit RowGroupRecordBatchReader(const std::vector& 
row_group_indices,
 
 Review comment:
   There are some code formatting issues, can you run `make format` (requires 
clang-format-5.0)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-07 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16390078#comment-16390078
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r172741404
 
 

 ##
 File path: src/parquet/arrow/reader.h
 ##
 @@ -149,6 +152,13 @@ class PARQUET_EXPORT FileReader {
   ::arrow::Status ReadSchemaField(int i, const std::vector& indices,
   std::shared_ptr<::arrow::Array>* out);
 
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
+
+  ::arrow::Status GetRecordBatchReader(const std::vector& 
row_group_indices,
+   const std::vector& column_indices,
+   
std::shared_ptr<::arrow::RecordBatchReader>* out);
 
 Review comment:
   Can you add brief doxygen comments to these new methods?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16387198#comment-16387198
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-370647551
 
 
   > Is there a possibility that individuals from Baidu may become more 
involved in Arrow development as well?
   
   I can't speak for my company. I will have to check with my manager and 
technical leader. However big company always wants something in return: 
reputation/business benefits etc. 
   
   However, me as an individual can be more involved in Arrow development in my 
spare time if some help is needed.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386906#comment-16386906
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

wesm commented on issue #445: [WIP] PARQUET-1166: Add GetRecordBatchReader in 
parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-370597784
 
 
   Will review soon, have been underwater with the Arrow 0.9.0 backlog. Is 
there a possibility that individuals from Baidu may become more involved in 
Arrow development as well? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386004#comment-16386004
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on issue #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#issuecomment-370403357
 
 
   Ping @wesm @xhochy, do you have any comments?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380268#comment-16380268
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy commented on a change in pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445#discussion_r171237865
 
 

 ##
 File path: src/parquet/arrow/writer.h
 ##
 @@ -31,7 +31,6 @@ namespace arrow {
 class Array;
 class MemoryPool;
 class PrimitiveArray;
-class RowBatch;
 
 Review comment:
   RowBatch is never used any more and is renamed to RecordBatch


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2018-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16380267#comment-16380267
 ] 

ASF GitHub Bot commented on PARQUET-1166:
-

advancedxy opened a new pull request #445: [WIP] PARQUET-1166: Add 
GetRecordBatchReader in parquet/arrow/reader
URL: https://github.com/apache/parquet-cpp/pull/445
 
 
   Ping @xhochy @wesm.
   
   Sorry for the delay, I finally get some time to finish this feature.
   
   This is just work in progress, but I want to get feedback before any further.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>Priority: Major
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-27 Thread Xianjin YE (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266954#comment-16266954
 ] 

Xianjin YE commented on PARQUET-1166:
-

All right then, I will send pr soon and will try to reuse Arrow's code whenever 
possible.

> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PARQUET-1166) [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h

2017-11-27 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/PARQUET-1166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16266926#comment-16266926
 ] 

Wes McKinney commented on PARQUET-1166:
---

Sounds good to me. This is actually already basically the intent of ARROW-1012

> [API Proposal] Add GetRecordBatchReader in parquet/arrow/reader.h
> -
>
> Key: PARQUET-1166
> URL: https://issues.apache.org/jira/browse/PARQUET-1166
> Project: Parquet
>  Issue Type: Improvement
>  Components: parquet-cpp
>Reporter: Xianjin YE
>
> Hi, I'd like to proposal a new API to better support splittable reading for 
> Parquet File.
> The intent for this API is that we can selective reading RowGroups(normally 
> be contiguous, but can be arbitrary as long as the row_group_idxes are sorted 
> and unique, [1, 3, 5] for example). 
> The proposed API would be something like this:
> {code:java}
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> 
> ::arrow::Status GetRecordBatchReader(const std::vector& 
> row_group_indices,
> const 
> std::vector& column_indices,
> 
> std::shared_ptr<::arrow::RecordBatchReader>* out);
> {code}
> With new API, we can split Parquet file into RowGroups and can be processed 
> by multiple tasks(maybe be on different hosts, like the Map task in MapReduce)
> [~wesmckinn][~xhochy] What do you think?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)