[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R

2020-09-23 Thread GitBox


romainfrancois commented on pull request #8122:
URL: https://github.com/apache/arrow/pull/8122#issuecomment-697232130


   This returns too early, I believe this needs to be `RETURN_NOT_OK` instead 
of `return`: 
   
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L185
   
   ```cpp
 Status BoundsCheck(const std::vector& row_groups,
const std::vector& column_indices) {
   for (int i : row_groups) {
 return BoundsCheckRowGroup(i);
   }
   for (int i : column_indices) {
 return BoundsCheckColumn(i);
   }
   return Status::OK();
 }
   ```
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R

2020-09-23 Thread GitBox


romainfrancois commented on pull request #8122:
URL: https://github.com/apache/arrow/pull/8122#issuecomment-697227723


   I added some tests for `ReadRowGroup(s)()` and clarified about 0-based. 
   
   I'm getting a weird error in some cases with `ReadRowGroups()`: 
   
   ``` r
   library(arrow, warn.conflicts = FALSE)
   
   tab <- Table$create(x = 1:100)
   tf <- tempfile()
   write_parquet(tab, tf, chunk_size = 10)
   
   reader <- ParquetFileReader$create(tf)
   # sensible error
   reader$ReadRowGroups(-2)
   #> Error in parquet___arrow___FileReader__ReadRowGroups1(self, row_groups): 
Invalid: Some index in row_group_indices is -2, which is either < 0 or >= 
num_row_groups(10)
   #> In 
/Users/romainfrancois/git/apache/arrow/cpp/src/parquet/arrow/reader.cc, line 
842, code: BoundsCheck(row_groups, column_indices)
   
   # weird error
   reader$ReadRowGroups(c(0, -2))
   #> Error in parquet___arrow___FileReader__ReadRowGroups1(self, row_groups): 
IOError: The file only has 0 columns, requested metadata for column: 0
   #> In 
/Users/romainfrancois/git/apache/arrow/cpp/src/parquet/arrow/reader.cc, line 
873, code: final_status
   ```
   
   Created on 2020-09-23 by the [reprex 
package](https://reprex.tidyverse.org) (v0.3.0.9001)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R

2020-09-22 Thread GitBox


romainfrancois commented on pull request #8122:
URL: https://github.com/apache/arrow/pull/8122#issuecomment-696593899


   The methods of `ParquetFileReader` no longer use tidyselect, i.e. you can 
use `$ReadTable()` or `$ReadTable(column_indices)` with an 0-based integer 
vector so this does not have to pay for retrieving the schema (which appears to 
be expensive) and invoke tidy selection. 
   
   `read_parquet(col_select=)` is still supported though, 
and does call `$GetSchema()`. 
   
   In addition, methods `ReadRowGroup()` and `$ReadRowGroups()` have been 
added. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R

2020-09-22 Thread GitBox


romainfrancois commented on pull request #8122:
URL: https://github.com/apache/arrow/pull/8122#issuecomment-696593899


   The methods of `ParquetFileReader` no longer use tidyselect, i.e. you can 
use `$ReadTable()` or `$ReadTable(column_indices)` with an 0-based integer 
vector so this does not have to pay for retrieving the schema (which appears to 
be expensive) and invoke tidy selection. 
   
   `read_parquet(col_select=)` is still supported though, 
and does call `$GetSchema()`. 
   
   In addition, methods `ReadRowGroup()` and `$ReadRowGroups()` have been 
added. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R

2020-09-07 Thread GitBox


romainfrancois commented on pull request #8122:
URL: https://github.com/apache/arrow/pull/8122#issuecomment-688284689


   also, I guess `$ReadTable()` could either be simplified to only use column 
indices (as in the C++ function)
   
   ```cpp
   virtual ::arrow::Status ReadTable(const std::vector& column_indices,
   std::shared_ptr<::arrow::Table>* out) = 
0;
   ```
   
   or maybe have some arguments to turn off potentially expensive tidyselect 
logic, that has to retrieve the schema ...



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org