[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R
romainfrancois commented on pull request #8122: URL: https://github.com/apache/arrow/pull/8122#issuecomment-697232130 This returns too early, I believe this needs to be `RETURN_NOT_OK` instead of `return`: https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/reader.cc#L185 ```cpp Status BoundsCheck(const std::vector& row_groups, const std::vector& column_indices) { for (int i : row_groups) { return BoundsCheckRowGroup(i); } for (int i : column_indices) { return BoundsCheckColumn(i); } return Status::OK(); } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R
romainfrancois commented on pull request #8122: URL: https://github.com/apache/arrow/pull/8122#issuecomment-697227723 I added some tests for `ReadRowGroup(s)()` and clarified about 0-based. I'm getting a weird error in some cases with `ReadRowGroups()`: ``` r library(arrow, warn.conflicts = FALSE) tab <- Table$create(x = 1:100) tf <- tempfile() write_parquet(tab, tf, chunk_size = 10) reader <- ParquetFileReader$create(tf) # sensible error reader$ReadRowGroups(-2) #> Error in parquet___arrow___FileReader__ReadRowGroups1(self, row_groups): Invalid: Some index in row_group_indices is -2, which is either < 0 or >= num_row_groups(10) #> In /Users/romainfrancois/git/apache/arrow/cpp/src/parquet/arrow/reader.cc, line 842, code: BoundsCheck(row_groups, column_indices) # weird error reader$ReadRowGroups(c(0, -2)) #> Error in parquet___arrow___FileReader__ReadRowGroups1(self, row_groups): IOError: The file only has 0 columns, requested metadata for column: 0 #> In /Users/romainfrancois/git/apache/arrow/cpp/src/parquet/arrow/reader.cc, line 873, code: final_status ``` Created on 2020-09-23 by the [reprex package](https://reprex.tidyverse.org) (v0.3.0.9001) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R
romainfrancois commented on pull request #8122: URL: https://github.com/apache/arrow/pull/8122#issuecomment-696593899 The methods of `ParquetFileReader` no longer use tidyselect, i.e. you can use `$ReadTable()` or `$ReadTable(column_indices)` with an 0-based integer vector so this does not have to pay for retrieving the schema (which appears to be expensive) and invoke tidy selection. `read_parquet(col_select=)` is still supported though, and does call `$GetSchema()`. In addition, methods `ReadRowGroup()` and `$ReadRowGroups()` have been added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R
romainfrancois commented on pull request #8122: URL: https://github.com/apache/arrow/pull/8122#issuecomment-696593899 The methods of `ParquetFileReader` no longer use tidyselect, i.e. you can use `$ReadTable()` or `$ReadTable(column_indices)` with an 0-based integer vector so this does not have to pay for retrieving the schema (which appears to be expensive) and invoke tidy selection. `read_parquet(col_select=)` is still supported though, and does call `$GetSchema()`. In addition, methods `ReadRowGroup()` and `$ReadRowGroups()` have been added. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [arrow] romainfrancois commented on pull request #8122: ARROW-9557: [R] Iterating over parquet columns is slow in R
romainfrancois commented on pull request #8122: URL: https://github.com/apache/arrow/pull/8122#issuecomment-688284689 also, I guess `$ReadTable()` could either be simplified to only use column indices (as in the C++ function) ```cpp virtual ::arrow::Status ReadTable(const std::vector& column_indices, std::shared_ptr<::arrow::Table>* out) = 0; ``` or maybe have some arguments to turn off potentially expensive tidyselect logic, that has to retrieve the schema ... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org