[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-10-07 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-705052002 @emkornfield Thanks for catching that. I've fixed the formatting issues. Looks like it's not just the windows R checks that are failing.

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-10-02 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-703043604 Got the actions to rerun, but some are still failing. As far as I can tell, these failures aren't due to the changes in this PR.

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-09-19 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-695343480 I'm back on this for the weekend and will be back as needed the week after next. @jorisvandenbossche I can confirm that once I merge in the latest changes from apache

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-16 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659710775 I have the unittests passing locally but they seem to be failing in CI. I probably need to rebase again and test. Will do so when I get some time this weekend.

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-15 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659157816 On second thought, I think if users was consistent batch_sizes they can probably add that functionality in a wrapping generator. I have adjusted my tests to the new

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-15 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-659143356 So it appears there were changes to the underlying implementation of RecordBatchReader. Prior to these changes, it would yield record batches with the exact batch size (if

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-07-03 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-653687403 Looking at the code, no longer think this `batch_size` parameter actually would affect those other read methods. There are a few different "batch_size" parameters floating

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-29 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-651140872 Actually @jorisvandenbossche, I agree we should probably just add in the batch_size argument (with a sensible default) to those other methods. Took me a while to understand what

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650474759 RE: @jorisvandenbossche > Same question as in the other PR: does setting the batch size also influence existing methods like `read` or `read_row_group` ? Should we add that

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-06-26 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-650472498 Apologies, I've been away for a bit. I thought I had invited @sonthonaxrk as a collaborator on my fork, but perhaps that did go through. Addressed the minor feedback.

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619463693 I found the cause of the test failure: If the `batch_size` isn't aligned with the `chunk_size`, categorical columns will fail with the error: ```

[GitHub] [arrow] wjones1 commented on pull request #6979: ARROW-7800 [Python] implement iter_batches() method for ParquetFile and ParquetReader

2020-04-25 Thread GitBox
wjones1 commented on pull request #6979: URL: https://github.com/apache/arrow/pull/6979#issuecomment-619437378 Two failing checks right now. For the linting one, it seems to be alarmed by some Rust code that I didn't touch. Am I missing something in that output? For the