[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-05-01 Thread GitBox
wesm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-622529037 thanks @lidavidm! I'm confident we'll be able to devise some solutions to the resource allocation problem This is

[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-29 Thread GitBox
wesm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-621308415 I wrote up a ticket for round-robin task scheduling which might help with this https://issues.apache.org/jira/browse/ARROW-8626

[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-29 Thread GitBox
wesm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-621305031 Yeah, I think one definite thing that needs to happen at minimum is externalizing the thread pool used for asynchronous IO calls so that the user is able to set whatever concurrency

[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-29 Thread GitBox
wesm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-621299892 @pitrou I think the problem is the global IO thread pool https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/interfaces.cc#L310 So if you read multiple files

[GitHub] [arrow] wesm commented on pull request #6744: PARQUET-1820: [C++] pre-buffer specified columns of row group

2020-04-29 Thread GitBox
wesm commented on pull request #6744: URL: https://github.com/apache/arrow/pull/6744#issuecomment-621266083 Yes, we should discuss on the mailing list. For the record, IO-related tasks should almost certainly not be using the default global thread pool, which is intended for