[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2020-02-13 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704 LGTM, There's one thing to be aware of, which is that

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-27 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-559112759 Quick follow up to the "how many connections" discussion. It turns out

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-11 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-552456614 This code LGTM: skips needless probes on the globbed paths; parallel checks on

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-11 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-552454930 I'd say 40 sounds good; people can tune it

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-04 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-549465360 Nice experiment! I guess in-EC2, you're limited by the number of course

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-01 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536993443 > Update: I tried increasing fs.s3a.connection.maximum and it did improve

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-28 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420 > fs.s3a.connection.maximum its 30 AFAIK. I should revisit that you

[GitHub] [spark] steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-27 Thread GitBox
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963 > it seems like the sweet spot is somewhere between 20-30 threads (for my