steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-585908704
LGTM,
There's one thing to be aware of, which is that
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-559112759
Quick follow up to the "how many connections" discussion.
It turns out
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-552456614
This code LGTM: skips needless probes on the globbed paths; parallel checks
on
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-552454930
I'd say 40 sounds good; people can tune it
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-549465360
Nice experiment!
I guess in-EC2, you're limited by the number of course
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536993443
> Update: I tried increasing fs.s3a.connection.maximum and it did improve
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536223420
> fs.s3a.connection.maximum
its 30 AFAIK. I should revisit that you
steveloughran commented on issue #25899: [SPARK-29089][SQL] Parallelize
blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-535903963
> it seems like the sweet spot is somewhere between 20-30 threads (for my