[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-587303126 Thank you everybody! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-585674253 Hi all, How can I get this PR accepted? Anything I can do to help with the process? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-555718135 Any next steps for me? Or just need :eyes: from commiters? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-550084928 @steveloughran How should we proceed? Does 40 threads sound OK? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-546554524 With Parquet mode i.e. ``` spark.hadoop.fs.s3a.readahead.range 256K spark.hadoop.fs.s3a.input.fadvise random ``` I have to say I don't see much difference? ``` | Type| fs.s3a.connection.maximum| Num Threads| Runtime(seconds)| |===| | Flat Paths | 40 | 10 | 21.36 | | Flat Paths | 40 | 20 | 10.51 | | Flat Paths | 40 | 40 | 5.76| | Flat Paths | 40 | 60 | 5.26| | Flat Paths | 40 | 80 | 5.86| | Flat Paths | 40 | 100| 8.54| | Flat Paths | 40 | 150| 6.01| | Flat Paths | 40 | 200| 6.95| | Flat Paths | 100 | 10 | 18.81 | | Flat Paths | 100 | 20 | 9.41| | Flat Paths | 100 | 40 | 5.21| | Flat Paths | 100 | 60 | 7.94| | Flat Paths | 100 | 80 | 5.5 | | Flat Paths | 100 | 100| 5.41| | Flat Paths | 100 | 150| 6.49| | Flat Paths | 100 | 200| 6.27| | Flat Paths | 300 | 10 | 17.22 | | Flat Paths | 300 | 20 | 11.55 | | Flat Paths | 300 | 40 | 5.55| | Flat Paths | 300 | 60 | 5.18| | Flat Paths | 300 | 80 | 9.57| | Flat Paths | 300 | 100| 6.46| | Flat Paths | 300 | 150| 4.71| | Flat Paths | 300 | 200| 5.22| | Glob Paths | 40 | 10 | 25.3| | Glob Paths | 40 | 20 | 3.56| | Glob Paths | 40 | 40 | 6.73| | Glob Paths | 40 | 60 | 2.23| | Glob Paths | 40 | 80 | 2.96| | Glob Paths | 40 | 100| 1.93| | Glob Paths | 40 | 150| 2.35| | Glob Paths | 40 | 200| 2.97| | Glob Paths | 100 | 10 | 4.45| | Glob Paths | 100 | 20 | 2.79| | Glob Paths | 100 | 40 | 2.54| | Glob Paths | 100 | 60 | 1.63| | Glob Paths | 100 | 80 | 6.98| | Glob Paths | 100 | 100| 2.69| | Glob Paths | 100 | 150| 2.37| | Glob Paths | 100 | 200| 2.4 | | Glob Paths | 300 | 10 | 4.7 | | Glob Paths | 300 | 20 | 2.98| | Glob Paths | 300 | 40 | 2.14| | Glob Paths | 300 | 60 | 1.6 | | Glob Paths | 300 | 80 | 3.19| | Glob Paths | 300 | 100| 2.1 | | Glob Paths | 300 | 150| 3.72| | Glob Paths | 300 | 200| 1.86| | Single glob path| 40 | 10 | 36.12 | | Single glob path| 40 | 20 | 37.45 | | Single glob path| 40 | 40 | 35.51 | | Single glob path| 40 | 60 | 35.7| | Single glob path| 40 | 80 | 37.58 | | Single glob path| 40 | 100|
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-546554461 @steveloughran Sorry for the delay. Been busy past couple weeks. Here are the results for the test with various num threads and threadpool size: This is on an Amazon Linux EC2 Instance in us-west-2 (`c3.xlarge`) with 4 vCPUs. `fs.s3a.threads.keepalivetime` was set to `300`. Types: - Flat Paths: 1206 S3 Paths (recursive list of `s3a://landsat-pds/L8/001/003/`) - Glob Paths: 30 Glob Paths (ex. `s3a://landsat-pds/L8/001/003/LC80010032016262LGN00/*`) resulting in 1206 S3 Paths - Single Glob Path: `s3a://landsat-pds/L8/001/003/*/*` ``` | Type| fs.s3a.connection.maximum| Num Threads| Runtime(seconds)| |===| | Flat Paths | 40 | 10 | 19.08 | | Flat Paths | 40 | 20 | 9.38| | Flat Paths | 40 | 40 | 5.83| | Flat Paths | 40 | 60 | 5.36| | Flat Paths | 40 | 80 | 5.16| | Flat Paths | 40 | 100| 5.08| | Flat Paths | 40 | 150| 4.99| | Flat Paths | 40 | 200| 8.33| | Flat Paths | 100 | 10 | 17.27 | | Flat Paths | 100 | 20 | 8.75| | Flat Paths | 100 | 40 | 7.43| | Flat Paths | 100 | 60 | 5.19| | Flat Paths | 100 | 80 | 4.35| | Flat Paths | 100 | 100| 4.8 | | Flat Paths | 100 | 150| 5.31| | Flat Paths | 100 | 200| 4.87| | Flat Paths | 300 | 10 | 17.1| | Flat Paths | 300 | 20 | 8.72| | Flat Paths | 300 | 40 | 4.88| | Flat Paths | 300 | 60 | 5.28| | Flat Paths | 300 | 80 | 4.81| | Flat Paths | 300 | 100| 6.13| | Flat Paths | 300 | 150| 5.23| | Flat Paths | 300 | 200| 5.98| | Glob Paths | 40 | 10 | 24.81 | | Glob Paths | 40 | 20 | 3.07| | Glob Paths | 40 | 40 | 2.64| | Glob Paths | 40 | 60 | 2.1 | | Glob Paths | 40 | 80 | 1.96| | Glob Paths | 40 | 100| 1.52| | Glob Paths | 40 | 150| 16.51 | | Glob Paths | 40 | 200| 2.16| | Glob Paths | 100 | 10 | 4.85| | Glob Paths | 100 | 20 | 4.36| | Glob Paths | 100 | 40 | 2.33| | Glob Paths | 100 | 60 | 2.58| | Glob Paths | 100 | 80 | 1.61| | Glob Paths | 100 | 100| 2.01| | Glob Paths | 100 | 150| 1.7 | | Glob Paths | 100 | 200| 2.28| | Glob Paths | 300 | 10 | 4.23| | Glob Paths | 300 | 20 | 2.87| | Glob Paths | 300 | 40 | 2.23| | Glob Paths | 300 | 60 | 2.05| | Glob Paths | 300 | 80 | 2.02| | Glob Paths | 300 | 100| 1.75| | Glob Paths | 300 | 150| 2.75| | Glob Paths | 300 | 200| 2.18| | Single glob path| 40 | 10 | 32.66
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-546552694 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-546550712 @steveloughran Sorry for the delay. Been busy past couple weeks; Here are the results for the test with various values threads and threadpool size: ``` _ | Type| s3a conn max| Num Threads| Runtime(seconds)| || | Flat Paths | 40| 10 | 24.08 | | Flat Paths | 40| 20 | 12.07 | | Flat Paths | 40| 40 | 6.63| | Flat Paths | 40| 60 | 6.94| | Flat Paths | 40| 80 | 6.58| | Flat Paths | 40| 100| 8.24| | Flat Paths | 40| 150| 7.19| | Flat Paths | 40| 200| 6.24| | Flat Paths | 300 | 10 | 19.39 | | Flat Paths | 300 | 20 | 10.16 | | Flat Paths | 300 | 40 | 6.78| | Flat Paths | 300 | 60 | 6.34| | Flat Paths | 300 | 80 | 6.94| | Flat Paths | 300 | 100| 5.35| | Flat Paths | 300 | 150| 5.96| | Flat Paths | 300 | 200| 6.78| | Glob Paths | 40| 10 | 37.28 | | Glob Paths | 40| 20 | 4.74| | Glob Paths | 40| 40 | 3.81| | Glob Paths | 40| 60 | 4.17| | Glob Paths | 40| 80 | 3.41| | Glob Paths | 40| 100| 3.01| | Glob Paths | 40| 150| 3.08| | Glob Paths | 40| 200| 2.63| | Glob Paths | 300 | 10 | 4.59| | Glob Paths | 300 | 20 | 3.26| | Glob Paths | 300 | 40 | 3.46| | Glob Paths | 300 | 60 | 2.62| | Glob Paths | 300 | 80 | 2.32| | Glob Paths | 300 | 100| 2.45| | Glob Paths | 300 | 150| 4.61| | Glob Paths | 300 | 200| 2.5 | | Single glob path| 40| 10 | 44.02 | | Single glob path| 40| 20 | 38.54 | | Single glob path| 40| 40 | 33.25 | | Single glob path| 40| 60 | 34.83 | | Single glob path| 40| 80 | 36.2| | Single glob path| 40| 100| 34.94 | | Single glob path| 40| 150| 46.32 | | Single glob path| 40| 200| 35.36 | | Single glob path| 300 | 10 | 31.33 | | Single glob path| 300 | 20 | 35.35 | | Single glob path| 300 | 40 | 36.4| | Single glob path| 300 | 60 | 34.7| | Single glob path| 300 | 80 | 35.1| | Single glob path| 300 | 100| 33.87 | | Single glob path| 300 | 150| 35.61 | | Single glob path| 300 | 200| 37.25 | FileSystem org.apache.hadoop.fs.s3a.S3AFileSystem: 0 bytes read, 0 bytes written, 21232 read ops, 0 large read ops, 0 write ops ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536741130 Update: I tried increasing `fs.s3a.connection.maximum` and it did improve performance of the filesystem calls. I still need to set up a benchmark that runs on EC2 instead of remote dev laptop, will update in a couple days. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-536046144 > "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks" > "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections around for longer (avoids that https overhead) Ah right, I haven't considered that we might be bottlenecked by the S3A connection pool. I will update my measurements based on this. @steveloughran Let's say hypothetically that performance keeps improving the more threads we add (say, 500), and it doesn't cause S3 throttling, etc. There's probably an upper limit to how many threads are acceptable to spawn on the driver right? For example what if a user puts their driver on a `t2.nano` EC2 instance. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-534380034 I ran additional measurements testing out different thread numbers on the S3 Landsat data, and it seems like the sweet spot is somewhere between 20-30 seconds (for my environment anyways) **30 glob paths paths* - 30 glob paths with the final result of 1206 files **single glob path* - 1 single glob path with the final result of 1206 files **raw paths* - 1206 raw paths without any globs see here: https://github.com/apache/spark/pull/25899#issuecomment-534069194 **original code** 30 glob paths paths _15.6 seconds_ single glob path _11.3 seconds_ raw paths 59 seconds_ **8 threads** 30 glob paths paths _1.48 seconds_ single glob path 11 seconds_ raw paths _7.73 seconds_ **20 threads** 30 glob paths paths _1.47 seconds_ single glob path _15.45 seconds_ raw paths _4.16 seconds_ **30 threads** 20 glob paths paths _0.92 seconds_ single glob path _11.74 seconds_ raw paths _4.12 seconds_ **40 threads** 30 glob paths paths _0.93 seconds_ single glob path _13.48 seconds_ raw paths _4.08 seconds_ This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary URL: https://github.com/apache/spark/pull/25899#issuecomment-534353737 @dongjoon-hyun Ok done, how does that sound? Should I update the JIRA aswell? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org