[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2020-02-17 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-587303126
 
 
   Thank you everybody!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2020-02-13 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-585674253
 
 
   Hi all,
   
   How can I get this PR accepted? Anything I can do to help with the process?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-19 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-555718135
 
 
   Any next steps for me? Or just need :eyes: from commiters?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-11-05 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-550084928
 
 
   @steveloughran How should we proceed? Does 40 threads sound OK? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-25 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-546554524
 
 
   With Parquet mode i.e.
   
   ```
   spark.hadoop.fs.s3a.readahead.range 256K
   spark.hadoop.fs.s3a.input.fadvise random
   ```
   
   I have to say I don't see much difference?
   
   ```
   
   | Type| fs.s3a.connection.maximum| Num Threads| Runtime(seconds)|
   |===|
   | Flat Paths  | 40   | 10 | 21.36   |
   | Flat Paths  | 40   | 20 | 10.51   |
   | Flat Paths  | 40   | 40 | 5.76|
   | Flat Paths  | 40   | 60 | 5.26|
   | Flat Paths  | 40   | 80 | 5.86|
   | Flat Paths  | 40   | 100| 8.54|
   | Flat Paths  | 40   | 150| 6.01|
   | Flat Paths  | 40   | 200| 6.95|
   | Flat Paths  | 100  | 10 | 18.81   |
   | Flat Paths  | 100  | 20 | 9.41|
   | Flat Paths  | 100  | 40 | 5.21|
   | Flat Paths  | 100  | 60 | 7.94|
   | Flat Paths  | 100  | 80 | 5.5 |
   | Flat Paths  | 100  | 100| 5.41|
   | Flat Paths  | 100  | 150| 6.49|
   | Flat Paths  | 100  | 200| 6.27|
   | Flat Paths  | 300  | 10 | 17.22   |
   | Flat Paths  | 300  | 20 | 11.55   |
   | Flat Paths  | 300  | 40 | 5.55|
   | Flat Paths  | 300  | 60 | 5.18|
   | Flat Paths  | 300  | 80 | 9.57|
   | Flat Paths  | 300  | 100| 6.46|
   | Flat Paths  | 300  | 150| 4.71|
   | Flat Paths  | 300  | 200| 5.22|
   | Glob Paths  | 40   | 10 | 25.3|
   | Glob Paths  | 40   | 20 | 3.56|
   | Glob Paths  | 40   | 40 | 6.73|
   | Glob Paths  | 40   | 60 | 2.23|
   | Glob Paths  | 40   | 80 | 2.96|
   | Glob Paths  | 40   | 100| 1.93|
   | Glob Paths  | 40   | 150| 2.35|
   | Glob Paths  | 40   | 200| 2.97|
   | Glob Paths  | 100  | 10 | 4.45|
   | Glob Paths  | 100  | 20 | 2.79|
   | Glob Paths  | 100  | 40 | 2.54|
   | Glob Paths  | 100  | 60 | 1.63|
   | Glob Paths  | 100  | 80 | 6.98|
   | Glob Paths  | 100  | 100| 2.69|
   | Glob Paths  | 100  | 150| 2.37|
   | Glob Paths  | 100  | 200| 2.4 |
   | Glob Paths  | 300  | 10 | 4.7 |
   | Glob Paths  | 300  | 20 | 2.98|
   | Glob Paths  | 300  | 40 | 2.14|
   | Glob Paths  | 300  | 60 | 1.6 |
   | Glob Paths  | 300  | 80 | 3.19|
   | Glob Paths  | 300  | 100| 2.1 |
   | Glob Paths  | 300  | 150| 3.72|
   | Glob Paths  | 300  | 200| 1.86|
   | Single glob path| 40   | 10 | 36.12   |
   | Single glob path| 40   | 20 | 37.45   |
   | Single glob path| 40   | 40 | 35.51   |
   | Single glob path| 40   | 60 | 35.7|
   | Single glob path| 40   | 80 | 37.58   |
   | Single glob path| 40   | 100| 

[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-25 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-546554461
 
 
   @steveloughran Sorry for the delay. Been busy past couple weeks.
   
   Here are the results for the test with various num threads and threadpool 
size:
   
   This is on an Amazon Linux EC2 Instance in us-west-2 (`c3.xlarge`) with 4 
vCPUs.
   
   `fs.s3a.threads.keepalivetime` was set to `300`.
   
   Types:
   - Flat Paths: 1206 S3 Paths (recursive list of 
`s3a://landsat-pds/L8/001/003/`)
   - Glob Paths: 30 Glob Paths (ex. 
`s3a://landsat-pds/L8/001/003/LC80010032016262LGN00/*`) resulting in 1206 S3 
Paths
   - Single Glob Path: `s3a://landsat-pds/L8/001/003/*/*`
   
   ```
   
   | Type| fs.s3a.connection.maximum| Num Threads| Runtime(seconds)|
   |===|
   | Flat Paths  | 40   | 10 | 19.08   |
   | Flat Paths  | 40   | 20 | 9.38|
   | Flat Paths  | 40   | 40 | 5.83|
   | Flat Paths  | 40   | 60 | 5.36|
   | Flat Paths  | 40   | 80 | 5.16|
   | Flat Paths  | 40   | 100| 5.08|
   | Flat Paths  | 40   | 150| 4.99|
   | Flat Paths  | 40   | 200| 8.33|
   | Flat Paths  | 100  | 10 | 17.27   |
   | Flat Paths  | 100  | 20 | 8.75|
   | Flat Paths  | 100  | 40 | 7.43|
   | Flat Paths  | 100  | 60 | 5.19|
   | Flat Paths  | 100  | 80 | 4.35|
   | Flat Paths  | 100  | 100| 4.8 |
   | Flat Paths  | 100  | 150| 5.31|
   | Flat Paths  | 100  | 200| 4.87|
   | Flat Paths  | 300  | 10 | 17.1|
   | Flat Paths  | 300  | 20 | 8.72|
   | Flat Paths  | 300  | 40 | 4.88|
   | Flat Paths  | 300  | 60 | 5.28|
   | Flat Paths  | 300  | 80 | 4.81|
   | Flat Paths  | 300  | 100| 6.13|
   | Flat Paths  | 300  | 150| 5.23|
   | Flat Paths  | 300  | 200| 5.98|
   | Glob Paths  | 40   | 10 | 24.81   |
   | Glob Paths  | 40   | 20 | 3.07|
   | Glob Paths  | 40   | 40 | 2.64|
   | Glob Paths  | 40   | 60 | 2.1 |
   | Glob Paths  | 40   | 80 | 1.96|
   | Glob Paths  | 40   | 100| 1.52|
   | Glob Paths  | 40   | 150| 16.51   |
   | Glob Paths  | 40   | 200| 2.16|
   | Glob Paths  | 100  | 10 | 4.85|
   | Glob Paths  | 100  | 20 | 4.36|
   | Glob Paths  | 100  | 40 | 2.33|
   | Glob Paths  | 100  | 60 | 2.58|
   | Glob Paths  | 100  | 80 | 1.61|
   | Glob Paths  | 100  | 100| 2.01|
   | Glob Paths  | 100  | 150| 1.7 |
   | Glob Paths  | 100  | 200| 2.28|
   | Glob Paths  | 300  | 10 | 4.23|
   | Glob Paths  | 300  | 20 | 2.87|
   | Glob Paths  | 300  | 40 | 2.23|
   | Glob Paths  | 300  | 60 | 2.05|
   | Glob Paths  | 300  | 80 | 2.02|
   | Glob Paths  | 300  | 100| 1.75|
   | Glob Paths  | 300  | 150| 2.75|
   | Glob Paths  | 300  | 200| 2.18|
   | Single glob path| 40   | 10 | 32.66

[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-25 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-546552694
 
 
   retest this please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-10-25 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-546550712
 
 
   @steveloughran Sorry for the delay. Been busy past couple weeks;
   
   Here are the results for the test with various values threads and threadpool 
size:
   
   ```
   _
   | Type| s3a conn max| Num Threads| Runtime(seconds)|
   ||
   | Flat Paths  | 40| 10 | 24.08   |
   | Flat Paths  | 40| 20 | 12.07   |
   | Flat Paths  | 40| 40 | 6.63|
   | Flat Paths  | 40| 60 | 6.94|
   | Flat Paths  | 40| 80 | 6.58|
   | Flat Paths  | 40| 100| 8.24|
   | Flat Paths  | 40| 150| 7.19|
   | Flat Paths  | 40| 200| 6.24|
   | Flat Paths  | 300   | 10 | 19.39   |
   | Flat Paths  | 300   | 20 | 10.16   |
   | Flat Paths  | 300   | 40 | 6.78|
   | Flat Paths  | 300   | 60 | 6.34|
   | Flat Paths  | 300   | 80 | 6.94|
   | Flat Paths  | 300   | 100| 5.35|
   | Flat Paths  | 300   | 150| 5.96|
   | Flat Paths  | 300   | 200| 6.78|
   | Glob Paths  | 40| 10 | 37.28   |
   | Glob Paths  | 40| 20 | 4.74|
   | Glob Paths  | 40| 40 | 3.81|
   | Glob Paths  | 40| 60 | 4.17|
   | Glob Paths  | 40| 80 | 3.41|
   | Glob Paths  | 40| 100| 3.01|
   | Glob Paths  | 40| 150| 3.08|
   | Glob Paths  | 40| 200| 2.63|
   | Glob Paths  | 300   | 10 | 4.59|
   | Glob Paths  | 300   | 20 | 3.26|
   | Glob Paths  | 300   | 40 | 3.46|
   | Glob Paths  | 300   | 60 | 2.62|
   | Glob Paths  | 300   | 80 | 2.32|
   | Glob Paths  | 300   | 100| 2.45|
   | Glob Paths  | 300   | 150| 4.61|
   | Glob Paths  | 300   | 200| 2.5 |
   | Single glob path| 40| 10 | 44.02   |
   | Single glob path| 40| 20 | 38.54   |
   | Single glob path| 40| 40 | 33.25   |
   | Single glob path| 40| 60 | 34.83   |
   | Single glob path| 40| 80 | 36.2|
   | Single glob path| 40| 100| 34.94   |
   | Single glob path| 40| 150| 46.32   |
   | Single glob path| 40| 200| 35.36   |
   | Single glob path| 300   | 10 | 31.33   |
   | Single glob path| 300   | 20 | 35.35   |
   | Single glob path| 300   | 40 | 36.4|
   | Single glob path| 300   | 60 | 34.7|
   | Single glob path| 300   | 80 | 35.1|
   | Single glob path| 300   | 100| 33.87   |
   | Single glob path| 300   | 150| 35.61   |
   | Single glob path| 300   | 200| 37.25   |
 FileSystem org.apache.hadoop.fs.s3a.S3AFileSystem: 0 bytes read, 0 bytes 
written, 21232 read ops, 0 large read ops, 0 write ops
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-30 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536741130
 
 
   Update: I tried increasing `fs.s3a.connection.maximum` and it did improve 
performance of the filesystem calls.
   
   I still need to set up a benchmark that runs on EC2 instead of remote dev 
laptop, will update in a couple days.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-27 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-536046144
 
 
   > "fs.s3a.connection.maximum" should be > than "fs.s3a.max.total.tasks"
   > "fs.s3a.threads.keepalivetime" from 60 to 300 to keep those connections 
around for longer (avoids that https overhead)
   
   Ah right, I haven't considered that we might be bottlenecked by the S3A 
connection pool. I will update my measurements based on this.
   
   @steveloughran Let's say hypothetically that performance keeps improving the 
more threads we add (say, 500), and it doesn't cause S3 throttling, etc. 
There's probably an upper limit to how many threads are acceptable to spawn on 
the driver right? For example what if a user puts their driver on a `t2.nano` 
EC2 instance.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-23 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-534380034
 
 
   I ran additional measurements testing out different thread numbers on the S3 
Landsat data, and it seems like the sweet spot is somewhere between 20-30 
seconds (for my environment anyways)
   
   **30 glob paths paths* - 30 glob paths with the final result of 1206 files
   **single glob path* - 1 single glob path with the final result of 1206 files
   **raw paths* - 1206 raw paths without any globs
   
   see here: https://github.com/apache/spark/pull/25899#issuecomment-534069194
   
   **original code**
   30 glob paths paths _15.6 seconds_
   single glob path _11.3 seconds_
   raw paths 59 seconds_ 
   
   **8 threads**
   30 glob paths paths _1.48 seconds_
   single glob path 11 seconds_
   raw paths _7.73 seconds_
   
   **20 threads**
   30 glob paths paths _1.47 seconds_
   single glob path _15.45 seconds_
   raw paths _4.16 seconds_
   
   **30 threads**
   20 glob paths paths _0.92 seconds_
   single glob path _11.74 seconds_
   raw paths _4.12 seconds_
   
   **40 threads**
   30 glob paths paths _0.93 seconds_
   single glob path _13.48 seconds_
   raw paths _4.08 seconds_
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking FileSystem calls in DataSource#checkAndGlobPathIfNecessary

2019-09-23 Thread GitBox
cozos commented on issue #25899: [SPARK-29089][SQL] Parallelize blocking 
FileSystem calls in DataSource#checkAndGlobPathIfNecessary
URL: https://github.com/apache/spark/pull/25899#issuecomment-534353737
 
 
   @dongjoon-hyun Ok done, how does that sound? Should I update the JIRA aswell?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org