[jira] [Commented] (SPARK-17593) list files on s3 very slow

2017-11-10 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247886#comment-16247886
 ] 

Steve Loughran commented on SPARK-17593:


Hey nick, 

yes, need to move to FileSystem.list(path, recursive=true) & then iterate 
through the results. This actually scales better to the many thousands of 
files, but you'd know that. Not done a patch for that myself.

Moving to that won't be any worse for spark on older hadoop versions, but will 
get the read time speedup on Hadoop 2.8+. Write performance a separate issue, 
which is really "commit algorithms for blob storage". 

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>Priority: Minor
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2017-11-08 Thread Nick Dimiduk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244887#comment-16244887
 ] 

Nick Dimiduk commented on SPARK-17593:
--

So the fix in Hadoop 2.8 is for any variant of the s3* FileSystem? Or is it 
only for s3a?

bq. as it really needs Spark to move to listFiles(recursive)

Do we still need this change to be shipped in Spark? Thanks.

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>Priority: Minor
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-12-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742480#comment-15742480
 ] 

Steve Loughran commented on SPARK-17593:


Marking as a dependency of HADOOP-13208, which fixes it for all code that uses 
this API. Anything which implements their own treewalk will suffer. In tests 
over long-haul links, a single {{getFileStatus()}} call can take 
[~1500millis|https://steveloughran.blogspot.co.uk/2016/12/how-long-does-filesystemexists-take.html]

Note also that HADOOP-13345 delivers faster listing performance for all API 
calls by caching the metadata in dynamoDB; this will also give you the 
consistency needed to use s3 as a direct destination of work. 

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-10-10 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562171#comment-15562171
 ] 

Gaurav Shah commented on SPARK-17593:
-

added detail explanation and solution here 
http://stackoverflow.com/questions/39513505/spark-lists-all-leaf-node-even-in-partitioned-data/39946236#39946236

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503675#comment-15503675
 ] 

Gaurav Shah commented on SPARK-17593:
-

I definitely agree that flattening out will help, ( not sure how I could 
because of the way we have partitions) but even if it is not flattened 10 
minutes sounds high given that the listing is not actually slow.

I agree, from spark 2.0.0 we have moved to snappy ( from about a month) .

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503668#comment-15503668
 ] 

Gaurav Shah commented on SPARK-17593:
-

Thanks [~ste...@apache.org] S3 is definitely slower than hdfs I would agree. 
But then if list via that ruby script can happen in 1 minute and spark takes 10 
minute, then it definitely sounds wrong. 

Hadoop 2.8/2.9 with https://issues.apache.org/jira/browse/HADOOP-13208 will 
definitely help. 

Updated directory structure and have sent you an email

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503663#comment-15503663
 ] 

Steve Loughran commented on SPARK-17593:


Looking at the dir tree, anything you could do to flatten things by putting 
them in the names would make a difference.

incidentally, .gz files aren't great for parallel data analysis as they aren't 
splittable. Have you considered snappy?

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
> -- event_date=2015-01-01
>  event_hour=0
> -- verb=follow
> part1.parquet.gz 
>  event_hour=1
> -- verb=click
> part1.parquet.gz 
> -- event_date=2015-01-02
>  event_hour=5
> -- verb=follow
> part1.parquet.gz 
>  event_hour=10
> -- verb=click
> part1.parquet.gz 
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503583#comment-15503583
 ] 

Steve Loughran commented on SPARK-17593:


Sean is right: this is primarily S3, or more specifically, how S3 is made to 
look like a filesystem,  but isn't really —what you are seeing   is the cost of 
doing recursive tree walks (many, many list operations)

For a start, use S3A URLs rather than S3; its where all optimisation work is 
going.

This isn't going to help you immediately, as it really needs Spark to move to 
listFiles(recursive) along with the move to Hadoop 2.8 and so pick up the 
HADOOP-13208. I'll look at the codepath here to see if it's easy to do

Otherwise  try to partition date more heirarchically, and then select under 
that (e.g. have separate dirs for year, month, etc). Alternatively, go for a 
flat structure: all events in one single directory. List time drops to 
O(entries/5000). 


One thing that would be good would be if you could stick up on the JIRA email 
me direct what your full directory tree looks like, along. That won't fix the 
problem, but it will give me another example data structure to use when testing 
performance speedups. We use the TCP-DS layout —it's good to have more 
examples. The output of your ruby command is enough

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
>   -- event_date=2015-01-01
> -- event_hour=2015-01-1
>   -- part1.parquet.gz
>   -- event_date=2015-01-02
> -- event_hour=5
>   -- part1.parquet.gz
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503207#comment-15503207
 ] 

Gaurav Shah commented on SPARK-17593:
-

Thanks [~srowen] tried after your comment, but that didn't help.
{code}
val df = 
sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket/events_v3/")
df.createOrReplaceTempView("temp_events")
sparkSession.sql(
  """
|select verb,count(*) from temp_events where event_date = "2016-08-05" 
group by verb
  """.stripMargin).show()
{code}

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
>   -- event_date=2015-01-01
> -- event_hour=2015-01-1
>   -- part1.parquet.gz
>   -- event_date=2015-01-02
> -- event_hour=5
>   -- part1.parquet.gz
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Gaurav Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503164#comment-15503164
 ] 

Gaurav Shah commented on SPARK-17593:
-

Thanks [~srowen] my spark code does use `s3n` 

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
>   -- event_date=2015-01-01
> -- event_hour=2015-01-1
>   -- part1.parquet.gz
>   -- event_date=2015-01-02
> -- event_hour=5
>   -- part1.parquet.gz
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17593) list files on s3 very slow

2016-09-19 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503128#comment-15503128
 ] 

Sean Owen commented on SPARK-17593:
---

I'm not sure this is a Spark problem. It seems S3 specific. Try using 
{{s3n://bucket_name/events_v3/}} as I seem to recall that it does matter in 
some cases whether you end with a slash, or glob pattern. CC 
[~ste...@apache.org]

> list files on s3 very slow
> --
>
> Key: SPARK-17593
> URL: https://issues.apache.org/jira/browse/SPARK-17593
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
> Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3)
>Reporter: Gaurav Shah
>
> lets say we have following partitioned data:
> {code}
> events_v3
>   -- event_date=2015-01-01
> -- event_hour=2015-01-1
>   -- part1.parquet.gz
>   -- event_date=2015-01-02
> -- event_hour=5
>   -- part1.parquet.gz
> {code}
> To read (or write ) parquet partitioned data via spark it makes call to 
> `ListingFileCatalog.listLeafFiles` .  Which recursively tries to list all 
> files and folders.
> In this case if we had 300 dates, we would have created 300 jobs each trying 
> to get filelist from date_directory. This process takes about 10 minutes to 
> finish ( with 2 executors). vs if I use a ruby script to get list of all 
> files recursively in the same folder it takes about 1 minute, on the same 
> machine with just 1 thread. 
> I am confused as to why it would take so much time extra for listing files.
> spark code:
> {code:scala}
> val sparkSession = org.apache.spark.sql.SparkSession.builder
> .config("spark.sql.hive.metastorePartitionPruning",true)
> .config("spark.sql.parquet.filterPushdown", true)
> .config("spark.sql.hive.verifyPartitionPath", false)
> .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false)
> .config("parquet.enable.summary-metadata",false)
> .config("spark.sql.sources.partitionDiscovery.enabled",false)
> .getOrCreate()
> val df = 
> sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3")
> df.createOrReplaceTempView("temp_events")
> sparkSession.sql(
>   """
> |select verb,count(*) from temp_events where event_date = 
> "2016-08-05" group by verb
>   """.stripMargin).show()
> {code}
> ruby code:
> {code:ruby}
> gem 'aws-sdk', '~> 2'
> require 'aws-sdk'
> client = Aws::S3::Client.new(:region=>'us-west-1')
> next_continuation_token = nil
> total = 0
> loop do
> a= client.list_objects_v2({
>   bucket: "bucket", # required
>   max_keys: 1000,
>   prefix: "events_v3/",
>   continuation_token: next_continuation_token ,
>   fetch_owner: false,
> })
> puts a.contents.last.key
> total += a.contents.size
> next_continuation_token = a.next_continuation_token
> break unless a.is_truncated
> end
> puts "total"
> puts total
> {code}
> tried looking into following bug:
> https://issues.apache.org/jira/browse/HADOOP-12810
> but hadoop 2.7.3 doesn't solve that problem
> stackoverflow reference:
> http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org