[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16247886#comment-16247886 ] Steve Loughran commented on SPARK-17593: Hey nick, yes, need to move to FileSystem.list(path, recursive=true) & then iterate through the results. This actually scales better to the many thousands of files, but you'd know that. Not done a patch for that myself. Moving to that won't be any worse for spark on older hadoop versions, but will get the read time speedup on Hadoop 2.8+. Write performance a separate issue, which is really "commit algorithms for blob storage". > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah >Priority: Minor > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16244887#comment-16244887 ] Nick Dimiduk commented on SPARK-17593: -- So the fix in Hadoop 2.8 is for any variant of the s3* FileSystem? Or is it only for s3a? bq. as it really needs Spark to move to listFiles(recursive) Do we still need this change to be shipped in Spark? Thanks. > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah >Priority: Minor > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15742480#comment-15742480 ] Steve Loughran commented on SPARK-17593: Marking as a dependency of HADOOP-13208, which fixes it for all code that uses this API. Anything which implements their own treewalk will suffer. In tests over long-haul links, a single {{getFileStatus()}} call can take [~1500millis|https://steveloughran.blogspot.co.uk/2016/12/how-long-does-filesystemexists-take.html] Note also that HADOOP-13345 delivers faster listing performance for all API calls by caching the metadata in dynamoDB; this will also give you the consistency needed to use s3 as a direct destination of work. > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15562171#comment-15562171 ] Gaurav Shah commented on SPARK-17593: - added detail explanation and solution here http://stackoverflow.com/questions/39513505/spark-lists-all-leaf-node-even-in-partitioned-data/39946236#39946236 > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503675#comment-15503675 ] Gaurav Shah commented on SPARK-17593: - I definitely agree that flattening out will help, ( not sure how I could because of the way we have partitions) but even if it is not flattened 10 minutes sounds high given that the listing is not actually slow. I agree, from spark 2.0.0 we have moved to snappy ( from about a month) . > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503668#comment-15503668 ] Gaurav Shah commented on SPARK-17593: - Thanks [~ste...@apache.org] S3 is definitely slower than hdfs I would agree. But then if list via that ruby script can happen in 1 minute and spark takes 10 minute, then it definitely sounds wrong. Hadoop 2.8/2.9 with https://issues.apache.org/jira/browse/HADOOP-13208 will definitely help. Updated directory structure and have sent you an email > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503663#comment-15503663 ] Steve Loughran commented on SPARK-17593: Looking at the dir tree, anything you could do to flatten things by putting them in the names would make a difference. incidentally, .gz files aren't great for parallel data analysis as they aren't splittable. Have you considered snappy? > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > event_hour=0 > -- verb=follow > part1.parquet.gz > event_hour=1 > -- verb=click > part1.parquet.gz > -- event_date=2015-01-02 > event_hour=5 > -- verb=follow > part1.parquet.gz > event_hour=10 > -- verb=click > part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503583#comment-15503583 ] Steve Loughran commented on SPARK-17593: Sean is right: this is primarily S3, or more specifically, how S3 is made to look like a filesystem, but isn't really —what you are seeing is the cost of doing recursive tree walks (many, many list operations) For a start, use S3A URLs rather than S3; its where all optimisation work is going. This isn't going to help you immediately, as it really needs Spark to move to listFiles(recursive) along with the move to Hadoop 2.8 and so pick up the HADOOP-13208. I'll look at the codepath here to see if it's easy to do Otherwise try to partition date more heirarchically, and then select under that (e.g. have separate dirs for year, month, etc). Alternatively, go for a flat structure: all events in one single directory. List time drops to O(entries/5000). One thing that would be good would be if you could stick up on the JIRA email me direct what your full directory tree looks like, along. That won't fix the problem, but it will give me another example data structure to use when testing performance speedups. We use the TCP-DS layout —it's good to have more examples. The output of your ruby command is enough > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > -- event_hour=2015-01-1 > -- part1.parquet.gz > -- event_date=2015-01-02 > -- event_hour=5 > -- part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503207#comment-15503207 ] Gaurav Shah commented on SPARK-17593: - Thanks [~srowen] tried after your comment, but that didn't help. {code} val df = sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket/events_v3/") df.createOrReplaceTempView("temp_events") sparkSession.sql( """ |select verb,count(*) from temp_events where event_date = "2016-08-05" group by verb """.stripMargin).show() {code} > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > -- event_hour=2015-01-1 > -- part1.parquet.gz > -- event_date=2015-01-02 > -- event_hour=5 > -- part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503164#comment-15503164 ] Gaurav Shah commented on SPARK-17593: - Thanks [~srowen] my spark code does use `s3n` > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > -- event_hour=2015-01-1 > -- part1.parquet.gz > -- event_date=2015-01-02 > -- event_hour=5 > -- part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17593) list files on s3 very slow
[ https://issues.apache.org/jira/browse/SPARK-17593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15503128#comment-15503128 ] Sean Owen commented on SPARK-17593: --- I'm not sure this is a Spark problem. It seems S3 specific. Try using {{s3n://bucket_name/events_v3/}} as I seem to recall that it does matter in some cases whether you end with a slash, or glob pattern. CC [~ste...@apache.org] > list files on s3 very slow > -- > > Key: SPARK-17593 > URL: https://issues.apache.org/jira/browse/SPARK-17593 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 > Environment: spark 2.0.0, hadoop 2.7.2 ( hadoop 2.7.3) >Reporter: Gaurav Shah > > lets say we have following partitioned data: > {code} > events_v3 > -- event_date=2015-01-01 > -- event_hour=2015-01-1 > -- part1.parquet.gz > -- event_date=2015-01-02 > -- event_hour=5 > -- part1.parquet.gz > {code} > To read (or write ) parquet partitioned data via spark it makes call to > `ListingFileCatalog.listLeafFiles` . Which recursively tries to list all > files and folders. > In this case if we had 300 dates, we would have created 300 jobs each trying > to get filelist from date_directory. This process takes about 10 minutes to > finish ( with 2 executors). vs if I use a ruby script to get list of all > files recursively in the same folder it takes about 1 minute, on the same > machine with just 1 thread. > I am confused as to why it would take so much time extra for listing files. > spark code: > {code:scala} > val sparkSession = org.apache.spark.sql.SparkSession.builder > .config("spark.sql.hive.metastorePartitionPruning",true) > .config("spark.sql.parquet.filterPushdown", true) > .config("spark.sql.hive.verifyPartitionPath", false) > .config("spark.sql.hive.convertMetastoreParquet.mergeSchema",false) > .config("parquet.enable.summary-metadata",false) > .config("spark.sql.sources.partitionDiscovery.enabled",false) > .getOrCreate() > val df = > sparkSession.read.option("mergeSchema","false").format("parquet").load("s3n://bucket_name/events_v3") > df.createOrReplaceTempView("temp_events") > sparkSession.sql( > """ > |select verb,count(*) from temp_events where event_date = > "2016-08-05" group by verb > """.stripMargin).show() > {code} > ruby code: > {code:ruby} > gem 'aws-sdk', '~> 2' > require 'aws-sdk' > client = Aws::S3::Client.new(:region=>'us-west-1') > next_continuation_token = nil > total = 0 > loop do > a= client.list_objects_v2({ > bucket: "bucket", # required > max_keys: 1000, > prefix: "events_v3/", > continuation_token: next_continuation_token , > fetch_owner: false, > }) > puts a.contents.last.key > total += a.contents.size > next_continuation_token = a.next_continuation_token > break unless a.is_truncated > end > puts "total" > puts total > {code} > tried looking into following bug: > https://issues.apache.org/jira/browse/HADOOP-12810 > but hadoop 2.7.3 doesn't solve that problem > stackoverflow reference: > http://stackoverflow.com/questions/39525288/spark-parquet-write-gets-slow-as-partitions-grow -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org