Re: saveAsTextFile extremely slow near finish
is your data skewed? Could it be that there are a few keys with a huge number of records? You might consider outputting (recordA, count) (recordB, count) instead of recordA recordA recordA ... you could do this with: input = sc.textFile pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _} sorted = pairs.sortByKey sorted.saveAsTextFile On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on composite keys, and then save the partitioned result back to > HDFS. > pseudo code is like this: > > input = sc.textFile > pairs = input.mapToPair > sorted = pairs.sortByKey > values = sorted.values > values.saveAsTextFile > > Input size is ~ 160G, and I made 1000 partitions specified in > JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is > splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished > in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress > and the last few jobs just took forever and never finishes. > > Cluster setup: > 8 nodes > on each node: 15gb memory, 8 cores > > running parameters: > --executor-memory 12G > --conf "spark.cores.max=60" > > Thank you for any help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: saveAsTextFile extremely slow near finish
This is more of an aside, but why repartition this data instead of letting it define partitions naturally? You will end up with a similar number. On Mar 9, 2015 5:32 PM, "mingweili0x" wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on composite keys, and then save the partitioned result back to > HDFS. > pseudo code is like this: > > input = sc.textFile > pairs = input.mapToPair > sorted = pairs.sortByKey > values = sorted.values > values.saveAsTextFile > > Input size is ~ 160G, and I made 1000 partitions specified in > JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is > splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished > in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress > and the last few jobs just took forever and never finishes. > > Cluster setup: > 8 nodes > on each node: 15gb memory, 8 cores > > running parameters: > --executor-memory 12G > --conf "spark.cores.max=60" > > Thank you for any help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Re: saveAsTextFile extremely slow near finish
Don't you think 1000 is too less for 160GB of data? Also you could try using KryoSerializer, Enabling RDD Compression. Thanks Best Regards On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x wrote: > I'm basically running a sorting using spark. The spark program will read > from > HDFS, sort on composite keys, and then save the partitioned result back to > HDFS. > pseudo code is like this: > > input = sc.textFile > pairs = input.mapToPair > sorted = pairs.sortByKey > values = sorted.values > values.saveAsTextFile > > Input size is ~ 160G, and I made 1000 partitions specified in > JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is > splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished > in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress > and the last few jobs just took forever and never finishes. > > Cluster setup: > 8 nodes > on each node: 15gb memory, 8 cores > > running parameters: > --executor-memory 12G > --conf "spark.cores.max=60" > > Thank you for any help. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
saveAsTextFile extremely slow near finish
I'm basically running a sorting using spark. The spark program will read from HDFS, sort on composite keys, and then save the partitioned result back to HDFS. pseudo code is like this: input = sc.textFile pairs = input.mapToPair sorted = pairs.sortByKey values = sorted.values values.saveAsTextFile Input size is ~ 160G, and I made 1000 partitions specified in JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress and the last few jobs just took forever and never finishes. Cluster setup: 8 nodes on each node: 15gb memory, 8 cores running parameters: --executor-memory 12G --conf "spark.cores.max=60" Thank you for any help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org