Re: saveAsTextFile extremely slow near finish

2015-03-11 Thread Imran Rashid
is your data skewed?  Could it be that there are a few keys with a huge
number of records?  You might consider outputting
(recordA, count)
(recordB, count)

instead of

recordA
recordA
recordA
...


you could do this with:

input = sc.textFile
pairsCounts = input.map{x => (x,1)}.reduceByKey{_ + _}
sorted = pairs.sortByKey
sorted.saveAsTextFile


On Mon, Mar 9, 2015 at 12:31 PM, mingweili0x  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile extremely slow near finish

2015-03-10 Thread Sean Owen
This is more of an aside, but why repartition this data instead of letting
it define partitions naturally? You will end up with a similar number.
On Mar 9, 2015 5:32 PM, "mingweili0x"  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: saveAsTextFile extremely slow near finish

2015-03-09 Thread Akhil Das
Don't you think 1000 is too less for 160GB of data? Also you could try
using KryoSerializer, Enabling RDD Compression.

Thanks
Best Regards

On Mon, Mar 9, 2015 at 11:01 PM, mingweili0x  wrote:

> I'm basically running a sorting using spark. The spark program will read
> from
> HDFS, sort on composite keys, and then save the partitioned result back to
> HDFS.
> pseudo code is like this:
>
> input = sc.textFile
> pairs = input.mapToPair
> sorted = pairs.sortByKey
> values = sorted.values
> values.saveAsTextFile
>
>  Input size is ~ 160G, and I made 1000 partitions specified in
> JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
> splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
> in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
> and the last few jobs just took forever and never finishes.
>
> Cluster setup:
> 8 nodes
> on each node: 15gb memory, 8 cores
>
> running parameters:
> --executor-memory 12G
> --conf "spark.cores.max=60"
>
> Thank you for any help.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


saveAsTextFile extremely slow near finish

2015-03-09 Thread mingweili0x
I'm basically running a sorting using spark. The spark program will read from
HDFS, sort on composite keys, and then save the partitioned result back to
HDFS.
pseudo code is like this:

input = sc.textFile
pairs = input.mapToPair
sorted = pairs.sortByKey
values = sorted.values
values.saveAsTextFile

 Input size is ~ 160G, and I made 1000 partitions specified in
JavaSparkContext.textFile and JavaPairRDD.sortByKey. From WebUI, the job is
splitted into two stages: saveAsTextFile and mapToPair. MapToPair finished
in 8 mins. While saveAsTextFile took ~15mins to reach (2366/2373) progress
and the last few jobs just took forever and never finishes. 

Cluster setup:
8 nodes
on each node: 15gb memory, 8 cores

running parameters:
--executor-memory 12G
--conf "spark.cores.max=60"

Thank you for any help.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-extremely-slow-near-finish-tp21978.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org