Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
Too many GC. The task runs much more faster with more memory(heap space). The CPU load is still too high, and network load is about 20+MB/s(not high enough) So what is the correct way to solve this GC problem? Is there other ways except using more memory? -- View this message in context: http

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-29 Thread Earthson
It's really strange that cpu load so high and both disk/network IO load so low. CLUSTER BY is just something similar to groupBy, why it needs so much cpu resource? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-1-0-1-SparkSQL-reduce-stage-of-shuffle-i

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
"spark.MapOutputTrackerMasterActor: Asked to send map output locations for shuffle 0 to" takes too much time, what should I do? What is the correct configuration? blockManager timeout if I using a small number of reduce partition.

Re: [Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Zongheng Yang
The optimal config depends on lots of things, but did you try a smaller numPartition size? Just guessing -- 160 / 320 may be reasonable. On Mon, Jul 28, 2014 at 1:52 AM, Earthson wrote: > I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition > with 2048 buckets. > >

[Spark 1.0.1][SparkSQL] reduce stage of shuffle is slow。

2014-07-28 Thread Earthson
I'm using SparkSQL with Hive 0.13, here is the SQL for inserting a partition with 2048 buckets. sqlsc.set("spark.sql.shuffle.partitions", "2048") hql("""|insert %s table mz_log |PARTITION (date='%s') |select * from tmp_mzlog