[jira] [Created] (SPARK-5951) Remove unreachable driver memory properties in yarn client mode (YarnClientSchedulerBackend)

2015-02-23 Thread Shekhar Bansal (JIRA)
Shekhar Bansal created SPARK-5951:
-

 Summary: Remove unreachable driver memory properties in yarn 
client mode (YarnClientSchedulerBackend)
 Key: SPARK-5951
 URL: https://issues.apache.org/jira/browse/SPARK-5951
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
 Environment: yarn
Reporter: Shekhar Bansal
Priority: Trivial
 Fix For: 1.3.0


In SPARK-4730 warning for deprecated was added
and in SPARK-1953 driver memory configs were removed in yarn client mode

During integration spark.master.memory and SPARK_MASTER_MEMORY were not removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Description: 
I am using
 {code}spark.driver.memory=6g{code}
which creates application master of 7g 
(yarn.scheduler.minimum-allocation-mb=1024), which is waste of resources.

  was:I am using {code}spark.driver.memory=6g{code}, which creates application 
master of 7g(yarn.scheduler.minimum-allocation-mb=1024), which is waste of 
resources.


 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024), which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Description: 
I am using
 {code}spark.driver.memory=6g{code}
which creates application master of 7g 
(yarn.scheduler.minimum-allocation-mb=1024)
which is waste of resources.

  was:
I am using
 {code}spark.driver.memory=6g{code}
which creates application master of 7g 
(yarn.scheduler.minimum-allocation-mb=1024), which is waste of resources.


 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324056#comment-14324056
 ] 

Shekhar Bansal edited comment on SPARK-5861 at 2/17/15 11:14 AM:
-

Thanks for the quick reply
I know all this.

I meant yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory


was (Author: sb58):
Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal

 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 Application manager don't need 7g in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324069#comment-14324069
 ] 

Shekhar Bansal commented on SPARK-5861:
---

I am submitting my job using spark summit

reproducible by
spark-submit  --master yarn-client --driver-memory 6g  --class 
org.apache.spark.examples.SparkPi spark-examples-1.2.1-hadoop2.4.0.jar

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal

 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 Application manager don't need 7g in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Comment: was deleted

(was: I am submitting my job using spark summit

reproducible by
spark-submit  --master yarn-client --driver-memory 6g  --class 
org.apache.spark.examples.SparkPi spark-examples-1.2.1-hadoop2.4.0.jar)

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal

 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 Application manager don't need 7g in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Comment: was deleted

(was: Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory)

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324055#comment-14324055
 ] 

Shekhar Bansal commented on SPARK-5861:
---

Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324053#comment-14324053
 ] 

Shekhar Bansal commented on SPARK-5861:
---

Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Comment: was deleted

(was: Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory)

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324056#comment-14324056
 ] 

Shekhar Bansal commented on SPARK-5861:
---

Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Comment: was deleted

(was: Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory)

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324054#comment-14324054
 ] 

Shekhar Bansal commented on SPARK-5861:
---

Thanks for the quick reply
I know all this.

I mean yarn-client mode only

In org.apache.spark.deploy.yarn.ClientArguments
amMemory = driver-memory
amMemoryOverhead = sparkConf.getInt(spark.yarn.driver.memoryOverhead,
math.max((MEMORY_OVERHEAD_FACTOR * amMemory).toInt, MEMORY_OVERHEAD_MIN))

there is no check for spark.master

In above case, I think we are wasting 5g memory

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)
Shekhar Bansal created SPARK-5861:
-

 Summary: [yarn-client mode] Application master should not use 
memory = spark.driver.memory
 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


I am using {code}spark.driver.memory=6g{code}, which creates application master 
of 7g(yarn.scheduler.minimum-allocation-mb=1024), which is waste of resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14324070#comment-14324070
 ] 

Shekhar Bansal commented on SPARK-5861:
---

I am submitting my job using spark summit

reproducible by
spark-submit  --master yarn-client --driver-memory 6g  --class 
org.apache.spark.examples.SparkPi spark-examples-1.2.1-hadoop2.4.0.jar

 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal

 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 Application manager don't need 7g in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5861) [yarn-client mode] Application master should not use memory = spark.driver.memory

2015-02-17 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-5861:
--
Description: 
I am using
 {code}spark.driver.memory=6g{code}
which creates application master of 7g 
(yarn.scheduler.minimum-allocation-mb=1024)

Application manager don't need 7g in yarn-client mode.

  was:
I am using
 {code}spark.driver.memory=6g{code}
which creates application master of 7g 
(yarn.scheduler.minimum-allocation-mb=1024)
which is waste of resources.


 [yarn-client mode] Application master should not use memory = 
 spark.driver.memory
 -

 Key: SPARK-5861
 URL: https://issues.apache.org/jira/browse/SPARK-5861
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.1
Reporter: Shekhar Bansal
 Fix For: 1.3.0, 1.2.2


 I am using
  {code}spark.driver.memory=6g{code}
 which creates application master of 7g 
 (yarn.scheduler.minimum-allocation-mb=1024)
 Application manager don't need 7g in yarn-client mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5081) Shuffle write increases

2015-02-05 Thread Shekhar Bansal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14308596#comment-14308596
 ] 

Shekhar Bansal commented on SPARK-5081:
---

I faced same problem, moving to lz4 compression did the trick for me.
try spark.io.compression.codec=lz4

 Shuffle write increases
 ---

 Key: SPARK-5081
 URL: https://issues.apache.org/jira/browse/SPARK-5081
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.2.0
Reporter: Kevin Jung

 The size of shuffle write showing in spark web UI is much different when I 
 execute same spark job with same input data in both spark 1.1 and spark 1.2. 
 At sortBy stage, the size of shuffle write is 98.1MB in spark 1.1 but 146.9MB 
 in spark 1.2. 
 I set spark.shuffle.manager option to hash because it's default value is 
 changed but spark 1.2 still writes shuffle output more than spark 1.1.
 It can increase disk I/O overhead exponentially as the input file gets bigger 
 and it causes the jobs take more time to complete. 
 In the case of about 100GB input, for example, the size of shuffle write is 
 39.7GB in spark 1.1 but 91.0GB in spark 1.2.
 spark 1.1
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |9|saveAsTextFile| |1169.4KB| |
 |12|combineByKey| |1265.4KB|1275.0KB|
 |6|sortByKey| |1276.5KB| |
 |8|mapPartitions| |91.0MB|1383.1KB|
 |4|apply| |89.4MB| |
 |5|sortBy|155.6MB| |98.1MB|
 |3|sortBy|155.6MB| | |
 |1|collect| |2.1MB| |
 |2|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |
 spark 1.2
 ||Stage Id||Description||Input||Shuffle Read||Shuffle Write||
 |12|saveAsTextFile| |1170.2KB| |
 |11|combineByKey| |1264.5KB|1275.0KB|
 |8|sortByKey| |1273.6KB| |
 |7|mapPartitions| |134.5MB|1383.1KB|
 |5|zipWithIndex| |132.5MB| |
 |4|sortBy|155.6MB| |146.9MB|
 |3|sortBy|155.6MB| | |
 |2|collect| |2.0MB| |
 |1|mapValues|155.6MB| |2.2MB|
 |0|first|184.4KB| | |



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4968) [SparkSQL] java.lang.UnsupportedOperationException when hive partition doesn't exist and order by and limit are used

2014-12-25 Thread Shekhar Bansal (JIRA)
Shekhar Bansal created SPARK-4968:
-

 Summary: [SparkSQL] java.lang.UnsupportedOperationException when 
hive partition doesn't exist and order by and limit are used
 Key: SPARK-4968
 URL: https://issues.apache.org/jira/browse/SPARK-4968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
 Environment: Spark 1.1.1
hive metastore db - pgsql
OS- Linux
Reporter: Shekhar Bansal
 Fix For: 1.1.2, 1.2.1, 1.1.1


Create table with partitions
run query for partition which doesn't exist and contains order by and limit

I am running queries in hiveContext

1. Create hive table
create table if not exists testTable (ID1 BIGINT, ID2 BIGINT,Start_Time STRING, 
End_Time STRING) PARTITIONED BY (Region STRING,Market STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;


2. Create data
1,2,2014-11-01,2014-11-02
2,3,2014-11-01,2014-11-02
3,4,2014-11-01,2014-11-02

3. Load data in hive
LOAD DATA LOCAL INPATH '/tmp/input.txt' OVERWRITE INTO TABLE testTable 
PARTITION (Region=North, market='market1');

4. run query
SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 
100;


Error trace
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:863)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136)
at 
org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4968) [SparkSQL] java.lang.UnsupportedOperationException when hive partition doesn't exist and order by and limit are used

2014-12-25 Thread Shekhar Bansal (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shekhar Bansal updated SPARK-4968:
--
Environment: 
Spark 1.1.1
scala - 2.10.2
hive metastore db - pgsql
OS- Linux

  was:
Spark 1.1.1
hive metastore db - pgsql
OS- Linux


 [SparkSQL] java.lang.UnsupportedOperationException when hive partition 
 doesn't exist and order by and limit are used
 

 Key: SPARK-4968
 URL: https://issues.apache.org/jira/browse/SPARK-4968
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.1
 Environment: Spark 1.1.1
 scala - 2.10.2
 hive metastore db - pgsql
 OS- Linux
Reporter: Shekhar Bansal
 Fix For: 1.1.1, 1.1.2, 1.2.1


 Create table with partitions
 run query for partition which doesn't exist and contains order by and limit
 I am running queries in hiveContext
 1. Create hive table
 create table if not exists testTable (ID1 BIGINT, ID2 BIGINT,Start_Time 
 STRING, End_Time STRING) PARTITIONED BY (Region STRING,Market STRING)
 ROW FORMAT DELIMITED
 FIELDS TERMINATED BY ','
 LINES TERMINATED BY '\n'
 STORED AS TEXTFILE;
 2. Create data
 1,2,2014-11-01,2014-11-02
 2,3,2014-11-01,2014-11-02
 3,4,2014-11-01,2014-11-02
 3. Load data in hive
 LOAD DATA LOCAL INPATH '/tmp/input.txt' OVERWRITE INTO TABLE testTable 
 PARTITION (Region=North, market='market1');
 4. run query
 SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 
 100;
 Error trace
 java.lang.UnsupportedOperationException: empty collection
   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
   at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
   at scala.Option.getOrElse(Option.scala:120)
   at org.apache.spark.rdd.RDD.reduce(RDD.scala:863)
   at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136)
   at 
 org.apache.spark.sql.execution.TakeOrdered.executeCollect(basicOperators.scala:171)
   at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:438)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org