[jira] [Updated] (SPARK-3862) MultiWayBroadcastInnerHashJoin

2015-07-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3862:
---
Target Version/s: 1.6.0  (was: 1.5.0)

 MultiWayBroadcastInnerHashJoin
 --

 Key: SPARK-3862
 URL: https://issues.apache.org/jira/browse/SPARK-3862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 It is common to have a single fact table inner join many small dimension 
 tables.  We can exploit this fact and create a MultiWayBroadcastInnerHashJoin 
 (or maybe just MultiwayDimensionJoin) operator that optimizes for this 
 pattern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-29 Thread Imran Rashid (JIRA)
Imran Rashid created SPARK-9437:
---

 Summary: SizeEstimator overflows for primitive arrays
 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor


{{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if you 
have an {{Array[Double]}} of size 1  28.  This means that when you try to 
broadcast a large primitive array, you get:

{noformat}
java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
negative: -2147483608
   at scala.Predef$.require(Predef.scala:233)
   at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
   at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
   at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9437:
---

Assignee: Imran Rashid  (was: Apache Spark)

 SizeEstimator overflows for primitive arrays
 

 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor

 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
 you have an {{Array[Double]}} of size 1  28.  This means that when you try 
 to broadcast a large primitive array, you get:
 {noformat}
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -2147483608
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9437:
---

Assignee: Apache Spark  (was: Imran Rashid)

 SizeEstimator overflows for primitive arrays
 

 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Apache Spark
Priority: Minor

 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
 you have an {{Array[Double]}} of size 1  28.  This means that when you try 
 to broadcast a large primitive array, you get:
 {noformat}
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -2147483608
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9437) SizeEstimator overflows for primitive arrays

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646342#comment-14646342
 ] 

Apache Spark commented on SPARK-9437:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/7750

 SizeEstimator overflows for primitive arrays
 

 Key: SPARK-9437
 URL: https://issues.apache.org/jira/browse/SPARK-9437
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Minor

 {{SizeEstimator}} can overflow when dealing w/ large primitive arrays eg if 
 you have an {{Array[Double]}} of size 1  28.  This means that when you try 
 to broadcast a large primitive array, you get:
 {noformat}
 java.lang.IllegalArgumentException: requirement failed: sizeInBytes was 
 negative: -2147483608
at scala.Predef$.require(Predef.scala:233)
at org.apache.spark.storage.BlockInfo.markReady(BlockInfo.scala:55)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:815)
at 
 org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:638)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3862) MultiWayBroadcastInnerHashJoin

2015-07-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646376#comment-14646376
 ] 

Reynold Xin commented on SPARK-3862:


I don't think that's extreme at all -- very plausible candidate for 1.6!


 MultiWayBroadcastInnerHashJoin
 --

 Key: SPARK-3862
 URL: https://issues.apache.org/jira/browse/SPARK-3862
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 It is common to have a single fact table inner join many small dimension 
 tables.  We can exploit this fact and create a MultiWayBroadcastInnerHashJoin 
 (or maybe just MultiwayDimensionJoin) operator that optimizes for this 
 pattern.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9404) UnsafeArrayData

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9404:
---

Assignee: Wenchen Fan  (was: Apache Spark)

 UnsafeArrayData
 ---

 Key: SPARK-9404
 URL: https://issues.apache.org/jira/browse/SPARK-9404
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan

 An Unsafe-based ArrayData implementation. To begin with, we can encode data 
 this way:
 first 4 bytes is the # elements
 then each 4 byte is the start offset of the element, unless it is negative, 
 in which case the element is null.
 followed by the elements themselves
 For example, [10, 11, 12, 13, null, 14], internally should be represented as 
 (each 4 bytes)
 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9404) UnsafeArrayData

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646444#comment-14646444
 ] 

Apache Spark commented on SPARK-9404:
-

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/7752

 UnsafeArrayData
 ---

 Key: SPARK-9404
 URL: https://issues.apache.org/jira/browse/SPARK-9404
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Wenchen Fan

 An Unsafe-based ArrayData implementation. To begin with, we can encode data 
 this way:
 first 4 bytes is the # elements
 then each 4 byte is the start offset of the element, unless it is negative, 
 in which case the element is null.
 followed by the elements themselves
 For example, [10, 11, 12, 13, null, 14], internally should be represented as 
 (each 4 bytes)
 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction

2015-07-29 Thread Badari Madhav (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646277#comment-14646277
 ] 

Badari Madhav commented on SPARK-9312:
--

Updated JIRA and PR, please review.

 The OneVsRest model does not provide confidence factor(not probability) along 
 with the prediction
 -

 Key: SPARK-9312
 URL: https://issues.apache.org/jira/browse/SPARK-9312
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0, 1.4.1
Reporter: Badari Madhav
  Labels: features
   Original Estimate: 72h
  Remaining Estimate: 72h





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9416) Yarn logs say that Spark Python job has succeeded even though job has failed in Yarn cluster mode

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646365#comment-14646365
 ] 

Apache Spark commented on SPARK-9416:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7751

 Yarn logs say that Spark Python job has succeeded even though job has failed 
 in Yarn cluster mode
 -

 Key: SPARK-9416
 URL: https://issues.apache.org/jira/browse/SPARK-9416
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.4.1
 Environment: 3.13.0-53-generic #89-Ubuntu SMP Wed May 20 10:34:39 UTC 
 2015 x86_64 x86_64 x86_64 GNU/Linux
Reporter: Elkhan Dadashov

 While running Spark Word count python example with intentional mistake in 
 Yarn cluster mode, Spark terminal logs (Yarn logs) states final status as 
 SUCCEEDED, but log files for Spark application state correct results 
 indicating that the job failed.
 Terminal log output  application log output contradict each other.
 If i run same job on local mode then terminal logs and application logs 
 match, where both state that job has failed to expected error in python 
 script.
 More details: Scenario
 While running Spark Word count python example on Yarn cluster mode, if I make 
 intentional error in wordcount.py by changing this line (I'm using Spark 
 1.4.1, but this problem exists in Spark 1.4.0 and in 1.3.0 versions - which i 
 tested):
 lines = sc.textFile(sys.argv[1], 1)
 into this line:
 lines = sc.textFile(nonExistentVariable,1)
 where nonExistentVariable variable was never created and initialized.
 then i run that example with this command (I put README.md into HDFS before 
 running this command): 
 ./bin/spark-submit --master yarn-cluster wordcount.py /README.md
 The job runs and finishes successfully according the log printed in the 
 terminal :
 Terminal logs:
 ...
 15/07/23 16:19:17 INFO yarn.Client: Application report for 
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:18 INFO yarn.Client: Application report for 
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:19 INFO yarn.Client: Application report for 
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:20 INFO yarn.Client: Application report for 
 application_1437612288327_0013 (state: RUNNING)
 15/07/23 16:19:21 INFO yarn.Client: Application report for 
 application_1437612288327_0013 (state: FINISHED)
 15/07/23 16:19:21 INFO yarn.Client: 
client token: N/A
diagnostics: Shutdown hook called before final status was reported.
ApplicationMaster host: 10.0.53.59
ApplicationMaster RPC port: 0
queue: default
start time: 1437693551439
final status: SUCCEEDED
tracking URL: 
 http://localhost:8088/proxy/application_1437612288327_0013/history/application_1437612288327_0013/1
user: edadashov
 15/07/23 16:19:21 INFO util.Utils: Shutdown hook called
 15/07/23 16:19:21 INFO util.Utils: Deleting directory 
 /tmp/spark-eba0a1b5-a216-4afa-9c54-a3cb67b16444
 But if look at log files generated for this application in HDFS - it 
 indicates failure of the job with correct reason:
 Application log files:
 ...
 \00 stdout\00 179Traceback (most recent call last):
   File wordcount.py, line 32, in module
 lines = sc.textFile(nonExistentVariable,1)
 NameError: name 'nonExistentVariable' is not defined
 (Yarn logs to) Terminal output - final status: SUCCEEDED , is not matching 
 application log results - failure of the job (NameError: name 
 'nonExistentVariable' is not defined) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646419#comment-14646419
 ] 

Dmitry Goldenberg commented on SPARK-9434:
--

Thanks Cody and Sean. I'll take a look at the process of submitting changes.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9434) Need how-to for resuming direct Kafka streaming consumers where they had left off before getting terminated, OR actual support for that mode in the Streaming API

2015-07-29 Thread Dmitry Goldenberg (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Goldenberg resolved SPARK-9434.
--
Resolution: Not A Problem

Documentation on checkpointing is available in the Spark doc set.

 Need how-to for resuming direct Kafka streaming consumers where they had left 
 off before getting terminated, OR actual support for that mode in the 
 Streaming API
 -

 Key: SPARK-9434
 URL: https://issues.apache.org/jira/browse/SPARK-9434
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Examples, Streaming
Affects Versions: 1.4.1
Reporter: Dmitry Goldenberg

 We've been getting some mixed information regarding how to cause our direct 
 streaming consumers to resume processing from where they left off in terms of 
 the Kafka offsets.
 On the one hand side, we're hearing If you are restarting the streaming app 
 with Direct kafka from the checkpoint information (that is, restarting), then 
 the last read offsets are automatically recovered, and the data will start 
 processing from that offset. All the N records added in T will stay buffered 
 in Kafka. (where T is the interval of time during which the consumer was 
 down).
 On the other hand, there are tickets such as SPARK-6249 and SPARK-8833 which 
 are marked as won't fix which seem to ask for the functionality we need, 
 with comments like I don't want to add more config options with confusing 
 semantics around what is being used for the system of record for offsets, I'd 
 rather make it easy for people to explicitly do what they need.
 The use-case is actually very clear and doesn't ask for confusing semantics. 
 An API option to resume reading where you left off, in addition to the 
 smallest or greatest auto.offset.reset should be *very* useful, probably for 
 quite a few folks.
 We're asking for this as an enhancement request. SPARK-8833 states  I am 
 waiting for getting enough usecase to float in before I take a final call. 
 We're adding to that.
 In the meantime, can you clarify the confusion?  Does direct streaming 
 persist the progress information into DStream checkpoints or does it not?  
 If it does, why is it that we're not seeing that happen? Our consumers start 
 with auto.offset.reset=greatest and that causes them to read from the first 
 offset of data that is written to Kafka *after* the consumer has been 
 restarted, meaning we're missing data that had come in while the consumer was 
 down.
 If the progress is stored in DStream checkpoints, we want to know a) how to 
 cause that to work for us and b) where the said checkpointing data is stored 
 physically.
 Conversely, if this is not accurate, then is our only choice to manually 
 persist the offsets into Zookeeper? If that is the case then a) we'd like a 
 clear, more complete code sample to be published, since the one in the Kafka 
 streaming guide is incomplete (it lacks the actual lines of code persisting 
 the offsets) and b) we'd like to request that SPARK-8833 be revisited as a 
 feature worth implementing in the API.
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9404) UnsafeArrayData

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9404:
---

Assignee: Apache Spark  (was: Wenchen Fan)

 UnsafeArrayData
 ---

 Key: SPARK-9404
 URL: https://issues.apache.org/jira/browse/SPARK-9404
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 An Unsafe-based ArrayData implementation. To begin with, we can encode data 
 this way:
 first 4 bytes is the # elements
 then each 4 byte is the start offset of the element, unless it is negative, 
 in which case the element is null.
 followed by the elements themselves
 For example, [10, 11, 12, 13, null, 14], internally should be represented as 
 (each 4 bytes)
 5, 28, 32, 36, 40, -44, 48, 10, 11, 12, 13, 0, 14



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9438) restarting leader zookeeper causes spark master to die when the spark master election is assigned to zookeeper

2015-07-29 Thread Amir Rad (JIRA)
Amir Rad created SPARK-9438:
---

 Summary: restarting leader zookeeper causes spark master to die 
when the spark master election is assigned to zookeeper
 Key: SPARK-9438
 URL: https://issues.apache.org/jira/browse/SPARK-9438
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
 Environment: Saprk 1.2.0 and Zookeeper version: 3.4.6-1569965
Reporter: Amir Rad


When Spark Master Election is assigned to Zookeeper, restarting the leader 
Zookeeper causes the master spark to die. 

Steps to reproduce:
create a cluster of 3 spark nodes. 
set Spark-env to:
SPARK_LOCAL_DIRS=/home/sparkcde/data_spark/data
SPARK_MASTER_OPTS=-Dspark.deploy.spreadOut=false
SPARK_WORKER_DIR=/home/sparkcde/data_spark/worker
SPARK_WORKER_OPTS=-Dspark.worker.cleanup.enabled=true
SPARK_DAEMON_JAVA_OPTS=-Dspark.deploy.recoveryMode=ZOOKEEPER 
-Dspark.deploy.zookeeper.url=s1:2181,s2:2181,s3:2181

Identify the spark master
identify the zookeeper leader. 
Stop zookeeper leader
check spark master: It is dead
start zookeeper leader
check spark master: still dead

If you continue the same pattern of stopping and starting zookeeper leader, 
eventually you will lose the whole spark cluster.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5259) Fix endless retry stage by add task equal() and hashcode() to avoid stage.pendingTasks not empty while stage map output is available

2015-07-29 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5259:

Target Version/s: 1.5.0
Priority: Blocker  (was: Major)

 Fix endless retry stage by add task equal() and hashcode() to avoid 
 stage.pendingTasks not empty while stage map output is available 
 -

 Key: SPARK-5259
 URL: https://issues.apache.org/jira/browse/SPARK-5259
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.1, 1.2.0
Reporter: SuYan
Priority: Blocker

 1. while shuffle stage was retry, there may have 2 taskSet running. 
 we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
 re-run taskSet0.0's un-complete task
 if taskSet0.0 was run all the task that the taskSet0.1 not complete yet but 
 covered the partitions.
 then stage is Available is true.
 {code}
   def isAvailable: Boolean = {
 if (!isShuffleMap) {
   true
 } else {
   numAvailableOutputs == numPartitions
 }
   } 
 {code}
 but stage.pending task is not empty, to protect register mapStatus in 
 mapOutputTracker.
 because if task is complete success, pendingTasks is minus Task in 
 reference-level because the task is not override hashcode() and equals()
 pendingTask -= task
 but numAvailableOutputs is according to partitionID.
 here is the testcase to prove:
 {code}
   test(Make sure mapStage.pendingtasks is set()  +
 while MapStage.isAvailable is true while stage was retry ) {
 val firstRDD = new MyRDD(sc, 6, Nil)
 val firstShuffleDep = new ShuffleDependency(firstRDD, null)
 val firstShuyffleId = firstShuffleDep.shuffleId
 val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
 val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
 val shuffleId = shuffleDep.shuffleId
 val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
 submit(reduceRdd, Array(0, 1))
 complete(taskSets(0), Seq(
   (Success, makeMapStatus(hostB, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostC, 3)),
   (Success, makeMapStatus(hostB, 4)),
   (Success, makeMapStatus(hostB, 5)),
   (Success, makeMapStatus(hostC, 6))
 ))
 complete(taskSets(1), Seq(
   (Success, makeMapStatus(hostA, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostA, 1)),
   (Success, makeMapStatus(hostB, 2)),
   (Success, makeMapStatus(hostA, 1))
 ))
 runEvent(ExecutorLost(exec-hostA))
 runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
 null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
 null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(0),
   FetchFailed(null, firstShuyffleId, -1, 0, Fetch Mata data failed),
   null, null, null, null))
 scheduler.resubmitFailedStages()
 runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
   makeMapStatus(hostC, 1), null, null, null))
 runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
   makeMapStatus(hostB, 2), null, null, null))
 val stage = scheduler.stageIdToStage(taskSets(1).stageId)
 assert(stage.attemptId == 2)
 assert(stage.isAvailable)
 assert(stage.pendingTasks.size == 0)
   }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-746) Automatically Use Avro Serialization for Avro Objects

2015-07-29 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-746:
---
Assignee: Joseph Batchik

 Automatically Use Avro Serialization for Avro Objects
 -

 Key: SPARK-746
 URL: https://issues.apache.org/jira/browse/SPARK-746
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Cogan
Assignee: Joseph Batchik

 All generated objects extend org.apache.avro.specific.SpecificRecordBase (or 
 there may be a higher up class as well).
 Since Avro records aren't JavaSerializable by default people currently have 
 to wrap their records. It would be good if we could use an implicit 
 conversion to do this for them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests

2015-07-29 Thread Hunter Morgan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646301#comment-14646301
 ] 

Hunter Morgan commented on SPARK-5079:
--

Would it be feasible to restrict test dependencies so that only a single slf4j 
implementation is present, make that implementation 
http://projects.lidalia.org.uk/slf4j-test/, and look for log messages to 
determine job failure?

 Detect failed jobs / batches in Spark Streaming unit tests
 --

 Key: SPARK-5079
 URL: https://issues.apache.org/jira/browse/SPARK-5079
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 Currently, it is possible to write Spark Streaming unit tests where Spark 
 jobs fail but the streaming tests succeed because we rely on wall-clock time 
 plus output comparision in order to check whether a test has passed, and 
 hence may miss cases where errors occurred if they didn't affect these 
 results.  We should strengthen the tests to check that no job failures 
 occurred while processing batches.
 See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for 
 additional context.
 The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might 
 also fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-07-29 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5945:

Target Version/s: 1.5.0

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin
Priority: Blocker

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646312#comment-14646312
 ] 

Meihua Wu commented on SPARK-9246:
--

Cool. I see. will keep updating about the progress.

I have a question: is topDocumentsPerTopic exact or approximate (like 
describeTopics which, according to ScalaDoc, may not return exactly the 
top-weighted terms for each topic; to get a more precise set of top terms, 
increase maxTermsPerTopic.)?

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2015-07-29 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3181:
-
Shepherd: Xiangrui Meng

 Add Robust Regression Algorithm with Huber Estimator
 

 Key: SPARK-3181
 URL: https://issues.apache.org/jira/browse/SPARK-3181
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Fan Jiang
Assignee: Fan Jiang
  Labels: features
   Original Estimate: 0h
  Remaining Estimate: 0h

 Linear least square estimates assume the error has normal distribution and 
 can behave badly when the errors are heavy-tailed. In practical we get 
 various types of data. We need to include Robust Regression  to employ a 
 fitting criterion that is not as vulnerable as least square.
 In 1973, Huber introduced M-estimation for regression which stands for 
 maximum likelihood type. The method is resistant to outliers in the 
 response variable and has been widely used.
 The new feature for MLlib will contain 3 new files
 /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
 /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
 /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
 and one new class HuberRobustGradient in 
 /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9127) Rand/Randn codegen fails with long seed

2015-07-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9127.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Rand/Randn codegen fails with long seed
 ---

 Key: SPARK-9127
 URL: https://issues.apache.org/jira/browse/SPARK-9127
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.5.0


 {code}
 ERROR GenerateMutableProjection: failed to compile:
   public Object 
 generate(org.apache.spark.sql.catalyst.expressions.Expression[] expr) {
 return new SpecificProjection(expr);
   }
   class SpecificProjection extends 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
 private org.apache.spark.sql.catalyst.expressions.Expression[] 
 expressions = null;
 private org.apache.spark.sql.catalyst.expressions.MutableRow 
 mutableRow = null;
 private org.apache.spark.util.random.XORShiftRandom rng4 = new 
 org.apache.spark.util.random.XORShiftRandom(-5419823303878592871 + 
 org.apache.spark.TaskContext.getPartitionId());
 public 
 SpecificProjection(org.apache.spark.sql.catalyst.expressions.Expression[] 
 expr) {
   expressions = expr;
   mutableRow = new 
 org.apache.spark.sql.catalyst.expressions.GenericMutableRow(2);
 }
 public 
 org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection 
 target(org.apache.spark.sql.catalyst.expressions.MutableRow row) {
   mutableRow = row;
   return this;
 }
 /* Provide immutable access to the last projected row. */
 public InternalRow currentValue() {
   return (InternalRow) mutableRow;
 }
 public Object apply(Object _i) {
   InternalRow i = (InternalRow) _i;
 boolean isNull0 = i.isNullAt(0);
 long primitive1 = isNull0 ?
 -1L : (i.getLong(0));
   if(isNull0)
 mutableRow.setNullAt(0);
   else
 mutableRow.setLong(0, primitive1);
   final double primitive3 = rng4.nextDouble();
   if(false)
 mutableRow.setNullAt(1);
   else
 mutableRow.setDouble(1, primitive3);
   return mutableRow;
 }
   }
 org.codehaus.commons.compiler.CompileException: Line 10, Column 117: Value of 
 decimal integer literal '5419823303878592871' is out of range
   at 
 org.codehaus.janino.UnitCompiler.compileException(UnitCompiler.java:10473)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4696)
   at org.codehaus.janino.UnitCompiler.access$9200(UnitCompiler.java:185)
   at 
 org.codehaus.janino.UnitCompiler$11.visitIntegerLiteral(UnitCompiler.java:4402)
   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427)
   at 
 org.codehaus.janino.UnitCompiler.getNegatedConstantValue2(UnitCompiler.java:4856)
   at 
 org.codehaus.janino.UnitCompiler.getNegatedConstantValue2(UnitCompiler.java:4890)
   at org.codehaus.janino.UnitCompiler.access$10400(UnitCompiler.java:185)
   at 
 org.codehaus.janino.UnitCompiler$12.visitIntegerLiteral(UnitCompiler.java:4823)
   at org.codehaus.janino.Java$IntegerLiteral.accept(Java.java:4321)
   at 
 org.codehaus.janino.UnitCompiler.getNegatedConstantValue(UnitCompiler.java:4848)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4451)
   at org.codehaus.janino.UnitCompiler.access$8800(UnitCompiler.java:185)
   at 
 org.codehaus.janino.UnitCompiler$11.visitUnaryOperation(UnitCompiler.java:4393)
   at org.codehaus.janino.Java$UnaryOperation.accept(Java.java:3647)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4498)
   at org.codehaus.janino.UnitCompiler.access$8900(UnitCompiler.java:185)
   at 
 org.codehaus.janino.UnitCompiler$11.visitBinaryOperation(UnitCompiler.java:4394)
   at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:3768)
   at 
 org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:4427)
   at 
 org.codehaus.janino.UnitCompiler.compileGetValue(UnitCompiler.java:4360)
   at 
 org.codehaus.janino.UnitCompiler.invokeConstructor(UnitCompiler.java:6681)
   at org.codehaus.janino.UnitCompiler.compileGet2(UnitCompiler.java:4126)
   at org.codehaus.janino.UnitCompiler.access$7600(UnitCompiler.java:185)
   at 
 org.codehaus.janino.UnitCompiler$10.visitNewClassInstance(UnitCompiler.java:3275)
   at 

[jira] [Updated] (SPARK-5945) Spark should not retry a stage infinitely on a FetchFailedException

2015-07-29 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-5945:

Priority: Blocker  (was: Major)

 Spark should not retry a stage infinitely on a FetchFailedException
 ---

 Key: SPARK-5945
 URL: https://issues.apache.org/jira/browse/SPARK-5945
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Imran Rashid
Assignee: Ilya Ganelin
Priority: Blocker

 While investigating SPARK-5928, I noticed some very strange behavior in the 
 way spark retries stages after a FetchFailedException.  It seems that on a 
 FetchFailedException, instead of simply killing the task and retrying, Spark 
 aborts the stage and retries.  If it just retried the task, the task might 
 fail 4 times and then trigger the usual job killing mechanism.  But by 
 killing the stage instead, the max retry logic is skipped (it looks to me 
 like there is no limit for retries on a stage).
 After a bit of discussion with Kay Ousterhout, it seems the idea is that if a 
 fetch fails, we assume that the block manager we are fetching from has 
 failed, and that it will succeed if we retry the stage w/out that block 
 manager.  In that case, it wouldn't make any sense to retry the task, since 
 its doomed to fail every time, so we might as well kill the whole stage.  But 
 this raises two questions:
 1) Is it really safe to assume that a FetchFailedException means that the 
 BlockManager has failed, and ti will work if we just try another one?  
 SPARK-5928 shows that there are at least some cases where that assumption is 
 wrong.  Even if we fix that case, this logic seems brittle to the next case 
 we find.  I guess the idea is that this behavior is what gives us the R in 
 RDD ... but it seems like its not really that robust and maybe should be 
 reconsidered.
 2) Should stages only be retried a limited number of times?  It would be 
 pretty easy to put in a limited number of retries per stage.  Though again, 
 we encounter issues with keeping things resilient.  Theoretically one stage 
 could have many retries, but due to failures in different stages further 
 downstream, so we might need to track the cause of each retry as well to 
 still have the desired behavior.
 In general it just seems there is some flakiness in the retry logic.  This is 
 the only reproducible example I have at the moment, but I vaguely recall 
 hitting other cases of strange behavior w/ retries when trying to run long 
 pipelines.  Eg., if one executor is stuck in a GC during a fetch, the fetch 
 fails, but the executor eventually comes back and the stage gets retried 
 again, but the same GC issues happen the second time around, etc.
 Copied from SPARK-5928, here's the example program that can regularly produce 
 a loop of stage failures.  Note that it will only fail from a remote fetch, 
 so it can't be run locally -- I ran with {{MASTER=yarn-client spark-shell 
 --num-executors 2 --executor-memory 4000m}}
 {code}
 val rdd = sc.parallelize(1 to 1e6.toInt, 1).map{ ignore =
   val n = 3e3.toInt
   val arr = new Array[Byte](n)
   //need to make sure the array doesn't compress to something small
   scala.util.Random.nextBytes(arr)
   arr
 }
 rdd.map { x = (1, x)}.groupByKey().count()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8029) ShuffleMapTasks must be robust to concurrent attempts on the same executor

2015-07-29 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-8029:

Target Version/s: 1.5.0
Priority: Blocker  (was: Major)

 ShuffleMapTasks must be robust to concurrent attempts on the same executor
 --

 Key: SPARK-8029
 URL: https://issues.apache.org/jira/browse/SPARK-8029
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Imran Rashid
Assignee: Imran Rashid
Priority: Blocker
 Attachments: 
 AlternativesforMakingShuffleMapTasksRobusttoMultipleAttempts.pdf


 When stages get retried, a task may have more than one attempt running at the 
 same time, on the same executor.  Currently this causes problems for 
 ShuffleMapTasks, since all attempts try to write to the same output files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5079) Detect failed jobs / batches in Spark Streaming unit tests

2015-07-29 Thread Hunter Morgan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646441#comment-14646441
 ] 

Hunter Morgan commented on SPARK-5079:
--

I'm successfully using this pattern to test a stream element parsing helper.
{code:title=Sample test code|borderStyle=solid}
assumeThat(slf4j is bound to slf4j-test, StaticLoggerBinder.getSingleton().
   getLoggerFactoryClassStr(),
   is(uk.org.lidalia.slf4jtest.TestLoggerFactory));

TestLogger logger = TestLoggerFactory.getTestLogger(
org.apache.spark.scheduler.TaskSetManager);

logger.setEnabledLevelsForAllThreads(ImmutableSet.of(Level.ERROR));

// ...

ListLoggingEvent failures = logger.getAllLoggingEvents().stream().
  filter(loggingEvent - loggingEvent.
 getMessage().contains(failed)).
  collect(Collectors.toList());

assertEquals(jobs didn't fail, 0, failures.size());
{code}

 Detect failed jobs / batches in Spark Streaming unit tests
 --

 Key: SPARK-5079
 URL: https://issues.apache.org/jira/browse/SPARK-5079
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Josh Rosen
Assignee: Ilya Ganelin

 Currently, it is possible to write Spark Streaming unit tests where Spark 
 jobs fail but the streaming tests succeed because we rely on wall-clock time 
 plus output comparision in order to check whether a test has passed, and 
 hence may miss cases where errors occurred if they didn't affect these 
 results.  We should strengthen the tests to check that no job failures 
 occurred while processing batches.
 See https://github.com/apache/spark/pull/3832#issuecomment-68580794 for 
 additional context.
 The StreamingTestWaiter in https://github.com/apache/spark/pull/3801 might 
 also fix this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9352) Add tests for standalone scheduling code

2015-07-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646486#comment-14646486
 ] 

Andrew Or commented on SPARK-9352:
--

This is actually already fixed. It was reverted at one point but it's merged 
back into 1.4

 Add tests for standalone scheduling code
 

 Key: SPARK-9352
 URL: https://issues.apache.org/jira/browse/SPARK-9352
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Tests
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.4.2, 1.5.0


 There are no tests for the standalone Master scheduling code! This has caused 
 issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have 
 some level of confidence that this code actually works...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration

2015-07-29 Thread nirav patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-9441:
---
Description: 
I recently migrated my spark based rest service from 1.0.2 to 1.3.1 


15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1

15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(npatel); users 
with modify permissions: Set(npatel)

Exception in thread main java.lang.NoSuchMethodError: 
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

at 
akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)

at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)

at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)

at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)

at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)

at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)

at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)

at org.apache.spark.SparkContext.init(SparkContext.scala:272)


I read on blogs where people suggest to modify classpath and put right version 
before, put scala libs before in classpath and similar suggestions. which is  
all ridiculous. I think typesafe config package included with spark-core lib is 
incorrect. I did  following with my maven build and now it works. But i think 
someone need to fix spark-core package. 

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
exclusions
exclusion
artifactIdconfig/artifactId
groupIdcom.typesafe/groupId
/exclusion
/exclusions
/dependency
dependency
groupIdcom.typesafe/groupId
artifactIdconfig/artifactId
version1.2.1/version
/dependency

  was:
I recently migrated my spark based rest service from 1.0.2 to 1.3.1 


15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1

15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(npatel); users 
with modify permissions: Set(npatel)

Exception in thread main java.lang.NoSuchMethodError: 
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

at 
akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)

at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)

at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)

at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)

at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)

at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)

at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)

at 

[jira] [Updated] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration

2015-07-29 Thread nirav patel (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

nirav patel updated SPARK-9441:
---
Description: 
I recently migrated my spark based rest service from 1.0.2 to 1.3.1 


15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1

15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(npatel); users 
with modify permissions: Set(npatel)

Exception in thread main java.lang.NoSuchMethodError: 
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

at 
akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)

at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)

at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)

at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)

at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)

at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)

at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)

at org.apache.spark.SparkContext.init(SparkContext.scala:272)


I read on blogs where people suggest to modify classpath and put right version 
before, put scala libs before in classpath and similar suggestions. which is  
all ridiculous. I think typesafe config package included with spark-core lib is 
incorrect. I did  following with my maven build and now it works. But i think 
someone need to fix spark-core package. 

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
exclusions
exclusion
artifactIdconfig/artifactId
groupIdcom.typesafe/groupId
/exclusion
/exclusions
/dependency


dependency
groupIdcom.typesafe/groupId
artifactIdconfig/artifactId
version1.2.1/version
/dependency

  was:
I recently migrated my spark based rest service from 1.0.2 to 1.3.1 


15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1

15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(npatel); users 
with modify permissions: Set(npatel)

Exception in thread main java.lang.NoSuchMethodError: 
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

at 
akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)

at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)

at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)

at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)

at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)

at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)

at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)

at 

[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646535#comment-14646535
 ] 

Joseph K. Bradley commented on SPARK-5133:
--

That's my hope...  I'll try to send a PR later today.

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration

2015-07-29 Thread nirav patel (JIRA)
nirav patel created SPARK-9441:
--

 Summary: NoSuchMethodError: Com.typesafe.config.Config.getDuration
 Key: SPARK-9441
 URL: https://issues.apache.org/jira/browse/SPARK-9441
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1
Reporter: nirav patel


I recently migrated my spark based rest service from 1.0.2 to 1.3.1 


15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1

15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel

15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(npatel); users 
with modify permissions: Set(npatel)

Exception in thread main java.lang.NoSuchMethodError: 
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J

at 
akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)

at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)

at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)

at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)

at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)

at 
org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)

at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)

at 
org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)

at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)

at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)

at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)

at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)

at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)

at org.apache.spark.SparkContext.init(SparkContext.scala:272)


I read on blogs where people suggest to modify classpath and all which is 
ridiculous. I think typesafe config package included with spark-core lib is 
incorrect. I did  following with my maven build and now it works. But i think 
someone need to fix spark-core package. 

dependency
groupIdorg.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
exclusions
exclusion
artifactIdconfig/artifactId
groupIdcom.typesafe/groupId
/exclusion
/exclusions
/dependency
dependency
groupIdcom.typesafe/groupId
artifactIdconfig/artifactId
version1.2.1/version
/dependency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8198) date/time function: months_between

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646584#comment-14646584
 ] 

Apache Spark commented on SPARK-8198:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7754

 date/time function: months_between
 --

 Key: SPARK-8198
 URL: https://issues.apache.org/jira/browse/SPARK-8198
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 months_between(date1, date2): double
 Returns number of months between dates date1 and date2 (as of Hive 1.2.0). If 
 date1 is later than date2, then the result is positive. If date1 is earlier 
 than date2, then the result is negative. If date1 and date2 are either the 
 same days of the month or both last days of months, then the result is always 
 an integer. Otherwise the UDF calculates the fractional portion of the result 
 based on a 31-day month and considers the difference in time components date1 
 and date2. date1 and date2 type can be date, timestamp or string in the 
 format '-MM-dd' or '-MM-dd HH:mm:ss'. The result is rounded to 8 
 decimal places. Example: months_between('1997-02-28 10:30:00', '1996-10-30') 
 = 3.94959677



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9133) Add and Subtract should support date/timestamp and interval type

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9133:
---

Assignee: Apache Spark  (was: Davies Liu)

 Add and Subtract should support date/timestamp and interval type
 

 Key: SPARK-9133
 URL: https://issues.apache.org/jira/browse/SPARK-9133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark

 Should support
 date + interval
 interval + date
 timestamp + interval
 interval + timestamp
 The best way to support this is probably to resolve this to a date 
 add/substract expression, rather than making add/subtract support these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8187) date/time function: date_sub

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646582#comment-14646582
 ] 

Apache Spark commented on SPARK-8187:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7754

 date/time function: date_sub
 

 Key: SPARK-8187
 URL: https://issues.apache.org/jira/browse/SPARK-8187
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_sub(timestamp startdate, int days): timestamp
 date_sub(timestamp startdate, interval i): timestamp
 date_sub(date date, int days): date
 date_sub(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9444) RemoveEvaluationFromSort reorders sort order

2015-07-29 Thread Michael Armbrust (JIRA)
Michael Armbrust created SPARK-9444:
---

 Summary: RemoveEvaluationFromSort reorders sort order
 Key: SPARK-9444
 URL: https://issues.apache.org/jira/browse/SPARK-9444
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Wenchen Fan
Priority: Blocker


{code}
Seq((1,2,3,4)).toDF(a, b, c, d).registerTempTable(test)

scala sqlContext.sql(SELECT * FROM test ORDER BY a + 1, b).queryExecution
== Parsed Logical Plan ==
'Sort [('a + 1) ASC,'b ASC], true
 'Project [unresolvedalias(*)]
  'UnresolvedRelation [test], None

== Optimized Logical Plan ==
Project [a#4,b#5,c#6,d#7]
 Sort [b#5 ASC,_sortCondition#9 ASC], true
  LocalRelation [a#4,b#5,c#6,d#7,_sortCondition#9], [[1,2,3,4,2]]
{code}

Notice how we are now sorting by b before a + 1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9446) Clear Active SparkContext in stop() method

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9446:
---

Assignee: Apache Spark

 Clear Active SparkContext in stop() method
 --

 Key: SPARK-9446
 URL: https://issues.apache.org/jira/browse/SPARK-9446
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu
Assignee: Apache Spark

 In thread 'stopped SparkContext remaining active' on mailing list, Andres 
 observed the following in driver log:
 {code}
 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: address removed
 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone scheduler to shut 
 down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 
 INFO YarnClientSchedulerBackend: Asking each executor to shut down
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
 ... 6 more
 {code}
 Effect of the above exception is that a stopped SparkContext is returned to 
 user since SparkContext.clearActiveContext() is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9446) Clear Active SparkContext in stop() method

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9446:
---

Assignee: (was: Apache Spark)

 Clear Active SparkContext in stop() method
 --

 Key: SPARK-9446
 URL: https://issues.apache.org/jira/browse/SPARK-9446
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu

 In thread 'stopped SparkContext remaining active' on mailing list, Andres 
 observed the following in driver log:
 {code}
 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: address removed
 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone scheduler to shut 
 down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 
 INFO YarnClientSchedulerBackend: Asking each executor to shut down
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
 ... 6 more
 {code}
 Effect of the above exception is that a stopped SparkContext is returned to 
 user since SparkContext.clearActiveContext() is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9446) Clear Active SparkContext in stop() method

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646728#comment-14646728
 ] 

Apache Spark commented on SPARK-9446:
-

User 'tedyu' has created a pull request for this issue:
https://github.com/apache/spark/pull/7756

 Clear Active SparkContext in stop() method
 --

 Key: SPARK-9446
 URL: https://issues.apache.org/jira/browse/SPARK-9446
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu

 In thread 'stopped SparkContext remaining active' on mailing list, Andres 
 observed the following in driver log:
 {code}
 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: address removed
 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone scheduler to shut 
 down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 
 INFO YarnClientSchedulerBackend: Asking each executor to shut down
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
 ... 6 more
 {code}
 Effect of the above exception is that a stopped SparkContext is returned to 
 user since SparkContext.clearActiveContext() is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9446) Clear Active SparkContext in stop() method

2015-07-29 Thread Ted Yu (JIRA)
Ted Yu created SPARK-9446:
-

 Summary: Clear Active SparkContext in stop() method
 Key: SPARK-9446
 URL: https://issues.apache.org/jira/browse/SPARK-9446
 Project: Spark
  Issue Type: Bug
Reporter: Ted Yu


In thread 'stopped SparkContext remaining active' on mailing list, Andres 
observed the following in driver log:
{code}
15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
ApplicationMaster has disassociated: address removed
15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
Exception in thread Yarn application state monitor 
org.apache.spark.SparkException: Error asking standalone scheduler to shut down 
executors
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
at 
scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at 
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at 
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 
INFO YarnClientSchedulerBackend: Asking each executor to shut down

at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at 
org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
... 6 more
{code}
Effect of the above exception is that a stopped SparkContext is returned to 
user since SparkContext.clearActiveContext() is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9436) Simplify Pregel by merging joins

2015-07-29 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave reassigned SPARK-9436:
-

Assignee: Ankur Dave

 Simplify Pregel by merging joins
 

 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Assignee: Ankur Dave
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Pregel code contains two consecutive joins: 
 ```
 g.vertices.innerJoin(messages)(vprog)
 ...
 g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
 newOpt.getOrElse(old) }
 ```
 They can be replaced by one join. Ankur Dave proposed a patch based on our 
 discussion in mailing list: 
 https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9425) Support DecimalType in UnsafeRow

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9425:
---

Assignee: Davies Liu  (was: Apache Spark)

 Support DecimalType in UnsafeRow
 

 Key: SPARK-9425
 URL: https://issues.apache.org/jira/browse/SPARK-9425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 The DecimalType has a precision up to 38 digits, so it's possible to 
 serialize a Decimal as bounded byte array.
 We could have a fast path for DecimalType with precision under 18 digits, 
 which could fit in a single long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov resolved SPARK-9380.
-
Resolution: Fixed

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9446) Clear Active SparkContext in stop() method

2015-07-29 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9446:
-
Affects Version/s: 1.4.1
 Priority: Minor  (was: Major)
  Component/s: Spark Core

[~tedyu] you need to fill out the JIRA fields when creating one

 Clear Active SparkContext in stop() method
 --

 Key: SPARK-9446
 URL: https://issues.apache.org/jira/browse/SPARK-9446
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: Ted Yu
Priority: Minor

 In thread 'stopped SparkContext remaining active' on mailing list, Andres 
 observed the following in driver log:
 {code}
 15/07/29 15:17:09 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: 
 ApplicationMaster has disassociated: address removed
 15/07/29 15:17:09 INFO YarnClientSchedulerBackend: Shutting down all executors
 Exception in thread Yarn application state monitor 
 org.apache.spark.SparkException: Error asking standalone scheduler to shut 
 down executors
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:261)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrainedSchedulerBackend.scala:266)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:158)
 at 
 org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
 at 
 org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
 at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
 at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
 Caused by: java.lang.InterruptedException
 at 
 java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1325)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
 at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
 at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
 at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
 at scala.concurrent.Await$.result(package.scala:190)15/07/29 15:17:09 
 INFO YarnClientSchedulerBackend: Asking each executor to shut down
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
 at 
 org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
 at 
 org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
 ... 6 more
 {code}
 Effect of the above exception is that a stopped SparkContext is returned to 
 user since SparkContext.clearActiveContext() is not called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9436) Simplify Pregel by merging joins

2015-07-29 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave updated SPARK-9436:
--
Assignee: Alexander Ulanov  (was: Ankur Dave)

 Simplify Pregel by merging joins
 

 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Assignee: Alexander Ulanov
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Pregel code contains two consecutive joins: 
 ```
 g.vertices.innerJoin(messages)(vprog)
 ...
 g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
 newOpt.getOrElse(old) }
 ```
 They can be replaced by one join. Ankur Dave proposed a patch based on our 
 discussion in mailing list: 
 https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9436) Simplify Pregel by merging joins

2015-07-29 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-9436.
---
   Resolution: Fixed
Fix Version/s: (was: 1.4.0)
   1.5.0

Issue resolved by pull request 7749
[https://github.com/apache/spark/pull/7749]

 Simplify Pregel by merging joins
 

 Key: SPARK-9436
 URL: https://issues.apache.org/jira/browse/SPARK-9436
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Priority: Minor
 Fix For: 1.5.0

   Original Estimate: 1h
  Remaining Estimate: 1h

 Pregel code contains two consecutive joins: 
 ```
 g.vertices.innerJoin(messages)(vprog)
 ...
 g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) = 
 newOpt.getOrElse(old) }
 ```
 They can be replaced by one join. Ankur Dave proposed a patch based on our 
 discussion in mailing list: 
 https://www.mail-archive.com/dev@spark.apache.org/msg10316.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646812#comment-14646812
 ] 

Feynman Liang commented on SPARK-9440:
--

PR at https://github.com/apache/spark/pull/7757

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9425) Support DecimalType in UnsafeRow

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646827#comment-14646827
 ] 

Apache Spark commented on SPARK-9425:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7758

 Support DecimalType in UnsafeRow
 

 Key: SPARK-9425
 URL: https://issues.apache.org/jira/browse/SPARK-9425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 The DecimalType has a precision up to 38 digits, so it's possible to 
 serialize a Decimal as bounded byte array.
 We could have a fast path for DecimalType with precision under 18 digits, 
 which could fit in a single long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9448:
---

Assignee: Reynold Xin  (was: Apache Spark)

 GenerateUnsafeProjection should not share expressions across instances
 --

 Key: SPARK-9448
 URL: https://issues.apache.org/jira/browse/SPARK-9448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 We accidentally moved the list of expressions from the generated code 
 instance to the class wrapper, and as a result, different threads are sharing 
 the same set of expressions, which cause problems when the expressions have 
 mutable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Neelesh Srinivas Salian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646841#comment-14646841
 ] 

Neelesh Srinivas Salian commented on SPARK-9380:


Seems to be corrected.
https://github.com/apache/spark/blob/master/docs/graphx-programming-guide.md

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646839#comment-14646839
 ] 

Apache Spark commented on SPARK-9448:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7759

 GenerateUnsafeProjection should not share expressions across instances
 --

 Key: SPARK-9448
 URL: https://issues.apache.org/jira/browse/SPARK-9448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker

 We accidentally moved the list of expressions from the generated code 
 instance to the class wrapper, and as a result, different threads are sharing 
 the same set of expressions, which cause problems when the expressions have 
 mutable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows

2015-07-29 Thread Carsten Blank (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646733#comment-14646733
 ] 

Carsten Blank commented on SPARK-5754:
--

And in 1.5.0. Actually I have quickly hacked a couple of lines to fix this 
issue. However, I also suggest that a command helper should be written, such 
that we populate a List of CommandEntry-Objects (oho creative naming!!) that 
each have key, value and a keyValueSeparator. We could assemble the correct 
command (either as a String or a List of Strings) plattform dependent. That way 
we could make sure, that commands don't break the container, regardless of the 
plattform. 

One more thought regarding the above mentioned persisting problem with 

-XX:OnOutOfMemoryError='kill %p'

I have checked and it seems to come from the fact, that cmd expects a 
environment variable at %p. Consequently every thing breaks. One way to deal 
with this is using '%%'. Again, this kind of stuff should be in a helper class. 
I could contribute if that is okay. However, I never have so a bit of help 
would be appreciated!


 Spark AM not launching on Windows
 -

 Key: SPARK-5754
 URL: https://issues.apache.org/jira/browse/SPARK-5754
 Project: Spark
  Issue Type: Bug
  Components: Windows, YARN
Affects Versions: 1.1.1, 1.2.0
 Environment: Windows Server 2012, Hadoop 2.4.1.
Reporter: Inigo

 I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM 
 container fails to start. The problem seems to be in the generation of the 
 YARN command which adds single quotes (') surrounding some of the java 
 options. In particular, the part of the code that is adding those is the 
 escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not 
 like the quotes for these options. Here is an example of the command that the 
 container tries to execute:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 '-Dspark.yarn.secondary.jars=' 
 '-Dspark.app.name=org.apache.spark.examples.SparkPi' 
 '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Once I transform it into:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 -Dspark.yarn.secondary.jars= 
 -Dspark.app.name=org.apache.spark.examples.SparkPi 
 -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Everything seems to start.
 How should I deal with this? Creating a separate function like escapeForShell 
 for Windows and call it whenever I detect this is for Windows? Or should I 
 add some sanity check on YARN?
 I checked a little and there seems to be people that is able to run Spark on 
 YARN on Windows, so it might be something else. I didn't find anything 
 related on Jira either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9448:
---

Assignee: Apache Spark  (was: Reynold Xin)

 GenerateUnsafeProjection should not share expressions across instances
 --

 Key: SPARK-9448
 URL: https://issues.apache.org/jira/browse/SPARK-9448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Apache Spark
Priority: Blocker

 We accidentally moved the list of expressions from the generated code 
 instance to the class wrapper, and as a result, different threads are sharing 
 the same set of expressions, which cause problems when the expressions have 
 mutable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9380:
---

Assignee: (was: Apache Spark)

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646845#comment-14646845
 ] 

Apache Spark commented on SPARK-9380:
-

User 'avulanov' has created a pull request for this issue:
https://github.com/apache/spark/pull/7695

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9380:
---

Assignee: Apache Spark

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
Assignee: Apache Spark
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9430) Rename IntervalType to CalendarInterval

2015-07-29 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-9430.

   Resolution: Fixed
Fix Version/s: 1.5.0

 Rename IntervalType to CalendarInterval
 ---

 Key: SPARK-9430
 URL: https://issues.apache.org/jira/browse/SPARK-9430
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.5.0


 Based on offline discussion with [~marmbrus].
 In 1.6, I think we should just create a TimeInterval type, which stores only 
 the interval in terms of number of microseconds. TimeInterval can then be 
 comparable.
 In 1.5, we should rename the existing IntervalType to CalendarInterval, so we 
 won't have name clashes in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9448) GenerateUnsafeProjection should not share expressions across instances

2015-07-29 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-9448:
--

 Summary: GenerateUnsafeProjection should not share expressions 
across instances
 Key: SPARK-9448
 URL: https://issues.apache.org/jira/browse/SPARK-9448
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Blocker


We accidentally moved the list of expressions from the generated code instance 
to the class wrapper, and as a result, different threads are sharing the same 
set of expressions, which cause problems when the expressions have mutable 
state.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-9380:

Comment: was deleted

(was: It seems that I did not name the PR correctly. I renamed it and resolved 
this issue. Sorry for inconvenience.
)

 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646854#comment-14646854
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9447) Update python API to include RandomForest as classifier changes.

2015-07-29 Thread holdenk (JIRA)
holdenk created SPARK-9447:
--

 Summary: Update python API to include RandomForest as classifier 
changes.
 Key: SPARK-9447
 URL: https://issues.apache.org/jira/browse/SPARK-9447
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk
Priority: Minor


The API should still work after 
SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets 
merged in, but we might want to extend  provide predictRaw and similar in the 
Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9447) Update python API to include RandomForest as classifier changes.

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646759#comment-14646759
 ] 

Joseph K. Bradley commented on SPARK-9447:
--

Upgrading to Major priority since users should be able to specify new output 
column names from Python.

 Update python API to include RandomForest as classifier changes.
 

 Key: SPARK-9447
 URL: https://issues.apache.org/jira/browse/SPARK-9447
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk

 The API should still work after 
 SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets 
 merged in, but we might want to extend  provide predictRaw and similar in 
 the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.

2015-07-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9447:
-
Shepherd: Joseph K. Bradley

 Update python API to include RandomForest as classifier changes.
 

 Key: SPARK-9447
 URL: https://issues.apache.org/jira/browse/SPARK-9447
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk

 The API should still work after 
 SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets 
 merged in, but we might want to extend  provide predictRaw and similar in 
 the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9447) Update python API to include RandomForest as classifier changes.

2015-07-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9447:
-
Priority: Major  (was: Minor)

 Update python API to include RandomForest as classifier changes.
 

 Key: SPARK-9447
 URL: https://issues.apache.org/jira/browse/SPARK-9447
 Project: Spark
  Issue Type: Improvement
Reporter: holdenk

 The API should still work after 
 SPARK-9016-make-random-forest-classifiers-implement-classification-trait gets 
 merged in, but we might want to extend  provide predictRaw and similar in 
 the Python API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5754) Spark AM not launching on Windows

2015-07-29 Thread Inigo Goiri (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646768#comment-14646768
 ] 

Inigo Goiri commented on SPARK-5754:


Thank you [~cbvoxel]] for looking into this.
I can send any patches you need, just let me know what you need.

 Spark AM not launching on Windows
 -

 Key: SPARK-5754
 URL: https://issues.apache.org/jira/browse/SPARK-5754
 Project: Spark
  Issue Type: Bug
  Components: Windows, YARN
Affects Versions: 1.1.1, 1.2.0
 Environment: Windows Server 2012, Hadoop 2.4.1.
Reporter: Inigo

 I'm trying to run Spark Pi on a YARN cluster running on Windows and the AM 
 container fails to start. The problem seems to be in the generation of the 
 YARN command which adds single quotes (') surrounding some of the java 
 options. In particular, the part of the code that is adding those is the 
 escapeForShell function in YarnSparkHadoopUtil. Apparently, Windows does not 
 like the quotes for these options. Here is an example of the command that the 
 container tries to execute:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 '-Dspark.yarn.secondary.jars=' 
 '-Dspark.app.name=org.apache.spark.examples.SparkPi' 
 '-Dspark.master=yarn-cluster' org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Once I transform it into:
 @call %JAVA_HOME%/bin/java -server -Xmx512m -Djava.io.tmpdir=%PWD%/tmp 
 -Dspark.yarn.secondary.jars= 
 -Dspark.app.name=org.apache.spark.examples.SparkPi 
 -Dspark.master=yarn-cluster org.apache.spark.deploy.yarn.ApplicationMaster 
 --class 'org.apache.spark.examples.SparkPi' --jar  
 'file:/D:/data/spark-1.1.1-bin-hadoop2.4/bin/../lib/spark-examples-1.1.1-hadoop2.4.0.jar'
   --executor-memory 1024 --executor-cores 1 --num-executors 2
 Everything seems to start.
 How should I deal with this? Creating a separate function like escapeForShell 
 for Windows and call it whenever I detect this is for Windows? Or should I 
 add some sanity check on YARN?
 I checked a little and there seems to be people that is able to run Spark on 
 YARN on Windows, so it might be something else. I didn't find anything 
 related on Jira either.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9425) Support DecimalType in UnsafeRow

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9425:
---

Assignee: Apache Spark  (was: Davies Liu)

 Support DecimalType in UnsafeRow
 

 Key: SPARK-9425
 URL: https://issues.apache.org/jira/browse/SPARK-9425
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 The DecimalType has a precision up to 38 digits, so it's possible to 
 serialize a Decimal as bounded byte array.
 We could have a fast path for DecimalType with precision under 18 digits, 
 which could fit in a single long.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9380) Pregel example fix in graphx-programming-guide

2015-07-29 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646853#comment-14646853
 ] 

Alexander Ulanov commented on SPARK-9380:
-

It seems that I did not name the PR correctly. I renamed it and resolved this 
issue. Sorry for inconvenience.


 Pregel example fix in graphx-programming-guide
 --

 Key: SPARK-9380
 URL: https://issues.apache.org/jira/browse/SPARK-9380
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.4.0
Reporter: Alexander Ulanov
 Fix For: 1.4.0


 Pregel operator to express single source
 shortest path does not work due to incorrect type of the graph: Graph[Int, 
 Double] should be Graph[Long, Double]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8414) Ensure ContextCleaner actually triggers clean ups

2015-07-29 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8414?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646487#comment-14646487
 ] 

Andrew Or commented on SPARK-8414:
--

Oops, you're right. I always mix them up...

 Ensure ContextCleaner actually triggers clean ups
 -

 Key: SPARK-8414
 URL: https://issues.apache.org/jira/browse/SPARK-8414
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical

 Right now it cleans up old references only through natural GCs, which may not 
 occur if the driver has infinite RAM. We should do a periodic GC to make sure 
 that we actually do clean things up. Something like once per 30 minutes seems 
 relatively inexpensive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)
DB Tsai created SPARK-9442:
--

 Summary: java.lang.ArithmeticException: / by zero when reading 
Parquet
 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7157) Add approximate stratified sampling to DataFrame

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646610#comment-14646610
 ] 

Apache Spark commented on SPARK-7157:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/7755

 Add approximate stratified sampling to DataFrame
 

 Key: SPARK-7157
 URL: https://issues.apache.org/jira/browse/SPARK-7157
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646708#comment-14646708
 ] 

DB Tsai commented on SPARK-9442:


I will try to turn on the logging at the info level. Thanks.

 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
   ... 21 more
 {code}
 BTW, no all the tasks fail, and some of them are successful. 
 Another note: By explicitly looping through the data to count, it will works.
 {code}
 sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
 1L).reduce((x, y) = x + y) 
 {code}
 I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Feynman Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646491#comment-14646491
 ] 

Feynman Liang commented on SPARK-9440:
--

Working on this

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Blocker

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-9440:


 Summary: LocalLDAModel should save docConcentration, 
topicConcentration, and gammaShap
 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Blocker


LocalLDAModel needs to save these parameters in order for {{logPerplexity}} and 
{{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646570#comment-14646570
 ] 

Meihua Wu commented on SPARK-9246:
--

Got it. Thanks!

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Meihua Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646571#comment-14646571
 ] 

Meihua Wu commented on SPARK-9246:
--

Got it. Thanks!

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9445) Adding custom SerDe when creating Hive tables

2015-07-29 Thread Yashwanth Rao Dannamaneni (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yashwanth Rao Dannamaneni closed SPARK-9445.

Resolution: Duplicate

 Adding custom SerDe when creating Hive tables
 -

 Key: SPARK-9445
 URL: https://issues.apache.org/jira/browse/SPARK-9445
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Yashwanth Rao Dannamaneni





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646692#comment-14646692
 ] 

François Garillot commented on SPARK-9442:
--

This may not be a Spark issue, but rather a Parquet Hadoop Reader issue. It 
looks like the total time spent reading may be reported wrong here:
https://github.com/Parquet/parquet-mr/blob/master/parquet-hadoop/src/main/java/parquet/hadoop/InternalParquetRecordReader.java#L110

Note this function outputs log messages that look like:
Assembled and processed x records from y columns in t ms: n rec/ms, p cell/ms

Do you see any of those, turning logging on at the info level ?

 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
   ... 21 more
 {code}
 BTW, no all the tasks fail, and some of them are successful. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646705#comment-14646705
 ] 

DB Tsai edited comment on SPARK-9442 at 7/29/15 8:28 PM:
-

Another note: By explicitly looping through the data to count, it will works.
{code}
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
{code}

I think maybe some metadata in parquet files are corrupted. 


was (Author: dbtsai):
Another note: By explicitly looping through the data to count, it will works.
[code]
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
[code]

I think maybe some metadata in parquet files are corrupted. 

 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
   ... 21 more
 {code}
 BTW, no all the tasks fail, and some of them are successful. 
 Another note: By explicitly looping through the data to count, it will works.
 {code}
 sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
 1L).reduce((x, y) = x + y) 
 {code}
 I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646546#comment-14646546
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

That's a shame!  I hope you're able to figure something out for the next patch 
you work on.

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9441) NoSuchMethodError: Com.typesafe.config.Config.getDuration

2015-07-29 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646578#comment-14646578
 ] 

Sean Owen commented on SPARK-9441:
--

Spark uses Akka which uses typesafe config 1.2.1, which contains this method, 
and matches what you're using. Before modifying your project, which you don't 
need to do, use {{mvn dependency:tree}} to verify what version of typesafe 
config you're getting, and from where. My guess is you're getting an older 
version from other dependency, and that's conflicting.

There's a longer story here about Akka, and Typesafe Config, and shading, but 
if I'm right about that, then you just need to manage the version in your app 
in {{dependencyManagement}} and not write exclusions. I think this is less 
than ideal but is still 'by design' while Akka is around here, and, should be a 
reliable workaround if that's your app's situation.

 NoSuchMethodError: Com.typesafe.config.Config.getDuration
 -

 Key: SPARK-9441
 URL: https://issues.apache.org/jira/browse/SPARK-9441
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.3.1
Reporter: nirav patel

 I recently migrated my spark based rest service from 1.0.2 to 1.3.1 
 15/07/29 10:31:12 INFO spark.SparkContext: Running Spark version 1.3.1
 15/07/29 10:31:12 INFO spark.SecurityManager: Changing view acls to: npatel
 15/07/29 10:31:12 INFO spark.SecurityManager: Changing modify acls to: npatel
 15/07/29 10:31:12 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(npatel); users 
 with modify permissions: Set(npatel)
 Exception in thread main java.lang.NoSuchMethodError: 
 com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/TimeUnit;)J
 at 
 akka.util.Helpers$ConfigOps$.akka$util$Helpers$ConfigOps$$getDuration$extension(Helpers.scala:125)
 at akka.util.Helpers$ConfigOps$.getMillisDuration$extension(Helpers.scala:120)
 at akka.actor.ActorSystem$Settings.init(ActorSystem.scala:171)
 at akka.actor.ActorSystemImpl.init(ActorSystem.scala:504)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:141)
 at akka.actor.ActorSystem$.apply(ActorSystem.scala:118)
 at 
 org.apache.spark.util.AkkaUtils$.org$apache$spark$util$AkkaUtils$$doCreateActorSystem(AkkaUtils.scala:122)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:55)
 at org.apache.spark.util.AkkaUtils$$anonfun$1.apply(AkkaUtils.scala:54)
 at 
 org.apache.spark.util.Utils$$anonfun$startServiceOnPort$1.apply$mcVI$sp(Utils.scala:1837)
 at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
 at org.apache.spark.util.Utils$.startServiceOnPort(Utils.scala:1828)
 at org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:57)
 at org.apache.spark.SparkEnv$.create(SparkEnv.scala:223)
 at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:163)
 at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:269)
 at org.apache.spark.SparkContext.init(SparkContext.scala:272)
 I read on blogs where people suggest to modify classpath and put right 
 version before, put scala libs before in classpath and similar suggestions. 
 which is  all ridiculous. I think typesafe config package included with 
 spark-core lib is incorrect. I did  following with my maven build and now it 
 works. But i think someone need to fix spark-core package. 
 dependency
   groupIdorg.apache.spark/groupId
   artifactIdspark-core_2.10/artifactId
   exclusions
   exclusion
   artifactIdconfig/artifactId
   groupIdcom.typesafe/groupId
   /exclusion
   /exclusions
   /dependency
   dependency
   groupIdcom.typesafe/groupId
   artifactIdconfig/artifactId
   version1.2.1/version
   /dependency



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7016) Refactor dev/run-tests(-jenkins) from Bash to Python

2015-07-29 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-7016:
--
Target Version/s: 1.6.0  (was: 1.5.0)

 Refactor dev/run-tests(-jenkins) from Bash to Python
 

 Key: SPARK-7016
 URL: https://issues.apache.org/jira/browse/SPARK-7016
 Project: Spark
  Issue Type: Improvement
  Components: Build, Project Infra
Reporter: Brennon York
Priority: Critical

 Currently the {{dev/run-tests}} and {{dev/run-tests-jenkins}} scripts are 
 written in Bash and becoming quite unwieldy to manage, both in their current 
 state and for future contributions.
 This proposal is to refactor both scripts into Python to allow for better 
 manage-ability by the community, easier capability to add features, and 
 provide a simpler approach to calling / running the various test suites.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

Another note: By explicitly looping through the data to count, it will works.
[code]
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
[code]

I think maybe some metadata in parquet files are corrupted. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 

[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

Another note: By explicitly looping through the data to count, it will works.
{code}
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
{code}

I think maybe some metadata in parquet files are corrupted. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at 

[jira] [Commented] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646705#comment-14646705
 ] 

DB Tsai commented on SPARK-9442:


Another note: By explicitly looping through the data to count, it will works.
[code]
sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
1L).reduce((x, y) = x + y) 
[code]

I think maybe some metadata in parquet files are corrupted. 

 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:70)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
   at java.lang.Thread.run(Thread.java:745)
 Caused by: java.lang.ArithmeticException: / by zero
   at 
 parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
   ... 21 more
 {code}
 BTW, no all the tasks fail, and some of them are successful. 
 Another note: By explicitly looping through the data to count, it will works.
 [code]
 sqlContext.read.load(hdfsPath + s/testing/u1snappy/${date}/).map(x = 
 1L).reduce((x, y) = x + y) 
 [code]
 I think maybe some metadata in parquet files are corrupted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646565#comment-14646565
 ] 

Joseph K. Bradley commented on SPARK-9440:
--

This JIRA is marked as a blocker since it is *required* by [SPARK-6793].  If 
this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new 
fields in LDA models which are not saved by save/load.

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Blocker

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9440:
-
Priority: Critical  (was: Blocker)

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646565#comment-14646565
 ] 

Joseph K. Bradley edited comment on SPARK-9440 at 7/29/15 6:28 PM:
---

This JIRA is marked as critical since it is *required* by [SPARK-6793].  If 
this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new 
fields in LDA models which are not saved by save/load.


was (Author: josephkb):
This JIRA is marked as a blocker since it is *required* by [SPARK-6793].  If 
this cannot get into 1.5, we need to revert [SPARK-6793], which introduces new 
fields in LDA models which are not saved by save/load.

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Critical

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9440) LocalLDAModel should save docConcentration, topicConcentration, and gammaShap

2015-07-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9440:
-
Shepherd: Joseph K. Bradley

 LocalLDAModel should save docConcentration, topicConcentration, and gammaShap
 -

 Key: SPARK-9440
 URL: https://issues.apache.org/jira/browse/SPARK-9440
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Feynman Liang
Priority: Blocker

 LocalLDAModel needs to save these parameters in order for {{logPerplexity}} 
 and {{bound}} (see SPARK-6793) to work correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6152) Spark does not support Java 8 compiled Scala classes

2015-07-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646699#comment-14646699
 ] 

Steve Loughran commented on SPARK-6152:
---

Chill and Kryo need to be in sync; there's also the need to be compatible with 
the version Hive uses, (which has historically been addressed with custom 
versions of Hive).

If spark could jump to Kryo 3.x, classpath conflict with hive would go away, 
provided the wire formats of serialized classes were compatible: hive's 
spark-client JAR uses kryo 2.2.x to talk to spark.

 Spark does not support Java 8 compiled Scala classes
 

 Key: SPARK-6152
 URL: https://issues.apache.org/jira/browse/SPARK-6152
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.1
 Environment: Java 8+
 Scala 2.11
Reporter: Ronald Chen
Priority: Minor

 Spark uses reflectasm to check Scala closures which fails if the *user 
 defined Scala closures* are compiled to Java 8 class version
 The cause is reflectasm does not support Java 8
 https://github.com/EsotericSoftware/reflectasm/issues/35
 Workaround:
 Don't compile Scala classes to Java 8, Scala 2.11 does not support nor 
 require any Java 8 features
 Stack trace:
 {code}
 java.lang.IllegalArgumentException
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 com.esotericsoftware.reflectasm.shaded.org.objectweb.asm.ClassReader.init(Unknown
  Source)
   at 
 org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$getClassReader(ClosureCleaner.scala:41)
   at 
 org.apache.spark.util.ClosureCleaner$.getInnerClasses(ClosureCleaner.scala:84)
   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:107)
   at org.apache.spark.SparkContext.clean(SparkContext.scala:1478)
   at org.apache.spark.rdd.RDD.map(RDD.scala:288)
   at ...my Scala 2.11 compiled to Java 8 code calling into spark
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9352) Add tests for standalone scheduling code

2015-07-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-9352:
-
Labels:   (was: backport-needed)

 Add tests for standalone scheduling code
 

 Key: SPARK-9352
 URL: https://issues.apache.org/jira/browse/SPARK-9352
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Tests
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.4.2, 1.5.0


 There are no tests for the standalone Master scheduling code! This has caused 
 issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have 
 some level of confidence that this code actually works...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-9352) Add tests for standalone scheduling code

2015-07-29 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-9352.

   Resolution: Fixed
Fix Version/s: 1.4.2

 Add tests for standalone scheduling code
 

 Key: SPARK-9352
 URL: https://issues.apache.org/jira/browse/SPARK-9352
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Tests
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.4.2, 1.5.0


 There are no tests for the standalone Master scheduling code! This has caused 
 issues like SPARK-8881 and SPARK-9260 in the past. It is crucial that we have 
 some level of confidence that this code actually works...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}


 java.lang.ArithmeticException: / by zero when reading Parquet
 -

 Key: SPARK-9442
 URL: https://issues.apache.org/jira/browse/SPARK-9442
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.1
Reporter: DB Tsai

 I am counting how many records in my nested parquet file with this schema,
 {code}
 scala u1aTesting.printSchema
 root
  |-- profileId: long (nullable = true)
  |-- country: string (nullable = true)
  |-- data: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- videoId: long (nullable = true)
  |||-- date: long (nullable = true)
  |||-- label: double (nullable = true)
  |||-- weight: double (nullable = true)
  |||-- features: vector (nullable = true)
 {code}
 and the number of the records in the nested data array is around 10k, and 
 each of the parquet file is around 600MB. The total size is around 120GB. 
 I am doing a simple count
 {code}
 scala u1aTesting.count
 parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
 file 
 hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
   at 
 parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
   at 
 parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
   at 
 org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
   at 
 

[jira] [Commented] (SPARK-9246) DistributedLDAModel predict top docs per topic

2015-07-29 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646554#comment-14646554
 ] 

Joseph K. Bradley commented on SPARK-9246:
--

If it's much easier or faster computationally, it's OK if it's approximate.  
(It probably should be analogous to describeTopics, so I guess it will be 
approximate.)  I think we should make both exact at some point, with a little 
more work, but either is OK for now.

 DistributedLDAModel predict top docs per topic
 --

 Key: SPARK-9246
 URL: https://issues.apache.org/jira/browse/SPARK-9246
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Joseph K. Bradley
   Original Estimate: 72h
  Remaining Estimate: 72h

 For each topic, return top documents based on topicDistributions.
 Synopsis:
 {code}
 /**
  * @param maxDocuments  Max docs to return for each topic
  * @return Array over topics of (sorted top docs, corresponding doc-topic 
 weights)
  */
 def topDocumentsPerTopic(maxDocuments: Int): Array[(Array[Long], 
 Array[Double])]
 {code}
 Note: We will need to make sure that the above return value format is 
 Java-friendly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9442) java.lang.ArithmeticException: / by zero when reading Parquet

2015-07-29 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9442?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-9442:
---
Description: 
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:126)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ArithmeticException: / by zero
at 
parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:109)
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:193)
... 21 more
{code}

BTW, no all the tasks fail, and some of them are successful. 

  was:
I am counting how many records in my nested parquet file with this schema,

{code}
scala u1aTesting.printSchema
root
 |-- profileId: long (nullable = true)
 |-- country: string (nullable = true)
 |-- data: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- videoId: long (nullable = true)
 |||-- date: long (nullable = true)
 |||-- label: double (nullable = true)
 |||-- weight: double (nullable = true)
 |||-- features: vector (nullable = true)
{code}

and the number of the records in the nested data array is around 10k, and each 
of the parquet file is around 600MB. The total size is around 120GB. 

I am doing a simple count

{code}
scala u1aTesting.count

parquet.io.ParquetDecodingException: Can not read value at 100 in block 0 in 
file 
hdfs://compute-1.amazonaws.com:9000/users/dbtsai/testing/u1old/20150721/part-r-00115-d70c946b-b0f0-45fe-9965-b9f062b9ec6d.gz.parquet
at 
parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:213)
at 
parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:204)
at 
org.apache.spark.sql.sources.SqlNewHadoopRDD$$anon$1.hasNext(SqlNewHadoopRDD.scala:163)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$doExecute$1$$anonfun$6.apply(Aggregate.scala:129)
at 

[jira] [Created] (SPARK-9445) Adding custom SerDe when creating Hive tables

2015-07-29 Thread Yashwanth Rao Dannamaneni (JIRA)
Yashwanth Rao Dannamaneni created SPARK-9445:


 Summary: Adding custom SerDe when creating Hive tables
 Key: SPARK-9445
 URL: https://issues.apache.org/jira/browse/SPARK-9445
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: Yashwanth Rao Dannamaneni






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5133) Feature Importance for Decision Tree (Ensembles)

2015-07-29 Thread Parv Oberoi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646493#comment-14646493
 ] 

Parv Oberoi commented on SPARK-5133:


[~josephkb]: is the plan to still include this in SPARK 1.5 considering that 
the code cutoff is this week?

 Feature Importance for Decision Tree (Ensembles)
 

 Key: SPARK-5133
 URL: https://issues.apache.org/jira/browse/SPARK-5133
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Peter Prettenhofer
   Original Estimate: 168h
  Remaining Estimate: 168h

 Add feature importance to decision tree model and tree ensemble models.
 If people are interested in this feature I could implement it given a mentor 
 (API decisions, etc). Please find a description of the feature below:
 Decision trees intrinsically perform feature selection by selecting 
 appropriate split points. This information can be used to assess the relative 
 importance of a feature. 
 Relative feature importance gives valuable insight into a decision tree or 
 tree ensemble and can even be used for feature selection.
 More information on feature importance (via decrease in impurity) can be 
 found in ESLII (10.13.1) or here [1].
 R's randomForest package uses a different technique for assessing variable 
 importance that is based on permutation tests.
 All necessary information to create relative importance scores should be 
 available in the tree representation (class Node; split, impurity gain, 
 (weighted) nr of samples?).
 [1] 
 http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9104) expose network layer memory usage in shuffle part

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9104:
---

Assignee: Apache Spark

 expose network layer memory usage in shuffle part
 -

 Key: SPARK-9104
 URL: https://issues.apache.org/jira/browse/SPARK-9104
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Zhang, Liye
Assignee: Apache Spark

 The default network transportation is netty, and when transfering blocks for 
 shuffle, the network layer will consume a decent size of memory, we shall 
 collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9104) expose network layer memory usage in shuffle part

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9104:
---

Assignee: (was: Apache Spark)

 expose network layer memory usage in shuffle part
 -

 Key: SPARK-9104
 URL: https://issues.apache.org/jira/browse/SPARK-9104
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Zhang, Liye

 The default network transportation is netty, and when transfering blocks for 
 shuffle, the network layer will consume a decent size of memory, we shall 
 collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9104) expose network layer memory usage in shuffle part

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646496#comment-14646496
 ] 

Apache Spark commented on SPARK-9104:
-

User 'liyezhang556520' has created a pull request for this issue:
https://github.com/apache/spark/pull/7753

 expose network layer memory usage in shuffle part
 -

 Key: SPARK-9104
 URL: https://issues.apache.org/jira/browse/SPARK-9104
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Reporter: Zhang, Liye

 The default network transportation is netty, and when transfering blocks for 
 shuffle, the network layer will consume a decent size of memory, we shall 
 collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8194) date/time function: add_months

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646583#comment-14646583
 ] 

Apache Spark commented on SPARK-8194:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7754

 date/time function: add_months
 --

 Key: SPARK-8194
 URL: https://issues.apache.org/jira/browse/SPARK-8194
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin

 add_months(string start_date, int num_months): string
 add_months(date start_date, int num_months): date
 Returns the date that is num_months after start_date. The time part of 
 start_date is ignored. If start_date is the last day of the month or if the 
 resulting month has fewer days than the day component of start_date, then the 
 result is the last day of the resulting month. Otherwise, the result has the 
 same day component as start_date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9133) Add and Subtract should support date/timestamp and interval type

2015-07-29 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9133:
---

Assignee: Davies Liu  (was: Apache Spark)

 Add and Subtract should support date/timestamp and interval type
 

 Key: SPARK-9133
 URL: https://issues.apache.org/jira/browse/SPARK-9133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Should support
 date + interval
 interval + date
 timestamp + interval
 interval + timestamp
 The best way to support this is probably to resolve this to a date 
 add/substract expression, rather than making add/subtract support these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9133) Add and Subtract should support date/timestamp and interval type

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646585#comment-14646585
 ] 

Apache Spark commented on SPARK-9133:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7754

 Add and Subtract should support date/timestamp and interval type
 

 Key: SPARK-9133
 URL: https://issues.apache.org/jira/browse/SPARK-9133
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Davies Liu

 Should support
 date + interval
 interval + date
 timestamp + interval
 interval + timestamp
 The best way to support this is probably to resolve this to a date 
 add/substract expression, rather than making add/subtract support these types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8186) date/time function: date_add

2015-07-29 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14646581#comment-14646581
 ] 

Apache Spark commented on SPARK-8186:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/7754

 date/time function: date_add
 

 Key: SPARK-8186
 URL: https://issues.apache.org/jira/browse/SPARK-8186
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin
Assignee: Adrian Wang

 date_add(timestamp startdate, int days): timestamp
 date_add(timestamp startdate, interval i): timestamp
 date_add(date date, int days): date
 date_add(date date, interval i): date



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   >