[jira] [Updated] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Shaik Idris Ali (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shaik Idris Ali updated SPARK-7706:
---
Labels: oozie yarn  (was: )

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7704) Updating Programming Guides per SPARK-4397

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7704:
---

Assignee: (was: Apache Spark)

 Updating Programming Guides per SPARK-4397
 --

 Key: SPARK-7704
 URL: https://issues.apache.org/jira/browse/SPARK-7704
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Daisuke Kobayashi

 The change per SPARK-4397 makes implicit objects in SparkContext to be found 
 by the compiler automatically. So that we don't need to import SparkContext._ 
 explicitly any more and can remove the statement of implicit conversions 
 from the latest Programming Guides (1.3.0 and higher)
 http://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Shaik Idris Ali (JIRA)
Shaik Idris Ali created SPARK-7706:
--

 Summary: Allow setting YARN_CONF_DIR from spark argument
 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali


Currently in SparkSubmitArguments.scala when master is set to yarn 
(yarn-cluster mode)
https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.

However we should additionally allow passing YARN_CONF_DIR from command line 
argument this is particularly handy when Spark is being launched from 
schedulers like OOZIE or FALCON.

Reason being, oozie launcher App starts in one of the container assigned by 
Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
/etc/hadoop/conf) should avoid setting the ENV variable.

This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548106#comment-14548106
 ] 

Sean Owen commented on SPARK-7706:
--

Can't you just use {{YARN_CONF_DIR=... command ...}}?

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6657) Fix Python doc build warnings

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-6657.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6221
[https://github.com/apache/spark/pull/6221]

 Fix Python doc build warnings
 -

 Key: SPARK-6657
 URL: https://issues.apache.org/jira/browse/SPARK-6657
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib, PySpark, SQL, Streaming
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xiangrui Meng
Priority: Trivial
 Fix For: 1.4.0


 Reported by [~rxin]
 {code}
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:15: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:16: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:22: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainClassifier:28: WARNING: Definition list 
 ends without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:13: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:14: WARNING: Block quote ends 
 without a blank line; unexpected unindent.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:16: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/mllib/tree.py:docstring of 
 pyspark.mllib.tree.RandomForest.trainRegressor:18: ERROR: Unexpected 
 indentation.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.collect:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.orderBy:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.sort:3: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/pyspark/sql/__init__.py:docstring of 
 pyspark.sql.DataFrame.take:1: WARNING: Inline interpreted text or phrase 
 reference start-string without end-string.
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 /scratch/rxin/spark/python/docs/pyspark.streaming.rst:13: WARNING: Title 
 underline too short.
 pyspark.streaming.kafka module
 
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-18 Thread meiyoula (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547956#comment-14547956
 ] 

meiyoula commented on SPARK-7699:
-

If so, I think the spark.dynamicAllocation.initialExecutors doesn't be needed 
more .

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed

2015-05-18 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547954#comment-14547954
 ] 

Wilfred Spiegelenburg commented on SPARK-7705:
--

I think the limitation that we currently set in the ApplicationMaster.scala on 
line 
[#120|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L120]
 is far to limiting.
The {{cleanupStagingDir(fs)}} should be moved out of the {{if (!unregistered)}}.

I have not tested this yet but this seems to be far more logical. Since we're 
in the shutdown hook it should also catch our case:
{code}
  if (finalStatus == FinalApplicationStatus.SUCCEEDED || isLastAttempt) 
{
// we only want to unregister if we don't want the RM to retry
if (!unregistered) {
  unregister(finalStatus, finalMsg)
}
// Since we're done we should clean up the staging directory
cleanupStagingDir(fs)
  }
{code}

Not sure how to create a PR to check and discuss this change

 Cleanup of .sparkStaging directory fails if application is killed
 -

 Key: SPARK-7705
 URL: https://issues.apache.org/jira/browse/SPARK-7705
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Wilfred Spiegelenburg
Priority: Minor

 When a streaming application is killed while running on YARN the 
 .sparkStaging directory is not cleaned up. Setting 
 spark.yarn.preserve.staging.files=false does not help and still leaves the 
 files around.
 The changes in SPARK-7503 do not catch this case since there is no exception 
 in the shutdown. When the application gets killed the AM gets told to 
 shutdown and the shutdown hook is run but the clean up is not triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7565) Broken maps in jsonRDD

2015-05-18 Thread Paul Colomiets (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547957#comment-14547957
 ] 

Paul Colomiets commented on SPARK-7565:
---

The pull request is pretty trivial. I've tested it and it solves the problem 
for me.

 Broken maps in jsonRDD
 --

 Key: SPARK-7565
 URL: https://issues.apache.org/jira/browse/SPARK-7565
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Paul Colomiets

 When I use the following JSON:
 {code}
 {obj: {a: hello}}
 {code}
 And load it with the following python code:
 {code}
 tf = sc.textFile('test.json')
 v = sqlContext.jsonRDD(tf, StructType([StructField(obj, 
 MapType(StringType(), StringType()), True)]))
 v.save('test.parquet', mode='overwrite')
 {code}
 I get the following error in spark master branch:
 {code}
 Py4JJavaError: An error occurred while calling o78.save.
 : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
 in stage 5.0 failed 1 times, most recent failure: Lost task 1.0 in stage 5.0 
 (TID 11, localhost): java.lang.ClassCastException: java.lang.String cannot be 
 cast to org.apache.spark.sql.types.UTF8String
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.writePrimitive(ParquetTableSupport.scala:201)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:192)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport$$anonfun$writeMap$2.apply(ParquetTableSupport.scala:284)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport$$anonfun$writeMap$2.apply(ParquetTableSupport.scala:281)
 at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at scala.collection.immutable.Map$Map1.foreach(Map.scala:109)
 at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeMap(ParquetTableSupport.scala:281)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.writeValue(ParquetTableSupport.scala:186)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:171)
 at 
 org.apache.spark.sql.parquet.RowWriteSupport.write(ParquetTableSupport.scala:134)
 at 
 parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:120)
 at 
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:81)
 at 
 parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:37)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:699)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:717)
 at 
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:717)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
 at org.apache.spark.scheduler.Task.run(Task.scala:70)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 {code}
 This worked well in spark 1.3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7705:
-
  Priority: Minor  (was: Major)
Issue Type: Improvement  (was: Bug)

Minor since this just amounts to leaving around directories.

 Cleanup of .sparkStaging directory fails if application is killed
 -

 Key: SPARK-7705
 URL: https://issues.apache.org/jira/browse/SPARK-7705
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.3.0
Reporter: Wilfred Spiegelenburg
Priority: Minor

 When a streaming application is killed while running on YARN the 
 .sparkStaging directory is not cleaned up. Setting 
 spark.yarn.preserve.staging.files=false does not help and still leaves the 
 files around.
 The changes in SPARK-7503 do not catch this case since there is no exception 
 in the shutdown. When the application gets killed the AM gets told to 
 shutdown and the shutdown hook is run but the clean up is not triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-5265:
--

 Submitting applications on Standalone cluster controlled by Zookeeper forces 
 to know active master
 --

 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo
  Labels: cluster, spark-submit, standalone, zookeeper

 Hi, this is my first JIRA here, so I hope it is clear enough.
 I'm using Spark 1.2.0 and trying to submit an application on a Spark 
 Standalone cluster in cluster deploy mode with supervise.
 Standalone cluster is running in high availability mode, using Zookeeper to 
 provide leader election between three available Masters (named master1, 
 master2 and master3).
 As read at Spark's documentation, to register a Worker to the Standalone 
 cluster, I provide complete cluster info as the spark route.
 I mean, spark://master1:7077,master2:7077,master3:7077
 and that route is parsed and three attempts are launched, first one to 
 master1:7077, second one to master2:7077 and third one to master3:7077.
 This works great!
 But if I try to do the same while submitting applications, it fails.
 I mean, if I provide complete cluster info as the --master option to 
 spark-submit script, it throws an exception because it tries to connect as it 
 was a single node.
 Example:
 spark-submit --class org.apache.spark.examples.SparkPi --master 
 spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
 --supervise examples.jar 100
 This is the output I got:
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mytest); users 
 with modify permissions: Set(mytest)
 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
 port 53930.
 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
 akka.actor.ActorInitializationException: exception during creation
   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
   at akka.actor.ActorCell.create(ActorCell.scala:596)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
   at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
   at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
   at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
   at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
   at akka.actor.ActorCell.create(ActorCell.scala:580)
   ... 9 more
 Shouldn't it parse it as on Worker registration?
 It will not force client to know which is the current active Master of the 
 Standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7704) Updating Programming Guides per SPARK-4397

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7704:
---

Assignee: Apache Spark

 Updating Programming Guides per SPARK-4397
 --

 Key: SPARK-7704
 URL: https://issues.apache.org/jira/browse/SPARK-7704
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.3.0
Reporter: Daisuke Kobayashi
Assignee: Apache Spark

 The change per SPARK-4397 makes implicit objects in SparkContext to be found 
 by the compiler automatically. So that we don't need to import SparkContext._ 
 explicitly any more and can remove the statement of implicit conversions 
 from the latest Programming Guides (1.3.0 and higher)
 http://spark.apache.org/docs/latest/programming-guide.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7705) Cleanup of .sparkStaging directory fails if application is killed

2015-05-18 Thread Wilfred Spiegelenburg (JIRA)
Wilfred Spiegelenburg created SPARK-7705:


 Summary: Cleanup of .sparkStaging directory fails if application 
is killed
 Key: SPARK-7705
 URL: https://issues.apache.org/jira/browse/SPARK-7705
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.3.0
Reporter: Wilfred Spiegelenburg


When a streaming application is killed while running on YARN the .sparkStaging 
directory is not cleaned up. Setting spark.yarn.preserve.staging.files=false 
does not help and still leaves the files around.

The changes in SPARK-7503 do not catch this case since there is no exception in 
the shutdown. When the application gets killed the AM gets told to shutdown and 
the shutdown hook is run but the clean up is not triggered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-05-18 Thread Roque Vassal'lo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547963#comment-14547963
 ] 

Roque Vassal'lo commented on SPARK-5265:


Yes, SPARK-6443 and this jira are the same.

 Submitting applications on Standalone cluster controlled by Zookeeper forces 
 to know active master
 --

 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo
  Labels: cluster, spark-submit, standalone, zookeeper

 Hi, this is my first JIRA here, so I hope it is clear enough.
 I'm using Spark 1.2.0 and trying to submit an application on a Spark 
 Standalone cluster in cluster deploy mode with supervise.
 Standalone cluster is running in high availability mode, using Zookeeper to 
 provide leader election between three available Masters (named master1, 
 master2 and master3).
 As read at Spark's documentation, to register a Worker to the Standalone 
 cluster, I provide complete cluster info as the spark route.
 I mean, spark://master1:7077,master2:7077,master3:7077
 and that route is parsed and three attempts are launched, first one to 
 master1:7077, second one to master2:7077 and third one to master3:7077.
 This works great!
 But if I try to do the same while submitting applications, it fails.
 I mean, if I provide complete cluster info as the --master option to 
 spark-submit script, it throws an exception because it tries to connect as it 
 was a single node.
 Example:
 spark-submit --class org.apache.spark.examples.SparkPi --master 
 spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
 --supervise examples.jar 100
 This is the output I got:
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mytest); users 
 with modify permissions: Set(mytest)
 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
 port 53930.
 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
 akka.actor.ActorInitializationException: exception during creation
   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
   at akka.actor.ActorCell.create(ActorCell.scala:596)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
   at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
   at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
   at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
   at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
   at akka.actor.ActorCell.create(ActorCell.scala:580)
   ... 9 more
 Shouldn't it parse it as on Worker registration?
 It will not force client to know which is the current active Master of the 
 Standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547974#comment-14547974
 ] 

Sean Owen commented on SPARK-7699:
--

It would make a difference if the program immediately executed an operation 
that needed more than the minimum number of executors, but a spark-shell idling 
doesn't do that.

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5265) Submitting applications on Standalone cluster controlled by Zookeeper forces to know active master

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-5265.
--
Resolution: Duplicate

 Submitting applications on Standalone cluster controlled by Zookeeper forces 
 to know active master
 --

 Key: SPARK-5265
 URL: https://issues.apache.org/jira/browse/SPARK-5265
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Reporter: Roque Vassal'lo
  Labels: cluster, spark-submit, standalone, zookeeper

 Hi, this is my first JIRA here, so I hope it is clear enough.
 I'm using Spark 1.2.0 and trying to submit an application on a Spark 
 Standalone cluster in cluster deploy mode with supervise.
 Standalone cluster is running in high availability mode, using Zookeeper to 
 provide leader election between three available Masters (named master1, 
 master2 and master3).
 As read at Spark's documentation, to register a Worker to the Standalone 
 cluster, I provide complete cluster info as the spark route.
 I mean, spark://master1:7077,master2:7077,master3:7077
 and that route is parsed and three attempts are launched, first one to 
 master1:7077, second one to master2:7077 and third one to master3:7077.
 This works great!
 But if I try to do the same while submitting applications, it fails.
 I mean, if I provide complete cluster info as the --master option to 
 spark-submit script, it throws an exception because it tries to connect as it 
 was a single node.
 Example:
 spark-submit --class org.apache.spark.examples.SparkPi --master 
 spark://master1:7077,master2:7077,master3:7077 --deploy-mode cluster 
 --supervise examples.jar 100
 This is the output I got:
 Using Spark's default log4j profile: 
 org/apache/spark/log4j-defaults.properties
 15/01/14 17:02:11 INFO SecurityManager: Changing view acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: Changing modify acls to: mytest
 15/01/14 17:02:11 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mytest); users 
 with modify permissions: Set(mytest)
 15/01/14 17:02:11 INFO Slf4jLogger: Slf4jLogger started
 15/01/14 17:02:11 INFO Utils: Successfully started service 'driverClient' on 
 port 53930.
 15/01/14 17:02:11 ERROR OneForOneStrategy: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
 akka.actor.ActorInitializationException: exception during creation
   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
   at akka.actor.ActorCell.create(ActorCell.scala:596)
   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
   at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 Caused by: org.apache.spark.SparkException: Invalid master URL: 
 spark://master1:7077,master2:7077,master3:7077
   at org.apache.spark.deploy.master.Master$.toAkkaUrl(Master.scala:830)
   at org.apache.spark.deploy.ClientActor.preStart(Client.scala:42)
   at akka.actor.Actor$class.aroundPreStart(Actor.scala:470)
   at org.apache.spark.deploy.ClientActor.aroundPreStart(Client.scala:35)
   at akka.actor.ActorCell.create(ActorCell.scala:580)
   ... 9 more
 Shouldn't it parse it as on Worker registration?
 It will not force client to know which is the current active Master of the 
 Standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

2015-05-18 Thread Kaveen Raajan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547641#comment-14547641
 ] 

Kaveen Raajan commented on SPARK-7700:
--

Hi [~srowen]

I'm sure there is no space available at my hadoop and spark path.

Since I running in windows environment I replace spark-env.sh to spark-env.cmd 
which contains  
_SET SPARK_JAR=hdfs://master:9000/user/spark/jar_
*Note:* spark jar files are moved to hdfs specified location. Also spark 
classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at 
enviroment variable.

And also I didn't put any JVM arg in class belongs. While seeing at 
launch-container.cmd file this line are executed
{panel}
@C:\Hadoop\bin\winutils.exe symlink __spark__.jar 
\tmp\hadoop-HDFS\nm-local-dir\filecache\10\jar
@C:\Hadoop\bin\winutils.exe symlink __app__.jar 
\tmp\hadoop-HDFS\nm-local-dir\usercache\HDFS\filecache\10\spark-examples-1.3.1-hadoop2.5.2.jar
@call %JAVA_HOME%/bin/java -server -Xmx4096m -Djava.io.tmpdir=%PWD%/tmp 
'-Dspark.executor.memory=2g' 
'-Dspark.app.name=org.apache.spark.examples.SparkPi' 
'-Dspark.master=yarn-cluster' 
-Dspark.yarn.app.container.log.dir=C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01
 org.apache.spark.deploy.yarn.ApplicationMaster --class 
'org.apache.spark.examples.SparkPi' --jar 
file:/C:/Spark/lib/spark-examples-1.3.1-hadoop2.5.2.jar --arg '10' 
--executor-memory 2048m --executor-cores 1 --num-executors  3 1 
C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01/stdout
 2 
C:/Hadoop/logs/userlogs/application_1431924261044_0002/container_1431924261044_0002_01_01/stderr
{panel}

 Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
 ---

 Key: SPARK-7700
 URL: https://issues.apache.org/jira/browse/SPARK-7700
 Project: Spark
  Issue Type: Story
  Components: Build
Affects Versions: 1.3.1
 Environment: windows 8 Single language
 Hadoop-2.5.2
 Protocol Buffer-2.5.0
 Scala-2.11
Reporter: Kaveen Raajan

 I build SPARK on yarn mode by giving following command. Build got succeeded.
 {panel}
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
 -Phive-thriftserver -DskipTests clean package
 {panel}
 I set following property at spark-env.cmd file
 {panel}
 SET SPARK_JAR=hdfs://master:9000/user/spark/jar
 {panel}
 *Note:*  spark jar files are moved to hdfs specified location. Also spark 
 classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at 
 enviroment variable.
 I tried to execute following SparkPi example in yarn-cluster mode.
 {panel}
 spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster 
 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 
 --queue default 
 S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10
 {panel}
 My job able to submit at hadoop cluster, but it always in accepted state and 
 Failed with following error
 {panel}
 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster 
 with 1 NodeManagers
 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not 
 requestedmore than the maximum memory capability of the cluster (8048 MB per 
 container)
 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB 
 memory including 384 MB overhead
 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for 
 ourAM
 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container
 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are 
 thesame. Not copying hdfs://master:9000/user/spark/jar
 15/05/14 13:00:52 INFO yarn.Client: Uploading resource 
 file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar
  - 
 hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar
 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our 
 AM container
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(HDFS); users 
 with modify permissions: Set(HDFS)
 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to 
 ResourceManager
 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application 
 application_1431587916618_0003
 15/05/14 13:00:53 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:53 INFO 

[jira] [Updated] (SPARK-7661) Support for dynamic allocation of resources in Kinesis Spark Streaming

2015-05-18 Thread Murtaza Kanchwala (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Murtaza Kanchwala updated SPARK-7661:
-
Summary: Support for dynamic allocation of resources in Kinesis Spark 
Streaming  (was: Support for dynamic allocation of executors in Kinesis Spark 
Streaming)

 Support for dynamic allocation of resources in Kinesis Spark Streaming
 --

 Key: SPARK-7661
 URL: https://issues.apache.org/jira/browse/SPARK-7661
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
 Environment: AWS-EMR
Reporter: Murtaza Kanchwala

 Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis 
 Stream.
 My Requirement is that if I use this Resharding util for Amazon Kinesis :
 Amazon Kinesis Resharding : 
 https://github.com/awslabs/amazon-kinesis-scaling-utils
 Then there should be some way to allocate executors on the basis of no. of 
 shards directly (for Spark Streaming only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7700.
--
Resolution: Duplicate

Given you're on Windows, this sounds almost exactly like SPARK-5754

 Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
 ---

 Key: SPARK-7700
 URL: https://issues.apache.org/jira/browse/SPARK-7700
 Project: Spark
  Issue Type: Story
  Components: Build
Affects Versions: 1.3.1
 Environment: windows 8 Single language
 Hadoop-2.5.2
 Protocol Buffer-2.5.0
 Scala-2.11
Reporter: Kaveen Raajan

 I build SPARK on yarn mode by giving following command. Build got succeeded.
 {panel}
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
 -Phive-thriftserver -DskipTests clean package
 {panel}
 I set following property at spark-env.cmd file
 {panel}
 SET SPARK_JAR=hdfs://master:9000/user/spark/jar
 {panel}
 *Note:*  spark jar files are moved to hdfs specified location. Also spark 
 classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at 
 enviroment variable.
 I tried to execute following SparkPi example in yarn-cluster mode.
 {panel}
 spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster 
 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 
 --queue default 
 S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10
 {panel}
 My job able to submit at hadoop cluster, but it always in accepted state and 
 Failed with following error
 {panel}
 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster 
 with 1 NodeManagers
 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not 
 requestedmore than the maximum memory capability of the cluster (8048 MB per 
 container)
 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB 
 memory including 384 MB overhead
 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for 
 ourAM
 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container
 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are 
 thesame. Not copying hdfs://master:9000/user/spark/jar
 15/05/14 13:00:52 INFO yarn.Client: Uploading resource 
 file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar
  - 
 hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar
 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our 
 AM container
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(HDFS); users 
 with modify permissions: Set(HDFS)
 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to 
 ResourceManager
 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application 
 application_1431587916618_0003
 15/05/14 13:00:53 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:53 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1431588652790
  final status: UNDEFINED
  tracking URL: 
 http://master:8088/proxy/application_1431587916618_0003/
  user: HDFS
 15/05/14 13:00:54 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:55 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:56 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:57 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:58 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:59 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: FAILED)
 15/05/14 13:00:59 INFO yarn.Client:
  client token: N/A
  diagnostics: Application application_1431587916618_0003 failed 2 
 times
 due to AM Container for appattempt_1431587916618_0003_02 exited with  
 exitCode: 1
 For more detailed output, check application tracking 
 page:http://master:8088/proxy/application_1431587916618_0003/Then, click on 
 links to logs of each attempt.
 Diagnostics: Exception from container-launch.
 Container id: 

[jira] [Resolved] (SPARK-7299) saving Oracle-source DataFrame to Hive changes scale

2015-05-18 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-7299.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Liang-Chi Hsieh

 saving Oracle-source DataFrame to Hive changes scale
 

 Key: SPARK-7299
 URL: https://issues.apache.org/jira/browse/SPARK-7299
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Ken Geis
Assignee: Liang-Chi Hsieh
 Fix For: 1.4.0


 When I load data from Oracle, save it to a table, then select from it, the 
 scale is changed.
 For example, I have a column defined as NUMBER(12, 2). I insert 1 into 
 the column. When I write that to a table and select from it, the result is 
 199.99.
 Some databases (e.g. H2) will return this as 1.00, but Oracle returns it 
 as 1. I believe that when the file is written out to parquet, the scale 
 information is taken from the schema, not the value. In an Oracle (at least) 
 JDBC DataFrame, the scale may be different from row to row.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7150) SQLContext.range()

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547710#comment-14547710
 ] 

Apache Spark commented on SPARK-7150:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6230

 SQLContext.range()
 --

 Key: SPARK-7150
 URL: https://issues.apache.org/jira/browse/SPARK-7150
 Project: Spark
  Issue Type: Sub-task
  Components: ML, SQL
Reporter: Joseph K. Bradley
Assignee: Adrian Wang
Priority: Minor
  Labels: starter

 It would be handy to have easy ways to construct random columns for 
 DataFrames.  Proposed API:
 {code}
 class SQLContext {
   // Return a DataFrame with a single column named id that has consecutive 
 value from 0 to n.
   def range(n: Long): DataFrame
   def range(n: Long, numPartitions: Int): DataFrame
 }
 {code}
 Usage:
 {code}
 // uniform distribution
 ctx.range(1000).select(rand())
 // normal distribution
 ctx.range(1000).select(randn())
 {code}
 We should add an RangeIterator that supports long start/stop position, and 
 then use it to create an RDD as the basis for this DataFrame.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6416) RDD.fold() requires the operator to be commutative

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547718#comment-14547718
 ] 

Apache Spark commented on SPARK-6416:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/6231

 RDD.fold() requires the operator to be commutative
 --

 Key: SPARK-6416
 URL: https://issues.apache.org/jira/browse/SPARK-6416
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Reporter: Josh Rosen
Priority: Critical

 Spark's {{RDD.fold}} operation has some confusing behaviors when a 
 non-commutative reduce function is used.
 Here's an example, which was originally reported on StackOverflow 
 (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
 {code}
 sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
 8
 {code}
 To understand what's going on here, let's look at the definition of Spark's 
 `fold` operation.  
 I'm going to show the Python version of the code, but the Scala version 
 exhibits the exact same behavior (you can also [browse the source on 
 GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
 {code}
 def fold(self, zeroValue, op):
 
 Aggregate the elements of each partition, and then the results for all
 the partitions, using a given associative function and a neutral zero
 value.
 The function C{op(t1, t2)} is allowed to modify C{t1} and return it
 as its result value to avoid object allocation; however, it should not
 modify C{t2}.
  from operator import add
  sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
 15
 
 def func(iterator):
 acc = zeroValue
 for obj in iterator:
 acc = op(obj, acc)
 yield acc
 vals = self.mapPartitions(func).collect()
 return reduce(op, vals, zeroValue)
 {code}
 (For comparison, see the [Scala implementation of 
 `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
 Spark's `fold` operates by first folding each partition and then folding the 
 results.  The problem is that an empty partition gets folded down to the zero 
 element, so the final driver-side fold ends up folding one value for _every_ 
 partition rather than one value for each _non-empty_ partition.  This means 
 that the result of `fold` is sensitive to the number of partitions:
 {code}
  sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
 100
  sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
 50
  sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
 1
 {code}
 In this last case, what's happening is that the single partition is being 
 folded down to the correct value, then that value is folded with the 
 zero-value at the driver to yield 1.
 I think the underlying problem here is that our fold() operation implicitly 
 requires the operator to be commutative in addition to associative, but this 
 isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
 Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
 Therefore, I think we should update the documentation and examples to clarify 
 this requirement and explain that our fold acts more like a reduce with a 
 default value than the type of ordering-sensitive fold() that users may 
 expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6416) RDD.fold() requires the operator to be commutative

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6416:
---

Assignee: (was: Apache Spark)

 RDD.fold() requires the operator to be commutative
 --

 Key: SPARK-6416
 URL: https://issues.apache.org/jira/browse/SPARK-6416
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Reporter: Josh Rosen
Priority: Critical

 Spark's {{RDD.fold}} operation has some confusing behaviors when a 
 non-commutative reduce function is used.
 Here's an example, which was originally reported on StackOverflow 
 (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
 {code}
 sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
 8
 {code}
 To understand what's going on here, let's look at the definition of Spark's 
 `fold` operation.  
 I'm going to show the Python version of the code, but the Scala version 
 exhibits the exact same behavior (you can also [browse the source on 
 GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
 {code}
 def fold(self, zeroValue, op):
 
 Aggregate the elements of each partition, and then the results for all
 the partitions, using a given associative function and a neutral zero
 value.
 The function C{op(t1, t2)} is allowed to modify C{t1} and return it
 as its result value to avoid object allocation; however, it should not
 modify C{t2}.
  from operator import add
  sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
 15
 
 def func(iterator):
 acc = zeroValue
 for obj in iterator:
 acc = op(obj, acc)
 yield acc
 vals = self.mapPartitions(func).collect()
 return reduce(op, vals, zeroValue)
 {code}
 (For comparison, see the [Scala implementation of 
 `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
 Spark's `fold` operates by first folding each partition and then folding the 
 results.  The problem is that an empty partition gets folded down to the zero 
 element, so the final driver-side fold ends up folding one value for _every_ 
 partition rather than one value for each _non-empty_ partition.  This means 
 that the result of `fold` is sensitive to the number of partitions:
 {code}
  sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
 100
  sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
 50
  sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
 1
 {code}
 In this last case, what's happening is that the single partition is being 
 folded down to the correct value, then that value is folded with the 
 zero-value at the driver to yield 1.
 I think the underlying problem here is that our fold() operation implicitly 
 requires the operator to be commutative in addition to associative, but this 
 isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
 Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
 Therefore, I think we should update the documentation and examples to clarify 
 this requirement and explain that our fold acts more like a reduce with a 
 default value than the type of ordering-sensitive fold() that users may 
 expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-18 Thread meiyoula (JIRA)
meiyoula created SPARK-7699:
---

 Summary: Config spark.dynamicAllocation.initialExecutors has no 
effect 
 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula


spark.dynamicAllocation.minExecutors 2
spark.dynamicAllocation.initialExecutors  3
spark.dynamicAllocation.maxExecutors 4

Just run the spark-shell with above configurations, the initial executor number 
is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-7700.
--
Resolution: Not A Problem

This sounds like a problem with your configuration, like you've put a JVM arg 
somewhere where a class belongs. What is spark-env.sh?

Another random question -- do your paths have a space in them anywhere?

 Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
 ---

 Key: SPARK-7700
 URL: https://issues.apache.org/jira/browse/SPARK-7700
 Project: Spark
  Issue Type: Story
  Components: Build
Affects Versions: 1.3.1
 Environment: windows 8 Single language
 Hadoop-2.5.2
 Protocol Buffer-2.5.0
 Scala-2.11
Reporter: Kaveen Raajan

 I build SPARK on yarn mode by giving following command. Build got succeeded.
 {panel}
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
 -Phive-thriftserver -DskipTests clean package
 {panel}
 I set following property at spark-env.cmd file
 {panel}
 SET SPARK_JAR=hdfs://master:9000/user/spark/jar
 {panel}
 *Note:*  spark jar files are moved to hdfs specified location. Also spark 
 classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at 
 enviroment variable.
 I tried to execute following SparkPi example in yarn-cluster mode.
 {panel}
 spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster 
 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 
 --queue default 
 S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10
 {panel}
 My job able to submit at hadoop cluster, but it always in accepted state and 
 Failed with following error
 {panel}
 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster 
 with 1 NodeManagers
 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not 
 requestedmore than the maximum memory capability of the cluster (8048 MB per 
 container)
 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB 
 memory including 384 MB overhead
 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for 
 ourAM
 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container
 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are 
 thesame. Not copying hdfs://master:9000/user/spark/jar
 15/05/14 13:00:52 INFO yarn.Client: Uploading resource 
 file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar
  - 
 hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar
 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our 
 AM container
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(HDFS); users 
 with modify permissions: Set(HDFS)
 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to 
 ResourceManager
 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application 
 application_1431587916618_0003
 15/05/14 13:00:53 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:53 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1431588652790
  final status: UNDEFINED
  tracking URL: 
 http://master:8088/proxy/application_1431587916618_0003/
  user: HDFS
 15/05/14 13:00:54 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:55 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:56 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:57 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:58 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:59 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: FAILED)
 15/05/14 13:00:59 INFO yarn.Client:
  client token: N/A
  diagnostics: Application application_1431587916618_0003 failed 2 
 times
 due to AM Container for appattempt_1431587916618_0003_02 exited with  
 exitCode: 1
 For more detailed output, check application tracking 
 

[jira] [Commented] (SPARK-7661) Support for dynamic allocation of executors in Kinesis Spark Streaming

2015-05-18 Thread Murtaza Kanchwala (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547642#comment-14547642
 ] 

Murtaza Kanchwala commented on SPARK-7661:
--

Yes, it works I took 4 + 4 = 8 cores, where 4 is my total no. of cores and 4 is 
my total no .of shards. 

But there is still an another thing, Now If the Spark Consumer starts up it 
takes 1 executor and 4 Network Receiver, when I Scale up my Kinesis Stream by 4 
more shards,i.e. 8 Shards, then it also works but the Receiver count is still 
4. So is there any way to scale up the receivers or not?

 Support for dynamic allocation of executors in Kinesis Spark Streaming
 --

 Key: SPARK-7661
 URL: https://issues.apache.org/jira/browse/SPARK-7661
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Affects Versions: 1.3.1
 Environment: AWS-EMR
Reporter: Murtaza Kanchwala

 Currently the no. of cores is (N + 1), where N is no. of shards in a Kinesis 
 Stream.
 My Requirement is that if I use this Resharding util for Amazon Kinesis :
 Amazon Kinesis Resharding : 
 https://github.com/awslabs/amazon-kinesis-scaling-utils
 Then there should be some way to allocate executors on the basis of no. of 
 shards directly (for Spark Streaming only).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7663:
---

Assignee: Apache Spark

 [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size 
 is zero
 -

 Key: SPARK-7663
 URL: https://issues.apache.org/jira/browse/SPARK-7663
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xusen Yin
Assignee: Apache Spark
Priority: Minor

 mllib.feature.Word2Vec at line 442: 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442
  uses `.head` to get the vector size. But it would throw an empty iterator 
 error if the `minCount` is large enough to remove all words in the dataset.
 But due to this is not a common scenario, so maybe we can ignore it. If so, 
 we can close the issue directly. If not, I can add some code to print more 
 elegant error hits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7663) [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size is zero

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7663:
---

Assignee: (was: Apache Spark)

 [MLLIB] feature.Word2Vec throws empty iterator error when the vocabulary size 
 is zero
 -

 Key: SPARK-7663
 URL: https://issues.apache.org/jira/browse/SPARK-7663
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xusen Yin
Priority: Minor

 mllib.feature.Word2Vec at line 442: 
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala#L442
  uses `.head` to get the vector size. But it would throw an empty iterator 
 error if the `minCount` is large enough to remove all words in the dataset.
 But due to this is not a common scenario, so maybe we can ignore it. If so, 
 we can close the issue directly. If not, I can add some code to print more 
 elegant error hits.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7699) Config spark.dynamicAllocation.initialExecutors has no effect

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547665#comment-14547665
 ] 

Sean Owen commented on SPARK-7699:
--

I think that's maybe intentional? the logic is detecting that you don't need 3 
executors and ramping it down to the minimum. 

 Config spark.dynamicAllocation.initialExecutors has no effect 
 

 Key: SPARK-7699
 URL: https://issues.apache.org/jira/browse/SPARK-7699
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula

 spark.dynamicAllocation.minExecutors 2
 spark.dynamicAllocation.initialExecutors  3
 spark.dynamicAllocation.maxExecutors 4
 Just run the spark-shell with above configurations, the initial executor 
 number is 2.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-7700) Spark 1.3.0 on YARN: Application failed 2 times due to AM Container

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reopened SPARK-7700:
--

 Spark 1.3.0 on YARN: Application failed 2 times due to AM Container
 ---

 Key: SPARK-7700
 URL: https://issues.apache.org/jira/browse/SPARK-7700
 Project: Spark
  Issue Type: Story
  Components: Build
Affects Versions: 1.3.1
 Environment: windows 8 Single language
 Hadoop-2.5.2
 Protocol Buffer-2.5.0
 Scala-2.11
Reporter: Kaveen Raajan

 I build SPARK on yarn mode by giving following command. Build got succeeded.
 {panel}
 mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-0.12.0 
 -Phive-thriftserver -DskipTests clean package
 {panel}
 I set following property at spark-env.cmd file
 {panel}
 SET SPARK_JAR=hdfs://master:9000/user/spark/jar
 {panel}
 *Note:*  spark jar files are moved to hdfs specified location. Also spark 
 classpath are added to hadoop-config.cmd and HADOOP_CONF_DIR are set at 
 enviroment variable.
 I tried to execute following SparkPi example in yarn-cluster mode.
 {panel}
 spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster 
 --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 1 
 --queue default 
 S:\Hadoop\Spark\spark-1.3.1\examples\target\spark-examples_2.10-1.3.1.jar 10
 {panel}
 My job able to submit at hadoop cluster, but it always in accepted state and 
 Failed with following error
 {panel}
 15/05/14 13:00:51 INFO client.RMProxy: Connecting to ResourceManager at 
 /0.0.0.0:8032
 15/05/14 13:00:51 INFO yarn.Client: Requesting a new application from cluster 
 with 1 NodeManagers
 15/05/14 13:00:51 INFO yarn.Client: Verifying our application has not 
 requestedmore than the maximum memory capability of the cluster (8048 MB per 
 container)
 15/05/14 13:00:51 INFO yarn.Client: Will allocate AM container, with 4480 MB 
 memory including 384 MB overhead
 15/05/14 13:00:51 INFO yarn.Client: Setting up container launch context for 
 ourAM
 15/05/14 13:00:51 INFO yarn.Client: Preparing resources for our AM container
 15/05/14 13:00:52 INFO yarn.Client: Source and destination file systems are 
 thesame. Not copying hdfs://master:9000/user/spark/jar
 15/05/14 13:00:52 INFO yarn.Client: Uploading resource 
 file:/S:/Hadoop/Spark/spark-1.3.1/examples/target/spark-examples_2.10-1.3.1.jar
  - 
 hdfs://master:9000/user/HDFS/.sparkStaging/application_1431587916618_0003/spark-examples_2.10-1.3.1.jar
 15/05/14 13:00:52 INFO yarn.Client: Setting up the launch environment for our 
 AM container
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing view acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: Changing modify acls to: HDFS
 15/05/14 13:00:52 INFO spark.SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(HDFS); users 
 with modify permissions: Set(HDFS)
 15/05/14 13:00:52 INFO yarn.Client: Submitting application 3 to 
 ResourceManager
 15/05/14 13:00:52 INFO impl.YarnClientImpl: Submitted application 
 application_1431587916618_0003
 15/05/14 13:00:53 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:53 INFO yarn.Client:
  client token: N/A
  diagnostics: N/A
  ApplicationMaster host: N/A
  ApplicationMaster RPC port: -1
  queue: default
  start time: 1431588652790
  final status: UNDEFINED
  tracking URL: 
 http://master:8088/proxy/application_1431587916618_0003/
  user: HDFS
 15/05/14 13:00:54 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:55 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:56 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:57 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:58 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: ACCEPTED)
 15/05/14 13:00:59 INFO yarn.Client: Application report for 
 application_1431587916618_0003 (state: FAILED)
 15/05/14 13:00:59 INFO yarn.Client:
  client token: N/A
  diagnostics: Application application_1431587916618_0003 failed 2 
 times
 due to AM Container for appattempt_1431587916618_0003_02 exited with  
 exitCode: 1
 For more detailed output, check application tracking 
 page:http://master:8088/proxy/application_1431587916618_0003/Then, click on 
 links to logs of each attempt.
 Diagnostics: Exception from container-launch.
 Container id: container_1431587916618_0003_02_01
 Exit code: 1
 Stack trace: ExitCodeException exitCode=1:
 at 

[jira] [Assigned] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7697:
---

Assignee: Apache Spark

 Column with an unsigned int should be treated as long in JDBCRDD
 

 Key: SPARK-7697
 URL: https://issues.apache.org/jira/browse/SPARK-7697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: DAITO Teppei
Assignee: Apache Spark

 Columns with an unsigned numeric type in JDBC should be treated as the next 
 'larger' Java type
 in JDBCRDD#getCatalystType .
 https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49
 {code:title=q.sql}
 create table t1 (id int unsigned);
 insert into t1 values (4234567890);
 {code}
 {code:title=T1.scala}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 object T1 {
   def main(args: Array[String]) {
 val sc = new SparkContext(new SparkConf())
 val s = new SQLContext(sc)
 val url = jdbc:mysql://localhost/test
 val t1 = s.jdbc(url, t1)
 t1.printSchema()
 t1.collect().foreach(println)
   }
 }
 {code}
 This code caused error like below.
 {noformat}
 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in 
 column '1' is outside valid range for the datatype INTEGER.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
 at com.mysql.jdbc.Util.getInstance(Util.java:360)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
 at 
 com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090)
 at 
 com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364)
 at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4852) Hive query plan deserialization failure caused by shaded hive-exec jar file when generating golden answers

2015-05-18 Thread Manku Timma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547708#comment-14547708
 ] 

Manku Timma commented on SPARK-4852:


The following diff could solve the problem. File is 
sql/hive/v0.13.1/src/main/scala/org/apache/spark/sql/hive/Shim13.scala.
{code}
   import java.io.{OutputStream, InputStream}
-  import com.esotericsoftware.kryo.Kryo
+  import org.apache.hive.com.esotericsoftware.kryo.Kryo
   import org.apache.spark.util.Utils._
   import org.apache.hadoop.hive.ql.exec.Utilities
   import org.apache.hadoop.hive.ql.exec.UDF
{code}

 Hive query plan deserialization failure caused by shaded hive-exec jar file 
 when generating golden answers
 --

 Key: SPARK-4852
 URL: https://issues.apache.org/jira/browse/SPARK-4852
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: Cheng Lian
Priority: Minor

 When adding Hive 0.13.1 support for Spark SQL Thrift server in PR 
 [2685|https://github.com/apache/spark/pull/2685], Kryo 2.22 used by original 
 hive-exec-0.13.1.jar was shaded by Kryo 2.21 used by Spark SQL because of 
 dependency hell. Unfortunately, Kryo 2.21 has a known bug that may cause Hive 
 query plan deserialization failure. This bug was fixed in Kryo 2.22.
 Normally, this issue doesn't affect Spark SQL because we don't even generate 
 Hive query plan. But when running Hive test suites like 
 {{HiveCompatibilitySuite}}, golden answer files must be generated by Hive, 
 and thus triggers this issue. A workaround is to replace 
 {{hive-exec-0.13.1.jar}} under {{$HIVE_HOME/lib}} with Spark's 
 {{hive-exec-0.13.1a.jar}} and {{kryo-2.21.jar}} under 
 {{$SPARK_DEV_HOME/lib_managed/jars}}. Then add {{$HIVE_HOME/lib}} to 
 {{$HADOOP_CLASSPATH}}.
 Upgrading to some newer version of Kryo which is binary compatible with Kryo 
 2.22 (if there is one) may fix this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14547709#comment-14547709
 ] 

Apache Spark commented on SPARK-7697:
-

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/6229

 Column with an unsigned int should be treated as long in JDBCRDD
 

 Key: SPARK-7697
 URL: https://issues.apache.org/jira/browse/SPARK-7697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: DAITO Teppei

 Columns with an unsigned numeric type in JDBC should be treated as the next 
 'larger' Java type
 in JDBCRDD#getCatalystType .
 https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49
 {code:title=q.sql}
 create table t1 (id int unsigned);
 insert into t1 values (4234567890);
 {code}
 {code:title=T1.scala}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 object T1 {
   def main(args: Array[String]) {
 val sc = new SparkContext(new SparkConf())
 val s = new SQLContext(sc)
 val url = jdbc:mysql://localhost/test
 val t1 = s.jdbc(url, t1)
 t1.printSchema()
 t1.collect().foreach(println)
   }
 }
 {code}
 This code caused error like below.
 {noformat}
 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in 
 column '1' is outside valid range for the datatype INTEGER.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
 at com.mysql.jdbc.Util.getInstance(Util.java:360)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
 at 
 com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090)
 at 
 com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364)
 at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7697) Column with an unsigned int should be treated as long in JDBCRDD

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7697:
---

Assignee: (was: Apache Spark)

 Column with an unsigned int should be treated as long in JDBCRDD
 

 Key: SPARK-7697
 URL: https://issues.apache.org/jira/browse/SPARK-7697
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: DAITO Teppei

 Columns with an unsigned numeric type in JDBC should be treated as the next 
 'larger' Java type
 in JDBCRDD#getCatalystType .
 https://github.com/apache/spark/blob/517eb37a85e0a28820bcfd5d98c50d02df6521c6/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JDBCRDD.scala#L49
 {code:title=q.sql}
 create table t1 (id int unsigned);
 insert into t1 values (4234567890);
 {code}
 {code:title=T1.scala}
 import org.apache.spark.{SparkConf, SparkContext}
 import org.apache.spark.sql.SQLContext
 object T1 {
   def main(args: Array[String]) {
 val sc = new SparkContext(new SparkConf())
 val s = new SQLContext(sc)
 val url = jdbc:mysql://localhost/test
 val t1 = s.jdbc(url, t1)
 t1.printSchema()
 t1.collect().foreach(println)
   }
 }
 {code}
 This code caused error like below.
 {noformat}
 15/05/18 11:39:51 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 
 xxx): com.mysql.jdbc.exceptions.jdbc4.MySQLDataException: '4.23456789E9' in 
 column '1' is outside valid range for the datatype INTEGER.
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
 Method)
 at 
 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
 at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
 at com.mysql.jdbc.Util.handleNewInstance(Util.java:377)
 at com.mysql.jdbc.Util.getInstance(Util.java:360)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:963)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:935)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:924)
 at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:870)
 at 
 com.mysql.jdbc.ResultSetImpl.throwRangeException(ResultSetImpl.java:7090)
 at 
 com.mysql.jdbc.ResultSetImpl.parseIntAsDouble(ResultSetImpl.java:6364)
 at com.mysql.jdbc.ResultSetImpl.getInt(ResultSetImpl.java:2484)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.getNext(JDBCRDD.scala:344)
 at 
 org.apache.spark.sql.jdbc.JDBCRDD$$anon$1.hasNext(JDBCRDD.scala:399)
 ...
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6416) RDD.fold() requires the operator to be commutative

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6416:
---

Assignee: Apache Spark

 RDD.fold() requires the operator to be commutative
 --

 Key: SPARK-6416
 URL: https://issues.apache.org/jira/browse/SPARK-6416
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Spark Core
Reporter: Josh Rosen
Assignee: Apache Spark
Priority: Critical

 Spark's {{RDD.fold}} operation has some confusing behaviors when a 
 non-commutative reduce function is used.
 Here's an example, which was originally reported on StackOverflow 
 (https://stackoverflow.com/questions/29150202/pyspark-fold-method-output):
 {code}
 sc.parallelize([1,25,8,4,2]).fold(0,lambda a,b:a+1 )
 8
 {code}
 To understand what's going on here, let's look at the definition of Spark's 
 `fold` operation.  
 I'm going to show the Python version of the code, but the Scala version 
 exhibits the exact same behavior (you can also [browse the source on 
 GitHub|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/python/pyspark/rdd.py#L780]:
 {code}
 def fold(self, zeroValue, op):
 
 Aggregate the elements of each partition, and then the results for all
 the partitions, using a given associative function and a neutral zero
 value.
 The function C{op(t1, t2)} is allowed to modify C{t1} and return it
 as its result value to avoid object allocation; however, it should not
 modify C{t2}.
  from operator import add
  sc.parallelize([1, 2, 3, 4, 5]).fold(0, add)
 15
 
 def func(iterator):
 acc = zeroValue
 for obj in iterator:
 acc = op(obj, acc)
 yield acc
 vals = self.mapPartitions(func).collect()
 return reduce(op, vals, zeroValue)
 {code}
 (For comparison, see the [Scala implementation of 
 `RDD.fold`|https://github.com/apache/spark/blob/8cb23a1f9a3ed08e57865bcb6cc1cc7902881073/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L943]).
 Spark's `fold` operates by first folding each partition and then folding the 
 results.  The problem is that an empty partition gets folded down to the zero 
 element, so the final driver-side fold ends up folding one value for _every_ 
 partition rather than one value for each _non-empty_ partition.  This means 
 that the result of `fold` is sensitive to the number of partitions:
 {code}
  sc.parallelize([1,25,8,4,2], 100).fold(0,lambda a,b:a+1 )
 100
  sc.parallelize([1,25,8,4,2], 50).fold(0,lambda a,b:a+1 )
 50
  sc.parallelize([1,25,8,4,2], 1).fold(0,lambda a,b:a+1 )
 1
 {code}
 In this last case, what's happening is that the single partition is being 
 folded down to the correct value, then that value is folded with the 
 zero-value at the driver to yield 1.
 I think the underlying problem here is that our fold() operation implicitly 
 requires the operator to be commutative in addition to associative, but this 
 isn't documented anywhere.  Due to ordering non-determinism elsewhere in 
 Spark, such as SPARK-5750, I don't think there's an easy way to fix this.  
 Therefore, I think we should update the documentation and examples to clarify 
 this requirement and explain that our fold acts more like a reduce with a 
 default value than the type of ordering-sensitive fold() that users may 
 expect in functional languages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7443:
-
Description: 
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc) (SPARK-7537)
** Java compatibility (SPARK-7529)
** Python API coverage (SPARK-7536)

* audit Pipeline APIs (SPARK-7535)

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml
** mark concrete classes final wherever reasonable

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* PMML
** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
* model save/load (SPARK-7541)

h2. Documentation and example code

* Create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author.  Link here as requires
** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
should start making subsections for the spark.ml API as needed.  We can follow 
the structure of the spark.mllib user guide.
*** The spark.ml user guide can provide: (a) code examples and (b) info on 
algorithms which do not exist in spark.mllib.
*** We should not duplicate info in the spark.ml guides.  Since spark.mllib is 
still the primary API, we should provide links to the corresponding algorithms 
in the spark.mllib user guide for more info.

* Create example code for major components.  Link here as requires
** cross validation in python
** pipeline with complex feature transformations (scala/java/python)
** elastic-net (possibly with cross validation)
** kernel density

  was:
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc) (SPARK-7537)
** Java compatibility (SPARK-7529)
** Python API coverage (SPARK-7536)

* audit Pipeline APIs (SPARK-7535)

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml
** mark concrete classes final wherever reasonable

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* PMML
** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
* model save/load (SPARK-7541)

h2. Documentation and example code

* Create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author.  Link here as requires
** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
should start making subsections for the spark.ml API as needed.  We can follow 
the structure of the spark.mllib user guide.
*** The spark.ml user guide can provide: (a) code examples and (b) info on 
algorithms which do not exist in spark.mllib.
*** We should not duplicate info in the spark.ml guides.  Since spark.mllib is 
still the primary API, we should provide links to the corresponding algorithms 
in the spark.mllib user guide for more info.

* Create example code for major components.  Link here as requires
** cross validation in python
** pipeline with complex feature transformations (scala/java/python)
** elastic-net (possibly with cross validation)


 MLlib 1.4 QA plan
 -

 Key: SPARK-7443
 URL: https://issues.apache.org/jira/browse/SPARK-7443
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical

 TODO: create JIRAs for each task and assign them accordingly.
 h2. API
 * Check API compliance using java-compliance-checker (SPARK-7458)
 * Audit new public APIs (from the generated html doc)
 ** Scala (do not forget to check the object doc) (SPARK-7537)
 ** Java compatibility (SPARK-7529)
 ** Python API coverage (SPARK-7536)
 * audit Pipeline APIs (SPARK-7535)
 * graduate spark.ml from alpha
 ** remove AlphaComponent annotations
 ** remove mima excludes for spark.ml
 ** mark concrete classes final wherever reasonable
 h2. Algorithms and performance
 *Performance*
 * _List any other missing performance tests from 

[jira] [Created] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-18 Thread Akshat Aranya (JIRA)
Akshat Aranya created SPARK-7708:


 Summary: Incorrect task serialization with Kryo closure serializer
 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya


I've been investigating the use of Kryo for closure serialization with Spark 
1.2, and it seems like I've hit upon a bug:

When a task is serialized before scheduling, the following log message is 
generated:

[info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
host, PROCESS_LOCAL, 302 bytes)

This message comes from TaskSetManager which serializes the task using the 
closure serializer.  Before the message is sent out, the TaskDescription (which 
included the original task as a byte array), is serialized again into a byte 
array with the closure serializer.  I added a log message for this in 
CoarseGrainedSchedulerBackend, which produces the following output:

[info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132

The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
than serialized task that it contains (302 bytes). This implies that 
TaskDescription.buffer is not getting serialized correctly.

On the executor side, the deserialization produces a null value for 
TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3334) Spark causes mesos-master memory leak

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-3334.
--
Resolution: Not A Problem

 Spark causes mesos-master memory leak
 -

 Key: SPARK-3334
 URL: https://issues.apache.org/jira/browse/SPARK-3334
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
 Environment: Mesos 0.16.0/0.19.0
 CentOS 6.4
Reporter: Iven Hsu

 The {{akkaFrameSize}} is set to {{Long.MaxValue}} in MesosBackend to 
 workaround SPARK-1112, this causes all serialized task result is sent using 
 Mesos TaskStatus.
 mesos-master stores TaskStatus in memory, and when running Spark, its memory 
 grows very fast, and will be OOM killed.
 See MESOS-1746 for more.
 I've tried to set {{akkaFrameSize}} to 0, mesos-master won't be killed, 
 however, the driver will block after success unless I use {{sc.stop()}} to 
 quit it manually. Not sure if it's related to SPARK-1112.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548242#comment-14548242
 ] 

Sean Owen commented on SPARK-7706:
--

I am not sure Spark is designed to be invoked this way. You may need to invoke 
it in a shell.  but why isn't the YARN_CONF_DIR in the environment on a Hadoop 
cluster not correct and sufficient? that's the idea.

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548243#comment-14548243
 ] 

Sean Owen commented on SPARK-7706:
--

I am not sure Spark is designed to be invoked this way. You may need to invoke 
it in a shell.  but why isn't the YARN_CONF_DIR in the environment on a Hadoop 
cluster not correct and sufficient? that's the idea.

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7706:
-
Comment: was deleted

(was: I am not sure Spark is designed to be invoked this way. You may need to 
invoke it in a shell.  but why isn't the YARN_CONF_DIR in the environment on a 
Hadoop cluster not correct and sufficient? that's the idea.)

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548322#comment-14548322
 ] 

Sean Owen commented on SPARK-7706:
--

YARN_CONF_DIR is a YARN env variable right? not Spark-specific or app-specific. 
It should point to the cluster's YARN configuration, which is not app-specific 
either. I think any gateway machine will have this set. Right, or, is this not 
valid for Oozie somehow?

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Shaik Idris Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548418#comment-14548418
 ] 

Shaik Idris Ali commented on SPARK-7706:


I think the cleaner way in SparkSubmitArguments.class is to support both Class 
arguments and Env variables, in fact that is how it done for most of the 
variables. And Spark java cmd can appropriately set these in classpath. Maybe 
it is uses ENV because these are generic variables.
{code}
executorMemory = Option(executorMemory)
  .orElse(sparkProperties.get(spark.executor.memory))
  .orElse(env.get(SPARK_EXECUTOR_MEMORY))
  .orNull
{code}
Meanwhile I will check if we have a simpler solution without fixing this in 
Spark. Thanks.




 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7540) PMML correctness check

2015-05-18 Thread Shuo Xiang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548326#comment-14548326
 ] 

Shuo Xiang commented on SPARK-7540:
---

Schema and required field verification done using both [~vfed]`s script and 
manual check. The output of [spark-pmml-exporter-validator]( 
https://github.com/selvinsource/spark-pmml-exporter-validator) is checked 
against the required fields for each of the following model.
 - Linear Regression and Lasso
 - LR
 - Clustering
 - linear SVM. This is implemented as a RegressionModel 
(functionName=classification ), instead of the SupportVectorMachineModel. The 
evaluation results on built-in numerical examples look fine.

 PMML correctness check
 --

 Key: SPARK-7540
 URL: https://issues.apache.org/jira/browse/SPARK-7540
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Shuo Xiang

 Check correctness of PMML export for MLlib models by using PMML evaluator to 
 load and run the models.  This unfortunately needs to be done externally (not 
 in spark-perf) because of licensing.  A record of tests run and the results 
 can be posted in this JIRA, as well as a link to the repo hosting the testing 
 code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7693) Remove import scala.concurrent.ExecutionContext.Implicits.global

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7693:
-
Assignee: Shixiong Zhu

 Remove import scala.concurrent.ExecutionContext.Implicits.global
 --

 Key: SPARK-7693
 URL: https://issues.apache.org/jira/browse/SPARK-7693
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu
 Fix For: 1.4.0


 Learnt a lesson from SPARK-7655: Spark should avoid to use 
 scala.concurrent.ExecutionContext.Implicits.global because the user may 
 submit blocking actions to 
 `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads 
 in it. This could crash Spark.
 So Spark should always use its own thread pools for safety.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7272) User guide update for PMML model export

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7272.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6219
[https://github.com/apache/spark/pull/6219]

 User guide update for PMML model export
 ---

 Key: SPARK-7272
 URL: https://issues.apache.org/jira/browse/SPARK-7272
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Vincenzo Selvaggio
 Fix For: 1.4.0


 Add user guide for PMML model export with some example code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7458) Check 1.3- 1.4 MLlib API compliance using java-compliance-checker

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-7458:


Assignee: Xiangrui Meng

 Check 1.3- 1.4 MLlib API compliance using java-compliance-checker
 --

 Key: SPARK-7458
 URL: https://issues.apache.org/jira/browse/SPARK-7458
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 We should do this after 1.4-rc1 is cut.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7707) User guide and example code for Statistics.kernelDensity

2015-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7707:


 Summary: User guide and example code for Statistics.kernelDensity
 Key: SPARK-7707
 URL: https://issues.apache.org/jira/browse/SPARK-7707
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Sandy Ryza






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Shaik Idris Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548306#comment-14548306
 ] 

Shaik Idris Ali commented on SPARK-7706:


We might have multiple types of applications running on Yarn cluster and it 
might not be a good idea to setup application specific Env variables in all the 
nodes of a cluster.

I understand that for submitting Ad-hoc jobs from command line it make sense to 
have YARN_CONF_DIR set in one of the machine which acts like Spark client. We 
can support both Spark argument and Env variable for such configurations.

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7709) spark-submit option to quit after submitting in cluster mode

2015-05-18 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-7709:


 Summary: spark-submit option to quit after submitting in cluster 
mode
 Key: SPARK-7709
 URL: https://issues.apache.org/jira/browse/SPARK-7709
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 1.3.1
Reporter: Shay Rojansky
Priority: Minor


When deploying in cluster mode, spark-submit continues polling the application 
every second. While this is a useful feature, there should be an option to have 
spark-submit exit immediately after submission completes. This would allow 
scripts to figure out that a job was successfully (or unsuccessfully) submitted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7708) Incorrect task serialization with Kryo closure serializer

2015-05-18 Thread Akshat Aranya (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548282#comment-14548282
 ] 

Akshat Aranya commented on SPARK-7708:
--

This happens because the TaskDescription contains a SerializableBuffer, which 
has no fields from the point of view of Kryo.  It only has a transient var and 
a def.  When this is serialized by Kryo, it finds no member fields to 
serialize, so it doesn't write out the underlying buffer.

 Incorrect task serialization with Kryo closure serializer
 -

 Key: SPARK-7708
 URL: https://issues.apache.org/jira/browse/SPARK-7708
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.2
Reporter: Akshat Aranya

 I've been investigating the use of Kryo for closure serialization with Spark 
 1.2, and it seems like I've hit upon a bug:
 When a task is serialized before scheduling, the following log message is 
 generated:
 [info] o.a.s.s.TaskSetManager - Starting task 124.1 in stage 0.0 (TID 342, 
 host, PROCESS_LOCAL, 302 bytes)
 This message comes from TaskSetManager which serializes the task using the 
 closure serializer.  Before the message is sent out, the TaskDescription 
 (which included the original task as a byte array), is serialized again into 
 a byte array with the closure serializer.  I added a log message for this in 
 CoarseGrainedSchedulerBackend, which produces the following output:
 [info] o.a.s.s.c.CoarseGrainedSchedulerBackend - 124.1 size=132
 The serialized size of TaskDescription (132 bytes) turns out to be _smaller_ 
 than serialized task that it contains (302 bytes). This implies that 
 TaskDescription.buffer is not getting serialized correctly.
 On the executor side, the deserialization produces a null value for 
 TaskDescription.buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7443) MLlib 1.4 QA plan

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-7443:
-
Description: 
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc) (SPARK-7537)
** Java compatibility (SPARK-7529)
** Python API coverage (SPARK-7536)

* audit Pipeline APIs (SPARK-7535)

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml
** mark concrete classes final wherever reasonable

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* PMML
** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
* model save/load (SPARK-7541)

h2. Documentation and example code

* Create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author.  Link here as requires
** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
should start making subsections for the spark.ml API as needed.  We can follow 
the structure of the spark.mllib user guide.
*** The spark.ml user guide can provide: (a) code examples and (b) info on 
algorithms which do not exist in spark.mllib.
*** We should not duplicate info in the spark.ml guides.  Since spark.mllib is 
still the primary API, we should provide links to the corresponding algorithms 
in the spark.mllib user guide for more info.

* Create example code for major components.  Link here as requires
** cross validation in python (SPARK-7387)
** pipeline with complex feature transformations (scala/java/python) 
(SPARK-7546)
** elastic-net (possibly with cross validation) (SPARK-7547)
** kernel density (SPARK-7707)

  was:
TODO: create JIRAs for each task and assign them accordingly.

h2. API

* Check API compliance using java-compliance-checker (SPARK-7458)

* Audit new public APIs (from the generated html doc)
** Scala (do not forget to check the object doc) (SPARK-7537)
** Java compatibility (SPARK-7529)
** Python API coverage (SPARK-7536)

* audit Pipeline APIs (SPARK-7535)

* graduate spark.ml from alpha
** remove AlphaComponent annotations
** remove mima excludes for spark.ml
** mark concrete classes final wherever reasonable

h2. Algorithms and performance

*Performance*
* _List any other missing performance tests from spark-perf here_
* LDA online/EM (SPARK-7455)
* ElasticNet for linear regression and logistic regression (SPARK-7456)
* Bernoulli naive Bayes (SPARK-7453)
* PIC (SPARK-7454)
* ALS.recommendAll (SPARK-7457)
* perf-tests in Python (SPARK-7539)

*Correctness*
* PMML
** scoring using PMML evaluator vs. MLlib models (SPARK-7540)
* model save/load (SPARK-7541)

h2. Documentation and example code

* Create JIRAs for the user guide to each new algorithm and assign them to the 
corresponding author.  Link here as requires
** Now that we have algorithms in spark.ml which are not in spark.mllib, we 
should start making subsections for the spark.ml API as needed.  We can follow 
the structure of the spark.mllib user guide.
*** The spark.ml user guide can provide: (a) code examples and (b) info on 
algorithms which do not exist in spark.mllib.
*** We should not duplicate info in the spark.ml guides.  Since spark.mllib is 
still the primary API, we should provide links to the corresponding algorithms 
in the spark.mllib user guide for more info.

* Create example code for major components.  Link here as requires
** cross validation in python
** pipeline with complex feature transformations (scala/java/python)
** elastic-net (possibly with cross validation)
** kernel density


 MLlib 1.4 QA plan
 -

 Key: SPARK-7443
 URL: https://issues.apache.org/jira/browse/SPARK-7443
 Project: Spark
  Issue Type: Umbrella
  Components: ML, MLlib
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Joseph K. Bradley
Priority: Critical

 TODO: create JIRAs for each task and assign them accordingly.
 h2. API
 * Check API compliance using java-compliance-checker (SPARK-7458)
 * Audit new public APIs (from the generated html doc)
 ** Scala (do not forget to check the object doc) (SPARK-7537)
 ** Java compatibility (SPARK-7529)
 ** Python API coverage (SPARK-7536)
 * audit Pipeline APIs (SPARK-7535)
 * graduate spark.ml from alpha
 ** remove AlphaComponent annotations
 ** remove mima excludes for spark.ml
 ** mark concrete classes final wherever reasonable
 h2. Algorithms and performance
 

[jira] [Commented] (SPARK-7706) Allow setting YARN_CONF_DIR from spark argument

2015-05-18 Thread Shaik Idris Ali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7706?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548203#comment-14548203
 ] 

Shaik Idris Ali commented on SPARK-7706:


Hi, [~srowen], 

Thanks for the quick response, sorry I did not get, basically the way actions 
are launched in oozie or any other scheduler is from a Java program. Which 
takes the Main class and bunch of arguments to that class.

Ex: org.apache.spark.deploy.SparkSubmit.main(args); and we do not require to 
set anything in System EVN variables.
Link to Oozie code:
https://github.com/apache/oozie/blob/master/sharelib/spark/src/main/java/org.apache.oozie.action.hadoop/SparkMain.java#L104

 Allow setting YARN_CONF_DIR from spark argument
 ---

 Key: SPARK-7706
 URL: https://issues.apache.org/jira/browse/SPARK-7706
 Project: Spark
  Issue Type: Improvement
  Components: Spark Submit
Affects Versions: 1.3.1
Reporter: Shaik Idris Ali
  Labels: oozie, yarn

 Currently in SparkSubmitArguments.scala when master is set to yarn 
 (yarn-cluster mode)
 https://github.com/apache/spark/blob/b1f4ca82d170935d15f1fe6beb9af0743b4d81cd/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L236
 Spark checks if YARN_CONF_DIR or HADOOP_CONF_DIR is set in EVN.
 However we should additionally allow passing YARN_CONF_DIR from command line 
 argument this is particularly handy when Spark is being launched from 
 schedulers like OOZIE or FALCON.
 Reason being, oozie launcher App starts in one of the container assigned by 
 Yarn RM and we do not want to set YARN_CONF_DIR in ENV for all the nodes in 
 cluster. Just passing the argument like -yarnconfdir with conf dir (ex: 
 /etc/hadoop/conf) should avoid setting the ENV variable.
 This is blocking us to onboard spark from oozie or falcon. Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3334) Spark causes mesos-master memory leak

2015-05-18 Thread Iven Hsu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548220#comment-14548220
 ] 

Iven Hsu commented on SPARK-3334:
-

I'm not using Spark currently, you can close it as you wish.

 Spark causes mesos-master memory leak
 -

 Key: SPARK-3334
 URL: https://issues.apache.org/jira/browse/SPARK-3334
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.2
 Environment: Mesos 0.16.0/0.19.0
 CentOS 6.4
Reporter: Iven Hsu

 The {{akkaFrameSize}} is set to {{Long.MaxValue}} in MesosBackend to 
 workaround SPARK-1112, this causes all serialized task result is sent using 
 Mesos TaskStatus.
 mesos-master stores TaskStatus in memory, and when running Spark, its memory 
 grows very fast, and will be OOM killed.
 See MESOS-1746 for more.
 I've tried to set {{akkaFrameSize}} to 0, mesos-master won't be killed, 
 however, the driver will block after success unless I use {{sc.stop()}} to 
 quit it manually. Not sure if it's related to SPARK-1112.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7110) when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be issued only with kerberos or web authentication

2015-05-18 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548248#comment-14548248
 ] 

Thomas Graves commented on SPARK-7110:
--

Are you using spark1.1.0 as reported in the jira?

If so then this is probably issue 
https://issues.apache.org/jira/browse/SPARK-3778 which was fixed in spark 1.3.  
Can you try the newer version?  Otherwise you could try patching 1.1.

Its calling into  org.apache.spark.rdd.NewHadoopRDD.getPartitions, which ends 
up only calling into org.apache.hadoop.fs.FileSystem.addDelegationTokens if the 
tokens aren't already present.  Since that is a NewHadoopRDD instance it should 
have already populated them at that point.  That is why I'm thinking SPARK-3778 
might be the issue.
  
Due you have a snippet of code where you are creating the NewHadoopRDD?  Are 
you using newAPIHadoopFile or newAPIHadoopRDD for instance.

 when use saveAsNewAPIHadoopFile, sometimes it throws Delegation Token can be 
 issued only with kerberos or web authentication
 --

 Key: SPARK-7110
 URL: https://issues.apache.org/jira/browse/SPARK-7110
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.1.0
Reporter: gu-chi
Assignee: Sean Owen

 Under yarn-client mode, this issue random occurs. Authentication method is 
 set to kerberos, and use saveAsNewAPIHadoopFile in PairRDDFunctions to save 
 data to HDFS, then exception comes as:
 org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation Token 
 can be issued only with kerberos or web authentication



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3251) Clarify learning interfaces

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3251:
---

Assignee: (was: Apache Spark)

  Clarify learning interfaces
 

 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade

 *Make threshold mandatory*
 Currently, the output of predict for an example is either the score
 or the class. This side-effect is caused by clearThreshold. To
 clarify that behaviour three different types of predict (predictScore,
 predictClass, predictProbabilty) were introduced; the threshold is not
 longer optional.
 *Clarify classification interfaces*
 Currently, some functionality is spreaded over multiple models.
 In order to clarify the structure and simplify the implementation of
 more complex models (like multinomial logistic regression), two new
 classes are introduced:
 - BinaryClassificationModel: for all models that derives a binary 
 classification from a single weight vector. Comprises the tresholding 
 functionality to derive a prediction from a score. It basically captures 
 SVMModel and LogisticRegressionModel.
 - ProbabilitistClassificaitonModel: This trait defines the interface for 
 models that return a calibrated confidence score (aka probability).
 *Misc*
 - some renaming
 - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3251) Clarify learning interfaces

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3251:
---

Assignee: Apache Spark

  Clarify learning interfaces
 

 Key: SPARK-3251
 URL: https://issues.apache.org/jira/browse/SPARK-3251
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0, 1.1.1
Reporter: Christoph Sawade
Assignee: Apache Spark

 *Make threshold mandatory*
 Currently, the output of predict for an example is either the score
 or the class. This side-effect is caused by clearThreshold. To
 clarify that behaviour three different types of predict (predictScore,
 predictClass, predictProbabilty) were introduced; the threshold is not
 longer optional.
 *Clarify classification interfaces*
 Currently, some functionality is spreaded over multiple models.
 In order to clarify the structure and simplify the implementation of
 more complex models (like multinomial logistic regression), two new
 classes are introduced:
 - BinaryClassificationModel: for all models that derives a binary 
 classification from a single weight vector. Comprises the tresholding 
 functionality to derive a prediction from a score. It basically captures 
 SVMModel and LogisticRegressionModel.
 - ProbabilitistClassificaitonModel: This trait defines the interface for 
 models that return a calibrated confidence score (aka probability).
 *Misc*
 - some renaming
 - add test for probabilistic output



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4962.
--
Resolution: Won't Fix

 Put TaskScheduler.start back in SparkContext to shorten cluster resources 
 occupation period
 ---

 Key: SPARK-4962
 URL: https://issues.apache.org/jira/browse/SPARK-4962
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai
Priority: Minor

 When SparkContext object is instantiated, TaskScheduler is started and some 
 resources are allocated from cluster. However, these
 resources may be not used for the moment. For example, 
 DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
 in
 this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
 resources occupation period specially for busy cluster.
 TaskScheduler could be started just before running stages.
 We could analyse and compare the  resources occupation period before and 
 after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 The cluster resources occupation period before optimization is 
 [time2_][time3___][time4_].
 The cluster resources occupation period after optimization 
 is[time3___][time4_].
 In summary, the cluster resources
 occupation period after optimization is less than before.
 If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may 
 be shorten more which is [time4_].
 The resources saving is important for busy cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4962:
---

Assignee: (was: Apache Spark)

 Put TaskScheduler.start back in SparkContext to shorten cluster resources 
 occupation period
 ---

 Key: SPARK-4962
 URL: https://issues.apache.org/jira/browse/SPARK-4962
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai
Priority: Minor

 When SparkContext object is instantiated, TaskScheduler is started and some 
 resources are allocated from cluster. However, these
 resources may be not used for the moment. For example, 
 DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
 in
 this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
 resources occupation period specially for busy cluster.
 TaskScheduler could be started just before running stages.
 We could analyse and compare the  resources occupation period before and 
 after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 The cluster resources occupation period before optimization is 
 [time2_][time3___][time4_].
 The cluster resources occupation period after optimization 
 is[time3___][time4_].
 In summary, the cluster resources
 occupation period after optimization is less than before.
 If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may 
 be shorten more which is [time4_].
 The resources saving is important for busy cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4962) Put TaskScheduler.start back in SparkContext to shorten cluster resources occupation period

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4962:
---

Assignee: Apache Spark

 Put TaskScheduler.start back in SparkContext to shorten cluster resources 
 occupation period
 ---

 Key: SPARK-4962
 URL: https://issues.apache.org/jira/browse/SPARK-4962
 Project: Spark
  Issue Type: Improvement
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai
Assignee: Apache Spark
Priority: Minor

 When SparkContext object is instantiated, TaskScheduler is started and some 
 resources are allocated from cluster. However, these
 resources may be not used for the moment. For example, 
 DAGScheduler.JobSubmitted is processing and so on. These resources are wasted 
 in
 this period. Thus, we want to put TaskScheduler.start back to shorten cluster 
 resources occupation period specially for busy cluster.
 TaskScheduler could be started just before running stages.
 We could analyse and compare the  resources occupation period before and 
 after optimization.
 TaskScheduler.start execution time: [time1__]
 DAGScheduler.JobSubmitted (excluding HadoopRDD.getPartitions or 
 TaskScheduler.start) execution time: [time2_]
 HadoopRDD.getPartitions execution time: [time3___]
 Stages execution time: [time4_]
 The cluster resources occupation period before optimization is 
 [time2_][time3___][time4_].
 The cluster resources occupation period after optimization 
 is[time3___][time4_].
 In summary, the cluster resources
 occupation period after optimization is less than before.
 If HadoopRDD.getPartitions could be put forward (SPARK-4961), the period may 
 be shorten more which is [time4_].
 The resources saving is important for busy cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7710) User guide and example code for math/stat functions in DataFrames

2015-05-18 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-7710:


 Summary: User guide and example code for math/stat functions in 
DataFrames
 Key: SPARK-7710
 URL: https://issues.apache.org/jira/browse/SPARK-7710
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Burak Yavuz
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5316:
---

Assignee: (was: Apache Spark)

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes
 --

 Key: SPARK-5316
 URL: https://issues.apache.org/jira/browse/SPARK-5316
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
 If getParentStages has exception for example input path does not exist, 
 DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
 be put some records when getParentStages. However these records in 
 shuffleToMapStage aren't going to be cleaned.
 A simple job as follows:
 {code:java}
 val inputFile1 = ... // Input path does not exist when this job submits
 val inputFile2 = ...
 val outputFile = ...
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd1 = sc.textFile(inputFile1)
 .flatMap(line = line.split( ))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 val rdd2 = sc.textFile(inputFile2)
 .flatMap(line = line.split(,))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 try {
   val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
   rdd3.saveAsTextFile(outputFile)
 } catch {
   case e : Exception =
   logError(e)
 }
 // print the information of DAGScheduler's shuffleToMapStage to check
 // whether it still has uncleaned records.
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5316) DAGScheduler may make shuffleToMapStage leak if getParentStages failes

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5316:
---

Assignee: Apache Spark

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes
 --

 Key: SPARK-5316
 URL: https://issues.apache.org/jira/browse/SPARK-5316
 Project: Spark
  Issue Type: Bug
  Components: Scheduler
Affects Versions: 1.0.0
Reporter: YanTang Zhai
Assignee: Apache Spark

 DAGScheduler may make shuffleToMapStage leak if getParentStages failes.
 If getParentStages has exception for example input path does not exist, 
 DAGScheduler would fail to handle job submission, while shuffleToMapStage may 
 be put some records when getParentStages. However these records in 
 shuffleToMapStage aren't going to be cleaned.
 A simple job as follows:
 {code:java}
 val inputFile1 = ... // Input path does not exist when this job submits
 val inputFile2 = ...
 val outputFile = ...
 val conf = new SparkConf()
 val sc = new SparkContext(conf)
 val rdd1 = sc.textFile(inputFile1)
 .flatMap(line = line.split( ))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 val rdd2 = sc.textFile(inputFile2)
 .flatMap(line = line.split(,))
 .map(word = (word, 1))
 .reduceByKey(_ + _, 1)
 try {
   val rdd3 = new PairRDDFunctions(rdd1).join(rdd2, 1)
   rdd3.saveAsTextFile(outputFile)
 } catch {
   case e : Exception =
   logError(e)
 }
 // print the information of DAGScheduler's shuffleToMapStage to check
 // whether it still has uncleaned records.
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6888) Make DriverQuirks editable

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6888.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 
[https://github.com/apache/spark/pull/]

 Make DriverQuirks editable
 --

 Key: SPARK-6888
 URL: https://issues.apache.org/jira/browse/SPARK-6888
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rene Treffer
Priority: Minor
 Fix For: 1.4.0


 JDBC type conversion is currently handled by spark with the help of 
 DriverQuirks (org.apache.spark.sql.jdbc.DriverQuirks).
 However some cases can't be resolved, e.g. MySQL BIGINT UNSIGNED. (other 
 UNSIGNED conversions won't work either but could be resolved automatically by 
 using the next larger type)
 An invalid type conversion (e.g. loading an unsigned bigint with the highest 
 bit set as a long value) causes the jdbc driver to throw an exception.
 The target type is determined automatically and bound to the resulting 
 DataFrame where it's immutable.
 Alternative solutions:
 - Subqueries. Produce extra load on the server
 - SQLContext / jdbc methods with schema support
 - Making it possible to change the schema of data frames



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4105) FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based shuffle

2015-05-18 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548433#comment-14548433
 ] 

Josh Rosen commented on SPARK-4105:
---

I just noticed something interesting, although perhaps it's a red herring: in 
most of the stacktraces posted here, it looks like shuffled data is being 
processed with CoGroupedRDD. In most (all?) of these cases, it looks like the 
error is occurring when inserting an iterator of values into an external 
append-only map inside of CoGroupedRDD.  
https://github.com/apache/spark/pull/1607 was one of the last PRs to touch this 
file in 1.1, so I wonder whether there could be some sort of odd corner-case 
that we're hitting there; might be a lead worth exploring further.

 FAILED_TO_UNCOMPRESS(5) errors when fetching shuffle data with sort-based 
 shuffle
 -

 Key: SPARK-4105
 URL: https://issues.apache.org/jira/browse/SPARK-4105
 Project: Spark
  Issue Type: Bug
  Components: Shuffle, Spark Core
Affects Versions: 1.2.0, 1.2.1, 1.3.0
Reporter: Josh Rosen
Assignee: Josh Rosen
Priority: Blocker
 Attachments: JavaObjectToSerialize.java, 
 SparkFailedToUncompressGenerator.scala


 We have seen non-deterministic {{FAILED_TO_UNCOMPRESS(5)}} errors during 
 shuffle read.  Here's a sample stacktrace from an executor:
 {code}
 14/10/23 18:34:11 ERROR Executor: Exception in task 1747.3 in stage 11.0 (TID 
 33053)
 java.io.IOException: FAILED_TO_UNCOMPRESS(5)
   at org.xerial.snappy.SnappyNative.throw_error(SnappyNative.java:78)
   at org.xerial.snappy.SnappyNative.rawUncompress(Native Method)
   at org.xerial.snappy.Snappy.rawUncompress(Snappy.java:391)
   at org.xerial.snappy.Snappy.uncompress(Snappy.java:427)
   at 
 org.xerial.snappy.SnappyInputStream.readFully(SnappyInputStream.java:127)
   at 
 org.xerial.snappy.SnappyInputStream.readHeader(SnappyInputStream.java:88)
   at org.xerial.snappy.SnappyInputStream.init(SnappyInputStream.java:58)
   at 
 org.apache.spark.io.SnappyCompressionCodec.compressedInputStream(CompressionCodec.scala:128)
   at 
 org.apache.spark.storage.BlockManager.wrapForCompression(BlockManager.scala:1090)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:116)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator$$anon$1$$anonfun$onBlockFetchSuccess$1.apply(ShuffleBlockFetcherIterator.scala:115)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:243)
   at 
 org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:52)
   at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
   at 
 org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
   at 
 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:129)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
   at 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)
   at 
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 

[jira] [Assigned] (SPARK-4991) Worker should reconnect to Master when Master actor restart

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4991:
---

Assignee: Apache Spark

 Worker should reconnect to Master when Master actor restart
 ---

 Key: SPARK-4991
 URL: https://issues.apache.org/jira/browse/SPARK-4991
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.2.0
Reporter: Zhang, Liye
Assignee: Apache Spark

 This is a following JIRA of 
 [SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master 
 akka actor encounter an exception, the Master will restart (akka actor 
 restart not JVM restart). And all old information are cleared on Master 
 (including workers, applications, etc). However, the workers are not aware of 
 this at all. The state of the cluster is that: the master is on, and all 
 workers are also on, but master is not aware of the exists of workers, and 
 will ignore all worker's heartbeat because all workers are not registered. So 
 that the whole cluster is not available.
 For some other information about this part, please refer to 
 [SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and 
 [SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4991) Worker should reconnect to Master when Master actor restart

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4991:
---

Assignee: (was: Apache Spark)

 Worker should reconnect to Master when Master actor restart
 ---

 Key: SPARK-4991
 URL: https://issues.apache.org/jira/browse/SPARK-4991
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Spark Core
Affects Versions: 1.0.0, 1.1.0, 1.2.0
Reporter: Zhang, Liye

 This is a following JIRA of 
 [SPARK-4989|https://issues.apache.org/jira/browse/SPARK-4989]. when Master 
 akka actor encounter an exception, the Master will restart (akka actor 
 restart not JVM restart). And all old information are cleared on Master 
 (including workers, applications, etc). However, the workers are not aware of 
 this at all. The state of the cluster is that: the master is on, and all 
 workers are also on, but master is not aware of the exists of workers, and 
 will ignore all worker's heartbeat because all workers are not registered. So 
 that the whole cluster is not available.
 For some other information about this part, please refer to 
 [SPARK-3736|https://issues.apache.org/jira/browse/SPARK-3736] and 
 [SPARK-4592|https://issues.apache.org/jira/browse/SPARK-4592]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3219:
---

Assignee: Apache Spark  (was: Derrick Burns)

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Apache Spark
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3261:
---

Assignee: Apache Spark  (was: Derrick Burns)

 KMeans clusterer can return duplicate cluster centers
 -

 Key: SPARK-3261
 URL: https://issues.apache.org/jira/browse/SPARK-3261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Apache Spark
  Labels: clustering

 This is a bad design choice.  I think that it is preferable to produce no 
 duplicate cluster centers. So instead of forcing the number of clusters to be 
 K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3218) K-Means clusterer can fail on degenerate data

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3218:
---

Assignee: Derrick Burns  (was: Apache Spark)

 K-Means clusterer can fail on degenerate data
 -

 Key: SPARK-3218
 URL: https://issues.apache.org/jira/browse/SPARK-3218
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Derrick Burns
Priority: Minor
  Labels: clustering

 The KMeans parallel implementation selects points to be cluster centers with 
 probability weighted by their distance to cluster centers.  However, if there 
 are fewer than k DISTINCT points in the data set, this approach will fail.  
 Further, the recent checkin to work around this problem results in selection 
 of the same point repeatedly as a cluster center. 
 The fix is to allow fewer than k cluster centers to be selected.  This 
 requires several changes to the code, as the number of cluster centers is 
 woven into the implementation.
 I have a version of the code that addresses this problem, AND generalizes the 
 distance metric.  However, I see that there are literally hundreds of 
 outstanding pull requests.  If someone will commit to working with me to 
 sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3261:
---

Assignee: Derrick Burns  (was: Apache Spark)

 KMeans clusterer can return duplicate cluster centers
 -

 Key: SPARK-3261
 URL: https://issues.apache.org/jira/browse/SPARK-3261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 This is a bad design choice.  I think that it is preferable to produce no 
 duplicate cluster centers. So instead of forcing the number of clusters to be 
 K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3219) K-Means clusterer should support Bregman distance functions

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3219:
---

Assignee: Derrick Burns  (was: Apache Spark)

 K-Means clusterer should support Bregman distance functions
 ---

 Key: SPARK-3219
 URL: https://issues.apache.org/jira/browse/SPARK-3219
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Derrick Burns
Assignee: Derrick Burns
  Labels: clustering

 The K-Means clusterer supports the Euclidean distance metric.  However, it is 
 rather straightforward to support Bregman 
 (http://machinelearning.wustl.edu/mlpapers/paper_files/BanerjeeMDG05.pdf) 
 distance functions which would increase the utility of the clusterer 
 tremendously.
 I have modified the clusterer to support pluggable distance functions.  
 However, I notice that there are hundreds of outstanding pull requests.  If 
 someone is willing to work with me to sponsor the work through the process, I 
 will create a pull request.  Otherwise, I will just keep my own fork.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3218) K-Means clusterer can fail on degenerate data

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-3218:
---

Assignee: Apache Spark  (was: Derrick Burns)

 K-Means clusterer can fail on degenerate data
 -

 Key: SPARK-3218
 URL: https://issues.apache.org/jira/browse/SPARK-3218
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
Assignee: Apache Spark
Priority: Minor
  Labels: clustering

 The KMeans parallel implementation selects points to be cluster centers with 
 probability weighted by their distance to cluster centers.  However, if there 
 are fewer than k DISTINCT points in the data set, this approach will fail.  
 Further, the recent checkin to work around this problem results in selection 
 of the same point repeatedly as a cluster center. 
 The fix is to allow fewer than k cluster centers to be selected.  This 
 requires several changes to the code, as the number of cluster centers is 
 woven into the implementation.
 I have a version of the code that addresses this problem, AND generalizes the 
 distance metric.  However, I see that there are literally hundreds of 
 outstanding pull requests.  If someone will commit to working with me to 
 sponsor the pull request, I will create it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4094) checkpoint should still be available after rdd actions

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-4094.
--
Resolution: Won't Fix

 checkpoint should still be available after rdd actions
 --

 Key: SPARK-4094
 URL: https://issues.apache.org/jira/browse/SPARK-4094
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Zhang, Liye
Assignee: Zhang, Liye

 rdd.checkpoint() must be called before any actions on this rdd, if there is 
 any other actions before, checkpoint would never succeed. For the following 
 code as example:
 *rdd = sc.makeRDD(...)*
 *rdd.collect()*
 *rdd.checkpoint()*
 *rdd.count()*
 This rdd would never be checkpointed. For algorithms that have many 
 iterations would have some problem. Such as graph algorithm, there will have 
 many iterations which will cause the RDD lineage very long. So RDD may need 
 checkpoint after a certain iteration number. And if there is also any action 
 within the iteration loop, the checkpoint() operation will never work for the 
 later iterations after the iteration which calls the action operation.
 But this would not happen for RDD cache. RDD cache would always make 
 successfully before rdd actions no matter whether there is any actions before 
 cache().
 So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4630:
---

Assignee: Apache Spark  (was: Kostas Sakellis)

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Apache Spark

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4630) Dynamically determine optimal number of partitions

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-4630:
---

Assignee: Kostas Sakellis  (was: Apache Spark)

 Dynamically determine optimal number of partitions
 --

 Key: SPARK-4630
 URL: https://issues.apache.org/jira/browse/SPARK-4630
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Kostas Sakellis
Assignee: Kostas Sakellis

 Partition sizes play a big part in how fast stages execute during a Spark 
 job. There is a direct relationship between the size of partitions to the 
 number of tasks - larger partitions, fewer tasks. For better performance, 
 Spark has a sweet spot for how large partitions should be that get executed 
 by a task. If partitions are too small, then the user pays a disproportionate 
 cost in scheduling overhead. If the partitions are too large, then task 
 execution slows down due to gc pressure and spilling to disk.
 To increase performance of jobs, users often hand optimize the number(size) 
 of partitions that the next stage gets. Factors that come into play are:
 Incoming partition sizes from previous stage
 number of available executors
 available memory per executor (taking into account 
 spark.shuffle.memoryFraction)
 Spark has access to this data and so should be able to automatically do the 
 partition sizing for the user. This feature can be turned off/on with a 
 configuration option. 
 To make this happen, we propose modifying the DAGScheduler to take into 
 account partition sizes upon stage completion. Before scheduling the next 
 stage, the scheduler can examine the sizes of the partitions and determine 
 the appropriate number tasks to create. Since this change requires 
 non-trivial modifications to the DAGScheduler, a detailed design doc will be 
 attached before proceeding with the work.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7711) startTime() is missing

2015-05-18 Thread Sam Steingold (JIRA)
Sam Steingold created SPARK-7711:


 Summary: startTime() is missing
 Key: SPARK-7711
 URL: https://issues.apache.org/jira/browse/SPARK-7711
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Sam Steingold


In PySpark 1.3, there appears to be no 
[startTime|https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#startTime%28%29]:
{code}
 sc.startTime  
Traceback (most recent call last):
  File stdin, line 1, in module
AttributeError: 'SparkContext' object has no attribute 'startTime'
{code}
Scala has the method:
{code}
scala sc.startTime
res1: Long = 1431974499272
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7712) Native Spark Window Functions Performance Improvements

2015-05-18 Thread Herman van Hovell tot Westerflier (JIRA)
Herman van Hovell tot Westerflier created SPARK-7712:


 Summary: Native Spark Window Functions  Performance Improvements 
 Key: SPARK-7712
 URL: https://issues.apache.org/jira/browse/SPARK-7712
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.4.0
Reporter: Herman van Hovell tot Westerflier
 Fix For: 1.5.0


Hi All,

After playing with the current spark window implementation, I tried to take 
this to next level. My main goal is/was to address the following issues:
- Native Spark-SQL, the current implementation relies only on Hive UDAFs. The 
improved implementation uses Spark SQL Aggregates. Hive UDAF's are still 
supported though.
- Much better performance (10x) in running cases (e.g. BETWEEN UNBOUNDED 
PRECEDING AND CURRENT ROW) and UNBOUDED FOLLOWING cases.
- Increased optimization opportunities. AggregateEvaluation style optimization 
should be possible for in frame processing. Tungsten might also provide 
interesting optimization opportunities.

The current work is available at the following location: 
https://github.com/hvanhovell/spark-window

I will try to turn this into a PR in the next couple of days. Meanwhile 
comments, feedback and other discussion is much appreciated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2883) Spark Support for ORCFile format

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2883.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6194
[https://github.com/apache/spark/pull/6194]

 Spark Support for ORCFile format
 

 Key: SPARK-2883
 URL: https://issues.apache.org/jira/browse/SPARK-2883
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, SQL
Reporter: Zhan Zhang
Priority: Critical
 Fix For: 1.4.0

 Attachments: 2014-09-12 07.05.24 pm Spark UI.png, 2014-09-12 07.07.19 
 pm jobtracker.png, orc.diff


 Verify the support of OrcInputFormat in spark, fix issues if exists and add 
 documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-7327) DataFrame show() method doesn't like empty dataframes

2015-05-18 Thread Olivier Girardot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Girardot closed SPARK-7327.
---
Resolution: Cannot Reproduce

I can't seem to reproduce the issue and I did not attach any repro step... 
Sorry I'm closing this as Cannot reproduce until I find back my failing 
example.

Thanks for your time.

 DataFrame show() method doesn't like empty dataframes
 -

 Key: SPARK-7327
 URL: https://issues.apache.org/jira/browse/SPARK-7327
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Olivier Girardot
Priority: Minor

 For an empty DataFrame (for exemple after a filter) any call to show() ends 
 up with : 
 {code}
 java.util.MissingFormatWidthException: -0s
   at java.util.Formatter$FormatSpecifier.checkGeneral(Formatter.java:2906)
   at java.util.Formatter$FormatSpecifier.init(Formatter.java:2680)
   at java.util.Formatter.parse(Formatter.java:2528)
   at java.util.Formatter.format(Formatter.java:2469)
   at java.util.Formatter.format(Formatter.java:2423)
   at java.lang.String.format(String.java:2790)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:200)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2$$anonfun$apply$4.apply(DataFrame.scala:199)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:199)
   at 
 org.apache.spark.sql.DataFrame$$anonfun$showString$2.apply(DataFrame.scala:198)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
   at scala.collection.mutable.ArraySeq.foreach(ArraySeq.scala:73)
   at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
   at scala.collection.AbstractTraversable.map(Traversable.scala:105)
   at org.apache.spark.sql.DataFrame.showString(DataFrame.scala:198)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:314)
   at org.apache.spark.sql.DataFrame.show(DataFrame.scala:320)
 {code}
 If no-one takes it by next friday, I'll fix it, the problem seems to come 
 from the colWidths method :
 {code}
// Compute the width of each column
 val colWidths = Array.fill(numCols)(0)
 for (row - rows) {
   for ((cell, i) - row.zipWithIndex) {
 colWidths(i) = math.max(colWidths(i), cell.length)
   }
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7711) startTime() is missing

2015-05-18 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-7711:
-
Priority: Minor  (was: Major)

 startTime() is missing
 --

 Key: SPARK-7711
 URL: https://issues.apache.org/jira/browse/SPARK-7711
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.1
Reporter: Sam Steingold
Priority: Minor

 In PySpark 1.3, there appears to be no 
 [startTime|https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#startTime%28%29]:
 {code}
  sc.startTime  
 Traceback (most recent call last):
   File stdin, line 1, in module
 AttributeError: 'SparkContext' object has no attribute 'startTime'
 {code}
 Scala has the method:
 {code}
 scala sc.startTime
 res1: Long = 1431974499272
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7690) MulticlassClassificationEvaluator for tuning Multiclass Classifiers

2015-05-18 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548577#comment-14548577
 ] 

Joseph K. Bradley commented on SPARK-7690:
--

+1
We should also check for implementations in other packages to find out what 
arguments and argument names (including micro and macro) are most common.  
Perhaps R, Weka, etc.

 MulticlassClassificationEvaluator for tuning Multiclass Classifiers
 ---

 Key: SPARK-7690
 URL: https://issues.apache.org/jira/browse/SPARK-7690
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Ram Sriharsha
Assignee: Ram Sriharsha

 Provide a MulticlassClassificationEvaluator with weighted F1-score to tune 
 multiclass classifiers using Pipeline API.
 MLLib already provides a MulticlassMetrics functionality which can be wrapped 
 around a MulticlassClassificationEvaluator to expose weighted F1-score as 
 metric.
 The functionality could be similar to 
 scikit(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)
   in that we can support micro, macro and weighted versions of the F1-score 
 (with weighted being default)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6785:
---

Assignee: (was: Apache Spark)

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548584#comment-14548584
 ] 

Apache Spark commented on SPARK-6785:
-

User 'ckadner' has created a pull request for this issue:
https://github.com/apache/spark/pull/6236

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu

 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7696:
---

Assignee: Apache Spark

 Aggregate function's result should be nullable only if the input expression 
 is nullable
 ---

 Key: SPARK-7696
 URL: https://issues.apache.org/jira/browse/SPARK-7696
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Haopu Wang
Assignee: Apache Spark
Priority: Minor

 In SparkSQL, the aggregate function's result currently is always nullable.
 It will make sense to change the behavior as: if the input expression is 
 nullable, the result is nullable; Otherwise, the result is non-nullable.
 Please see the following discussion:
 
 From: Olivier Girardot [mailto:ssab...@gmail.com] 
 Sent: Tuesday, May 12, 2015 5:12 AM
 To: Reynold Xin
 Cc: Haopu Wang; user
 Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
  
 I'll look into it - not sure yet what I can get out of exprs :p 
  
 Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit :
 Thanks for catching this. I didn't read carefully enough.
  
 It'd make sense to have the udaf result be non-nullable, if the exprs are 
 indeed non-nullable.
  
 On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote:
 Hi Haopu, 
 actually here `key` is nullable because this is your input's schema : 
 scala result.printSchema
 root 
 |-- key: string (nullable = true) 
 |-- SUM(value): long (nullable = true) 
 scala df.printSchema 
 root 
 |-- key: string (nullable = true) 
 |-- value: long (nullable = false)
  
 I tried it with a schema where the key is not flagged as nullable, and the 
 schema is actually respected. What you can argue however is that SUM(value) 
 should also be not nullable since value is not nullable.
  
 @rxin do you think it would be reasonable to flag the Sum aggregation 
 function as nullable (or not) depending on the input expression's schema ?
  
 Regards, 
  
 Olivier.
 Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit :
 Not by design. Would you be interested in submitting a pull request?
  
 On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:
 I try to get the result schema of aggregate functions using DataFrame
 API.
 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.
 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.
 ==
   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF
   val result = df.groupBy(key).agg($key, sum(value))
   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548617#comment-14548617
 ] 

Apache Spark commented on SPARK-7696:
-

User 'ogirardot' has created a pull request for this issue:
https://github.com/apache/spark/pull/6237

 Aggregate function's result should be nullable only if the input expression 
 is nullable
 ---

 Key: SPARK-7696
 URL: https://issues.apache.org/jira/browse/SPARK-7696
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Haopu Wang
Priority: Minor

 In SparkSQL, the aggregate function's result currently is always nullable.
 It will make sense to change the behavior as: if the input expression is 
 nullable, the result is nullable; Otherwise, the result is non-nullable.
 Please see the following discussion:
 
 From: Olivier Girardot [mailto:ssab...@gmail.com] 
 Sent: Tuesday, May 12, 2015 5:12 AM
 To: Reynold Xin
 Cc: Haopu Wang; user
 Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
  
 I'll look into it - not sure yet what I can get out of exprs :p 
  
 Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit :
 Thanks for catching this. I didn't read carefully enough.
  
 It'd make sense to have the udaf result be non-nullable, if the exprs are 
 indeed non-nullable.
  
 On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote:
 Hi Haopu, 
 actually here `key` is nullable because this is your input's schema : 
 scala result.printSchema
 root 
 |-- key: string (nullable = true) 
 |-- SUM(value): long (nullable = true) 
 scala df.printSchema 
 root 
 |-- key: string (nullable = true) 
 |-- value: long (nullable = false)
  
 I tried it with a schema where the key is not flagged as nullable, and the 
 schema is actually respected. What you can argue however is that SUM(value) 
 should also be not nullable since value is not nullable.
  
 @rxin do you think it would be reasonable to flag the Sum aggregation 
 function as nullable (or not) depending on the input expression's schema ?
  
 Regards, 
  
 Olivier.
 Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit :
 Not by design. Would you be interested in submitting a pull request?
  
 On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:
 I try to get the result schema of aggregate functions using DataFrame
 API.
 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.
 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.
 ==
   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF
   val result = df.groupBy(key).agg($key, sum(value))
   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7696) Aggregate function's result should be nullable only if the input expression is nullable

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7696:
---

Assignee: (was: Apache Spark)

 Aggregate function's result should be nullable only if the input expression 
 is nullable
 ---

 Key: SPARK-7696
 URL: https://issues.apache.org/jira/browse/SPARK-7696
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0, 1.3.1
Reporter: Haopu Wang
Priority: Minor

 In SparkSQL, the aggregate function's result currently is always nullable.
 It will make sense to change the behavior as: if the input expression is 
 nullable, the result is nullable; Otherwise, the result is non-nullable.
 Please see the following discussion:
 
 From: Olivier Girardot [mailto:ssab...@gmail.com] 
 Sent: Tuesday, May 12, 2015 5:12 AM
 To: Reynold Xin
 Cc: Haopu Wang; user
 Subject: Re: [SparkSQL 1.4.0] groupBy columns are always nullable?
  
 I'll look into it - not sure yet what I can get out of exprs :p 
  
 Le lun. 11 mai 2015 à 22:35, Reynold Xin r...@databricks.com a écrit :
 Thanks for catching this. I didn't read carefully enough.
  
 It'd make sense to have the udaf result be non-nullable, if the exprs are 
 indeed non-nullable.
  
 On Mon, May 11, 2015 at 1:32 PM, Olivier Girardot ssab...@gmail.com wrote:
 Hi Haopu, 
 actually here `key` is nullable because this is your input's schema : 
 scala result.printSchema
 root 
 |-- key: string (nullable = true) 
 |-- SUM(value): long (nullable = true) 
 scala df.printSchema 
 root 
 |-- key: string (nullable = true) 
 |-- value: long (nullable = false)
  
 I tried it with a schema where the key is not flagged as nullable, and the 
 schema is actually respected. What you can argue however is that SUM(value) 
 should also be not nullable since value is not nullable.
  
 @rxin do you think it would be reasonable to flag the Sum aggregation 
 function as nullable (or not) depending on the input expression's schema ?
  
 Regards, 
  
 Olivier.
 Le lun. 11 mai 2015 à 22:07, Reynold Xin r...@databricks.com a écrit :
 Not by design. Would you be interested in submitting a pull request?
  
 On Mon, May 11, 2015 at 1:48 AM, Haopu Wang hw...@qilinsoft.com wrote:
 I try to get the result schema of aggregate functions using DataFrame
 API.
 However, I find the result field of groupBy columns are always nullable
 even the source field is not nullable.
 I want to know if this is by design, thank you! Below is the simple code
 to show the issue.
 ==
   import sqlContext.implicits._
   import org.apache.spark.sql.functions._
   case class Test(key: String, value: Long)
   val df = sc.makeRDD(Seq(Test(k1,2),Test(k1,1))).toDF
   val result = df.groupBy(key).agg($key, sum(value))
   // From the output, you can see the key column is nullable, why??
   result.printSchema
 //root
 // |-- key: string (nullable = true)
 // |-- SUM(value): long (nullable = true)
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7570) Ignore _temporary folders during partition discovery

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7570.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6091
[https://github.com/apache/spark/pull/6091]

 Ignore _temporary folders during partition discovery
 

 Key: SPARK-7570
 URL: https://issues.apache.org/jira/browse/SPARK-7570
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1, 1.4.0
Reporter: Cheng Lian
Assignee: Cheng Lian
Priority: Critical
 Fix For: 1.4.0


 When speculation is turned on, directories named {{_temporary}} may be left 
 in data directories after saving a DataFrame. These directories should be 
 ignored. Currently they simply fail partition discovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7380) Python: Transformer/Estimator should be copyable

2015-05-18 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7380.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6088
[https://github.com/apache/spark/pull/6088]

 Python: Transformer/Estimator should be copyable
 

 Key: SPARK-7380
 URL: https://issues.apache.org/jira/browse/SPARK-7380
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
 Fix For: 1.4.0


 Same as [SPARK-5956]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7631) treenode argString should not print children

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7631.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6144
[https://github.com/apache/spark/pull/6144]

 treenode argString should not print children
 

 Key: SPARK-7631
 URL: https://issues.apache.org/jira/browse/SPARK-7631
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.1
Reporter: Fei Wang
Priority: Minor
 Fix For: 1.4.0


 spark-sql explain extended   
   
   select * from (
   
   select key from src union all  
   
   select key from src) t;
 the spark plan will print children in argString

 == Physical Plan ==
 Union[ HiveTableScan [key#1], (MetastoreRelation default, src, None), None,
  HiveTableScan [key#3], (MetastoreRelation default, src, None), None]
  HiveTableScan [key#1], (MetastoreRelation default, src, None), None
  HiveTableScan [key#3], (MetastoreRelation default, src, None), None



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7269) Incorrect aggregation analysis

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-7269.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6173
[https://github.com/apache/spark/pull/6173]

 Incorrect aggregation analysis
 --

 Key: SPARK-7269
 URL: https://issues.apache.org/jira/browse/SPARK-7269
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.4.0


 In a case insensitive analyzer (HiveContext), the attribute name captial 
 differences will fail the analysis check for aggregation.
 {code}
 test(check analysis failed in case in-sensitive) {
 Seq(1,2,3).map(i = (i, i.toString)).toDF(key, 
 value).registerTempTable(df_analysis)
 sql(SELECT kEy from df_analysis group by key)
 }
 {code}
 {noformat}
 expression 'kEy' is neither present in the group by, nor is it an aggregate 
 function. Add to group by or wrap in first() if you don't care which value 
 you get.;
 org.apache.spark.sql.AnalysisException: expression 'kEy' is neither present 
 in the group by, nor is it an aggregate function. Add to group by or wrap in 
 first() if you don't care which value you get.;
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:85)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$4.apply(CheckAnalysis.scala:101)
   at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:101)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:89)
   at 
 org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:50)
   at 
 org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:39)
   at 
 org.apache.spark.sql.SQLContext$QueryExecution.assertAnalyzed(SQLContext.scala:1121)
   at org.apache.spark.sql.DataFrame.init(DataFrame.scala:133)
   at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
   at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:97)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply$mcV$sp(SQLQuerySuite.scala:408)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.apache.spark.sql.hive.execution.SQLQuerySuite$$anonfun$15.apply(SQLQuerySuite.scala:406)
   at 
 org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   at org.scalatest.Transformer.apply(Transformer.scala:20)
   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
   at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
   at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
   at 
 org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6785) DateUtils can not handle date before 1970/01/01 correctly

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6785:
---

Assignee: Apache Spark

 DateUtils can not handle date before 1970/01/01 correctly
 -

 Key: SPARK-6785
 URL: https://issues.apache.org/jira/browse/SPARK-6785
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Davies Liu
Assignee: Apache Spark

 {code}
 scala val d = new Date(100)
 d: java.sql.Date = 1969-12-31
 scala DateUtils.toJavaDate(DateUtils.fromJavaDate(d))
 res1: java.sql.Date = 1970-01-01
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7497) test_count_by_value_and_window is flaky

2015-05-18 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14548630#comment-14548630
 ] 

Apache Spark commented on SPARK-7497:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/6239

 test_count_by_value_and_window is flaky
 ---

 Key: SPARK-7497
 URL: https://issues.apache.org/jira/browse/SPARK-7497
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Critical
  Labels: flaky-test

 Saw this test failure in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console
 {code}
 ==
 FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests)
 --
 Traceback (most recent call last):
   File pyspark/streaming/tests.py, line 418, in 
 test_count_by_value_and_window
 self._test_func(input, func, expected)
   File pyspark/streaming/tests.py, line 133, in _test_func
 self.assertEqual(expected, result)
 AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], 
 [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]]
 First list contains 2 additional elements.
 First extra element 8:
 [6]
 - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]]
 ? --
 + [[1], [2], [3], [4], [5], [6], [6], [6]]
 --
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7497) test_count_by_value_and_window is flaky

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7497:
---

Assignee: Davies Liu  (was: Apache Spark)

 test_count_by_value_and_window is flaky
 ---

 Key: SPARK-7497
 URL: https://issues.apache.org/jira/browse/SPARK-7497
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Davies Liu
Priority: Critical
  Labels: flaky-test

 Saw this test failure in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console
 {code}
 ==
 FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests)
 --
 Traceback (most recent call last):
   File pyspark/streaming/tests.py, line 418, in 
 test_count_by_value_and_window
 self._test_func(input, func, expected)
   File pyspark/streaming/tests.py, line 133, in _test_func
 self.assertEqual(expected, result)
 AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], 
 [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]]
 First list contains 2 additional elements.
 First extra element 8:
 [6]
 - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]]
 ? --
 + [[1], [2], [3], [4], [5], [6], [6], [6]]
 --
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7497) test_count_by_value_and_window is flaky

2015-05-18 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7497:
---

Assignee: Apache Spark  (was: Davies Liu)

 test_count_by_value_and_window is flaky
 ---

 Key: SPARK-7497
 URL: https://issues.apache.org/jira/browse/SPARK-7497
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Streaming
Affects Versions: 1.4.0
Reporter: Xiangrui Meng
Assignee: Apache Spark
Priority: Critical
  Labels: flaky-test

 Saw this test failure in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32268/console
 {code}
 ==
 FAIL: test_count_by_value_and_window (__main__.WindowFunctionTests)
 --
 Traceback (most recent call last):
   File pyspark/streaming/tests.py, line 418, in 
 test_count_by_value_and_window
 self._test_func(input, func, expected)
   File pyspark/streaming/tests.py, line 133, in _test_func
 self.assertEqual(expected, result)
 AssertionError: Lists differ: [[1], [2], [3], [4], [5], [6], [6], [6], [6], 
 [6]] != [[1], [2], [3], [4], [5], [6], [6], [6]]
 First list contains 2 additional elements.
 First extra element 8:
 [6]
 - [[1], [2], [3], [4], [5], [6], [6], [6], [6], [6]]
 ? --
 + [[1], [2], [3], [4], [5], [6], [6], [6]]
 --
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7673) DataSourceStrategy's buildPartitionedTableScan always list list file status for all data files

2015-05-18 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-7673.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue has been addressed by 
https://github.com/apache/spark/commit/9dadf019b93038e1e18336ccd06c5eecb4bae32f.

 DataSourceStrategy's buildPartitionedTableScan always list list file status 
 for all data files 
 ---

 Key: SPARK-7673
 URL: https://issues.apache.org/jira/browse/SPARK-7673
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.4.0


 See 
 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala#L134-141



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-7713) Use shared broadcast hadoop conf for partitioned table scan.

2015-05-18 Thread Yin Huai (JIRA)
Yin Huai created SPARK-7713:
---

 Summary: Use shared broadcast hadoop conf for partitioned table 
scan.
 Key: SPARK-7713
 URL: https://issues.apache.org/jira/browse/SPARK-7713
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Blocker


While debugging SPARK-7673, we also found that we are broadcasting a hadoop 
conf for every Partition (backed by a Hadoop RDD). It also causes the 
performance regression of compiling a query involving a large number of 
partitions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6216) Check Python version in worker before run PySpark job

2015-05-18 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6216.
---
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 6203
[https://github.com/apache/spark/pull/6203]

 Check Python version in worker before run PySpark job
 -

 Key: SPARK-6216
 URL: https://issues.apache.org/jira/browse/SPARK-6216
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Davies Liu
Assignee: Davies Liu
 Fix For: 1.4.0


 PySpark can only run with the same major version both in driver and worker ( 
 both of the are 2.6 or 2.7), it will cause random error if it have 2.7 in 
 driver or 2.6 in worker (or vice).
 For example:
 {code}
 davies@localhost:~/work/spark$ PYSPARK_PYTHON=python2.6 
 PYSPARK_DRIVER_PYTHON=python2.7 bin/pyspark
 Using Python version 2.7.7 (default, Jun  2 2014 12:48:16)
 SparkContext available as sc, SQLContext available as sqlCtx.
  sc.textFile('LICENSE').map(lambda l: l.split()).count()
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/davies/work/spark/python/pyspark/worker.py, line 101, in main
 process()
   File /Users/davies/work/spark/python/pyspark/worker.py, line 96, in 
 process
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 2251, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 281, in func
 return f(iterator)
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/davies/work/spark/python/pyspark/rdd.py, line 931, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File stdin, line 1, in lambda
 TypeError: 'bool' object is not callable
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:136)
   at 
 org.apache.spark.api.python.PythonRDD$$anon$1.init(PythonRDD.scala:177)
   at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:95)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3267.
-
Resolution: Cannot Reproduce

 Deadlock between ScalaReflectionLock and Data type initialization
 -

 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Priority: Critical

 Deadlock here:
 {code}
 Executor task launch worker-0 daemon prio=10 tid=0x7fab50036000 
 nid=0x27a in Object.wait() [0x7fab60c2e000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:202)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
 a:175)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:304)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:314)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
 93)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:313)
 at 
 org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
 a:195)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
 at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
 ...
 {code}
 and
 {code}
 Executor task launch worker-2 daemon prio=10 tid=0x7fab100f0800 
 nid=0x27e in Object.wait() [0x7fab0eeec000
 ]
java.lang.Thread.State: RUNNABLE
 at 
 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
 - locked 0x00064e5d9a48 (a 
 org.apache.spark.sql.catalyst.expressions.Cast)
 at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
 scala:139)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
 at 
 org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
 at 
 org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at 

[jira] [Resolved] (SPARK-4523) Improve handling of serialized schema information

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-4523.
-
Resolution: Won't Fix

We haven't changed this for a few release now, and it seem unlikely that we 
will so I'm going to close this issue.

 Improve handling of serialized schema information
 -

 Key: SPARK-4523
 URL: https://issues.apache.org/jira/browse/SPARK-4523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Critical

 There are several issues with our current handling of metadata serialization, 
 which is especially troublesome since this is the only place that we persist 
 information directly using Spark SQL.  Moving forward we should do the 
 following:
  - Relax the parsing so that it does not fail when optional fields are 
 missing (i.e. containsNull or metadata)
  - Include a regression suite that attempts to read old parquet files written 
 by previous versions of Spark SQL.
  - Provide better warning messages when various forms of parsing fail (I 
 think that it is silent right now which makes tracking down bugs more 
 difficult than it needs to be).
  - Deprecate (display a warning) when reading data with the old case class 
 schema representation and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6241) hiveql ANALYZE TABLE doesn't work for external tables

2015-05-18 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6241.
-
Resolution: Won't Fix

Datasource tables have their own mechanism for reporting statistics that does 
not rely on ANALYZE.  Please reopen if that is not working for you.

 hiveql ANALYZE TABLE doesn't work for external tables
 -

 Key: SPARK-6241
 URL: https://issues.apache.org/jira/browse/SPARK-6241
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.3.0
Reporter: Kai Zeng
Priority: Critical

 ANALYZE TABLE does not collect statistics for external tables, but works 
 well for tables created by CREATE AS SELECT. 
 Also tried to use refresh table to refresh metadata cache, but got 
 NullPointer error:
 java.util.concurrent.ExecutionException: java.lang.NullPointerException
 at 
 com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
 at 
 com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
 at 
 com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 at 
 com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:135)
 at 
 com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2344)
 at 
 com.google.common.cache.LocalCache$Segment$1.run(LocalCache.java:2327)
 at 
 com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
 at 
 com.google.common.util.concurrent.ExecutionList.executeListener(ExecutionList.java:156)
 at 
 com.google.common.util.concurrent.ExecutionList.add(ExecutionList.java:101)
 at 
 com.google.common.util.concurrent.AbstractFuture.addListener(AbstractFuture.java:170)
 at 
 com.google.common.cache.LocalCache$Segment.loadAsync(LocalCache.java:2322)
 at 
 com.google.common.cache.LocalCache$Segment.refresh(LocalCache.java:2385)
 at com.google.common.cache.LocalCache.refresh(LocalCache.java:4085)
 at 
 com.google.common.cache.LocalCache$LocalLoadingCache.refresh(LocalCache.java:4825)
 at 
 org.apache.spark.sql.hive.HiveMetastoreCatalog.refreshTable(HiveMetastoreCatalog.scala:108)
 at org.apache.spark.sql.sources.RefreshTable.run(ddl.scala:404)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:55)
 at 
 org.apache.spark.sql.execution.ExecutedCommand.execute(commands.scala:65)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:1092)
 at 
 org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:1092)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:134)
 at org.apache.spark.sql.DataFrame.init(DataFrame.scala:117)
 at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:51)
 at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:92)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >