date:20140728


 [ 
https://issues.apache.org/jira/browse/SPARK-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-2613.


Assignee: Xiangrui Meng  (was: Liquan Pei)

 CLONE - word2vec: Distributed Representation of Words
 -

 Key: SPARK-2613
 URL: https://issues.apache.org/jira/browse/SPARK-2613
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yifan Yang
Assignee: Xiangrui Meng
   Original Estimate: 672h
  Remaining Estimate: 672h

 We would like to add parallel implementation of word2vec to MLlib. word2vec 
 finds distributed representation of words through training of large data 
 sets. The Spark programming model fits nicely with word2vec as the training 
 algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram 
 model and negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2510) word2vec: Distributed Representation of Words


[ 
https://issues.apache.org/jira/browse/SPARK-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075951#comment-14075951
 ] 

Xiangrui Meng commented on SPARK-2510:
--

Had an offline discussion with [~liquanpei] and checked the C implementation of 
word2vec. It is not embarrassingly parallel because it frequently updates the 
global vectors, which is okay for multithreading but bad for distributed. We 
are thinking about making stochastic updates within each partition and then 
merging the vectors. Averaging works for SGD but I doubt whether it would work 
here. More to investigate.

 word2vec: Distributed Representation of Words
 -

 Key: SPARK-2510
 URL: https://issues.apache.org/jira/browse/SPARK-2510
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Liquan Pei
Assignee: Liquan Pei
   Original Estimate: 672h
  Remaining Estimate: 672h

 We would like to add parallel implementation of word2vec to MLlib. word2vec 
 finds distributed representation of words through training of large data 
 sets. The Spark programming model fits nicely with word2vec as the training 
 algorithm of word2vec is embarrassingly parallel. We will focus on skip-gram 
 model and negative sampling in our initial implementation. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2692) Decision Tree API update


 [ 
https://issues.apache.org/jira/browse/SPARK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2692:
-

Assignee: Joseph K. Bradley

 Decision Tree API update
 

 Key: SPARK-2692
 URL: https://issues.apache.org/jira/browse/SPARK-2692
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Split Decision Tree API into separate Classifier and Regressor classes.
 Details:
 (a) Split classes: E.g.: DecisionTree -- DecisionTreeClassifier and 
 DecisionTreeRegressor
 (b) Included print() function for human-readable model descriptions
 (c) Renamed Strategy to *Params. Changed to take strings instead of special 
 types.
 (d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
 (e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
 (f) Removed static train() functions in favor of using Params classes.
 (g) Introduced DatasetInfo class for metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2692) Decision Tree API update


 [ 
https://issues.apache.org/jira/browse/SPARK-2692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2692:
-

 Target Version/s: 1.1.0
Affects Version/s: 1.0.0

 Decision Tree API update
 

 Key: SPARK-2692
 URL: https://issues.apache.org/jira/browse/SPARK-2692
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 Split Decision Tree API into separate Classifier and Regressor classes.
 Details:
 (a) Split classes: E.g.: DecisionTree -- DecisionTreeClassifier and 
 DecisionTreeRegressor
 (b) Included print() function for human-readable model descriptions
 (c) Renamed Strategy to *Params. Changed to take strings instead of special 
 types.
 (d) Made configuration classes (Impurity, QuantileStrategy) private to mllib.
 (e) Changed meaning of maxDepth by 1 to match scikit-learn and rpart.
 (f) Removed static train() functions in favor of using Params classes.
 (g) Introduced DatasetInfo class for metadata.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling

2014-07-28 Thread YanTang Zhai (JIRA)

YanTang Zhai created SPARK-2715:
---

 Summary: ExternalAppendOnlyMap adds max limit of times and max 
limit of disk bytes written for spilling
 Key: SPARK-2715
 URL: https://issues.apache.org/jira/browse/SPARK-2715
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor


ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
written for spilling. Therefore, some task could be let fail fast instead of 
running for a long time if it has data skew.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2702) Upgrade Tachyon dependency to 0.5.0

2014-07-28 Thread Haoyuan Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-2702:
--

Assignee: Rong Gu

 Upgrade Tachyon dependency to 0.5.0
 ---

 Key: SPARK-2702
 URL: https://issues.apache.org/jira/browse/SPARK-2702
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Haoyuan Li
Assignee: Rong Gu
 Fix For: 1.1.0


 Upgrade Tachyon dependency to 0.5.0:
 a. Code dependency.
 b. Start Tachyon script.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2703) Make Tachyon related unit tests execute without deploying a Tachyon system locally.

2014-07-28 Thread Haoyuan Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Haoyuan Li updated SPARK-2703:
--

Assignee: Rong Gu

 Make Tachyon related unit tests execute without deploying a Tachyon system 
 locally.
 ---

 Key: SPARK-2703
 URL: https://issues.apache.org/jira/browse/SPARK-2703
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Haoyuan Li
Assignee: Rong Gu
 Fix For: 1.1.0


 Use the LocalTachyonCluster class in tachyon-test.jar in 0.5.0 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2614) Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... -Pdeb (using assembly/pom.xml)

2014-07-28 Thread Christian Tzolov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14075998#comment-14075998
 ] 

Christian Tzolov commented on SPARK-2614:
-

The #1611 pull request addresses some of concerns expressed above. It doesn't 
put everything into a single package. Instead when the -Pdeb is enabled then 2 
debian packages are be build:   
   1. spark_XXX_all.deb  - Current spark debian package without modifications.
   2. spark_XXX_examples.deb - additional deb package that bundles only the 
spark_examples.jar

 Add the spark-examples-xxx-.jar to the Debian packages created with mvn ... 
 -Pdeb (using assembly/pom.xml)
 --

 Key: SPARK-2614
 URL: https://issues.apache.org/jira/browse/SPARK-2614
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov

 The tar.gz distribution includes already the spark-examples.jar in the 
 bundle. It is a common practice for installers to run SparkPi as a smoke test 
 to verify that the installation is OK
 /usr/share/spark/bin/spark-submit \
   --num-executors 10  --master yarn-cluster \
   --class org.apache.spark.examples.SparkPi \
   /usr/share/spark/jars/spark-examples-1.0.1-hadoop2.2.0.jar 10



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-28 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076033#comment-14076033
 ] 

Guoqiang Li commented on SPARK-2677:


[~pwendell] , [~sarutak] How about the following solution?
 https://github.com/witgo/spark/compare/SPARK-2677 

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2511) Add TF-IDF featurizer

2014-07-28 Thread duanfa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035
 ] 

duanfa edited comment on SPARK-2511 at 7/28/14 9:05 AM:


i need it also


was (Author: duanfa):
i need it alse

 Add TF-IDF featurizer
 -

 Key: SPARK-2511
 URL: https://issues.apache.org/jira/browse/SPARK-2511
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Port the TF-IDF implementation that was used in the Databricks Cloud demo to 
 MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2511) Add TF-IDF featurizer

2014-07-28 Thread duanfa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035
 ] 

duanfa commented on SPARK-2511:
---

i need it alse

 Add TF-IDF featurizer
 -

 Key: SPARK-2511
 URL: https://issues.apache.org/jira/browse/SPARK-2511
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Port the TF-IDF implementation that was used in the Databricks Cloud demo to 
 MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2511) Add TF-IDF featurizer

2014-07-28 Thread duanfa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076035#comment-14076035
 ] 

duanfa edited comment on SPARK-2511 at 7/28/14 9:12 AM:


i need it also,i code tonight ,


was (Author: duanfa):
i need it also

 Add TF-IDF featurizer
 -

 Key: SPARK-2511
 URL: https://issues.apache.org/jira/browse/SPARK-2511
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng

 Port the TF-IDF implementation that was used in the Databricks Cloud demo to 
 MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-28 Thread Teng Qiu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076086#comment-14076086
 ] 

Teng Qiu commented on SPARK-2576:
-

i get same problem, 1.0.1, standalone cluster

 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend Vanderveken
Assignee: Yin Huai
Priority: Blocker
 Fix For: 1.0.2


 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
 java.lang.NoClassDefFoundError: $line11/$read$
 at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
 at

[jira] [Commented] (SPARK-2417) Decision tree tests are failing

2014-07-28 Thread Patrick Morton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076101#comment-14076101
 ] 

Patrick Morton commented on SPARK-2417:
---

Hallucinogenic stroke of important metabolites during black father may affect 
the awkward nudity and stress of the midline, resulting in includesubtypes in 
the clinical fingers that control belief and execution. 
adderall depression 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787433-29851520-stopadd9.html
 
For language, it would be inconclusive to choose routinely proprietary cultures 
to be imprinted with a far fatal history.

 Decision tree tests are failing
 ---

 Key: SPARK-2417
 URL: https://issues.apache.org/jira/browse/SPARK-2417
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Patrick Wendell
Assignee: Jon Sondag
 Fix For: 1.0.1, 1.1.0


 After SPARK-2152 was merged, these tests started failing in Jenkins:
 {code}
 - classification stump with all categorical variables *** FAILED ***
   org.scalatest.exceptions.TestFailedException was thrown. 
 (DecisionTreeSuite.scala:257)
 - regression stump with all categorical variables *** FAILED ***
   org.scalatest.exceptions.TestFailedException was thrown. 
 (DecisionTreeSuite.scala:284)
 {code}
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/97/hadoop.version=1.0.4,label=centos/console



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.

2014-07-28 Thread Patrick Morton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076104#comment-14076104
 ] 

Patrick Morton commented on SPARK-2415:
---

In the ethical, three endings of core symptoms have been not linked with these 
benzodiazepines. 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851519/7787449-29851519-stopadd32.html
 
Before, this firing of gym will suffer from sampling trouble because rights 
with thought conditions will be more other to be referred to fever structures 
if they are experiencing pleasurable times.

 RowWriteSupport should handle empty ArrayType correctly.
 

 Key: SPARK-2415
 URL: https://issues.apache.org/jira/browse/SPARK-2415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read 
 value becomes {{null}}.
 It should write empty {{ArrayType}} value as it is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2714) DAGScheduler logs jobid when runJob finishes


[ 
https://issues.apache.org/jira/browse/SPARK-2714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076232#comment-14076232
 ] 

Apache Spark commented on SPARK-2714:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1617

 DAGScheduler logs jobid when runJob finishes
 

 Key: SPARK-2714
 URL: https://issues.apache.org/jira/browse/SPARK-2714
 Project: Spark
  Issue Type: Documentation
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 DAGScheduler logs jobid when runJob finishes



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-2415) RowWriteSupport should handle empty ArrayType correctly.

2014-07-28 Thread Jake Farrell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Farrell updated SPARK-2415:


Comment: was deleted

(was: In the ethical, three endings of core symptoms have been not linked with 
these benzodiazepines. 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851519/7787449-29851519-stopadd32.html
 
Before, this firing of gym will suffer from sampling trouble because rights 
with thought conditions will be more other to be referred to fever structures 
if they are experiencing pleasurable times.)

 RowWriteSupport should handle empty ArrayType correctly.
 

 Key: SPARK-2415
 URL: https://issues.apache.org/jira/browse/SPARK-2415
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Takuya Ueshin
Assignee: Takuya Ueshin
 Fix For: 1.1.0, 1.0.2


 {{RowWriteSupport}} doesn't write empty {{ArrayType}} value, so the read 
 value becomes {{null}}.
 It should write empty {{ArrayType}} value as it is.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-2417) Decision tree tests are failing

2014-07-28 Thread Jake Farrell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jake Farrell updated SPARK-2417:


Comment: was deleted

(was: Hallucinogenic stroke of important metabolites during black father may 
affect the awkward nudity and stress of the midline, resulting in 
includesubtypes in the clinical fingers that control belief and execution. 
adderall depression 
http://www.surveyanalytics.com//userimages/sub-2/2007589/3153260/29851520/7787433-29851520-stopadd9.html
 
For language, it would be inconclusive to choose routinely proprietary cultures 
to be imprinted with a far fatal history.)

 Decision tree tests are failing
 ---

 Key: SPARK-2417
 URL: https://issues.apache.org/jira/browse/SPARK-2417
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Patrick Wendell
Assignee: Jon Sondag
 Fix For: 1.0.1, 1.1.0


 After SPARK-2152 was merged, these tests started failing in Jenkins:
 {code}
 - classification stump with all categorical variables *** FAILED ***
   org.scalatest.exceptions.TestFailedException was thrown. 
 (DecisionTreeSuite.scala:257)
 - regression stump with all categorical variables *** FAILED ***
   org.scalatest.exceptions.TestFailedException was thrown. 
 (DecisionTreeSuite.scala:284)
 {code}
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/97/hadoop.version=1.0.4,label=centos/console



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2715) ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes written for spilling


[ 
https://issues.apache.org/jira/browse/SPARK-2715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076257#comment-14076257
 ] 

Apache Spark commented on SPARK-2715:
-

User 'YanTangZhai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1618

 ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
 written for spilling
 --

 Key: SPARK-2715
 URL: https://issues.apache.org/jira/browse/SPARK-2715
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: YanTang Zhai
Priority: Minor

 ExternalAppendOnlyMap adds max limit of times and max limit of disk bytes 
 written for spilling. Therefore, some task could be let fail fast instead of 
 running for a long time if it has data skew.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2141) Add sc.getPersistentRDDs() to PySpark

2014-07-28 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076262#comment-14076262
 ] 

Kan Zhang commented on SPARK-2141:
--

Hi [~nchammas], we are debating potential use cases for this feature. Would be 
great if you could provide your input (use above link). Thx.

 Add sc.getPersistentRDDs() to PySpark
 -

 Key: SPARK-2141
 URL: https://issues.apache.org/jira/browse/SPARK-2141
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Nicholas Chammas
Assignee: Kan Zhang

 PySpark does not appear to have {{sc.getPersistentRDDs()}}.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2677) BasicBlockFetchIterator#next can wait forever


[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076278#comment-14076278
 ] 

Apache Spark commented on SPARK-2677:
-

User 'witgo' has created a pull request for this issue:
https://github.com/apache/spark/pull/1619

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-28 Thread Guoqiang Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076033#comment-14076033
 ] 

Guoqiang Li edited comment on SPARK-2677 at 7/28/14 3:00 PM:
-

[~pwendell] , [~sarutak] How about the following solution?
https://github.com/apache/spark/pull/1619


was (Author: gq):
[~pwendell] , [~sarutak] How about the following solution?
 https://github.com/witgo/spark/compare/SPARK-2677 

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.2, 1.0.0, 1.0.1
Reporter: Kousuke Saruta
Priority: Blocker

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications

2014-07-28 Thread Aaron Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076403#comment-14076403
 ] 

Aaron Davidson commented on SPARK-1860:
---

There's not an easy way to tell if an application is still running. However, 
the Worker has state about which executors are still running. This is really 
what I intended originally -- we must not clean up an Executor's own state from 
underneath it. I will change the title to reflect this intention.

 Standalone Worker cleanup should not clean up running applications
 --

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.1.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 applications that happen to be running for longer than 7 days, hitting 
 streaming jobs especially hard.
 Applications should not be cleaned up if they're still running. Until then, 
 this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-07-28 Thread Aaron Davidson (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Aaron Davidson updated SPARK-1860:
--

Description:
The default values of the standalone worker cleanup code cleanup all
application data every 7 days. This includes jars that were added to any
executors that happen to be running for longer than 7 days, hitting streaming
jobs especially hard.

Executor's log/data folders should not be cleaned up if they're still running.
Until then, this behavior should not be enabled by default.

was:
The default values of the standalone worker cleanup code cleanup all
application data every 7 days. This includes jars that were added to any
applications that happen to be running for longer than 7 days, hitting
streaming jobs especially hard.

Applications should not be cleaned up if they're still running. Until then,
this behavior should not be enabled by default.

Standalone Worker cleanup should not clean up running executors
---

Key: SPARK-1860
URL: https://issues.apache.org/jira/browse/SPARK-1860
Project: Spark
Issue Type: Bug
Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
Fix For: 1.1.0

The default values of the standalone worker cleanup code cleanup all
application data every 7 days. This includes jars that were added to any
executors that happen to be running for longer than 7 days, hitting streaming
jobs especially hard.
Executor's log/data folders should not be cleaned up if they're still
running. Until then, this behavior should not be enabled by default.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2716) Having clause with no references fails to resolve

2014-07-28 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-2716:
---

 Summary: Having clause with no references fails to resolve
 Key: SPARK-2716
 URL: https://issues.apache.org/jira/browse/SPARK-2716
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical


For example:
{code}
SELECT a FROM b GROUP BY a HAVING COUNT(*)  1
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2563) Re-open sockets to handle connect timeouts

2014-07-28 Thread Shivaram Venkataraman (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-2563:
-

Description: 
In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
to connection timeout exceptions. 

 If the connection attempt times out, the socket gets closed and from [1] we 
get a ClosedChannelException.  We should check if the Socket was closed due to 
a timeout and open a new socket and try to connect. 

FWIW, I was able to work around my problems by increasing the number of SYN 
retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)

[1] 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573

  was:In a large EC2 cluster, I often see the first shuffle stage in a job fail 
due to connection timeout exceptions. We should make the number of retries 
before failing configurable to handle these cases.

Summary: Re-open sockets to handle connect timeouts  (was: Make number 
of connection retries configurable)

 Re-open sockets to handle connect timeouts
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. 
  If the connection attempt times out, the socket gets closed and from [1] we 
 get a ClosedChannelException.  We should check if the Socket was closed due 
 to a timeout and open a new socket and try to connect. 
 FWIW, I was able to work around my problems by increasing the number of SYN 
 retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)
 [1] 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-2563) Re-open sockets to handle connect timeouts

2014-07-28 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14065735#comment-14065735
 ] 

Shivaram Venkataraman edited comment on SPARK-2563 at 7/28/14 5:43 PM:
---

More details about the bug is -https://github.com/apache/spark/pull/1471-


was (Author: shivaram):
https://github.com/apache/spark/pull/1471

 Re-open sockets to handle connect timeouts
 --

 Key: SPARK-2563
 URL: https://issues.apache.org/jira/browse/SPARK-2563
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Shivaram Venkataraman
Priority: Minor

 In a large EC2 cluster, I often see the first shuffle stage in a job fail due 
 to connection timeout exceptions. 
  If the connection attempt times out, the socket gets closed and from [1] we 
 get a ClosedChannelException.  We should check if the Socket was closed due 
 to a timeout and open a new socket and try to connect. 
 FWIW, I was able to work around my problems by increasing the number of SYN 
 retries in Linux. (I ran echo 8  /proc/sys/net/ipv4/tcp_syn_retries)
 [1] 
 http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/sun/nio/ch/SocketChannelImpl.java?av=h#573



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2410) Thrift/JDBC Server


[ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076453#comment-14076453
 ] 

Apache Spark commented on SPARK-2410:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/1620

 Thrift/JDBC Server
 --

 Key: SPARK-2410
 URL: https://issues.apache.org/jira/browse/SPARK-2410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 We have this, but need to make sure that it gets merged into master before 
 the 1.1 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-28 Thread Sean Owen (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076468#comment-14076468
]

Sean Owen commented on SPARK-2420:
--

I'm sure shading just means moving the packages, and references in the byte
code, with maven-shade-plugin. assembly takes very little of the total build
time. Nothing else I can see except Hadoop has a Guava dependency. But yeah,
there is gonna have to be a teensy fork of a Guava class maintained then. It
can go in the source tree, so doesn't necessarily need more assembly surgery.
Does it change your calculus? I remain slightly grossed out by all options.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

During the prototyping of HIVE-7292, many library conflicts showed up because
Spark build contains versions of libraries that's vastly different from
current major Hadoop version. It would be nice if we can choose versions
that's in line with Hadoop or shading them in the assembly. Here are the wish
list:
1. Upgrade protobuf version to 2.5.0 from current 2.4.1
2. Shading Spark's jetty and servlet dependency in the assembly.
3. guava version difference. Spark is using a higher version. I'm not sure
what's the best solution for this.
The list may grow as HIVE-7292 proceeds.
For information only, the attached is a patch that we applied on Spark in
order to make Spark work with Hive. It gives an idea of the scope of changes.

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-28 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076456#comment-14076456
]

Marcelo Vanzin commented on SPARK-2420:
---

So let me see if I'm following things so far. The current proposals are 1.
downgrade or 2. shade (which if I understand Patrick correctly means forking
Guava and changing the sources to a different package, not using the maven
shade plugin?).

Both options avoid overriding libraries used by Hadoop; the first by using the
same one, the second by avoiding the namespace conflict.

Option 1 provides less backwards compatibility issues. Shading just removes
Guava from the user's classpath, so it leaves users to manage it; they'll
either inherit it from Hadoop, or get into a situation where they override the
classpath's Guava with their own, and potentially might break Hadoop. For both
cases, I think the best recommendation is to tell the user to shade Guava in
their application if they really need a newer version - that way they won't be
overriding the library used by Hadoop classes.

Option 1 is also less work; you don't need to maintain the shaded Guava (if I
understand correctly what was meant here by shading). Using maven's shade
instead means builds would get slower.

Also, does anyone have an idea about whether any of the libraries Spark depends
on depend on Guava and need a version later than 11? I haven't checked that.

As for Guava leaking through Spark's API that's very, very unfortunate. Option
2 here will definitely break compatibility for anyone who uses those APIs.
Option 1, on the other hand, has only a couple of implications: according to
Guava's javadoc, only one method doesn't exist in 11 ({{transform}}) and one
has a changed signature ({{presentInstances}}, and only generic arguments were
changed, so maybe still binary compatible).

So, pending my dependency question above, I still think that downgrading is the
option that creates less headaches.

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2420) Change Spark build to minimize library conflicts

2014-07-28 Thread Marcelo Vanzin (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076488#comment-14076488
]

Marcelo Vanzin commented on SPARK-2420:
---

Forking {{Optional}} would make Option 2 more palatable. But shading + fork
that class still feels more like a sledgehammer, and it will have pretty much
the same effect on user code as downgrading, from what I can see (since now,
without explicit dependencies, they'll be getting Guava 11 from Hadoop instead
of Guava 14 from Spark).

Change Spark build to minimize library conflicts

Key: SPARK-2420
URL: https://issues.apache.org/jira/browse/SPARK-2420
Project: Spark
Issue Type: Wish
Components: Build
Affects Versions: 1.0.0
Reporter: Xuefu Zhang
Attachments: spark_1.0.0.patch

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2523) For partitioned Hive tables, partition-specific ObjectInspectors should be used.

2014-07-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2523.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 For partitioned Hive tables, partition-specific ObjectInspectors should be 
 used.
 

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.1.0


 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests


 [ 
https://issues.apache.org/jira/browse/SPARK-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2479.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1425
[https://github.com/apache/spark/pull/1425]

 Comparing floating-point numbers using relative error in UnitTests
 --

 Key: SPARK-2479
 URL: https://issues.apache.org/jira/browse/SPARK-2479
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai
Assignee: DB Tsai
 Fix For: 1.1.0


 Floating point math is not exact, and most floating-point numbers end up 
 being slightly imprecise due to rounding errors. Simple values like 0.1 
 cannot be precisely represented using binary floating point numbers, and the 
 limited precision of floating point numbers means that slight changes in the 
 order of operations or the precision of intermediates can change the result. 
 That means that comparing two floats to see if they are equal is usually not 
 what we want. As long as this imprecision stays small, it can usually be 
 ignored.
 See the following famous article for detail.
 http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
 For example:
   float a = 0.15 + 0.15
   float b = 0.1 + 0.2
   if(a == b) // can be false!
   if(a = b) // can also be false!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2544) Improve ALS algorithm resource usage


 [ 
https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2544:
-

Target Version/s: 1.1.0

 Improve ALS algorithm resource usage
 

 Key: SPARK-2544
 URL: https://issues.apache.org/jira/browse/SPARK-2544
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li

 The following problems in ALS
 1. The RDD of products and users dependencies are too long
 2. The shuffle files are too large.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2544) Improve ALS algorithm resource usage


 [ 
https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2544:
-

Assignee: Guoqiang Li

 Improve ALS algorithm resource usage
 

 Key: SPARK-2544
 URL: https://issues.apache.org/jira/browse/SPARK-2544
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Guoqiang Li

 The following problems in ALS
 1. The RDD of products and users dependencies are too long
 2. The shuffle files are too large.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2410) Thrift/JDBC Server

2014-07-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2410.
-

Resolution: Fixed

 Thrift/JDBC Server
 --

 Key: SPARK-2410
 URL: https://issues.apache.org/jira/browse/SPARK-2410
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
Priority: Blocker
 Fix For: 1.1.0


 We have this, but need to make sure that it gets merged into master before 
 the 1.1 release.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-07-28 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076778#comment-14076778
 ] 

Mark Hamstra commented on SPARK-1860:
-

I don't think that there is much in the way of conflict, but something to be 
aware of is that the proposed fix to SPARK-2425 does modify Executor state 
transitions and cleanup: https://github.com/apache/spark/pull/1360

 Standalone Worker cleanup should not clean up running executors
 ---

 Key: SPARK-1860
 URL: https://issues.apache.org/jira/browse/SPARK-1860
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Priority: Critical
 Fix For: 1.1.0


 The default values of the standalone worker cleanup code cleanup all 
 application data every 7 days. This includes jars that were added to any 
 executors that happen to be running for longer than 7 days, hitting streaming 
 jobs especially hard.
 Executor's log/data folders should not be cleaned up if they're still 
 running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2305) pyspark - depend on py4j 0.8.1

2014-07-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2305:
--

Target Version/s: 1.1.0
Assignee: Josh Rosen

Py4J 0.8.2.1 was just released; I'll look into upgrading.

 pyspark - depend on py4j  0.8.1
 

 Key: SPARK-2305
 URL: https://issues.apache.org/jira/browse/SPARK-2305
 Project: Spark
  Issue Type: Dependency upgrade
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Matthew Farrellee
Assignee: Josh Rosen
Priority: Minor

 py4j 0.8.1 has a bug in java_import that results in extraneous warnings
 pyspark should depend on a py4j version  0.8.1 (non exists at time of 
 filing) that includes 
 https://github.com/bartdag/py4j/commit/64cd657e75dbe769c5e3bf757fcf83b5c0f8f4f0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2411) Standalone Master - direct users to turn on event logs


 [ 
https://issues.apache.org/jira/browse/SPARK-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-2411.
--

Resolution: Fixed

 Standalone Master - direct users to turn on event logs
 --

 Key: SPARK-2411
 URL: https://issues.apache.org/jira/browse/SPARK-2411
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Andrew Or
 Fix For: 1.1.0

 Attachments: Application history load error.png, Application history 
 not found.png, Event logging not enabled.png


 Right now if the user attempts to click on a finished application's UI, it 
 simply refreshes. This is simply because the event logs are not there, in 
 which case we set the href=.
 We could provide more information by pointing them to configure 
 spark.eventLog.enabled if they click on the empty link.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076739#comment-14076739
 ] 

Yin Huai edited comment on SPARK-1649 at 7/28/14 8:42 PM:
--

Seems Hive supports null values in a Map, to be consistent with Hive, we will 
also support that. I will introduce a boolean valueContainsNull to MapType. 
For null map keys, Hive has inconsistent behaviors. Here are examples (using 
sbt/sbt hive/console). 
{code}
runSqlHive(select map(null, 1, null, 2, null, 3, 4, null, 5, null) from src 
limit 1)
res6: Seq[String] = Buffer({4:null,5:null})
runSqlHive(select map_keys(map(null, 1, null, 2, null, 3, 4, null, 5, null)) 
from src limit 1)
res7: Seq[String] = Buffer([null,4,5])
runSqlHive(select map_values(map(null, 1, null, 2, null, 3, 4, null, 5, null)) 
from src limit 1)
res8: Seq[String] = Buffer([3,null,null])
{code}
Also, different implementations handle null keys in different ways (e.g. 
HashMap supports an entry with a null key. But, TreeMap will throw a NPE when a 
user want to insert an entry with a null key). So, I think we will not allow 
null keys in a map.


was (Author: yhuai):
Seems Hive supports null values in a Map, to be consistent with Hive, we will 
also support that. I will introduce a boolean valuesContainNull to MapType. 
For null map keys, Hive has inconsistent behaviors. Here are examples (using 
sbt/sbt hive/console). 
{code}
runSqlHive(select map(null, 1, null, 2, null, 3, 4, null, 5, null) from src 
limit 1)
res6: Seq[String] = Buffer({4:null,5:null})
runSqlHive(select map_keys(map(null, 1, null, 2, null, 3, 4, null, 5, null)) 
from src limit 1)
res7: Seq[String] = Buffer([null,4,5])
runSqlHive(select map_values(map(null, 1, null, 2, null, 3, 4, null, 5, null)) 
from src limit 1)
res8: Seq[String] = Buffer([3,null,null])
{code}
Also, different implementations handle null keys in different ways (e.g. 
HashMap supports an entry with a null key. But, TreeMap will throw a NPE when a 
user want to insert an entry with a null key). So, I think we will not allow 
null keys in a map.

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs


[ 
https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076849#comment-14076849
 ] 

Davies Liu commented on SPARK-1687:
---

Dill is implemented in pure Python, so it will have similar performance with 
pickle, but much slower than cPickle, which is the default serializer we use as 
default. So we could not switch the the default serializer to Dill.

We could provide an customized namedtuple (which can be serialized by cPickle), 
also replace the one in collections with it.

I will send an PR, if it make sense.

 Support NamedTuples in RDDs
 ---

 Key: SPARK-1687
 URL: https://issues.apache.org/jira/browse/SPARK-1687
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Spark version 1.0.0-SNAPSHOT
 Python 2.7.5
Reporter: Pat McDonough
Assignee: Kan Zhang

 Add Support for NamedTuples in RDDs. Some sample code is below, followed by 
 the current error that comes from it.
 Based on a quick conversation with [~ahirreddy], 
 [Dill|https://github.com/uqfoundation/dill] might be a good solution here.
 {code}
 In [26]: from collections import namedtuple
 ...
 In [33]: Person = namedtuple('Person', 'id firstName lastName')
 In [34]: jon = Person(1, Jon, Doe)
 In [35]: jane = Person(2, Jane, Doe)
 In [36]: theDoes = sc.parallelize((jon, jane))
 In [37]: theDoes.collect()
 Out[37]: 
 [Person(id=1, firstName='Jon', lastName='Doe'),
  Person(id=2, firstName='Jane', lastName='Doe')]
 In [38]: theDoes.count()
 PySpark worker failed with exception:
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in

[jira] [Commented] (SPARK-2655) Change the default logging level to WARN


[ 
https://issues.apache.org/jira/browse/SPARK-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14076863#comment-14076863
 ] 

Davies Liu commented on SPARK-2655:
---

[~pwendell] [~matei], how do you think about this?

 Change the default logging level to WARN
 

 Key: SPARK-2655
 URL: https://issues.apache.org/jira/browse/SPARK-2655
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu

 The current logging level INFO is pretty noisy, reduce these unnecessary 
 logging will provide better experience for users.
 Currently, Spark is march stable and nature than before, so user will not 
 need those much logging in normal cases. But some high level information will 
 be helpful, such as messages about job and tasks progress, we could changes 
 these important logging into WARN level as an hack, otherwise will need to 
 change all other logging into level DEBUG.
 PS: it's better to have one line progress logging in terminal (also in title).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-07-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2717:
---

Component/s: Spark Core

 BasicBlockFetchIterator#next should log when it gets stuck
 --

 Key: SPARK-2717
 URL: https://issues.apache.org/jira/browse/SPARK-2717
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Priority: Blocker

 If this is stuck for a long time waiting for blocks, we should log what nodes 
 it is waiting for to help debugging. One way to do this is to call take() 
 with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
 for the blocks it is still waiting for. This could all happen in a loop so 
 that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-07-28 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-2717:
--

 Summary: BasicBlockFetchIterator#next should log when it gets stuck
 Key: SPARK-2717
 URL: https://issues.apache.org/jira/browse/SPARK-2717
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Priority: Blocker


If this is stuck for a long time waiting for blocks, we should log what nodes 
it is waiting for to help debugging. One way to do this is to call take() with 
a timeout (e.g. 60 seconds) and when the timeout expires log a message for the 
blocks it is still waiting for. This could all happen in a loop so that the 
wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


 [ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2718:
-

Affects Version/s: (was: 1.0.1)
   1.0.2

 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes

Andrew Or created SPARK-2718:


 Summary: YARN does not handle spark configs with quotes or 
backslashes
 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.1.0


Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


 [ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2718:
-

Description: 
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does 
not. As a result, submitting an application to YARN with the given config leads 
to the following exception:

{code}
line 0: unexpected EOF while looking for matching `'
syntax error: unexpected end of file

at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

  was:
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext.


 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


 [ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2718:
-

Description: 
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does 
not. As a result, submitting an application to YARN with the given config leads 
to the following exception:

{code}
line 0: unexpected EOF while looking for matching `'
syntax error: unexpected end of file
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}

  was:
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does 
not. As a result, submitting an application to YARN with the given config leads 
to the following exception:

{code}
line 0: unexpected EOF while looking for matching `'
syntax error: unexpected end of file

at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}


 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at 
 org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   at 
 org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
   at 
 org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at

[jira] [Commented] (SPARK-1343) PySpark OOMs without caching


[ 
https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077011#comment-14077011
 ] 

Davies Liu commented on SPARK-1343:
---

Maybe it's related to partitionBy() with small number of partitions, the data 
in one partition will send to JVM as several huge bytearray, they will cost 
huge memory before writing into disks, because default 
spark.serializer.objectStreamReset is too large.

Hopefully, PR-1568 and PR-1460 will fix these issues.

Close this now, will re-open it if it happens again.

 PySpark OOMs without caching
 

 Key: SPARK-1343
 URL: https://issues.apache.org/jira/browse/SPARK-1343
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Matei Zaharia

 There have been several reports on the list of PySpark 0.9 OOMing even if it 
 does simple maps and counts, whereas 0.9 didn't. This may be due to either 
 the batching added to serialization, or due to invalid serialized data which 
 makes the Java side allocate an overly large array. Needs investigating for 
 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2718) YARN does not handle spark configs with quotes or backslashes


 [ 
https://issues.apache.org/jira/browse/SPARK-2718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2718:
-

Description: 
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does 
not. As a result, submitting an application to YARN with the given config leads 
to the following exception:

{code}
line 0: unexpected EOF while looking for matching `'
syntax error: unexpected end of file
  at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
  at org.apache.hadoop.util.Shell.run(Shell.java:418)
  at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
  ...
{code}

  was:
Say we have the following config:
{code}
spark.app.name spark shell with spaces and quotes  and \ backslashes \
{code}

This works in standalone mode but not in YARN mode. This is because standalone 
mode uses Java's ProcessBuilder, which handles these cases nicely, but YARN 
mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, which does 
not. As a result, submitting an application to YARN with the given config leads 
to the following exception:

{code}
line 0: unexpected EOF while looking for matching `'
syntax error: unexpected end of file
at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at 
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
at 
org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
{code}


 YARN does not handle spark configs with quotes or backslashes
 -

 Key: SPARK-2718
 URL: https://issues.apache.org/jira/browse/SPARK-2718
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Andrew Or
 Fix For: 1.1.0


 Say we have the following config:
 {code}
 spark.app.name spark shell with spaces and quotes  and \ backslashes \
 {code}
 This works in standalone mode but not in YARN mode. This is because 
 standalone mode uses Java's ProcessBuilder, which handles these cases nicely, 
 but YARN mode uses org.apache.hadoop.yarn.api.records.ContainerLaunchContext, 
 which does not. As a result, submitting an application to YARN with the given 
 config leads to the following exception:
 {code}
 line 0: unexpected EOF while looking for matching `'
 syntax error: unexpected end of file
   at org.apache.hadoop.util.Shell.runCommand(Shell.java:505)
   at org.apache.hadoop.util.Shell.run(Shell.java:418)
   at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:650)
   ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1343) PySpark OOMs without caching


 [ 
https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-1343.
---

  Resolution: Fixed
   Fix Version/s: 0.9.0
  1.0.0
Target Version/s: 1.1.0

 PySpark OOMs without caching
 

 Key: SPARK-1343
 URL: https://issues.apache.org/jira/browse/SPARK-1343
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Matei Zaharia
Assignee: Davies Liu
 Fix For: 1.0.0, 0.9.0


 There have been several reports on the list of PySpark 0.9 OOMing even if it 
 does simple maps and counts, whereas 0.9 didn't. This may be due to either 
 the batching added to serialization, or due to invalid serialized data which 
 makes the Java side allocate an overly large array. Needs investigating for 
 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1343) PySpark OOMs without caching


[ 
https://issues.apache.org/jira/browse/SPARK-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077012#comment-14077012
 ] 

Davies Liu commented on SPARK-1343:
---

https://github.com/apache/spark/pull/1460

https://github.com/apache/spark/pull/1568

 PySpark OOMs without caching
 

 Key: SPARK-1343
 URL: https://issues.apache.org/jira/browse/SPARK-1343
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0
Reporter: Matei Zaharia
 Fix For: 0.9.0, 1.0.0


 There have been several reports on the list of PySpark 0.9 OOMing even if it 
 does simple maps and counts, whereas 0.9 didn't. This may be due to either 
 the batching added to serialization, or due to invalid serialized data which 
 makes the Java side allocate an overly large array. Needs investigating for 
 1.0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs

2014-07-28 Thread Kan Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077019#comment-14077019
 ] 

Kan Zhang commented on SPARK-1687:
--

Sure, pls go ahead and feel free to take over this JIRA.

 Support NamedTuples in RDDs
 ---

 Key: SPARK-1687
 URL: https://issues.apache.org/jira/browse/SPARK-1687
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Spark version 1.0.0-SNAPSHOT
 Python 2.7.5
Reporter: Pat McDonough
Assignee: Kan Zhang

 Add Support for NamedTuples in RDDs. Some sample code is below, followed by 
 the current error that comes from it.
 Based on a quick conversation with [~ahirreddy], 
 [Dill|https://github.com/uqfoundation/dill] might be a good solution here.
 {code}
 In [26]: from collections import namedtuple
 ...
 In [33]: Person = namedtuple('Person', 'id firstName lastName')
 In [34]: jon = Person(1, Jon, Doe)
 In [35]: jane = Person(2, Jane, Doe)
 In [36]: theDoes = sc.parallelize((jon, jane))
 In [37]: theDoes.collect()
 Out[37]: 
 [Person(id=1, firstName='Jon', lastName='Doe'),
  Person(id=2, firstName='Jane', lastName='Doe')]
 In [38]: theDoes.count()
 PySpark worker failed with exception:
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File

[jira] [Commented] (SPARK-2023) PySpark reduce does a map side reduce and then sends the results to the driver for final reduce, instead do this more like Scala Spark.


[ 
https://issues.apache.org/jira/browse/SPARK-2023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077075#comment-14077075
 ] 

Davies Liu commented on SPARK-2023:
---

In most cases, the result of reduce will be small, so collect these small data 
from each partition then reduce them will not be bottleneck.

 PySpark reduce does a map side reduce and then sends the results to the 
 driver for final reduce, instead do this more like Scala Spark.
 ---

 Key: SPARK-2023
 URL: https://issues.apache.org/jira/browse/SPARK-2023
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: holdenk

 PySpark reduce does a map side reduce and then sends the results to the 
 driver for final reduce, instead do this more like Scala Spark. The current 
 implementation could be a bottleneck. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2719) Add Mima binary checks to Flume-Sink

2014-07-28 Thread Tathagata Das (JIRA)

Tathagata Das created SPARK-2719:


 Summary: Add Mima binary checks to Flume-Sink
 Key: SPARK-2719
 URL: https://issues.apache.org/jira/browse/SPARK-2719
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Tathagata Das
Priority: Minor


Mima binary check has been disabled for flume-sink in 1.1, as previous version 
of flume-sink does not exist. This should be enabled for 1.2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true

2014-07-28 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077109#comment-14077109
 ] 

Timothy Chen commented on SPARK-2022:
-

Github PR: https://github.com/apache/spark/pull/1622

 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Assignee: Tim Chen
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2022) Spark 1.0.0 is failing if mesos.coarse set to true


[ 
https://issues.apache.org/jira/browse/SPARK-2022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077110#comment-14077110
 ] 

Apache Spark commented on SPARK-2022:
-

User 'tnachen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1622

 Spark 1.0.0 is failing if mesos.coarse set to true
 --

 Key: SPARK-2022
 URL: https://issues.apache.org/jira/browse/SPARK-2022
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.0.0
Reporter: Marek Wiewiorka
Assignee: Tim Chen
Priority: Critical

 more stderr
 ---
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0603 16:07:53.721132 61192 exec.cpp:131] Version: 0.18.2
 I0603 16:07:53.725230 61200 exec.cpp:205] Executor registered on slave 
 201405220917-134217738-5050-27119-0
 Exception in thread main java.lang.NumberFormatException: For input string: 
 sparkseq003.cloudapp.net
 at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
 at java.lang.Integer.parseInt(Integer.java:492)
 at java.lang.Integer.parseInt(Integer.java:527)
 at 
 scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
 at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:135)
 at 
 org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
 more stdout
 ---
 Registered executor on sparkseq003.cloudapp.net
 Starting task 5
 Forked command at 61202
 sh -c '/home/mesos/spark-1.0.0/bin/spark-class 
 org.apache.spark.executor.CoarseGrainedExecutorBackend 
 -Dspark.mesos.coarse=true 
 akka.tcp://sp...@sparkseq001.cloudapp.net:40312/user/CoarseG
 rainedScheduler 201405220917-134217738-5050-27119-0 sparkseq003.cloudapp.net 
 4'
 Command exited with status 1 (pid: 61202)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Robbie Russo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077115#comment-14077115
 ] 

Robbie Russo commented on SPARK-1649:
-

Thrift also supports null values in a map and this makes any thrift generated 
parquet files that contain a map unreadable by spark sql due to the following 
code in parquet-thrift for generating the schema for maps:

{code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
  @Override
  public void visit(ThriftType.MapType mapType) {
final ThriftField mapKeyField = mapType.getKey();
final ThriftField mapValueField = mapType.getValue();

//save env for map
String mapName = currentName;
Type.Repetition mapRepetition = currentRepetition;

//=handle key
currentFieldPath.push(mapKeyField);
currentName = key;
currentRepetition = REQUIRED;
mapKeyField.getType().accept(this);
Type keyType = currentType;//currentType is the already converted type
currentFieldPath.pop();

//=handle value
currentFieldPath.push(mapValueField);
currentName = value;
currentRepetition = OPTIONAL;
mapValueField.getType().accept(this);
Type valueType = currentType;
currentFieldPath.pop();

if (keyType == null  valueType == null) {
  currentType = null;
  return;
}

if (keyType == null  valueType != null)
  throw new ThriftProjectionException(key of map is not specified in 
projection:  + currentFieldPath);

//restore Env
currentName = mapName;
currentRepetition = mapRepetition;
currentType = ConversionPatterns.mapType(currentRepetition, currentName,
keyType,
valueType);
  }
{code}

Which causes an error on the spark side when we reach this step in the 
toDataType function that asserts that both the key and value are of repetition 
level REQUIRED:

{code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
case ParquetOriginalType.MAP = {
  assert(
!groupType.getFields.apply(0).isPrimitive,
Parquet Map type malformatted: expected nested group for map!)
  val keyValueGroup = groupType.getFields.apply(0).asGroupType()
  assert(
keyValueGroup.getFieldCount == 2,
Parquet Map type malformatted: nested group should have 2 (key, 
value) fields!)
  val keyType = toDataType(keyValueGroup.getFields.apply(0))
  println(here)
  assert(keyValueGroup.getFields.apply(0).getRepetition == 
Repetition.REQUIRED)
  val valueType = toDataType(keyValueGroup.getFields.apply(1))
  assert(keyValueGroup.getFields.apply(1).getRepetition == 
Repetition.REQUIRED)
  new MapType(keyType, valueType)
}
{code}

Currently I have modified parquet-thrift to use repetition REQUIRED just to 
make spark sql able to work on the parquet files since we don't actually use 
null values in our maps. However it would be preferred to use parquet-thrift 
and spark sql out of the box and have them work nicely together with our 
existing thrift data types without having to modify dependencies.

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077127#comment-14077127
 ] 

Yin Huai commented on SPARK-1649:
-

[~rrusso2007] Can you open a JIRA for the issue of reading Parquet datasets?

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1687) Support NamedTuples in RDDs


[ 
https://issues.apache.org/jira/browse/SPARK-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077136#comment-14077136
 ] 

Apache Spark commented on SPARK-1687:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1623

 Support NamedTuples in RDDs
 ---

 Key: SPARK-1687
 URL: https://issues.apache.org/jira/browse/SPARK-1687
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
 Environment: Spark version 1.0.0-SNAPSHOT
 Python 2.7.5
Reporter: Pat McDonough
Assignee: Davies Liu

 Add Support for NamedTuples in RDDs. Some sample code is below, followed by 
 the current error that comes from it.
 Based on a quick conversation with [~ahirreddy], 
 [Dill|https://github.com/uqfoundation/dill] might be a good solution here.
 {code}
 In [26]: from collections import namedtuple
 ...
 In [33]: Person = namedtuple('Person', 'id firstName lastName')
 In [34]: jon = Person(1, Jon, Doe)
 In [35]: jane = Person(2, Jane, Doe)
 In [36]: theDoes = sc.parallelize((jon, jane))
 In [37]: theDoes.collect()
 Out[37]: 
 [Person(id=1, firstName='Jon', lastName='Doe'),
  Person(id=2, firstName='Jane', lastName='Doe')]
 In [38]: theDoes.count()
 PySpark worker failed with exception:
 PySpark worker failed with exception:
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 Traceback (most recent call last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield self._read_with_length(stream)
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 146, 
 in _read_with_length
 return self.loads(obj)
 AttributeError: 'module' object has no attribute 'Person'
 14/04/30 14:43:53 ERROR Executor: Exception in task ID 23
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/pat/Projects/spark/python/pyspark/worker.py, line 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 1373, in 
 pipeline_func
 return func(split, prev_func(split, iterator))
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 283, in func
 def func(s, iterator): return f(iterator)
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 lambda
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/rdd.py, line 708, in 
 genexpr
 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   File /Users/pat/Projects/spark/python/pyspark/serializers.py, line 129, 
 in load_stream
 yield

[jira] [Created] (SPARK-2720) spark-examples should depend on HBase modules for HBase 0.96+

2014-07-28 Thread Ted Yu (JIRA)

Ted Yu created SPARK-2720:
-

 Summary: spark-examples should depend on HBase modules for HBase 
0.96+
 Key: SPARK-2720
 URL: https://issues.apache.org/jira/browse/SPARK-2720
 Project: Spark
  Issue Type: Task
Reporter: Ted Yu
Priority: Minor


With this change:
{code}
diff --git a/pom.xml b/pom.xml
index 93ef3b9..092430a 100644
--- a/pom.xml
+++ b/pom.xml
@@ -122,7 +122,7 @@
 hadoop.version1.0.4/hadoop.version
 protobuf.version2.4.1/protobuf.version
 yarn.version${hadoop.version}/yarn.version
-hbase.version0.94.6/hbase.version
+hbase.version0.98.4/hbase.version
 zookeeper.version3.4.5/zookeeper.version
 hive.version0.12.0/hive.version
 parquet.version1.4.3/parquet.version
{code}
I got:
{code}
[ERROR] Failed to execute goal on project spark-examples_2.10: Could not 
resolve dependencies for project 
org.apache.spark:spark-examples_2.10:jar:1.1.0-SNAPSHOT: Could not find 
artifact org.apache.hbase:hbase:jar:0.98.4 in maven-repo 
(http://repo.maven.apache.org/maven2) - [Help 1]
{code}
To build against HBase 0.96+, spark-examples needs to specify HBase modules 
(hbase-client, etc) in dependencies - possibly using a new profile.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-1649) Figure out Nullability semantics for Array elements and Map values

2014-07-28 Thread Robbie Russo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077150#comment-14077150
 ] 

Robbie Russo commented on SPARK-1649:
-

Just opened https://issues.apache.org/jira/browse/SPARK-2721

 Figure out Nullability semantics for Array elements and Map values
 --

 Key: SPARK-1649
 URL: https://issues.apache.org/jira/browse/SPARK-1649
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.1.0
Reporter: Andre Schumacher
Priority: Critical

 For the underlying storage layer it would simplify things such as schema 
 conversions, predicate filter determination and such to record in the data 
 type itself whether a column can be nullable. So the DataType type could look 
 like like this:
 abstract class DataType(nullable: Boolean = true)
 Concrete subclasses could then override the nullable val. Mostly this could 
 be left as the default but when types can be contained in nested types one 
 could optimize for, e.g., arrays with elements that are nullable and those 
 that are not.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2721) Fix MapType compatibility issues with reading Parquet datasets

2014-07-28 Thread Robbie Russo (JIRA)

Robbie Russo created SPARK-2721:
---

 Summary: Fix MapType compatibility issues with reading Parquet 
datasets
 Key: SPARK-2721
 URL: https://issues.apache.org/jira/browse/SPARK-2721
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.1
Reporter: Robbie Russo


Parquet-thrift (along with most likely other implementations of parquet) 
supports null values in a map and this makes any thrift generated parquet files 
that contain a map unreadable by spark sql due to the following code in 
parquet-thrift for generating the schema for maps:

{code:title=parquet.thrift.ThriftSchemaConverter.java|borderStyle=solid}
  @Override
  public void visit(ThriftType.MapType mapType) {
final ThriftField mapKeyField = mapType.getKey();
final ThriftField mapValueField = mapType.getValue();

//save env for map
String mapName = currentName;
Type.Repetition mapRepetition = currentRepetition;

//=handle key
currentFieldPath.push(mapKeyField);
currentName = key;
currentRepetition = REQUIRED;
mapKeyField.getType().accept(this);
Type keyType = currentType;//currentType is the already converted type
currentFieldPath.pop();

//=handle value
currentFieldPath.push(mapValueField);
currentName = value;
currentRepetition = OPTIONAL;
mapValueField.getType().accept(this);
Type valueType = currentType;
currentFieldPath.pop();

if (keyType == null  valueType == null) {
  currentType = null;
  return;
}

if (keyType == null  valueType != null)
  throw new ThriftProjectionException(key of map is not specified in 
projection:  + currentFieldPath);

//restore Env
currentName = mapName;
currentRepetition = mapRepetition;
currentType = ConversionPatterns.mapType(currentRepetition, currentName,
keyType,
valueType);
  }
{code}

Which causes an error on the spark side when we reach this step in the 
toDataType function that asserts that both the key and value are of repetition 
level REQUIRED:

{code:title=org.apache.spark.sql.parquet.ParquetTypes.scala|borderStyle=solid}
case ParquetOriginalType.MAP = {
  assert(
!groupType.getFields.apply(0).isPrimitive,
Parquet Map type malformatted: expected nested group for map!)
  val keyValueGroup = groupType.getFields.apply(0).asGroupType()
  assert(
keyValueGroup.getFieldCount == 2,
Parquet Map type malformatted: nested group should have 2 (key, 
value) fields!)
  val keyType = toDataType(keyValueGroup.getFields.apply(0))
  println(here)
  assert(keyValueGroup.getFields.apply(0).getRepetition == 
Repetition.REQUIRED)
  val valueType = toDataType(keyValueGroup.getFields.apply(1))
  assert(keyValueGroup.getFields.apply(1).getRepetition == 
Repetition.REQUIRED)
  new MapType(keyType, valueType)
}
{code}

Currently I have modified parquet-thrift to use repetition REQUIRED just to 
make spark sql able to work on the parquet files since we don't actually use 
null values in our maps. However it would be preferred to use parquet-thrift 
and spark sql out of the box and have them work nicely together with our 
existing thrift data types without having to modify dependencies.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods

2014-07-28 Thread Michael Yannakopoulos (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077155#comment-14077155
 ] 

Michael Yannakopoulos commented on SPARK-2550:
--

Please ignore the previous pull request since it did not include the commits 
that should appear related to the aforementioned issue.

The new correct pull request is the following:
[https://github.com/apache/spark/pull/1624]

 Support regularization and intercept in pyspark's linear methods
 

 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Michael Yannakopoulos

 Python API doesn't provide options to set regularization parameter and 
 intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2550) Support regularization and intercept in pyspark's linear methods


[ 
https://issues.apache.org/jira/browse/SPARK-2550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077158#comment-14077158
 ] 

Apache Spark commented on SPARK-2550:
-

User 'miccagiann' has created a pull request for this issue:
https://github.com/apache/spark/pull/1624

 Support regularization and intercept in pyspark's linear methods
 

 Key: SPARK-2550
 URL: https://issues.apache.org/jira/browse/SPARK-2550
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.0.0
Reporter: Xiangrui Meng
Assignee: Michael Yannakopoulos

 Python API doesn't provide options to set regularization parameter and 
 intercept in linear methods, which should be fixed in v1.1.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2382) build error:

2014-07-28 Thread Mukul Jain (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077161#comment-14077161
]

Mukul Jain commented on SPARK-2382:
---

how to open PR ? I am planning to close this issue with comments :

Building of some of the projects, such as Project External MQTT, require
explicit access over HTTPS . so make sure your build machine is configured
properly to download dependencies over HTTPS ( check HTTPS proxy configuration
and such).

build error:
-

Key: SPARK-2382
URL: https://issues.apache.org/jira/browse/SPARK-2382
Project: Spark
Issue Type: Question
Components: Build
Affects Versions: 1.0.0
Environment: Ubuntu 12.0.4 precise.
spark@ubuntu-cdh5-spark:~/spark-1.0.0$ mvn -version
Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.6.0_31, vendor: Sun Microsystems Inc.
Java home: /usr/lib/jvm/j2sdk1.6-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: linux, version: 3.11.0-15-generic, arch: amd64, family: unix
Reporter: Mukul Jain
Labels: newbie

Unable to build. maven can't download dependency .. checked my http_proxy and
https_proxy setting they are working fine. Other http and https dependencies
were downloaded fine.. build process gets stuck always at this repo. manually
down loading also fails and receive an exception.
[INFO]

[INFO] Building Spark Project External MQTT 1.0.0
[INFO]

Downloading:
https://repository.apache.org/content/repositories/releases/org/eclipse/paho/mqtt-client/0.4.0/mqtt-client-0.4.0.pom
Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: I/O exception (java.net.ConnectException) caught when processing
request: Connection timed out
Jul 6, 2014 4:53:26 PM org.apache.commons.httpclient.HttpMethodDirector
executeWithRetry
INFO: Retrying request

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-28 Thread Ted Malaska (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077179#comment-14077179
]

Ted Malaska commented on SPARK-2447:

Making good progress. Just FYI it may take a little longer because the version
of HBase in Spark is 94.1 which has a couple different APIs.

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Tathagata Das

Going to review the design with Tdas today.
But first thoughts is to have an extension of VoidFunction that handles the
connection to HBase and allows for options such as turning auto flush off for
higher through put.
Need to answer the following questions first.
- Can it be written in Java or should it be written in Scala?
- What is the best way to add the HBase dependency? (will review how Flume
does this as the first option)
- What is the best way to do testing? (will review how Flume does this as the
first option)
- How to support python? (python may be a different Jira it is unknown at
this time)
Goals:
- Simple to use
- Stable
- Supports high load
- Documented (May be in a separate Jira need to ask Tdas)
- Supports Java, Scala, and hopefully Python
- Supports Streaming and normal Spark

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2580) broken pipe collecting schemardd results


[ 
https://issues.apache.org/jira/browse/SPARK-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077193#comment-14077193
 ] 

Apache Spark commented on SPARK-2580:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1625

 broken pipe collecting schemardd results
 

 Key: SPARK-2580
 URL: https://issues.apache.org/jira/browse/SPARK-2580
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.0.0
 Environment: fedora 21 local and rhel 7 clustered (standalone)
Reporter: Matthew Farrellee
Assignee: Davies Liu
  Labels: py4j, pyspark

 {code}
 from pyspark.sql import SQLContext
 sqlCtx = SQLContext(sc)
 # size of cluster impacts where this breaks (i.e. 2**15 vs 2**2)
 data = sc.parallelize([{'name': 'index', 'value': 0}] * 2**20)
 sdata = sqlCtx.inferSchema(data)
 sdata.first()
 {code}
 result: note - result returned as well as error
 {code}
  sdata.first()
 14/07/18 12:10:25 INFO SparkContext: Starting job: runJob at 
 PythonRDD.scala:290
 14/07/18 12:10:25 INFO DAGScheduler: Got job 43 (runJob at 
 PythonRDD.scala:290) with 1 output partitions (allowLocal=true)
 14/07/18 12:10:25 INFO DAGScheduler: Final stage: Stage 52(runJob at 
 PythonRDD.scala:290)
 14/07/18 12:10:25 INFO DAGScheduler: Parents of final stage: List()
 14/07/18 12:10:25 INFO DAGScheduler: Missing parents: List()
 14/07/18 12:10:25 INFO DAGScheduler: Computing the requested partition locally
 14/07/18 12:10:25 INFO PythonRDD: Times: total = 45, boot = 3, init = 40, 
 finish = 2
 14/07/18 12:10:25 INFO SparkContext: Job finished: runJob at 
 PythonRDD.scala:290, took 0.048348426 s
 {u'name': u'index', u'value': 0}
  PySpark worker failed with exception:
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/worker.py, line 
 77, in main
 serializer.dump_stream(func(split_index, iterator), outfile)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 191, in dump_stream
 self.serializer.dump_stream(self._batched(iterator), stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 124, in dump_stream
 self._write_with_length(obj, stream)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/serializers.py, 
 line 139, in _write_with_length
 stream.write(serialized)
 IOError: [Errno 32] Broken pipe
 Traceback (most recent call last):
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 130, in launch_worker
 worker(listen_sock)
   File 
 /home/matt/Documents/Repositories/spark/dist/python/pyspark/daemon.py, line 
 119, in worker
 outfile.flush()
 IOError: [Errno 32] Broken pipe
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2305) pyspark - depend on py4j 0.8.1


[ 
https://issues.apache.org/jira/browse/SPARK-2305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077202#comment-14077202
 ] 

Apache Spark commented on SPARK-2305:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1626

 pyspark - depend on py4j  0.8.1
 

 Key: SPARK-2305
 URL: https://issues.apache.org/jira/browse/SPARK-2305
 Project: Spark
  Issue Type: Dependency upgrade
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Matthew Farrellee
Assignee: Josh Rosen
Priority: Minor

 py4j 0.8.1 has a bug in java_import that results in extraneous warnings
 pyspark should depend on a py4j version  0.8.1 (non exists at time of 
 filing) that includes 
 https://github.com/bartdag/py4j/commit/64cd657e75dbe769c5e3bf757fcf83b5c0f8f4f0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2722) Mechanism for escaping spark configs is not consistent

Andrew Or created SPARK-2722:


 Summary: Mechanism for escaping spark configs is not consistent
 Key: SPARK-2722
 URL: https://issues.apache.org/jira/browse/SPARK-2722
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.1.0


Currently, you can specify a spark config in spark-defaults.conf as follows:
{code}
spark.magic Mr. Johnson
{code}
and this will preserve the double quotes as part of the string. Naturally, if 
you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
following
{code}
spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\
{code}
However, this fails because the backslashes go away and it tries to interpret 
Johnson as the main class argument. Instead, you have to do the following
{code}
spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\
{code}
which is not super intuitive.

Note that this only applies to standalone mode. In YARN it's not even possible 
to use quoted strings in config values (SPARK-2718).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2722) Mechanism for escaping spark configs is not consistent


 [ 
https://issues.apache.org/jira/browse/SPARK-2722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2722:
-

Description: 
Currently, you can specify a spark config in spark-defaults.conf as follows:
{code}
spark.magic Mr. Johnson
{code}
and this will preserve the double quotes as part of the string. Naturally, if 
you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
following:
{code}
spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\
{code}
However, this fails because the backslashes go away and it tries to interpret 
Johnson as the main class argument. Instead, you have to do the following:
{code}
spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\
{code}
which is not super intuitive.

Note that this only applies to standalone mode. In YARN it's not even possible 
to use quoted strings in config values (SPARK-2718).

  was:
Currently, you can specify a spark config in spark-defaults.conf as follows:
{code}
spark.magic Mr. Johnson
{code}
and this will preserve the double quotes as part of the string. Naturally, if 
you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
following
{code}
spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\
{code}
However, this fails because the backslashes go away and it tries to interpret 
Johnson as the main class argument. Instead, you have to do the following
{code}
spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\
{code}
which is not super intuitive.

Note that this only applies to standalone mode. In YARN it's not even possible 
to use quoted strings in config values (SPARK-2718).


 Mechanism for escaping spark configs is not consistent
 --

 Key: SPARK-2722
 URL: https://issues.apache.org/jira/browse/SPARK-2722
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.1
Reporter: Andrew Or
Priority: Minor
 Fix For: 1.1.0


 Currently, you can specify a spark config in spark-defaults.conf as follows:
 {code}
 spark.magic Mr. Johnson
 {code}
 and this will preserve the double quotes as part of the string. Naturally, if 
 you want to do the equivalent in spark.*.extraJavaOptions, you would use the 
 following:
 {code}
 spark.executor.extraJavaOptions -Dmagic=\Mr. Johnson\
 {code}
 However, this fails because the backslashes go away and it tries to interpret 
 Johnson as the main class argument. Instead, you have to do the following:
 {code}
 spark.executor.extraJavaOptions -Dmagic=\\\Mr. Johnson\\\
 {code}
 which is not super intuitive.
 Note that this only applies to standalone mode. In YARN it's not even 
 possible to use quoted strings in config values (SPARK-2718).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-791) [pyspark] operator.getattr not serialized


[ 
https://issues.apache.org/jira/browse/SPARK-791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077239#comment-14077239
 ] 

Davies Liu commented on SPARK-791:
--

This will be fixed by PR-1627[1]

[1] https://github.com/apache/spark/pull/1627

 [pyspark] operator.getattr not serialized
 -

 Key: SPARK-791
 URL: https://issues.apache.org/jira/browse/SPARK-791
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.7.2, 0.9.0
Reporter: Jim Blomo
Priority: Minor

 Using operator.itemgetter as a function in map seems to confuse the 
 serialization process in pyspark.  I'm using itemgetter to return tuples, 
 which fails with a TypeError (details below).  Using an equivalent lambda 
 function returns the correct result.
 Use a test file:
 {code:sh}
 echo 1,1  test.txt
 {code}
 Then try mapping it to a tuple:
 {code:python}
 import csv
 sc.textFile(test.txt).mapPartitions(csv.reader).map(lambda l: 
 (l[0],l[1])).first()
 Out[7]: ('1', '1')
 {code}
 But this does not work when using operator.itemgetter:
 {code:python}
 import operator
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 # TypeError: list indices must be integers, not tuple
 {code}
 This is running with git master, commit 
 6d60fe571a405eb9306a2be1817901316a46f892
 IPython 0.13.2 
 java version 1.7.0_25
 Scala code runner version 2.9.1 
 Ubuntu 12.04
 Full debug output:
 {code:python}
 In [9]: 
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 13/07/04 16:19:49 INFO storage.MemoryStore: ensureFreeSpace(33632) called 
 with curMem=201792, maxMem=339585269
 13/07/04 16:19:49 INFO storage.MemoryStore: Block broadcast_6 stored as 
 values to memory (estimated size 32.8 KB, free 323.6 MB)
 13/07/04 16:19:49 INFO mapred.FileInputFormat: Total input paths to process : 
 1
 13/07/04 16:19:49 INFO spark.SparkContext: Starting job: takePartition at 
 NativeMethodAccessorImpl.java:-2
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Got job 4 (takePartition at 
 NativeMethodAccessorImpl.java:-2) with 1 output partitions (allowLocal=true)
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Final stage: Stage 4 
 (PythonRDD at NativeConstructorAccessorImpl.java:-2)
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Parents of final stage: List()
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Missing parents: List()
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Computing the requested 
 partition locally
 13/07/04 16:19:49 INFO scheduler.DAGScheduler: Failed to run takePartition at 
 NativeMethodAccessorImpl.java:-2
 ---
 Py4JJavaError Traceback (most recent call last)
 ipython-input-9-1fdb3e7a8ac7 in module()
  1 
 sc.textFile(test.txt).mapPartitions(csv.reader).map(operator.itemgetter(0,1)).first()
 /home/jim/src/spark/python/pyspark/rdd.pyc in first(self)
 389 2
 390 
 -- 391 return self.take(1)[0]
 392 
 393 def saveAsTextFile(self, path):
 /home/jim/src/spark/python/pyspark/rdd.pyc in take(self, num)
 372 items = []
 373 for partition in range(self._jrdd.splits().size()):
 -- 374 iterator = self.ctx._takePartition(self._jrdd.rdd(), 
 partition)
 375 # Each item in the iterator is a string, Python object, 
 batch of
 376 # Python objects.  Regardless, it is sufficient to take 
 `num`
 /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/java_gateway.pyc in 
 __call__(self, *args)
 498 answer = self.gateway_client.send_command(command)
 499 return_value = get_return_value(answer, self.gateway_client,
 -- 500 self.target_id, self.name)
 501 
 502 for temp_arg in temp_args:
 /home/jim/src/spark/python/lib/py4j0.7.egg/py4j/protocol.pyc in 
 get_return_value(answer, gateway_client, target_id, name)
 298 raise Py4JJavaError(
 299 'An error occurred while calling {0}{1}{2}.\n'.
 -- 300 format(target_id, '.', name), value)
 301 else:
 302 raise Py4JError(
 Py4JJavaError: An error occurred while calling 
 z:spark.api.python.PythonRDD.takePartition.
 : spark.api.python.PythonException: Traceback (most recent call last):
   File /home/jim/src/spark/python/pyspark/worker.py, line 53, in main
 for obj in func(split_index, iterator):
   File /home/jim/src/spark/python/pyspark/serializers.py, line 24, in 
 batched
 for item in iterator:
 TypeError: list indices must be integers, not tuple
   at spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:117)
   at

[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS

2014-07-28 Thread Russell Jurney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077270#comment-14077270
 ] 

Russell Jurney commented on SPARK-1138:
---

I built spark master with 'sbt/sbt assembly publish-local' and had issues with 
my hadoop version, which is CDH 4.4. 

Then I built with CDH 4.4, via: 'SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.4.0 
sbt/sbt assembly publish-local'. Note: I did not clean. I saw this issue. This 
is with Spark trunk when the released version is 1.0.1. 

Then I cleaned and rebuilt. The issue persists. 

What should I do?


 Spark 0.9.0 does not work with Hadoop / HDFS
 

 Key: SPARK-1138
 URL: https://issues.apache.org/jira/browse/SPARK-1138
 Project: Spark
  Issue Type: Bug
Reporter: Sam Abeyratne

 UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and 
 the latest cloudera Hadoop / HDFS in the same jar.  It seems no matter how I 
 fiddle with the deps, the do not play nice together.
 I'm getting a java.util.concurrent.TimeoutException when trying to create a 
 spark context with 0.9.  I cannot, whatever I do, change the timeout.  I've 
 tried using System.setProperty, the SparkConf mechanism of creating a 
 SparkContext and the -D flags when executing my jar.  I seem to be able to 
 run simple jobs from the spark-shell OK, but my more complicated jobs require 
 external libraries so I need to build jars and execute them.
 Some code that causes this:
 println(Creating config)
 val conf = new SparkConf()
   .setMaster(clusterMaster)
   .setAppName(MyApp)
   .setSparkHome(sparkHome)
   .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100))
   .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100))
 println(Creating sc)
 implicit val sc = new SparkContext(conf)
 The output:
 Creating config
 Creating sc
 log4j:WARN No appenders could be found for logger 
 (akka.event.slf4j.Slf4jLogger).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup 
 timed out] [
 akka.remote.RemoteTransportException: Startup timed out
   at 
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
   at akka.remote.Remoting.start(Remoting.scala:191)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   ... 11 more
 ]
 Exception in thread main java.util.concurrent.TimeoutException: Futures 
 timed out after [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)

[jira] [Commented] (SPARK-1138) Spark 0.9.0 does not work with Hadoop / HDFS

2014-07-28 Thread Russell Jurney (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077271#comment-14077271
 ] 

Russell Jurney commented on SPARK-1138:
---

See https://github.com/apache/spark/pull/455

 Spark 0.9.0 does not work with Hadoop / HDFS
 

 Key: SPARK-1138
 URL: https://issues.apache.org/jira/browse/SPARK-1138
 Project: Spark
  Issue Type: Bug
Reporter: Sam Abeyratne

 UPDATE: This problem is certainly related to trying to use Spark 0.9.0 and 
 the latest cloudera Hadoop / HDFS in the same jar.  It seems no matter how I 
 fiddle with the deps, the do not play nice together.
 I'm getting a java.util.concurrent.TimeoutException when trying to create a 
 spark context with 0.9.  I cannot, whatever I do, change the timeout.  I've 
 tried using System.setProperty, the SparkConf mechanism of creating a 
 SparkContext and the -D flags when executing my jar.  I seem to be able to 
 run simple jobs from the spark-shell OK, but my more complicated jobs require 
 external libraries so I need to build jars and execute them.
 Some code that causes this:
 println(Creating config)
 val conf = new SparkConf()
   .setMaster(clusterMaster)
   .setAppName(MyApp)
   .setSparkHome(sparkHome)
   .set(spark.akka.askTimeout, parsed.getOrElse(timeouts, 100))
   .set(spark.akka.timeout, parsed.getOrElse(timeouts, 100))
 println(Creating sc)
 implicit val sc = new SparkContext(conf)
 The output:
 Creating config
 Creating sc
 log4j:WARN No appenders could be found for logger 
 (akka.event.slf4j.Slf4jLogger).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
 info.
 [ERROR] [02/26/2014 11:05:25.491] [main] [Remoting] Remoting error: [Startup 
 timed out] [
 akka.remote.RemoteTransportException: Startup timed out
   at 
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
   at akka.remote.Remoting.start(Remoting.scala:191)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)
 Caused by: java.util.concurrent.TimeoutException: Futures timed out after 
 [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   ... 11 more
 ]
 Exception in thread main java.util.concurrent.TimeoutException: Futures 
 timed out after [1 milliseconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at akka.remote.Remoting.start(Remoting.scala:173)
   at 
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
   at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
   at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
   at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
   at akka.actor.ActorSystem$.apply(ActorSystem.scala:104)
   at 
 org.apache.spark.util.AkkaUtils$.createActorSystem(AkkaUtils.scala:96)
   at org.apache.spark.SparkEnv$.create(SparkEnv.scala:126)
   at org.apache.spark.SparkContext.init(SparkContext.scala:139)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs$.main(EvaluateAdtruthIDs.scala:40)
   at 
 com.adbrain.accuracy.EvaluateAdtruthIDs.main(EvaluateAdtruthIDs.scala)



--
This message was sent by Atlassian JIRA

[jira] [Created] (SPARK-2723) Block Manager should catch exceptions in putValues

2014-07-28 Thread Shivaram Venkataraman (JIRA)

Shivaram Venkataraman created SPARK-2723:


 Summary: Block Manager should catch exceptions in putValues
 Key: SPARK-2723
 URL: https://issues.apache.org/jira/browse/SPARK-2723
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Shivaram Venkataraman


The BlockManager should catch exceptions encountered while writing out files to 
disk. Right now these exceptions get counted as user-level task failures and 
the job is aborted after failing 4 times. We should either fail the executor or 
handle this better to prevent the job from dying.

I ran into an issue where one disk on a large EC2 cluster failed and this 
resulted in a long running job terminating. Longer term, we should also look at 
black-listing local directories when one of them become unusable ?

Exception pasted below:

14/07/29 00:55:39 WARN scheduler.TaskSetManager: Loss was due to 
java.io.FileNotFoundException
java.io.FileNotFoundException: 
/mnt2/spark/spark-local-20140728175256-e7cb/28/broadcast_264_piece20 
(Input/output error)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at java.io.FileOutputStream.init(FileOutputStream.java:171)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:79)
at org.apache.spark.storage.DiskStore.putValues(DiskStore.scala:66)
at 
org.apache.spark.storage.BlockManager.dropFromMemory(BlockManager.scala:847)
at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:267)
at 
org.apache.spark.storage.MemoryStore$$anonfun$ensureFreeSpace$4.apply(MemoryStore.scala:256)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.storage.MemoryStore.ensureFreeSpace(MemoryStore.scala:256)
at org.apache.spark.storage.MemoryStore.tryToPut(MemoryStore.scala:179)
at org.apache.spark.storage.MemoryStore.putValues(MemoryStore.scala:76)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:663)
at org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Reopened] (SPARK-2512) Stratified sampling

2014-07-28 Thread Doris Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doris Xin reopened SPARK-2512:
--


 Stratified sampling
 ---

 Key: SPARK-2512
 URL: https://issues.apache.org/jira/browse/SPARK-2512
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 PR: https://github.com/apache/spark/pull/1025



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution

2014-07-28 Thread Doris Xin (JIRA)

Doris Xin created SPARK-2724:


 Summary: Python version of Random RDD without support for 
arbitrary distribution
 Key: SPARK-2724
 URL: https://issues.apache.org/jira/browse/SPARK-2724
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin


Python version of [SPARK-2514] but without support for randomRDD and 
randomVectorRDD, which take in any DistributionGenerator objects.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2724) Python version of Random RDD without support for arbitrary distribution