[jira] [Created] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-04 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3399:
-

 Summary: Test for PySpark should ignore HADOOP_CONF_DIR and 
YARN_CONF_DIR
 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta


Some tests for PySpark make temporary files on /tmp of local file system but if 
environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in spark-env.sh, 
tests expects temporary files are on FileSystem configured in core-site.xml 
even though actual files are on local file system.

I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
If we need those variables in some tests, we should set those variables in such 
tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3399) Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121016#comment-14121016
 ] 

Apache Spark commented on SPARK-3399:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2270

 Test for PySpark should ignore HADOOP_CONF_DIR and YARN_CONF_DIR
 

 Key: SPARK-3399
 URL: https://issues.apache.org/jira/browse/SPARK-3399
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: Kousuke Saruta

 Some tests for PySpark make temporary files on /tmp of local file system but 
 if environment variable HADOOP_CONF_DIR or YARN_CONF_DIR is set in 
 spark-env.sh, tests expects temporary files are on FileSystem configured in 
 core-site.xml even though actual files are on local file system.
 I think, we should ignore HADOOP_CONF_DIR and YARN_CONF_DIR.
 If we need those variables in some tests, we should set those variables in 
 such tests specially.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3400) GraphX unit tests fail nondeterministically

2014-09-04 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-3400:
-

 Summary: GraphX unit tests fail nondeterministically
 Key: SPARK-3400
 URL: https://issues.apache.org/jira/browse/SPARK-3400
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker


GraphX unit tests have been failing since the fix to SPARK-2823 was merged: 
https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef.
 Failures have appeared as Snappy parsing errors and shuffle 
FileNotFoundExceptions. A local test showed that these failures occurred in 
about 3/10 test runs.

Reverting the mentioned commit seems to solve the problem. Since this is 
blocking everyone else, I'm submitting a hotfix to do that, and we can diagnose 
the problem in more detail afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-04 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-3401:
-

 Summary: Wrong usage of tee command in python/run-tests
 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta


In python/run-test, tee command is used with -a option to append 
unit-tests.log for logging but the usage is wrong.
In current implementation, the output of tee command is redirected to 
unit-tests.log like tee -a  unit-tests.log.
tee command is not needed to redirect its output.

This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3401) Wrong usage of tee command in python/run-tests

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121033#comment-14121033
 ] 

Apache Spark commented on SPARK-3401:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2272

 Wrong usage of tee command in python/run-tests
 --

 Key: SPARK-3401
 URL: https://issues.apache.org/jira/browse/SPARK-3401
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Kousuke Saruta

 In python/run-test, tee command is used with -a option to append 
 unit-tests.log for logging but the usage is wrong.
 In current implementation, the output of tee command is redirected to 
 unit-tests.log like tee -a  unit-tests.log.
 tee command is not needed to redirect its output.
 This issue affects invalid truncate of unit-tests.log.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3400) GraphX unit tests fail nondeterministically

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121032#comment-14121032
 ] 

Apache Spark commented on SPARK-3400:
-

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/2271

 GraphX unit tests fail nondeterministically
 ---

 Key: SPARK-3400
 URL: https://issues.apache.org/jira/browse/SPARK-3400
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker
 Fix For: 1.0.1, 1.1.0, 1.3.0


 GraphX unit tests have been failing since the fix to SPARK-2823 was merged: 
 https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef.
  Failures have appeared as Snappy parsing errors and shuffle 
 FileNotFoundExceptions. A local test showed that these failures occurred in 
 about 3/10 test runs.
 Reverting the mentioned commit seems to solve the problem. Since this is 
 blocking everyone else, I'm submitting a hotfix to do that, and we can 
 diagnose the problem in more detail afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3400) GraphX unit tests fail nondeterministically

2014-09-04 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave resolved SPARK-3400.
---
   Resolution: Fixed
Fix Version/s: 1.1.0
   1.3.0
   1.0.1

Issue resolved by pull request 2271
[https://github.com/apache/spark/pull/2271]

 GraphX unit tests fail nondeterministically
 ---

 Key: SPARK-3400
 URL: https://issues.apache.org/jira/browse/SPARK-3400
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Ankur Dave
Assignee: Ankur Dave
Priority: Blocker
 Fix For: 1.0.1, 1.3.0, 1.1.0


 GraphX unit tests have been failing since the fix to SPARK-2823 was merged: 
 https://github.com/apache/spark/commit/9b225ac3072de522b40b46aba6df1f1c231f13ef.
  Failures have appeared as Snappy parsing errors and shuffle 
 FileNotFoundExceptions. A local test showed that these failures occurred in 
 about 3/10 test runs.
 Reverting the mentioned commit seems to solve the problem. Since this is 
 blocking everyone else, I'm submitting a hotfix to do that, and we can 
 diagnose the problem in more detail afterwards.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2823) GraphX jobs throw IllegalArgumentException

2014-09-04 Thread Ankur Dave (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankur Dave reopened SPARK-2823:
---

I had to revert this because of SPARK-3400.

 GraphX jobs throw IllegalArgumentException
 --

 Key: SPARK-2823
 URL: https://issues.apache.org/jira/browse/SPARK-2823
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Reporter: Lu Lu
 Fix For: 1.1.1, 1.2.0, 1.0.3


 If the users set “spark.default.parallelism” and the value is different with 
 the EdgeRDD partition number, GraphX jobs will throw IllegalArgumentException:
 14/07/26 21:06:51 WARN DAGScheduler: Creating new stage failed due to 
 exception - job: 1
 java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of 
 partitions
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:60)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:54)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getShuffleMapStage(DAGScheduler.scala:1
 97)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:272)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:274)
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$visit$1$1.apply(DAGScheduler.s
 cala:269)
 at scala.collection.immutable.List.foreach(List.scala:318)
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$visit$1(DAGScheduler.scala:269)
 at 
 org.apache.spark.scheduler.DAGScheduler.getParentStages(DAGScheduler.scala:279)
 at 
 org.apache.spark.scheduler.DAGScheduler.newStage(DAGScheduler.scala:219)
 at 
 org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:672)
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1184)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at 
 scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at 
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Commented] (SPARK-3353) Stage id monotonicity (parent stage should have lower stage id)

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121075#comment-14121075
 ] 

Apache Spark commented on SPARK-3353:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2273

 Stage id monotonicity (parent stage should have lower stage id)
 ---

 Key: SPARK-3353
 URL: https://issues.apache.org/jira/browse/SPARK-3353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Reynold Xin

 The way stage IDs are generated is that parent stages actually have higher 
 stage id. This is very confusing because parent stages get scheduled  
 executed first.
 We should reverse that order so the scheduling timeline of stages (absent of 
 failures) is monotonic, i.e. stages that are executed first have lower stage 
 ids.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-09-04 Thread Chengxiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121085#comment-14121085
 ] 

Chengxiang Li commented on SPARK-2321:
--

I collect some hive side requirement here, which should be helpful for spark 
job status and statistic API design.

Hive should be able to get the following job status information through Spark 
job status API.
1. job identifier
2. current job execution state, should include RUNNING/SUCCEEDED/FAILED/KILLED.
3. running/failed/killed/total task number on job level.
4. stage identifier
5. stage state, should include RUNNING/SUCCEEDED/FAILED/KILLED
6. running/failed/killed/total task number on stage level.

MR/Tez use Counter to collect statistic information, similiar to MR/Tez 
Counter, it would be better if Spark job statistic API organize statistic 
information with:
1. group same kind statistic information by groupName.
2. displayName for both group and statistic information which would uniform 
print string for frontend(Web UI/Hive CLI/...).


 Design a proper progress reporting  event listener API
 ---

 Key: SPARK-2321
 URL: https://issues.apache.org/jira/browse/SPARK-2321
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 This is a ticket to track progress on redesigning the SparkListener and 
 JobProgressListener API.
 There are multiple problems with the current design, including:
 0. I'm not sure if the API is usable in Java (there are at least some enums 
 we used in Scala and a bunch of case classes that might complicate things).
 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
 attention to it yet. Something as important as progress reporting deserves a 
 more stable API.
 2. There is no easy way to connect jobs with stages. Similarly, there is no 
 easy way to connect job groups with jobs / stages.
 3. JobProgressListener itself has no encapsulation at all. States can be 
 arbitrarily mutated by external programs. Variable names are sort of randomly 
 decided and inconsistent. 
 We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-09-04 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121094#comment-14121094
 ] 

Reynold Xin commented on SPARK-2321:


What about pull vs push? i.e. should this be a listener like API, or some 
service with states that the caller can poll to ask?

 Design a proper progress reporting  event listener API
 ---

 Key: SPARK-2321
 URL: https://issues.apache.org/jira/browse/SPARK-2321
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 This is a ticket to track progress on redesigning the SparkListener and 
 JobProgressListener API.
 There are multiple problems with the current design, including:
 0. I'm not sure if the API is usable in Java (there are at least some enums 
 we used in Scala and a bunch of case classes that might complicate things).
 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
 attention to it yet. Something as important as progress reporting deserves a 
 more stable API.
 2. There is no easy way to connect jobs with stages. Similarly, there is no 
 easy way to connect job groups with jobs / stages.
 3. JobProgressListener itself has no encapsulation at all. States can be 
 arbitrarily mutated by external programs. Variable names are sort of randomly 
 decided and inconsistent. 
 We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2978) Provide an MR-style shuffle transformation

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121115#comment-14121115
 ] 

Apache Spark commented on SPARK-2978:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/2274

 Provide an MR-style shuffle transformation
 --

 Key: SPARK-2978
 URL: https://issues.apache.org/jira/browse/SPARK-2978
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Sandy Ryza

 For Hive on Spark joins in particular, and for running legacy MR code in 
 general, I think it would be useful to provide a transformation with the 
 semantics of the Hadoop MR shuffle, i.e. one that
 * groups by key: provides (Key, Iterator[Value])
 * within each partition, provides keys in sorted order
 A couple ways that could make sense to expose this:
 * Add a new operator.  groupAndSortByKey, 
 groupByKeyAndSortWithinPartition, hadoopStyleShuffle, maybe?
 * Allow groupByKey to take an ordering param for keys within a partition



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3377) Don't mix metrics from different applications

2014-09-04 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3377:
--
Summary: Don't mix metrics from different applications  (was: codahale base 
Metrics data between applications can jumble up together)

 Don't mix metrics from different applications
 -

 Key: SPARK-3377
 URL: https://issues.apache.org/jira/browse/SPARK-3377
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Critical

 I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I 
 saw following 2 problems.
 (1) When applications which have same spark.app.name run on cluster at the 
 same time, some metrics names jumble up together. e.g, 
 SparkPi.DAGScheduler.stage.failedStages jumble.
 (2) When 2+ executors run on the same machine, JVM metrics of each executors 
 jumble. e.g, We current implementation cannot distinguish metric jvm.memory 
 is for which executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-640.
-
Resolution: Fixed

This looks stale right? Hadoop 1 version has been at 1.2.1 for some time.

 Update Hadoop 1 version to 1.1.0 (especially on AMIs)
 -

 Key: SPARK-640
 URL: https://issues.apache.org/jira/browse/SPARK-640
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia

 Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects 
 in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be 
 good to support on the AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-529) Have a single file that controls the environmental variables and spark config options

2014-09-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-529.
-
Resolution: Won't Fix

This looks obsolete and/or fixed, as variables like SPARK_MEM are deprecated, 
and I suppose there is spark-env.sh too.

 Have a single file that controls the environmental variables and spark config 
 options
 -

 Key: SPARK-529
 URL: https://issues.apache.org/jira/browse/SPARK-529
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin

 E.g. multiple places in the code base uses SPARK_MEM and has its own default 
 set to 512. We need a central place to enforce default values as well as 
 documenting the variables.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121160#comment-14121160
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan], one more question:

Is your design goal trying to fix the receiver node failure caused data loss 
issue? Seems potentially data will be lost when data is only stored in 
BlockGenerator not yet in BM when node is failed. Your design doc mainly 
focused on driver failure, so what's your thought?

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2321) Design a proper progress reporting event listener API

2014-09-04 Thread Chengxiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121173#comment-14121173
 ] 

Chengxiang Li commented on SPARK-2321:
--

I'm not sure whether i understand you right, here is my thought about the API 
design:
# The JobStatus/JobStatistic API only contains getter method.
# JobProgressListener contains variables of JobStatusImpl/JobStatisticImpl.
# DagScheduler post events to JobProgressListener through listener bus.
# Caller get JobStatusImpl/JobStatisticImpl from JobProgressListener with 
updated state.

So i think it should be a pull style API.

 Design a proper progress reporting  event listener API
 ---

 Key: SPARK-2321
 URL: https://issues.apache.org/jira/browse/SPARK-2321
 Project: Spark
  Issue Type: Improvement
  Components: Java API, Spark Core
Affects Versions: 1.0.0
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 This is a ticket to track progress on redesigning the SparkListener and 
 JobProgressListener API.
 There are multiple problems with the current design, including:
 0. I'm not sure if the API is usable in Java (there are at least some enums 
 we used in Scala and a bunch of case classes that might complicate things).
 1. The whole API is marked as DeveloperApi, because we haven't paid a lot of 
 attention to it yet. Something as important as progress reporting deserves a 
 more stable API.
 2. There is no easy way to connect jobs with stages. Similarly, there is no 
 easy way to connect job groups with jobs / stages.
 3. JobProgressListener itself has no encapsulation at all. States can be 
 arbitrarily mutated by external programs. Variable names are sort of randomly 
 decided and inconsistent. 
 We should just revisit these and propose a new, concrete design. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3402) Library for Natural Language Processing over Spark.

2014-09-04 Thread Nagamallikarjuna (JIRA)
Nagamallikarjuna created SPARK-3402:
---

 Summary: Library for Natural Language Processing over Spark.
 Key: SPARK-3402
 URL: https://issues.apache.org/jira/browse/SPARK-3402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Nagamallikarjuna
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3402) Library for Natural Language Processing over Spark.

2014-09-04 Thread Nagamallikarjuna (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121190#comment-14121190
 ] 

Nagamallikarjuna commented on SPARK-3402:
-

We have gone through Spark and its family, we didn't find any natural language 
processing library over spark. We (Impetus) are working to implement some 
natural language features over Spark. We already developed some working 
algorithms library using OpenNLP tool kit, and will extend to other NLP tool 
kits like Stanford, CTakes, NLTK etc.. We are planning to contribute our work 
to existing MLLib or new sub project.


Thanks
Naga

 Library for Natural Language Processing over Spark.
 ---

 Key: SPARK-3402
 URL: https://issues.apache.org/jira/browse/SPARK-3402
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Nagamallikarjuna
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread David (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121276#comment-14121276
 ] 

David commented on SPARK-1473:
--

Hi you all,

I am Dr. David Martinez and this is my first comment of this project. We 
implemented all feature selection methods included in
•Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.The Journal of Machine Learning Research, 13, 27-66

included more optimizations and left the framework open to include more 
criteria. We opened a pull request in the past but did not finished it. You can 
have a look in our github
https://github.com/LIDIAgroup/SparkFeatureSelection
We would like to finish our pull request

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
  Labels: features

 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121324#comment-14121324
 ] 

Helena Edelson commented on SPARK-2892:
---

I see the same with 1.0.2 streaming:

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))


 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121324#comment-14121324
 ] 

Helena Edelson edited comment on SPARK-2892 at 9/4/14 1:12 PM:
---

I see the same with 1.0.2 streaming, with or without stopGracefully = true

ssc.stop(stopSparkContext = false, stopGracefully = true)

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))



was (Author: helena_e):
I see the same with 1.0.2 streaming:

ERROR 08:26:21,139 Deregistered receiver for stream 0: Stopped by driver
 WARN 08:26:21,211 Stopped executor without error
 WARN 08:26:21,213 All of the receivers have not deregistered, Map(0 - 
ReceiverInfo(0,ActorReceiver-0,null,false,host,Stopped by driver,))


 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3375) spark on yarn container allocation issues

2014-09-04 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-3375:


Assignee: Thomas Graves

 spark on yarn container allocation issues
 -

 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker

 It looks like if yarn doesn't get the containers immediately it stops asking 
 for them and the yarn application hangs with never getting any executors.  
 This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3375) spark on yarn container allocation issues

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121451#comment-14121451
 ] 

Apache Spark commented on SPARK-3375:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2275

 spark on yarn container allocation issues
 -

 Key: SPARK-3375
 URL: https://issues.apache.org/jira/browse/SPARK-3375
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Blocker

 It looks like if yarn doesn't get the containers immediately it stops asking 
 for them and the yarn application hangs with never getting any executors.  
 This was introduced by https://github.com/apache/spark/pull/2169



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)
Alexander Ulanov created SPARK-3403:
---

 Summary: NaiveBayes crashes with blas/lapack native libraries for 
breeze (netlib-java)
 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0


Code:
val model = NaiveBayes.train(train)
val predictionAndLabels = test.map { point =
  val score = model.predict(point.features)
  (score, point.label)
}
predictionAndLabels.foreach(println)

Result: 
program crashes with: Process finished with exit code -1073741819 
(0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Ulanov updated SPARK-3403:

Attachment: NativeNN.scala

The file contains example that produces the same issue

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121516#comment-14121516
 ] 

Xiangrui Meng commented on SPARK-3403:
--

Did you test the setup of netlib-java with OpenBLAS? I hit a JNI issue (a year 
ago, maybe fixed) with netlib-java and multithreading OpenBLAS. Could you try 
compiling OpenBLAS with `USE_THREAD=0`? If it still doesn't work, please attach 
the driver/executor logs. Thanks!

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1

2014-09-04 Thread Sean Owen (JIRA)
Sean Owen created SPARK-3404:


 Summary: SparkSubmitSuite fails in Maven (only) - spark-submit 
exits with code 1
 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen


Maven-based Jenkins builds have been failing for over a month. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/

It's SparkSubmitSuite that fails. For example:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull

{code}
SparkSubmitSuite
...
- launch simple application with spark-submit *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- spark submit includes jars passed in through --jar *** FAILED ***
  org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
local-cluster[2,1,512], --jars, 
file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
 file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
  at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
  at 
org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at 
org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
  at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
{code}

SBT builds don't fail, so it is likely to be due to some difference in how the 
tests are run rather than a problem with test or core project.

This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
cause identified in that JIRA is, at least, not the only cause. (Although, it 
wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)

This JIRA tracks investigation into a different cause. Right now I have some 
further information but not a PR yet.

Part of the issue is that there is no clue in the log about why 
{{spark-submit}} exited with status 1. See 
https://github.com/apache/spark/pull/2108/files and 
https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
least print stdout to the log too.

The SparkSubmit program exits with 1 when the main class it is supposed to run 
is not found 
(https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
 This is for example SimpleApplicationTest 
(https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)

The test actually submits an empty JAR not containing this class. It relies on 
{{spark-submit}} finding the class within the compiled test-classes of the 
Spark project. However it does seem to be compiled and present even with Maven.

If modified to print stdout and stderr, and dump the actual command, I see an 
empty stdout, and only the command to stderr:

{code}
Spark Command: 
/Library/Java/JavaVirtualMachines/jdk1.8.0_20.jdk/Contents/Home/bin/java -cp 

[jira] [Commented] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121561#comment-14121561
 ] 

Helena Edelson commented on SPARK-2892:
---

I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and  Deregistered receiver 
for stream seems like the expected behavior.

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Alexander Ulanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121563#comment-14121563
 ] 

Alexander Ulanov commented on SPARK-3403:
-

Yes, I tried using netlib-java separately with the same OpenBLAS setup and it 
worked properly, even within several threads. However I didn't mimic the same 
multi-threading setup as MLlib has because it is complicated.  Do you want me 
to send you all DLLs that I used? I had troubles with compiling OpenBLAS for 
Windows so I used precompiled x64 versions from OpenBLAS and MinGW64 websites.


 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2892) Socket Receiver does not stop when streaming context is stopped

2014-09-04 Thread Helena Edelson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121561#comment-14121561
 ] 

Helena Edelson edited comment on SPARK-2892 at 9/4/14 5:01 PM:
---

I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and  Deregistered receiver 
for stream seems like the expected behavior.


DEBUG 13:00:22,418 Stopping JobScheduler
 INFO 13:00:22,441 Received stop signal
 INFO 13:00:22,441 Sent stop signal to all 1 receivers
 INFO 13:00:22,442 Stopping receiver with message: Stopped by driver: 
 INFO 13:00:22,442 Called receiver onStop
 INFO 13:00:22,443 Deregistering receiver 0
ERROR 13:00:22,445 Deregistered receiver for stream 0: Stopped by driver
 INFO 13:00:22,445 Stopped receiver 0


was (Author: helena_e):
I wonder if the ERROR should be a WARN or INFO since it occurs as a result of 
ReceiverSupervisorImpl receiving a StopReceiver, and  Deregistered receiver 
for stream seems like the expected behavior.

 Socket Receiver does not stop when streaming context is stopped
 ---

 Key: SPARK-2892
 URL: https://issues.apache.org/jira/browse/SPARK-2892
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.2
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 Running NetworkWordCount with
 {quote}  
 ssc.start(); Thread.sleep(1); ssc.stop(stopSparkContext = false); 
 Thread.sleep(6)
 {quote}
 gives the following error
 {quote}
 14/08/06 18:37:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) 
 in 10047 ms on localhost (1/1)
 14/08/06 18:37:13 INFO DAGScheduler: Stage 0 (runJob at 
 ReceiverTracker.scala:275) finished in 10.056 s
 14/08/06 18:37:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks 
 have all completed, from pool
 14/08/06 18:37:13 INFO SparkContext: Job finished: runJob at 
 ReceiverTracker.scala:275, took 10.179263 s
 14/08/06 18:37:13 INFO ReceiverTracker: All of the receivers have been 
 terminated
 14/08/06 18:37:13 WARN ReceiverTracker: All of the receivers have not 
 deregistered, Map(0 - 
 ReceiverInfo(0,SocketReceiver-0,null,false,localhost,Stopped by driver,))
 14/08/06 18:37:13 INFO ReceiverTracker: ReceiverTracker stopped
 14/08/06 18:37:13 INFO JobGenerator: Stopping JobGenerator immediately
 14/08/06 18:37:13 INFO RecurringTimer: Stopped timer for JobGenerator after 
 time 1407375433000
 14/08/06 18:37:13 INFO JobGenerator: Stopped JobGenerator
 14/08/06 18:37:13 INFO JobScheduler: Stopped JobScheduler
 14/08/06 18:37:13 INFO StreamingContext: StreamingContext stopped successfully
 14/08/06 18:37:43 INFO SocketReceiver: Stopped receiving
 14/08/06 18:37:43 INFO SocketReceiver: Closed socket to localhost:
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121608#comment-14121608
 ] 

Apache Spark commented on SPARK-3286:
-

User 'benoyantony' has created a pull request for this issue:
https://github.com/apache/spark/pull/2276

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121646#comment-14121646
 ] 

Andrew Or commented on SPARK-3404:
--

Thanks for looking into this Sean. Does this happen all the time or only once 
in a while? We have observed the same tests failing on our Jenkins, which runs 
the test through sbt. The behavior is consistent with running it through maven. 
If we run it through 'sbt test-only SparkSubmitSuite' then it always passes, 
but if we run 'sbt test' then sometimes it fails.

This has also been failing for a while for sbt. Very roughly I remember we 
began seeing it after https://github.com/apache/spark/pull/1777 went in. Though 
I have gone down that path to debug any possibilities of port collision to no 
avail. A related test failure is in DriverSuite, which also calls 
`Utils.executeAndGetOutput`. Have you seen that failing in maven?

I will keep investigating it in parallel for sbt, though I suspect the root 
cause is the same. Let me know if you find anything.

 SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1
 ---

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Priority: Critical  (was: Major)

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the actual command, I see an 
 empty stdout, and only the command to stderr:
 {code}
 Spark Command: 
 

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Summary: SparkSubmitSuite fails with spark-submit exits with code 1  
(was: SparkSubmitSuite fails in Maven (only) - spark-submit exits with code 1)

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the actual command, I see an 
 empty stdout, and 

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121650#comment-14121650
 ] 

Andrew Or commented on SPARK-3404:
--

I have updated the title to reflect this.

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the actual command, I see an 
 empty stdout, and only the command to 

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Target Version/s: 1.1.1

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the actual command, I see an 
 empty stdout, and only the command to stderr:
 {code}
 Spark Command: 
 

[jira] [Updated] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3404:
-
Affects Version/s: 1.1.0

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L322)
  This is for example SimpleApplicationTest 
 (https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala#L339)
 The test actually submits an empty JAR not containing this class. It relies 
 on {{spark-submit}} finding the class within the compiled test-classes of the 
 Spark project. However it does seem to be compiled and present even with 
 Maven.
 If modified to print stdout and stderr, and dump the actual command, I see an 
 empty stdout, and only the command to stderr:
 {code}
 Spark Command: 
 

[jira] [Commented] (SPARK-3404) SparkSubmitSuite fails with spark-submit exits with code 1

2014-09-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121653#comment-14121653
 ] 

Sean Owen commented on SPARK-3404:
--

It's 100% repeatable in Maven for me locally, which seems to be Jenkins' 
experience too. I don't see the same problem with SBT (/dev/run-tests) locally, 
although I can't say I run that regularly.

I could rewrite the SparkSubmitSuite to submit a JAR file that actually 
contains the class it's trying to invoke. Maybe that's smarter? the problem 
here seems to be the vagaries of what the run-time classpath is during an SBT 
vs Maven test. Would anyone second that?

Separately it would probably not hurt to get in that change that logs stdout / 
stderr from the Utils method.

 SparkSubmitSuite fails with spark-submit exits with code 1
 

 Key: SPARK-3404
 URL: https://issues.apache.org/jira/browse/SPARK-3404
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
Reporter: Sean Owen
Priority: Critical

 Maven-based Jenkins builds have been failing for over a month. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/
 It's SparkSubmitSuite that fails. For example:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/541/hadoop.version=2.0.0-mr1-cdh4.1.2,label=centos/consoleFull
 {code}
 SparkSubmitSuite
 ...
 - launch simple application with spark-submit *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.SimpleApplicationTest, --name, testApp, --master, 
 local, file:/tmp/1409815981504-0/testJar-1409815981505.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply$mcV$sp(SparkSubmitSuite.scala:291)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$14.apply(SparkSubmitSuite.scala:284)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 - spark submit includes jars passed in through --jar *** FAILED ***
   org.apache.spark.SparkException: Process List(./bin/spark-submit, --class, 
 org.apache.spark.deploy.JarCreationTest, --name, testApp, --master, 
 local-cluster[2,1,512], --jars, 
 file:/tmp/1409815984960-0/testJar-1409815985029.jar,file:/tmp/1409815985030-0/testJar-1409815985087.jar,
  file:/tmp/1409815984959-0/testJar-1409815984959.jar) exited with code 1
   at org.apache.spark.util.Utils$.executeAndGetOutput(Utils.scala:837)
   at 
 org.apache.spark.deploy.SparkSubmitSuite.runSparkSubmit(SparkSubmitSuite.scala:311)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply$mcV$sp(SparkSubmitSuite.scala:305)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at 
 org.apache.spark.deploy.SparkSubmitSuite$$anonfun$15.apply(SparkSubmitSuite.scala:294)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.Transformer$$anonfun$apply$1.apply(Transformer.scala:22)
   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   at org.scalatest.Transformer.apply(Transformer.scala:22)
   ...
 {code}
 SBT builds don't fail, so it is likely to be due to some difference in how 
 the tests are run rather than a problem with test or core project.
 This is related to http://issues.apache.org/jira/browse/SPARK-3330 but the 
 cause identified in that JIRA is, at least, not the only cause. (Although, it 
 wouldn't hurt to be doubly-sure this is not an issue by changing the Jenkins 
 config to invoke {{mvn clean  mvn ... package}} {{mvn ... clean package}}.)
 This JIRA tracks investigation into a different cause. Right now I have some 
 further information but not a PR yet.
 Part of the issue is that there is no clue in the log about why 
 {{spark-submit}} exited with status 1. See 
 https://github.com/apache/spark/pull/2108/files and 
 https://issues.apache.org/jira/browse/SPARK-3193 for a change that would at 
 least print stdout to the log too.
 The SparkSubmit program exits with 1 when the main class it is supposed to 
 run is not found 
 

[jira] [Resolved] (SPARK-1078) Replace lift-json with json4s-jackson

2014-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1078.
---
   Resolution: Fixed
Fix Version/s: 1.0.0

It looks like this was fixed in SPARK-1132 / Spark 1.0.0, where we migrated to 
json4s.jackson. 

 Replace lift-json with json4s-jackson
 -

 Key: SPARK-1078
 URL: https://issues.apache.org/jira/browse/SPARK-1078
 Project: Spark
  Issue Type: Task
  Components: Deploy, Web UI
Affects Versions: 0.9.0
Reporter: William Benton
Priority: Minor
 Fix For: 1.0.0


 json4s-jackson is a Jackson-backed implementation of the Json4s common JSON 
 API for Scala JSON libraries.  (Evan Chan has a nice comparison of Scala JSON 
 libraries here:  
 http://engineering.ooyala.com/blog/comparing-scala-json-libraries)  It is 
 Apache-licensed, mostly API-compatible with lift-json, and easier for 
 downstream operating system distributions to consume than lift-json.
 In terms of performance, json4s-jackson is slightly slower but comparable to 
 lift-json on my machine when parsing very small JSON files ( 2kb and  ~30 
 objects), around 40% faster than lift-json on medium-sized files (~50kb), and 
 significantly (~10x) faster on multi-megabyte files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3286:
--
Component/s: Web UI

 Cannot view ApplicationMaster UI when Yarn’s url scheme is https
 

 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: Web UI, YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony
 Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch


 The spark Application Master starts its web UI at http://host-name:port.
 When Spark ApplicationMaster registers its URL with Resource Manager , the 
 URL does not contain URI scheme.
 If the URL scheme is absent, Resource Manager’s web app proxy will use the 
 HTTP Policy of the Resource Manager.(YARN-1553)
 If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
 try to access https://host-name:port.
 This will result in error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3284) saveAsParquetFile not working on windows

2014-09-04 Thread Pravesh Jain (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pravesh Jain updated SPARK-3284:

Description: 
{code}
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster(local).setAppName(HdfsWordCount)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)

val parquetFile = 
sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
  }
}
{code}

gives the error



Exception in thread main java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.

This works fine in linux but using in eclipse in windows gives the error.

  was:
object parquet {

  case class Person(name: String, age: Int)

  def main(args: Array[String]) {

val sparkConf = new 
SparkConf().setMaster(local).setAppName(HdfsWordCount)
val sc = new SparkContext(sparkConf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
import sqlContext.createSchemaRDD

val people = 
sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
 = Person(p(0), p(1).trim.toInt))

people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)

val parquetFile = 
sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
  }
}

gives the error



Exception in thread main java.lang.NullPointerException at 
org.apache.spark.parquet$.main(parquet.scala:16)

which is the line saveAsParquetFile.

This works fine in linux but using in eclipse in windows gives the error.


 saveAsParquetFile not working on windows
 

 Key: SPARK-3284
 URL: https://issues.apache.org/jira/browse/SPARK-3284
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.2
 Environment: Windows
Reporter: Pravesh Jain
Priority: Minor

 {code}
 object parquet {
   case class Person(name: String, age: Int)
   def main(args: Array[String]) {
 val sparkConf = new 
 SparkConf().setMaster(local).setAppName(HdfsWordCount)
 val sc = new SparkContext(sparkConf)
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 // createSchemaRDD is used to implicitly convert an RDD to a SchemaRDD.
 import sqlContext.createSchemaRDD
 val people = 
 sc.textFile(C:/Users/pravesh.jain/Desktop/people/people.txt).map(_.split(,)).map(p
  = Person(p(0), p(1).trim.toInt))
 
 people.saveAsParquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
 val parquetFile = 
 sqlContext.parquetFile(C:/Users/pravesh.jain/Desktop/people/people.parquet)
   }
 }
 {code}
 gives the error
 Exception in thread main java.lang.NullPointerException at 
 org.apache.spark.parquet$.main(parquet.scala:16)
 which is the line saveAsParquetFile.
 This works fine in linux but using in eclipse in windows gives the error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2015) Spark UI issues at scale

2014-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2015:
--
Component/s: Web UI

 Spark UI issues at scale
 

 Key: SPARK-2015
 URL: https://issues.apache.org/jira/browse/SPARK-2015
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Reynold Xin

 This is an umbrella ticket for issues related to Spark's web ui when we run 
 Spark at scale (large datasets, large number of machines, or large number of 
 tasks).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3061) Maven build fails in Windows OS

2014-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-3061:
-

Assignee: Andrew Or  (was: Josh Rosen)

Re-assigning to Andrew, who's going to backport it.

 Maven build fails in Windows OS
 ---

 Key: SPARK-3061
 URL: https://issues.apache.org/jira/browse/SPARK-3061
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.2, 1.1.0
 Environment: Windows
Reporter: Masayoshi TSUZUKI
Assignee: Andrew Or
Priority: Minor
 Fix For: 1.2.0


 Maven build fails in Windows OS with this error message.
 {noformat}
 [ERROR] Failed to execute goal org.codehaus.mojo:exec-maven-plugin:1.2.1:exec 
 (default) on project spark-core_2.10: Command execution failed. Cannot run 
 program unzip (in directory C:\path\to\gitofspark\python): CreateProcess 
 error=2, Žw’肳‚ꂽƒtƒ@ƒ - [Help 1]
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2334) Attribute Error calling PipelinedRDD.id() in pyspark

2014-09-04 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2334:
--
Affects Version/s: 1.1.0

 Attribute Error calling PipelinedRDD.id() in pyspark
 

 Key: SPARK-2334
 URL: https://issues.apache.org/jira/browse/SPARK-2334
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.1.0
Reporter: Diana Carroll

 calling the id() function of a PipelinedRDD causes an error in PySpark.  
 (Works fine in Scala.)
 The second id() call here fails, the first works:
 {code}
 r1 = sc.parallelize([1,2,3])
 r1.id()
 r2=r1.map(lambda i: i+1)
 r2.id()
 {code}
 Error:
 {code}
 ---
 AttributeErrorTraceback (most recent call last)
 ipython-input-31-a0cf66fcf645 in module()
  1 r2.id()
 /usr/lib/spark/python/pyspark/rdd.py in id(self)
 180 A unique ID for this RDD (within its SparkContext).
 181 
 -- 182 return self._id
 183 
 184 def __repr__(self):
 AttributeError: 'PipelinedRDD' object has no attribute '_id'
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-640) Update Hadoop 1 version to 1.1.0 (especially on AMIs)

2014-09-04 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14121883#comment-14121883
 ] 

Matei Zaharia commented on SPARK-640:
-

[~pwendell] what is our Hadoop 1 version on AMIs now?

 Update Hadoop 1 version to 1.1.0 (especially on AMIs)
 -

 Key: SPARK-640
 URL: https://issues.apache.org/jira/browse/SPARK-640
 Project: Spark
  Issue Type: New Feature
Reporter: Matei Zaharia

 Hadoop 1.1.0 has a fix to the notorious trailing slash for directory objects 
 in S3 issue: https://issues.apache.org/jira/browse/HADOOP-5836, so would be 
 good to support on the AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3405) EC2 cluster creation on VPC

2014-09-04 Thread Dawson Reid (JIRA)
Dawson Reid created SPARK-3405:
--

 Summary: EC2 cluster creation on VPC
 Key: SPARK-3405
 URL: https://issues.apache.org/jira/browse/SPARK-3405
 Project: Spark
  Issue Type: New Feature
  Components: EC2, PySpark
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04
Reporter: Dawson Reid
Priority: Minor


It would be very useful to be able to specify the EC2 VPC in which the Spark 
cluster should be created. 

When creating a Spark cluster on AWS via the spark-ec2 script there is no way 
to specify a VPC id of the VPC you would like the cluster to be created in. The 
script always creates the cluster in the default VPC. 

In my case I have deleted the default VPC and the spark-ec2 script errors out 
with the following : 

Setting up security groups...
Creating security group test-master
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC 
for this 
user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response
Traceback (most recent call last):
  File ./spark_ec2.py, line 860, in module
main()
  File ./spark_ec2.py, line 852, in main
real_main()
  File ./spark_ec2.py, line 735, in real_main
conn, opts, cluster_name)
  File ./spark_ec2.py, line 247, in launch_cluster
master_group = get_or_make_group(conn, cluster_name + -master)
  File ./spark_ec2.py, line 143, in get_or_make_group
return conn.create_security_group(name, Spark EC2 group)
  File 
/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
 line 2011, in create_security_group
  File 
/home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
 line 925, in get_object
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default VPC 
for this 
user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122006#comment-14122006
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Yes, so my initial goal is to be able to recover all the blocks that have not 
been made into an RDD yet (at which point it would be safe). There is data 
which may not have become a block yet (which are created using the += operator) 
- for now, I am going to call it fair game to say that we are going to be 
adding storeReliably(ArrayBuffer/Iterable) methods which are the only ones that 
store data such that they are guaranteed to be recovered.

At a later stage, we could use something like a WAL on HDFS to recover even the 
+= data, though that would affect performance.



 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122017#comment-14122017
 ] 

Hari Shreedharan commented on SPARK-3129:
-

[~tgraves] - Am I correct in assuming that using Akka automatically gives the 
shared secret authentication if spark.authenticate is set to true - if the AM 
is restarted by YARN itself (since it is the same application, it theoretically 
has access to the same shared secret and thus should be able to communicate via 
Akka)? 

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122064#comment-14122064
 ] 

Thomas Graves commented on SPARK-3129:
--

On yarn, it generates the secret automatically.  In cluster mode, it does it in 
the applicationMaster.  Since it generates it in the applicationmaster, it goes 
away when the application master dies.   If the secret was generated on the 
client side and populated into the credentials in the UGI similar to how we do 
tokens then a restart of the AM in cluster mode should be able to pick it back 
up.  

This won't work for client mode though since the client/spark driver wouldn't 
have a way to get ahold of the UGI again.  

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3378) Replace the word SparkSQL with right word Spark SQL

2014-09-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3378.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 Replace the word SparkSQL with right word Spark SQL
 ---

 Key: SPARK-3378
 URL: https://issues.apache.org/jira/browse/SPARK-3378
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Trivial
 Fix For: 1.2.0


 In programming-guide.md, there are 2 SparkSQL. We should use Spark SQL 
 instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122103#comment-14122103
 ] 

Hari Shreedharan commented on SPARK-3129:
-

I am less worried about client mode, since most streaming applications would 
run in cluster mode. We can make this available only in the cluster mode.

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Hari Shreedharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122114#comment-14122114
 ] 

Hari Shreedharan commented on SPARK-3129:
-

Looks like simply moving the code that generates the secret and sets in the UGI 
to the Client class should take care of that. 

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3405) EC2 cluster creation on VPC

2014-09-04 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3405:
---
Component/s: (was: PySpark)

 EC2 cluster creation on VPC
 ---

 Key: SPARK-3405
 URL: https://issues.apache.org/jira/browse/SPARK-3405
 Project: Spark
  Issue Type: New Feature
  Components: EC2
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04
Reporter: Dawson Reid
Priority: Minor

 It would be very useful to be able to specify the EC2 VPC in which the Spark 
 cluster should be created. 
 When creating a Spark cluster on AWS via the spark-ec2 script there is no way 
 to specify a VPC id of the VPC you would like the cluster to be created in. 
 The script always creates the cluster in the default VPC. 
 In my case I have deleted the default VPC and the spark-ec2 script errors out 
 with the following : 
 Setting up security groups...
 Creating security group test-master
 ERROR:boto:400 Bad Request
 ERROR:boto:?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default 
 VPC for this 
 user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response
 Traceback (most recent call last):
   File ./spark_ec2.py, line 860, in module
 main()
   File ./spark_ec2.py, line 852, in main
 real_main()
   File ./spark_ec2.py, line 735, in real_main
 conn, opts, cluster_name)
   File ./spark_ec2.py, line 247, in launch_cluster
 master_group = get_or_make_group(conn, cluster_name + -master)
   File ./spark_ec2.py, line 143, in get_or_make_group
 return conn.create_security_group(name, Spark EC2 group)
   File 
 /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/ec2/connection.py,
  line 2011, in create_security_group
   File 
 /home/dawson/Develop/spark-1.0.2/ec2/third_party/boto-2.4.1.zip/boto-2.4.1/boto/connection.py,
  line 925, in get_object
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeVPCIdNotSpecified/CodeMessageNo default 
 VPC for this 
 user/Message/Error/ErrorsRequestID312a2281-81a1-4d3c-ba10-0593a886779d/RequestID/Response



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3403) NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)

2014-09-04 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122219#comment-14122219
 ] 

Xiangrui Meng commented on SPARK-3403:
--

I don't have a Windows system to test. There should be a runtime flag you can 
set to control the number of threads OpenBLAS use. Could you try that? I will 
test the code attached on OSX and report back.

 NaiveBayes crashes with blas/lapack native libraries for breeze (netlib-java)
 -

 Key: SPARK-3403
 URL: https://issues.apache.org/jira/browse/SPARK-3403
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.2
 Environment: Setup: Windows 7, x64 libraries for netlib-java (as 
 described on https://github.com/fommil/netlib-java). I used OpenBlas x64 and 
 MinGW64 precompiled dlls.
Reporter: Alexander Ulanov
 Fix For: 1.1.0

 Attachments: NativeNN.scala


 Code:
 val model = NaiveBayes.train(train)
 val predictionAndLabels = test.map { point =
   val score = model.predict(point.features)
   (score, point.label)
 }
 predictionAndLabels.foreach(println)
 Result: 
 program crashes with: Process finished with exit code -1073741819 
 (0xC005) after displaying the first prediction



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3406) Python persist API does not have a default storage level

2014-09-04 Thread holdenk (JIRA)
holdenk created SPARK-3406:
--

 Summary: Python persist API does not have a default storage level
 Key: SPARK-3406
 URL: https://issues.apache.org/jira/browse/SPARK-3406
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Reporter: holdenk
Priority: Minor


PySpark's persist method on RDD's does not have a default storage level. This 
is different than the Scala API which defaults to in memory caching. This is 
minor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object nesting

2014-09-04 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-3390:

Summary: sqlContext.jsonRDD fails on a complex structure of JSON array and 
JSON object nesting  (was: sqlContext.jsonRDD fails on a complex structure of 
array and hashmap nesting)

 sqlContext.jsonRDD fails on a complex structure of JSON array and JSON object 
 nesting
 -

 Key: SPARK-3390
 URL: https://issues.apache.org/jira/browse/SPARK-3390
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Vida Ha
Assignee: Yin Huai
Priority: Critical

 I found a valid JSON string, but which Spark SQL fails to correctly parse:
 Try running these lines in a spark-shell to reproduce:
 {code:borderStyle=solid}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val badJson = {\foo\: [[{\bar\: 0}]]}
 val rdd = sc.parallelize(badJson :: Nil)
 sqlContext.jsonRDD(rdd).count()
 {code}
 I've tried running these lines on the 1.0.2 release as well latest Spark1.1 
 release candidate, and I get this stack trace:
 {panel}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 
 failed 1 times, most recent failure: Exception failure in TID 7 on host 
 localhost: scala.MatchError: StructType(List()) (of class 
 org.apache.spark.sql.catalyst.types.StructType)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365)
 scala.Option.map(Option.scala:145)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3390) sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting

2014-09-04 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122235#comment-14122235
 ] 

Yin Huai commented on SPARK-3390:
-

Oh, I see the problem. I am out of town this week. Will fix it next week.

 sqlContext.jsonRDD fails on a complex structure of array and hashmap nesting
 

 Key: SPARK-3390
 URL: https://issues.apache.org/jira/browse/SPARK-3390
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Vida Ha
Assignee: Yin Huai
Priority: Critical

 I found a valid JSON string, but which Spark SQL fails to correctly parse:
 Try running these lines in a spark-shell to reproduce:
 {code:borderStyle=solid}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 val badJson = {\foo\: [[{\bar\: 0}]]}
 val rdd = sc.parallelize(badJson :: Nil)
 sqlContext.jsonRDD(rdd).count()
 {code}
 I've tried running these lines on the 1.0.2 release as well latest Spark1.1 
 release candidate, and I get this stack trace:
 {panel}
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 2.0:3 
 failed 1 times, most recent failure: Exception failure in TID 7 on host 
 localhost: scala.MatchError: StructType(List()) (of class 
 org.apache.spark.sql.catalyst.types.StructType)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:333)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$enforceCorrectType$1.apply(JsonRDD.scala:335)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 scala.collection.AbstractTraversable.map(Traversable.scala:105)
 
 org.apache.spark.sql.json.JsonRDD$.enforceCorrectType(JsonRDD.scala:335)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1$$anonfun$apply$12.apply(JsonRDD.scala:365)
 scala.Option.map(Option.scala:145)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:364)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$org$apache$spark$sql$json$JsonRDD$$asRow$1.apply(JsonRDD.scala:349)
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 
 org.apache.spark.sql.json.JsonRDD$.org$apache$spark$sql$json$JsonRDD$$asRow(JsonRDD.scala:349)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
 
 org.apache.spark.sql.json.JsonRDD$$anonfun$createLogicalPlan$1.apply(JsonRDD.scala:51)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 
 {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-04 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122242#comment-14122242
 ] 

Yu Ishikawa commented on SPARK-2430:


Hi [~rnowling] ,

I am very interested in this issue.
If possible, I am willing to work with you.

I think MLlib's high-level API should be consistent like Scikit-learn.
You know, we can use the almost algorithms with  `fit` and `predict` function 
in Scikit-learn.
The consisntent API would be helpful for Spark user too.

 Standarized Clustering Algorithm API and Framework
 --

 Key: SPARK-2430
 URL: https://issues.apache.org/jira/browse/SPARK-2430
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Recently, there has been a chorus of voices on the mailing lists about adding 
 new clustering algorithms to MLlib.  To support these additions, we should 
 develop a common framework and API to reduce code duplication and keep the 
 APIs consistent.
 At the same time, we can also expand the current API to incorporate requested 
 features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-04 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122249#comment-14122249
 ] 

Yu Ishikawa commented on SPARK-2966:


I'm sorry for not checking community discussion and JIRA issue. Thank you for 
let me know.

We would be able to implement an approximation algorithm for hierarchical 
clustering with LSH. I think the approach of this issue is different from that 
of [SPARK-2429]. Should we merge this issue to [SPARK-2429] ?

 Add an approximation algorithm for hierarchical clustering to MLlib
 ---

 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 A hierarchical clustering algorithm is a useful unsupervised learning method.
 Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
 (1).
 I would like to implement this method.
 I suggest adding an approximate hierarchical clustering algorithm to MLlib.
 I'd like this to be assigned to me.
 h3. Reference
 # Fast agglomerative hierarchical clustering algorithm using 
 Locality-Sensitive Hashing
 http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3310) Directly use currentTable without unnecessary implicit conversion

2014-09-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3310.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 Directly use currentTable without unnecessary implicit conversion
 -

 Key: SPARK-3310
 URL: https://issues.apache.org/jira/browse/SPARK-3310
 Project: Spark
  Issue Type: Improvement
Reporter: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.2.0


 We can directly use currentTable in function cacheTable without unnecessary 
 implicit conversion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2219) AddJar doesn't work

2014-09-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2219.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 AddJar doesn't work
 ---

 Key: SPARK-2219
 URL: https://issues.apache.org/jira/browse/SPARK-2219
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Cheng Lian
 Fix For: 1.2.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2966) Add an approximation algorithm for hierarchical clustering to MLlib

2014-09-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122266#comment-14122266
 ] 

RJ Nowling commented on SPARK-2966:
---

No worries.

Based on my reading of the Spark contribution guidelines ( 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark ), I 
think that the Spark community would prefer to have one good implementation of 
an algorithm instead of multiple similar algorithms.

Since the community has stated a clear preference for divisive hierarchical 
clustering, I think that is a better aim.  You seem very motivated and have 
made some good contributions -- would you like to take the lead on the 
hierarchical clustering?  I can review your code to help you improve it.

That said, I suggest you look at the comment I added to SPARK-2429 and see what 
you think of that approach.  If you like the example code and papers, why don't 
you work on implementing it efficiently in Spark?

 Add an approximation algorithm for hierarchical clustering to MLlib
 ---

 Key: SPARK-2966
 URL: https://issues.apache.org/jira/browse/SPARK-2966
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Yu Ishikawa
Priority: Minor

 A hierarchical clustering algorithm is a useful unsupervised learning method.
 Koga. et al. proposed highly scalable hierarchical clustering altgorithm in 
 (1).
 I would like to implement this method.
 I suggest adding an approximate hierarchical clustering algorithm to MLlib.
 I'd like this to be assigned to me.
 h3. Reference
 # Fast agglomerative hierarchical clustering algorithm using 
 Locality-Sensitive Hashing
 http://dl.acm.org/citation.cfm?id=1266811



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2430) Standarized Clustering Algorithm API and Framework

2014-09-04 Thread RJ Nowling (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122273#comment-14122273
 ] 

RJ Nowling commented on SPARK-2430:
---

Hi Yu,

The community had suggested looking into scikit-learn's API so that is a good 
idea.

I am hesitant to make backwards-incompatible API changes, however, until we 
know the new API will be stable for a long time.  I think it would be best to 
implement a few more clustering algorithms to get a clear idea of what is 
similar vs different before making a new API.  May I suggest you work on 
SPARK-2966 / SPARK-2429 first?

RJ

 Standarized Clustering Algorithm API and Framework
 --

 Key: SPARK-2430
 URL: https://issues.apache.org/jira/browse/SPARK-2430
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Priority: Minor

 Recently, there has been a chorus of voices on the mailing lists about adding 
 new clustering algorithms to MLlib.  To support these additions, we should 
 develop a common framework and API to reduce code duplication and keep the 
 APIs consistent.
 At the same time, we can also expand the current API to incorporate requested 
 features such as arbitrary distance metrics or pre-computed distance matrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3129) Prevent data loss in Spark Streaming

2014-09-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122274#comment-14122274
 ] 

Saisai Shao commented on SPARK-3129:


Hi [~hshreedharan]], thanks for your reply, is this PR 
(https://github.com/apache/spark/pull/1195) the one you mentioned about 
storeReliably()? 

According to my knowledge, this API aims to store bunch of messages into BM 
directly to make it reliable, but for some receiver like Kafka, socket and 
others, data is injected one by one message, we can't call storeReliably() each 
time because of efficiency and throughput concern, so we need to store these 
data locally to some amount, and then flush to BM using storeReliably(). So I 
think data will potentially be lost as we store it locally. These days I 
thought about WAL things, IMHO i think WAL would be a better solution compared 
to blocked store API.

 Prevent data loss in Spark Streaming
 

 Key: SPARK-3129
 URL: https://issues.apache.org/jira/browse/SPARK-3129
 Project: Spark
  Issue Type: New Feature
Reporter: Hari Shreedharan
Assignee: Hari Shreedharan
 Attachments: StreamingPreventDataLoss.pdf


 Spark Streaming can small amounts of data when the driver goes down - and the 
 sending system cannot re-send the data (or the data has already expired on 
 the sender side). The document attached has more details. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3407) Add Date type support

2014-09-04 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-3407:


 Summary: Add Date type support
 Key: SPARK-3407
 URL: https://issues.apache.org/jira/browse/SPARK-3407
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Cheng Hao






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-09-04 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122292#comment-14122292
 ] 

Saisai Shao commented on SPARK-2926:


Hi Matei, sorry for late response, I will test more scenarios with your notes, 
also factor out to see if some codes can be shared with ExternalSorter. Thanks 
a lot.

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3392) Set command always get undefined for key mapred.reduce.tasks

2014-09-04 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-3392.
-
   Resolution: Fixed
Fix Version/s: 1.2.0

 Set command always get undefined for key mapred.reduce.tasks
 

 Key: SPARK-3392
 URL: https://issues.apache.org/jira/browse/SPARK-3392
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
Priority: Trivial
 Fix For: 1.2.0


 This is a tiny fix for getting the value of mapred.reduce.tasks, which make 
 more sense for the hive user.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3408) Limit operator doesn't work with sort based shuffle

2014-09-04 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3408:
--

 Summary: Limit operator doesn't work with sort based shuffle
 Key: SPARK-3408
 URL: https://issues.apache.org/jira/browse/SPARK-3408
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-04 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-3409:
--

 Summary: Avoid pulling in Exchange operator itself in Exchange's 
closures
 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin


{code}
val rdd = child.execute().mapPartitions { iter =
  if (sortBasedShuffleOn) {
iter.map(r = (null, r.copy()))
  } else {
val mutablePair = new MutablePair[Null, Row]()
iter.map(r = mutablePair.update(null, r))
  }
}
{code}

The above snippet from Exchange references sortBasedShuffleOn within a closure, 
which requires pulling in the entire Exchange object in the closure. 

This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122386#comment-14122386
 ] 

sam commented on SPARK-1473:


Good paper, the theory is very solid. My only concern is that the paper does 
not explicitly tackle the problem of probability estimation for high 
dimensionality, which for sparse data will be even worse. It just touches on 
the problem, saying:

This in turn causes increasingly poor judgements for the in- clusion/exclusion 
of features. For precisely this reason, the research community have developed 
various low-dimensional approximations to (9). In the following sections, we 
will investigate the implicit statistical assumptions and empirical effects of 
these approximations

Those mentioned sections do not go into theoretical detail, and therefore I 
disagree that the paper provides a single unified information theoretic 
framework for feature selection as it basically leaves the problem of 
probability estimation to the readers choice, and merely suggests the reader 
assumes some level of independence between features in order to implement an 
algorithm.

 [~dmborque]  Do you know of any literature that does approach the problem of 
probability estimation in an information theoretic and philosophically 
justified way?? 

Anyway despite my concerns, this paper is still by far the best treatment of 
feature selection I have seen.

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
  Labels: features

 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1473) Feature selection for high dimensional datasets

2014-09-04 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122390#comment-14122390
 ] 

sam commented on SPARK-1473:


[~dmm...@gmail.com]  mentioning also (i cant work which david is the one that 
posted above)

 Feature selection for high dimensional datasets
 ---

 Key: SPARK-1473
 URL: https://issues.apache.org/jira/browse/SPARK-1473
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Ignacio Zendejas
Assignee: Alexander Ulanov
Priority: Minor
  Labels: features

 For classification tasks involving large feature spaces in the order of tens 
 of thousands or higher (e.g., text classification with n-grams, where n  1), 
 it is often useful to rank and filter features that are irrelevant thereby 
 reducing the feature space by at least one or two orders of magnitude without 
 impacting performance on key evaluation metrics (accuracy/precision/recall).
 A feature evaluation interface which is flexible needs to be designed and at 
 least two methods should be implemented with Information Gain being a 
 priority as it has been shown to be amongst the most reliable.
 Special consideration should be taken in the design to account for wrapper 
 methods (see research papers below) which are more practical for lower 
 dimensional data.
 Relevant research:
 * Brown, G., Pocock, A., Zhao, M. J.,  Luján, M. (2012). Conditional
 likelihood maximisation: a unifying framework for information theoretic
 feature selection.*The Journal of Machine Learning Research*, *13*, 27-66.
 * Forman, George. An extensive empirical study of feature selection metrics 
 for text classification. The Journal of machine learning research 3 (2003): 
 1289-1305.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3408) Limit operator doesn't work with sort based shuffle

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122407#comment-14122407
 ] 

Apache Spark commented on SPARK-3408:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2281

 Limit operator doesn't work with sort based shuffle
 ---

 Key: SPARK-3408
 URL: https://issues.apache.org/jira/browse/SPARK-3408
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Reynold Xin
Assignee: Reynold Xin





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3409) Avoid pulling in Exchange operator itself in Exchange's closures

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3409?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122408#comment-14122408
 ] 

Apache Spark commented on SPARK-3409:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2282

 Avoid pulling in Exchange operator itself in Exchange's closures
 

 Key: SPARK-3409
 URL: https://issues.apache.org/jira/browse/SPARK-3409
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin
Assignee: Reynold Xin

 {code}
 val rdd = child.execute().mapPartitions { iter =
   if (sortBasedShuffleOn) {
 iter.map(r = (null, r.copy()))
   } else {
 val mutablePair = new MutablePair[Null, Row]()
 iter.map(r = mutablePair.update(null, r))
   }
 }
 {code}
 The above snippet from Exchange references sortBasedShuffleOn within a 
 closure, which requires pulling in the entire Exchange object in the closure. 
 This is a tiny teeny optimization.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.

2014-09-04 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3410:
--
Issue Type: Improvement  (was: Bug)

 The priority of shutdownhook for ApplicationMaster should not be integer 
 literal, rather than refer constant.
 -

 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 In ApplicationMaster, the priority of shutdown hook is set to 30, which 
 expects higher than the priority of o.a.h.FileSystem.
 In FileSystem, the priority of shutdown hook is expressed as public constant 
 named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant 
 for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3410) The priority of shutdownhook for ApplicationMaster should not be integer literal, rather than refer constant.

2014-09-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3410?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122465#comment-14122465
 ] 

Apache Spark commented on SPARK-3410:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2283

 The priority of shutdownhook for ApplicationMaster should not be integer 
 literal, rather than refer constant.
 -

 Key: SPARK-3410
 URL: https://issues.apache.org/jira/browse/SPARK-3410
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.1.0
Reporter: Kousuke Saruta
Priority: Minor

 In ApplicationMaster, the priority of shutdown hook is set to 30, which 
 expects higher than the priority of o.a.h.FileSystem.
 In FileSystem, the priority of shutdown hook is expressed as public constant 
 named SHUTDOWN_HOOK_PRIORITY so I think it's better to use this constant 
 for the priority of ApplicationMaster's shutdown hook.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3411) Optimize the schedule procedure in Master

2014-09-04 Thread WangTaoTheTonic (JIRA)
WangTaoTheTonic created SPARK-3411:
--

 Summary: Optimize the schedule procedure in Master
 Key: SPARK-3411
 URL: https://issues.apache.org/jira/browse/SPARK-3411
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: WangTaoTheTonic
Priority: Minor


If the waiting driver array is too big, the drivers in it will be dispatched to 
the first worker we get(if it has enough resources), with or without the 
Randomization.

We should do randomization every time we dispatch a driver, in order to better 
balance drivers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org