[jira] [Created] (SPARK-10580) Remove Bagel

2015-09-13 Thread Sean Owen (JIRA)
Sean Owen created SPARK-10580:
-

 Summary: Remove Bagel
 Key: SPARK-10580
 URL: https://issues.apache.org/jira/browse/SPARK-10580
 Project: Spark
  Issue Type: Task
  Components: GraphX
Reporter: Sean Owen


Follow up to SPARK-10222: remove Bagel in 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10222) More thoroughly deprecate Bagel in favor of GraphX

2015-09-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10222.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8731
[https://github.com/apache/spark/pull/8731]

> More thoroughly deprecate Bagel in favor of GraphX
> --
>
> Key: SPARK-10222
> URL: https://issues.apache.org/jira/browse/SPARK-10222
> Project: Spark
>  Issue Type: Task
>  Components: GraphX
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 1.6.0
>
>
> It seems like Bagel has had little or no activity since before even Spark 1.0 
> (?) and is supposed to be superseded by GraphX. 
> Would it be reasonable to deprecate it for 1.6? and remove it in Spark 2.x? I 
> think it's reasonable enough that I'll assert this as a JIRA, but obviously 
> open to discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742375#comment-14742375
 ] 

Sean Owen commented on SPARK-10576:
---

Yes a lot of these seem to be duplicating the {{package.scala}} file next to it 
in order to get a package summary to appear for these packages in the _Javadoc_ 
as well as Scaladoc. That seems legitimate and it's most of them.

What about the others thought? the actual class files seem like they can go in 
{{src/main/java}}. Let me try moving them to see if it breaks, as a first pass.

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6350:
-
Labels:   (was: backport-needed)

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0, 1.5.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.4.0, 1.6.0, 1.5.1
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6350) Make mesosExecutorCores configurable in mesos "fine-grained" mode

2015-09-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6350.
--
   Resolution: Fixed
Fix Version/s: 1.5.1

> Make mesosExecutorCores configurable in mesos "fine-grained" mode
> -
>
> Key: SPARK-6350
> URL: https://issues.apache.org/jira/browse/SPARK-6350
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 1.0.0, 1.5.0
>Reporter: Jongyoul Lee
>Assignee: Jongyoul Lee
>Priority: Minor
> Fix For: 1.6.0, 1.5.1, 1.4.0
>
>
> When spark runs in mesos fine-grained mode, mesos slave launches executor 
> with # of cpus and memories. By the way, # of executor's cores is always 
> CPU_PER_TASKS as same as spark.task.cpus. If I set that values as 5 for 
> running intensive task, mesos executor always consume 5 cores without any 
> running task. This waste resources. We should set executor core as a 
> configuration variable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-09-13 Thread Jacek Laskowski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski updated SPARK-10581:

Description: 
The Scala API documentation (scaladoc) for 
[org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
 does not resolve groups, and they appear unresolved like {{df_ops}}, 
{{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
operators._, et al.  

BTW, 
[DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
 and other classes in the 
[org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
 package seem fine.

  was:
The Scala API documentation (scaladoc) for 
[org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
 does not resolve groups, and they appear unresolved like {{df_ops}}, 
{{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
operators._, et al.  

BTW, 
[DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
 and other classes in the org.apache.spark.sql package seem fine.


> Groups are not resolved in scaladoc for org.apache.spark.sql.Column
> ---
>
> Key: SPARK-10581
> URL: https://issues.apache.org/jira/browse/SPARK-10581
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Jacek Laskowski
>Priority: Minor
>
> The Scala API documentation (scaladoc) for 
> [org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
>  does not resolve groups, and they appear unresolved like {{df_ops}}, 
> {{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
> operators._, et al.  
> BTW, 
> [DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
>  and other classes in the 
> [org.apache.spark.sql|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.package]
>  package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10581) Groups are not resolved in scaladoc for org.apache.spark.sql.Column

2015-09-13 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-10581:
---

 Summary: Groups are not resolved in scaladoc for 
org.apache.spark.sql.Column
 Key: SPARK-10581
 URL: https://issues.apache.org/jira/browse/SPARK-10581
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Jacek Laskowski
Priority: Minor


The Scala API documentation (scaladoc) for 
[org.apache.spark.sql.Column|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.Column]
 does not resolve groups, and they appear unresolved like {{df_ops}}, 
{{expr_ops}}, et al. instead of _DataFrame functions._, _Expression 
operators._, et al.  

BTW, 
[DataFrame|http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame]
 and other classes in the org.apache.spark.sql package seem fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10524) Decision tree binary classification with ordered categorical features: incorrect centroid

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10524:


Assignee: (was: Apache Spark)

> Decision tree binary classification with ordered categorical features: 
> incorrect centroid
> -
>
> Key: SPARK-10524
> URL: https://issues.apache.org/jira/browse/SPARK-10524
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> In DecisionTree and RandomForest binary classification with ordered 
> categorical features, we order categories' bins based on the hard prediction, 
> but we should use the soft prediction.
> Here are the 2 places in mllib and ml:
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]
> The PR which fixes this should include a unit test which isolates this issue, 
> ideally by directly calling binsToBestSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10524) Decision tree binary classification with ordered categorical features: incorrect centroid

2015-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742515#comment-14742515
 ] 

Apache Spark commented on SPARK-10524:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8734

> Decision tree binary classification with ordered categorical features: 
> incorrect centroid
> -
>
> Key: SPARK-10524
> URL: https://issues.apache.org/jira/browse/SPARK-10524
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>
> In DecisionTree and RandomForest binary classification with ordered 
> categorical features, we order categories' bins based on the hard prediction, 
> but we should use the soft prediction.
> Here are the 2 places in mllib and ml:
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]
> The PR which fixes this should include a unit test which isolates this issue, 
> ideally by directly calling binsToBestSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10524) Decision tree binary classification with ordered categorical features: incorrect centroid

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10524:


Assignee: Apache Spark

> Decision tree binary classification with ordered categorical features: 
> incorrect centroid
> -
>
> Key: SPARK-10524
> URL: https://issues.apache.org/jira/browse/SPARK-10524
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> In DecisionTree and RandomForest binary classification with ordered 
> categorical features, we order categories' bins based on the hard prediction, 
> but we should use the soft prediction.
> Here are the 2 places in mllib and ml:
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala#L887]
> * 
> [https://github.com/apache/spark/blob/45de518742446ddfbd4816c9d0f8501139f9bc2d/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L779]
> The PR which fixes this should include a unit test which isolates this issue, 
> ideally by directly calling binsToBestSplit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4879) Missing output partitions after job completes with speculative execution

2015-09-13 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742453#comment-14742453
 ] 

Steve Loughran commented on SPARK-4879:
---

What about splitting the issue into HDFS commits (interesting its mostly EC2 
reports), which is probably fixed, and eventually consistent object stores (S3 
on the Apache Hadoop releases (but not amazon's own), which need a different 
committer (no rename), better checks for file presence (direct stat() is more 
reliable than a directory listing), and some dedicated test suite which could 
be targeted straight at s3 —yet still runnable remotely by someone (not 
jenkins) from their own desktop & build servers. That's essentially what we do 
in core hadoop to qualify the object stores' base API compatibility.

> Missing output partitions after job completes with speculative execution
> 
>
> Key: SPARK-4879
> URL: https://issues.apache.org/jira/browse/SPARK-4879
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, Spark Core
>Affects Versions: 1.0.2, 1.1.1, 1.2.0, 1.3.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.3.0
>
> Attachments: speculation.txt, speculation2.txt
>
>
> When speculative execution is enabled ({{spark.speculation=true}}), jobs that 
> save output files may report that they have completed successfully even 
> though some output partitions written by speculative tasks may be missing.
> h3. Reproduction
> This symptom was reported to me by a Spark user and I've been doing my own 
> investigation to try to come up with an in-house reproduction.
> I'm still working on a reliable local reproduction for this issue, which is a 
> little tricky because Spark won't schedule speculated tasks on the same host 
> as the original task, so you need an actual (or containerized) multi-host 
> cluster to test speculation.  Here's a simple reproduction of some of the 
> symptoms on EC2, which can be run in {{spark-shell}} with {{--conf 
> spark.speculation=true}}:
> {code}
> // Rig a job such that all but one of the tasks complete instantly
> // and one task runs for 20 seconds on its first attempt and instantly
> // on its second attempt:
> val numTasks = 100
> sc.parallelize(1 to numTasks, 
> numTasks).repartition(2).mapPartitionsWithContext { case (ctx, iter) =>
>   if (ctx.partitionId == 0) {  // If this is the one task that should run 
> really slow
> if (ctx.attemptId == 0) {  // If this is the first attempt, run slow
>  Thread.sleep(20 * 1000)
> }
>   }
>   iter
> }.map(x => (x, x)).saveAsTextFile("/test4")
> {code}
> When I run this, I end up with a job that completes quickly (due to 
> speculation) but reports failures from the speculated task:
> {code}
> [...]
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Finished task 37.1 in stage 
> 3.0 (TID 411) in 131 ms on ip-172-31-8-164.us-west-2.compute.internal 
> (100/100)
> 14/12/11 01:41:13 INFO scheduler.DAGScheduler: Stage 3 (saveAsTextFile at 
> :22) finished in 0.856 s
> 14/12/11 01:41:13 INFO spark.SparkContext: Job finished: saveAsTextFile at 
> :22, took 0.885438374 s
> 14/12/11 01:41:13 INFO scheduler.TaskSetManager: Ignoring task-finished event 
> for 70.1 in stage 3.0 because task 70 has already completed successfully
> scala> 14/12/11 01:41:13 WARN scheduler.TaskSetManager: Lost task 49.1 in 
> stage 3.0 (TID 413, ip-172-31-8-164.us-west-2.compute.internal): 
> java.io.IOException: Failed to save output of task: 
> attempt_201412110141_0003_m_49_413
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:160)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.moveTaskOutputs(FileOutputCommitter.java:172)
> 
> org.apache.hadoop.mapred.FileOutputCommitter.commitTask(FileOutputCommitter.java:132)
> org.apache.spark.SparkHadoopWriter.commit(SparkHadoopWriter.scala:109)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:991)
> 
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$13.apply(PairRDDFunctions.scala:974)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> One interesting thing to note about this stack trace: if we look at 
> {{FileOutputCommitter.java:160}} 
> 

[jira] [Commented] (SPARK-5113) Audit and document use of hostnames and IP addresses in Spark

2015-09-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742672#comment-14742672
 ] 

Sean Owen commented on SPARK-5113:
--

Not sure if this one will ever happen, but there's a related problem at 
SPARK-10149

> Audit and document use of hostnames and IP addresses in Spark
> -
>
> Key: SPARK-5113
> URL: https://issues.apache.org/jira/browse/SPARK-5113
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Priority: Critical
>
> Spark has multiple network components that start servers and advertise their 
> network addresses to other processes.
> We should go through each of these components and make sure they have 
> consistent and/or documented behavior wrt (a) what interface(s) they bind to 
> and (b) what hostname they use to advertise themselves to other processes. We 
> should document this clearly and explain to people what to do in different 
> cases (e.g. EC2, dockerized containers, etc).
> When Spark initializes, it will search for a network interface until it finds 
> one that is not a loopback address. Then it will do a reverse DNS lookup for 
> a hostname associated with that interface. Then the network components will 
> use that hostname to advertise the component to other processes. That 
> hostname is also the one used for the akka system identifier (akka supports 
> only supplying a single name which it uses both as the bind interface and as 
> the actor identifier). In some cases, that hostname is used as the bind 
> hostname also (e.g. I think this happens in the connection manager and 
> possibly akka) - which will likely internally result in a re-resolution of 
> this to an IP address. In other cases (the web UI and netty shuffle) we seem 
> to bind to all interfaces.
> The best outcome would be to have three configs that can be set on each 
> machine:
> {code}
> SPARK_LOCAL_IP # Ip address we bind to for all services
> SPARK_INTERNAL_HOSTNAME # Hostname we advertise to remote processes within 
> the cluster
> SPARK_EXTERNAL_HOSTNAME # Hostname we advertise to processes outside the 
> cluster (e.g. the UI)
> {code}
> It's not clear how easily we can support that scheme while providing 
> backwards compatibility. The last one (SPARK_EXTERNAL_HOSTNAME) is easy - 
> it's just an alias for what is now SPARK_PUBLIC_DNS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10006) Locality broken in spark 1.4.x for NewHadoopRDD

2015-09-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10006.
---
Resolution: Duplicate

This is the same issue as SPARK-10149, which has something of a PR in motion, 
so let's fold it into that one.

> Locality broken in spark 1.4.x for NewHadoopRDD
> ---
>
> Key: SPARK-10006
> URL: https://issues.apache.org/jira/browse/SPARK-10006
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Rob Russo
>Priority: Critical
>  Labels: performance
>
> After upgrading to spark 1.4.x, locality seems to be entirely broken for 
> NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster. 
> Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all 
> partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the 
> locality level switched to ANY for all partitions. 
> Furthermore it appears to be somehow launching the tasks in order of their 
> locations or something to that effect because there are hotspots of 1 node at 
> a time with completely maxed resources during the read. To test this theory i 
> wrote a job that scans for all the files in the driver, parallelizes the list 
> and then loads the files back through the hadoop API in a mapPartitions 
> function (which correct me if i'm wrong but this should be identical to using 
> ANY locality?) and the result was that my hack was 4x faster than letting 
> spark parse the files itself!
> As for performance effect, this has caused a 12x slowdown for us from 1.3.1 
> to 1.4.1. Needless to say we have downgraded back for now and everything 
> appears to work normally again now.
> We were able to reproduce this behavior on multiple clusters and also on both 
> hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths 
> depending on the existence of of hadoop 2.6 for figuring out preferred 
> locations). The only thing that has fixed the problem for us is to downgrade 
> back to 1.3.1.
> Not sure how helpful it will be but through reflection i checked the results 
> of calling on the RDD the getPreferredLocations method and it returned me an 
> empty List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also 
> tried called the function getPreferredLocs on the spark context with the RDD 
> and that actually properly gave me back the 3 locations of the partition i 
> passed it in both 1.3.1 and 1.4.1. So as far as i can tell the logic for 
> getPreferredLocs and getPreferredLocations seems to match across versions and 
> it appears to be that the use of this information in the scheduler is what 
> must have changed. However I could not find many references to either of 
> these 2 functions so I was not able to debug much further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10571) Spark REST / JSON API mixes up application names and application ids

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10571:


Assignee: Apache Spark

> Spark REST / JSON API mixes up application names and application ids
> 
>
> Key: SPARK-10571
> URL: https://issues.apache.org/jira/browse/SPARK-10571
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> The REST API documentation 
> (https://spark.apache.org/docs/1.5.0/monitoring.html#rest-api) claims that 
> the URLs incorporate application ids, but it appears that the actual 
> implementation uses application _names_.
> For instance, when I browse to http://localhost:4041/api/v1/applications/ 
> locally with a 1.5.0 spark-shell:
> {code}
> [ {
>   "id" : "Spark shell",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2015-09-11T19:58:26.673GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "sparkUser" : "",
> "completed" : false
>   } ]
> } ]
> {code}
> However, in spark-shell:
> {code}
> scala> sc.applicationId
> res2: String = local-1442001507533
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10571) Spark REST / JSON API mixes up application names and application ids

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10571:


Assignee: (was: Apache Spark)

> Spark REST / JSON API mixes up application names and application ids
> 
>
> Key: SPARK-10571
> URL: https://issues.apache.org/jira/browse/SPARK-10571
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>
> The REST API documentation 
> (https://spark.apache.org/docs/1.5.0/monitoring.html#rest-api) claims that 
> the URLs incorporate application ids, but it appears that the actual 
> implementation uses application _names_.
> For instance, when I browse to http://localhost:4041/api/v1/applications/ 
> locally with a 1.5.0 spark-shell:
> {code}
> [ {
>   "id" : "Spark shell",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2015-09-11T19:58:26.673GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "sparkUser" : "",
> "completed" : false
>   } ]
> } ]
> {code}
> However, in spark-shell:
> {code}
> scala> sc.applicationId
> res2: String = local-1442001507533
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10571) Spark REST / JSON API mixes up application names and application ids

2015-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742558#comment-14742558
 ] 

Apache Spark commented on SPARK-10571:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/8735

> Spark REST / JSON API mixes up application names and application ids
> 
>
> Key: SPARK-10571
> URL: https://issues.apache.org/jira/browse/SPARK-10571
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.5.0
>Reporter: Josh Rosen
>
> The REST API documentation 
> (https://spark.apache.org/docs/1.5.0/monitoring.html#rest-api) claims that 
> the URLs incorporate application ids, but it appears that the actual 
> implementation uses application _names_.
> For instance, when I browse to http://localhost:4041/api/v1/applications/ 
> locally with a 1.5.0 spark-shell:
> {code}
> [ {
>   "id" : "Spark shell",
>   "name" : "Spark shell",
>   "attempts" : [ {
> "startTime" : "2015-09-11T19:58:26.673GMT",
> "endTime" : "1969-12-31T23:59:59.999GMT",
> "sparkUser" : "",
> "completed" : false
>   } ]
> } ]
> {code}
> However, in spark-shell:
> {code}
> scala> sc.applicationId
> res2: String = local-1442001507533
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10006) Locality broken in spark 1.4.x for NewHadoopRDD

2015-09-13 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-10006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742666#comment-14742666
 ] 

Daniel Gómez Ferro commented on SPARK-10006:


I had a similar issue caused by the worker nodes not resolving the hostname 
correctly, using the local IP instead. Could that be your problem? 

I solved it thanks to SPARK-5078, I added 

{code}
SPARK_LOCAL_HOSTNAME=`hostname`
{code}

to my spark-env.sh. 

In SPARK-5113 there seems to be an effort to reach consistency with regards to 
hostnames/IPs.

> Locality broken in spark 1.4.x for NewHadoopRDD
> ---
>
> Key: SPARK-10006
> URL: https://issues.apache.org/jira/browse/SPARK-10006
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Rob Russo
>Priority: Critical
>  Labels: performance
>
> After upgrading to spark 1.4.x, locality seems to be entirely broken for 
> NewHadoopRDD with a spark cluster that is co-located with an HDFS cluster. 
> Whereas an identical job run in spark 1.2.x or 1.3.x for us would run all 
> partitions with locality level NODE_LOCAL, after upgrading to 1.4.x the 
> locality level switched to ANY for all partitions. 
> Furthermore it appears to be somehow launching the tasks in order of their 
> locations or something to that effect because there are hotspots of 1 node at 
> a time with completely maxed resources during the read. To test this theory i 
> wrote a job that scans for all the files in the driver, parallelizes the list 
> and then loads the files back through the hadoop API in a mapPartitions 
> function (which correct me if i'm wrong but this should be identical to using 
> ANY locality?) and the result was that my hack was 4x faster than letting 
> spark parse the files itself!
> As for performance effect, this has caused a 12x slowdown for us from 1.3.1 
> to 1.4.1. Needless to say we have downgraded back for now and everything 
> appears to work normally again now.
> We were able to reproduce this behavior on multiple clusters and also on both 
> hadoop 2.4 and hadoop 2.6 (I saw that there were 2 different code paths 
> depending on the existence of of hadoop 2.6 for figuring out preferred 
> locations). The only thing that has fixed the problem for us is to downgrade 
> back to 1.3.1.
> Not sure how helpful it will be but through reflection i checked the results 
> of calling on the RDD the getPreferredLocations method and it returned me an 
> empty List on both 1.3.1 where it works and on 1.4.1 where it doesn't. I also 
> tried called the function getPreferredLocs on the spark context with the RDD 
> and that actually properly gave me back the 3 locations of the partition i 
> passed it in both 1.3.1 and 1.4.1. So as far as i can tell the logic for 
> getPreferredLocs and getPreferredLocations seems to match across versions and 
> it appears to be that the use of this information in the scheduler is what 
> must have changed. However I could not find many references to either of 
> these 2 functions so I was not able to debug much further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10576) Move .java files out of src/main/scala

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10576:


Assignee: (was: Apache Spark)

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10576) Move .java files out of src/main/scala

2015-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742653#comment-14742653
 ] 

Apache Spark commented on SPARK-10576:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/8736

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10576) Move .java files out of src/main/scala

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10576:


Assignee: Apache Spark

> Move .java files out of src/main/scala
> --
>
> Key: SPARK-10576
> URL: https://issues.apache.org/jira/browse/SPARK-10576
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.5.0
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> (I suppose I'm really asking for an opinion on this, rather than asserting it 
> must be done, but seems worthwhile. CC [~rxin] and [~pwendell])
> As pointed out on the mailing list, there are some Java files in the Scala 
> source tree:
> {code}
> ./bagel/src/main/scala/org/apache/spark/bagel/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/AlphaComponent.java
> ./core/src/main/scala/org/apache/spark/annotation/DeveloperApi.java
> ./core/src/main/scala/org/apache/spark/annotation/Experimental.java
> ./core/src/main/scala/org/apache/spark/annotation/package-info.java
> ./core/src/main/scala/org/apache/spark/annotation/Private.java
> ./core/src/main/scala/org/apache/spark/api/java/package-info.java
> ./core/src/main/scala/org/apache/spark/broadcast/package-info.java
> ./core/src/main/scala/org/apache/spark/executor/package-info.java
> ./core/src/main/scala/org/apache/spark/io/package-info.java
> ./core/src/main/scala/org/apache/spark/rdd/package-info.java
> ./core/src/main/scala/org/apache/spark/scheduler/package-info.java
> ./core/src/main/scala/org/apache/spark/serializer/package-info.java
> ./core/src/main/scala/org/apache/spark/util/package-info.java
> ./core/src/main/scala/org/apache/spark/util/random/package-info.java
> ./external/flume/src/main/scala/org/apache/spark/streaming/flume/package-info.java
> ./external/kafka/src/main/scala/org/apache/spark/streaming/kafka/package-info.java
> ./external/mqtt/src/main/scala/org/apache/spark/streaming/mqtt/package-info.java
> ./external/twitter/src/main/scala/org/apache/spark/streaming/twitter/package-info.java
> ./external/zeromq/src/main/scala/org/apache/spark/streaming/zeromq/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/impl/EdgeActiveness.java
> ./graphx/src/main/scala/org/apache/spark/graphx/lib/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/package-info.java
> ./graphx/src/main/scala/org/apache/spark/graphx/TripletFields.java
> ./graphx/src/main/scala/org/apache/spark/graphx/util/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/attribute/package-info.java
> ./mllib/src/main/scala/org/apache/spark/ml/package-info.java
> ./mllib/src/main/scala/org/apache/spark/mllib/package-info.java
> ./sql/catalyst/src/main/scala/org/apache/spark/sql/types/SQLUserDefinedType.java
> ./sql/hive/src/main/scala/org/apache/spark/sql/hive/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/api/java/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/dstream/package-info.java
> ./streaming/src/main/scala/org/apache/spark/streaming/StreamingContextState.java
> {code}
> It happens to work since the Scala compiler plugin is handling both.
> On its face, they should be in the Java source tree. I'm trying to figure out 
> if there are good reasons they have to be in this less intuitive location.
> I might try moving them just to see.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-09-13 Thread KaiXinXIaoLei (JIRA)
KaiXinXIaoLei created SPARK-10582:
-

 Summary: using dynamic-executor-allocation, if AM failed. the new 
AM will be started. But the new AM does not allocate executors to dirver
 Key: SPARK-10582
 URL: https://issues.apache.org/jira/browse/SPARK-10582
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.1
Reporter: KaiXinXIaoLei
 Fix For: 1.5.1






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10582:


Assignee: Apache Spark

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
>Assignee: Apache Spark
> Fix For: 1.5.1
>
>
> using spark-dynamic-executor-allocation, if AM failed during running task, 
> the new AM will be started. But the new AM does not allocate executors for 
> driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10582:


Assignee: (was: Apache Spark)

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.5.1
>
>
> using spark-dynamic-executor-allocation, if AM failed during running task, 
> the new AM will be started. But the new AM does not allocate executors for 
> driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742828#comment-14742828
 ] 

Apache Spark commented on SPARK-10582:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/8737

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.5.1
>
>
> using spark-dynamic-executor-allocation, if AM failed during running task, 
> the new AM will be started. But the new AM does not allocate executors for 
> driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10582) using dynamic-executor-allocation, if AM failed. the new AM will be started. But the new AM does not allocate executors to dirver

2015-09-13 Thread KaiXinXIaoLei (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

KaiXinXIaoLei updated SPARK-10582:
--
Description: using spark-dynamic-executor-allocation, if AM failed during 
running task, the new AM will be started. But the new AM does not allocate 
executors for driver.

> using dynamic-executor-allocation, if AM failed. the new AM will be started. 
> But the new AM does not allocate executors to dirver
> -
>
> Key: SPARK-10582
> URL: https://issues.apache.org/jira/browse/SPARK-10582
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
>Reporter: KaiXinXIaoLei
> Fix For: 1.5.1
>
>
> using spark-dynamic-executor-allocation, if AM failed during running task, 
> the new AM will be started. But the new AM does not allocate executors for 
> driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10585) only copy data once when generate unsafe projection

2015-09-13 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-10585:
---

 Summary: only copy data once when generate unsafe projection
 Key: SPARK-10585
 URL: https://issues.apache.org/jira/browse/SPARK-10585
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9325) Support `collect` on DataFrame columns

2015-09-13 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742860#comment-14742860
 ] 

Felix Cheung commented on SPARK-9325:
-

This turns out to be not straightforward.
Since Column captures the selection and not the data, there is no obvious way 
to "get data only for this column". Ideally, this should be implemented as 

{code}
  ages <- collect(select(df, df$Age))
{code}

However, df$Age returns a Column which does not reference the DataFrame, 
whether privately or on the JVM side, therefore it isn't clear how to turn 
`df$Age` (returns Column) into `select(df, df$Age)` (returns DataFrame)

Any suggestion on how to proceed?


> Support `collect` on DataFrame columns
> --
>
> Key: SPARK-9325
> URL: https://issues.apache.org/jira/browse/SPARK-9325
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is to support code of the form 
> ```
> ages <- collect(df$Age)
> ```
> Right now `df$Age` returns a Column, which has no functions supported.
> Similarly we might consider supporting `head(df$Age)` etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10583) Correctness test for Multilayer Perceptron using Weka Reference

2015-09-13 Thread Feynman Liang (JIRA)
Feynman Liang created SPARK-10583:
-

 Summary: Correctness test for Multilayer Perceptron using Weka 
Reference
 Key: SPARK-10583
 URL: https://issues.apache.org/jira/browse/SPARK-10583
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Feynman Liang


SPARK-9471 adds MLP and a [TODO 
item|https://github.com/apache/spark/blob/6add4eddb39e7748a87da3e921ea3c7881d30a82/mllib/src/test/scala/org/apache/spark/ml/ann/ANNSuite.scala#L28]
 to create a test checking implementation's learned weights against Weka's MLP 
implementation.

We need to add this as a unit test. The work should include the reference Weka 
code ran as a comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10585) only copy data once when generate unsafe projection

2015-09-13 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-10585:

Description: When we have nested struct, array or map, we will create a 
byte buffer for each of them, and copy data to the buffer first, then copy them 
to the final row buffer. We can save the first copy and directly copy data to 
final row buffer.

> only copy data once when generate unsafe projection
> ---
>
> Key: SPARK-10585
> URL: https://issues.apache.org/jira/browse/SPARK-10585
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>
> When we have nested struct, array or map, we will create a byte buffer for 
> each of them, and copy data to the buffer first, then copy them to the final 
> row buffer. We can save the first copy and directly copy data to final row 
> buffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10447) Upgrade pyspark to use py4j 0.9

2015-09-13 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742866#comment-14742866
 ] 

holdenk commented on SPARK-10447:
-

I've got a PR for this, and I can start the streaming context in the shell and 
do so stuff but I'm not super sure why the streaming tests aren't working with 
it (need to do some more digging).

> Upgrade pyspark to use py4j 0.9
> ---
>
> Key: SPARK-10447
> URL: https://issues.apache.org/jira/browse/SPARK-10447
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Justin Uang
>
> This was recently released, and it has many improvements, especially the 
> following:
> {quote}
> Python side: IDEs and interactive interpreters such as IPython can now get 
> help text/autocompletion for Java classes, objects, and members. This makes 
> Py4J an ideal tool to explore complex Java APIs (e.g., the Eclipse API). 
> Thanks to @jonahkichwacoders
> {quote}
> Normally we wrap all the APIs in spark, but for the ones that aren't, this 
> would make it easier to offroad by using the java proxy objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8571) spark streaming hanging processes upon build exit

2015-09-13 Thread shane knapp (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742939#comment-14742939
 ] 

shane knapp commented on SPARK-8571:


yes, it is.  i'll post an update this coming week.



> spark streaming hanging processes upon build exit
> -
>
> Key: SPARK-8571
> URL: https://issues.apache.org/jira/browse/SPARK-8571
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Streaming
> Environment: centos 6.6 amplab build system
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Minor
>  Labels: build, test
>
> over the past 3 months i've been noticing that there are occasionally hanging 
> processes on our build system workers after various spark builds have 
> finished.  these are all spark streaming processes.
> today i noticed a 3+ hour spark build that was timed out after 200 minutes 
> (https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/2994/),
>  and the matrix build hadoop.version=2.0.0-mr1-cdh4.1.2 ran on 
> amp-jenkins-worker-02.  after the timeout, it left the following process (and 
> all of it's children) hanging.
> the process' CLI command was:
> {quote}
> [root@amp-jenkins-worker-02 ~]# ps auxwww|grep 1714
> jenkins1714  733  2.7 21342148 3642740 ?Sl   07:52 1713:41 java 
> -Dderby.system.durability=test -Djava.awt.headless=true 
> -Djava.io.tmpdir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/tmp
>  -Dspark.driver.allowMultipleContexts=true 
> -Dspark.test.home=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos
>  -Dspark.testing=1 -Dspark.ui.enabled=false 
> -Dspark.ui.showConsoleProgress=false 
> -Dbasedir=/home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming
>  -ea -Xmx3g -XX:MaxPermSize=512m -XX:ReservedCodeCacheSize=512m 
> org.scalatest.tools.Runner -R 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/classes
>  
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/scala-2.10/test-classes
>  -o -f 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/SparkTestSuite.txt
>  -u 
> /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/2.0.0-mr1-cdh4.1.2/label/centos/streaming/target/surefire-reports/.
> {quote}
> stracing that process doesn't give us much:
> {quote}
> [root@amp-jenkins-worker-02 ~]# strace -p 1714
> Process 1714 attached - interrupt to quit
> futex(0x7ff3cdd269d0, FUTEX_WAIT, 1715, NULL
> {quote}
> stracing it's children gives is a *little* bit more...  some loop like this:
> {quote}
> 
> futex(0x7ff3c8012d28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8012f54, FUTEX_WAIT_PRIVATE, 28969, NULL) = 0
> futex(0x7ff3c8012f28, FUTEX_WAKE_PRIVATE, 1) = 0
> futex(0x7ff3c8f17954, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x7ff3c8f17950, 
> {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
> futex(0x7ff3c8f17928, FUTEX_WAKE_PRIVATE, 1) = 1
> futex(0x7ff3c8012d54, FUTEX_WAIT_BITSET_PRIVATE, 1, {2263862, 865233273}, 
> ) = -1 ETIMEDOUT (Connection timed out)
> {quote}
> and others loop on prtrace_attach (no such process) or restart_syscall 
> (resuming interrupted call)
> even though this behavior has been solidly pinned to jobs timing out (which 
> ends w/an aborted, not failed, build), i've seen it happen for failed builds 
> as well.  if i see any hanging processes from failed (not aborted) builds, i 
> will investigate them and update this bug as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-10584) Documentation about spark.sql.hive.metastore.version is wrong.

2015-09-13 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-10584:
--

 Summary: Documentation about spark.sql.hive.metastore.version is 
wrong.
 Key: SPARK-10584
 URL: https://issues.apache.org/jira/browse/SPARK-10584
 Project: Spark
  Issue Type: Bug
  Components: Documentation, SQL
Affects Versions: 1.5.0
Reporter: Kousuke Saruta
Priority: Minor


The default value of hive metastore version is 1.2.1 but the documentation says 
`spark.sql.hive.metastore.version` is 0.13.1.

Also, we cannot get the default value by 
`sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10584) Documentation about spark.sql.hive.metastore.version is wrong.

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10584:


Assignee: (was: Apache Spark)

> Documentation about spark.sql.hive.metastore.version is wrong.
> --
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> The default value of hive metastore version is 1.2.1 but the documentation 
> says `spark.sql.hive.metastore.version` is 0.13.1.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10584) Documentation about spark.sql.hive.metastore.version is wrong.

2015-09-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10584:


Assignee: Apache Spark

> Documentation about spark.sql.hive.metastore.version is wrong.
> --
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> The default value of hive metastore version is 1.2.1 but the documentation 
> says `spark.sql.hive.metastore.version` is 0.13.1.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10584) Documentation about spark.sql.hive.metastore.version is wrong.

2015-09-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742962#comment-14742962
 ] 

Apache Spark commented on SPARK-10584:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/8739

> Documentation about spark.sql.hive.metastore.version is wrong.
> --
>
> Key: SPARK-10584
> URL: https://issues.apache.org/jira/browse/SPARK-10584
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Affects Versions: 1.5.0
>Reporter: Kousuke Saruta
>Priority: Minor
>
> The default value of hive metastore version is 1.2.1 but the documentation 
> says `spark.sql.hive.metastore.version` is 0.13.1.
> Also, we cannot get the default value by 
> `sqlContext.getConf("spark.sql.hive.metastore.version")`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10577) [PySpark, SQL] DataFrame hint for broadcast join

2015-09-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14742973#comment-14742973
 ] 

Reynold Xin commented on SPARK-10577:
-

Sorry [~maver1ck] that's only available in the DataFrame world, not SQL yet. 
The SQL query parser we inherit from Hive simply doesn't support this type of 
hint.


> [PySpark, SQL] DataFrame hint for broadcast join
> 
>
> Key: SPARK-10577
> URL: https://issues.apache.org/jira/browse/SPARK-10577
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Maciej Bryński
>
> As in https://issues.apache.org/jira/browse/SPARK-8300
> there should by possibility to add hint for broadcast join in:
> - Spark SQL
> - Pyspark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org