date:20140730

Tathagata Das created SPARK-2745:


 Summary: Add Java friendly methods to Duration class
 Key: SPARK-2745
 URL: https://issues.apache.org/jira/browse/SPARK-2745
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2260) Spark submit standalone-cluster mode is broken


 [ 
https://issues.apache.org/jira/browse/SPARK-2260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2260.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1538
[https://github.com/apache/spark/pull/1538]

 Spark submit standalone-cluster mode is broken
 --

 Key: SPARK-2260
 URL: https://issues.apache.org/jira/browse/SPARK-2260
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
Assignee: Andrew Or
Priority: Blocker
 Fix For: 1.1.0


 Well, it is technically not officially supported... but we should still fix 
 it.
 In particular, important configs such as spark.master and the application jar 
 are not propagated to the worker nodes properly, due to obvious missing 
 pieces in the code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2560) Create Spark SQL syntax reference


 [ 
https://issues.apache.org/jira/browse/SPARK-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2560:


Priority: Critical  (was: Major)
Target Version/s: 1.1.0

 Create Spark SQL syntax reference
 -

 Key: SPARK-2560
 URL: https://issues.apache.org/jira/browse/SPARK-2560
 Project: Spark
  Issue Type: Documentation
  Components: SQL
Reporter: Nicholas Chammas
Priority: Critical

 Does Spark SQL support {{LEN()}}? How about {{LIMIT}}? And what about {{MY 
 FAVOURITE SYNTAX}}?
 Right now there is no reference page to document this. [Hive has one.| 
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select] Spark 
 SQL should have one, too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2179) Public API for DataTypes and Schema


 [ 
https://issues.apache.org/jira/browse/SPARK-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2179.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 Public API for DataTypes and Schema
 ---

 Key: SPARK-2179
 URL: https://issues.apache.org/jira/browse/SPARK-2179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Michael Armbrust
Assignee: Yin Huai
Priority: Critical
 Fix For: 1.1.0


 We want something like the following:
  * Expose DataType in the SQL package and lock down all the internal details 
 (TypeTags, etc)
  * Programatic API for viewing the schema of an RDD as a StructType
  * Method that creates a schema RDD given (RDD[A], StructType, A = Row)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2543) Allow user to set maximum Kryo buffer size


 [ 
https://issues.apache.org/jira/browse/SPARK-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2543:
---

Summary: Allow user to set maximum Kryo buffer size  (was: Resizable 
serialization buffers for kryo)

 Allow user to set maximum Kryo buffer size
 --

 Key: SPARK-2543
 URL: https://issues.apache.org/jira/browse/SPARK-2543
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Assignee: Koert Kuipers
Priority: Minor

 Kryo supports resizing serialization output buffers with the maxBufferSize 
 parameter of KryoOutput.
 I suggest we expose this through the config spark.kryoserializer.buffer.max.mb
 For pull request see:
 https://github.com/apache/spark/pull/735



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2178) createSchemaRDD is not thread safe


 [ 
https://issues.apache.org/jira/browse/SPARK-2178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2178:


Target Version/s: 1.2.0  (was: 1.1.0)

 createSchemaRDD is not thread safe
 --

 Key: SPARK-2178
 URL: https://issues.apache.org/jira/browse/SPARK-2178
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust

 This is because implicit type tags are not thread safe.  We could fix this 
 with compile time macros (which could also make the conversion a lot faster).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2543) Allow user to set maximum Kryo buffer size


 [ 
https://issues.apache.org/jira/browse/SPARK-2543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2543.


  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s: 1.1.0

Fixed via this pull request:
https://github.com/apache/spark/pull/735/files

 Allow user to set maximum Kryo buffer size
 --

 Key: SPARK-2543
 URL: https://issues.apache.org/jira/browse/SPARK-2543
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: koert kuipers
Assignee: Koert Kuipers
Priority: Minor
 Fix For: 1.1.0


 Kryo supports resizing serialization output buffers with the maxBufferSize 
 parameter of KryoOutput.
 I suggest we expose this through the config spark.kryoserializer.buffer.max.mb
 For pull request see:
 https://github.com/apache/spark/pull/735



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user

Reynold Xin created SPARK-2746:
--

 Summary: Set SBT_MAVEN_PROFILES only when it is not set explicitly 
by the user
 Key: SPARK-2746
 URL: https://issues.apache.org/jira/browse/SPARK-2746
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of 
fact, Jenkins is failing for older Hadoop versions because the YARN profile is 
always on.

{code}
  export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user


[ 
https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079039#comment-14079039
 ] 

Apache Spark commented on SPARK-2746:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1655

 Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
 -

 Key: SPARK-2746
 URL: https://issues.apache.org/jira/browse/SPARK-2746
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter 
 of fact, Jenkins is failing for older Hadoop versions because the YARN 
 profile is always on.
 {code}
   export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests

Reynold Xin created SPARK-2747:
--

 Summary: git diff --dirstat can miss sql changes and not run Hive 
tests
 Key: SPARK-2747
 URL: https://issues.apache.org/jira/browse/SPARK-2747
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical


dev/run-tests use git diff --dirstat master to check whether sql is changed. 
However, --dirstat won't show sql if sql's change is negligible (e.g. 1k loc 
change in core, and only 1 loc change in hive).

We should use git diff --name-only master instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests


[ 
https://issues.apache.org/jira/browse/SPARK-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079073#comment-14079073
 ] 

Apache Spark commented on SPARK-2747:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1656

 git diff --dirstat can miss sql changes and not run Hive tests
 --

 Key: SPARK-2747
 URL: https://issues.apache.org/jira/browse/SPARK-2747
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical

 dev/run-tests use git diff --dirstat master to check whether sql is 
 changed. However, --dirstat won't show sql if sql's change is negligible 
 (e.g. 1k loc change in core, and only 1 loc change in hive).
 We should use git diff --name-only master instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2641) Spark submit doesn't pick up executor instances from properties file


[ 
https://issues.apache.org/jira/browse/SPARK-2641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079114#comment-14079114
 ] 

Apache Spark commented on SPARK-2641:
-

User 'kjsingh' has created a pull request for this issue:
https://github.com/apache/spark/pull/1657

 Spark submit doesn't pick up executor instances from properties file
 

 Key: SPARK-2641
 URL: https://issues.apache.org/jira/browse/SPARK-2641
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kanwaljit Singh

 When running spark-submit in Yarn cluster mode, we provide properties file 
 using --properties-file option.
 spark.executor.instances=5
 spark.executor.memory=2120m
 spark.executor.cores=3
 The submitted job picks up the cores and memory, but not the correct 
 instances.
 I think the issue is here in org.apache.spark.deploy.SparkSubmitArguments:
 // Use properties file as fallback for values which have a direct analog to
 // arguments in this script.
 master = 
 Option(master).getOrElse(defaultProperties.get(spark.master).orNull)
 executorMemory = Option(executorMemory)
   .getOrElse(defaultProperties.get(spark.executor.memory).orNull)
 executorCores = Option(executorCores)
   .getOrElse(defaultProperties.get(spark.executor.cores).orNull)
 totalExecutorCores = Option(totalExecutorCores)
   .getOrElse(defaultProperties.get(spark.cores.max).orNull)
 name = 
 Option(name).getOrElse(defaultProperties.get(spark.app.name).orNull)
 jars = Option(jars).getOrElse(defaultProperties.get(spark.jars).orNull)
 Along with these defaults, we should also set default for instances:
 numExecutors=Option(numExecutors).getOrElse(defaultProperties.get(spark.executor.instances).orNull)
 PS: spark.executor.instances is also not mentioned on 
 http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log

2014-07-30 Thread Sean Owen (JIRA)

Sean Owen created SPARK-2748:


 Summary: Loss of precision for small arguments to Math.exp, 
Math.log
 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor


In a few places in MLlib, an expression of the form log(1.0 + p) is evaluated. 
When p is so small that 1.0 + p == 1.0, the result is 0.0. However the correct 
answer is very near p. This is why Math.log1p exists.

Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
Math.expm1 method.

While the errors occur only for very small arguments, given their use in 
machine learning algorithms, this is entirely possible.

Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 / 
(1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
don't think there's a JIRA on that one, so maybe this can serve as an umbrella 
for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log


[ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079207#comment-14079207
 ] 

Apache Spark commented on SPARK-2748:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1659

 Loss of precision for small arguments to Math.exp, Math.log
 ---

 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 In a few places in MLlib, an expression of the form log(1.0 + p) is 
 evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
 the correct answer is very near p. This is why Math.log1p exists.
 Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
 Math.expm1 method.
 While the errors occur only for very small arguments, given their use in 
 machine learning algorithms, this is entirely possible.
 Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
 don't think there's a JIRA on that one, so maybe this can serve as an 
 umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log

2014-07-30 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079208#comment-14079208
 ] 

Sean Owen commented on SPARK-2748:
--

PR: https://github.com/apache/spark/pull/1659
See also: https://github.com/apache/spark/pull/1652

 Loss of precision for small arguments to Math.exp, Math.log
 ---

 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 In a few places in MLlib, an expression of the form log(1.0 + p) is 
 evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
 the correct answer is very near p. This is why Math.log1p exists.
 Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
 Math.expm1 method.
 While the errors occur only for very small arguments, given their use in 
 machine learning algorithms, this is entirely possible.
 Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
 don't think there's a JIRA on that one, so maybe this can serve as an 
 umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep

2014-07-30 Thread Sean Owen (JIRA)

Sean Owen created SPARK-2749:


 Summary: Spark SQL Java tests aren't compiling in Jenkins' Maven 
builds; missing junit:junit dep
 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor


The Maven-based builds in the build matrix have been failing for a few days:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/

On inspection, it looks like the Spark SQL Java tests don't compile:

https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull

I confirmed it by repeating the command vs master:

mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package

The problem is that this module doesn't depend on JUnit. In fact, none of the 
modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
in, in most places. However this module doesn't depend on 
com.novocode:junit-interface

Adding the junit:junit dependency fixes the compile problem. In fact, the other 
modules with Java tests should probably depend on it explicitly instead of 
happening to get it via com.novocode:junit-interface, since that is a bit 
SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2749) Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep


[ 
https://issues.apache.org/jira/browse/SPARK-2749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079231#comment-14079231
 ] 

Apache Spark commented on SPARK-2749:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1660

 Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing 
 junit:junit dep
 ---

 Key: SPARK-2749
 URL: https://issues.apache.org/jira/browse/SPARK-2749
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 1.0.1
Reporter: Sean Owen
Priority: Minor

 The Maven-based builds in the build matrix have been failing for a few days:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/
 On inspection, it looks like the Spark SQL Java tests don't compile:
 https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
 I confirmed it by repeating the command vs master:
 mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package
 The problem is that this module doesn't depend on JUnit. In fact, none of the 
 modules do, but com.novocode:junit-interface (the SBT-JUnit bridge) pulls it 
 in, in most places. However this module doesn't depend on 
 com.novocode:junit-interface
 Adding the junit:junit dependency fixes the compile problem. In fact, the 
 other modules with Java tests should probably depend on it explicitly instead 
 of happening to get it via com.novocode:junit-interface, since that is a bit 
 SBT/Scala-specific (and I am not even sure it's needed).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2308) Add KMeans MiniBatch clustering algorithm to MLlib

2014-07-30 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079260#comment-14079260
 ] 

RJ Nowling commented on SPARK-2308:
---

Thanks for the clarification. :)  I'll run the additional tests to try to 
answer those questions.

I'll also work on trying to implement MiniBatch KMeans as a flag for the 
current KMeans implementation -- that would be a nicer API.

 Add KMeans MiniBatch clustering algorithm to MLlib
 --

 Key: SPARK-2308
 URL: https://issues.apache.org/jira/browse/SPARK-2308
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: RJ Nowling
Assignee: RJ Nowling
Priority: Minor
 Attachments: many_small_centers.pdf, uneven_centers.pdf


 Mini-batch is a version of KMeans that uses a randomly-sampled subset of the 
 data points in each iteration instead of the full set of data points, 
 improving performance (and in some cases, accuracy).  The mini-batch version 
 is compatible with the KMeans|| initialization algorithm currently 
 implemented in MLlib.
 I suggest adding KMeans Mini-batch as an alternative.
 I'd like this to be assigned to me.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2750) Add Https support for Web UI

2014-07-30 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-2750:
--

 Summary: Add Https support for Web UI
 Key: SPARK-2750
 URL: https://issues.apache.org/jira/browse/SPARK-2750
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: WangTaoTheTonic


Now I try to add https support for web ui using Jetty ssl integration.Below is 
the plan:
1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can 
choose to use http or/and https way to acess them. We add some configuration 
items here, for example, SPARK_MASTER_WEBUI_PORT in system envs claim the 
https port for master ui. Different items would be added  to control access way 
of dirrenet processes in system envs, JVM Properties and launch args.
2.User choose access way according to their configuration. If http port is 
assigned, then we start http service for web ui. If https port is assigned, we 
start https. If both two are assigned, we start two. If neither is assigned, we 
start http service at default port as same as now.
3.We should add some configuration items to state some args for ssl 
authentication, like keystore location and keystore password in 1-way ssl, 
truststore location in 2-way. User can also choose to switch between 1-way and 
2-way.

Now I nearly have implemented the functions mentioned above. Here are some 
questions:
We know there are some hyper links between Master and Worker and some from 
Master to Spark UI(Driver UI). Now the link is their http addresses. So if we 
add https to them, what we should do when users click the links?
Situation:
1.Master http port to Worker which opens https port only.
2.Master https port to Worker which opens http port only.

Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2315) drop, dropRight and dropWhile which take RDD input and return RDD

2014-07-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079300#comment-14079300
 ] 

Erik Erlandson commented on SPARK-2315:
---

Updated the PR with a proper lazy-transform implementation:
http://erikerlandson.github.io/blog/2014/07/29/deferring-spark-actions-to-lazy-transforms-with-the-promise-rdd/


 drop, dropRight and dropWhile which take RDD input and return RDD
 -

 Key: SPARK-2315
 URL: https://issues.apache.org/jira/browse/SPARK-2315
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Erik Erlandson
  Labels: features

 Last time I loaded in a text file, I found myself wanting to just skip the 
 first element as it was a header. I wrote candidate methods drop, 
 dropRight and dropWhile to satisfy this kind of need:
 val txt = sc.textFile(text_with_header.txt)
 val data = txt.drop(1)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2750) Add Https support for Web UI

2014-07-30 Thread WangTaoTheTonic (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

WangTaoTheTonic updated SPARK-2750:
---

Description:
Now I try to add https support for web ui using Jetty ssl integration.Below is
the plan:
1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can
choose to use http or/and https way to acess them. We add some configuration
items here, for example, SPARK_MASTER_WEBUI_HTTPS_PORT in system envs claim
the https port for master ui. Different items would be added to control access
way of dirrenet processes in system envs, JVM Properties and launch args.
2.User choose access way according to their configuration. If http port is
assigned, then we start http service for web ui. If https port is assigned, we
start https. If both two are assigned, we start two. If neither is assigned, we
start http service at default port as same as now.
3.We should add some configuration items to state some args for ssl
authentication, like keystore location and keystore password in 1-way ssl,
truststore location in 2-way. User can also choose to switch between 1-way and
2-way.

Now I nearly have implemented the functions mentioned above. Here are some
questions:
We know there are some hyper links between Master and Worker and some from
Master to Spark UI(Driver UI). Now the link is their http addresses. So if we
add https to them, what we should do when users click the links?
Situation:
1.Master http port to Worker which opens https port only.
2.Master https port to Worker which opens http port only.

Any feedback is welcome!

was:
Now I try to add https support for web ui using Jetty ssl integration.Below is
the plan:
1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User can
choose to use http or/and https way to acess them. We add some configuration
items here, for example, SPARK_MASTER_WEBUI_PORT in system envs claim the
https port for master ui. Different items would be added to control access way
of dirrenet processes in system envs, JVM Properties and launch args.
2.User choose access way according to their configuration. If http port is
assigned, then we start http service for web ui. If https port is assigned, we
start https. If both two are assigned, we start two. If neither is assigned, we
start http service at default port as same as now.
3.We should add some configuration items to state some args for ssl
authentication, like keystore location and keystore password in 1-way ssl,
truststore location in 2-way. User can also choose to switch between 1-way and
2-way.

Any feedback is welcome!

Add Https support for Web UI

Key: SPARK-2750
URL: https://issues.apache.org/jira/browse/SPARK-2750
Project: Spark
Issue Type: New Feature
Components: Web UI
Reporter: WangTaoTheTonic
Labels: https, ssl, webui
Original Estimate: 96h
Remaining Estimate: 96h

Now I try to add https support for web ui using Jetty ssl integration.Below
is the plan:
1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User
can choose to use http or/and https way to acess them. We add some
configuration items here, for example, SPARK_MASTER_WEBUI_HTTPS_PORT in
system envs claim the https port for master ui. Different items would be
added to control access way of dirrenet processes in system envs, JVM
Properties and launch args.
2.User choose access way according to their configuration. If http port is
assigned, then we start http service for web ui. If https port is assigned,
we start https. If both two are assigned, we start two. If neither is
assigned, we start http service at default port as same as now.
3.We should add some configuration items to state some args for ssl
authentication, like keystore location and keystore password in 1-way ssl,
truststore location in 2-way. User can also choose to switch between 1-way
and 2-way.
Now I nearly have implemented the functions mentioned above. Here are some
questions:
We know there are some hyper links between Master and Worker and some from
Master to Spark UI(Driver UI). Now the link is their http addresses. So if we
add https to them, what we should do when users click the links?
Situation:
1.Master http port to Worker which opens https port only.
2.Master https port to Worker which opens http port only.
Any

[jira] [Created] (SPARK-2752) spark sql cli should not exit when get a exception

2014-07-30 Thread wangfei (JIRA)

wangfei created SPARK-2752:
--

 Summary: spark sql cli should not exit when get a exception
 Key: SPARK-2752
 URL: https://issues.apache.org/jira/browse/SPARK-2752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: wangfei
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2752) spark sql cli should not exit when get a exception


[ 
https://issues.apache.org/jira/browse/SPARK-2752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079370#comment-14079370
 ] 

Apache Spark commented on SPARK-2752:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/1661

 spark sql cli should not exit when get a exception
 --

 Key: SPARK-2752
 URL: https://issues.apache.org/jira/browse/SPARK-2752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: wangfei
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2753) Is it supposed --archives option in yarn cluster mode to uncompress file?

2014-07-30 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-2753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

José Manuel Abuín Mosquera updated SPARK-2753:
--

Description: 
Hi all,

this is my first sent issue, I googled and searche dinto the Spark code and 
arrived here.

When passing as argument to --archives a tar.gz or a .zip file, Spark uploads 
it to the distributed cache, but it is not uncompressing it.

According the documentation, it is supposed to uncompress it, is this a bug??

Launching command is:

/opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster 
--num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ 
--driver-memory 390m --executor-memory 890m --executor-cores 1 
--archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar 
Wikipedias/WikipediaPlain.txt saidaWikipediaSpark

In files 
/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and 
/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 doesn't seem to uncompress the files.

I hope this helps, thank you very much :)

  was:
Hi all,

this is my first sent issue, I googled and searche dinto the Spark code and 
arrived here.

When passing as argument to --archives a tar.gz or a .zip file, Spark uploads 
it to the distributed cache, but it is not uncompressing it.

According the documentation, it is supposed to uncompress it, is this a bug??

Launching command is:

/opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster 
--num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ 
--driver-memory 390m --executor-memory 890m --executor-cores 1 
--archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar 
Wikipedias/WikipediaPlain.txt saidaWikipediaSpark

In files 
/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and 
/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
 doesn't seem to uncompress the files.

I hope this helps, than you very much :)


 Is it supposed --archives option in yarn cluster mode to uncompress file?
 -

 Key: SPARK-2753
 URL: https://issues.apache.org/jira/browse/SPARK-2753
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
 Environment: CentOS release 6.5 (64 bits) and Hadoop 2.2.0
Reporter: José Manuel Abuín Mosquera
  Labels: archives, cache, distributed, yarn

 Hi all,
 this is my first sent issue, I googled and searche dinto the Spark code and 
 arrived here.
 When passing as argument to --archives a tar.gz or a .zip file, Spark uploads 
 it to the distributed cache, but it is not uncompressing it.
 According the documentation, it is supposed to uncompress it, is this a bug??
 Launching command is:
 /opt/spark-1.0.1/bin/spark-submit --class ProlnatSpark --master yarn-cluster 
 --num-executors 32 --driver-library-path /opt/hadoop/hadoop-2.2.0/lib/native/ 
 --driver-memory 390m --executor-memory 890m --executor-cores 1 
 --archives=Diccionarios.tar.gz --verbose ProlnatSpark.jar 
 Wikipedias/WikipediaPlain.txt saidaWikipediaSpark
 In files 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala and 
 /yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ExecutorRunnableUtil.scala
  doesn't seem to uncompress the files.
 I hope this helps, thank you very much :)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log


 [ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2748:
-

Target Version/s: 1.1.0

 Loss of precision for small arguments to Math.exp, Math.log
 ---

 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 In a few places in MLlib, an expression of the form log(1.0 + p) is 
 evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
 the correct answer is very near p. This is why Math.log1p exists.
 Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
 Math.expm1 method.
 While the errors occur only for very small arguments, given their use in 
 machine learning algorithms, this is entirely possible.
 Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
 don't think there's a JIRA on that one, so maybe this can serve as an 
 umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2748) Loss of precision for small arguments to Math.exp, Math.log


 [ 
https://issues.apache.org/jira/browse/SPARK-2748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2748:
-

Assignee: Sean Owen

 Loss of precision for small arguments to Math.exp, Math.log
 ---

 Key: SPARK-2748
 URL: https://issues.apache.org/jira/browse/SPARK-2748
 Project: Spark
  Issue Type: Bug
  Components: GraphX, MLlib
Affects Versions: 1.0.1
Reporter: Sean Owen
Assignee: Sean Owen
Priority: Minor

 In a few places in MLlib, an expression of the form log(1.0 + p) is 
 evaluated. When p is so small that 1.0 + p == 1.0, the result is 0.0. However 
 the correct answer is very near p. This is why Math.log1p exists.
 Similarly for one instance of exp(m) - 1 in GraphX; there's a special 
 Math.expm1 method.
 While the errors occur only for very small arguments, given their use in 
 machine learning algorithms, this is entirely possible.
 Also, while we're here, naftaliharris discovered a case in Python where 1 - 1 
 / (1 + exp(margin)) is less accurate than exp(margin) / (1 + exp(margin)). I 
 don't think there's a JIRA on that one, so maybe this can serve as an 
 umbrella for all of these related issues.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2521) Broadcast RDD object once per TaskSet (instead of sending it for every task)


 [ 
https://issues.apache.org/jira/browse/SPARK-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2521.


   Resolution: Fixed
Fix Version/s: 1.1.0

 Broadcast RDD object once per TaskSet (instead of sending it for every task)
 

 Key: SPARK-2521
 URL: https://issues.apache.org/jira/browse/SPARK-2521
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Reynold Xin
Assignee: Reynold Xin
 Fix For: 1.1.0


 Currently (as of Spark 1.0.1), Spark sends RDD object (which contains 
 closures) using Akka along with the task itself to the executors. This is 
 inefficient because all tasks in the same stage use the same RDD object, but 
 we have to send RDD object multiple times to the executors. This is 
 especially bad when a closure references some variable that is very large. 
 The current design led to users having to explicitly broadcast large 
 variables.
 The patch uses broadcast to send RDD objects and the closures to executors, 
 and use Akka to only send a reference to the broadcast RDD/closure along with 
 the partition specific information for the task. For those of you who know 
 more about the internals, Spark already relies on broadcast to send the 
 Hadoop JobConf every time it uses the Hadoop input, because the JobConf is 
 large.
 The user-facing impact of the change include:
 Users won't need to decide what to broadcast anymore, unless they would want 
 to use a large object multiple times in different operations
 Task size will get smaller, resulting in faster scheduling and higher task 
 dispatch throughput.
 In addition, the change will simplify some internals of Spark, eliminating 
 the need to maintain task caches and the complex logic to broadcast JobConf 
 (which also led to a deadlock recently).
 A simple way to test this:
 {code}
 val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
 sc.parallelize(1 to 1000, 1000).map { x = a; x }.groupBy { x = a; x }.count
 Numbers on 3 r3.8xlarge instances on EC2
 master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
 with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2747) git diff --dirstat can miss sql changes and not run Hive tests


 [ 
https://issues.apache.org/jira/browse/SPARK-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2747.


   Resolution: Fixed
Fix Version/s: 1.1.0

 git diff --dirstat can miss sql changes and not run Hive tests
 --

 Key: SPARK-2747
 URL: https://issues.apache.org/jira/browse/SPARK-2747
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0


 dev/run-tests use git diff --dirstat master to check whether sql is 
 changed. However, --dirstat won't show sql if sql's change is negligible 
 (e.g. 1k loc change in core, and only 1 loc change in hive).
 We should use git diff --name-only master instead.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-30 Thread Anand Avati (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079487#comment-14079487
 ] 

Anand Avati commented on SPARK-2707:


[~helena_e] - it turned out to be more than just a timeout issue. As described 
in SPARK-1812 and 
https://groups.google.com/forum/#!topic/akka-user/cI4CEKEJvfs, this is because 
of protobuf version mismatch. The combination of 
https://github.com/avati/spark/commit/f8b5e96fca20c13308cb2a9a6c18049bcdd0a7ba 
and 
https://github.com/avati/spark/commit/722aee26399b9bf4b725d17f5cfcfad99464af35 
is making akka-2.3 work for me.

 Upgrade to Akka 2.3
 ---

 Key: SPARK-2707
 URL: https://issues.apache.org/jira/browse/SPARK-2707
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Yardena

 Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
 features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2754) Document standalone-cluster mode now that it's working

2014-07-30 Thread Andrew Or (JIRA)

Andrew Or created SPARK-2754:


 Summary: Document standalone-cluster mode now that it's working
 Key: SPARK-2754
 URL: https://issues.apache.org/jira/browse/SPARK-2754
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.1.0


This was previously broken before SPARK-2260, so we (attempted to) remove all 
documentation related to this mode. We should add it back now that we have 
fixed it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2744) The configuration spark.history.retainedApplications is invalid

2014-07-30 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079497#comment-14079497
 ] 

Marcelo Vanzin commented on SPARK-2744:
---

Are you sure that option means what you think it means?

The History Server will list all applications. It will just retain a max number 
of them *in memory*. That option does not control how many applications are 
show, it controls how much memory the HS will need.

 The configuration spark.history.retainedApplications is invalid
 -

 Key: SPARK-2744
 URL: https://issues.apache.org/jira/browse/SPARK-2744
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
  Labels: historyserver

 when I set it in spark-env.sh like this:export 
 SPARK_HISTORY_OPTS=$SPARK_HISTORY_OPTS -Dspark.history.ui.port=5678 
 -Dspark.history.retainedApplications=1 , the web of historyserver retains 
 more than one application



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2755) TorrentBroadcast cannot broadcast very large objects

Xiangrui Meng created SPARK-2755:


 Summary: TorrentBroadcast cannot broadcast very large objects
 Key: SPARK-2755
 URL: https://issues.apache.org/jira/browse/SPARK-2755
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng


TorrentBroadcast uses `Utils.serialize` to serialize an object into 
Array[Byte]. So it cannot handle data of size greater than Int.MaxValue bytes. 
Instead of serializing the object into Array[Byte] directly, we can use the 
stream version implemented in HttpBroadcast.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-1630) PythonRDDs don't handle nulls gracefully


 [ 
https://issues.apache.org/jira/browse/SPARK-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1630.
---

Resolution: Won't Fix

Based on some discussion in https://github.com/apache/spark/pull/1551, we've 
decided to hold off on fixing this: this issue only affects users that are 
calling private APIs and the fix adds complexity and could mask bugs in other 
parts of the code.

 PythonRDDs don't handle nulls gracefully
 

 Key: SPARK-1630
 URL: https://issues.apache.org/jira/browse/SPARK-1630
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 0.9.0, 0.9.1
Reporter: Kalpit Shah
Assignee: Davies Liu
   Original Estimate: 2h
  Remaining Estimate: 2h

 If PythonRDDs receive a null element in iterators, they currently NPE. It 
 would be better do log a DEBUG message and skip the write of NULL elements.
 Here are the 2 stack traces :
 14/04/22 03:44:19 ERROR executor.Executor: Uncaught exception in thread 
 Thread[stdin writer for python,5,main]
 java.lang.NullPointerException
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:267)
   at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:88)
 -
 Py4JJavaError: An error occurred while calling 
 z:org.apache.spark.api.python.PythonRDD.writeToFile.
 : java.lang.NullPointerException
   at org.apache.spark.api.python.PythonRDD$.writeUTF(PythonRDD.scala:273)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:247)
   at 
 org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$2.apply(PythonRDD.scala:246)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:246)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:285)
   at org.apache.spark.api.python.PythonRDD$.writeToFile(PythonRDD.scala:280)
   at org.apache.spark.api.python.PythonRDD.writeToFile(PythonRDD.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
   at py4j.Gateway.invoke(Gateway.java:259)
   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
   at py4j.commands.CallCommand.execute(CallCommand.java:79)
   at py4j.GatewayConnection.run(GatewayConnection.java:207)
   at java.lang.Thread.run(Thread.java:744)  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2316) StorageStatusListener should avoid O(blocks) operations

2014-07-30 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2316:
-

Fix Version/s: 1.1.0

 StorageStatusListener should avoid O(blocks) operations
 ---

 Key: SPARK-2316
 URL: https://issues.apache.org/jira/browse/SPARK-2316
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Web UI
Affects Versions: 1.0.0
Reporter: Patrick Wendell
Assignee: Andrew Or
Priority: Critical
 Fix For: 1.1.0


 In the case where jobs are frequently causing dropped blocks the storage 
 status listener can bottleneck. This is slow for a few reasons, one being 
 that we use Scala collection operations, the other being that we operations 
 that are O(number of blocks). I think using a few indices here could make 
 this much faster.
 {code}
  at java.lang.Integer.valueOf(Integer.java:642)
 at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:70)
 at 
 org.apache.spark.storage.StorageUtils$$anonfun$9.apply(StorageUtils.scala:82)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
 at 
 scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
 at 
 scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:224)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:403)
 at 
 scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
 at scala.collection.AbstractTraversable.groupBy(Traversable.scala:105)
 at 
 org.apache.spark.storage.StorageUtils$.rddInfoFromStorageStatus(StorageUtils.scala:82)
 at 
 org.apache.spark.ui.storage.StorageListener.updateRDDInfo(StorageTab.scala:56)
 at 
 org.apache.spark.ui.storage.StorageListener.onTaskEnd(StorageTab.scala:67)
 - locked 0xa27ebe30 (a 
 org.apache.spark.ui.storage.StorageListener)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2736:
--

Assignee: Kan Zhang  (was: Josh Rosen)

 Ceeate Pyspark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
Priority: Minor
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Assigned] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-2736:
-

Assignee: Josh Rosen

 Ceeate Pyspark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Josh Rosen
Priority: Minor
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2544) Improve ALS algorithm resource usage


 [ 
https://issues.apache.org/jira/browse/SPARK-2544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2544.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 929
[https://github.com/apache/spark/pull/929]

 Improve ALS algorithm resource usage
 

 Key: SPARK-2544
 URL: https://issues.apache.org/jira/browse/SPARK-2544
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Guoqiang Li
 Fix For: 1.1.0


 The following problems in ALS
 1. The RDD of products and users dependencies are too long
 2. The shuffle files are too large.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets


[ 
https://issues.apache.org/jira/browse/SPARK-2341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079704#comment-14079704
 ] 

Apache Spark commented on SPARK-2341:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1663

 loadLibSVMFile doesn't handle regression datasets
 -

 Key: SPARK-2341
 URL: https://issues.apache.org/jira/browse/SPARK-2341
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Eustache
Assignee: Sean Owen
Priority: Minor
  Labels: easyfix

 Many datasets exist in LibSVM format for regression tasks [1] but currently 
 the loadLibSVMFile primitive doesn't handle regression datasets.
 More precisely, the LabelParser is either a MulticlassLabelParser or a 
 BinaryLabelParser. What happens then is that the file is loaded but in 
 multiclass mode : each target value is interpreted as a class name !
 The fix would be to write a RegressionLabelParser which converts target 
 values to Double and plug it into the loadLibSVMFile routine.
 [1] http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/regression.html 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2736) Ceeate Pyspark RDD from Apache Avro File


[ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079761#comment-14079761
 ] 

Michael Armbrust commented on SPARK-2736:
-

Another thing to consider is that Avro would be an idea fit for SchemaRDDs and 
then we could reuse the java/python bridge code that is already there.

 Ceeate Pyspark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
Priority: Minor
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user


 [ 
https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2746:
---

Description: 
dev/run-tests always sets SBT_MAVEN_PROFILES, which is not desired. As a matter 
of fact, Jenkins is failing for older Hadoop versions because the YARN profile 
is always on.

{code}
  export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
{code}

  was:
dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter of 
fact, Jenkins is failing for older Hadoop versions because the YARN profile is 
always on.

{code}
  export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
{code}


 Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
 -

 Key: SPARK-2746
 URL: https://issues.apache.org/jira/browse/SPARK-2746
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0


 dev/run-tests always sets SBT_MAVEN_PROFILES, which is not desired. As a 
 matter of fact, Jenkins is failing for older Hadoop versions because the YARN 
 profile is always on.
 {code}
   export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2746) Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user


 [ 
https://issues.apache.org/jira/browse/SPARK-2746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-2746.


  Resolution: Fixed
   Fix Version/s: 1.1.0
Target Version/s: 1.1.0

 Set SBT_MAVEN_PROFILES only when it is not set explicitly by the user
 -

 Key: SPARK-2746
 URL: https://issues.apache.org/jira/browse/SPARK-2746
 Project: Spark
  Issue Type: Bug
  Components: Build
Reporter: Reynold Xin
Assignee: Reynold Xin
Priority: Critical
 Fix For: 1.1.0


 dev/run-tests always sets SBT_MAVEN_PROFILES, which is desired. As a matter 
 of fact, Jenkins is failing for older Hadoop versions because the YARN 
 profile is always on.
 {code}
   export SBT_MAVEN_PROFILES=-Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags


[ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079775#comment-14079775
 ] 

Apache Spark commented on SPARK-2664:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/1665

 Deal with `--conf` options in spark-submit that relate to flags
 ---

 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker

 If someone sets a spark conf that relates to an existing flag `--master`, we 
 should set it correctly like we do with the defaults file. Otherwise it can 
 have confusing semantics. I noticed this after merging it, otherwise I would 
 have mentioned it in the review.
 I think it's as simple as modifying loadDefaults to check the user-supplied 
 options also. We might change it to loadUserProperties since it's no longer 
 just the defaults file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2735) Remove deprecation in jekyll for pygment in _config.yml


[ 
https://issues.apache.org/jira/browse/SPARK-2735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079778#comment-14079778
 ] 

Apache Spark commented on SPARK-2735:
-

User 'RAbraham' has created a pull request for this issue:
https://github.com/apache/spark/pull/1666

 Remove deprecation in jekyll for pygment in _config.yml
 ---

 Key: SPARK-2735
 URL: https://issues.apache.org/jira/browse/SPARK-2735
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Rajiv Abraham
Priority: Trivial
   Original Estimate: 1h
  Remaining Estimate: 1h

 NOTE: Creating this issue for the patch I am submitting soon. This will be my 
 first pull request. So please let me know if I have missed something
 Change:
 Remove following deprecation warning in 'jekyll build' for pygments. 
 Deprecation: The 'pygments' configuration option has been renamed to 
 'highlighter'. Please update your config file accordingly. The allowed values 
 are 'rouge', 'pygments' or null.
 Reference:https://github.com/mmistakes/hpstr-jekyll-theme/issues/25.
 Rajiv



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive

2014-07-30 Thread Brock Noland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079795#comment-14079795
 ] 

Brock Noland commented on SPARK-2741:
-

https://github.com/apache/spark/pull/1667

 Publish version of spark assembly which does not contain Hive
 -

 Key: SPARK-2741
 URL: https://issues.apache.org/jira/browse/SPARK-2741
 Project: Spark
  Issue Type: Task
Reporter: Brock Noland
Assignee: Patrick Wendell
 Attachments: SPARK-2741.patch


 The current spark assembly contains Hive. This conflicts with Hive + Spark 
 which is attempting to use it's own version of Hive.
 We'll need to publish a version of the assembly which does not contain the 
 Hive jars.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive


[ 
https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079804#comment-14079804
 ] 

Apache Spark commented on SPARK-2741:
-

User 'brockn' has created a pull request for this issue:
https://github.com/apache/spark/pull/1667

 Publish version of spark assembly which does not contain Hive
 -

 Key: SPARK-2741
 URL: https://issues.apache.org/jira/browse/SPARK-2741
 Project: Spark
  Issue Type: Task
Reporter: Brock Noland
Assignee: Patrick Wendell
 Attachments: SPARK-2741.patch


 The current spark assembly contains Hive. This conflicts with Hive + Spark 
 which is attempting to use it's own version of Hive.
 We'll need to publish a version of the assembly which does not contain the 
 Hive jars.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2711) Create a ShuffleMemoryManager that allocates across spilling collections in the same task

2014-07-30 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2711:
-

Priority: Critical  (was: Major)

 Create a ShuffleMemoryManager that allocates across spilling collections in 
 the same task
 -

 Key: SPARK-2711
 URL: https://issues.apache.org/jira/browse/SPARK-2711
 Project: Spark
  Issue Type: Improvement
Reporter: Matei Zaharia
Assignee: Matei Zaharia
Priority: Critical

 Right now if there are two ExternalAppendOnlyMaps, they don't compete 
 correctly for memory. This can happen e.g. in a task that is both reducing 
 data from its parent RDD and writing it out to files for a future shuffle, 
 for instance if you do rdd.groupByKey(...).map(...).groupByKey(...) (another 
 key).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2523) For partitioned Hive tables, partition-specific ObjectInspectors should be used.


[ 
https://issues.apache.org/jira/browse/SPARK-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079881#comment-14079881
 ] 

Apache Spark commented on SPARK-2523:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1669

 For partitioned Hive tables, partition-specific ObjectInspectors should be 
 used.
 

 Key: SPARK-2523
 URL: https://issues.apache.org/jira/browse/SPARK-2523
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
Assignee: Cheng Hao
 Fix For: 1.1.0


 In HiveTableScan.scala, ObjectInspector was created for all of the partition 
 based records, which probably causes ClassCastException if the object 
 inspector is not identical among table  partitions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2024) Add saveAsSequenceFile to PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-2024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2024.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

 Add saveAsSequenceFile to PySpark
 -

 Key: SPARK-2024
 URL: https://issues.apache.org/jira/browse/SPARK-2024
 Project: Spark
  Issue Type: New Feature
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Kan Zhang
 Fix For: 1.1.0


 After SPARK-1416 we will be able to read SequenceFiles from Python, but it 
 remains to write them.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2103) Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init


 [ 
https://issues.apache.org/jira/browse/SPARK-2103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2103:
-

Target Version/s: 1.1.0

 Java + Kafka + Spark Streaming NoSuchMethodError in java.lang.Object.init
 ---

 Key: SPARK-2103
 URL: https://issues.apache.org/jira/browse/SPARK-2103
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Sean Owen

 This has come up a few times, from user venki-kratos:
 http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-in-KafkaReciever-td2209.html
 and I ran into it a few weeks ago:
 http://mail-archives.apache.org/mod_mbox/spark-dev/201405.mbox/%3ccamassdlzs6ihctxepusphryxxa-wp26zgbxx83sm6niro0q...@mail.gmail.com%3E
 and yesterday user mpieck:
 {quote}
 When I use the createStream method from the example class like
 this:
 KafkaUtils.createStream(jssc, zookeeper:port, test, topicMap);
 everything is working fine, but when I explicitely specify message decoder
 classes used in this method with another overloaded createStream method:
 KafkaUtils.createStream(jssc, String.class, String.class,
 StringDecoder.class, StringDecoder.class, props, topicMap,
 StorageLevels.MEMORY_AND_DISK_2);
 the applications stops with an error:
 14/06/10 22:28:06 ERROR kafka.KafkaReceiver: Error receiving data
 java.lang.NoSuchMethodException:
 java.lang.Object.init(kafka.utils.VerifiableProperties)
 at java.lang.Class.getConstructor0(Unknown Source)
 at java.lang.Class.getConstructor(Unknown Source)
 at
 org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:108)
 at
 org.apache.spark.streaming.dstream.NetworkReceiver.start(NetworkInputDStream.scala:126)
 {quote}
 Something is making it try to instantiate java.lang.Object as if it's a 
 Decoder class.
 I suspect that the problem is to do with
 https://github.com/apache/spark/blob/master/external/kafka/src/main/scala/org/apache/spark/streaming/kafka/KafkaUtils.scala#L148
 {code}
 implicit val keyCmd: Manifest[U] =
 implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[U]]
 implicit val valueCmd: Manifest[T] =
 implicitly[Manifest[AnyRef]].asInstanceOf[Manifest[T]]
 {code}
 ... where U and T are key/value Decoder types. I don't know enough Scala to 
 fully understand this, but is it possible this causes the reflective call 
 later to lose the type and try to instantiate Object? The AnyRef made me 
 wonder.
 I am sorry to say I don't have a PR to suggest at this point.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Target Version/s: 1.1.0

 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Description: 
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores  # 
receivers is necessary
- Classes of spark-streaming-XYZ cannot be access from Spark Shell
- Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and 
its dependencies to be packaged with application JAR
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream = iterator function.
- New Flume and Kinesis stuff.

  was:
This JIRA collects together a number of small issues that should be added to 
the streaming programming guide

- Receivers consume an executor slot and highlight the fact the # cores  # 
receivers is necessary
- Classes of spark-streaming-XYZ cannot be access from Spark Shell
- Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar and 
its dependencies to be packaged with application JAR
- Ordering and parallelism of the output operations
- Receiver's should be serializable
- Add more information on how socketStream: input stream = iterator function.


 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.
 - New Flume and Kinesis stuff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2736) Create Pyspark RDD from Apache Avro File

2014-07-30 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2736:
-

Summary: Create Pyspark RDD from Apache Avro File  (was: Ceeate Pyspark RDD 
from Apache Avro File)

 Create Pyspark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
Priority: Minor
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2381) streaming receiver crashed,but seems nothing happened


[ 
https://issues.apache.org/jira/browse/SPARK-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079978#comment-14079978
 ] 

Tathagata Das commented on SPARK-2381:
--

Any updates on this? If not, then I am inclined to close to this JIRA.

 streaming receiver crashed,but seems nothing happened
 -

 Key: SPARK-2381
 URL: https://issues.apache.org/jira/browse/SPARK-2381
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: sunsc

 when we submit a streaming job and if receivers doesn't start normally, the 
 application should stop itself. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2736) Create PySpark RDD from Apache Avro File

2014-07-30 Thread Kan Zhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kan Zhang updated SPARK-2736:
-

Summary: Create PySpark RDD from Apache Avro File  (was: Create Pyspark RDD 
from Apache Avro File)

 Create PySpark RDD from Apache Avro File
 

 Key: SPARK-2736
 URL: https://issues.apache.org/jira/browse/SPARK-2736
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Eric Garcia
Assignee: Kan Zhang
Priority: Minor
   Original Estimate: 4h
  Remaining Estimate: 4h

 There is a partially working example Avro Converter at this pull request: 
 https://github.com/apache/spark/pull/1536
 It does not fully implement all types in the Avro format and could be cleaned 
 up a little bit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-07-30 Thread Jeremy Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14079984#comment-14079984
 ] 

Jeremy Freeman commented on SPARK-2012:
---

[~davies] cool, that definitely makes sense to me, shall I put a PR together 
done that way?

 PySpark StatCounter with numpy arrays
 -

 Key: SPARK-2012
 URL: https://issues.apache.org/jira/browse/SPARK-2012
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor

 In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
 arrays just as with an RDD of scalars, which was very useful (e.g. for 
 computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
 because the added functionality for computing the minimum and maximum, as 
 implemented, doesn't work on arrays.
 I have a PR ready that re-enables this functionality by having StatCounter 
 use the numpy element-wise functions maximum and minimum, which work on 
 both numpy arrays and scalars (and I've added new tests for this capability). 
 However, I realize this adds a dependency on NumPy outside of MLLib. If 
 that's not ok, maybe it'd be worth adding this functionality as a util within 
 PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2012) PySpark StatCounter with numpy arrays

2014-07-30 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080001#comment-14080001
 ] 

Davies Liu commented on SPARK-2012:
---

Yes, plz!

 PySpark StatCounter with numpy arrays
 -

 Key: SPARK-2012
 URL: https://issues.apache.org/jira/browse/SPARK-2012
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Jeremy Freeman
Priority: Minor

 In Spark 0.9, the PySpark version of StatCounter worked with an RDD of numpy 
 arrays just as with an RDD of scalars, which was very useful (e.g. for 
 computing stats on a set of vectors in ML analyses). In 1.0.0 this broke 
 because the added functionality for computing the minimum and maximum, as 
 implemented, doesn't work on arrays.
 I have a PR ready that re-enables this functionality by having StatCounter 
 use the numpy element-wise functions maximum and minimum, which work on 
 both numpy arrays and scalars (and I've added new tests for this capability). 
 However, I realize this adds a dependency on NumPy outside of MLLib. If 
 that's not ok, maybe it'd be worth adding this functionality as a util within 
 PySpark MLLib?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-546) Support full outer join and multiple join in a single shuffle


 [ 
https://issues.apache.org/jira/browse/SPARK-546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-546:


Component/s: Streaming
 Spark Core

 Support full outer join and multiple join in a single shuffle
 -

 Key: SPARK-546
 URL: https://issues.apache.org/jira/browse/SPARK-546
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, Streaming
Reporter: Reynold Xin

 RDD[(K,V)] now supports left/right outer join but not full outer join.
 Also it'd be nice to provide a way for users to join multiple RDDs on the 
 same key in a single shuffle.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1730) Make receiver store data reliably to avoid data-loss on executor failures


 [ 
https://issues.apache.org/jira/browse/SPARK-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1730:
-

Assignee: Hari Shreedharan  (was: Tathagata Das)

 Make receiver store data reliably to avoid data-loss on executor failures
 -

 Key: SPARK-1730
 URL: https://issues.apache.org/jira/browse/SPARK-1730
 Project: Spark
  Issue Type: Sub-task
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Tathagata Das
Assignee: Hari Shreedharan





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Priority: Major  (was: Critical)

 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.
 - New Flume and Kinesis stuff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2419) Misc updates to streaming programming guide


 [ 
https://issues.apache.org/jira/browse/SPARK-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2419:
-

Priority: Critical  (was: Major)

 Misc updates to streaming programming guide
 ---

 Key: SPARK-2419
 URL: https://issues.apache.org/jira/browse/SPARK-2419
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical

 This JIRA collects together a number of small issues that should be added to 
 the streaming programming guide
 - Receivers consume an executor slot and highlight the fact the # cores  # 
 receivers is necessary
 - Classes of spark-streaming-XYZ cannot be access from Spark Shell
 - Deploying and using spark-streaming-XYZ requires spark-streaming-XYZ.jar 
 and its dependencies to be packaged with application JAR
 - Ordering and parallelism of the output operations
 - Receiver's should be serializable
 - Add more information on how socketStream: input stream = iterator function.
 - New Flume and Kinesis stuff.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2463) Creating multiple StreamingContexts from shell generates duplicate Streaming tabs in UI


 [ 
https://issues.apache.org/jira/browse/SPARK-2463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2463:
-

Target Version/s: 1.2.0

 Creating multiple StreamingContexts from shell generates duplicate Streaming 
 tabs in UI
 ---

 Key: SPARK-2463
 URL: https://issues.apache.org/jira/browse/SPARK-2463
 Project: Spark
  Issue Type: Bug
  Components: Streaming, Web UI
Affects Versions: 1.0.1
Reporter: Nicholas Chammas

 Start a {{StreamingContext}} from the interactive shell and then stop it. Go 
 to {{http://master_url:4040/streaming/}} and you will see a tab in the UI for 
 Streaming.
 Now from the same shell, create and start a new {{StreamingContext}}. There 
 will now be a duplicate tab for Streaming in the UI. Repeating this process 
 generates additional Streaming tabs. 
 They all link to the same information.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1312) Batch should read based on the batch interval provided in the StreamingContext


 [ 
https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1312:
-

Target Version/s: 1.2.0
Assignee: Tathagata Das

 Batch should read based on the batch interval provided in the StreamingContext
 --

 Key: SPARK-1312
 URL: https://issues.apache.org/jira/browse/SPARK-1312
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0
Reporter: Sanjay Awatramani
Assignee: Tathagata Das
Priority: Minor
  Labels: sliding, streaming, window

 This problem primarily affects sliding window operations in spark streaming.
 Consider the following scenario:
 - a DStream is created from any source. (I've checked with file and socket)
 - No actions are applied to this DStream
 - Sliding Window operation is applied to this DStream and an action is 
 applied to the sliding window.
 In this case, Spark will not even read the input stream in the batch in which 
 the sliding interval isn't a multiple of batch interval. Put another way, it 
 won't read the input when it doesn't have to apply the window function. This 
 is happening because all transformations in Spark are lazy.
 How to fix this or workaround it (see line#3):
 JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new 
 Duration(1 * 60 * 1000));
 JavaDStreamString inputStream = stcObj.textFileStream(/Input);
 inputStream.print(); // This is the workaround
 JavaDStreamString objWindow = inputStream.window(new 
 Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
 objWindow.dstream().saveAsTextFiles(/Output, );
 The Window operations example on the streaming guide implies that Spark 
 will read the stream in every batch, which is not happening because of the 
 lazy transformations.
 Wherever sliding window would be used, in most of the cases, no actions will 
 be taken on the pre-window batch, hence my gut feeling was that Streaming 
 would read every batch if any actions are being taken in the windowed stream.
 As per Tathagata,
 Ideally every batch should read based on the batch interval provided in the 
 StreamingContext.
 Refer the original thread on 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html
  for more details, including Tathagata's conclusion.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-30 Thread Ted Malaska (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080035#comment-14080035
 ] 

Ted Malaska commented on SPARK-2447:


The build is fixed and the pull request is updated

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-30 Thread Ted Malaska (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080055#comment-14080055
]

Ted Malaska commented on SPARK-2447:

OK had a status meeting with TD.

1. 2447 will be pushed past 1.1
2. Focus on these tasks
2.1. Java
2.2. More unit testing
2.3. Partitioned Put
2.4. Partitioned Sorted Get
2.5. BulkCheckPut
2.6. BulkLoad

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

Going to review the design with Tdas today.
But first thoughts is to have an extension of VoidFunction that handles the
connection to HBase and allows for options such as turning auto flush off for
higher through put.
Need to answer the following questions first.
- Can it be written in Java or should it be written in Scala?
- What is the best way to add the HBase dependency? (will review how Flume
does this as the first option)
- What is the best way to do testing? (will review how Flume does this as the
first option)
- How to support python? (python may be a different Jira it is unknown at
this time)
Goals:
- Simple to use
- Stable
- Supports high load
- Documented (May be in a separate Jira need to ask Tdas)
- Supports Java, Scala, and hopefully Python
- Supports Streaming and normal Spark

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1642) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083


 [ 
https://issues.apache.org/jira/browse/SPARK-1642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1642:
-

Target Version/s: 1.2.0  (was: 1.1.0)

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-2083
 ---

 Key: SPARK-1642
 URL: https://issues.apache.org/jira/browse/SPARK-1642
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 This will add support for SSL encryption between Flume AvroSink and Spark 
 Streaming.
 It is based on FLUME-2083



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-30 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2447:
-

Target Version/s: 1.2.0  (was: 1.1.0)

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-944) Give example of writing to HBase from Spark Streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-944:


Target Version/s: 1.2.0  (was: 1.1.0)

 Give example of writing to HBase from Spark Streaming
 -

 Key: SPARK-944
 URL: https://issues.apache.org/jira/browse/SPARK-944
 Project: Spark
  Issue Type: New Feature
  Components: Streaming
Reporter: Patrick Wendell
Assignee: Tathagata Das
 Attachments: MetricAggregatorHBase.scala






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2492) KafkaReceiver minor changes to align with Kafka 0.8


 [ 
https://issues.apache.org/jira/browse/SPARK-2492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2492:
-

Target Version/s: 1.1.0
   Fix Version/s: (was: 1.1.0)

 KafkaReceiver minor changes to align with Kafka 0.8 
 

 Key: SPARK-2492
 URL: https://issues.apache.org/jira/browse/SPARK-2492
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Minor

 Update to delete Zookeeper metadata when Kafka's parameter 
 auto.offset.reset is set to smallest, which is aligned with Kafka 0.8's 
 ConsoleConsumer.
 Also use Kafka offered API without directly using zkClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0


[ 
https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080111#comment-14080111
 ] 

Tathagata Das commented on SPARK-2507:
--

This was solved in PR https://github.com/apache/spark/pull/153

 Compile error of streaming project with 2.0.0-cdh4.6.0
 --

 Key: SPARK-2507
 URL: https://issues.apache.org/jira/browse/SPARK-2507
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0
 Environment: RedHat 5.3
 2.0.0-cdh4.6.0
 enable yarn
 java version 1.6.0_45
Reporter: James Z.M. Gao
Priority: Minor

 Hi,
 When compiling with
 {quote}
 ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz
 {quote}
 I have the following errors on streaming java api:
 {quote}
 Version is 0.9.0-incubating
 Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 Hadoop version set to 2.0.0-cdh4.6.0
 YARN enabled
 [info] Loading project definition from /root/spark-source/project/project
 [info] Loading project definition from /root/spark-source/project
 [info] Set current project to root (in build file:/root/spark-source/)
 [info] Compiling 1 Scala source to 
 /root/spark-source/streaming/target/scala-2.10/classes...
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] dstream.filter((x = f(x).booleanValue()))
 [error]   ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def cache(): JavaPairDStream[K, V] = dstream.cache()
 [error] ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def persist(): JavaPairDStream[K, V] = dstream.persist()
 [error] ^
 ..
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, 
 (com.google.common.base.Optional[V], W))]
 [error]  required: 
 org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V],
  W)]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] joinResult.mapValues{case (v, w) = 
 (JavaUtils.optionToOptional(v), w)}
 [error] ^
 [error] 44 errors found
 [error] (streaming/compile:compile) Compilation failed
 {quote}
 Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2756) Decision Tree bugs

2014-07-30 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-2756:


 Summary: Decision Tree bugs
 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley


2 bugs:

Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
features (in multiclass classification with categorical features, where the 
features had few enough values such that they could be considered unordered, 
i.e., isSpaceSufficientForAllCategoricalSplits=true).

* updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
binIndex), where
** featureValue was from arr (so it was a feature value)
** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
* The rest of the code indexed agg as (node, feature, binIndex, label).

Bug 2: calculateGainForSplit (for classification):
* It returns dummy prediction values when either the right or left children had 
0 weight.  These are incorrect for multiclass classification.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2756) Decision Tree bugs


 [ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2756:
-

Assignee: Joseph K. Bradley

 Decision Tree bugs
 --

 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 bugs:
 Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
 features (in multiclass classification with categorical features, where the 
 features had few enough values such that they could be considered unordered, 
 i.e., isSpaceSufficientForAllCategoricalSplits=true).
 * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
 binIndex), where
 ** featureValue was from arr (so it was a feature value)
 ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
 * The rest of the code indexed agg as (node, feature, binIndex, label).
 Bug 2: calculateGainForSplit (for classification):
 * It returns dummy prediction values when either the right or left children 
 had 0 weight.  These are incorrect for multiclass classification.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2756) Decision Tree bugs


[ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080134#comment-14080134
 ] 

Apache Spark commented on SPARK-2756:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/1673

 Decision Tree bugs
 --

 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 bugs:
 Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
 features (in multiclass classification with categorical features, where the 
 features had few enough values such that they could be considered unordered, 
 i.e., isSpaceSufficientForAllCategoricalSplits=true).
 * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
 binIndex), where
 ** featureValue was from arr (so it was a feature value)
 ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
 * The rest of the code indexed agg as (node, feature, binIndex, label).
 Bug 2: calculateGainForSplit (for classification):
 * It returns dummy prediction values when either the right or left children 
 had 0 weight.  These are incorrect for multiclass classification.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2756) Decision Tree bugs

2014-07-30 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080136#comment-14080136
 ] 

Joseph K. Bradley commented on SPARK-2756:
--

Submitted [https://github.com/apache/spark/pull/1673] with bug fixes.

 Decision Tree bugs
 --

 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 bugs:
 Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
 features (in multiclass classification with categorical features, where the 
 features had few enough values such that they could be considered unordered, 
 i.e., isSpaceSufficientForAllCategoricalSplits=true).
 * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
 binIndex), where
 ** featureValue was from arr (so it was a feature value)
 ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
 * The rest of the code indexed agg as (node, feature, binIndex, label).
 Bug 2: calculateGainForSplit (for classification):
 * It returns dummy prediction values when either the right or left children 
 had 0 weight.  These are incorrect for multiclass classification.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Issue Comment Deleted] (SPARK-2756) Decision Tree bugs

2014-07-30 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-2756:
-

Comment: was deleted

(was: Submitted [https://github.com/apache/spark/pull/1673] with bug fixes.)

 Decision Tree bugs
 --

 Key: SPARK-2756
 URL: https://issues.apache.org/jira/browse/SPARK-2756
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 2 bugs:
 Bug 1: Indexing is inconsistent for aggregate calculations for unordered 
 features (in multiclass classification with categorical features, where the 
 features had few enough values such that they could be considered unordered, 
 i.e., isSpaceSufficientForAllCategoricalSplits=true).
 * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, 
 binIndex), where
 ** featureValue was from arr (so it was a feature value)
 ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
 * The rest of the code indexed agg as (node, feature, binIndex, label).
 Bug 2: calculateGainForSplit (for classification):
 * It returns dummy prediction values when either the right or left children 
 had 0 weight.  These are incorrect for multiclass classification.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0


 [ 
https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2507:
-

Affects Version/s: 1.0.2
   1.0.1

 Compile error of streaming project with 2.0.0-cdh4.6.0
 --

 Key: SPARK-2507
 URL: https://issues.apache.org/jira/browse/SPARK-2507
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2
 Environment: RedHat 5.3
 2.0.0-cdh4.6.0
 enable yarn
 java version 1.6.0_45
Reporter: James Z.M. Gao
Priority: Minor

 Hi,
 When compiling with
 {quote}
 ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz
 {quote}
 I have the following errors on streaming java api:
 {quote}
 Version is 0.9.0-incubating
 Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 Hadoop version set to 2.0.0-cdh4.6.0
 YARN enabled
 [info] Loading project definition from /root/spark-source/project/project
 [info] Loading project definition from /root/spark-source/project
 [info] Set current project to root (in build file:/root/spark-source/)
 [info] Compiling 1 Scala source to 
 /root/spark-source/streaming/target/scala-2.10/classes...
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] dstream.filter((x = f(x).booleanValue()))
 [error]   ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def cache(): JavaPairDStream[K, V] = dstream.cache()
 [error] ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def persist(): JavaPairDStream[K, V] = dstream.persist()
 [error] ^
 ..
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, 
 (com.google.common.base.Optional[V], W))]
 [error]  required: 
 org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V],
  W)]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] joinResult.mapValues{case (v, w) = 
 (JavaUtils.optionToOptional(v), w)}
 [error] ^
 [error] 44 errors found
 [error] (streaming/compile:compile) Compilation failed
 {quote}
 Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0


 [ 
https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-2507.
--

Resolution: Fixed

 Compile error of streaming project with 2.0.0-cdh4.6.0
 --

 Key: SPARK-2507
 URL: https://issues.apache.org/jira/browse/SPARK-2507
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2
 Environment: RedHat 5.3
 2.0.0-cdh4.6.0
 enable yarn
 java version 1.6.0_45
Reporter: James Z.M. Gao
Priority: Minor

 Hi,
 When compiling with
 {quote}
 ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz
 {quote}
 I have the following errors on streaming java api:
 {quote}
 Version is 0.9.0-incubating
 Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 Hadoop version set to 2.0.0-cdh4.6.0
 YARN enabled
 [info] Loading project definition from /root/spark-source/project/project
 [info] Loading project definition from /root/spark-source/project
 [info] Set current project to root (in build file:/root/spark-source/)
 [info] Compiling 1 Scala source to 
 /root/spark-source/streaming/target/scala-2.10/classes...
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] dstream.filter((x = f(x).booleanValue()))
 [error]   ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def cache(): JavaPairDStream[K, V] = dstream.cache()
 [error] ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def persist(): JavaPairDStream[K, V] = dstream.persist()
 [error] ^
 ..
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, 
 (com.google.common.base.Optional[V], W))]
 [error]  required: 
 org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V],
  W)]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] joinResult.mapValues{case (v, w) = 
 (JavaUtils.optionToOptional(v), w)}
 [error] ^
 [error] 44 errors found
 [error] (streaming/compile:compile) Compilation failed
 {quote}
 Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2507) Compile error of streaming project with 2.0.0-cdh4.6.0


 [ 
https://issues.apache.org/jira/browse/SPARK-2507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2507:
-

Target Version/s: 1.1.0  (was: 0.9.0, 0.9.1, 1.0.0)

 Compile error of streaming project with 2.0.0-cdh4.6.0
 --

 Key: SPARK-2507
 URL: https://issues.apache.org/jira/browse/SPARK-2507
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0, 1.0.1, 1.0.2
 Environment: RedHat 5.3
 2.0.0-cdh4.6.0
 enable yarn
 java version 1.6.0_45
Reporter: James Z.M. Gao
Priority: Minor

 Hi,
 When compiling with
 {quote}
 ./make-distribution.sh --hadoop 2.0.0-cdh4.6.0 --with-yarn --tgz
 {quote}
 I have the following errors on streaming java api:
 {quote}
 Version is 0.9.0-incubating
 Making spark-0.9.0-incubating-hadoop_2.0.0-cdh4.6.0-bin.tar.gz
 Hadoop version set to 2.0.0-cdh4.6.0
 YARN enabled
 [info] Loading project definition from /root/spark-source/project/project
 [info] Loading project definition from /root/spark-source/project
 [info] Set current project to root (in build file:/root/spark-source/)
 [info] Compiling 1 Scala source to 
 /root/spark-source/streaming/target/scala-2.10/classes...
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:57:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] dstream.filter((x = f(x).booleanValue()))
 [error]   ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:60:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def cache(): JavaPairDStream[K, V] = dstream.cache()
 [error] ^
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:63:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, V)]
 [error]  required: org.apache.spark.streaming.api.java.JavaPairDStream[K,V]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error]   def persist(): JavaPairDStream[K, V] = dstream.persist()
 [error] ^
 ..
 [error] 
 /root/spark-source/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaPairDStream.scala:669:
  type mismatch;
 [error]  found   : org.apache.spark.streaming.dstream.DStream[(K, 
 (com.google.common.base.Optional[V], W))]
 [error]  required: 
 org.apache.spark.streaming.api.java.JavaPairDStream[K,(com.google.common.base.Optional[V],
  W)]
 [error]  Note: implicit method fromPairDStream is not applicable here because 
 it comes after the application point and it lacks an explicit result type
 [error] joinResult.mapValues{case (v, w) = 
 (JavaUtils.optionToOptional(v), w)}
 [error] ^
 [error] 44 errors found
 [error] (streaming/compile:compile) Compilation failed
 {quote}
 Here is a simple PR fix this problem: https://github.com/apache/spark/pull/153



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2757) Add Mima test for Spark Sink after 1.10 is released

2014-07-30 Thread Hari Shreedharan (JIRA)

Hari Shreedharan created SPARK-2757:
---

 Summary: Add Mima test for Spark Sink after 1.10 is released
 Key: SPARK-2757
 URL: https://issues.apache.org/jira/browse/SPARK-2757
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.1.0
Reporter: Hari Shreedharan
 Fix For: 1.2.0


We are adding it in 1.1.0, so it is excluded from Mima right now. Once we 
release 1.1.0, we should add it to Mima so we do binary compat checks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)

2014-07-30 Thread Ted Malaska (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080166#comment-14080166
]

Ted Malaska commented on SPARK-2447:

Hey Matei,

Lets do a webex or something in the near future. I would love to get more of
your input.

Here are my answers to you questions above:
1. Yes I can do Python
2. Yes I can do that. So to be clear the bulkGet and scan will return a fixed
(Array[Byte], Array[(Array[Byte], Array[Byte], Array[Byte], Long)]) for
(rowKey, Array[columnFamily, column, value, timestamp)])
2.1 As for the bulkPut/Increment/Delete/CheckPut I think we need to give the
user freedom to interact with the raw API. I have no problem building a
simpler interface for the 80% use case but I don't want to fail the 20%.
3. The lowest version is 0.96 The release is there was a major API change from
0.94 to 0.96+. So if we need to support 0.94 and below we need to make a
different code base.

Let me know if this answers you questions and let me know if there is anything
else I can do. I have learned so much from TD and I have grown so much from
this process.

Ted Malaska

Add common solution for sending upsert actions to HBase (put, deletes, and
increment)
-

Key: SPARK-2447
URL: https://issues.apache.org/jira/browse/SPARK-2447
Project: Spark
Issue Type: New Feature
Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2758) UnionRDD's UnionPartition should not reference parent RDDs

Reynold Xin created SPARK-2758:
--

 Summary: UnionRDD's UnionPartition should not reference parent RDDs
 Key: SPARK-2758
 URL: https://issues.apache.org/jira/browse/SPARK-2758
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0, 1.0.1, 1.0.2
Reporter: Reynold Xin
Assignee: Reynold Xin


UnionPartition has a non-transient field referencing the parent RDD, to be used 
in compute (iterator). That causes some trouble with task size because 
partition objects are supposed to be small.





--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2447) Add common solution for sending upsert actions to HBase (put, deletes, and increment)


[ 
https://issues.apache.org/jira/browse/SPARK-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080179#comment-14080179
 ] 

Patrick Wendell commented on SPARK-2447:


This is not entirely a duplicate, but it's similar to SPARK-1127

 Add common solution for sending upsert actions to HBase (put, deletes, and 
 increment)
 -

 Key: SPARK-2447
 URL: https://issues.apache.org/jira/browse/SPARK-2447
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska

 Going to review the design with Tdas today.  
 But first thoughts is to have an extension of VoidFunction that handles the 
 connection to HBase and allows for options such as turning auto flush off for 
 higher through put.
 Need to answer the following questions first.
 - Can it be written in Java or should it be written in Scala?
 - What is the best way to add the HBase dependency? (will review how Flume 
 does this as the first option)
 - What is the best way to do testing? (will review how Flume does this as the 
 first option)
 - How to support python? (python may be a different Jira it is unknown at 
 this time)
 Goals:
 - Simple to use
 - Stable
 - Supports high load
 - Documented (May be in a separate Jira need to ask Tdas)
 - Supports Java, Scala, and hopefully Python
 - Supports Streaming and normal Spark



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2758) UnionRDD's UnionPartition should not reference parent RDDs


[ 
https://issues.apache.org/jira/browse/SPARK-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080184#comment-14080184
 ] 

Apache Spark commented on SPARK-2758:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1675

 UnionRDD's UnionPartition should not reference parent RDDs
 --

 Key: SPARK-2758
 URL: https://issues.apache.org/jira/browse/SPARK-2758
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.0.0, 1.0.1, 1.0.2
Reporter: Reynold Xin
Assignee: Reynold Xin

 UnionPartition has a non-transient field referencing the parent RDD, to be 
 used in compute (iterator). That causes some trouble with task size because 
 partition objects are supposed to be small.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2706) Enable Spark to support Hive 0.13

2014-07-30 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-2706:
--

Attachment: spark-hive.err

This file shows the error I got after applying tentative patch.

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
 Attachments: spark-hive.err


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive  FAILURE [11.121s]
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark Project External Twitter

[jira] [Updated] (SPARK-2706) Enable Spark to support Hive 0.13

2014-07-30 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-2706:
--

Attachment: spark-2706-v1.txt

Tentative patch.

I copied Hive 0.13.1 artifacts to local maven repo manually.

 Enable Spark to support Hive 0.13
 -

 Key: SPARK-2706
 URL: https://issues.apache.org/jira/browse/SPARK-2706
 Project: Spark
  Issue Type: Dependency upgrade
  Components: SQL
Affects Versions: 1.0.1
Reporter: Chunjun Xiao
 Attachments: spark-2706-v1.txt, spark-hive.err


 It seems Spark cannot work with Hive 0.13 well.
 When I compiled Spark with Hive 0.13.1, I got some error messages, as 
 attached below.
 So, when can Spark be enabled to support Hive 0.13?
 Compiling Error:
 {quote}
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveContext.scala:180:
  type mismatch;
  found   : String
  required: Array[String]
 [ERROR]   val proc: CommandProcessor = 
 CommandProcessorFactory.get(tokens(0), hiveconf)
 [ERROR]  ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:264:
  overloaded method constructor TableDesc with alternatives:
   (x$1: Class[_ : org.apache.hadoop.mapred.InputFormat[_, _]],x$2: 
 Class[_],x$3: java.util.Properties)org.apache.hadoop.hive.ql.plan.TableDesc 
 and
   ()org.apache.hadoop.hive.ql.plan.TableDesc
  cannot be applied to (Class[org.apache.hadoop.hive.serde2.Deserializer], 
 Class[(some other)?0(in value tableDesc)(in value tableDesc)], Class[?0(in 
 value tableDesc)(in value tableDesc)], java.util.Properties)
 [ERROR]   val tableDesc = new TableDesc(
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala:140:
  value getPartitionPath is not a member of 
 org.apache.hadoop.hive.ql.metadata.Partition
 [ERROR]   val partPath = partition.getPartitionPath
 [ERROR]^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScan.scala:132:
  value appendReadColumnNames is not a member of object 
 org.apache.hadoop.hive.serde2.ColumnProjectionUtils
 [ERROR] ColumnProjectionUtils.appendReadColumnNames(hiveConf, 
 attributes.map(_.name))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:79:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   new HiveDecimal(bd.underlying())
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:132:
  type mismatch;
  found   : org.apache.hadoop.fs.Path
  required: String
 [ERROR]   
 SparkHiveHadoopWriter.createPathFromString(fileSinkConf.getDirName, conf))
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala:179:
  value getExternalTmpFileURI is not a member of 
 org.apache.hadoop.hive.ql.Context
 [ERROR] val tmpLocation = hiveContext.getExternalTmpFileURI(tableLocation)
 [ERROR]   ^
 [ERROR] 
 /ws/spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUdfs.scala:209:
  org.apache.hadoop.hive.common.type.HiveDecimal does not have a constructor
 [ERROR]   case bd: BigDecimal = new HiveDecimal(bd.underlying())
 [ERROR]  ^
 [ERROR] 8 errors found
 [DEBUG] Compilation failed (CompilerInterface)
 [INFO] 
 
 [INFO] Reactor Summary:
 [INFO] 
 [INFO] Spark Project Parent POM .. SUCCESS [2.579s]
 [INFO] Spark Project Core  SUCCESS [2:39.805s]
 [INFO] Spark Project Bagel ... SUCCESS [21.148s]
 [INFO] Spark Project GraphX .. SUCCESS [59.950s]
 [INFO] Spark Project ML Library .. SUCCESS [1:08.771s]
 [INFO] Spark Project Streaming ... SUCCESS [1:17.759s]
 [INFO] Spark Project Tools ... SUCCESS [15.405s]
 [INFO] Spark Project Catalyst  SUCCESS [1:17.405s]
 [INFO] Spark Project SQL . SUCCESS [1:11.094s]
 [INFO] Spark Project Hive  FAILURE [11.121s]
 [INFO] Spark Project REPL  SKIPPED
 [INFO] Spark Project YARN Parent POM . SKIPPED
 [INFO] Spark Project YARN Stable API . SKIPPED
 [INFO] Spark Project Assembly  SKIPPED
 [INFO] Spark

[jira] [Commented] (SPARK-1021) sortByKey() launches a cluster job when it shouldn't

2014-07-30 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080250#comment-14080250
 ] 

Erik Erlandson commented on SPARK-1021:
---

I deferred the compute of the partition bounds this way, and seems to work 
properly in my testing and the unit tests:
https://github.com/erikerlandson/spark/compare/erikerlandson:rdd_drop_master...spark-1021


 sortByKey() launches a cluster job when it shouldn't
 

 Key: SPARK-1021
 URL: https://issues.apache.org/jira/browse/SPARK-1021
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.8.0, 0.9.0
Reporter: Andrew Ash
Assignee: Mark Hamstra
  Labels: starter

 The sortByKey() method is listed as a transformation, not an action, in the 
 documentation.  But it launches a cluster job regardless.
 http://spark.incubator.apache.org/docs/latest/scala-programming-guide.html
 Some discussion on the mailing list suggested that this is a problem with the 
 rdd.count() call inside Partitioner.scala's rangeBounds method.
 https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L102
 Josh Rosen suggests that rangeBounds should be made into a lazy variable:
 {quote}
 I wonder whether making RangePartitoner .rangeBounds into a lazy val would 
 fix this 
 (https://github.com/apache/incubator-spark/blob/6169fe14a140146602fb07cfcd13eee6efad98f9/core/src/main/scala/org/apache/spark/Partitioner.scala#L95).
   We'd need to make sure that rangeBounds() is never called before an action 
 is performed.  This could be tricky because it's called in the 
 RangePartitioner.equals() method.  Maybe it's sufficient to just compare the 
 number of partitions, the ids of the RDDs used to create the 
 RangePartitioner, and the sort ordering.  This still supports the case where 
 I range-partition one RDD and pass the same partitioner to a different RDD.  
 It breaks support for the case where two range partitioners created on 
 different RDDs happened to have the same rangeBounds(), but it seems unlikely 
 that this would really harm performance since it's probably unlikely that the 
 range partitioners are equal by chance.
 {quote}
 Can we please make this happen?  I'll send a PR on GitHub to start the 
 discussion and testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2741) Publish version of spark assembly which does not contain Hive


 [ 
https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2741.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1667
[https://github.com/apache/spark/pull/1667]

 Publish version of spark assembly which does not contain Hive
 -

 Key: SPARK-2741
 URL: https://issues.apache.org/jira/browse/SPARK-2741
 Project: Spark
  Issue Type: Task
Reporter: Brock Noland
Assignee: Patrick Wendell
 Fix For: 1.1.0

 Attachments: SPARK-2741.patch


 The current spark assembly contains Hive. This conflicts with Hive + Spark 
 which is attempting to use it's own version of Hive.
 We'll need to publish a version of the assembly which does not contain the 
 Hive jars.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-2741) Publish version of spark assembly which does not contain Hive


 [ 
https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2741:
---

Assignee: Brock Noland  (was: Patrick Wendell)

 Publish version of spark assembly which does not contain Hive
 -

 Key: SPARK-2741
 URL: https://issues.apache.org/jira/browse/SPARK-2741
 Project: Spark
  Issue Type: Task
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 1.1.0

 Attachments: SPARK-2741.patch


 The current spark assembly contains Hive. This conflicts with Hive + Spark 
 which is attempting to use it's own version of Hive.
 We'll need to publish a version of the assembly which does not contain the 
 Hive jars.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SPARK-2741) Publish version of spark assembly which does not contain Hive

2014-07-30 Thread Brock Noland (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14080264#comment-14080264
 ] 

Brock Noland commented on SPARK-2741:
-

Thanks guys!!

 Publish version of spark assembly which does not contain Hive
 -

 Key: SPARK-2741
 URL: https://issues.apache.org/jira/browse/SPARK-2741
 Project: Spark
  Issue Type: Task
Reporter: Brock Noland
Assignee: Brock Noland
 Fix For: 1.1.0

 Attachments: SPARK-2741.patch


 The current spark assembly contains Hive. This conflicts with Hive + Spark 
 which is attempting to use it's own version of Hive.
 We'll need to publish a version of the assembly which does not contain the 
 Hive jars.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1647) Prevent data loss when Streaming driver goes down


 [ 
https://issues.apache.org/jira/browse/SPARK-1647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1647:
-

Fix Version/s: (was: 1.1.0)

 Prevent data loss when Streaming driver goes down
 -

 Key: SPARK-1647
 URL: https://issues.apache.org/jira/browse/SPARK-1647
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Hari Shreedharan

 Currently when the driver goes down, any uncheckpointed data is lost from 
 within spark. If the system from which messages are pulled can  replay 
 messages, the data may be available - but for some systems, like Flume this 
 is not the case. 
 Also, all windowing information is lost for windowing functions. 
 We must persist raw data somehow, and be able to replay this data if 
 required. We also must persist windowing information with the data itself.
 This will likely require quite a bit of work to complete and probably will 
 have to be split into several sub-jiras.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1478) Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915


 [ 
https://issues.apache.org/jira/browse/SPARK-1478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1478:
-

Target Version/s: 1.2.0

 Upgrade FlumeInputDStream's FlumeReceiver to support FLUME-1915
 ---

 Key: SPARK-1478
 URL: https://issues.apache.org/jira/browse/SPARK-1478
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Reporter: Ted Malaska
Assignee: Ted Malaska
Priority: Minor

 Flume-1915 added support for compression over the wire from avro sink to avro 
 source.  I would like to add this functionality to the FlumeReceiver.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1600:
-

Fix Version/s: (was: 1.1.0)

 flaky test case in streaming.CheckpointSuite
 

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Nan Zhu

 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1600) flaky test case in streaming.CheckpointSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1600:
-

Target Version/s: 1.2.0

 flaky test case in streaming.CheckpointSuite
 

 Key: SPARK-1600
 URL: https://issues.apache.org/jira/browse/SPARK-1600
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Nan Zhu

 the case recovery with file input stream.recovery with file input stream   
 sometimes fails when the Jenkins is very busy with an unrelated change 
 I have met it for 3 times, I also saw it in other places, 
 the latest example is in 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/14397/
 where the modification is just in YARN related files
 I once reported in dev mail list: 
 http://apache-spark-developers-list.1001551.n3.nabble.com/a-weird-test-case-in-Streaming-td6116.html



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1409:
-

Target Version/s: 1.2.0

 Flaky Test: actor input stream test in 
 org.apache.spark.streaming.InputStreamsSuite
 -

 Key: SPARK-1409
 URL: https://issues.apache.org/jira/browse/SPARK-1409
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das

 Here are just a few cases:
 https://travis-ci.org/apache/spark/jobs/22151827
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1409) Flaky Test: actor input stream test in org.apache.spark.streaming.InputStreamsSuite


 [ 
https://issues.apache.org/jira/browse/SPARK-1409?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1409:
-

Fix Version/s: (was: 1.1.0)

 Flaky Test: actor input stream test in 
 org.apache.spark.streaming.InputStreamsSuite
 -

 Key: SPARK-1409
 URL: https://issues.apache.org/jira/browse/SPARK-1409
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Reporter: Michael Armbrust
Assignee: Tathagata Das

 Here are just a few cases:
 https://travis-ci.org/apache/spark/jobs/22151827
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13709/



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2759) The ability to read binary files into Spark

2014-07-30 Thread Kevin Mader (JIRA)

Kevin Mader created SPARK-2759:
--

 Summary: The ability to read binary files into Spark
 Key: SPARK-2759
 URL: https://issues.apache.org/jira/browse/SPARK-2759
 Project: Spark
  Issue Type: Improvement
  Components: Input/Output, Java API, Spark Core
Reporter: Kevin Mader


For reading images, compressed files, or other custom formats it would be 
useful to have methods that could read the files in as a byte array or 
DataInputStream so other functions could then process the data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Created] (SPARK-2760) Caching tables from multiple databases does not work

Michael Armbrust created SPARK-2760:
---

 Summary: Caching tables from multiple databases does not work
 Key: SPARK-2760
 URL: https://issues.apache.org/jira/browse/SPARK-2760
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.1.0
Reporter: Michael Armbrust
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2734) DROP TABLE should also uncache table


 [ 
https://issues.apache.org/jira/browse/SPARK-2734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2734.
-

   Resolution: Fixed
Fix Version/s: 1.1.0

 DROP TABLE should also uncache table
 

 Key: SPARK-2734
 URL: https://issues.apache.org/jira/browse/SPARK-2734
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Critical
 Fix For: 1.1.0


 Steps to reproduce:
 {code}
 hql(CREATE TABLE test(a INT))
 hql(CACHE TABLE test)
 hql(DROP TABLE test)
 hql(SELECT * FROM test)
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (SPARK-1580) [MLlib] ALS: Estimate communication and computation costs given a partitioner

2014-07-30 Thread Tor Myklebust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tor Myklebust updated SPARK-1580:
-

Summary: [MLlib] ALS: Estimate communication and computation costs given a 
partitioner  (was: ALS: Estimate communication and computation costs given a 
partitioner)

 [MLlib] ALS: Estimate communication and computation costs given a partitioner
 -

 Key: SPARK-1580
 URL: https://issues.apache.org/jira/browse/SPARK-1580
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Tor Myklebust
Priority: Minor

 It would be nice to be able to estimate the amount of work needed to solve an 
 ALS problem.  The chief components of this work are computation time---time 
 spent forming and solving the least squares problems---and communication 
 cost---the number of bytes sent across the network.  Communication cost 
 depends heavily on how the users and products are partitioned.
 We currently do not try to cluster users or products so that fewer feature 
 vectors need to be communicated.  This is intended as a first step toward 
 that end---we ought to be able to tell whether one partitioning is better 
 than another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (SPARK-2341) loadLibSVMFile doesn't handle regression datasets