date:20160108

[jira] [Assigned] (SPARK-12710) Create local CoGroup operator

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12710:


Assignee: (was: Apache Spark)

> Create local CoGroup operator
> -
>
> Key: SPARK-12710
> URL: https://issues.apache.org/jira/browse/SPARK-12710
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12710) Create local CoGroup operator

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088857#comment-15088857
 ] 

Apache Spark commented on SPARK-12710:
--

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/10662

> Create local CoGroup operator
> -
>
> Key: SPARK-12710
> URL: https://issues.apache.org/jira/browse/SPARK-12710
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12710) Create local CoGroup operator

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12710:


Assignee: Apache Spark

> Create local CoGroup operator
> -
>
> Key: SPARK-12710
> URL: https://issues.apache.org/jira/browse/SPARK-12710
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1270) An optimized gradient descent implementation

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088869#comment-15088869
 ] 

Apache Spark commented on SPARK-1270:
-

User 'yoshidakuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/10663

> An optimized gradient descent implementation
> 
>
> Key: SPARK-1270
> URL: https://issues.apache.org/jira/browse/SPARK-1270
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Xusen Yin
>  Labels: GradientDescent, MLLib,
>
> Current implementation of GradientDescent is inefficient in some aspects, 
> especially in high-latency network. I propose a new implementation of 
> GradientDescent, which follows a parallelism model called 
> GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
> Xing's SSP. With a few modifications of runMiniBatchSGD, the 
> GradientDescentWithLocalUpdate can outperform the original sequential version 
> by about 4x without sacrificing accuracy, and can be easily adopted by most 
> classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12708) Sorting task error in Stages Page when yarn mode

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088898#comment-15088898
 ] 

Apache Spark commented on SPARK-12708:
--

User 'yoshidakuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/10663

> Sorting task error in Stages Page when yarn mode
> 
>
> Key: SPARK-12708
> URL: https://issues.apache.org/jira/browse/SPARK-12708
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Koyo Yoshida
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12708) Sorting task error in Stages Page when yarn mode

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12708:


Assignee: Apache Spark

> Sorting task error in Stages Page when yarn mode
> 
>
> Key: SPARK-12708
> URL: https://issues.apache.org/jira/browse/SPARK-12708
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Koyo Yoshida
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12708) Sorting task error in Stages Page when yarn mode

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12708:


Assignee: (was: Apache Spark)

> Sorting task error in Stages Page when yarn mode
> 
>
> Key: SPARK-12708
> URL: https://issues.apache.org/jira/browse/SPARK-12708
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Koyo Yoshida
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12692) Scala style: check no white space before comma and colon

2016-01-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-12692.
-
   Resolution: Fixed
 Assignee: Kousuke Saruta
Fix Version/s: 2.0.0

> Scala style: check no white space before comma and colon
> 
>
> Key: SPARK-12692
> URL: https://issues.apache.org/jira/browse/SPARK-12692
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 2.0.0
>
>
> We should not put a white space before `,` and `:` so let's check it.
> Because there are lots of style violation, first, I'd like to add a checker, 
> enable  and let the level `warn`.
> Then, I'd like to fix the style step by step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12692) Scala style: check no white space before comma and colon

2016-01-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin reopened SPARK-12692:
-

Reopen this since we still need to fix the violations.


> Scala style: check no white space before comma and colon
> 
>
> Key: SPARK-12692
> URL: https://issues.apache.org/jira/browse/SPARK-12692
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 2.0.0
>
>
> We should not put a white space before `,` and `:` so let's check it.
> Because there are lots of style violation, first, I'd like to add a checker, 
> enable  and let the level `warn`.
> Then, I'd like to fix the style step by step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12692) Scala style: check no white space before comma and colon

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12692:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Scala style: check no white space before comma and colon
> 
>
> Key: SPARK-12692
> URL: https://issues.apache.org/jira/browse/SPARK-12692
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 2.0.0
>
>
> We should not put a white space before `,` and `:` so let's check it.
> Because there are lots of style violation, first, I'd like to add a checker, 
> enable  and let the level `warn`.
> Then, I'd like to fix the style step by step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12692) Scala style: check no white space before comma and colon

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12692:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Scala style: check no white space before comma and colon
> 
>
> Key: SPARK-12692
> URL: https://issues.apache.org/jira/browse/SPARK-12692
> Project: Spark
>  Issue Type: Sub-task
>  Components: Project Infra
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
> Fix For: 2.0.0
>
>
> We should not put a white space before `,` and `:` so let's check it.
> Because there are lots of style violation, first, I'd like to add a checker, 
> enable  and let the level `warn`.
> Then, I'd like to fix the style step by step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-11214) Join with Unicode-String results wrong empty

2016-01-08 Thread Hans Fischer (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Fischer closed SPARK-11214.

Resolution: Fixed

Update: The problem only occur when you require a lot of ram. 30g or less will 
be fine. That was my learning. 

> Join with Unicode-String results wrong empty
> 
>
> Key: SPARK-11214
> URL: https://issues.apache.org/jira/browse/SPARK-11214
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Hans Fischer
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.1
>
>
> I created a join that should clearly result in a single row but return: 
> empty. Could someone validate this bug?
> hiveContext.sql('SELECT * FROM (SELECT "c" AS a) AS a JOIN (SELECT "c" AS b) 
> AS b ON a.a = b.b').take(10)
> result: []
> kind regards
> Hans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-08 Thread Ajesh Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088933#comment-15088933
 ] 

Ajesh Kumar commented on SPARK-12622:
-

Can you provide the steps to recreate the issue?

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12675) Executor dies because of ClassCastException and causes timeout

2016-01-08 Thread Alexandru Rosianu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088949#comment-15088949
 ] 

Alexandru Rosianu commented on SPARK-12675:
---

I'd like to add that this doesn't happen in cluster-mode due to [this 
check|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L433].

> Executor dies because of ClassCastException and causes timeout
> --
>
> Key: SPARK-12675
> URL: https://issues.apache.org/jira/browse/SPARK-12675
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 2.0.0
> Environment: 64-bit Linux Ubuntu 15.10, 16GB RAM, 8 cores 3ghz
>Reporter: Alexandru Rosianu
>Priority: Minor
>
> I'm trying to fit a Spark ML pipeline but my executor dies. Here's the script 
> which doesn't work (a bit simplified):
> {code:title=Script.scala}
> // Prepare data sets
> logInfo("Getting datasets")
> val emoTrainingData = 
> sqlc.read.parquet("/tw/sentiment/emo/parsed/data.parquet")
> val trainingData = emoTrainingData
> // Configure the pipeline
> val pipeline = new Pipeline().setStages(Array(
>   new 
> FeatureReducer().setInputCol("raw_text").setOutputCol("reduced_text"),
>   new StringSanitizer().setInputCol("reduced_text").setOutputCol("text"),
>   new Tokenizer().setInputCol("text").setOutputCol("raw_words"),
>   new StopWordsRemover().setInputCol("raw_words").setOutputCol("words"),
>   new HashingTF().setInputCol("words").setOutputCol("features"),
>   new NaiveBayes().setSmoothing(0.5).setFeaturesCol("features"),
>   new ColumnDropper().setDropColumns("raw_text", "reduced_text", "text", 
> "raw_words", "words", "features")
> ))
> // Fit the pipeline
> logInfo(s"Training model on ${trainingData.count()} rows")
> val model = pipeline.fit(trainingData)
> {code}
> It executes up to the last line. It prints "Training model on xx rows", then 
> it starts fitting, the executor dies, the drivers doesn't receive heartbeats 
> from the executor and it times out, then the script exits. It doesn't get 
> past that line.
> This is the exception that kills the executor:
> {code}
> java.io.IOException: java.lang.ClassCastException: cannot assign instance 
> of scala.collection.immutable.HashMap$SerializationProxy to field 
> org.apache.spark.executor.TaskMetrics._accumulatorUpdates of type 
> scala.collection.immutable.Map in instance of 
> org.apache.spark.executor.TaskMetrics
>   at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1207)
>   at 
> org.apache.spark.executor.TaskMetrics.readObject(TaskMetrics.scala:219)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:497)
>   at 
> java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1058)
>   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
>   at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
>   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
>   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
>   at org.apache.spark.util.Utils$.deserialize(Utils.scala:92)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:436)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1$$anonfun$apply$6.apply(Executor.scala:426)
>   at scala.Option.foreach(Option.scala:257)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:426)
>   at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$reportHeartBeat$1.apply(Executor.scala:424)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:742)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:424)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:468)
>   at

[jira] [Assigned] (SPARK-12709) Create local Except operator

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12709:


Assignee: (was: Apache Spark)

> Create local Except operator
> 
>
> Key: SPARK-12709
> URL: https://issues.apache.org/jira/browse/SPARK-12709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12709) Create local Except operator

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088853#comment-15088853
 ] 

Apache Spark commented on SPARK-12709:
--

User 'mwws' has created a pull request for this issue:
https://github.com/apache/spark/pull/10661

> Create local Except operator
> 
>
> Key: SPARK-12709
> URL: https://issues.apache.org/jira/browse/SPARK-12709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12621) ArrayIndexOutOfBoundsException when running sqlContext.sql(...)

2016-01-08 Thread Lei Wu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088863#comment-15088863
 ] 

Lei Wu commented on SPARK-12621:


Could you elaborate what's your context before you make this SQL query call ? 
Did you run it from within spark-sql CLI or from your application ? More 
details will help reproducing and debuging the problem.

> ArrayIndexOutOfBoundsException when running sqlContext.sql(...)
> ---
>
> Key: SPARK-12621
> URL: https://issues.apache.org/jira/browse/SPARK-12621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sasi
>
> Sometimes i'm getting this exception while trying to do "select * from 
> table". 
> I'm using Spark 1.5.0, with Spark SQL 2.10
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:199)
> at org.apache.spark.sql.Row$class.getAs(Row.scala:316)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
> at org.apache.spark.sql.Row$class.getString(Row.scala:249)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191)
> at 
> com.cxtrm.mgmt.subscriber.spark.SparkDataAccessorBean.doQuery(SparkDataAccessorBean.java:138)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-1270) An optimized gradient descent implementation

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1270:
---

Assignee: Apache Spark

> An optimized gradient descent implementation
> 
>
> Key: SPARK-1270
> URL: https://issues.apache.org/jira/browse/SPARK-1270
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Xusen Yin
>Assignee: Apache Spark
>  Labels: GradientDescent, MLLib,
>
> Current implementation of GradientDescent is inefficient in some aspects, 
> especially in high-latency network. I propose a new implementation of 
> GradientDescent, which follows a parallelism model called 
> GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
> Xing's SSP. With a few modifications of runMiniBatchSGD, the 
> GradientDescentWithLocalUpdate can outperform the original sequential version 
> by about 4x without sacrificing accuracy, and can be easily adopted by most 
> classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-1270) An optimized gradient descent implementation

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-1270:
---

Assignee: (was: Apache Spark)

> An optimized gradient descent implementation
> 
>
> Key: SPARK-1270
> URL: https://issues.apache.org/jira/browse/SPARK-1270
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.0
>Reporter: Xusen Yin
>  Labels: GradientDescent, MLLib,
>
> Current implementation of GradientDescent is inefficient in some aspects, 
> especially in high-latency network. I propose a new implementation of 
> GradientDescent, which follows a parallelism model called 
> GradientDescentWithLocalUpdate, inspired by Jeff Dean's DistBelief and Eric 
> Xing's SSP. With a few modifications of runMiniBatchSGD, the 
> GradientDescentWithLocalUpdate can outperform the original sequential version 
> by about 4x without sacrificing accuracy, and can be easily adopted by most 
> classification and regression algorithms in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3893) declare mutableMap/mutableSet explicitly

2016-01-08 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1507#comment-1507
 ] 

Reynold Xin commented on SPARK-3893:


I'm going to close this as won't fix for now since I don't think it is that big 
of a deal, and this would require massive changes to the codebase to fix it.


> declare  mutableMap/mutableSet explicitly
> -
>
> Key: SPARK-3893
> URL: https://issues.apache.org/jira/browse/SPARK-3893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: shijinkui
>
> {code:java}
>   // current
>   val workers = new HashSet[WorkerInfo]
>   // sugguest
>   val workers = new mutable.HashSet[WorkerInfo]
> {code}
> the other benefit is reminding us whether can use immutable collection 
> instead of.
> most of map we used is mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3893) declare mutableMap/mutableSet explicitly

2016-01-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-3893.
--
Resolution: Won't Fix

> declare  mutableMap/mutableSet explicitly
> -
>
> Key: SPARK-3893
> URL: https://issues.apache.org/jira/browse/SPARK-3893
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: shijinkui
>
> {code:java}
>   // current
>   val workers = new HashSet[WorkerInfo]
>   // sugguest
>   val workers = new mutable.HashSet[WorkerInfo]
> {code}
> the other benefit is reminding us whether can use immutable collection 
> instead of.
> most of map we used is mutable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10386) Model import/export for PrefixSpan

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088912#comment-15088912
 ] 

Apache Spark commented on SPARK-10386:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/10664

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10386) Model import/export for PrefixSpan

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10386:


Assignee: Apache Spark

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-10386) Model import/export for PrefixSpan

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-10386:


Assignee: (was: Apache Spark)

> Model import/export for PrefixSpan
> --
>
> Key: SPARK-10386
> URL: https://issues.apache.org/jira/browse/SPARK-10386
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Xiangrui Meng
>
> Support save/load for PrefixSpanModel. Should be similar to save/load for 
> FPGrowth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12621) ArrayIndexOutOfBoundsException when running sqlContext.sql(...)

2016-01-08 Thread Sasi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15088917#comment-15088917
 ] 

Sasi commented on SPARK-12621:
--

I run this from my application which act like a driver in JBoss web application.
This issue occurs randomly so I'll add some logs and will post it. 

> ArrayIndexOutOfBoundsException when running sqlContext.sql(...)
> ---
>
> Key: SPARK-12621
> URL: https://issues.apache.org/jira/browse/SPARK-12621
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Sasi
>
> Sometimes i'm getting this exception while trying to do "select * from 
> table". 
> I'm using Spark 1.5.0, with Spark SQL 2.10
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:199)
> at org.apache.spark.sql.Row$class.getAs(Row.scala:316)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
> at org.apache.spark.sql.Row$class.getString(Row.scala:249)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191)
> at 
> com.cxtrm.mgmt.subscriber.spark.SparkDataAccessorBean.doQuery(SparkDataAccessorBean.java:138)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12709) Create local Except operator

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12709:


Assignee: Apache Spark

> Create local Except operator
> 
>
> Key: SPARK-12709
> URL: https://issues.apache.org/jira/browse/SPARK-12709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Mao, Wei
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12650) No means to specify Xmx settings for SparkSubmit in yarn-cluster mode

2016-01-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089684#comment-15089684
 ] 

Marcelo Vanzin commented on SPARK-12650:


{{SparkLauncher}} has a constructor that takes a map of environment variables 
to set.

> No means to specify Xmx settings for SparkSubmit in yarn-cluster mode
> -
>
> Key: SPARK-12650
> URL: https://issues.apache.org/jira/browse/SPARK-12650
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.2
> Environment: Hadoop 2.6.0
>Reporter: John Vines
>
> Background-
> I have an app master designed to do some work and then launch a spark job.
> Issue-
> If I use yarn-cluster, then the SparkSubmit does not Xmx itself at all, 
> leading to the jvm taking a default heap which is relatively large. This 
> causes a large amount of vmem to be taken, so that it is killed by yarn. This 
> can be worked around by disabling Yarn's vmem check, but that is a hack.
> If I run it in yarn-client mode, it's fine as long as my container has enough 
> space for the driver, which is manageable. But I feel that the utter lack of 
> Xmx settings for what I believe is a very small jvm is a problem.
> I believe this was introduced with the fix for SPARK-3884



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12177) Update KafkaDStreams to new Kafka 0.9 Consumer API

2016-01-08 Thread Mark Grover (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089750#comment-15089750
 ] 

Mark Grover commented on SPARK-12177:
-

Thanks for working on this, Nikita. I'd like to help out. Here are a few things 
of feedback that I have:
1. I tried rebasing what you have in your current branch to upstream master 
(yours still seems to be based off of pre-1.6.0 code) but mostly because of 
some commits related to import ordering that happened on spark trunk relatively 
recently, I found it easier to migrate/copy the code for kafka-v09 and make the 
minor changes to examples and root pom instead of doing a 'git rebase'.
2. I also noticed that the v09DirectKafkaWordCount example is pulling at least 
the ConsumerConfig class from Kafka 0.8.2.1. This because the examples pom 
contains both kafka 0.8.2.1 and 0.9.0 dependencies and somewhat arbitrarily 
puts the 0.8.2.1 ahead. Since the ConsumerConfig class is available in both 
under the same namespace, we end up pulling 0.8.2.1. We should fix that.


In general, I may have a few more changes/fixes that I'd like to contribute to 
your pull request. Would it be possible for us to collaborate? What's the best 
way to do so? Reopening the pull request and me adding to it? Or, just me 
issuing pull requests to [your 
branch|https://github.com/nikit-os/spark/tree/kafka-09-consumer-api]? Thanks!

> Update KafkaDStreams to new Kafka 0.9 Consumer API
> --
>
> Key: SPARK-12177
> URL: https://issues.apache.org/jira/browse/SPARK-12177
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Nikita Tarasenko
>  Labels: consumer, kafka
>
> Kafka 0.9 already released and it introduce new consumer API that not 
> compatible with old one. So, I added new consumer api. I made separate 
> classes in package org.apache.spark.streaming.kafka.v09 with changed API. I 
> didn't remove old classes for more backward compatibility. User will not need 
> to change his old spark applications when he uprgade to new Spark version.
> Please rewiew my changes



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12701) Logging FileAppender should use join to ensure thread is finished

2016-01-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12701.
--
   Resolution: Fixed
 Assignee: Bryan Cutler
Fix Version/s: 2.0.0

> Logging FileAppender should use join to ensure thread is finished
> -
>
> Key: SPARK-12701
> URL: https://issues.apache.org/jira/browse/SPARK-12701
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.0.0
>
>
> Currently, FileAppender for logging uses wait/notifyAll to signal that the 
> writing thread has finished.  While I was trying to write a regression test 
> for a fix of SPARK-9844, the writing thread was not able to fully complete 
> before the process was shutdown, despite calling 
> {{FileAppender.awaitTermination}}.  Using join ensures the thread completes 
> and would simplify things a little more.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12713) UI Executor page should keep links around to executors that died

2016-01-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089734#comment-15089734
 ] 

Josh Rosen commented on SPARK-12713:


+1; UI improvements for dead executors would be great. An issue that I've seen 
with the current UI is that it's sometimes hard to notice that executors have 
died if they're churning heavily and being replaced by fresh ones; today I 
think that users have to pay close attention to executor numbers in order to 
spot this and, as you've pointed out, the log viewing experience isn't great.

> UI Executor page should keep links around to executors that died
> 
>
> Key: SPARK-12713
> URL: https://issues.apache.org/jira/browse/SPARK-12713
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.2
>Reporter: Thomas Graves
>
> When an executor dies the web ui no longer shows it in the executors page 
> which makes getting to the logs to see what happened very difficult.  I'm 
> running on yarn so not sure if behavior is different in standalone mode.
> We should figure out a way to keep links around to the ones that died so we 
> can show stats and log links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12694) The detailed rest API documentation for each field is missing

2016-01-08 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089737#comment-15089737
 ] 

Josh Rosen commented on SPARK-12694:


+1. Do you want to put together a PR for this?

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-01-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reopened SPARK-12697:
--
  Assignee: (was: Shixiong Zhu)

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-01-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12697:
-
Fix Version/s: (was: 2.0.0)

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12687) Support from clause surrounded by `()`

2016-01-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12687.

   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10660
[https://github.com/apache/spark/pull/10660]

> Support from clause surrounded by `()`
> --
>
> Key: SPARK-12687
> URL: https://issues.apache.org/jira/browse/SPARK-12687
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
> Fix For: 2.0.0
>
>
> This query can't be parsed under Hive parser:
> {code}
> (select * from t1) union (select * from t2)
> {code}
> also this one:
> {code}
> select * from ((select * from t1) union (select * from t2)) t
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12694) The detailed rest API documentation for each field is missing

2016-01-08 Thread Zhuo Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12694?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089753#comment-15089753
 ] 

Zhuo Liu commented on SPARK-12694:
--

Yes, Josh. I plan to fix this soon after Spark-10873.

> The detailed rest API documentation for each field is missing
> -
>
> Key: SPARK-12694
> URL: https://issues.apache.org/jira/browse/SPARK-12694
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Reporter: Zhuo Liu
>
> In 1.4, rest API is added. However, we now only have high-level description 
> here (http://spark.apache.org/docs/latest/monitoring.html). We still lack 
> detailed documentation for the name, type and description of each field in 
> the REST API. For example, "startTime" in an application's attempt.
> Adding this will help users a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12648) UDF with Option[Double] throws ClassCastException

2016-01-08 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089758#comment-15089758
 ] 

kevin yu commented on SPARK-12648:
--

Hi Mikael: I see, I am looking into to see if it is doable or not. 

> UDF with Option[Double] throws ClassCastException
> -
>
> Key: SPARK-12648
> URL: https://issues.apache.org/jira/browse/SPARK-12648
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Mikael Valot
>
> I can write an UDF that returns an Option[Double], and the DataFrame's  
> schema is correctly inferred to be a nullable double. 
> However I cannot seem to be able to write a UDF that takes an Option as an 
> argument:
> import org.apache.spark.sql.SQLContext
> import org.apache.spark.{SparkContext, SparkConf}
> val conf = new SparkConf().setMaster("local[4]").setAppName("test")
> val sc = new SparkContext(conf)
> val sqlc = new SQLContext(sc)
> import sqlc.implicits._
> val df = sc.parallelize(List(("a", Some(4D)), ("b", None))).toDF("name", 
> "weight")
> import org.apache.spark.sql.functions._
> val addTwo = udf((d: Option[Double]) => d.map(_+2)) 
> df.withColumn("plusTwo", addTwo(df("weight"))).show()
> =>
> 2016-01-05T14:41:52 Executor task launch worker-0 ERROR 
> org.apache.spark.executor.Executor Exception in task 0.0 in stage 1.0 (TID 1)
> java.lang.ClassCastException: java.lang.Double cannot be cast to scala.Option
>   at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:18) 
> ~[na:na]
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) ~[na:na]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at 
> org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
>  ~[spark-sql_2.10-1.6.0.jar:1.6.0]
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
> ~[scala-library-2.10.5.jar:na]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12618) Clean up build warnings: 2.0.0 edition

2016-01-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12618.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10570
[https://github.com/apache/spark/pull/10570]

> Clean up build warnings: 2.0.0 edition
> --
>
> Key: SPARK-12618
> URL: https://issues.apache.org/jira/browse/SPARK-12618
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL, Streaming
>Affects Versions: 1.6.0
>Reporter: Sean Owen
>Assignee: Sean Owen
>Priority: Minor
> Fix For: 2.0.0
>
>
> Time again to comb through all of the build warnings and try to eliminate 
> them. PR coming which removes about 75% of them, mostly due to deprecated API 
> usages. I'll annotate the changes. This may be a helpful precursor to 
> removing deprecated APIs, as this removes a lot of the calls to said APIs in 
> the code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12715) Improve test coverage

2016-01-08 Thread Davies Liu (JIRA)

Davies Liu created SPARK-12715:
--

 Summary: Improve test coverage
 Key: SPARK-12715
 URL: https://issues.apache.org/jira/browse/SPARK-12715
 Project: Spark
  Issue Type: Sub-task
Reporter: Davies Liu


We could bring in all test parser test cases into Spark, to make sure we will 
not break compatibility with Hive (we could do more and skip some of them that 
does not make sense).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-12693) OffsetOutOfRangeException caused by retention

2016-01-08 Thread Rado Buransky (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rado Buransky closed SPARK-12693.
-

Thanks

> OffsetOutOfRangeException caused by retention
> -
>
> Key: SPARK-12693
> URL: https://issues.apache.org/jira/browse/SPARK-12693
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubuntu 64bit, Intel i7
>Reporter: Rado Buransky
>Priority: Minor
>  Labels: kafka
> Attachments: kafka-log.txt, log.txt
>
>
> I am running Kafka server locally with extremely low retention of 3 seconds 
> and with 1 second segmentation. I create direct Kafka stream with 
> auto.offset.reset = smallest. 
> In case of bad luck (happens actually quite often in my case) the smallest 
> offset retrieved druing stream initialization doesn't already exists when 
> streaming actually starts.
> Complete source code of the Spark Streaming application is here:
> https://github.com/pygmalios/spark-checkpoint-experience/blob/cb27ab83b7a29e619386b56e68a755d7bd73fc46/src/main/scala/com/pygmalios/sparkCheckpointExperience/spark/SparkApp.scala
> The application ends in an endless loop trying to get that non-existing 
> offset and has to be killed. Check attached logs from Spark and also from 
> Kafka server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12510) Refactor ActorReceiver to support Java

2016-01-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12510.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu  (was: Apache Spark)
Fix Version/s: 2.0.0

> Refactor ActorReceiver to support Java
> --
>
> Key: SPARK-12510
> URL: https://issues.apache.org/jira/browse/SPARK-12510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
> Fix For: 2.0.0
>
>
> Right now the Java users cannot use ActorHelper because it uses special Scala 
> syntax.
> This patch just refactored the codes to provide Java API and add an example.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-01-08 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-12697.
--
   Resolution: Fixed
 Assignee: Shixiong Zhu
Fix Version/s: 2.0.0

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>Assignee: Shixiong Zhu
>  Labels: features
> Fix For: 2.0.0
>
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12697) Allow adding new streams without stopping Spark streaming context

2016-01-08 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089712#comment-15089712
 ] 

Shixiong Zhu commented on SPARK-12697:
--

Sorry, closed a wrong jira

> Allow adding new streams without stopping Spark streaming context
> -
>
> Key: SPARK-12697
> URL: https://issues.apache.org/jira/browse/SPARK-12697
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.6.0
> Environment: Ubantu
>Reporter: Johny Mathew
>  Labels: features
>
> I am analyzing streaming input from kafka. For example I calculate average 
> use of a resource over a period of time. It works great.
> The problem is that if I need to make a change to the analysis, for example 
> change from average to max or add another item to the list of item to analyze 
> then I have to stop the spark application and restart it (at least I have to 
> stop the streaming context and restart the context).
> I would think it will be a great addition if spark can add new streams on the 
> fly or make modifications on the existing stream



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12685) word2vec trainWordsCount gets overflow

2016-01-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089814#comment-15089814
 ] 

Joseph K. Bradley commented on SPARK-12685:
---

This affects earlier versions too.  Would you mind checking to see which ones, 
and also how hard it'd be to backport to them?  Thanks!

> word2vec trainWordsCount gets overflow
> --
>
> Key: SPARK-12685
> URL: https://issues.apache.org/jira/browse/SPARK-12685
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> the log of word2vec  reports 
> trainWordsCount = -785727483
> during computation over a large dataset.
> I'll also add vocabsize to the log.
> Update the priority as it will affects the computation process.
> alpha =
> learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 
> 1))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12685) word2vec trainWordsCount gets overflow

2016-01-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12685:
--
Assignee: yuhao yang

> word2vec trainWordsCount gets overflow
> --
>
> Key: SPARK-12685
> URL: https://issues.apache.org/jira/browse/SPARK-12685
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>
> the log of word2vec  reports 
> trainWordsCount = -785727483
> during computation over a large dataset.
> I'll also add vocabsize to the log.
> Update the priority as it will affects the computation process.
> alpha =
> learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 
> 1))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12703) Spark KMeans Documentation Python Api

2016-01-08 Thread Anton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089853#comment-15089853
 ] 

Anton commented on SPARK-12703:
---

I'm new with all of this system.. what's a PR?


> Spark KMeans Documentation Python Api
> -
>
> Key: SPARK-12703
> URL: https://issues.apache.org/jira/browse/SPARK-12703
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Anton
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the documentation of Spark's Kmeans - python api:
> http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
> the cost of the final result is calculated using the 'error()' function where 
> its returning:
> {quote}
> return sqrt(sum([x**2 for x in (point - center)]))
> {quote}
> As I understand, it's wrong to use sqrt() and it should be omitted:
> {quote} return sum([x**2 for x in (point - center)]).{quote}
> Please refer to :
> https://en.wikipedia.org/wiki/K-means_clustering#Description
> Where you can see that the power is canceling the square.
> What do you think? It's minor but wasted me a few min to understand why the 
> result isn't what I'm expecting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12703) Spark KMeans Documentation Python Api

2016-01-08 Thread Anton (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton updated SPARK-12703:
--
Description: 
In the documentation of Spark's Kmeans - Python api:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

the cost of the final result is calculated using the 'error()' function where 
its returning:
{quote}
return sqrt(sum([x**2 for x in (point - center)]))
{quote}

As I understand, it's wrong to use sqrt() and it should be omitted:
{quote} return sum([x**2 for x in (point - center)]).{quote}

Please refer to :
https://en.wikipedia.org/wiki/K-means_clustering#Description
Where you can see that the power is canceling the square.

What do you think? It's minor but wasted me a few min to understand why the 
result isn't what I'm expecting.

  was:
In the documentation of Spark's Kmeans - python api:
http://spark.apache.org/docs/latest/mllib-clustering.html#k-means

the cost of the final result is calculated using the 'error()' function where 
its returning:
{quote}
return sqrt(sum([x**2 for x in (point - center)]))
{quote}

As I understand, it's wrong to use sqrt() and it should be omitted:
{quote} return sum([x**2 for x in (point - center)]).{quote}

Please refer to :
https://en.wikipedia.org/wiki/K-means_clustering#Description
Where you can see that the power is canceling the square.

What do you think? It's minor but wasted me a few min to understand why the 
result isn't what I'm expecting.


> Spark KMeans Documentation Python Api
> -
>
> Key: SPARK-12703
> URL: https://issues.apache.org/jira/browse/SPARK-12703
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Anton
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the documentation of Spark's Kmeans - Python api:
> http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
> the cost of the final result is calculated using the 'error()' function where 
> its returning:
> {quote}
> return sqrt(sum([x**2 for x in (point - center)]))
> {quote}
> As I understand, it's wrong to use sqrt() and it should be omitted:
> {quote} return sum([x**2 for x in (point - center)]).{quote}
> Please refer to :
> https://en.wikipedia.org/wiki/K-means_clustering#Description
> Where you can see that the power is canceling the square.
> What do you think? It's minor but wasted me a few min to understand why the 
> result isn't what I'm expecting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-12654.
---
Resolution: Fixed
  Assignee: Thomas Graves  (was: Apache Spark)

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.6.1, 2.0.0
>
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Assigned] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12654:


Assignee: Apache Spark  (was: Thomas Graves)

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Apache Spark
> Fix For: 1.6.1, 2.0.0
>
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands,

[jira] [Closed] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves closed SPARK-12654.
-
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.6.1, 2.0.0
>
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For

[jira] [Reopened] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-08 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reopened SPARK-12654:
---

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
> Fix For: 1.6.1, 2.0.0
>
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12706) support grouping/grouping_id function together group set

2016-01-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12706:
---
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-12540

> support grouping/grouping_id function together group set
> 
>
> Key: SPARK-12706
> URL: https://issues.apache.org/jira/browse/SPARK-12706
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup#EnhancedAggregation,Cube,GroupingandRollup-Grouping__IDfunction
> http://etutorials.org/SQL/Mastering+Oracle+SQL/Chapter+13.+Advanced+Group+Operations/13.3+The+GROUPING_ID+and+GROUP_ID+Functions/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1832) Executor UI improvement suggestions

2016-01-08 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089909#comment-15089909
 ] 

Alex Bozarth commented on SPARK-1832:
-

Updated my totals row solution to be independent of the colors sub-task, will 
submit PR after re-testing

> Executor UI improvement suggestions
> ---
>
> Key: SPARK-1832
> URL: https://issues.apache.org/jira/browse/SPARK-1832
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
>  Fill some of the cells with color in order to make it easier to absorb 
> the info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> - if dark blue then write the value in white (same for the RED and GREEN above
> Maybe mark the MASTER task somehow
>  
> Report the TOTALS in each column (do this at the TOP so no need to scroll 
> to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12703) Spark KMeans Documentation Python Api

2016-01-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089844#comment-15089844
 ] 

Joseph K. Bradley commented on SPARK-12703:
---

You're right that it shouldn't be computing sqrt.  Would you mind sending a 
little PR to fix it?  Thanks!

> Spark KMeans Documentation Python Api
> -
>
> Key: SPARK-12703
> URL: https://issues.apache.org/jira/browse/SPARK-12703
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Anton
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the documentation of Spark's Kmeans - python api:
> http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
> the cost of the final result is calculated using the 'error()' function where 
> its returning:
> {quote}
> return sqrt(sum([x**2 for x in (point - center)]))
> {quote}
> As I understand, it's wrong to use sqrt() and it should be omitted:
> {quote} return sum([x**2 for x in (point - center)]).{quote}
> Please refer to :
> https://en.wikipedia.org/wiki/K-means_clustering#Description
> Where you can see that the power is canceling the square.
> What do you think? It's minor but wasted me a few min to understand why the 
> result isn't what I'm expecting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12713) UI Executor page should keep links around to executors that died

2016-01-08 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089884#comment-15089884
 ] 

Nan Zhu commented on SPARK-12713:
-

I attached a PR and two duplicate JIRAs which are addressing the same issue

> UI Executor page should keep links around to executors that died
> 
>
> Key: SPARK-12713
> URL: https://issues.apache.org/jira/browse/SPARK-12713
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.2
>Reporter: Thomas Graves
>
> When an executor dies the web ui no longer shows it in the executors page 
> which makes getting to the logs to see what happened very difficult.  I'm 
> running on yarn so not sure if behavior is different in standalone mode.
> We should figure out a way to keep links around to the ones that died so we 
> can show stats and log links.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6332) compute calibration curve for binary classifiers

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6332:
---

Assignee: Apache Spark

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Assignee: Apache Spark
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6332) compute calibration curve for binary classifiers

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6332:
---

Assignee: (was: Apache Spark)

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-12703) Spark KMeans Documentation Python Api

2016-01-08 Thread Anton (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089853#comment-15089853
 ] 

Anton edited comment on SPARK-12703 at 1/8/16 8:07 PM:
---

I'm new in all of this system.. what's a PR?



was (Author: anton96):
I'm new with all of this system.. what's a PR?


> Spark KMeans Documentation Python Api
> -
>
> Key: SPARK-12703
> URL: https://issues.apache.org/jira/browse/SPARK-12703
> Project: Spark
>  Issue Type: Documentation
>  Components: MLlib
>Reporter: Anton
>Priority: Minor
>   Original Estimate: 5m
>  Remaining Estimate: 5m
>
> In the documentation of Spark's Kmeans - python api:
> http://spark.apache.org/docs/latest/mllib-clustering.html#k-means
> the cost of the final result is calculated using the 'error()' function where 
> its returning:
> {quote}
> return sqrt(sum([x**2 for x in (point - center)]))
> {quote}
> As I understand, it's wrong to use sqrt() and it should be omitted:
> {quote} return sum([x**2 for x in (point - center)]).{quote}
> Please refer to :
> https://en.wikipedia.org/wiki/K-means_clustering#Description
> Where you can see that the power is canceling the square.
> What do you think? It's minor but wasted me a few min to understand why the 
> result isn't what I'm expecting.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-01-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089854#comment-15089854
 ] 

Joseph K. Bradley commented on SPARK-12711:
---

You're correct that it should prevent column name duplication.  I'd recommend 
using SchemaUtils.appendColumn: 
[https://github.com/apache/spark/blob/00d9261724feb48d358679efbae6889833e893e0/mllib/src/main/scala/org/apache/spark/ml/util/SchemaUtils.scala#L54]

That will be great if you can send a PR to fix this.  Thanks!

By the way, please don't set the target version; committers or component 
maintainers will set that.

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12711) ML StopWordsRemover does not protect itself from column name duplication

2016-01-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12711?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12711:
--
Target Version/s: 1.6.1, 2.0.0  (was: 1.6.1)

> ML StopWordsRemover does not protect itself from column name duplication
> 
>
> Key: SPARK-12711
> URL: https://issues.apache.org/jira/browse/SPARK-12711
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Trivial
>  Labels: ml, mllib, newbie, suggestion
>
> At work we were 'taking a closer look' at ML transformers and I 
> spotted that anomally.
> On first look, resolution looks simple:
> Add to StopWordsRemover.transformSchema line (as is done in e.g. 
> PCA.transformSchema, StandardScaler.transformSchema, 
> OneHotEncoder.transformSchema):
> {code}
> require(!schema.fieldNames.contains($(outputCol)), s"Output column 
> ${$(outputCol)} already exists.")
> {code}
> Am I correct? Is that a bug?If yes - I am willing to prepare an 
> appropriate pull request.
> Maybe a better idea is to make use of super.transformSchema in 
> StopWordsRemover (and possibly in all other places)?
> Links to files at github, mentioned above:
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StopWordsRemover.scala#L147
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Transformer.scala#L109-L111
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L101-L102
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/PCA.scala#L138-L139
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala#L75-L76



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6332) compute calibration curve for binary classifiers

2016-01-08 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089923#comment-15089923
 ] 

Joseph K. Bradley commented on SPARK-6332:
--

[~robert_dodier] I'm sorry we have not been able to review your PR.  We've had 
much less bandwidth than we had hoped.  This does sound like a useful item.  
Would you be able to submit a Spark package to make this available to users?  
Thanks for your understanding.

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0

2016-01-08 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089936#comment-15089936
 ] 

Owen O'Malley commented on SPARK-1693:
--

Can you explain what the problem is and how to fix it? We are hitting the same 
problem on the hive on spark work.

> Dependent on multiple versions of servlet-api jars lead to throw an 
> SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 
> 
>
> Key: SPARK-1693
> URL: https://issues.apache.org/jira/browse/SPARK-1693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: log.txt
>
>
> {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > 
> log.txt{code}
> The log: 
> {code}
> UnpersistSuite:
> - unpersist RDD *** FAILED ***
>   java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s 
> signer information does not match signer information of other classes in the 
> same package
>   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1693) Dependent on multiple versions of servlet-api jars lead to throw an SecurityException when Spark built for hadoop 2.3.0 , 2.4.0

2016-01-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089945#comment-15089945
 ] 

Sean Owen commented on SPARK-1693:
--

As I recall this happens because the official JavaEE servlet API jar includes 
MANIFEST.MF entries with a signature. If these classes are repackaged in an 
assembly jar with these same MANIFEST.MF entries, you get this failure. There's 
a lot of discussion above, but one quick fix is to manage to not include the 
signatures from anywhere. 

> Dependent on multiple versions of servlet-api jars lead to throw an 
> SecurityException when Spark built for hadoop 2.3.0 , 2.4.0 
> 
>
> Key: SPARK-1693
> URL: https://issues.apache.org/jira/browse/SPARK-1693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>Priority: Blocker
> Fix For: 1.0.0
>
> Attachments: log.txt
>
>
> {code}mvn test -Pyarn -Dhadoop.version=2.4.0 -Dyarn.version=2.4.0 > 
> log.txt{code}
> The log: 
> {code}
> UnpersistSuite:
> - unpersist RDD *** FAILED ***
>   java.lang.SecurityException: class "javax.servlet.FilterRegistration"'s 
> signer information does not match signer information of other classes in the 
> same package
>   at java.lang.ClassLoader.checkCerts(ClassLoader.java:952)
>   at java.lang.ClassLoader.preDefineClass(ClassLoader.java:666)
>   at java.lang.ClassLoader.defineClass(ClassLoader.java:794)
>   at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>   at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
>   at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-6332) compute calibration curve for binary classifiers

2016-01-08 Thread Robert Dodier (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Dodier reopened SPARK-6332:
--

I'm reopening this issue, as I have made a new [PR 
#10666](https://github.com/apache/spark/pull/10666) to address the comments 
that were made on the previous [PR 
#5025](https://github.com/apache/spark/pull/5025). 

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6332) compute calibration curve for binary classifiers

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089929#comment-15089929
 ] 

Apache Spark commented on SPARK-6332:
-

User 'robert-dodier' has created a pull request for this issue:
https://github.com/apache/spark/pull/10666

> compute calibration curve for binary classifiers
> 
>
> Key: SPARK-6332
> URL: https://issues.apache.org/jira/browse/SPARK-6332
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Robert Dodier
>Priority: Minor
>  Labels: classification
>
> For binary classifiers, calibration measures how classifier scores compare to 
> the proportion of positive examples. If the classifier is well-calibrated, 
> the classifier score is approximately equal to the proportion of positive 
> examples. This is important if the scores are used as probabilities for 
> making decisions via expected cost. Otherwise, the calibration curve may 
> still be interesting; the proportion of positive examples should at least be 
> a monotonic function of the score.
> I propose that a new method for calibration be added to the class 
> BinaryClassificationMetrics, since calibration seems to fit in with the ROC 
> curve and other classifier assessments. 
> For more about calibration, see: 
> http://en.wikipedia.org/wiki/Calibration_%28statistics%29#In_classification
> References:
> Mahdi Pakdaman Naeini, Gregory F. Cooper, Milos Hauskrecht. "Binary 
> Classifier Calibration: Non-parametric approach." 
> http://arxiv.org/abs/1401.3390
> Alexandru Niculescu-Mizil, Rich Caruana. "Predicting Good Probabilities With 
> Supervised Learning." Appearing in Proceedings of the 22nd International 
> Conference on Machine Learning, Bonn, Germany, 2005. 
> http://www.cs.cornell.edu/~alexn/papers/calibration.icml05.crc.rev3.pdf
> "Properties and benefits of calibrated classifiers." Ira Cohen, Moises 
> Goldszmidt. http://www.hpl.hp.com/techreports/2004/HPL-2004-22R1.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-08 Thread Alex Bozarth (JIRA)

Alex Bozarth created SPARK-12716:


 Summary: Executor UI improvement suggestions - Totals
 Key: SPARK-12716
 URL: https://issues.apache.org/jira/browse/SPARK-12716
 Project: Spark
  Issue Type: Sub-task
  Components: Web UI
Reporter: Alex Bozarth


Splitting off the Totals portion of the parent UI improvements task, 
description copied below:

I received some suggestions from a user for the /executors UI page to make it 
more helpful. This gets more important when you have a really large number of 
executors.

...

Report the TOTALS in each column (do this at the TOP so no need to scroll to 
the bottom, or print both at top and bottom).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1832) Executor UI improvement suggestions

2016-01-08 Thread Alex Bozarth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089967#comment-15089967
 ] 

Alex Bozarth commented on SPARK-1832:
-

Opened a sub-task for the totals UI half. Unless there is a clarification on 
"the MASTER task" line in the description this should be resolved when the 
sub-tasks are resolved

> Executor UI improvement suggestions
> ---
>
> Key: SPARK-1832
> URL: https://issues.apache.org/jira/browse/SPARK-1832
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Thomas Graves
>
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
>  Fill some of the cells with color in order to make it easier to absorb 
> the info, e.g.
> RED if Failed Tasks greater than 0 (maybe the more failed, the more intense 
> the red)
> GREEN if Active Tasks greater than 0 (maybe more intense the larger the 
> number)
> Possibly color code COMPLETE TASKS using various shades of blue (e.g., based 
> on the log(# completed)
> - if dark blue then write the value in white (same for the RED and GREEN above
> Maybe mark the MASTER task somehow
>  
> Report the TOTALS in each column (do this at the TOP so no need to scroll 
> to the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4819) Remove Guava's "Optional" from public API

2016-01-08 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-4819.

   Resolution: Fixed
Fix Version/s: 2.0.0

> Remove Guava's "Optional" from public API
> -
>
> Key: SPARK-4819
> URL: https://issues.apache.org/jira/browse/SPARK-4819
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Marcelo Vanzin
>Assignee: Sean Owen
> Fix For: 2.0.0
>
> Attachments: SPARK_4819_null_do_not_merge.patch
>
>
> Filing this mostly so this isn't forgotten. Spark currently exposes Guava 
> types in its public API (the {{Optional}} class is used in the Java 
> bindings). This makes it hard to properly hide Guava from user applications, 
> and makes mixing different Guava versions with Spark a little sketchy (even 
> if things should work, since those classes are pretty simple in general).
> Since this changes the public API, it has to be done in a release that allows 
> such breakages. But it would be nice to at least have a transition plan for 
> deprecating the affected APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12705:
---
Description: 
The following query can't be resolved:

```
scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
order by b").explain()
org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input columns: 
[_c0]; line 1 pos 63
at 
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
``` 

  was:
The following query can't be resolved:

```
select a from t sort by b
``` 


> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at

[jira] [Updated] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-08 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12705:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-12540

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-12593) Convert resolved logical plans back to SQL query strings

2016-01-08 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12593.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10541
[https://github.com/apache/spark/pull/10541]

> Convert resolved logical plans back to SQL query strings
> 
>
> Key: SPARK-12593
> URL: https://issues.apache.org/jira/browse/SPARK-12593
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12717) pyspark broadcast fails when using multiple threads

2016-01-08 Thread Edward Walker (JIRA)

Edward Walker created SPARK-12717:
-

 Summary: pyspark broadcast fails when using multiple threads
 Key: SPARK-12717
 URL: https://issues.apache.org/jira/browse/SPARK-12717
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.6.0
 Environment: Linux, python 2.6 or python 2.7.
Reporter: Edward Walker


The following multi-threaded program that uses broadcast variables consistently 
throws exceptions like:  Exception("Broadcast variable '18' not loaded!",) --- 
even when run with "--master local[10]".

try:
   
import pyspark  
   
except: 
   
pass
   
from optparse import OptionParser   
   

   
def my_option_parser(): 
   
op = OptionParser() 
   
op.add_option("--parallelism", dest="parallelism", type="int", default=2)   
   
return op   
   

   
def do_process(x, w):   
   
return x * w.value  
   

   
def func(name, rdd, conf):  
   
new_rdd = rdd.map(lambda x :   do_process(x, conf)) 
   
total = new_rdd.reduce(lambda x, y : x + y) 
   
count = rdd.count() 
   
print name, 1.0 * total / count 
   

   
if __name__ == "__main__":  
   
import threading
   
op = my_option_parser() 
   
options, args = op.parse_args() 
   
sc = pyspark.SparkContext(appName="Buggy")  
   
data_rdd = sc.parallelize(range(0,1000), 1) 
   
confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]
   
threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
data_rdd, confs[i]]) for i in xrange(options.parallelism) ] 
 
for t in threads:   
   
t.start()   
   
for t in threads:   
   
t.join() 

Abridged run output:

% spark-submit --master local[10] bug_spark.py --parallelism 20
[snip]
16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File

[jira] [Updated] (SPARK-12717) pyspark broadcast fails when using multiple threads

2016-01-08 Thread Edward Walker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Walker updated SPARK-12717:
--
Description: 
The following multi-threaded program that uses broadcast variables consistently 
throws exceptions like:  Exception("Broadcast variable '18' not 
loaded!",) --- even when run with "--master local[10]".

try:
   
import pyspark  
   
except: 
   
pass
   
from optparse import OptionParser   
   

   
def my_option_parser(): 
   
op = OptionParser() 
   
op.add_option("--parallelism", dest="parallelism", type="int", default=2)   
   
return op   
   

   
def do_process(x, w):   
   
return x * w.value  
   

   
def func(name, rdd, conf):  
   
new_rdd = rdd.map(lambda x :   do_process(x, conf)) 
   
total = new_rdd.reduce(lambda x, y : x + y) 
   
count = rdd.count() 
   
print name, 1.0 * total / count 
   

   
if __name__ == "__main__":  
   
import threading
   
op = my_option_parser() 
   
options, args = op.parse_args() 
   
sc = pyspark.SparkContext(appName="Buggy")  
   
data_rdd = sc.parallelize(range(0,1000), 1) 
   
confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]
   
threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
data_rdd, confs[i]]) for i in xrange(options.parallelism) ] 
 
for t in threads:   
   
t.start()   
   
for t in threads:   
   
t.join() 

Abridged run output:

% spark-submit --master local[10] bug_spark.py --parallelism 20
[snip]
16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
return pickle.loads(obj)
  File

[jira] [Updated] (SPARK-12717) pyspark broadcast fails when using multiple threads

2016-01-08 Thread Edward Walker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Walker updated SPARK-12717:
--
Description: 
The following multi-threaded program that uses broadcast variables consistently 
throws exceptions like:  *Exception("Broadcast variable '18' not loaded!",)* 
--- even when run with "--master local[10]".

{code:title=bug_spark.py|borderStyle=solid}
try:
   
import pyspark  
   
except: 
   
pass
   
from optparse import OptionParser   
   

   
def my_option_parser(): 
   
op = OptionParser() 
   
op.add_option("--parallelism", dest="parallelism", type="int", default=2)   
   
return op   
   

   
def do_process(x, w):   
   
return x * w.value  
   

   
def func(name, rdd, conf):  
   
new_rdd = rdd.map(lambda x :   do_process(x, conf)) 
   
total = new_rdd.reduce(lambda x, y : x + y) 
   
count = rdd.count() 
   
print name, 1.0 * total / count 
   

   
if __name__ == "__main__":  
   
import threading
   
op = my_option_parser() 
   
options, args = op.parse_args() 
   
sc = pyspark.SparkContext(appName="Buggy")  
   
data_rdd = sc.parallelize(range(0,1000), 1) 
   
confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]
   
threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
data_rdd, confs[i]]) for i in xrange(options.parallelism) ] 
 
for t in threads:   
   
t.start()   
   
for t in threads:   
   
t.join() 
{code}

Abridged run output:

{code:title=abridge_run.txt|borderStyle=solid}
% spark-submit --master local[10] bug_spark.py --parallelism 20
[snip]
16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",

[jira] [Created] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-01-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12725:
--

 Summary: SQL generation suffers from name conficts introduced by 
some analysis rules
 Key: SPARK-12725
 URL: https://issues.apache.org/jira/browse/SPARK-12725
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian


Some analysis rules generate auxiliary attribute references with the same name 
but different expression IDs. For example, {{ResolveAggregateFunctions}} 
introduces {{havingCondition}} and {{aggOrder}}, and 
{{DistinctAggregationRewriter}} introduces {{gid}}.

This is OK for normal query execution since these attribute references get 
expression IDs. However, it's troublesome when converting resolved query plans 
back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12723) Comprehensive SQL generation support for expressions

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12723:
---
Description: 
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of [PR #10541|https://github.com/apache/spark/pull/10541]:

- Math expressions
- String expressions
- Null expressions
- Calendar interval literal
- Part of date time expressions
- Complex type creators
- Special NOT expressions, e.g. NOT LIKE and NOT IN

  was:
Ensure that all built-in expressions can be mapped to its SQL representation if 
there is one (e.g. ScalaUDF doesn't have a SQL representation).

A (possibly incomplete) list of unsupported expressions is provided in PR 
description of [PR #10541|https://github.com/apache/spark/pull/10541]:


> Comprehensive SQL generation support for expressions
> 
>
> Key: SPARK-12723
> URL: https://issues.apache.org/jira/browse/SPARK-12723
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>
> Ensure that all built-in expressions can be mapped to its SQL representation 
> if there is one (e.g. ScalaUDF doesn't have a SQL representation).
> A (possibly incomplete) list of unsupported expressions is provided in PR 
> description of [PR #10541|https://github.com/apache/spark/pull/10541]:
> - Math expressions
> - String expressions
> - Null expressions
> - Calendar interval literal
> - Part of date time expressions
> - Complex type creators
> - Special NOT expressions, e.g. NOT LIKE and NOT IN



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12725) SQL generation suffers from name conficts introduced by some analysis rules

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12725:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation suffers from name conficts introduced by some analysis rules
> ---
>
> Key: SPARK-12725
> URL: https://issues.apache.org/jira/browse/SPARK-12725
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>
> Some analysis rules generate auxiliary attribute references with the same 
> name but different expression IDs. For example, {{ResolveAggregateFunctions}} 
> introduces {{havingCondition}} and {{aggOrder}}, and 
> {{DistinctAggregationRewriter}} introduces {{gid}}.
> This is OK for normal query execution since these attribute references get 
> expression IDs. However, it's troublesome when converting resolved query 
> plans back to SQL query strings since expression IDs are erased.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12728) Integrate SQL generation feature with native view

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12728:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Integrate SQL generation feature with native view
> -
>
> Key: SPARK-12728
> URL: https://issues.apache.org/jira/browse/SPARK-12728
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12727) SQL generation support for distinct aggregation patterns that fit DistinctAggregationRewriter analysis rule

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12727:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for distinct aggregation patterns that fit 
> DistinctAggregationRewriter analysis rule
> ---
>
> Key: SPARK-12727
> URL: https://issues.apache.org/jira/browse/SPARK-12727
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12731) PySpark docstring cleanup

2016-01-08 Thread holdenk (JIRA)

holdenk created SPARK-12731:
---

 Summary: PySpark docstring cleanup
 Key: SPARK-12731
 URL: https://issues.apache.org/jira/browse/SPARK-12731
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, PySpark
Reporter: holdenk
Priority: Trivial


We don't currently have any automated checks that our PySpark docstring lines 
are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle 
this). As such there are ~400 non-comformant docstring lines. This JIRA is to 
fix those docstring lines and add a command to lint python to fail on long 
lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090245#comment-15090245
 ] 

Apache Spark commented on SPARK-12716:
--

User 'ajbozarth' has created a pull request for this issue:
https://github.com/apache/spark/pull/10668

> Executor UI improvement suggestions - Totals
> 
>
> Key: SPARK-12716
> URL: https://issues.apache.org/jira/browse/SPARK-12716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>
> Splitting off the Totals portion of the parent UI improvements task, 
> description copied below:
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
> ...
> Report the TOTALS in each column (do this at the TOP so no need to scroll to 
> the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12716:


Assignee: (was: Apache Spark)

> Executor UI improvement suggestions - Totals
> 
>
> Key: SPARK-12716
> URL: https://issues.apache.org/jira/browse/SPARK-12716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>
> Splitting off the Totals portion of the parent UI improvements task, 
> description copied below:
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
> ...
> Report the TOTALS in each column (do this at the TOP so no need to scroll to 
> the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-12716) Executor UI improvement suggestions - Totals

2016-01-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12716:


Assignee: Apache Spark

> Executor UI improvement suggestions - Totals
> 
>
> Key: SPARK-12716
> URL: https://issues.apache.org/jira/browse/SPARK-12716
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Apache Spark
>
> Splitting off the Totals portion of the parent UI improvements task, 
> description copied below:
> I received some suggestions from a user for the /executors UI page to make it 
> more helpful. This gets more important when you have a really large number of 
> executors.
> ...
> Report the TOTALS in each column (do this at the TOP so no need to scroll to 
> the bottom, or print both at top and bottom).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12731) PySpark docstring cleanup

2016-01-08 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090280#comment-15090280
 ] 

holdenk commented on SPARK-12731:
-

I've got a shell script started to do this I'll just take some free time when 
I'm tired and cleanup our current docstrings.

> PySpark docstring cleanup
> -
>
> Key: SPARK-12731
> URL: https://issues.apache.org/jira/browse/SPARK-12731
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Reporter: holdenk
>Priority: Trivial
>
> We don't currently have any automated checks that our PySpark docstring lines 
> are within pep8/275/276 lenght limits (since the pep8 checker doesn't handle 
> this). As such there are ~400 non-comformant docstring lines. This JIRA is to 
> fix those docstring lines and add a command to lint python to fail on long 
> lines.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2016-01-08 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090292#comment-15090292
 ] 

Bryan Cutler commented on SPARK-11219:
--

That's my fault [~josephkb], I was instructing to go up to 100 based off this 
from the style guide wiki "For Python code, Apache Spark follows PEP 8 with one 
exception: lines can be up to 100 characters in length, not 79" but I guess it 
is different for docstrings..  Apologies if that created extra work for you 
guys.

> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
>   * SVMModel - single line
>   * SVMWithSGD - vertical alignment, sporatic default values
>   * NaiveBayesModel - single line
>   * NaiveBayes - single line
> h4. Clustering
>   * KMeansModel - missing param description
>   * KMeans - missing param description and defaults
>   * GaussianMixture - vertical align, incorrect default formatting
>   * PowerIterationClustering - single line with wrapped indentation, missing 
> defaults
>   * StreamingKMeansModel - single line wrapped
>   * StreamingKMeans - single line wrapped, missing defaults
>   * LDAModel - single line
>   * LDA - vertical align, mising some defaults
> h4. FPM  
>   * FPGrowth - single line
>   * PrefixSpan - single line, defaults values in backticks
> h4. Recommendation
>   * ALS - does not have param descriptions
> h4. Regression
>   * LabeledPoint - single line
>   * LinearModel - single line
>   * LinearRegressionWithSGD - vertical alignment
>   * RidgeRegressionWithSGD - vertical align
>   * IsotonicRegressionModel - single line
>   * IsotonicRegression - single line, missing default
> h4. Tree
>   * DecisionTree - single line with vertical indentation, missing defaults
>   * RandomForest - single line with wrapped indent, missing some defaults
>   * GradientBoostedTrees - single line with wrapped indent
> NOTE
> This issue will just focus on model/algorithm descriptions, which are the 
> largest source of inconsistent formatting
> evaluation.py, feature.py, random.py, utils.py - these supporting classes 
> have param descriptions as single line, but are consistent so don't need to 
> be changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-01-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12720:
--

 Summary: SQL generation support for cube, rollup, and grouping set
 Key: SPARK-12720
 URL: https://issues.apache.org/jira/browse/SPARK-12720
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12721) SQL generation support for script transformation

2016-01-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12721:
--

 Summary: SQL generation support for script transformation
 Key: SPARK-12721
 URL: https://issues.apache.org/jira/browse/SPARK-12721
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12717) pyspark broadcast fails when using multiple threads

2016-01-08 Thread Edward Walker (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Edward Walker updated SPARK-12717:
--
Description: 
The following multi-threaded program that uses broadcast variables consistently 
throws exceptions like:  Exception("Broadcast variable '18' not 
loaded!",) --- even when run with "--master local[10]".

```
try:
   
import pyspark  
   
except: 
   
pass
   
from optparse import OptionParser   
   

   
def my_option_parser(): 
   
op = OptionParser() 
   
op.add_option("--parallelism", dest="parallelism", type="int", default=2)   
   
return op   
   

   
def do_process(x, w):   
   
return x * w.value  
   

   
def func(name, rdd, conf):  
   
new_rdd = rdd.map(lambda x :   do_process(x, conf)) 
   
total = new_rdd.reduce(lambda x, y : x + y) 
   
count = rdd.count() 
   
print name, 1.0 * total / count 
   

   
if __name__ == "__main__":  
   
import threading
   
op = my_option_parser() 
   
options, args = op.parse_args() 
   
sc = pyspark.SparkContext(appName="Buggy")  
   
data_rdd = sc.parallelize(range(0,1000), 1) 
   
confs = [ sc.broadcast(i) for i in xrange(options.parallelism) ]
   
threads = [ threading.Thread(target=func, args=["thread_" + str(i), 
data_rdd, confs[i]]) for i in xrange(options.parallelism) ] 
 
for t in threads:   
   
t.start()   
   
for t in threads:   
   
t.join() 
```
Abridged run output:
```
% spark-submit --master local[10] bug_spark.py --parallelism 20
[snip]
16/01/08 17:10:20 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 9)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/worker.py",
 line 98, in main
command = pickleSer._read_with_length(infile)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 164, in _read_with_length
return self.loads(obj)
  File 
"/Network/Servers/mother.adverplex.com/Volumes/homeland/Users/walker/.spark/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",
 line 422, in loads
return pickle.loads(obj)
  File

[jira] [Commented] (SPARK-11012) Canonicalize view definitions

2016-01-08 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090110#comment-15090110
 ] 

Cheng Lian commented on SPARK-11012:


Done.

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11012) Canonicalize view definitions

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-11012:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-12728) Integrate SQL generation feature with native view

2016-01-08 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-12728:
--

 Summary: Integrate SQL generation feature with native view
 Key: SPARK-12728
 URL: https://issues.apache.org/jira/browse/SPARK-12728
 Project: Spark
  Issue Type: Sub-task
Reporter: Cheng Lian






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12655) GraphX does not unpersist RDDs

2016-01-08 Thread Jason C Lee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090165#comment-15090165
 ] 

Jason C Lee commented on SPARK-12655:
-

The VertexRDD and EdgeRDD you see are created during the intermediate step of 
g.connectedComponents(). They are not properly unpersisted at the moment. I 
will look into this. 

> GraphX does not unpersist RDDs
> --
>
> Key: SPARK-12655
> URL: https://issues.apache.org/jira/browse/SPARK-12655
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Alexander Pivovarov
>
> Looks like Graph does not clean all RDDs from the cache on unpersist
> {code}
> // open spark-shell 1.5.2 or 1.6.0
> // run
> import org.apache.spark.graphx._
> val vert = sc.parallelize(List((1L, 1), (2L, 2), (3L, 3)), 1)
> val edges = sc.parallelize(List(Edge[Long](1L, 2L), Edge[Long](1L, 3L)), 1)
> val g0 = Graph(vert, edges)
> val g = g0.partitionBy(PartitionStrategy.EdgePartition2D, 2)
> val cc = g.connectedComponents()
> cc.unpersist()
> g.unpersist()
> g0.unpersist()
> vert.unpersist()
> edges.unpersist()
> {code}
> open http://localhost:4040/storage/
> Spark UI 4040 Storage page still shows 2 items
> {code}
> VertexRDD  Memory Deserialized 1x Replicated   1  100%1688.0 B   0.0 
> B  0.0 B
> EdgeRDDMemory Deserialized 1x Replicated   2  100%  4.7 KB   0.0 
> B  0.0 B
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12718) SQL generation support for window functions

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12718:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for window functions
> ---
>
> Key: SPARK-12718
> URL: https://issues.apache.org/jira/browse/SPARK-12718
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12720) SQL generation support for cube, rollup, and grouping set

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12720:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for cube, rollup, and grouping set
> -
>
> Key: SPARK-12720
> URL: https://issues.apache.org/jira/browse/SPARK-12720
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12719) SQL generation support for generators (including UDTF)

2016-01-08 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-12719:
---
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> SQL generation support for generators (including UDTF)
> --
>
> Key: SPARK-12719
> URL: https://issues.apache.org/jira/browse/SPARK-12719
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11944) Python API for mllib.clustering.BisectingKMeans

2016-01-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11944:
--
Assignee: holdenk
Target Version/s: 2.0.0

> Python API for mllib.clustering.BisectingKMeans
> ---
>
> Key: SPARK-11944
> URL: https://issues.apache.org/jira/browse/SPARK-11944
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: holdenk
>Priority: Minor
>
> Add Python API for mllib.clustering.BisectingKMeans.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11826) Subtract BlockMatrix

2016-01-08 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11826?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11826:
--
Target Version/s:   (was: 2.0.0)

> Subtract BlockMatrix
> 
>
> Key: SPARK-11826
> URL: https://issues.apache.org/jira/browse/SPARK-11826
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.6.0
>Reporter: Ehsan Mohyedin Kermani
>Assignee: Ehsan Mohyedin Kermani
>Priority: Minor
>
> It'd be more convenient to have subtract method for BlockMatrices.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11214) Join with Unicode-String results wrong empty

2016-01-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090222#comment-15090222
 ] 

Davies Liu commented on SPARK-11214:


[~xDSticker] Could you try this in master or 1.6? I couldn't reproduce it in 
master (standalone mode, with 40G executor and 40G driver)

> Join with Unicode-String results wrong empty
> 
>
> Key: SPARK-11214
> URL: https://issues.apache.org/jira/browse/SPARK-11214
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.0
>Reporter: Hans Fischer
>Assignee: Josh Rosen
>Priority: Critical
> Fix For: 1.5.1
>
>
> I created a join that should clearly result in a single row but return: 
> empty. Could someone validate this bug?
> hiveContext.sql('SELECT * FROM (SELECT "c" AS a) AS a JOIN (SELECT "c" AS b) 
> AS b ON a.a = b.b').take(10)
> result: []
> kind regards
> Hans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12705) Sorting column can't be resolved if it's not in projection

2016-01-08 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15089994#comment-15089994
 ] 

Davies Liu commented on SPARK-12705:


[~smilegator]I had updated JIRA to have a reproducable test case.

> Sorting column can't be resolved if it's not in projection
> --
>
> Key: SPARK-12705
> URL: https://issues.apache.org/jira/browse/SPARK-12705
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Davies Liu
>
> The following query can't be resolved:
> ```
> scala> sqlContext.sql("select sum(a) over ()  from (select 1 as a, 2 as b) t 
> order by b").explain()
> org.apache.spark.sql.AnalysisException: cannot resolve 'b' given input 
> columns: [_c0]; line 1 pos 63
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:57)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:336)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:335)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$5.apply(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:282)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at scala.collection.AbstractIterator.to(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionUp$1(QueryPlan.scala:109)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2(QueryPlan.scala:119)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$2$1.apply(QueryPlan.scala:123)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> ``` 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11012) Canonicalize view definitions

2016-01-08 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090033#comment-15090033
 ] 

Yin Huai commented on SPARK-11012:
--

[~lian cheng] Let's create jiras for all sub-tasks.

> Canonicalize view definitions
> -
>
> Key: SPARK-11012
> URL: https://issues.apache.org/jira/browse/SPARK-11012
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>
> In SPARK-10337, we added the first step of supporting view natively. Building 
> on top of that work, we need to canonicalize the view definition. So, for a 
> SQL string SELECT a, b FROM table, we will save this text to Hive metastore 
> as SELECT `table`.`a`, `table`.`b` FROM `currentDB`.`table`. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12593) Convert resolved logical plans back to SQL query strings

2016-01-08 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12593:
-
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-11012

> Convert resolved logical plans back to SQL query strings
> 
>
> Key: SPARK-12593
> URL: https://issues.apache.org/jira/browse/SPARK-12593
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Cheng Lian
>Assignee: Cheng Lian
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 >

1 - 100 of 211 matches

Mail list logo