[jira] [Commented] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation

2016-01-05 Thread somil deshmukh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085130#comment-15085130
 ] 

somil deshmukh commented on SPARK-12632:


I am working on this,will send pull request today

> Make Parameter Descriptions Consistent for PySpark MLlib FPM and 
> Recommendation
> ---
>
> Key: SPARK-12632
> URL: https://issues.apache.org/jira/browse/SPARK-12632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up fpm.py 
> and recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10383) Sync example code between API doc and user guide

2016-01-05 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085167#comment-15085167
 ] 

Xusen Yin commented on SPARK-10383:
---

I think we only (almost) finished the step 1 - moving example codes to 
spark/examples. There is also a step 2 - connecting spark/examples with API doc 
according to the introduction of the JIRA. However, I am not quite 
understanding [~mengxr]'s idea.

> Sync example code between API doc and user guide
> 
>
> Key: SPARK-10383
> URL: https://issues.apache.org/jira/browse/SPARK-10383
> Project: Spark
>  Issue Type: Brainstorming
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> It would be nice to provide example code in both user guide and API docs. 
> However, it would become hard to keep the content in-sync. This JIRA is to 
> collect approaches/processes to make it feasible.
> This is related to SPARK-10382, where we discuss how to move example code 
> from user guide markdown to `spark/examples/`. After that, we can look for 
> solutions that can pick up example code from `spark/examples` and make them 
> available in the API doc. Though I don't know any feasible solution right 
> now, those are some relevant projects:
> * https://github.com/tkawachi/sbt-doctest
> * http://www.doctester.org/
> It would be nice to hear more ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12393) Add read.text and write.text for SparkR

2016-01-05 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12393:
--
Assignee: Yanbo Liang

> Add read.text and write.text for SparkR
> ---
>
> Key: SPARK-12393
> URL: https://issues.apache.org/jira/browse/SPARK-12393
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 1.6.1, 2.0.0
>
>
> Add read.text and write.text for SparkR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3872) Rewrite the test for ActorInputStream.

2016-01-05 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15084983#comment-15084983
 ] 

Prashant Sharma commented on SPARK-3872:


The reason this exist is, we had the version of typesafe config which did not 
allow us to turn that JVM option on. We chose to sacrifice the test in favor of 
keeping that option. However once we get rid akka dependency (or upgrade it - 
which is not going to happen.). This can be restored. It is certainly a won't 
fix, if we intend to get rid of - all of akka in subsequent releases. 

> Rewrite the test for ActorInputStream. 
> ---
>
> Key: SPARK-3872
> URL: https://issues.apache.org/jira/browse/SPARK-3872
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12665) Remove deprecated and unused classes

2016-01-05 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-12665:
--

 Summary: Remove deprecated and unused classes
 Key: SPARK-12665
 URL: https://issues.apache.org/jira/browse/SPARK-12665
 Project: Spark
  Issue Type: Sub-task
Reporter: Kousuke Saruta


Whole code of Vector.scala and GraphKryoRegistrator  are no longer used so it's 
time to remove them in Spark 2.0. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12662) Add document to randomSplit to explain the sampling depends on the ordering of the rows in a partition

2016-01-05 Thread Brian Pasley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085005#comment-15085005
 ] 

Brian Pasley commented on SPARK-12662:
--

Users' expectation for randomSplit probably doesn't realize the disjoint sets 
depend on sorted data. randomSplit is used in ML pipeline to split 
training/validation/test sets which is common operation that doesn't assume 
sorted data in general.  e.g.:
http://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/TrainValidationSplit.scala

If user misses the documentation, they may end up with overlapping train/test 
sets without realizing it.  Can we add local sort operator or warn user there 
is overlap? 

> Add document to randomSplit to explain the sampling depends on the ordering 
> of the rows in a partition
> --
>
> Key: SPARK-12662
> URL: https://issues.apache.org/jira/browse/SPARK-12662
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, SQL
>Reporter: Yin Huai
>Assignee: Sameer Agarwal
>
> With {{./bin/spark-shell --master=local-cluster[2,1,2014]}}, the following 
> code will provide overlapped rows for two DFs returned by the randomSplit. 
> {code}
> sqlContext.sql("drop table if exists test")
> val x = sc.parallelize(1 to 210)
> case class R(ID : Int)
> sqlContext.createDataFrame(x.map 
> {R(_)}).write.format("json").saveAsTable("bugsc1597")
> var df = sql("select distinct ID from test")
> var Array(a, b) = df.randomSplit(Array(0.333, 0.667), 1234L)
> a.registerTempTable("a")
> b.registerTempTable("b")
> val intersectDF = a.intersect(b)
> intersectDF.show
> {code}
> The reason is that {{sql("select distinct ID from test")} does not guarantee 
> the ordering rows in a partition. It will be good to add more document to the 
> api doc to explain it. To make intersectDF contain 0 row, the df needs to 
> have fixed row ordering within a partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT

2016-01-05 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-12666:
--

 Summary: spark-shell --packages cannot load artifacts which are 
publishLocal'd by SBT
 Key: SPARK-12666
 URL: https://issues.apache.org/jira/browse/SPARK-12666
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 1.6.0, 1.5.1
Reporter: Josh Rosen


Symptom:

I cloned the latest master of {{spark-redshift}}, then used {{sbt 
publishLocal}} to publish it to my Ivy cache. When I tried running 
{{./bin/spark-shell --packages 
com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
into {{spark-shell}}, I received the following cryptic error:

{code}
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration not found in 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It was required 
from org.apache.spark#spark-submit-parent;1.0 default]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

I think the problem here is that Spark is declaring a dependency on the 
spark-redshift artifact using the {{default}} Ivy configuration. The default 
configuration will be the only configuration defined in an Ivy artifact if that 
artifact defines no other configurations. Thus, for Maven artifacts I think the 
default configuration will end up mapping to Maven's regular JAR dependency but 
for Ivy artifacts I think we can run into trouble when loading artifacts which 
explicitly define their own configurations, since those artifacts might not 
have a configuration named {{default}}.

I spent a bit of time playing around with the SparkSubmit code to see if I 
could fix this but wasn't able to completely resolve the issue.

/cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
if you'd like)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12666) spark-shell --packages cannot load artifacts which are publishLocal'd by SBT

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-12666:
---
Description: 
Symptom:

I cloned the latest master of {{spark-redshift}}, then used {{sbt 
publishLocal}} to publish it to my Ivy cache. When I tried running 
{{./bin/spark-shell --packages 
com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
into {{spark-shell}}, I received the following cryptic error:

{code}
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration not found in 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It was required 
from org.apache.spark#spark-submit-parent;1.0 default]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

I think the problem here is that Spark is declaring a dependency on the 
spark-redshift artifact using the {{default}} Ivy configuration. Based on my 
admittedly limited understanding of Ivy, the default configuration will be the 
only configuration defined in an Ivy artifact if that artifact defines no other 
configurations. Thus, for Maven artifacts I think the default configuration 
will end up mapping to Maven's regular JAR dependency but for Ivy artifacts I 
think we can run into trouble when loading artifacts which explicitly define 
their own configurations, since those artifacts might not have a configuration 
named {{default}}.

I spent a bit of time playing around with the SparkSubmit code to see if I 
could fix this but wasn't able to completely resolve the issue.

/cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
if you'd like)

  was:
Symptom:

I cloned the latest master of {{spark-redshift}}, then used {{sbt 
publishLocal}} to publish it to my Ivy cache. When I tried running 
{{./bin/spark-shell --packages 
com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
into {{spark-shell}}, I received the following cryptic error:

{code}
Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration not found in 
com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: 'default'. It was required 
from org.apache.spark#spark-submit-parent;1.0 default]
at 
org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1009)
at 
org.apache.spark.deploy.SparkSubmit$.prepareSubmitEnvironment(SparkSubmit.scala:286)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:153)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

I think the problem here is that Spark is declaring a dependency on the 
spark-redshift artifact using the {{default}} Ivy configuration. The default 
configuration will be the only configuration defined in an Ivy artifact if that 
artifact defines no other configurations. Thus, for Maven artifacts I think the 
default configuration will end up mapping to Maven's regular JAR dependency but 
for Ivy artifacts I think we can run into trouble when loading artifacts which 
explicitly define their own configurations, since those artifacts might not 
have a configuration named {{default}}.

I spent a bit of time playing around with the SparkSubmit code to see if I 
could fix this but wasn't able to completely resolve the issue.

/cc [~brkyvz] (ping me offline and I can walk you through the repo in person, 
if you'd like)


> spark-shell --packages cannot load artifacts which are publishLocal'd by SBT
> 
>
> Key: SPARK-12666
> URL: https://issues.apache.org/jira/browse/SPARK-12666
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Josh Rosen
>
> Symptom:
> I cloned the latest master of {{spark-redshift}}, then used {{sbt 
> publishLocal}} to publish it to my Ivy cache. When I tried running 
> {{./bin/spark-shell --packages 
> com.databricks:spark-redshift_2.10:0.5.3-SNAPSHOT}} to load this dependency 
> into {{spark-shell}}, I received the following cryptic error:
> {code}
> Exception in thread "main" java.lang.RuntimeException: [unresolved 
> dependency: com.databricks#spark-redshift_2.10;0.5.3-SNAPSHOT: configuration 
> not found in 

[jira] [Created] (SPARK-12659) NPE when spill in CartisianProduct

2016-01-05 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12659:
--

 Summary: NPE when spill in CartisianProduct
 Key: SPARK-12659
 URL: https://issues.apache.org/jira/browse/SPARK-12659
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Davies Liu
Assignee: Davies Liu


{code}
java.lang.NullPointerException
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37)
at 
org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231)
at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12658) Revert SPARK-12511

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083946#comment-15083946
 ] 

Sean Owen commented on SPARK-12658:
---

This should just be part of updating py4j right? why is this a separate item? 

> Revert SPARK-12511 
> ---
>
> Key: SPARK-12658
> URL: https://issues.apache.org/jira/browse/SPARK-12658
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shixiong Zhu
>
>  SPARK-12511 is just a workaround. Since Py4J is going to fix it. We should 
> revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12658) Revert SPARK-12511

2016-01-05 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083950#comment-15083950
 ] 

Shixiong Zhu edited comment on SPARK-12658 at 1/5/16 10:24 PM:
---

Just in case I forget to revert them.


was (Author: zsxwing):
Just in case I forgot to revert them.

> Revert SPARK-12511 
> ---
>
> Key: SPARK-12658
> URL: https://issues.apache.org/jira/browse/SPARK-12658
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shixiong Zhu
>
>  SPARK-12511 is just a workaround. Since Py4J is going to fix it. We should 
> revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12658) Revert SPARK-12511

2016-01-05 Thread Shixiong Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083950#comment-15083950
 ] 

Shixiong Zhu commented on SPARK-12658:
--

Just in case I forgot to revert them.

> Revert SPARK-12511 
> ---
>
> Key: SPARK-12658
> URL: https://issues.apache.org/jira/browse/SPARK-12658
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shixiong Zhu
>
>  SPARK-12511 is just a workaround. Since Py4J is going to fix it. We should 
> revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12570.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10594
[https://github.com/apache/spark/pull/10594]

> DecisionTreeRegressor: provide variance of prediction: user guide update
> 
>
> Key: SPARK-12570
> URL: https://issues.apache.org/jira/browse/SPARK-12570
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0
>
>
> See linked JIRA for details.  This should update the table of output columns 
> and text.  Examples are probably not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12379) Copy GBT implementation to spark.ml

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12379:


Assignee: (was: Apache Spark)

> Copy GBT implementation to spark.ml
> ---
>
> Key: SPARK-12379
> URL: https://issues.apache.org/jira/browse/SPARK-12379
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is 
> preventing several improvements to GBTs in spark.ml, so we need to move the 
> implementation to ml and use spark.ml decision trees in the implementation. 
> At first, we should make minimal changes to the implementation.
> Performance testing should be done to ensure there were no regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12379) Copy GBT implementation to spark.ml

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083979#comment-15083979
 ] 

Apache Spark commented on SPARK-12379:
--

User 'sethah' has created a pull request for this issue:
https://github.com/apache/spark/pull/10607

> Copy GBT implementation to spark.ml
> ---
>
> Key: SPARK-12379
> URL: https://issues.apache.org/jira/browse/SPARK-12379
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>
> Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is 
> preventing several improvements to GBTs in spark.ml, so we need to move the 
> implementation to ml and use spark.ml decision trees in the implementation. 
> At first, we should make minimal changes to the implementation.
> Performance testing should be done to ensure there were no regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12368) Better doc for the binary classification evaluator' metricName

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12368:
--
Shepherd: Joseph K. Bradley
Assignee: Benjamin Fradet
Target Version/s: 2.0.0

> Better doc for the binary classification evaluator' metricName
> --
>
> Key: SPARK-12368
> URL: https://issues.apache.org/jira/browse/SPARK-12368
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
>
> For the BinaryClassificationEvaluator, the scaladoc doesn't mention that 
> "areaUnderPR" is supported, only that the default is "areadUnderROC".
> Also, in the documentation, it is said that:
> "The default metric used to choose the best ParamMap can be overriden by the 
> setMetric method in each of these evaluators."
> However, the method is called setMetricName.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7675) PySpark spark.ml Params type conversions

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7675:
-
Shepherd: Joseph K. Bradley
Assignee: holdenk
Target Version/s: 2.0.0

> PySpark spark.ml Params type conversions
> 
>
> Key: SPARK-7675
> URL: https://issues.apache.org/jira/browse/SPARK-7675
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>Assignee: holdenk
>Priority: Minor
>
> Currently, PySpark wrappers for spark.ml Scala classes are brittle when 
> accepting Param types.  E.g., Normalizer's "p" param cannot be set to "2" (an 
> integer); it must be set to "2.0" (a float).  Fixing this is not trivial 
> since there does not appear to be a natural place to insert the conversion 
> before Python wrappers call Java's Params setter method.
> A possible fix will be to include a method "_checkType" to PySpark's Param 
> class which checks the type, prints an error if needed, and converts types 
> when relevant (e.g., int to float, or scipy matrix to array).  The Java 
> wrapper method which copies params to Scala can call this method when 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10789) spark-submit in cluster mode can't use third-party libraries

2016-01-05 Thread Jonathan Kelly (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Kelly updated SPARK-10789:
---
Summary: spark-submit in cluster mode can't use third-party libraries  
(was: Cluster mode SparkSubmit classpath only includes Spark assembly)

> spark-submit in cluster mode can't use third-party libraries
> 
>
> Key: SPARK-10789
> URL: https://issues.apache.org/jira/browse/SPARK-10789
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.5.0, 1.6.0
>Reporter: Jonathan Kelly
> Attachments: SPARK-10789.diff, SPARK-10789.v1.6.0.diff
>
>
> When using cluster deploy mode, the classpath of the SparkSubmit process that 
> gets launched only includes the Spark assembly and not 
> spark.driver.extraClassPath. This is of course by design, since the driver 
> actually runs on the cluster and not inside the SparkSubmit process.
> However, if the SparkSubmit process, minimal as it may be, needs any extra 
> libraries that are not part of the Spark assembly, there is no good way to 
> include them. (I say "no good way" because including them in the 
> SPARK_CLASSPATH environment variable does cause the SparkSubmit process to 
> include them, but this is not acceptable because this environment variable 
> has long been deprecated, and it prevents the use of 
> spark.driver.extraClassPath.)
> An example of when this matters is on Amazon EMR when using an S3 path for 
> the application JAR and running in yarn-cluster mode. The SparkSubmit process 
> needs the EmrFileSystem implementation and its dependencies in the classpath 
> in order to download the application JAR from S3, so it fails with a 
> ClassNotFoundException. (EMR currently gets around this by setting 
> SPARK_CLASSPATH, but as mentioned above this is less than ideal.)
> I have tried modifying SparkSubmitCommandBuilder to include the driver extra 
> classpath whether it's client mode or cluster mode, and this seems to work, 
> but I don't know if there is any downside to this.
> Example that fails on emr-4.0.0 (if you switch to setting 
> spark.(driver,executor).extraClassPath instead of SPARK_CLASSPATH): 
> spark-submit --deploy-mode cluster --class 
> org.apache.spark.examples.JavaWordCount s3://my-bucket/spark-examples.jar 
> s3://my-bucket/word-count-input.txt
> Resulting Exception:
> Exception in thread "main" java.lang.RuntimeException: 
> java.lang.ClassNotFoundException: Class 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2074)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2626)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2639)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:90)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2678)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2660)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:374)
>   at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
>   at 
> org.apache.spark.deploy.yarn.Client.copyFileToRemote(Client.scala:233)
>   at 
> org.apache.spark.deploy.yarn.Client.org$apache$spark$deploy$yarn$Client$$distribute$1(Client.scala:327)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:366)
>   at 
> org.apache.spark.deploy.yarn.Client$$anonfun$prepareLocalResources$5.apply(Client.scala:364)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at 
> org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:364)
>   at 
> org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:629)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:119)
>   at org.apache.spark.deploy.yarn.Client.run(Client.scala:907)
>   at org.apache.spark.deploy.yarn.Client$.main(Client.scala:966)
>   at org.apache.spark.deploy.yarn.Client.main(Client.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672)
>   at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
>   at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
>   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
>   at 

[jira] [Commented] (SPARK-12657) Revert SPARK-12617

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083948#comment-15083948
 ] 

Sean Owen commented on SPARK-12657:
---

Same question as SPARK-12658 -- seems like part of updating py4j only.

> Revert SPARK-12617
> --
>
> Key: SPARK-12657
> URL: https://issues.apache.org/jira/browse/SPARK-12657
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shixiong Zhu
>
> SPARK-12617 is just a workaround. Since Py4J is going to fix it. We should 
> revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10906) More efficient SparseMatrix.equals

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10906.
-
  Resolution: Won't Fix
Target Version/s:   (was: )

We'll fix this in Breeze, rather than in MLlib.

> More efficient SparseMatrix.equals
> --
>
> Key: SPARK-10906
> URL: https://issues.apache.org/jira/browse/SPARK-10906
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> SparseMatrix.equals currently uses toBreeze and then calls Breeze's equals 
> method.  However, it looks like Breeze's equals is inefficient: 
> [https://github.com/scalanlp/breeze/blob/1130e0de31948d19225179d8500a8d2d1cc337d0/math/src/main/scala/breeze/linalg/Matrix.scala#L132]
> Breeze iterates over all values, including implicit zeros.  We could make 
> this more efficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12660) Rewrite except using anti-join

2016-01-05 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12660:
---

 Summary: Rewrite except using anti-join
 Key: SPARK-12660
 URL: https://issues.apache.org/jira/browse/SPARK-12660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Reynold Xin


Similar to SPARK-12656, we can rewrite except in the logical level using 
anti-join.  This way, we can take advantage of all the benefits of join 
implementations (e.g. managed memory, code generation, broadcast joins).




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12659) NPE when spill in CartisianProduct

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12659:


Assignee: Davies Liu  (was: Apache Spark)

> NPE when spill in CartisianProduct
> --
>
> Key: SPARK-12659
> URL: https://issues.apache.org/jira/browse/SPARK-12659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12659) NPE when spill in CartisianProduct

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083975#comment-15083975
 ] 

Apache Spark commented on SPARK-12659:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/10606

> NPE when spill in CartisianProduct
> --
>
> Key: SPARK-12659
> URL: https://issues.apache.org/jira/browse/SPARK-12659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12659) NPE when spill in CartisianProduct

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12659:


Assignee: Apache Spark  (was: Davies Liu)

> NPE when spill in CartisianProduct
> --
>
> Key: SPARK-12659
> URL: https://issues.apache.org/jira/browse/SPARK-12659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Davies Liu
>Assignee: Apache Spark
>
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:54)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:37)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:231)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:187)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12379) Copy GBT implementation to spark.ml

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12379:


Assignee: Apache Spark

> Copy GBT implementation to spark.ml
> ---
>
> Key: SPARK-12379
> URL: https://issues.apache.org/jira/browse/SPARK-12379
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Assignee: Apache Spark
>
> Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is 
> preventing several improvements to GBTs in spark.ml, so we need to move the 
> implementation to ml and use spark.ml decision trees in the implementation. 
> At first, we should make minimal changes to the implementation.
> Performance testing should be done to ensure there were no regressions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11798) Datanucleus jars is missing under lib_managed/jars

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083640#comment-15083640
 ] 

Josh Rosen commented on SPARK-11798:


Datanucleus is only added as a dependency when the Hive build profile is 
enabled. Are you sure that you enabled that flag?

> Datanucleus jars is missing under lib_managed/jars
> --
>
> Key: SPARK-11798
> URL: https://issues.apache.org/jira/browse/SPARK-11798
> Project: Spark
>  Issue Type: Bug
>  Components: Build, SQL
>Reporter: Jeff Zhang
>
> I notice the comments in https://github.com/apache/spark/pull/9575 said that 
> Datanucleus related jars will still be copied to lib_managed/jars. But I 
> don't see any jars under lib_managed/jars. The weird thing is that I see the 
> jars on another machine, but could not see jars on my laptop even after I 
> delete the whole spark project and start from scratch. Does it related with 
> environments ? I try to add the following code in SparkBuild.scala to track 
> the issue, it shows that the jars is empty. 
> {code}
> deployDatanucleusJars := {
>   val jars: Seq[File] = (fullClasspath in assembly).value.map(_.data)
> .filter(_.getPath.contains("org.datanucleus"))
>   // this is what I added
>   println("*")
>   println("fullClasspath:"+fullClasspath)
>   println("assembly:"+assembly)
>   println("jars:"+jars.map(_.getAbsolutePath()).mkString(","))
>   //
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4343) Mima considers protected API methods for exclusion from binary checks.

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4343.
---
Resolution: Not A Problem

> Mima considers protected API methods for exclusion from binary checks. 
> ---
>
> Key: SPARK-4343
> URL: https://issues.apache.org/jira/browse/SPARK-4343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Prashant Sharma
>Priority: Minor
>
> Related SPARK-4335



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083646#comment-15083646
 ] 

Josh Rosen commented on SPARK-7481:
---

Hey, is this task done? I see that we have a {{hadoop2.6}} profile now.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1070) Add check for JIRA ticket in the Github pull request title/summary with CI

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1070.
---
Resolution: Won't Fix

Resolving as "Won't Fix." Most contributors now submit PRs with JIRAs, so the 
cost of infrequently reminding new folks to file them isn't high enough to 
justify automation here.

> Add check for JIRA ticket in the Github pull request title/summary with CI
> --
>
> Key: SPARK-1070
> URL: https://issues.apache.org/jira/browse/SPARK-1070
> Project: Spark
>  Issue Type: Task
>  Components: Build
>Reporter: Henry Saputra
>Assignee: Mark Hamstra
>Priority: Minor
>
> As part of discussion in the dev@ list to add audit trail of Spark's Github 
> pull requests (PR) to JIRA, need to add check maybe in the Jenkins CI to 
> verify that the PRs contain JIRA ticket number in the title/ summary.
>  
> There are maybe some PRs that may not need ticket so probably add support for 
> some "magic" keyword to bypass the check. But this should be done in rare 
> cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4343) Mima considers protected API methods for exclusion from binary checks.

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083651#comment-15083651
 ] 

Josh Rosen commented on SPARK-4343:
---

I'm going to resolve as "not an issue." Protected methods _should_ be part of 
API compatibility checks. If they're `protected[spark]` that's a slightly 
different story, but let's deal with that narrower issue once it crops up.

> Mima considers protected API methods for exclusion from binary checks. 
> ---
>
> Key: SPARK-4343
> URL: https://issues.apache.org/jira/browse/SPARK-4343
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Reporter: Prashant Sharma
>Priority: Minor
>
> Related SPARK-4335



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12654) sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop

2016-01-05 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083663#comment-15083663
 ] 

Thomas Graves commented on SPARK-12654:
---

It looks like the version of getConf in HadoopRDD already creates it as a 
JobConf versus a hadoop Configuration.  Not sure why NewHadoopRDD didn't do the 
same.  

[~joshrosen]  Do you know the history on that?

> sc.wholeTextFiles with spark.hadoop.cloneConf=true fails on secure Hadoop
> -
>
> Key: SPARK-12654
> URL: https://issues.apache.org/jira/browse/SPARK-12654
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> On a secure hadoop cluster using pyspark or spark-shell in yarn client mode 
> with spark.hadoop.cloneConf=true, start it up and wait for over 1 minute.  
> Then try to use:
> val files =  sc.wholeTextFiles("dir") 
> files.collect()
> and it fails with:
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.hadoop.ipc.RemoteException(java.io.IOException): Delegation 
> Token can be issued only with kerberos or web authentication
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken(FSNamesystem.java:7365)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getDelegationToken(NameNodeRpcServer.java:528)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getDelegationToken(ClientNamenodeProtocolServerSideTranslatorPB.java:963)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2096)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2092)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1694)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2090)
>  
> at org.apache.hadoop.ipc.Client.call(Client.java:1451)
> at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
> at com.sun.proxy.$Proxy12.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getDelegationToken(ClientNamenodeProtocolTranslatorPB.java:909)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> at com.sun.proxy.$Proxy13.getDelegationToken(Unknown Source)
> at 
> org.apache.hadoop.hdfs.DFSClient.getDelegationToken(DFSClient.java:1029)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getDelegationToken(DistributedFileSystem.java:1434)
> at 
> org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:529)
> at 
> org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:507)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.addDelegationTokens(DistributedFileSystem.java:2120)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:121)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:100)
> at 
> org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:80)
> at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:242)
> at 
> org.apache.spark.input.WholeTextFileInputFormat.setMinPartitions(WholeTextFileInputFormat.scala:55)
> at 
> org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:304)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12630:


Assignee: Apache Spark

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12630:


Assignee: (was: Apache Spark)

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082829#comment-15082829
 ] 

Sean Owen commented on SPARK-12622:
---

I don't see details of the actual problem here. Everything so far looks 
correct. {{file:/tmp/f%20oo.jar}} is a valid URI for the file, so that can't be 
rejected. What breaks?

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082846#comment-15082846
 ] 

Sean Owen commented on SPARK-12647:
---

[~robbinspg] rather than make a new JIRA, you should reopen your existing one 
and provide another PR. The additional change must logically go with your 
original one.

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082847#comment-15082847
 ] 

Apache Spark commented on SPARK-12647:
--

User 'robbinspg' has created a pull request for this issue:
https://github.com/apache/spark/pull/10599

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082859#comment-15082859
 ] 

Kazuaki Ishizaki commented on SPARK-3785:
-

To use CUDA is an intermediate approach to evaluate idea A. A future version 
will drive GPU code from a Spark program without writing CUDA code by hand. The 
version may generate GPU binary thru CUDA or OpenCL by using it as a backend in 
a compiler.

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082862#comment-15082862
 ] 

Sean Owen commented on SPARK-12622:
---

Oh I see it. Ultimately I assume it's because the JAR isn't found locally, 
though the question is why. This looks suspicious:
{{Added JAR file:/tmpf%20oo.jar at http://10.1.201.77:43888/jars/f%oo.jar}}

The second http URL can't be right. I don't have any more ideas but that looks 
like somewhere to start looking.

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-05 Thread Vijay Kiran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082872#comment-15082872
 ] 

Vijay Kiran commented on SPARK-12634:
-

I'm editing tree.py.

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-05 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082875#comment-15082875
 ] 

Adrian Bridgett commented on SPARK-12622:
-

Damn - sorry, that's my obfuscation error, so sorry about that - it :-(  It 
should read:
{noformat}
Added JAR file:/tmp/f%20oo.jar at http://10.1.201.77:35016/jars/f%20oo.jar with 
timestamp 1451917055779
{noformat}

Let me also post the full stack trace:
{noformat}
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO -   16/01/04 14:23:00 WARN 
scheduler.TaskSetManager: Lost task 19.0 in stage 0.0 (TID 20, 
ip-10-1-200-159.ec2.internal): java.lang.ClassNotFoundException: 
ProcessFoo$$anonfun$46
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.lang.Class.forName0(Native Method)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.lang.Class.forName(Class.java:348)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1613)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1518)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1774)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
[2016-01-04 14:23:00,053] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:371)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
scala.collection.immutable.$colon$colon.readObject(List.scala:362)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[2016-01-04 14:23:00,054] {daily_tmo.py:153} INFO - at 
java.lang.reflect.Method.invoke(Method.java:497)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1900)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1801)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1351)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2000)
[2016-01-04 14:23:00,055] {daily_tmo.py:153} INFO - at 
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1924)
[2016-01-04 14:23:00,055] 

[jira] [Commented] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082890#comment-15082890
 ] 

Apache Spark commented on SPARK-12634:
--

User 'vijaykiran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10601

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12634:


Assignee: (was: Apache Spark)

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12634:


Assignee: Apache Spark

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12331) R^2 for regression through the origin

2016-01-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12331.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10384
[https://github.com/apache/spark/pull/10384]

> R^2 for regression through the origin
> -
>
> Key: SPARK-12331
> URL: https://issues.apache.org/jira/browse/SPARK-12331
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Imran Younus
>Priority: Minor
> Fix For: 2.0.0
>
>
> The value of R^2 (coefficient of determination) obtained from 
> LinearRegressionModel is not consistent with R and statsmodels when the 
> fitIntercept is false i.e., regression through the origin. In this case, both 
> R and statsmodels use the definition of R^2 given by eq(4') in the following 
> review paper:
> https://online.stat.psu.edu/~ajw13/stat501/SpecialTopics/Reg_thru_origin.pdf
> Here is the definition from this paper:
> R^2 = \sum(\hat( y)_i^2)/\sum(y_i^2)
> The paper also describes why this should be the case. I've double checked 
> that the value of R^2 from statsmodels and R are consistent with this 
> definition. On the other hand, scikit-learn doesn't use the above definition. 
> I would recommend using the above definition in Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1061) allow Hadoop RDDs to be read w/ a partitioner

2016-01-05 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1061.
--
Resolution: Won't Fix

> allow Hadoop RDDs to be read w/ a partitioner
> -
>
> Key: SPARK-1061
> URL: https://issues.apache.org/jira/browse/SPARK-1061
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Imran Rashid
>Assignee: Imran Rashid
>
> Using partitioners to get narrow dependencies can save tons of time on a 
> shuffle.  However, after saving an RDD to hdfs, and then reloading it, all 
> partitioner information is lost.  This means that you can never get a narrow 
> dependency when loading data from hadoop.
> I think we could get around this by:
> 1) having a modified version of hadoop rdd that kept track of original part 
> file (or maybe just prevent splits altogether ...)
> 2) add a "assumePartition(partitioner:Partitioner, verify: Boolean)" function 
> to RDD.  It would create a new RDD, which had the exact same data but just 
> pretended that the RDD had the given partitioner applied to it.  And if 
> verify=true, it could add a mapPartitionsWithIndex to check that each record 
> was in the right partition.
> http://apache-spark-user-list.1001560.n3.nabble.com/setting-partitioners-with-hadoop-rdds-td976.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082821#comment-15082821
 ] 

Apache Spark commented on SPARK-12630:
--

User 'vijaykiran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10598

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-05 Thread Vijay Kiran (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vijay Kiran updated SPARK-12633:

Comment: was deleted

(was: Opened a PR  https://github.com/apache/spark/pull/10600)

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12633:


Assignee: Apache Spark

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-05 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082853#comment-15082853
 ] 

Apache Spark commented on SPARK-12633:
--

User 'vijaykiran' has created a pull request for this issue:
https://github.com/apache/spark/pull/10600

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-05 Thread Vijay Kiran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082854#comment-15082854
 ] 

Vijay Kiran commented on SPARK-12633:
-

Opened a PR  https://github.com/apache/spark/pull/10600

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12633:


Assignee: (was: Apache Spark)

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082879#comment-15082879
 ] 

Pete Robbins commented on SPARK-12647:
--

@sowen should I close this and move the PR?


> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Tristan Reid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082921#comment-15082921
 ] 

Tristan Reid commented on SPARK-12095:
--

The SQL syntax doesn't appear to work at all. 
  `select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl`

Is that the case?

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Pete Robbins (JIRA)
Pete Robbins created SPARK-12647:


 Summary: 1.6 branch test failure 
o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
reducers: aggregate operator
 Key: SPARK-12647
 URL: https://issues.apache.org/jira/browse/SPARK-12647
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Pete Robbins
Priority: Minor


All 1.6 branch builds failing eg 
https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/

3 did not equal 2

PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12647:


Assignee: Apache Spark

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Assignee: Apache Spark
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12622) spark-submit fails on executors when jar has a space in it

2016-01-05 Thread Adrian Bridgett (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082855#comment-15082855
 ] 

Adrian Bridgett commented on SPARK-12622:
-

The job fails with the ClassNotFound exception, if I rename the jar file and 
resubmit it all works.

> spark-submit fails on executors when jar has a space in it
> --
>
> Key: SPARK-12622
> URL: https://issues.apache.org/jira/browse/SPARK-12622
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 1.6.0
> Environment: Linux, Mesos 
>Reporter: Adrian Bridgett
>Priority: Minor
>
> spark-submit --class foo "Foo.jar"  works
> but when using "f oo.jar" it starts to run and then breaks on the executors 
> as they cannot find the various functions.
> Out of interest (as HDFS CLI uses this format) I tried f%20oo.jar - this 
> fails immediately.
> {noformat}
> spark-submit --class Foo /tmp/f\ oo.jar
> ...
> spark.jars=file:/tmp/f%20oo.jar
> 6/01/04 14:56:47 INFO spark.SparkContext: Added JAR file:/tmpf%20oo.jar at 
> http://10.1.201.77:43888/jars/f%oo.jar with timestamp 1451919407769
> 16/01/04 14:57:48 WARN scheduler.TaskSetManager: Lost task 4.0 in stage 0.0 
> (TID 2, ip-10-1-200-232.ec2.internal): java.lang.ClassNotFoundException: 
> Foo$$anonfun$46
> {noformat}
> SPARK-6568 is related but maybe specific to the Windows environment



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Pete Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082879#comment-15082879
 ] 

Pete Robbins edited comment on SPARK-12647 at 1/5/16 11:30 AM:
---

[~sowen] should I close this and move the PR?



was (Author: robbinspg):
@sowen should I close this and move the PR?


> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3785) Support off-loading computations to a GPU

2016-01-05 Thread Kazuaki Ishizaki (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082881#comment-15082881
 ] 

Kazuaki Ishizaki commented on SPARK-3785:
-

# You can specify cpu-cores by using conventional Spark options like 
"--executor-cores".
# Do you want to execute an operation for a matrix represented by a RDD? The 
current version has possible two GPU memory limitations
#* Since it copies the whole data in a partition of RDD between CPU and GPU, a 
GPU kernel for a task cannot exceed the capacity of the GPU memory
#* Since tasks are concurrently executed, a sum of the required GPU memories by 
tasks at a time cannot exceed the capacity of the GPU memory.

Comment 2 is a very good question. To exploit GPU in Spark, it is necessary to 
devise better approaches.

> Support off-loading computations to a GPU
> -
>
> Key: SPARK-3785
> URL: https://issues.apache.org/jira/browse/SPARK-3785
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Thomas Darimont
>Priority: Minor
>
> Are there any plans to adding support for off-loading computations to the 
> GPU, e.g. via an open-cl binding? 
> http://www.jocl.org/
> https://code.google.com/p/javacl/
> http://lwjgl.org/wiki/index.php?title=OpenCL_in_LWJGL



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15082889#comment-15082889
 ] 

Sean Owen commented on SPARK-12647:
---

*shrug* at this point probably doesn't matter; mostly for next time here. The 
concern is just that someone finds your fix to the first JIRA but not the fix 
to the fix. I linked them here at least.

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12621) ArrayIndexOutOfBoundsException when running sqlContext.sql(...)

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083835#comment-15083835
 ] 

Josh Rosen commented on SPARK-12621:


Can you please check whether this issue occurs in Spark 1.5.2?

> ArrayIndexOutOfBoundsException when running sqlContext.sql(...)
> ---
>
> Key: SPARK-12621
> URL: https://issues.apache.org/jira/browse/SPARK-12621
> Project: Spark
>  Issue Type: Bug
>Reporter: Sasi
>
> Sometimes i'm getting this exception while trying to do "select * from 
> table". 
> I'm using Spark 1.5.0, with Spark SQL 2.10
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:199)
> at org.apache.spark.sql.Row$class.getAs(Row.scala:316)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getAs(rows.scala:191)
> at org.apache.spark.sql.Row$class.getString(Row.scala:249)
> at 
> org.apache.spark.sql.catalyst.expressions.GenericRow.getString(rows.scala:191)
> at 
> com.cxtrm.mgmt.subscriber.spark.SparkDataAccessorBean.doQuery(SparkDataAccessorBean.java:138)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3992) Spark 1.1.0 python binding cannot use any collections but list as Accumulators

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3992.
---
Resolution: Won't Fix

Resolving as "Won't Fix" since this issue is really old and is targeted at 
Spark 1.1.0. Please re-open if this is still an issue.

> Spark 1.1.0 python binding cannot use any collections but list as Accumulators
> --
>
> Key: SPARK-3992
> URL: https://issues.apache.org/jira/browse/SPARK-3992
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: 3.13.0-36-generic #63-Ubuntu SMP Wed Sep 3 21:30:07 UTC 
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Walter Bogorad
>
> A dictionary accumulator defined as a global variable is not visible inside a 
> function called by "foreach()".
> Here is the minimal code snippet:
> {noformat}
> from collections import defaultdict
> from pyspark import SparkContext
> from pyspark.accumulators import AccumulatorParam
> class DictAccumParam(AccumulatorParam):
> def zero(self, value):
> value.clear()
> def addInPlace(self, val1, val2):
> return val1
> sc = SparkContext("local", "Dict Accumulator Bug")
> va = sc.accumulator(defaultdict(int), DictAccumParam())
> def foo(x):
> global va
> print "va is:", va
> rdd = sc.parallelize([1,2,3]).foreach(foo)
> {noformat}
> When ran the code snippet produced the following results:
> ...
> va is: None
> va is: None
> va is: None
> ...
> I have verified that the global variables are visible inside foo() called by 
> foreach only if they are for scalars or lists like in the API doc  at 
> http://spark.apache.org/docs/latest/api/python/
> The problem exists with standard dictionaries and collections.
> I also verified that if foo() is called directly, i.e. outside foreach, then 
> the global variables are visible OK.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6883) Fork pyspark's cloudpickle as a separate dependency

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083861#comment-15083861
 ] 

Josh Rosen edited comment on SPARK-6883 at 1/5/16 9:39 PM:
---

Closing as "Later" for now. Let's file a separate issue later down the line in 
case we want to explore having Spark depend on the cloudpipe/cloudpickle fork.


was (Author: joshrosen):
Closing as "Later" for now. Let's file a separate issue later down the line in 
case we want to explore having Spark depend on the cloudpickle/cloudpickle fork.

> Fork pyspark's cloudpickle as a separate dependency
> ---
>
> Key: SPARK-6883
> URL: https://issues.apache.org/jira/browse/SPARK-6883
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kyle Kelley
>  Labels: fork
>
> IPython, pyspark, picloud/multyvac/cloudpipe all rely on cloudpickle from 
> various sources (cloud, pyspark, and multyvac correspondingly). It would be 
> great to have this as a separately maintained project that can:
> * Work with Python3
> * Add tests!
> * Use higher order pickling (when on Python3)
> * Be installed with pip
> We're starting this off at the PyCon sprints under 
> https://github.com/cloudpipe/cloudpickle. We'd like to coordinate with 
> PySpark to make it work across all the above mentioned projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6883) Fork pyspark's cloudpickle as a separate dependency

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6883.
---
Resolution: Later

Closing as "Later" for now. Let's file a separate issue later down the line in 
case we want to explore having Spark depend on the cloudpickle/cloudpickle fork.

> Fork pyspark's cloudpickle as a separate dependency
> ---
>
> Key: SPARK-6883
> URL: https://issues.apache.org/jira/browse/SPARK-6883
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Kyle Kelley
>  Labels: fork
>
> IPython, pyspark, picloud/multyvac/cloudpipe all rely on cloudpickle from 
> various sources (cloud, pyspark, and multyvac correspondingly). It would be 
> great to have this as a separately maintained project that can:
> * Work with Python3
> * Add tests!
> * Use higher order pickling (when on Python3)
> * Be installed with pip
> We're starting this off at the PyCon sprints under 
> https://github.com/cloudpipe/cloudpickle. We'd like to coordinate with 
> PySpark to make it work across all the above mentioned projects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5616) Add examples for PySpark API

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5616.
---
Resolution: Won't Fix

Resolving as "Won't fix" per discussion on PR.

> Add examples for PySpark API
> 
>
> Key: SPARK-5616
> URL: https://issues.apache.org/jira/browse/SPARK-5616
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: dongxu
>Priority: Minor
>  Labels: examples, pyspark, python
>
> PySpark API examples are less than Spark scala API. For example:  
> 1.Broadcast: how to use broadcast operation API
> 2.Module: how to import a other python file in zip file.
> Add more examples for freshman who wanna use PySpark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10499) Improve error message when constructing a hive context in PySpark with non-hive assembly

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10499.

Resolution: Won't Fix

> Improve error message when constructing a hive context in PySpark with 
> non-hive assembly
> 
>
> Key: SPARK-10499
> URL: https://issues.apache.org/jira/browse/SPARK-10499
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>Priority: Minor
>
> A few times when I've been running local tests I've forgotten to build with 
> the Hive assembly and the error message isn't super clear (just a generic 
> py4j error message). Lets wrap that error message so its clearer about its 
> probably root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10499) Improve error message when constructing a hive context in PySpark with non-hive assembly

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083871#comment-15083871
 ] 

Josh Rosen commented on SPARK-10499:


This will no longer be necessary once we remove the need to set the {{-Phive}} 
build flag (SPARK-8108), so I'm going to mark this as "Won't Fix." Feel free to 
submit a PR targeted at 1.6 if you really want this, but it seems like a low 
priority at this point.

> Improve error message when constructing a hive context in PySpark with 
> non-hive assembly
> 
>
> Key: SPARK-10499
> URL: https://issues.apache.org/jira/browse/SPARK-10499
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Reporter: holdenk
>Priority: Minor
>
> A few times when I've been running local tests I've forgotten to build with 
> the Hive assembly and the error message isn't super clear (just a generic 
> py4j error message). Lets wrap that error message so its clearer about its 
> probably root cause.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12511) streaming driver with checkpointing unable to finalize leading to OOM

2016-01-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12511.

   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10514
[https://github.com/apache/spark/pull/10514]

> streaming driver with checkpointing unable to finalize leading to OOM
> -
>
> Key: SPARK-12511
> URL: https://issues.apache.org/jira/browse/SPARK-12511
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2, 1.6.0
> Environment: pyspark 1.5.2
> yarn 2.6.0
> python 2.6
> centos 6.5
> openjdk 1.8.0
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.0, 1.6.1
>
> Attachments: bug.py, finalizer-classes.png, finalizer-pending.png, 
> finalizer-spark_assembly.png
>
>
> Spark streaming application when configured with checkpointing is filling 
> driver's heap with multiple ZipFileInputStream instances as results of 
> spark-assembly.jar (potentially some others like for example snappy-java.jar) 
> getting repetitively referenced (loaded?). Java Finalizer can't finalize 
> these ZipFileInputStream instances and it eventually takes all heap leading 
> the driver to OOM crash.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Leave it running and monitor the driver java process heap
> ** with heap dump you will primarily see growing instances of byte array data 
> (here cumulated zip payload of the jar refs):
> {noformat}
>  num #instances #bytes  class name
> --
>1: 32653   32735296  [B
>2: 480005135816  [C
>3:411344144  [Lscala.concurrent.forkjoin.ForkJoinTask;
>4: 113621261816  java.lang.Class
>5: 470541129296  java.lang.String
>6: 254601018400  java.lang.ref.Finalizer
>7:  9802 789400  [Ljava.lang.Object;
> {noformat}
> ** with visualvm you can see:
> *** increasing number of objects pending for finalization
> !finalizer-pending.png!
> *** increasing number of ZipFileInputStreams instances related to the 
> spark-assembly.jar referenced by Finalizer
> !finalizer-spark_assembly.png!
> * Depending on the heap size and running time this will lead to driver OOM 
> crash
> h2. Comments
> * The [^bug.py] is lightweight proof of the problem. In production I am 
> experiencing this as quite rapid effect - in few hours it eats gigs of heap 
> and kills the app.
> * If the same [^bug.py] is run without checkpointing there is no issue 
> whatsoever.
> * Not sure if it is just pyspark related.
> * In [^bug.py] I am using the socketTextStream input but seems to be 
> independent of the input type (in production having same problem with Kafka 
> direct stream, have seen it even with textFileStream).
> * It is happening even if the input stream doesn't produce any data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9313) Enable a "docker run" invocation in place of PYSPARK_PYTHON

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083879#comment-15083879
 ] 

Josh Rosen commented on SPARK-9313:
---

Just curious: why can't {{PYSPARK_PYTHON}} just point to a bash script which 
invokes Docker? Trying to figure out if there's still a small actionable Spark 
change here.

> Enable a "docker run" invocation in place of PYSPARK_PYTHON
> ---
>
> Key: SPARK-9313
> URL: https://issues.apache.org/jira/browse/SPARK-9313
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
> Environment: Linux
>Reporter: thom neale
>Priority: Minor
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> There's a potentially high-yield improvement that might be possible by 
> enabling people to set PYSPARK_PYTHON (or possibly a new env var) to a docker 
> run of a specific docker image. I'm interesting in taking a shot at this, but 
> could use some pointers on overall pyspark architecture in order to avoid 
> hurting myself or trying something stupid that won't work. 
> History of this idea: I handle most of the spark infrastructure for 
> MassMutual's data science team, and we currently push code updates out to 
> spark workers with a combination of git post-recieve hooks and ansible 
> playbooks, all glued together with jenkins. It works well, but every time 
> someone wants a specific PYSPARK_PYTHON environment with precise branch 
> checkouts, for example, it has to be exquisitely configured in advance. What 
> would be amazing is if we could run a docker image in place of 
> PYSPARK_PYTHON, so people could build an image with whatever they want on it, 
> push it to a docker registry, then as long as the spark worker nodes had a 
> docker daemon running, they wouldn't need the images in advance--they would 
> just pull the built images from the registry on the fly once someone 
> submitted their job and specified the appropriate docker fu in place of 
> PYSPARK_PYTHON. This would basically make the distribution of code to the 
> workers self-service as long as users were savvy with docker. A lesser 
> benefit is that the layered filesystem feature of docker would solve the 
> (it's not really a problem) minor issue of a profusion of python virtualenvs, 
> each loaded with a huge ML stack plus other deps, from gobbling up gigs of 
> space on smaller code partitions on our workers. Each new combination of 
> branch checkouts for our application code could use the same huge ML base 
> image, and things would just be faster and simpler. 
> What I Speculate This Would Require 
> --- 
> Based on a reading of pyspark/daemon.py, I think this would require: 
> - somehow making the os.setpgid call inside manager() optional. The 
> pyspark.daemon process isn't allowed to call setpgid, I think because it has 
> pid 1 in the container. In my hacked branch I'm going this by checking if a 
> new environment variable is set. 
> - instead of binding to a random port, if the worker is dockerized, bind to a 
> predetermined port 
> - When the dockerized worker is invoked, query docker for the exposed port on 
> the host, and print that instead - Possibly do the same with ports opened by 
> forked workers? 
> - Forward stdin/out to/from the container where appropriate 
> My initial tinkering has done the first three points on 1.3.1 and I get the 
> InvalidArgumentException with an out-of-range port number, probably 
> indicating something is hitting an error a printing something else instead of 
> the actual port. 
> Any pointers people can supply would most welcome; I'm really interested in 
> at least succeeding in a demonstration of this hack, if not getting it merged 
> any time soon.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2371) Show locally-running tasks (e.g. from take()) in web UI

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2371.
---
Resolution: Won't Fix

Resolving as "Won't Fix", now that we've fully removed local execution.

> Show locally-running tasks (e.g. from take()) in web UI
> ---
>
> Key: SPARK-2371
> URL: https://issues.apache.org/jira/browse/SPARK-2371
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Reporter: Matei Zaharia
>
> It's somewhat confusing that these don't show up, so you wonder whether your 
> job is frozen. We probably need to give them a stage ID and somehow mark them 
> specially in the UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10874) add Search box to History Page

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-10874.

Resolution: Duplicate

Resolving as a duplicate of SPARK-10775

> add Search box to History Page
> --
>
> Key: SPARK-10874
> URL: https://issues.apache.org/jira/browse/SPARK-10874
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>
> Its hard to navigate the history server. It would be really nice to have a 
> search box to look for just the applications you are interested in. It should 
> search all columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12657) Revert SPARK-12617

2016-01-05 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-12657:


 Summary: Revert SPARK-12617
 Key: SPARK-12657
 URL: https://issues.apache.org/jira/browse/SPARK-12657
 Project: Spark
  Issue Type: Sub-task
Reporter: Shixiong Zhu


SPARK-12617 is just a workaround. Since Py4J is going to fix it. We should 
revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2579) Reading from S3 returns an inconsistent number of items with Spark 0.9.1

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2579.
---
Resolution: Cannot Reproduce

Resolving as "cannot reproduce" pending more information. Please comment here / 
keep reporting if this is still an issue.

> Reading from S3 returns an inconsistent number of items with Spark 0.9.1
> 
>
> Key: SPARK-2579
> URL: https://issues.apache.org/jira/browse/SPARK-2579
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 0.9.1
>Reporter: Eemil Lagerspetz
>Priority: Critical
>  Labels: hdfs, read, s3, skipping
>
> I have created a random matrix of 1M rows with 10K items on each row, 
> semicolon-separated. While reading it with Spark 0.9.1 and doing a count, I 
> consistently get less than 1M rows, and a different number every time at that 
> ( !! ). Example below:
> head -n 1 tool-generate-random-matrix*log
> ==> tool-generate-random-matrix-999158.log <==
> Row item counts: 999158
> ==> tool-generate-random-matrix.log <==
> Row item counts: 997163
> The data is split into 1000 partitions. When I download it using s3cmd sync, 
> and run the following AWK on it, I get the correct number of rows in each 
> partition (1000x1000 = 1M). What is up?
> {code:title=checkrows.sh|borderStyle=solid}
> for k in part-0*
> do
>   echo $k
>   awk -F ";" '
> NF != 1 {
>   print "Wrong number of items:",NF
> }
> END {
>   if (NR != 1000) {
> print "Wrong number of rows:",NR
>   }
> }' "$k"
> done
> {code}
> The matrix generation and counting code is below:
> {code:title=Matrix.scala|borderStyle=solid}
> package fi.helsinki.cs.nodes.matrix
> import java.util.Random
> import org.apache.spark._
> import org.apache.spark.SparkContext._
> import scala.collection.mutable.ListBuffer
> import org.apache.spark.rdd.RDD
> import org.apache.spark.storage.StorageLevel._
> object GenerateRandomMatrix {
>   def NewGeMatrix(rSeed: Int, rdd: RDD[Int], features: Int) = {
> rdd.mapPartitions(part => part.map(xarr => {
> val rdm = new Random(rSeed + xarr)
> val arr = new Array[Double](features)
> for (i <- 0 until features)
>   arr(i) = rdm.nextDouble()
> new Row(xarr, arr)
>   }))
>   }
>   case class Row(id: Int, elements: Array[Double]) {}
>   def rowFromText(line: String) = {
> val idarr = line.split(" ")
> val arr = idarr(1).split(";")
> // -1 to fix saved matrix indexing error
> new Row(idarr(0).toInt-1, arr.map(_.toDouble))
>   }
>   def main(args: Array[String]) {
> val master = args(0)
> val tasks = args(1).toInt
> val savePath = args(2)
> val read = args.contains("read")
> 
> val datapoints = 100
> val features = 1
> val sc = new SparkContext(master, "RandomMatrix")
> if (read) {
>   val randomMatrix: RDD[Row] = sc.textFile(savePath, 
> tasks).map(rowFromText).persist(MEMORY_AND_DISK)
>   println("Row item counts: "+ randomMatrix.count)
> } else {
>   val rdd = sc.parallelize(0 until datapoints, tasks)
>   val bcSeed = sc.broadcast(128)
>   /* Generating a matrix of random Doubles */
>   val randomMatrix = NewGeMatrix(bcSeed.value, rdd, 
> features).persist(MEMORY_AND_DISK)
>   randomMatrix.map(row => row.id + " " + 
> row.elements.mkString(";")).saveAsTextFile(savePath)
> }
> 
> sc.stop
>   }
> }
> {code}
> I run this with:
> appassembler/bin/tool-generate-random-matrix master 1000 
> s3n://keys@path/to/data 1>matrix.log 2>matrix.err
> Reading from HDFS gives the right count and right number of items on each 
> row. However, I had to run with the full path with the server name, just 
> /matrix does not work (it thinks I want file://):
> p="hdfs://ec2-54-188-6-77.us-west-2.compute.amazonaws.com:9000/matrix"
> appassembler/bin/tool-generate-random-matrix $( cat 
> /root/spark-ec2/cluster-url ) 1000 "$p" read 1>readmatrix.log 2>readmatrix.err



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12617) socket descriptor leak killing streaming app

2016-01-05 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-12617.

   Resolution: Fixed
Fix Version/s: 1.6.1
   1.5.3
   2.0.0

Issue resolved by pull request 10579
[https://github.com/apache/spark/pull/10579]

> socket descriptor leak killing streaming app
> 
>
> Key: SPARK-12617
> URL: https://issues.apache.org/jira/browse/SPARK-12617
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Streaming
>Affects Versions: 1.5.2
> Environment: pyspark (python 2.6)
>Reporter: Antony Mayi
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.0, 1.5.3, 1.6.1
>
> Attachments: bug.py
>
>
> There is a socket descriptor leakage in a pyspark streaming app when 
> configured with batch interval more then 30 seconds. This is due to default 
> timeout in py4j JavaGateway which (half-)closes CallbackConnection after 30 
> seconds of inactivity and creates new one next time. That connection doesn't 
> get closed on the python CallbackServer side and keep piling up until it 
> eventually blocks new connections.
> h2. Steps to reproduce:
> * Submit attached [^bug.py] to spark
> * Watch {{/tmp/bug.log}} to see the increasing total number of py4j callback 
> connections of which 0 will ever be closed
> {code}
> [BUG] py4j callback server port: 51282
> [BUG] py4j CB 0/0 closed
> ...
> [BUG] py4j CB 0/123 closed
> {code}
> * You can confirm the reality by using lsof on the pyspark driver process:
> {code}
> $ sudo lsof -p 39770 | grep CLOSE_WAIT | grep :51282
> python2.6 39770  das   94u  IPv4 138824906  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60419 (CLOSE_WAIT)
> python2.6 39770  das   95u  IPv4 138867747  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:60745 (CLOSE_WAIT)
> python2.6 39770  das   96u  IPv4 138831829  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:32849 (CLOSE_WAIT)
> python2.6 39770  das   97u  IPv4 138890524  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33184 (CLOSE_WAIT)
> python2.6 39770  das   98u  IPv4 138860190  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33512 (CLOSE_WAIT)
> python2.6 39770  das   99u  IPv4 138860439  0t0   TCP 
> localhost.localdomain:51282->localhost.localdomain:33854 (CLOSE_WAIT)
> ...
> {code}
> * If you leave it running for long enough the CallbackServer will eventually 
> become unable to accept new connections from the gateway and the app will 
> crash:
> {code}
> 16/01/02 05:12:07 ERROR scheduler.JobScheduler: Error generating jobs for 
> time 145171140 ms
> py4j.Py4JException: Error while obtaining a new communication channel
> ...
> Caused by: java.net.ConnectException: Connection timed out
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> at java.net.Socket.connect(Socket.java:589)
> at java.net.Socket.connect(Socket.java:538)
> at java.net.Socket.(Socket.java:434)
> at java.net.Socket.(Socket.java:244)
> at py4j.CallbackConnection.start(CallbackConnection.java:104)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12647.
--
   Resolution: Fixed
Fix Version/s: 1.6.1

Issue resolved by pull request 10599
[https://github.com/apache/spark/pull/10599]

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Priority: Minor
> Fix For: 1.6.1
>
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12647) 1.6 branch test failure o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of reducers: aggregate operator

2016-01-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12647:
-
Assignee: Pete Robbins

> 1.6 branch test failure 
> o.a.s.sql.execution.ExchangeCoordinatorSuite.determining the number of 
> reducers: aggregate operator
> ---
>
> Key: SPARK-12647
> URL: https://issues.apache.org/jira/browse/SPARK-12647
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Minor
> Fix For: 1.6.1
>
>
> All 1.6 branch builds failing eg 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-branch-1.6-test-maven-pre-yarn-2.0.0-mr1-cdh4.1.2/lastCompletedBuild/testReport/org.apache.spark.sql.execution/ExchangeCoordinatorSuite/determining_the_number_of_reducers__aggregate_operator/
> 3 did not equal 2
> PR for SPARK-12470 causes change in partition size so test needs updating



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12470) Incorrect calculation of row size in o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner

2016-01-05 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-12470:
-
Assignee: Pete Robbins

> Incorrect calculation of row size in 
> o.a.s.sql.catalyst.expressions.codegen.GenerateUnsafeRowJoiner
> ---
>
> Key: SPARK-12470
> URL: https://issues.apache.org/jira/browse/SPARK-12470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Pete Robbins
>Assignee: Pete Robbins
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> While looking into https://issues.apache.org/jira/browse/SPARK-12319 I 
> noticed that the row size is incorrectly calculated.
> The "sizeReduction" value is calculated in words:
>// The number of words we can reduce when we concat two rows together.
> // The only reduction comes from merging the bitset portion of the two 
> rows, saving 1 word.
> val sizeReduction = bitset1Words + bitset2Words - outputBitsetWords
> but then it is subtracted from the size of the row in bytes:
>|out.pointTo(buf, ${schema1.size + schema2.size}, sizeInBytes - 
> $sizeReduction);
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-584) Pass slave ip address when starting a cluster

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-584.
--
Resolution: Incomplete

Resolving as "incomplete." Please submit a PR if this is still an issue.

> Pass slave ip address when starting a cluster 
> --
>
> Key: SPARK-584
> URL: https://issues.apache.org/jira/browse/SPARK-584
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 0.6.0
>Priority: Minor
> Attachments: 0001-fix-for-SPARK-584.patch
>
>
> Pass slave ip address from conf while starting a cluster:
> bin/start-slaves.sh is used to start all the slaves in the cluster. While the 
> slave class takes a --ip argument, we don't pass the ip address from the 
> conf/slaves. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4212) Actor not found

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4212.
---
Resolution: Cannot Reproduce

Resolving as "cannot reproduce" since this is really old.

> Actor not found
> ---
>
> Key: SPARK-4212
> URL: https://issues.apache.org/jira/browse/SPARK-4212
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Davies Liu
>
> tried to run a PySpark test, but it hanged:
> NOTE: SPARK_PREPEND_CLASSES is set, placing locally compiled Spark classes 
> ahead of assembly.
> 14/11/03 12:32:58 WARN Remoting: Tried to associate with unreachable remote 
> address [akka.tcp://sparkDriver@dm:7077]. Address is now gated for 5000 ms, 
> all messages to this address will be delivered to dead letters. Reason: 
> Connection refused: dm/192.168.1.11:7077
> 14/11/03 12:32:58 ERROR OneForOneStrategy: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), 
> Path(/user/HeartbeatReceiver)]
> akka.actor.ActorInitializationException: exception during creation
>   at akka.actor.ActorInitializationException$.apply(Actor.scala:164)
>   at akka.actor.ActorCell.create(ActorCell.scala:596)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:456)
>   at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
>   at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:263)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Caused by: akka.actor.ActorNotFound: Actor not found for: 
> ActorSelection[Anchor(akka.tcp://sparkDriver@dm:7077/), 
> Path(/user/HeartbeatReceiver)]
>   at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
>   at 
> akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
>   at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>   at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
>   at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
>   at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>   at 
> akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
>   at 
> scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
>   at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
>   at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
>   at 
> akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
>   at 
> akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
>   at 
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
>   at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
>   at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
>   at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
>   at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
>   at 
> akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
>   at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
>   at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
>   at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
>   at 
> akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
>   at 
> akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
>   at akka.actor.ActorCell.terminate(ActorCell.scala:369)
>   at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
>   ... 8 more
> ^CTraceback (most recent call last):
>   File "python/pyspark/tests.py", line 1627, in 
> unittest.main()
>   File "//anaconda/lib/python2.7/unittest/main.py", line 95, in __init__
> self.runTests()
>   File "//anaconda/lib/python2.7/unittest/main.py", line 232, in runTests
> self.result = testRunner.run(self.test)
>   File "//anaconda/lib/python2.7/unittest/runner.py", line 151, in run
> test(result)
>   File 

[jira] [Resolved] (SPARK-1835) sbt gen-idea includes both mesos and mesos with shaded-protobuf into dependencies

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1835?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1835.
---
Resolution: Won't Fix

Resolving as "Won't Fix." Use the IntelliJ Maven / SBT import instead of 
gen-idea.

> sbt gen-idea includes both mesos and mesos with shaded-protobuf into 
> dependencies
> -
>
> Key: SPARK-1835
> URL: https://issues.apache.org/jira/browse/SPARK-1835
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> gen-idea includes both mesos-0.18.1 and mesos-0.18.1-shaded-protobuf into 
> dependencies. This generates compile error because mesos-0.18.1 comes first 
> and there is no protobuf jar in the dependencies.
> A workaround is to delete mesos-0.18.1.jar manually from idea intellij. 
> Another solution is to publish the shaded jar as a separate version instead 
> of using classifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Irakli Machabeli (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Irakli Machabeli closed SPARK-12095.


Ignore, initially I was testing on windows without hive so HiveContext was not 
available.

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Tristan Reid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083823#comment-15083823
 ] 

Tristan Reid commented on SPARK-12095:
--

Ah thanks, I see - I was using SqlContext, not HiveContext.  I just saw in a SO 
post that only HiveContext works.  Is that reflected in the docs somewhere?  

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083847#comment-15083847
 ] 

Yin Huai commented on SPARK-12095:
--

Does the error message that you got when using dataframe API mentioned any 
thing? If not, feel free to create a JIRA (PR is welcome) to add a doc to 1.6 
branch (relevant parts are 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/expressions/Window.scala,
 
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L557-L768,
 
https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/functions.py#L151-L183
 and 
https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/functions.py#L600-L649).
 https://issues.apache.org/jira/browse/SPARK-8641 added the native window 
function support. So, starting from 2.0, you can use window functions with 
SQLContext.

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083851#comment-15083851
 ] 

Yin Huai commented on SPARK-12095:
--

[~imachabeli] where did you find it?

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083853#comment-15083853
 ] 

Yin Huai commented on SPARK-12095:
--

Ah, I found it. It is inside column.py 
(https://github.com/apache/spark/blob/branch-1.6/python/pyspark/sql/column.py#L438).
 I guess we can just add a note at 
https://github.com/apache/spark/blob/branch-1.6/sql/core/src/main/scala/org/apache/spark/sql/Column.scala#L1049.
 

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6641) Add config or control of accumulator on python

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6641.
---
Resolution: Won't Fix

Resolving as "Won't Fix" until there is more detail justifying this feature.

> Add config or control of accumulator on python
> --
>
> Key: SPARK-6641
> URL: https://issues.apache.org/jira/browse/SPARK-6641
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Weizhong
>Priority: Minor
>
> Now if we init SparkContext of Python, then will create a single Accumulator 
> in Java and start a TCP server in a daemon thread.
> We can add config option to decide not start  this TCP server at sc start, 
> and start this TCP server when we use accumulator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9385) Python style check fails at jenkins

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-9385.
---
Resolution: Won't Fix

Resolving as "Won't Fix" for now. We'll revisit a different form of Python 
style checking down the line.

> Python style check fails at jenkins 
> 
>
> Key: SPARK-9385
> URL: https://issues.apache.org/jira/browse/SPARK-9385
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.0
>Reporter: Yin Huai
>Priority: Blocker
>
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3088/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/console
> Seems there is something wrong with installing pylint. I tried it locally. 
> When I first ran dev/lint-python, I got the error shown in the jenkins log. 
> The second time I ran dev/lint-python, it's fine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11815) PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11815:
--
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed
> -
>
> Key: SPARK-11815
> URL: https://issues.apache.org/jira/browse/SPARK-11815
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> PySpark DecisionTreeClassifier & DecisionTreeRegressor should be extended 
> from HasSeed just like what we do at Scala side.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12041) Add columnSimilarities to IndexedRowMatrix for PySpark

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12041:
--
Target Version/s: 2.0.0

> Add columnSimilarities to IndexedRowMatrix for PySpark
> --
>
> Key: SPARK-12041
> URL: https://issues.apache.org/jira/browse/SPARK-12041
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: starter
>
> Add columnSimilarities to IndexedRowMatrix for PySpark, please refer the peer 
> Scala one SPARK-10654.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5179) Spark UI history job duration is wrong

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5179.
---
Resolution: Fixed

I believe that a duplicate of this issue was fixed for Spark 1.2.1 or 1.3 or 
something like that. Please re-open if this is still an issue.

> Spark UI history job duration is wrong
> --
>
> Key: SPARK-5179
> URL: https://issues.apache.org/jira/browse/SPARK-5179
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.2.0
>Reporter: Olivier Toupin
>Priority: Minor
>
> In the Web UI, the jobs duration times are wrong when using reviewing the job 
> with the history. The stages duration times are ok.
> Jobs are shown with milliseconds duration, which is wrong. However, it's only 
> an history issue, while the job is running, it works.
> More details in that discussion on the mailing list:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-UI-history-job-duration-is-wrong-tc10010.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5076) Don't show "Cores" or "Memory Per Node" columns for completed applications

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5076.
---
Resolution: Won't Fix

Resolving as "won't fix" for now.

> Don't show "Cores" or "Memory Per Node" columns for completed applications
> --
>
> Key: SPARK-5076
> URL: https://issues.apache.org/jira/browse/SPARK-5076
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Josh Rosen
>Priority: Minor
>
> In the Master web UI, I don't think that it makes sense to show "Cores" and 
> "Memory per Node" for completed applications; the current behavior may be 
> confusing to users: 
> https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201412.mbox/%3c2ad05705-f7b6-4cf2-b315-6d5483326...@qq.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12657) Revert SPARK-12617

2016-01-05 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-12657:
-
Target Version/s: 2.0.0

> Revert SPARK-12617
> --
>
> Key: SPARK-12657
> URL: https://issues.apache.org/jira/browse/SPARK-12657
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Shixiong Zhu
>
> SPARK-12617 is just a workaround. Since Py4J is going to fix it. We should 
> revert it in master when we upgrade Py4J.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10498) Add requirements file for create dev python tools

2016-01-05 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083778#comment-15083778
 ] 

Josh Rosen commented on SPARK-10498:


Agreed. Want to try running the tests in a sandbox to figure out the complete 
set of deps, then submit a PR?

> Add requirements file for create dev python tools
> -
>
> Key: SPARK-10498
> URL: https://issues.apache.org/jira/browse/SPARK-10498
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: holdenk
>Priority: Minor
>
> Minor since so few people use them, but it would probably be good to have a 
> requirements file for our python release tools for easier setup (also version 
> pinning).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6689) MiniYarnCLuster still test failed with hadoop-2.2

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-6689.
---
Resolution: Cannot Reproduce

Resolving as "cannot reproduce" for now.

> MiniYarnCLuster still test failed with hadoop-2.2
> -
>
> Key: SPARK-6689
> URL: https://issues.apache.org/jira/browse/SPARK-6689
> Project: Spark
>  Issue Type: Test
>  Components: Tests, YARN
>Affects Versions: 1.3.0
>Reporter: Zhang, Liye
>Priority: Minor
>
> when running unit test *YarnClusterSuite* with *hadoop-2.2*, exception will 
> throw because *Timed out waiting for RM to come up*. Some previously related 
> discussion can be traced in 
> [spark-3710|https://issues.apache.org/jira/browse/SPARK-3710] 
> ([PR2682|https://github.com/apache/spark/pull/2682]) and 
> [spark-2778|https://issues.apache.org/jira/browse/SPARK-2778] 
> ([PR2605|https://github.com/apache/spark/pull/2605]). 
> With command *build/sbt -Pyarn -Phadoop-2.2 "test-only 
> org.apache.spark.deploy.yarn.YarnClusterSuite"*, will get following 
> exceptions: 
> {noformat}
> [info] Exception encountered when attempting to run a suite with class name: 
> org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (15 seconds, 
> 799 milliseconds)
> [info]   java.lang.IllegalStateException: Timed out waiting for RM to come up.
> [info]   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:114)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
> [info]   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.beforeAll(YarnClusterSuite.scala:44)
> [info]   at 
> org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
> [info]   at 
> org.apache.spark.deploy.yarn.YarnClusterSuite.run(YarnClusterSuite.scala:44)
> [info]   at 
> org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
> [info]   at 
> org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:294)
> [info]   at sbt.ForkMain$Run$2.call(ForkMain.java:284)
> [info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> [info]   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> [info]   at java.lang.Thread.run(Thread.java:745)
> {noformat}
> And without *-Phadoop-2.2* or replace it with "*-Dhadoop.version*" (e.g. 
> build/sbt -Pyarn "test-only org.apache.spark.deploy.yarn.YarnClusterSuite") 
> more info will come out:
> {noformat}
> Exception in thread "Thread-7" java.lang.NoClassDefFoundError: 
> org/mortbay/jetty/servlet/Context
>   at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:602)
>   at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:655)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper$2.run(MiniYARNCluster.java:219)
> Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.servlet.Context
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>   at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
>   at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> [info] Resolving org.apache.hadoop#hadoop-yarn-server-common;2.2.0 ...
> Exception in thread "Thread-18" java.lang.NoClassDefFoundError: 
> org/mortbay/jetty/servlet/Context
>   at org.apache.hadoop.yarn.webapp.WebApps.$for(WebApps.java:309)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer.serviceStart(WebServer.java:62)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:199)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.server.MiniYARNCluster$NodeManagerWrapper$1.run(MiniYARNCluster.java:337)
> Caused by: java.lang.ClassNotFoundException: org.mortbay.jetty.servlet.Context
>   at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
>   at 

[jira] [Updated] (SPARK-12570) DecisionTreeRegressor: provide variance of prediction: user guide update

2016-01-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12570:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang
Target Version/s: 2.0.0

> DecisionTreeRegressor: provide variance of prediction: user guide update
> 
>
> Key: SPARK-12570
> URL: https://issues.apache.org/jira/browse/SPARK-12570
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Minor
>
> See linked JIRA for details.  This should update the table of output columns 
> and text.  Examples are probably not needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083813#comment-15083813
 ] 

Yin Huai commented on SPARK-12095:
--

Are you using {{sqlContext.sql(yourQuery)}}? Are you using HiveContext?

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1155) Clean up and document use of SparkEnv

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1155.
---
Resolution: Invalid

This issue is no longer relevant now that SparkEnv is no longer a thread-local.

> Clean up and document use of SparkEnv
> -
>
> Key: SPARK-1155
> URL: https://issues.apache.org/jira/browse/SPARK-1155
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> We should provide better documentation explaining what SparkEnv is and why it 
> needs to be thread local (basically, to allow it to be accessed inside of 
> closures on executors). Also, in cases where SparkEnv is being accessed on 
> the driver we should access it through the associated SparkContext rather 
> than through the thread local. Finally, we should see if it's possible to 
> just remove this as a thread local and instead make it a static singleton 
> that the exeucutor sets once. This last thing might not be possible if, under 
> certain code paths, this is used on the driver.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5330) Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core :jackson-core:2.3.1 causes compatibility issues

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5330:
--
Component/s: (was: Spark Core)
 Build

> Core | Scala 2.11 | Transitive dependency on com.fasterxml.jackson.core 
> :jackson-core:2.3.1 causes compatibility issues
> ---
>
> Key: SPARK-5330
> URL: https://issues.apache.org/jira/browse/SPARK-5330
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.2.0
>Reporter: Aniket Bhatnagar
>Priority: Minor
>
> Spark Transitive depends on com.fasterxml.jackson.core :jackson-core:2.3.1. 
> Users of jackson-module-scala had to to depend on the same version to avoid 
> any class compatibility issues. However, since scala 2.11, 
> jackson-module-scala is no longer published for version 2.3.1. Since the 
> version 2.3.1 is quiet old, perhaps we should investigate upgrading to latest 
> jackson-core. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2168) History Server renered page not suitable for load balancing

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2168.
---
   Resolution: Fixed
Fix Version/s: 1.4.2
   1.3.2

> History Server renered page not suitable for load balancing
> ---
>
> Key: SPARK-2168
> URL: https://issues.apache.org/jira/browse/SPARK-2168
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Lukasz Jastrzebski
>Priority: Minor
> Fix For: 1.3.2, 1.4.2
>
>
> Small issue but still.
> I run history server through Marathon and balance it through haproxy. The 
> problem is that links generated by HistoryPage (links to completed 
> applications) are absolute, e.g.  href="http://some-server:port/history/...;>completedApplicationName , but 
> instead they should be relative, e.g.   hfref="/history/...">completedApplicationName, so they can be load 
> balanced. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2168) History Server renered page not suitable for load balancing

2016-01-05 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2168:
--
Assignee: Lukasz Jastrzebski

> History Server renered page not suitable for load balancing
> ---
>
> Key: SPARK-2168
> URL: https://issues.apache.org/jira/browse/SPARK-2168
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Lukasz Jastrzebski
>Assignee: Lukasz Jastrzebski
>Priority: Minor
> Fix For: 1.3.2, 1.4.2
>
>
> Small issue but still.
> I run history server through Marathon and balance it through haproxy. The 
> problem is that links generated by HistoryPage (links to completed 
> applications) are absolute, e.g.  href="http://some-server:port/history/...;>completedApplicationName , but 
> instead they should be relative, e.g.   hfref="/history/...">completedApplicationName, so they can be load 
> balanced. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12095) Window function rowsBetween throws exception

2016-01-05 Thread Irakli Machabeli (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083833#comment-15083833
 ] 

Irakli Machabeli commented on SPARK-12095:
--

It is mentioned briefly in API docs 
"Note Window functions is only supported with HiveContext in 1.4"

> Window function rowsBetween throws exception
> 
>
> Key: SPARK-12095
> URL: https://issues.apache.org/jira/browse/SPARK-12095
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Irakli Machabeli
>
> From pyspark :
>  windowSpec=Window.partitionBy('A', 'B').orderBy('A','B', 
> 'C').rowsBetween('UNBOUNDED PRECEDING','CURRENT')
> Py4JError: An error occurred while calling o1107.rowsBetween. Trace:
> py4j.Py4JException: Method rowsBetween([class java.lang.String, class 
> java.lang.Long]) does not exist
> from SQL query parser fails immediately:
> Py4JJavaError: An error occurred while calling o18.sql.
> : java.lang.RuntimeException: [1.20] failure: ``union'' expected but `(' found
> select rank() OVER (PARTITION BY c1 ORDER BY c2 ) as rank from tbl
>^
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.AbstractSparkSQLParser.parse(AbstractSparkSQLParser.scala:36)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   >