date:20151019

[jira] [Updated] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11176:
---
Summary: Umbrella ticket for wholeTextFiles bugs  (was: Umbrella ticket for 
wholeTextFiles + S3 bugs)

> Umbrella ticket for wholeTextFiles bugs
> ---
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files from S3.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-19 Thread Yang Yang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Yang updated SPARK-10994:
--
Description: 
The Clustering Coefficient (CC) is a fundamental measure in social (or other 
type of) network analysis assessing the degree to which nodes tend to cluster 
together [1][2]. Clustering coefficient, along with density, node degree, path 
length, diameter, connectedness, and node centrality are seven most important 
properties to characterise a network [3].

We found that GraphX has already implemented connectedness, node centrality, 
path length, but does not have a componenet for computing clustering 
coefficient. This actually was the first intention for us to implement an 
algorithm to compute clustering coefficient for each vertex of a given graph.

Clustering coefficient is very helpful to many real applications, such as user 
behaviour prediction and structure prediction (like link prediction). We did 
that before in a bunch of papers (e.g., [4-5]), and also found many other 
publication papers using this metric in their work [6-8]. We are very confident 
that this feature will benefit GraphX and attract a large number of users.


References
[1] https://en.wikipedia.org/wiki/Clustering_coefficient
[2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 
citations).
[3] https://en.wikipedia.org/wiki/Network_science
[4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of "Following" 
Links in Microblogging Networks. IEEE Transaction on Knowledge and Data 
Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106.
[5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi Li, 
and Liangwei Wang. Mining Competitive Relationships by Learning across 
Heterogeneous Networks. In Proceedings of the Twenty-First Conference on 
Information and Knowledge Management (CIKM'12). pp. 1432-1441.
[6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical 
structure and the prediction of missing links in networks. Nature 453.7191 
(2008): 98-101. (with 973 citations)
[7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social 
networks 25.3 (2003): 211-230. (1238 citations)
[8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New 
perspectives and methods in link prediction. In KDD'10.

  was:
The Clustering Coefficient (CC) is a fundamental measure in social (or other 
type of) network analysis assessing the degree to which nodes tend to cluster 
together. We propose to implement an algorithm to compute the clustering 
coefficient for each vertex of a given graph in GraphX.

Specifically, The clustering coefficient of a vertex (node) in a graph 
quantifies how close its neighbours are to being a clique (complete graph). 
More formally, the clustering coefficient C_i for a vertex v_i is given by the 
proportion of links between the vertices within its neighbourhood divided by 
the number of links that could possibly exist between them. 

Clustering coefficient is well known and has wide applications. Duncan J. Watts 
and Steven Strogatz introduced the measure in 1998 to determine whether a graph 
is a small-world network (1). Their paper has attacted 27266 citations by now. 
Similar features are included in NetworkX (2), SNAP (3), etc. 

(1) Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
‘small-world’networks." nature 393.6684 (1998): 440-442.
(2) http://networkx.github.io/
(3) http://snap.stanford.edu/



> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The Clustering Coefficient (CC) is a fundamental measure in social (or other 
> type of) network analysis assessing the degree to which nodes tend to cluster 
> together [1][2]. Clustering coefficient, along with density, node degree, 
> path length, diameter, connectedness, and node centrality are seven most 
> important properties to characterise a network [3].
> We found that GraphX has already implemented connectedness, node centrality, 
> path length, but does not have a componenet for computing clustering 
> coefficient. This actually was the first intention for us to implement an 
> algorithm to compute clustering coefficient for each vertex of a given graph.
> Clustering coefficient is very helpful to many real applications, such as 
> user behaviour prediction and structure prediction (like link prediction). We 
> did that before in a bunch of papers (e.g., [4-5]), and also found many other 
> publication papers using this metric in their

[jira] [Commented] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-19 Thread Yang Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962929#comment-14962929
 ] 

Yang Yang commented on SPARK-10994:
---

update with describing our motivation in more details

> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The Clustering Coefficient (CC) is a fundamental measure in social (or other 
> type of) network analysis assessing the degree to which nodes tend to cluster 
> together [1][2]. Clustering coefficient, along with density, node degree, 
> path length, diameter, connectedness, and node centrality are seven most 
> important properties to characterise a network [3].
> We found that GraphX has already implemented connectedness, node centrality, 
> path length, but does not have a componenet for computing clustering 
> coefficient. This actually was the first intention for us to implement an 
> algorithm to compute clustering coefficient for each vertex of a given graph.
> Clustering coefficient is very helpful to many real applications, such as 
> user behaviour prediction and structure prediction (like link prediction). We 
> did that before in a bunch of papers (e.g., [4-5]), and also found many other 
> publication papers using this metric in their work [6-8]. We are very 
> confident that this feature will benefit GraphX and attract a large number of 
> users.
> References
> [1] https://en.wikipedia.org/wiki/Clustering_coefficient
> [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
> ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 
> citations).
> [3] https://en.wikipedia.org/wiki/Network_science
> [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of 
> "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge 
> and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106.
> [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi 
> Li, and Liangwei Wang. Mining Competitive Relationships by Learning across 
> Heterogeneous Networks. In Proceedings of the Twenty-First Conference on 
> Information and Knowledge Management (CIKM'12). pp. 1432-1441.
> [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical 
> structure and the prediction of missing links in networks. Nature 453.7191 
> (2008): 98-101. (with 973 citations)
> [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social 
> networks 25.3 (2003): 211-230. (1238 citations)
> [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New 
> perspectives and methods in link prediction. In KDD'10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-19 Thread prakhar jauhari (JIRA)

prakhar jauhari created SPARK-11181:
---

 Summary: Spark Yarn : Spark reducing total executors count even 
when Dynamic Allocation is disabled.
 Key: SPARK-11181
 URL: https://issues.apache.org/jira/browse/SPARK-11181
 Project: Spark
  Issue Type: Bug
  Components: Scheduler, Spark Core, YARN
Affects Versions: 1.3.1
 Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
All servers in cluster running Linux version 2.6.32. 
Job in yarn-client mode.
Reporter: prakhar jauhari
 Fix For: 1.3.2, 1.5.2


Spark driver reduces total executors count even when Dynamic Allocation is not 
enabled.

To reproduce this:
1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
2. When the application launches and gets it required executors, One of the 
DN's losses connectivity and is timed out.
3. Spark issues a killExecutor for the executor on the DN which was timed out. 
4. Even with dynamic allocation off, spark's scheduler reduces the 
"targetNumExecutors".
5. Thus the job runs with reduced executor count.

Note : The severity of the issue increases : If some of the DN that were 
running my job's executors lose connectivity intermittently, spark scheduler 
reduces "targetNumExecutors", thus not asking for new executors on any other 
nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Nitin Goyal (JIRA)

Nitin Goyal created SPARK-11179:
---

 Summary: Push filters through aggregate if filters are subset of 
'group by' expressions
 Key: SPARK-11179
 URL: https://issues.apache.org/jira/browse/SPARK-11179
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Nitin Goyal
Priority: Minor
 Fix For: 1.6.0


Push filters through aggregate if filters are subset of 'group by' expressions. 
This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11144) Add SparkLauncher for Spark Streaming, Spark SQL, etc

2015-10-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962903#comment-14962903
 ] 

Jean-Baptiste Onofré commented on SPARK-11144:
--

Hi Yuhang, just to confirm: an utility like spark-submit but programmatically 
(like SparkLauncher), right ?

> Add SparkLauncher for Spark Streaming, Spark SQL, etc
> -
>
> Key: SPARK-11144
> URL: https://issues.apache.org/jira/browse/SPARK-11144
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL, Streaming
>Affects Versions: 1.5.1
> Environment: Linux x64
>Reporter: Yuhang Chen
>Priority: Minor
>  Labels: launcher
>
> Now we hava org.apache.spark.launcher.SparkLauncher to lauch spark as a child 
> process. However, it does not support other libs, such as Spark Streaming and 
> Spark SQL.
> What I'm looking for is an utility like spark-submit, with which you can 
> submit any spark lib jobs to all supported resource manager(Standalone, YARN, 
> Mesos, etc) in Java/Scala code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11157) Allow Spark to be built without assemblies

2015-10-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962897#comment-14962897
 ] 

Jean-Baptiste Onofré commented on SPARK-11157:
--

Agree with Marcelo. It's something that I planned: create more fine grained jar 
file instead of big Spark jar too.

> Allow Spark to be built without assemblies
> --
>
> Key: SPARK-11157
> URL: https://issues.apache.org/jira/browse/SPARK-11157
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Spark Core, YARN
>Reporter: Marcelo Vanzin
> Attachments: no-assemblies.pdf
>
>
> For reasoning, discussion of pros and cons, and other more detailed 
> information, please see attached doc.
> The idea is to be able to build a Spark distribution that has just a 
> directory full of jars instead of the huge assembly files we currently have.
> Getting there requires changes in a bunch of places, I'll try to list the 
> ones I identified in the document, in the order that I think would be needed 
> to not break things:
> * make streaming backends not be assemblies
> Since people may depend on the current assembly artifacts in their 
> deployments, we can't really remove them; but we can make them be dummy jars 
> and rely on dependency resolution to download all the jars.
> PySpark tests would also need some tweaking here.
> * make examples jar not be an assembly
> Probably requires tweaks to the {{run-example}} script. The location of the 
> examples jar would have to change (it won't be able to live in the same place 
> as the main Spark jars anymore).
> * update YARN backend to handle a directory full of jars when launching apps
> Currently YARN localizes the Spark assembly (depending on the user 
> configuration); it needs to be modified so that it can localize all needed 
> libraries instead of a single jar.
> * Modify launcher library to handle the jars directory
> This should be trivial
> * Modify {{assembly/pom.xml}} to generate assembly or a {{libs}} directory 
> depending on which profile is enabled.
> We should keep the option to build with the assembly on by default, for 
> backwards compatibility, to give people time to prepare.
> Filing this bug as an umbrella; please file sub-tasks if you plan to work on 
> a specific part of the issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11176:
---
Description: 
This umbrella ticket gathers together several distinct bug reports related to 
problems using the wholeTextFiles method to read files. Most of these bugs deal 
with reading files from S3, but it's not clear whether S3 is necessary to hit 
these bugs.

These issues may have a common underlying cause and should be investigated 
together.

  was:
This umbrella ticket gathers together several distinct bug reports related to 
problems using the wholeTextFiles method to read files from S3.

These issues may have a common underlying cause and should be investigated 
together.


> Umbrella ticket for wholeTextFiles bugs
> ---
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files. Most of these bugs 
> deal with reading files from S3, but it's not clear whether S3 is necessary 
> to hit these bugs.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11180:


Assignee: Apache Spark

>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Assignee: Apache Spark
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> We want to set "isOfficer" false whenever there is null. 
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962923#comment-14962923
 ] 

Apache Spark commented on SPARK-11180:
--

User 'rishabhbhardwaj' has created a pull request for this issue:
https://github.com/apache/spark/pull/9166

>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> We want to set "isOfficer" false whenever there is null. 
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11180:


Assignee: (was: Apache Spark)

>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> We want to set "isOfficer" false whenever there is null. 
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962931#comment-14962931
 ] 

Apache Spark commented on SPARK-11179:
--

User 'nitin2goyal' has created a pull request for this issue:
https://github.com/apache/spark/pull/9167

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Priority: Minor
> Fix For: 1.6.0
>
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11179:


Assignee: Apache Spark

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 1.6.0
>
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11132) Mean Shift algorithm integration

2015-10-19 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-11132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962960#comment-14962960
 ] 

Beck Gaël commented on SPARK-11132:
---

Thank you,
It's not the case for mean shift, i hope it will be.
I've deposited the algorithm on 
http://spark-packages.org/package/Kybe67/Mean-Shift-LSH.
I will prepare it as a Spark package as soon as i can because i've some sbt 
issues with spark-package.
If something is missing it will be a pleasure to remedy it.
Thank you again for your support.

> Mean Shift algorithm integration
> 
>
> Key: SPARK-11132
> URL: https://issues.apache.org/jira/browse/SPARK-11132
> Project: Spark
>  Issue Type: Brainstorming
>  Components: MLlib
>Reporter: Beck Gaël
>Priority: Minor
>
> I made a version of the clustering algorithm Mean Shift in scala/Spark and 
> would like to contribute if you think that it is a good idea.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-19 Thread prakhar jauhari (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962961#comment-14962961
 ] 

prakhar jauhari commented on SPARK-11181:
-

On analysing the code (Spark 1.3.1): 
When my DN goes unreachable: 

Spark core's HeartbeatReceiver invokes _expireDeadHosts()_: which checks if 
Dynamic Allocation is supported and then invokes _"sc.killExecutor()"_
{quote}
if (sc.supportDynamicAllocation) \{
sc.killExecutor(executorId)
}
{quote} 

Surprisingly _supportDynamicAllocation_ in _sparkContext.scala_ is defined to 
result "True" if _dynamicAllocationTesting_ flag is enabled or spark is running 
over _yarn_  
{quote}
private\[spark\] def supportDynamicAllocation = 
master.contains("yarn") || dynamicAllocationTesting
{quote} 

_"sc.killExecutor()"_ matches it to configured _"schedulerBackend"_ 
(CoarseGrainedSchedulerBackend in this case) and invokes 
_"killExecutors(executorIds)"_

CoarseGrainedSchedulerBackend calculates a _"newTotal"_ for the total number of 
executors required, and sends a update to application master by invoking 
_"doRequestTotalExecutors(newTotal)"_
 
CoarseGrainedSchedulerBackend then invokes a 
_"doKillExecutors(filteredExecutorIds)"_ for the lost executors. 

Thus reducing the total number of executors in a host intermittently 
unreachable scenario. 

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
> Fix For: 1.3.2
>
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-19 Thread prakhar jauhari (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

prakhar jauhari updated SPARK-11181:

Fix Version/s: (was: 1.5.2)

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
> Fix For: 1.3.2
>
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11180) DataFrameNaFunctions fills does not support Boolean Type:

2015-10-19 Thread Satya Narayan (JIRA)

Satya Narayan created SPARK-11180:
-

 Summary: DataFrameNaFunctions fills does not support Boolean Type:
 Key: SPARK-11180
 URL: https://issues.apache.org/jira/browse/SPARK-11180
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: Satya Narayan
Priority: Minor


Currently  DataFrame.na.fill does not support  Boolean primitive type. We have 
use cases where while data massaging/preparation we want to fill boolean 
columns with false/true value. 
Ex: 
val empDf = 
sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer")
empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
isOfficer: boolean]

scala> empDf.show
+-+---+-+
|EmpId|Designation|isOfficer|
+-+---+-+
|1|   null| null|
|2|SVP| true|
|3|Dir|false|
+-+---+-+

We want to set "isOfficer" false whenever there is null. 

scala> empDf.na.fill(Map("isOfficer"->false))
throws exception 
java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
(false).
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
...

Can you add support for Boolean into na.fill function.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-19 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962913#comment-14962913
 ] 

Josh Rosen commented on SPARK-11177:


It looks like this is caused by MAPREDUCE-4470, which is not patched in Apache 
Hadoop 1.x releases. If Spark users cannot upgrade to Hadoop 2.x and absolutely 
need a fix for this, then one somewhat hacky solution is to use a modified copy 
of CombineFileInputFormat which lives in the Spark source tree and includes the 
three-line fix for MAPREDUCE-4470. While this works (I have tests!), it's not 
an approach which is suitable for inclusion in a Spark release: it's going to 
be borderline impossible to maintain source- and binary-compatibility with all 
of our supported Hadoop versions while using this approach.



> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-19 Thread prakhar jauhari (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

prakhar jauhari updated SPARK-11181:

Target Version/s: 1.3.2  (was: 1.3.2, 1.5.2)

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
> Fix For: 1.3.2
>
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6541:
---

Assignee: (was: Apache Spark)

> Executor table on Stage page should sort by Executor ID numerically, not 
> lexically
> --
>
> Key: SPARK-6541
> URL: https://issues.apache.org/jira/browse/SPARK-6541
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Page loads with a table like this:
> !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png!
> After clicking "Executor ID" to sort by that column, it sorts numerically:
> !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Satya Narayan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satya Narayan updated SPARK-11180:
--
Summary:  DataFrame.na.fill does not support Boolean Type:  (was: 
DataFrameNaFunctions fills does not support Boolean Type:)

>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> val empDf = 
> sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> +-+---+-+
> |EmpId|Designation|isOfficer|
> +-+---+-+
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> +-+---+-+
> We want to set "isOfficer" false whenever there is null. 
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Satya Narayan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satya Narayan updated SPARK-11180:
--
Description: 
Currently  DataFrame.na.fill does not support  Boolean primitive type. We have 
use cases where while data massaging/preparation we want to fill boolean 
columns with false/true value. 
Ex: 
val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
((1,null,null),(2,"SVP",true),(3,"Dir",false)))
.toDF("EmpId","Designation","isOfficer")
empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
isOfficer: boolean]

scala> empDf.show

|EmpId|Designation|isOfficer|
|1|   null| null|
|2|SVP| true|
|3|Dir|false|


We want to set "isOfficer" false whenever there is null. 

scala> empDf.na.fill(Map("isOfficer"->false))
throws exception 
java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
(false).
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
...

Can you add support for Boolean into na.fill function.


  was:
Currently  DataFrame.na.fill does not support  Boolean primitive type. We have 
use cases where while data massaging/preparation we want to fill boolean 
columns with false/true value. 
Ex: 
val empDf = 
sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]((1,null,null),(2,"SVP",true),(3,"Dir",false))).toDF("EmpId","Designation","isOfficer")
empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
isOfficer: boolean]

scala> empDf.show
+-+---+-+
|EmpId|Designation|isOfficer|
+-+---+-+
|1|   null| null|
|2|SVP| true|
|3|Dir|false|
+-+---+-+

We want to set "isOfficer" false whenever there is null. 

scala> empDf.na.fill(Map("isOfficer"->false))
throws exception 
java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
(false).
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
...

Can you add support for Boolean into na.fill function.



>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> We want to set "isOfficer" false whenever there is null. 
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962895#comment-14962895
 ] 

Apache Spark commented on SPARK-6541:
-

User 'jbonofre' has created a pull request for this issue:
https://github.com/apache/spark/pull/9165

> Executor table on Stage page should sort by Executor ID numerically, not 
> lexically
> --
>
> Key: SPARK-6541
> URL: https://issues.apache.org/jira/browse/SPARK-6541
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>Priority: Minor
>
> Page loads with a table like this:
> !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png!
> After clicking "Executor ID" to sort by that column, it sorts numerically:
> !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-6541) Executor table on Stage page should sort by Executor ID numerically, not lexically

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6541:
---

Assignee: Apache Spark

> Executor table on Stage page should sort by Executor ID numerically, not 
> lexically
> --
>
> Key: SPARK-6541
> URL: https://issues.apache.org/jira/browse/SPARK-6541
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Ryan Williams
>Assignee: Apache Spark
>Priority: Minor
>
> Page loads with a table like this:
> !http://f.cl.ly/items/0M273s053F2T2K1o441L/Screen%20Shot%202015-03-25%20at%207.07.08%20PM.png!
> After clicking "Executor ID" to sort by that column, it sorts numerically:
> !http://f.cl.ly/items/01161p3s2H070h1K1a0c/Screen%20Shot%202015-03-25%20at%207.08.26%20PM.png!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11167) Incorrect type resolution on heterogeneous data structures

2015-10-19 Thread Sun Rui (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14962900#comment-14962900
 ] 

Sun Rui commented on SPARK-11167:
-

For a DataFrame, each column is a collection of values of same type. No 
heterogeneous values are expected for a specific column. We can enhance the 
robustness of inferring type by adding check for such case and report error.

> Incorrect type resolution on heterogeneous data structures
> --
>
> Key: SPARK-11167
> URL: https://issues.apache.org/jira/browse/SPARK-11167
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Maciej Szymkiewicz
>
> If structure contains heterogeneous incorrectly assigns type of the 
> encountered element as type of a whole structure. This problem affects both 
> lists:
> {code}
> SparkR:::infer_type(list(a=1, b="a")
> ## [1] "array"
> SparkR:::infer_type(list(a="a", b=1))
> ##  [1] "array"
> {code}
> and environments:
> {code}
> SparkR:::infer_type(as.environment(list(a=1, b="a")))
> ## [1] "map"
> SparkR:::infer_type(as.environment(list(a="a", b=1)))
> ## [1] "map"
> {code}
> This results in errors during data collection and other operations on 
> DataFrames:
> {code}
> ldf <- data.frame(row.names=1:2)
> ldf$foo <- list(list("1", 2), list(3, 4))
> sdf <- createDataFrame(sqlContext, ldf)
> collect(sdf)
> ## 15/10/17 17:58:57 ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 
> 9)
> ## scala.MatchError: 2.0 (of class java.lang.Double)
> ## ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11179:


Assignee: (was: Apache Spark)

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Priority: Minor
> Fix For: 1.6.0
>
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11128) strange NPE when writing in non-existing S3 bucket

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11128.
---
Resolution: Not A Problem

Not a problem with Spark, that is.

> strange NPE when writing in non-existing S3 bucket
> --
>
> Key: SPARK-11128
> URL: https://issues.apache.org/jira/browse/SPARK-11128
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.5.1
>Reporter: mathieu despriee
>Priority: Minor
>
> For the record, as it's relatively minor, and related to s3n (not tested with 
> s3a).
> By mistake, we tried writing a parquet dataframe to a non-existing s3 bucket, 
> with a simple df.write.parquet(s3path).
> We got a NPE (see stack trace below), which is very misleading.
> java.lang.NullPointerException
> at 
> org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433)
> at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1398)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:73)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
> at 
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd(SQLContext.scala:933)
> at 
> org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:197)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:146)
> at 
> org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:137)
> at 
> org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:304)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10352) Replace SQLTestData internal usages of String with UTF8String

2015-10-19 Thread Harsh Rathi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963194#comment-14963194
 ] 

Harsh Rathi commented on SPARK-10352:
-

Why this is not a problem ?

I am writing custom explode function. 
If I try to use CatalystTypeConverters for type conversions, it gives error in 
StructConverter since InternalRow is not added as a case there.
If I don't use CatalystTypeConverters,  it gives casting error saying 
java.lang.String cannot be cast into UTF8String.

> Replace SQLTestData internal usages of String with UTF8String
> -
>
> Key: SPARK-10352
> URL: https://issues.apache.org/jira/browse/SPARK-10352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Feynman Liang
>
> Running the code:
> {code}
> val inputString = "abc"
> val row = InternalRow.apply(inputString)
> val unsafeRow = 
> UnsafeProjection.create(Array[DataType](StringType)).apply(row)
> {code}
> generates the error:
> {code}
> [info]   java.lang.ClassCastException: java.lang.String cannot be cast to 
> org.apache.spark.unsafe.types.UTF8String
> [info]   at 
> org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:46)
> ***snip***
> {code}
> Although {{StringType}} should in theory only have internal type 
> {{UTF8String}}, we [are inconsistent with this 
> constraint|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L131]
>  and being more strict would [break existing 
> code|https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestData.scala#L41]
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11184) Declare most of .mllib code not-Experimental

2015-10-19 Thread Sean Owen (JIRA)

Sean Owen created SPARK-11184:
-

 Summary: Declare most of .mllib code not-Experimental
 Key: SPARK-11184
 URL: https://issues.apache.org/jira/browse/SPARK-11184
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.5.1
Reporter: Sean Owen
Priority: Minor


Comments please [~mengxr] and [~josephkb]: my proposal is to remove most 
{{@Experimental}} annotations from the {{.mllib}} code, on the theory that it's 
not intended to change much more. 

I can easily take a shot at this, but wanted to collect thoughts before I 
started. Does the theory sound reasonable? Part of it is a desire to keep this 
annotation meaningful, and also encourage people to at least view MLlib as 
stable, because it is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-19 Thread Liangliang Gu (JIRA)

Liangliang Gu created SPARK-11182:
-

 Summary: HDFS Delegation Token will be expired when calling 
"UserGroupInformation.getCurrentUser.addCredentials" in HA mode
 Key: SPARK-11182
 URL: https://issues.apache.org/jira/browse/SPARK-11182
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Liangliang Gu


In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
automatically, which will not be updated when Spark update Credentials for the 
current user.
Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11181) Spark Yarn : Spark reducing total executors count even when Dynamic Allocation is disabled.

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11181:
--
   Flags:   (was: Patch,Important)
Target Version/s:   (was: 1.3.2)
   Fix Version/s: (was: 1.3.2)

[~prakhar088] Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a JIRA, as there are a number of issues here: it can't have a 
Fix/Target verison; the flags aren't valid.

Please try reproducing vs master as 1.3.1 is relatively old, and many things 
have been fixed since. I suspect this is a duplicate.

> Spark Yarn : Spark reducing total executors count even when Dynamic 
> Allocation is disabled.
> ---
>
> Key: SPARK-11181
> URL: https://issues.apache.org/jira/browse/SPARK-11181
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core, YARN
>Affects Versions: 1.3.1
> Environment: Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. 
> All servers in cluster running Linux version 2.6.32. 
> Job in yarn-client mode.
>Reporter: prakhar jauhari
>
> Spark driver reduces total executors count even when Dynamic Allocation is 
> not enabled.
> To reproduce this:
> 1. A 2 node yarn setup : each DN has ~ 20GB mem and 4 cores.
> 2. When the application launches and gets it required executors, One of the 
> DN's losses connectivity and is timed out.
> 3. Spark issues a killExecutor for the executor on the DN which was timed 
> out. 
> 4. Even with dynamic allocation off, spark's scheduler reduces the 
> "targetNumExecutors".
> 5. Thus the job runs with reduced executor count.
> Note : The severity of the issue increases : If some of the DN that were 
> running my job's executors lose connectivity intermittently, spark scheduler 
> reduces "targetNumExecutors", thus not asking for new executors on any other 
> nodes, causing the job to hang.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-10921:
--
Assignee: Jacek Laskowski

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Assignee: Jacek Laskowski
>Priority: Minor
> Fix For: 1.6.0
>
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10633) Persisting Spark stream to MySQL - Spark tries to create the table for every stream even if it exist already.

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10633.
---
Resolution: Not A Problem

> Persisting Spark stream to MySQL - Spark tries to create the table for every 
> stream even if it exist already.
> -
>
> Key: SPARK-10633
> URL: https://issues.apache.org/jira/browse/SPARK-10633
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Streaming
>Affects Versions: 1.4.0, 1.5.0
> Environment: Ubuntu 14.04
> IntelliJ IDEA 14.1.4
> sbt
> mysql-connector-java 5.1.35 (Tested and working with Spark 1.3.1)
>Reporter: Lunen
>
> Persisting Spark Kafka stream to MySQL 
> Spark 1.4 + tries to create a table automatically every time the stream gets 
> sent to a specified table.
> Please note, Spark 1.3.1 works.
> Code sample:
> val url = "jdbc:mysql://host:port/db?user=user=password
> val crp = RowSetProvider.newFactory()
> val crsSql: CachedRowSet = crp.createCachedRowSet()
> val crsTrg: CachedRowSet = crp.createCachedRowSet()
> crsSql.beforeFirst()
> crsTrg.beforeFirst()
> //Read Stream from Kafka
> //Produce SQL INSERT STRING
> 
> streamT.foreachRDD { rdd =>
>   if (rdd.toLocalIterator.nonEmpty) {
> sqlContext.read.json(rdd).registerTempTable(serverEvents + "_events")
> while (crsSql.next) {
>   sqlContext.sql("SQL INSERT STRING").write.jdbc(url, "SCHEMA_NAME", 
> new Properties)
>   println("Persisted Data: " + 'SQL INSERT STRING')
> }
> crsSql.beforeFirst()
>   }
>   stmt.close()
>   conn.close()
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11182:


Assignee: (was: Apache Spark)

> HDFS Delegation Token will be expired when calling 
> "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
> --
>
> Key: SPARK-11182
> URL: https://issues.apache.org/jira/browse/SPARK-11182
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Liangliang Gu
>
> In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
> automatically, which will not be updated when Spark update Credentials for 
> the current user.
> Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963092#comment-14963092
 ] 

Apache Spark commented on SPARK-11182:
--

User 'marsishandsome' has created a pull request for this issue:
https://github.com/apache/spark/pull/9168

> HDFS Delegation Token will be expired when calling 
> "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
> --
>
> Key: SPARK-11182
> URL: https://issues.apache.org/jira/browse/SPARK-11182
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Liangliang Gu
>
> In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
> automatically, which will not be updated when Spark update Credentials for 
> the current user.
> Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-19 Thread Liangliang Gu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963091#comment-14963091
 ] 

Liangliang Gu commented on SPARK-11182:
---

https://github.com/apache/spark/pull/9168


> HDFS Delegation Token will be expired when calling 
> "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
> --
>
> Key: SPARK-11182
> URL: https://issues.apache.org/jira/browse/SPARK-11182
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Liangliang Gu
>
> In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
> automatically, which will not be updated when Spark update Credentials for 
> the current user.
> Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11182) HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11182?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11182:


Assignee: Apache Spark

> HDFS Delegation Token will be expired when calling 
> "UserGroupInformation.getCurrentUser.addCredentials" in HA mode
> --
>
> Key: SPARK-11182
> URL: https://issues.apache.org/jira/browse/SPARK-11182
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Reporter: Liangliang Gu
>Assignee: Apache Spark
>
> In HA mode, DFSClient will generate HDFS Delegation Token for each Name Node 
> automatically, which will not be updated when Spark update Credentials for 
> the current user.
> Spark should update these tokens in order to avoid Token Expired Error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11183) enable support for mesos 0.24+

2015-10-19 Thread Ioannis Polyzos (JIRA)

Ioannis Polyzos created SPARK-11183:
---

 Summary: enable support for mesos 0.24+
 Key: SPARK-11183
 URL: https://issues.apache.org/jira/browse/SPARK-11183
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Mesos
Reporter: Ioannis Polyzos


mesos 0.24, the mesos leader info in ZK has changed to json tis result to spark 
failed to running on 0.24+.


References : 
  https://issues.apache.org/jira/browse/MESOS-2340 
  
http://mail-archives.apache.org/mod_mbox/mesos-commits/201506.mbox/%3ced4698dc56444bcdac3bdf19134db...@git.apache.org%3E
  https://github.com/mesos/elasticsearch/issues/338
  https://github.com/spark-jobserver/spark-jobserver/issues/267



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-19 Thread Mojmir Vinkler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963110#comment-14963110
 ] 

Mojmir Vinkler commented on SPARK-5250:
---

Yes, it's caused by reading a corrupt file (we only experienced this for 
compressed (gzipped) files). I think the file got corrupted when it was saved 
to S3, but we used boto for that, not Spark. What's weird is that I'm able to 
read the file with pandas without any problems.

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>

[jira] [Resolved] (SPARK-10921) Completely remove the use of SparkContext.preferredNodeLocationData

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-10921.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8976
[https://github.com/apache/spark/pull/8976]

> Completely remove the use of SparkContext.preferredNodeLocationData
> ---
>
> Key: SPARK-10921
> URL: https://issues.apache.org/jira/browse/SPARK-10921
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.5.1
>Reporter: Jacek Laskowski
>Priority: Minor
> Fix For: 1.6.0
>
>
> SPARK-8949 obsoleted the use of {{SparkContext.preferredNodeLocationData}} 
> yet the code makes it less obvious as it says (see 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L93-L96):
> {code}
>   // This is used only by YARN for now, but should be relevant to other 
> cluster types (Mesos,
>   // etc) too. This is typically generated from 
> InputFormatInfo.computePreferredLocations. It
>   // contains a map from hostname to a list of input format splits on the 
> host.
>   private[spark] var preferredNodeLocationData: Map[String, Set[SplitInfo]] = 
> Map()
> {code}
> It turns out that there are places where the initialization does take place 
> that only adds up to the confusion.
> When you search for the use of {{SparkContext.preferredNodeLocationData}},
> you'll find 3 places - one constructor marked {{@deprecated}}, the other with
> {{logWarning}} telling us that _"Passing in preferred locations has no
> effect at all, see SPARK-8949"_, and in
> {{org.apache.spark.deploy.yarn.ApplicationMaster.registerAM}} method.
> There is no consistent approach to deal with it given it's no longer used in 
> theory.
> [org.apache.spark.deploy.yarn.ApplicationMaster.registerAM|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L234-L265]
>  method
> caught my eye and I found that it does the following in
> client.register:
> {code}
> if (sc != null) sc.preferredNodeLocationData else Map()
> {code}
> However, {{client.register}} [ignores the input parameter 
> completely|https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnRMClient.scala#L47-L78],
>  but the scaladoc says (note {{preferredNodeLocations}} param):
> {code}
>   /**
>* Registers the application master with the RM.
>*
>* @param conf The Yarn configuration.
>* @param sparkConf The Spark configuration.
>* @param preferredNodeLocations Map with hints about where to allocate 
> containers.
>* @param uiAddress Address of the SparkUI.
>* @param uiHistoryAddress Address of the application on the History Server.
>*/
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10861) Univariate Statistics: Adding range support as UDAF

2015-10-19 Thread Jeff Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963029#comment-14963029
 ] 

Jeff Zhang commented on SPARK-10861:


[~JihongMA] what's your progress on this ?

> Univariate Statistics: Adding range support as UDAF
> ---
>
> Key: SPARK-10861
> URL: https://issues.apache.org/jira/browse/SPARK-10861
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Range support for continuous



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6645) StructField/StructType and related classes are not in the Scaladoc

2015-10-19 Thread Rishabh Bhardwaj (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963345#comment-14963345
 ] 

Rishabh Bhardwaj commented on SPARK-6645:
-

I can see StructField/StructType classes in ScalaDoc in org.apache.sql.types 
package
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructField
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.types.StructType

Can you please elaborate?
Correct me If I have misunderstood something here.

> StructField/StructType and related classes are not in the Scaladoc
> --
>
> Key: SPARK-6645
> URL: https://issues.apache.org/jira/browse/SPARK-6645
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 1.3.0
>Reporter: Aaron Defazio
>Priority: Minor
>
> The current programming guide uses StructField in the Scala examples, yet it 
> doesn't appear to exist in the Scaladoc. This is related to SPARK-6592, in 
> that several classes that a user might use do not appear in the Scaladoc. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11185) Add more task metrics to the "all Stages Page"

2015-10-19 Thread Thomas Graves (JIRA)

Thomas Graves created SPARK-11185:
-

 Summary: Add more task metrics to the "all Stages Page"
 Key: SPARK-11185
 URL: https://issues.apache.org/jira/browse/SPARK-11185
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.5.1
Reporter: Thomas Graves


The "All Stages Page" on the History page could have more information about the 
stage to allow users to quickly see which stage potentially has long tasks. 
Indicator or skewed data or bad nodes, etc.  

Currently to get this information you have to click on every stage.  If you 
have a hundreds of stages this can be very cumbersome.

For instance pulling out the max task time and the median to the all stages 
page would allow me to see the difference and if the max task time is much 
greater then the median this stage may have had tasks with problems.  

We already had some discussion about this under 
https://github.com/apache/spark/pull/9051



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext

2015-10-19 Thread Santiago M. Mola (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Santiago M. Mola updated SPARK-11186:
-
Description: 
Default catalog behaviour for caseness is different in {{SQLContext}} and 
{{HiveContext}}.

{code}
  test("Catalog caseness (SQL)") {
val sqlc = new SQLContext(sc)
val relationName = "MyTable"
sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
BaseRelation {
  override def sqlContext: SQLContext = sqlc
  override def schema: StructType = StructType(Nil)
}))
val tables = sqlc.tableNames()
assert(tables.contains(relationName))
  }

  test("Catalog caseness (Hive)") {
val sqlc = new HiveContext(sc)
val relationName = "MyTable"
sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
BaseRelation {
  override def sqlContext: SQLContext = sqlc
  override def schema: StructType = StructType(Nil)
}))
val tables = sqlc.tableNames()
assert(tables.contains(relationName))
  }
{code}

Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
But the reason that this is needed seems undocumented (both in the manual or in 
the source code comments).

  was:
Default catalog behaviour for caseness is different in {{SQLContext}} and 
{{HiveContext}}.

{code}
  test("Catalog caseness (SQL)") {
val sqlc = new SQLContext(sc)
val relationName = "MyTable"
sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
BaseRelation {
  override def sqlContext: SQLContext = sqlc
  override def schema: StructType = StructType(Nil)
}))
val tables = sqlc.tableNames()
assert(tables.contains(relationName))
  }

  test("Catalog caseness (Hive)") {
val sqlc = new HiveContext(sc)
val relationName = "MyTable"
sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
BaseRelation {
  override def sqlContext: SQLContext = sqlc
  override def schema: StructType = StructType(Nil)
}))
val tables = sqlc.tableNames()
assert(tables.contains(relationName))
  }
{/code}

Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
But the reason that this is needed seems undocumented (both in the manual or in 
the source code comments).


> Caseness inconsistency between SQLContext and HiveContext
> -
>
> Key: SPARK-11186
> URL: https://issues.apache.org/jira/browse/SPARK-11186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Santiago M. Mola
>Priority: Minor
>
> Default catalog behaviour for caseness is different in {{SQLContext}} and 
> {{HiveContext}}.
> {code}
>   test("Catalog caseness (SQL)") {
> val sqlc = new SQLContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
>   test("Catalog caseness (Hive)") {
> val sqlc = new HiveContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
> {code}
> Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
> But the reason that this is needed seems undocumented (both in the manual or 
> in the source code comments).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-19 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509
 ] 

Ryan Williams edited comment on SPARK-11162 at 10/19/15 4:01 PM:
-

In the second message (of 2 total, afaict) on the thread, "eric wong" lays out 
two steps for enabling DEBUG logging; the first step involves changing a local 
copy of {{log4j.properties}}, whereas the second involves passing certain 
parameters to {{spark-submit}}.

I was hoping for a way to not have to modify a local {{log4j.properties}} file, 
but to get debug logging by only passing parameters to {{spark-submit}}.

I suppose this issue is confusingly named since technically all of this can be 
accomlished "from the command line", so I'll rename it to reflect that I'd like 
a config flag to {{spark-submit}} to enable different logging levels.

Also, even modifying {{log4j.properties}} in various places and passing it to 
the {{--files}} flag, I am unable to get DEBUG logging on the client in 
{{yarn-client}} mode, i.e. {{--files log4j.properties}} makes all of my YARN 
containers have debug logging, but I still only get INFO logging in e.g. my 
{{spark-shell}} session that is running locally.



was (Author: rdub):
In the second message (of 2 total, afaict) on the thread, "eric wong" lays out 
two steps for enabling DEBUG logging; the first step involves changing a local 
copy of {{log4j.properties}}, whereas the second involves passing certain 
parameters to {{spark-submit}}.

I was hoping for a way to not have to modify a local {{log4j.properties}} file, 
but to get debug logging by only passing parameters to {{spark-submit}}.



> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-19 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509
 ] 

Ryan Williams commented on SPARK-11162:
---

In the second message (of 2 total, afaict) on the thread, "eric wong" lays out 
two steps for enabling DEBUG logging; the first step involves changing a local 
copy of {{log4j.properties}}, whereas the second involves passing certain 
parameters to {{spark-submit}}.

I was hoping for a way to not have to modify a local {{log4j.properties}} file, 
but to get debug logging by only passing parameters to {{spark-{submit,shell}}}.



> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-19 Thread Ryan Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963509#comment-14963509
 ] 

Ryan Williams edited comment on SPARK-11162 at 10/19/15 3:59 PM:
-

In the second message (of 2 total, afaict) on the thread, "eric wong" lays out 
two steps for enabling DEBUG logging; the first step involves changing a local 
copy of {{log4j.properties}}, whereas the second involves passing certain 
parameters to {{spark-submit}}.

I was hoping for a way to not have to modify a local {{log4j.properties}} file, 
but to get debug logging by only passing parameters to {{spark-submit}}.




was (Author: rdub):
In the second message (of 2 total, afaict) on the thread, "eric wong" lays out 
two steps for enabling DEBUG logging; the first step involves changing a local 
copy of {{log4j.properties}}, whereas the second involves passing certain 
parameters to {{spark-submit}}.

I was hoping for a way to not have to modify a local {{log4j.properties}} file, 
but to get debug logging by only passing parameters to {{spark-{submit,shell}}}.



> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11162) Allow enabling debug logging from the command line

2015-10-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963511#comment-14963511
 ] 

Sean Owen commented on SPARK-11162:
---

Related: https://issues.apache.org/jira/browse/SPARK-11105

In general configuring log4j does mean configuring a log4j.properties. You 
should be able to achieve something similar with -D flags but I find it ugly.

> Allow enabling debug logging from the command line
> --
>
> Key: SPARK-11162
> URL: https://issues.apache.org/jira/browse/SPARK-11162
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> Per [~vanzin] on [the user 
> list|http://apache-spark-user-list.1001560.n3.nabble.com/how-to-set-log-level-of-spark-executor-on-YARN-using-yarn-cluster-mode-tp16528p16529.html],
>  it would be nice if debug-logging could be enabled from the command line.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11161) Viewing the web UI for the first time unpersists a cached RDD

2015-10-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963522#comment-14963522
 ] 

Sean Owen commented on SPARK-11161:
---

Why would it be useful to continue to cache an RDD that can't be used any more? 
there is no more reference to it in the controlling driver program in this 
case, and it's the only thing that can use it. 

There isn't an RDD registry, but if there were, then it would prevent this 
situation from occurring, which seems like what you'd expect at least. I expect 
RDDs to behave like JVM objects in this regard. I would not expect something to 
be hanging on to references to all my objects since I have the references I 
need, and indeed, doing so prevents the GC that I want.

You can't unpersist RDDs from the web UI, though that would make sense as a 
feature. That's something different.

> Viewing the web UI for the first time unpersists a cached RDD
> -
>
> Key: SPARK-11161
> URL: https://issues.apache.org/jira/browse/SPARK-11161
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Web UI
>Affects Versions: 1.5.1
>Reporter: Ryan Williams
>Priority: Minor
>
> This one is a real head-scratcher. [Here's a 
> screencast|http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif]:
> !http://f.cl.ly/items/0P0N413t1V3j2B0A3V1a/Screen%20Recording%202015-10-16%20at%2005.43%20PM.gif!
> The three windows, left-to-right, are: 
> * a {{spark-shell}} on YARN with dynamic allocation enabled, at rest with one 
> executor. [Here's an example app's 
> environment|https://gist.github.com/ryan-williams/6dd3502d5d0de2f030ac].
> * [Spree|https://github.com/hammerlab/spree], opened to the above app's 
> "Storage" tab.
> * my YARN resource manager, showing a link to the web UI running on the 
> driver.
> At the start, nothing has been run in the shell, and I've not visited the web 
> UI.
> I run a simple job in the shell and cache a small RDD that it computes:
> {code}
> sc.parallelize(1 to 1, 100).map(_ % 100 -> 1).reduceByKey(_+_, 
> 100).setName("foo").cache.count
> {code}
> As the second stage runs, you can see the partitions show up as cached in 
> Spree.
> After the job finishes, a few requested executors continue to fill in, which 
> you can see in the console at left or the nav bar of Spree in the middle.
> Once that has finished, everything is at rest with the RDD "foo" 100% cached.
> Then, I click the YARN RM's "ApplicationMaster" link which loads the web UI 
> on the driver for the first time.
> Immediately, the console prints some activity, including that RDD 2 has been 
> removed:
> {code}
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on 172.29.46.15:33156 in memory (size: 1517.0 B, free: 7.2 GB)
> 15/10/16 21:43:12 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1517.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 2
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on 172.29.46.15:33156 in memory (size: 1666.0 B, free: 7.2 GB)
> 15/10/16 21:43:13 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 
> on demeter-csmaz10-17.demeter.hpc.mssm.edu:56997 in memory (size: 1666.0 B, 
> free: 12.2 GB)
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned accumulator 1
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned shuffle 0
> 15/10/16 21:43:13 INFO storage.BlockManager: Removing RDD 2
> 15/10/16 21:43:13 INFO spark.ContextCleaner: Cleaned RDD 2
> {code}
> Accordingly, Spree shows that the RDD has been unpersisted, and I can see in 
> the event log (not pictured in the screencast) that an Unpersist event has 
> made its way through the various SparkListeners:
> {code}
> {"Event":"SparkListenerUnpersistRDD","RDD ID":2}
> {code}
> Simply loading the web UI causes an RDD unpersist event to fire!
> I can't nail down exactly what's causing this, and I've seen evidence that 
> there are other sequences of events that can also cause it:
> * I've repro'd the above steps ~20 times. The RDD always gets unpersisted 
> when I've not visited the web UI until the RDD is cached, and when the app is 
> dynamically allocating executors.
> * One time, I observed the unpersist to fire without my even visiting the web 
> UI at all. Other times I wait a long time before visiting the web UI, so that 
> it is clear that the loading of the web UI is causal, and it always is, but 
> apparently there's another way for the unpersist to happen, seemingly rarely, 
> without visiting the web UI.
> * I tried a couple of times without dynamic allocation and could not 
> reproduce it.
> * I've tried a couple of times with dynamic

[jira] [Assigned] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-11176:
--

Assignee: Josh Rosen

> Umbrella ticket for wholeTextFiles bugs
> ---
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files. Most of these bugs 
> deal with reading files from S3, but it's not clear whether S3 is necessary 
> to hit these bugs.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10780) Set initialModel in KMeans in Pipelines API

2015-10-19 Thread Jayant Shekhar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963827#comment-14963827
 ] 

Jayant Shekhar commented on SPARK-10780:


Sounds good [~xusen] and [~josephkb]

In the process of updating the PR. Thanks!

> Set initialModel in KMeans in Pipelines API
> ---
>
> Key: SPARK-10780
> URL: https://issues.apache.org/jira/browse/SPARK-10780
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This is for the Scala version.  After this is merged, create a JIRA for 
> Python version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs

2015-10-19 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963825#comment-14963825
 ] 

Josh Rosen commented on SPARK-11176:


Going to close this as now, since all child tickets have been resolved as 
either "Won't Fix" or "Cannot Reproduce." Will re-open if new issues are 
discovered.

> Umbrella ticket for wholeTextFiles bugs
> ---
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files. Most of these bugs 
> deal with reading files from S3, but it's not clear whether S3 is necessary 
> to hit these bugs.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11176) Umbrella ticket for wholeTextFiles bugs

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11176.

Resolution: Incomplete

> Umbrella ticket for wholeTextFiles bugs
> ---
>
> Key: SPARK-11176
> URL: https://issues.apache.org/jira/browse/SPARK-11176
> Project: Spark
>  Issue Type: Umbrella
>  Components: Input/Output, Spark Core
>Reporter: Josh Rosen
>
> This umbrella ticket gathers together several distinct bug reports related to 
> problems using the wholeTextFiles method to read files. Most of these bugs 
> deal with reading files from S3, but it's not clear whether S3 is necessary 
> to hit these bugs.
> These issues may have a common underlying cause and should be investigated 
> together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11027) Better group distinct columns in query compilation

2015-10-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-11027.
--
Resolution: Won't Fix

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11193) Spark 1.5+ Kinesis Streaming - ClassCastException when starting KinesisReceiver

2015-10-19 Thread Phil Kallos (JIRA)

Phil Kallos created SPARK-11193:
---

 Summary: Spark 1.5+ Kinesis Streaming - ClassCastException when 
starting KinesisReceiver
 Key: SPARK-11193
 URL: https://issues.apache.org/jira/browse/SPARK-11193
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.5.1, 1.5.0
Reporter: Phil Kallos


After upgrading from Spark 1.4.x -> 1.5.x, I am now unable to start a Kinesis 
Spark Streaming application, and am being consistently greeted with this 
exception:

java.lang.ClassCastException: scala.collection.mutable.HashMap cannot be cast 
to scala.collection.mutable.SynchronizedMap
at 
org.apache.spark.streaming.kinesis.KinesisReceiver.onStart(KinesisReceiver.scala:175)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:148)
at 
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:130)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:542)
at 
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$9.apply(ReceiverTracker.scala:532)
at 
org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
at 
org.apache.spark.SparkContext$$anonfun$38.apply(SparkContext.scala:1982)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


Worth noting that I am able to reproduce this issue locally, and also on Amazon 
EMR (using the latest emr-release 4.1.0 which packages Spark 1.5.0).

Also, I am not able to run the included kinesis-asl example.

Built locally using:
git checkout v1.5.1
mvn -Pyarn -Pkinesis-asl -Phadoop-2.6 -DskipTests clean package

Example run command:
bin/run-example streaming.KinesisWordCountASL phibit-test kinesis-connector 
https://kinesis.us-east-1.amazonaws.com



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11192) When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time

2015-10-19 Thread Blake Livingston (JIRA)

Blake Livingston created SPARK-11192:


 Summary: When graphite metric sink is enabled, spark sql leaks 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time
 Key: SPARK-11192
 URL: https://issues.apache.org/jira/browse/SPARK-11192
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1
 Environment: java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
org.apache.spark/spark-sql_2.10 "1.5.1"

Embedded, in-process spark. Have not tested on standalone or yarn clusters.
Reporter: Blake Livingston
Priority: Minor


Noticed that slowly, over the course of a day or two, heap memory usage on a 
long running spark process increased monotonically.
After doing a heap dump and examining in jvisualvm, saw there were over 15M 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 
500MB.

Accumulation does not occur when I removed metrics.properties.

metrics.properties content:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=x
*.sink.graphite.port=2003
*.sink.graphite.period=10

# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11194:


Assignee: Yin Huai  (was: Apache Spark)

> Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
> --
>
> Key: SPARK-11194
> URL: https://issues.apache.org/jira/browse/SPARK-11194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we stack a new URLClassLoader when a user add a jar through SQL's 
> add jar command. This approach can introduce issues caused by the ordering of 
> added jars when a class of a jar depends on another class of another jar.
> For example,
> {code}
> ClassLoader1 for Jar1.jar (A.class)
>|
>|- ClassLoader2 for Jar2.jar (B.class depending on A.class)
> {code}
> In this case, when we lookup class B, we will not be able to find class A 
> because Jar2 is the parent of Jar1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11194:


Assignee: Apache Spark  (was: Yin Huai)

> Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
> --
>
> Key: SPARK-11194
> URL: https://issues.apache.org/jira/browse/SPARK-11194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Apache Spark
>
> Right now, we stack a new URLClassLoader when a user add a jar through SQL's 
> add jar command. This approach can introduce issues caused by the ordering of 
> added jars when a class of a jar depends on another class of another jar.
> For example,
> {code}
> ClassLoader1 for Jar1.jar (A.class)
>|
>|- ClassLoader2 for Jar2.jar (B.class depending on A.class)
> {code}
> In this case, when we lookup class B, we will not be able to find class A 
> because Jar2 is the parent of Jar1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command

2015-10-19 Thread Yin Huai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-11194:
-
Description: 
Right now, we stack a new URLClassLoader when a user add a jar through SQL's 
add jar command. This approach can introduce issues caused by the ordering of 
added jars when a class of a jar depends on another class of another jar.

For example,
{code}
ClassLoader1 for Jar1.jar (A.class)
   |
   |- ClassLoader2 for Jar2.jar (B.class depending on A.class)
{code}
In this case, when we lookup class B, we will not be able to find class A 
because Jar2 is the parent of Jar1.

> Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
> --
>
> Key: SPARK-11194
> URL: https://issues.apache.org/jira/browse/SPARK-11194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we stack a new URLClassLoader when a user add a jar through SQL's 
> add jar command. This approach can introduce issues caused by the ordering of 
> added jars when a class of a jar depends on another class of another jar.
> For example,
> {code}
> ClassLoader1 for Jar1.jar (A.class)
>|
>|- ClassLoader2 for Jar2.jar (B.class depending on A.class)
> {code}
> In this case, when we lookup class B, we will not be able to find class A 
> because Jar2 is the parent of Jar1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-19 Thread Bilind Hajer (JIRA)

Bilind Hajer created SPARK-11190:


 Summary: SparkR support for cassandra collection types. 
 Key: SPARK-11190
 URL: https://issues.apache.org/jira/browse/SPARK-11190
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
 Environment: SparkR Version: 1.5.1
Cassandra Version: 2.1.6
R Version: 3.2.2 
Cassandra Connector version: 1.5.0-M2
Reporter: Bilind Hajer
 Fix For: 1.5.2


I want to create a data frame from a Cassandra keyspace and column family in 
sparkR. 
I am able to create data frames from tables which do not include any Cassandra 
collection datatypes, 
such as Map, Set and List.  But, many of the schemas that I need data from, do 
include these collection data types. 

Here is my local environment. 

SparkR Version: 1.5.1
Cassandra Version: 2.1.6
R Version: 3.2.2 
Cassandra Connector version: 1.5.0-M2

To test this issue, I did the following iterative process. 

sudo ./sparkR --packages 
com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
spark.cassandra.connection.host=127.0.0.1

Running this command, with sparkR gives me access to the spark cassandra 
connector package I need, 
and connects me to my local cqlsh server ( which is up and running while 
running this code in sparkR shell ). 

CREATE TABLE test_table (
  column_1 int,
  column_2 text,
  column_3 float,
  column_4 uuid,
  column_5 timestamp,
  column_6 boolean,
  column_7 timeuuid,
  column_8 bigint,
  column_9 blob,
  column_10   ascii,
  column_11   decimal,
  column_12   double,
  column_13   inet,
  column_14   varchar,
  column_15   varint,
  PRIMARY KEY( ( column_1, column_2 ) )
); 

All of the above data types are supported. I insert dummy data after creating 
this test schema. 

For example, now in my sparkR shell, I run the following code. 

df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
keyspace = "datahub", table = "test_table")

assigns with no errors, then, 

> schema(df.test)

StructType
|-name = "column_1", type = "IntegerType", nullable = TRUE
|-name = "column_2", type = "StringType", nullable = TRUE
|-name = "column_10", type = "StringType", nullable = TRUE
|-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
|-name = "column_12", type = "DoubleType", nullable = TRUE
|-name = "column_13", type = "InetAddressType", nullable = TRUE
|-name = "column_14", type = "StringType", nullable = TRUE
|-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
|-name = "column_3", type = "FloatType", nullable = TRUE
|-name = "column_4", type = "UUIDType", nullable = TRUE
|-name = "column_5", type = "TimestampType", nullable = TRUE
|-name = "column_6", type = "BooleanType", nullable = TRUE
|-name = "column_7", type = "UUIDType", nullable = TRUE
|-name = "column_8", type = "LongType", nullable = TRUE
|-name = "column_9", type = "BinaryType", nullable = TRUE

Schema is correct. 

> class(df.test)
[1] "DataFrame"
attr(,"package")
[1] "SparkR"

df.test is clearly defined to be a DataFrame Object. 

> head(df.test)

  column_1 column_2 column_10 column_11 column_12 column_13 column_14 column_15
11helloNANANANANANA
  column_3 column_4 column_5 column_6 column_7 column_8 column_9
1  3.4   NA   NA   NA   NA   NA   NA

sparkR is reading from the column_family correctly, but now lets add a 
collection data type to the schema. 

Now I will drop that test_table, and recreate the table, with with an extra 
column of data type  map

CREATE TABLE test_table (
  column_1 int,
  column_2 text,
  column_3 float,
  column_4 uuid,
  column_5 timestamp,
  column_6 boolean,
  column_7 timeuuid,
  column_8 bigint,
  column_9 blob,
  column_10ascii,
  column_11decimal,
  column_12double,
  column_13inet,
  column_14varchar,
  column_15varint,
  column_16map,
  PRIMARY KEY( ( column_1, column_2 ) )
); 

After inserting dummy data into the new test schema, 

> df.test  <- read.df(sqlContext,  source =

[jira] [Updated] (SPARK-10955) Warn if dynamic allocation is enabled for Streaming jobs

2015-10-19 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10955:
--
Summary: Warn if dynamic allocation is enabled for Streaming jobs  (was: 
Disable dynamic allocation for Streaming jobs)

> Warn if dynamic allocation is enabled for Streaming jobs
> 
>
> Key: SPARK-10955
> URL: https://issues.apache.org/jira/browse/SPARK-10955
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Fix For: 1.5.2, 1.6.0
>
>
> Spark streaming can be tricky with dynamic allocation and can lose dataWe 
> should disable dynamic allocation or at least log that it is dangerous.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11027) Better group distinct columns in query compilation

2015-10-19 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963990#comment-14963990
 ] 

Yin Huai commented on SPARK-11027:
--

As pointed out by [~joshrosen] (see https://github.com/apache/spark/pull/9115), 
it is not always safe to evaluate cast after we do distinct because cast 
operation can affect the result of distinct. So, I am closing this JIRA for now.

> Better group distinct columns in query compilation
> --
>
> Key: SPARK-11027
> URL: https://issues.apache.org/jira/browse/SPARK-11027
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Yin Huai
>
> In AggregationQuerySuite, we have a test
> {code}
> checkAnswer(
>   sqlContext.sql(
> """
>   |SELECT sum(distinct value1), kEY - 100, count(distinct value1)
>   |FROM agg2
>   |GROUP BY Key - 100
> """.stripMargin),
>   Row(40, -99, 2) :: Row(0, -98, 2) :: Row(null, -97, 0) :: Row(30, null, 
> 3) :: Nil)
> {code}
> We will treat it as having two distinct columns because sum causes a cast on 
> value1. Maybe we can ignore the cast when we group distinct columns. So, it 
> will not be treated as having two distinct columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-10-19 Thread David Ross (JIRA)

David Ross created SPARK-11191:
--

 Summary: [1.5] Can't create UDF's using hive thrift service
 Key: SPARK-11191
 URL: https://issues.apache.org/jira/browse/SPARK-11191
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.5.1, 1.5.0
Reporter: David Ross


Since upgrading to spark 1.5 we've been unable to create and use UDF's when we 
run in thrift server mode.

Our setup:
We start the thrift-server running against yarn in client mode, (we've also 
built our own spark from github branch-1.5 with the following args: {{-Pyarn 
-Phive -Phive-thrifeserver}}

If i run the following after connecting via JDBC (in this case via beeline):

{{add jar 'hdfs://path/to/jar"}}
(this command succeeds with no errors)

{{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
(this command succeeds with no errors)

{{select testUDF(col1) from table1;}}

I get the following error in the logs:

{code}
org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 8
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
at scala.Option.getOrElse(Option.scala:120)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
at scala.util.Try.getOrElse(Try.scala:77)
at 
org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
at 
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
{code}


(cutting the bulk for ease of report, more than happy to send the full output)

{code}
15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
query:
org.apache.hive.service.cli.HiveSQLException: 
org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 pos 
100
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}


When I ran the same against 1.4 it worked.

I've also changed the {{spark.sql.hive.metastore.version}} version to be 0.13 
(similar to what it was in 1.4) and 0.14 but I still get the same errors.

Also, in 1.5, when you run it against the {{spark-sql}} shell, it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-11191) [1.5] Can't create UDF's using hive thrift service

2015-10-19 Thread David Ross (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964048#comment-14964048
 ] 

David Ross commented on SPARK-11191:


I will add that the exact same thing happens when you don't use {{TEMPORARY}} 
i.e.:

{code}
CREATE FUNCTION testUDF AS 'com.foo.class.UDF';
{code}

> [1.5] Can't create UDF's using hive thrift service
> --
>
> Key: SPARK-11191
> URL: https://issues.apache.org/jira/browse/SPARK-11191
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: David Ross
>
> Since upgrading to spark 1.5 we've been unable to create and use UDF's when 
> we run in thrift server mode.
> Our setup:
> We start the thrift-server running against yarn in client mode, (we've also 
> built our own spark from github branch-1.5 with the following args: {{-Pyarn 
> -Phive -Phive-thrifeserver}}
> If i run the following after connecting via JDBC (in this case via beeline):
> {{add jar 'hdfs://path/to/jar"}}
> (this command succeeds with no errors)
> {{CREATE TEMPORARY FUNCTION testUDF AS 'com.foo.class.UDF';}}
> (this command succeeds with no errors)
> {{select testUDF(col1) from table1;}}
> I get the following error in the logs:
> {code}
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 8
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2$$anonfun$1.apply(hiveUDFs.scala:58)
> at scala.Option.getOrElse(Option.scala:120)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:57)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry$$anonfun$lookupFunction$2.apply(hiveUDFs.scala:53)
> at scala.util.Try.getOrElse(Try.scala:77)
> at 
> org.apache.spark.sql.hive.HiveFunctionRegistry.lookupFunction(hiveUDFs.scala:53)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5$$anonfun$applyOrElse$24.apply(Analyzer.scala:506)
> at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:48)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:505)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$10$$anonfun$applyOrElse$5.applyOrElse(Analyzer.scala:502)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:227)
> at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:51)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:226)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:232)
> at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:249)
> {code}
> (cutting the bulk for ease of report, more than happy to send the full output)
> {code}
> 15/10/12 14:34:37 ERROR SparkExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: 
> org.apache.spark.sql.AnalysisException: undefined function testUDF; line 1 
> pos 100
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.runInternal(SparkExecuteStatementOperation.scala:259)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:171)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:182)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at

[jira] [Updated] (SPARK-11180) Support BooleanType in DataFrame.na.fill

2015-10-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11180:

Summary: Support BooleanType in DataFrame.na.fill  (was:  DataFrame.na.fill 
does not support Boolean Type:)

> Support BooleanType in DataFrame.na.fill
> 
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> {code}
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> {code}
> We want to set "isOfficer" false whenever there is null. 
> {code}
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> {code}
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10754) table and column name are case sensitive when json Dataframe was registered as tempTable using JavaSparkContext.

2015-10-19 Thread Yin Huai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964041#comment-14964041
 ] 

Yin Huai commented on SPARK-10754:
--

Can you use {{HiveContext}}, which set {{spark.sql.caseSensitive}} to false by 
default.

> table and column name are case sensitive when json Dataframe was registered 
> as tempTable using JavaSparkContext. 
> -
>
> Key: SPARK-10754
> URL: https://issues.apache.org/jira/browse/SPARK-10754
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.3.1, 1.4.1
> Environment: Linux ,Hadoop Version 1.3
>Reporter: Babulal
>
> Create a dataframe using json data source 
>   SparkConf conf=new 
> SparkConf().setMaster("spark://xyz:7077")).setAppName("Spark Tabble");
>   JavaSparkContext javacontext=new JavaSparkContext(conf);
>   SQLContext sqlContext=new SQLContext(javacontext);
>   
>   DataFrame df = 
> sqlContext.jsonFile("/user/root/examples/src/main/resources/people.json");
>   
>   df.registerTempTable("sparktable");
>   
>   Run the Query
>   
>   sqlContext.sql("select * from sparktable").show()// this will PASs
>   
>   
>   sqlContext.sql("select * from sparkTable").show()/// This will FAIL 
>   
>   java.lang.RuntimeException: Table Not Found: sparkTable
> at scala.sys.package$.error(package.scala:27)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog$$anonfun$1.apply(Catalog.scala:115)
> at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
> at scala.collection.AbstractMap.getOrElse(Map.scala:58)
> at 
> org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:115)
> at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.getTable(Analyzer.scala:233)
>   
>   
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11192) When graphite metric sink is enabled, spark sql leaks org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time

2015-10-19 Thread Blake Livingston (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Blake Livingston updated SPARK-11192:
-
Description: 
Noticed that slowly, over the course of a day or two, heap memory usage on a 
long running spark process increased monotonically.
After doing a heap dump and examining in jvisualvm, saw there were over 15M 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 
500MB.

Accumulation does not occur when I removed metrics.properties.

metrics.properties content:

*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=x
*.sink.graphite.port=2003
*.sink.graphite.period=10

master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource

  was:
Noticed that slowly, over the course of a day or two, heap memory usage on a 
long running spark process increased monotonically.
After doing a heap dump and examining in jvisualvm, saw there were over 15M 
org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking over 
500MB.

Accumulation does not occur when I removed metrics.properties.

metrics.properties content:
# Enable Graphite
*.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
*.sink.graphite.host=x
*.sink.graphite.port=2003
*.sink.graphite.period=10

# Enable jvm source for instance master, worker, driver and executor
master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource


> When graphite metric sink is enabled, spark sql leaks 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects over time
> 
>
> Key: SPARK-11192
> URL: https://issues.apache.org/jira/browse/SPARK-11192
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
> Environment: java version "1.8.0_60"
> Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
> Java HotSpot(TM) 64-Bit Server VM (build 25.60-b23, mixed mode)
> org.apache.spark/spark-sql_2.10 "1.5.1"
> Embedded, in-process spark. Have not tested on standalone or yarn clusters.
>Reporter: Blake Livingston
>Priority: Minor
>
> Noticed that slowly, over the course of a day or two, heap memory usage on a 
> long running spark process increased monotonically.
> After doing a heap dump and examining in jvisualvm, saw there were over 15M 
> org.apache.spark.sql.execution.ui.SQLTaskMetrics objects allocated, taking 
> over 500MB.
> Accumulation does not occur when I removed metrics.properties.
> metrics.properties content:
> *.sink.graphite.class=org.apache.spark.metrics.sink.GraphiteSink
> *.sink.graphite.host=x
> *.sink.graphite.port=2003
> *.sink.graphite.period=10
> master.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> worker.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> driver.source.jvm.class=org.apache.spark.metrics.source.JvmSource
> executor.source.jvm.class=org.apache.spark.metrics.source.JvmSource



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11184) Declare most of .mllib code not-Experimental

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964077#comment-14964077
 ] 

Apache Spark commented on SPARK-11184:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/9169

> Declare most of .mllib code not-Experimental
> 
>
> Key: SPARK-11184
> URL: https://issues.apache.org/jira/browse/SPARK-11184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Priority: Minor
>
> Comments please [~mengxr] and [~josephkb]: my proposal is to remove most 
> {{@Experimental}} annotations from the {{.mllib}} code, on the theory that 
> it's not intended to change much more. 
> I can easily take a shot at this, but wanted to collect thoughts before I 
> started. Does the theory sound reasonable? Part of it is a desire to keep 
> this annotation meaningful, and also encourage people to at least view MLlib 
> as stable, because it is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11184) Declare most of .mllib code not-Experimental

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11184:


Assignee: Apache Spark

> Declare most of .mllib code not-Experimental
> 
>
> Key: SPARK-11184
> URL: https://issues.apache.org/jira/browse/SPARK-11184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Assignee: Apache Spark
>Priority: Minor
>
> Comments please [~mengxr] and [~josephkb]: my proposal is to remove most 
> {{@Experimental}} annotations from the {{.mllib}} code, on the theory that 
> it's not intended to change much more. 
> I can easily take a shot at this, but wanted to collect thoughts before I 
> started. Does the theory sound reasonable? Part of it is a desire to keep 
> this annotation meaningful, and also encourage people to at least view MLlib 
> as stable, because it is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-11184) Declare most of .mllib code not-Experimental

2015-10-19 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11184:


Assignee: (was: Apache Spark)

> Declare most of .mllib code not-Experimental
> 
>
> Key: SPARK-11184
> URL: https://issues.apache.org/jira/browse/SPARK-11184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Priority: Minor
>
> Comments please [~mengxr] and [~josephkb]: my proposal is to remove most 
> {{@Experimental}} annotations from the {{.mllib}} code, on the theory that 
> it's not intended to change much more. 
> I can easily take a shot at this, but wanted to collect thoughts before I 
> started. Does the theory sound reasonable? Part of it is a desire to keep 
> this annotation meaningful, and also encourage people to at least view MLlib 
> as stable, because it is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command

2015-10-19 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964113#comment-14964113
 ] 

Apache Spark commented on SPARK-11194:
--

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/9170

> Use a single URLClassLoader for jars added through SQL's "ADD JAR" command
> --
>
> Key: SPARK-11194
> URL: https://issues.apache.org/jira/browse/SPARK-11194
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
>
> Right now, we stack a new URLClassLoader when a user add a jar through SQL's 
> add jar command. This approach can introduce issues caused by the ordering of 
> added jars when a class of a jar depends on another class of another jar.
> For example,
> {code}
> ClassLoader1 for Jar1.jar (A.class)
>|
>|- ClassLoader2 for Jar2.jar (B.class depending on A.class)
> {code}
> In this case, when we lookup class B, we will not be able to find class A 
> because Jar2 is the parent of Jar1.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-11180:

Description: 
Currently  DataFrame.na.fill does not support  Boolean primitive type. We have 
use cases where while data massaging/preparation we want to fill boolean 
columns with false/true value. 
Ex: 
{code}
val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
((1,null,null),(2,"SVP",true),(3,"Dir",false)))
.toDF("EmpId","Designation","isOfficer")
empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
isOfficer: boolean]

scala> empDf.show

|EmpId|Designation|isOfficer|
|1|   null| null|
|2|SVP| true|
|3|Dir|false|
{code}


We want to set "isOfficer" false whenever there is null. 

{code}
scala> empDf.na.fill(Map("isOfficer"->false))
throws exception 
java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
(false).
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
...

{code}
Can you add support for Boolean into na.fill function.


  was:
Currently  DataFrame.na.fill does not support  Boolean primitive type. We have 
use cases where while data massaging/preparation we want to fill boolean 
columns with false/true value. 
Ex: 
val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
((1,null,null),(2,"SVP",true),(3,"Dir",false)))
.toDF("EmpId","Designation","isOfficer")
empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
isOfficer: boolean]

scala> empDf.show

|EmpId|Designation|isOfficer|
|1|   null| null|
|2|SVP| true|
|3|Dir|false|


We want to set "isOfficer" false whenever there is null. 

scala> empDf.na.fill(Map("isOfficer"->false))
throws exception 
java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
(false).
at 
org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
...

Can you add support for Boolean into na.fill function.



>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> {code}
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> {code}
> We want to set "isOfficer" false whenever there is null. 
> {code}
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> {code}
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10645) Bivariate Statistics: Spearman's Correlation support as UDAF

2015-10-19 Thread Arvind Surve (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964068#comment-14964068
 ] 

Arvind Surve commented on SPARK-10645:
--

Spearman's correlation coefficient (SpCoeff) does not fit into the UDAF model, 
as rank needs to be calculated for every column independently.

I have created a stand-alone method to have holistic approach to evaluate 
SpCoeff which is outlined below.
This method takes two arrays -- representing two columns -- (This can be 
converted to taking two RDDs as input parameters) and returns SpCoeff. This 
method can be added in org.apache.spark.sql.execution.stat.StatFunction.scala, 
with coff() method invoked for "spearman" method.

Please provide feedback on this approach and then go from there.

  // This function will calculate Spearman's rank correlation coefficient
  // Reference: 
https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
  def computeSpearmanCorrCoeff(sc: SparkContext, data1:Array[Int], 
data2:Array[Int]): Double = {

val rddData1 = sc.parallelize(data1)
val rddData2 = sc.parallelize(data2)

//Calculate Rank for first vector data.
val rddData1Rank = rddData1 .zipWithIndex()
.sortByKey()
.zipWithIndex()
.map{case((a,b),c)=> (a,((c+1.0),1.0))}
.reduceByKey{case(a,b) => 
(((a._1*a._2+b._1*b._2)/(a._2+b._2),(a._2 + b._2 )))}
.map { case (a,(b,c)) => (a,b)}

//Calculate Rank for second vector data.
val rddData2Rank = rddData2 .zipWithIndex()
.sortByKey()
.zipWithIndex()
.map{case((a,b),c)=> (a,((c+1.0),1.0))}
.reduceByKey{case(a,b) => 
(((a._1*a._2+b._1*b._2)/(a._2+b._2),(a._2 + b._2 )))}
.map { case (a,(b,c)) => (a,b)}

//Calculate sum of square of diffrence of ranks between two vector 
corresponding elements in original order.
val sumSqRankDiff = rddData1.zip(rddData2)
.join(rddData1Rank).map{case (a,(b,c)) => (b, (a, c))}
.join(rddData2Rank).map{case (a,((b,c),d)) => 
(d-c)*(d-c)}.sum()

//Length of vector.
val dataLen = rddData1Rank.count()

// Return Spearman's rank correlation coefficient.
return (1 - (6 * sumSqRankDiff)/(dataLen*(dataLen*dataLen -1)))
  }


-Arvind Surve

> Bivariate Statistics: Spearman's Correlation support as UDAF
> 
>
> Key: SPARK-10645
> URL: https://issues.apache.org/jira/browse/SPARK-10645
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, SQL
>Reporter: Jihong MA
>
> Spearman's rank correlation coefficient : 
> https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11180) DataFrame.na.fill does not support Boolean Type:

2015-10-19 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-11180.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

>  DataFrame.na.fill does not support Boolean Type:
> -
>
> Key: SPARK-11180
> URL: https://issues.apache.org/jira/browse/SPARK-11180
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.5.0, 1.5.1
>Reporter: Satya Narayan
>Priority: Minor
> Fix For: 1.6.0
>
>
> Currently  DataFrame.na.fill does not support  Boolean primitive type. We 
> have use cases where while data massaging/preparation we want to fill boolean 
> columns with false/true value. 
> Ex: 
> {code}
> val empDf = sqlContext.createDataFrame(Seq[(Integer,String,java.lang.Boolean)]
> ((1,null,null),(2,"SVP",true),(3,"Dir",false)))
> .toDF("EmpId","Designation","isOfficer")
> empDf: org.apache.spark.sql.DataFrame = [EmpId: int, Designation: string, 
> isOfficer: boolean]
> scala> empDf.show
> |EmpId|Designation|isOfficer|
> |1|   null| null|
> |2|SVP| true|
> |3|Dir|false|
> {code}
> We want to set "isOfficer" false whenever there is null. 
> {code}
> scala> empDf.na.fill(Map("isOfficer"->false))
> throws exception 
> java.lang.IllegalArgumentException: Unsupported value type java.lang.Boolean 
> (false).
>   at 
> org.apache.spark.sql.DataFrameNaFunctions$$anonfun$fill0$1.apply(DataFrameNaFunctions.scala:370)
> ...
> {code}
> Can you add support for Boolean into na.fill function.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-10955) Warn if dynamic allocation is enabled for Streaming jobs

2015-10-19 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-10955:
--
Description: Spark streaming can be tricky with dynamic allocation and can 
lose data if not used properly (with WAL, or with WAL-free solutions like 
Direct Kafka and Kinesis since 1.5). If dynamic allocation is enabled, we 
should issue a log4j warning.  (was: Spark streaming can be tricky with dynamic 
allocation and can lose dataWe should disable dynamic allocation or at least 
log that it is dangerous.)

> Warn if dynamic allocation is enabled for Streaming jobs
> 
>
> Key: SPARK-10955
> URL: https://issues.apache.org/jira/browse/SPARK-10955
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Reporter: Hari Shreedharan
>Assignee: Hari Shreedharan
> Fix For: 1.5.2, 1.6.0
>
>
> Spark streaming can be tricky with dynamic allocation and can lose data if 
> not used properly (with WAL, or with WAL-free solutions like Direct Kafka and 
> Kinesis since 1.5). If dynamic allocation is enabled, we should issue a log4j 
> warning.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11190) SparkR support for cassandra collection types.

2015-10-19 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964032#comment-14964032
 ] 

Shivaram Venkataraman commented on SPARK-11190:
---

cc [~sunrui] Could you try this on the master branch ? We recently added 
support for Lists, Maps etc. in the master branch

> SparkR support for cassandra collection types. 
> ---
>
> Key: SPARK-11190
> URL: https://issues.apache.org/jira/browse/SPARK-11190
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
> Environment: SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
>Reporter: Bilind Hajer
>  Labels: cassandra, dataframe, sparkR
> Fix For: 1.5.2
>
>
> I want to create a data frame from a Cassandra keyspace and column family in 
> sparkR. 
> I am able to create data frames from tables which do not include any 
> Cassandra collection datatypes, 
> such as Map, Set and List.  But, many of the schemas that I need data from, 
> do include these collection data types. 
> Here is my local environment. 
> SparkR Version: 1.5.1
> Cassandra Version: 2.1.6
> R Version: 3.2.2 
> Cassandra Connector version: 1.5.0-M2
> To test this issue, I did the following iterative process. 
> sudo ./sparkR --packages 
> com.datastax.spark:spark-cassandra-connector_2.10:1.5.0-M2 --conf 
> spark.cassandra.connection.host=127.0.0.1
> Running this command, with sparkR gives me access to the spark cassandra 
> connector package I need, 
> and connects me to my local cqlsh server ( which is up and running while 
> running this code in sparkR shell ). 
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8 bigint,
>   column_9 blob,
>   column_10   ascii,
>   column_11   decimal,
>   column_12   double,
>   column_13   inet,
>   column_14   varchar,
>   column_15   varint,
>   PRIMARY KEY( ( column_1, column_2 ) )
> ); 
> All of the above data types are supported. I insert dummy data after creating 
> this test schema. 
> For example, now in my sparkR shell, I run the following code. 
> df.test  <- read.df(sqlContext,  source = "org.apache.spark.sql.cassandra", 
> keyspace = "datahub", table = "test_table")
> assigns with no errors, then, 
> > schema(df.test)
> StructType
> |-name = "column_1", type = "IntegerType", nullable = TRUE
> |-name = "column_2", type = "StringType", nullable = TRUE
> |-name = "column_10", type = "StringType", nullable = TRUE
> |-name = "column_11", type = "DecimalType(38,18)", nullable = TRUE
> |-name = "column_12", type = "DoubleType", nullable = TRUE
> |-name = "column_13", type = "InetAddressType", nullable = TRUE
> |-name = "column_14", type = "StringType", nullable = TRUE
> |-name = "column_15", type = "DecimalType(38,0)", nullable = TRUE
> |-name = "column_3", type = "FloatType", nullable = TRUE
> |-name = "column_4", type = "UUIDType", nullable = TRUE
> |-name = "column_5", type = "TimestampType", nullable = TRUE
> |-name = "column_6", type = "BooleanType", nullable = TRUE
> |-name = "column_7", type = "UUIDType", nullable = TRUE
> |-name = "column_8", type = "LongType", nullable = TRUE
> |-name = "column_9", type = "BinaryType", nullable = TRUE
> Schema is correct. 
> > class(df.test)
> [1] "DataFrame"
> attr(,"package")
> [1] "SparkR"
> df.test is clearly defined to be a DataFrame Object. 
> > head(df.test)
>   column_1 column_2 column_10 column_11 column_12 column_13 column_14 
> column_15
> 11helloNANANANANA
> NA
>   column_3 column_4 column_5 column_6 column_7 column_8 column_9
> 1  3.4   NA   NA   NA   NA   NA   NA
> sparkR is reading from the column_family correctly, but now lets add a 
> collection data type to the schema. 
> Now I will drop that test_table, and recreate the table, with with an extra 
> column of data type  map
> CREATE TABLE test_table (
>   column_1 int,
>   column_2 text,
>   column_3 float,
>   column_4 uuid,
>   column_5 timestamp,
>   column_6 boolean,
>   column_7 timeuuid,
>   column_8

[jira] [Commented] (SPARK-11016) Spark fails when running with a task that requires a more recent version of RoaringBitmaps

2015-10-19 Thread Charles Allen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14964043#comment-14964043
 ] 

Charles Allen commented on SPARK-11016:
---

[~srowen] I confirmed locally that https://github.com/metamx/spark/pull/1 
prevents this error, but as per your prior comment a "more correct" 
implementation would probably provide a Kryo Externalizable bridge of some 
kind. 

> Spark fails when running with a task that requires a more recent version of 
> RoaringBitmaps
> --
>
> Key: SPARK-11016
> URL: https://issues.apache.org/jira/browse/SPARK-11016
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.0
>Reporter: Charles Allen
>
> The following error appears during Kryo init whenever a more recent version 
> (>0.5.0) of Roaring bitmaps is required by a job. 
> org/roaringbitmap/RoaringArray$Element was removed in 0.5.0
> {code}
> A needed class was not found. This could be due to an error in your runpath. 
> Missing class: org/roaringbitmap/RoaringArray$Element
> java.lang.NoClassDefFoundError: org/roaringbitmap/RoaringArray$Element
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala:338)
>   at 
> org.apache.spark.serializer.KryoSerializer$.(KryoSerializer.scala)
>   at 
> org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:93)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.borrowKryo(KryoSerializer.scala:237)
>   at 
> org.apache.spark.serializer.KryoSerializerInstance.(KryoSerializer.scala:222)
>   at 
> org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:138)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:201)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:102)
>   at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
>   at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
>   at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
>   at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1318)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1006)
>   at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1.apply(SparkContext.scala:1003)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.hadoopFile(SparkContext.scala:1003)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:818)
>   at 
> org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:816)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
>   at org.apache.spark.SparkContext.withScope(SparkContext.scala:700)
>   at org.apache.spark.SparkContext.textFile(SparkContext.scala:816)
> {code}
> See https://issues.apache.org/jira/browse/SPARK-5949 for related info



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11194) Use a single URLClassLoader for jars added through SQL's "ADD JAR" command

2015-10-19 Thread Yin Huai (JIRA)

Yin Huai created SPARK-11194:


 Summary: Use a single URLClassLoader for jars added through SQL's 
"ADD JAR" command
 Key: SPARK-11194
 URL: https://issues.apache.org/jira/browse/SPARK-11194
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5929) Pyspark: Register a pip requirements file with spark_context

2015-10-19 Thread buckhx (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963655#comment-14963655
 ] 

buckhx commented on SPARK-5929:
---

I also included an add module that will bundle and ship a module that has 
already been imported by the driver

> Pyspark: Register a pip requirements file with spark_context
> 
>
> Key: SPARK-5929
> URL: https://issues.apache.org/jira/browse/SPARK-5929
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: buckhx
>Priority: Minor
>
> I've been doing a lot of dependency work with shipping dependencies to 
> workers as it is non-trivial for me to have my workers include the proper 
> dependencies in their own environments.
> To circumvent this, I added a addRequirementsFile() method that takes a pip 
> requirements file, downloads the packages, repackages them to be registered 
> with addPyFiles and ship them to workers.
> Here is a comparison of what I've done on the Palantir fork 
> https://github.com/buckheroux/spark/compare/palantir:master...master



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11179) Push filters through aggregate if filters are subset of 'group by' expressions

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11179:
--
Fix Version/s: (was: 1.6.0)

[~nitin2goyal] this can't have a Fix version.

> Push filters through aggregate if filters are subset of 'group by' expressions
> --
>
> Key: SPARK-11179
> URL: https://issues.apache.org/jira/browse/SPARK-11179
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Nitin Goyal
>Priority: Minor
>
> Push filters through aggregate if filters are subset of 'group by' 
> expressions. This optimisation can be added in Spark SQL's Optimizer class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11119) cleanup unsafe array and map

2015-10-19 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-9.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9131
[https://github.com/apache/spark/pull/9131]

> cleanup unsafe array and map
> 
>
> Key: SPARK-9
> URL: https://issues.apache.org/jira/browse/SPARK-9
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5250.
---
Resolution: Cannot Reproduce

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>   at org.apache.spark.scheduler.Task.run(Task.scala:56)
>   at

[jira] [Created] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-19 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-11188:


 Summary: Elide stacktraces in bin/spark-sql for AnalysisExceptions
 Key: SPARK-11188
 URL: https://issues.apache.org/jira/browse/SPARK-11188
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


For analysis exceptions in the sql-shell, we should only print the error 
message to the screen.  The stacktrace will never have useful information since 
this error is used to signify an error with the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11184) Declare most of .mllib code not-Experimental

2015-10-19 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963692#comment-14963692
 ] 

Joseph K. Bradley commented on SPARK-11184:
---

I agree we need to remove more of those tags; thanks for working on this!  I'll 
be happy to help review.

> Declare most of .mllib code not-Experimental
> 
>
> Key: SPARK-11184
> URL: https://issues.apache.org/jira/browse/SPARK-11184
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.1
>Reporter: Sean Owen
>Priority: Minor
>
> Comments please [~mengxr] and [~josephkb]: my proposal is to remove most 
> {{@Experimental}} annotations from the {{.mllib}} code, on the theory that 
> it's not intended to change much more. 
> I can easily take a shot at this, but wanted to collect thoughts before I 
> started. Does the theory sound reasonable? Part of it is a desire to keep 
> this annotation meaningful, and also encourage people to at least view MLlib 
> as stable, because it is.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11177.

Resolution: Won't Fix

I'm going to resolve this as "Won't Fix", since I think that the difficultly / 
risk of fixing this in Spark is too high right now. Affected users should 
upgrade to Hadoop 2.x.

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Input/Output
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11187) Add Newton-Raphson Step per Tree to GBDT Implementation

2015-10-19 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai updated SPARK-11187:

 Shepherd: DB Tsai
Affects Version/s: (was: 1.5.1)
   1.6.0
  Component/s: ML

> Add Newton-Raphson Step per Tree to GBDT Implementation
> ---
>
> Key: SPARK-11187
> URL: https://issues.apache.org/jira/browse/SPARK-11187
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Joseph Babcock
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5250) EOFException in when reading gzipped files from S3 with wholeTextFiles

2015-10-19 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963817#comment-14963817
 ] 

Josh Rosen commented on SPARK-5250:
---

Ah, gotcha. I'm going to resolve this as "Cannot Reproduce" for the time being, 
since I don' really have any means to debug this right now.

> EOFException in when reading gzipped files from S3 with wholeTextFiles
> --
>
> Key: SPARK-5250
> URL: https://issues.apache.org/jira/browse/SPARK-5250
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Mojmir Vinkler
>Priority: Critical
>
> I get an `EOFException` error when reading *some* gzipped files using 
> `sc.wholeTextFiles`. It happens to just a few files, I thought that the file 
> is corrupted, but I was able to read it without problems using `sc.textFile` 
> (and pandas). 
> Traceback for command 
> `sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()`
> {code}
> ---
> Py4JJavaError Traceback (most recent call last)
>  in ()
> > 1 sc.wholeTextFiles('s3n://s3bucket/2525322021051.csv.gz').collect()
> /home/ubuntu/databricks/spark/python/pyspark/rdd.py in collect(self)
> 674 """
> 675 with SCCallSiteSync(self.context) as css:
> --> 676 bytesInJava = self._jrdd.collect().iterator()
> 677 return list(self._collect_iterator_through_file(bytesInJava))
> 678 
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py
>  in __call__(self, *args)
> 536 answer = self.gateway_client.send_command(command)
> 537 return_value = get_return_value(answer, self.gateway_client,
> --> 538 self.target_id, self.name)
> 539 
> 540 for temp_arg in temp_args:
> /home/ubuntu/databricks/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py
>  in get_return_value(answer, gateway_client, target_id, name)
> 298 raise Py4JJavaError(
> 299 'An error occurred while calling {0}{1}{2}.\n'.
> --> 300 format(target_id, '.', name), value)
> 301 else:
> 302 raise Py4JError(
> Py4JJavaError: An error occurred while calling o1576.collect.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 41.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 41.0 (TID 4720, ip-10-0-241-126.ec2.internal): java.io.EOFException: 
> Unexpected end of input stream
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:137)
>   at 
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:77)
>   at java.io.InputStream.read(InputStream.java:101)
>   at com.google.common.io.ByteStreams.copy(ByteStreams.java:207)
>   at com.google.common.io.ByteStreams.toByteArray(ByteStreams.java:252)
>   at 
> org.apache.spark.input.WholeTextFileRecordReader.nextKeyValue(WholeTextFileRecordReader.scala:73)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.nextKeyValue(CombineFileRecordReader.java:69)
>   at 
> org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:145)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at 
> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
>   at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>   at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>   at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
>   at 
> org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>   at 
> org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
>   at 
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>   at 
> org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at org.apache.spark.rdd.RDD$$anonfun$16.apply(RDD.scala:780)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
>   at

[jira] [Commented] (SPARK-11150) Dynamic partition pruning

2015-10-19 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963849#comment-14963849
 ] 

Ruslan Dautkhanov commented on SPARK-11150:
---

Will partition-wise join will also be handled by this JIRA as well? 
https://blogs.oracle.com/datawarehousing/entry/partition_wise_joins
E.g. in a two-table join by a common key, if both tables are hash-partitioned 
the same way, there is no need for shuffling.

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: Younes
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11189) History server is not able to parse some application report

2015-10-19 Thread JIRA

Jean-Baptiste Onofré created SPARK-11189:


 Summary: History server is not able to parse some application 
report
 Key: SPARK-11189
 URL: https://issues.apache.org/jira/browse/SPARK-11189
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.6.0
Reporter: Jean-Baptiste Onofré


In some case, history server is not able to parse an application report.

For instance, with JavaTC example:

{code}
com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was 
expecting closing '"' for name
 at [Source: {"Event":"SparkListenerTaskEnd","Stage ID":245,"Stage Attempt 
ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Rea; line: 1, column: 
241]
at 
com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
at 
com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:445)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268)
at 
com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:34)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
at 
org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
at 
com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066)
at 
com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
at 
org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
at 
org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:950)
at 
org.apache.spark.deploy.master.Master.removeApplication(Master.scala:812)
at 
org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:790)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382)
at scala.Option.foreach(Option.scala:236)
at 
org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:382)
at 
org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:206)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:99)
at 
org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:224)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11189) History server is not able to parse some application report

2015-10-19 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963936#comment-14963936
 ] 

Sean Owen commented on SPARK-11189:
---

It looks like you have a truncated input file. Are there any other problems 
leading up to this?

> History server is not able to parse some application report
> ---
>
> Key: SPARK-11189
> URL: https://issues.apache.org/jira/browse/SPARK-11189
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
>Reporter: Jean-Baptiste Onofré
>
> In some case, history server is not able to parse an application report.
> For instance, with JavaTC example:
> {code}
> com.fasterxml.jackson.core.JsonParseException: Unexpected end-of-input: was 
> expecting closing '"' for name
>  at [Source: {"Event":"SparkListenerTaskEnd","Stage ID":245,"Stage Attempt 
> ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Rea; line: 1, column: 
> 241]
>   at 
> com.fasterxml.jackson.core.JsonParser._constructError(JsonParser.java:1419)
>   at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportError(ParserMinimalBase.java:508)
>   at 
> com.fasterxml.jackson.core.base.ParserMinimalBase._reportInvalidEOF(ParserMinimalBase.java:445)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName2(ReaderBasedJsonParser.java:1284)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser._parseName(ReaderBasedJsonParser.java:1268)
>   at 
> com.fasterxml.jackson.core.json.ReaderBasedJsonParser.nextToken(ReaderBasedJsonParser.java:618)
>   at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:34)
>   at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:42)
>   at 
> org.json4s.jackson.JValueDeserializer.deserialize(JValueDeserializer.scala:35)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._readMapAndClose(ObjectMapper.java:3066)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:2161)
>   at org.json4s.jackson.JsonMethods$class.parse(JsonMethods.scala:19)
>   at org.json4s.jackson.JsonMethods$.parse(JsonMethods.scala:44)
>   at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>   at 
> org.apache.spark.deploy.master.Master.rebuildSparkUI(Master.scala:950)
>   at 
> org.apache.spark.deploy.master.Master.removeApplication(Master.scala:812)
>   at 
> org.apache.spark.deploy.master.Master.org$apache$spark$deploy$master$Master$$finishApplication(Master.scala:790)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1$$anonfun$applyOrElse$21.apply(Master.scala:382)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:382)
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:105)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:206)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:99)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:224)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {/code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4240) Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.

2015-10-19 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963648#comment-14963648
 ] 

Joseph K. Bradley commented on SPARK-4240:
--

This conversation slipped under my radar somehow; my apologies!

I think it'd be fine to copy the implementation of GBTs to spark.ml, especially 
if we want to restructure it to support TreeBoost.  As far as updating or 
replacing the spark.mllib implementation, I'd say: Ideally it would eventually 
be a wrapper for the spark.ml implementation, but we should focus on the 
spark.ml API and implementation for now, even if it means temporarily having a 
copy of the code.

I think it'd be hard to combine this work with generic boosting because 
TreeBoost relies on the fact that trees are a space-partitioning algorithm, but 
we could discuss feasibility if there is a way to leverage the same 
implementation.

[~dbtsai] expressed interest in this work, so I'll ping him here.

> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> 
>
> Key: SPARK-4240
> URL: https://issues.apache.org/jira/browse/SPARK-4240
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11177) sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero bytes

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11177:
--
Component/s: Input/Output

> sc.wholeTextFiles throws ArrayIndexOutOfBoundsException when S3 file has zero 
> bytes
> ---
>
> Key: SPARK-11177
> URL: https://issues.apache.org/jira/browse/SPARK-11177
> Project: Spark
>  Issue Type: Sub-task
>  Components: Input/Output
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> From a user report:
> {quote}
> When I upload a series of text files to an S3 directory and one of the files 
> is empty (0 bytes). The sc.wholeTextFiles method stack traces.
> java.lang.ArrayIndexOutOfBoundsException: 0
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:506)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getMoreSplits(CombineFileInputFormat.java:285)
> at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:245)
> at org.apache.spark.rdd.WholeTextFileRDD.getPartitions(NewHadoopRDD.scala:303)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
> at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:905)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
> at org.apache.spark.rdd.RDD.collect(RDD.scala:904)
> {quote}
> It looks like this has been a longstanding issue:
> * 
> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-wholeTextFiles-error-td8872.html
> * 
> https://stackoverflow.com/questions/31051107/read-multiple-files-from-a-directory-using-spark
> * 
> https://forums.databricks.com/questions/1799/arrayindexoutofboundsexception-with-wholetextfiles.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-10668) Use WeightedLeastSquares in LinearRegression with L2 regularization if the number of features is small

2015-10-19 Thread DB Tsai (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-10668.
-
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 8884
[https://github.com/apache/spark/pull/8884]

> Use WeightedLeastSquares in LinearRegression with L2 regularization if the 
> number of features is small
> --
>
> Key: SPARK-10668
> URL: https://issues.apache.org/jira/browse/SPARK-10668
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xiangrui Meng
>Assignee: Kai Sasaki
>Priority: Critical
> Fix For: 1.6.0
>
>
> If the number of features is small (<=4096) and the regularization is L2, we 
> should use WeightedLeastSquares to solve the problem rather than L-BFGS. The 
> former requires only one pass to the data.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4414) SparkContext.wholeTextFiles Doesn't work with S3 Buckets

2015-10-19 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4414.
---
Resolution: Won't Fix

I'm going to resolve this as "Won't Fix", since I think that the difficultly / 
risk of fixing this in Spark is too high right now. While in principle we could 
fix this by inlining the affected Hadoop classes in Spark, it's going to be 
extremely difficult to do this in a way that is source- and binary-compatible 
with all of the Hadoop versions that we need to support.

Affected users should upgrade to Hadoop 1.2.1 or higher, which do not seem to 
be affected by this bug.

> SparkContext.wholeTextFiles Doesn't work with S3 Buckets
> 
>
> Key: SPARK-4414
> URL: https://issues.apache.org/jira/browse/SPARK-4414
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: Pedro Rodriguez
>Assignee: Josh Rosen
>Priority: Critical
>
> SparkContext.wholeTextFiles does not read files which SparkContext.textFile 
> can read. Below are general steps to reproduce, my specific case is following 
> that on a git repo.
> Steps to reproduce.
> 1. Create Amazon S3 bucket, make public with multiple files
> 2. Attempt to read bucket with
> sc.wholeTextFiles("s3n://mybucket/myfile.txt")
> 3. Spark returns the following error, even if the file exists.
> Exception in thread "main" java.io.FileNotFoundException: File does not 
> exist: /myfile.txt
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:517)
>   at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat$OneFileInfo.(CombineFileInputFormat.java:489)
> 4. Change the call to
> sc.textFile("s3n://mybucket/myfile.txt")
> and there is no error message, the application should run fine.
> There is a question on StackOverflow as well on this:
> http://stackoverflow.com/questions/26258458/sparkcontext-wholetextfiles-java-io-filenotfoundexception-file-does-not-exist
> This is link to repo/lines of code. The uncommented call doesn't work, the 
> commented call works as expected:
> https://github.com/EntilZha/nips-lda-spark/blob/45f5ad1e2646609ef9d295a0954fbefe84111d8a/src/main/scala/NipsLda.scala#L13-L19
> It would be easy to use textFile with a multifile argument, but this should 
> work correctly for s3 bucket files as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11187) Add Newton-Raphson Step per Tree to GBDT Implementation

2015-10-19 Thread Joseph Babcock (JIRA)

Joseph Babcock created SPARK-11187:
--

 Summary: Add Newton-Raphson Step per Tree to GBDT Implementation
 Key: SPARK-11187
 URL: https://issues.apache.org/jira/browse/SPARK-11187
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.5.1
Reporter: Joseph Babcock






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-9643) Error serializing datetimes with timezones using Dataframes and Parquet

2015-10-19 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-9643:
-
Assignee: Alex Angelini

> Error serializing datetimes with timezones using Dataframes and Parquet
> ---
>
> Key: SPARK-9643
> URL: https://issues.apache.org/jira/browse/SPARK-9643
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.4.1
>Reporter: Alex Angelini
>Assignee: Alex Angelini
>  Labels: upgrade
> Fix For: 1.6.0
>
>
> Trying to serialize a DataFrame with a datetime column that includes a 
> timezone fails with the following error.
> {code}
> net.razorvine.pickle.PickleException: invalid pickle data for datetime; 
> expected 1 or 7 args, got 2
> at 
> net.razorvine.pickle.objects.DateTimeConstructor.createDateTime(DateTimeConstructor.java:69)
> at 
> net.razorvine.pickle.objects.DateTimeConstructor.construct(DateTimeConstructor.java:32)
> at net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:701)
> at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:171)
> at net.razorvine.pickle.Unpickler.load(Unpickler.java:85)
> at net.razorvine.pickle.Unpickler.loads(Unpickler.java:98)
> at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:151)
> at 
> org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:150)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.org$apache$spark$sql$execution$datasources$InsertIntoHadoopFsRelation$$writeRows$1(commands.scala:185)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$insert$1.apply(commands.scala:163)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:64)
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> According to [~davies] timezone serialization is done directly in Spark and 
> not dependent on Pyrolite, but I was not able to prove that.
> Upgrading to Pyrolite 4.9 fixed this issue
> https://github.com/apache/spark/pull/7950



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10994) Clustering coefficient computation in GraphX

2015-10-19 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963806#comment-14963806
 ] 

Reynold Xin commented on SPARK-10994:
-

[~sherlockbourne] I am sure this is a pretty good algorithm, but lately we have 
been pushing to have more robust implementations outside as 
http://spark-packages.org/

In many ways it is better for this to be maintained outside:

1. You can iterate on it really quickly without the overhead of the Apache 
Software Foundation processes.
2. You can promote this easier, since with so many changes in each Spark 
release, it is getting harder and harder for users to discover new features. If 
this is a 3rd-party package, you can write dedicated blog posts and have good 
entry point READMEs on github.
3. It is just as easy to use this. As soon as you publish the package to maven, 
users can use the package directly in the repl by adding a command line flag.


> Clustering coefficient computation in GraphX
> 
>
> Key: SPARK-10994
> URL: https://issues.apache.org/jira/browse/SPARK-10994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Reporter: Yang Yang
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> The Clustering Coefficient (CC) is a fundamental measure in social (or other 
> type of) network analysis assessing the degree to which nodes tend to cluster 
> together [1][2]. Clustering coefficient, along with density, node degree, 
> path length, diameter, connectedness, and node centrality are seven most 
> important properties to characterise a network [3].
> We found that GraphX has already implemented connectedness, node centrality, 
> path length, but does not have a componenet for computing clustering 
> coefficient. This actually was the first intention for us to implement an 
> algorithm to compute clustering coefficient for each vertex of a given graph.
> Clustering coefficient is very helpful to many real applications, such as 
> user behaviour prediction and structure prediction (like link prediction). We 
> did that before in a bunch of papers (e.g., [4-5]), and also found many other 
> publication papers using this metric in their work [6-8]. We are very 
> confident that this feature will benefit GraphX and attract a large number of 
> users.
> References
> [1] https://en.wikipedia.org/wiki/Clustering_coefficient
> [2] Watts, Duncan J., and Steven H. Strogatz. "Collective dynamics of 
> ‘small-world’ networks." nature 393.6684 (1998): 440-442. (with 27266 
> citations).
> [3] https://en.wikipedia.org/wiki/Network_science
> [4] Jing Zhang, Zhanpeng Fang, Wei Chen, and Jie Tang. Diffusion of 
> "Following" Links in Microblogging Networks. IEEE Transaction on Knowledge 
> and Data Engineering (TKDE), Volume 27, Issue 8, 2015, Pages 2093-2106.
> [5] Yang Yang, Jie Tang, Jacklyne Keomany, Yanting Zhao, Ying Ding, Juanzi 
> Li, and Liangwei Wang. Mining Competitive Relationships by Learning across 
> Heterogeneous Networks. In Proceedings of the Twenty-First Conference on 
> Information and Knowledge Management (CIKM'12). pp. 1432-1441.
> [6] Clauset, Aaron, Cristopher Moore, and Mark EJ Newman. Hierarchical 
> structure and the prediction of missing links in networks. Nature 453.7191 
> (2008): 98-101. (with 973 citations)
> [7] Adamic, Lada A., and Eytan Adar. Friends and neighbors on the web. Social 
> networks 25.3 (2003): 211-230. (1238 citations)
> [8] Lichtenwalter, Ryan N., Jake T. Lussier, and Nitesh V. Chawla. New 
> perspectives and methods in link prediction. In KDD'10.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-11186) Caseness inconsistency between SQLContext and HiveContext

2015-10-19 Thread kevin yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14963835#comment-14963835
 ] 

kevin yu commented on SPARK-11186:
--

Hello Santiago: How did you run the above code? did you get any stack trace? I 
tried on spark-shell, I got the error, it seems that the SQLContext.value is a 
protected field. 
 can't access the . scala> sqlc.catalog.registerTable(relationName :: Nil, 
LogicalRelation(new BaseRelation {
 |   override def sqlContext: SQLContext = sqlc
 |   override def schema: StructType = StructType(Nil)
 | }))
:26: error: lazy value catalog in class SQLContext cannot be accessed 
in org.apache.spark.sql.SQLContext
 Access to protected value catalog not permitted because
 enclosing class $iwC is not a subclass of 
 class SQLContext in package sql where target is defined
  sqlc.catalog.registerTable(relationName :: Nil, 
LogicalRelation(new BaseRelation {

> Caseness inconsistency between SQLContext and HiveContext
> -
>
> Key: SPARK-11186
> URL: https://issues.apache.org/jira/browse/SPARK-11186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1
>Reporter: Santiago M. Mola
>Priority: Minor
>
> Default catalog behaviour for caseness is different in {{SQLContext}} and 
> {{HiveContext}}.
> {code}
>   test("Catalog caseness (SQL)") {
> val sqlc = new SQLContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
>   test("Catalog caseness (Hive)") {
> val sqlc = new HiveContext(sc)
> val relationName = "MyTable"
> sqlc.catalog.registerTable(relationName :: Nil, LogicalRelation(new 
> BaseRelation {
>   override def sqlContext: SQLContext = sqlc
>   override def schema: StructType = StructType(Nil)
> }))
> val tables = sqlc.tableNames()
> assert(tables.contains(relationName))
>   }
> {code}
> Looking at {{HiveContext#SQLSession}}, I see this is the intended behaviour. 
> But the reason that this is needed seems undocumented (both in the manual or 
> in the source code comments).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-11188) Elide stacktraces in bin/spark-sql for AnalysisExceptions

2015-10-19 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-11188:
-
Target Version/s: 1.4.2, 1.5.2, 1.6.0  (was: 1.6.0)

> Elide stacktraces in bin/spark-sql for AnalysisExceptions
> -
>
> Key: SPARK-11188
> URL: https://issues.apache.org/jira/browse/SPARK-11188
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>
> For analysis exceptions in the sql-shell, we should only print the error 
> message to the screen.  The stacktrace will never have useful information 
> since this error is used to signify an error with the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-11196) Support for equality and pushdown of filters on some UDTs

2015-10-19 Thread Michael Armbrust (JIRA)

Michael Armbrust created SPARK-11196:


 Summary: Support for equality and pushdown of filters on some UDTs
 Key: SPARK-11196
 URL: https://issues.apache.org/jira/browse/SPARK-11196
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust


Today if you try and do any comparisons with UDTs it fails due to bad casting.  
However, in some cases the UDT is just a thing wrapper around a SQL type 
(StringType for example).  In these cases we could just convert the UDT to its 
SQL type.

Rough prototype: 
https://github.com/apache/spark/compare/apache:master...marmbrus:uuid-udt



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 163 matches

Mail list logo