[jira] [Assigned] (SPARK-12993) Remove usage of ADD_FILES in pyspark

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12993:


Assignee: (was: Apache Spark)

> Remove usage of ADD_FILES in pyspark
> 
>
> Key: SPARK-12993
> URL: https://issues.apache.org/jira/browse/SPARK-12993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12993) Remove usage of ADD_FILES in pyspark

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12993:


Assignee: Apache Spark

> Remove usage of ADD_FILES in pyspark
> 
>
> Key: SPARK-12993
> URL: https://issues.apache.org/jira/browse/SPARK-12993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12993) Remove usage of ADD_FILES in pyspark

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116647#comment-15116647
 ] 

Apache Spark commented on SPARK-12993:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10913

> Remove usage of ADD_FILES in pyspark
> 
>
> Key: SPARK-12993
> URL: https://issues.apache.org/jira/browse/SPARK-12993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11780) Provide type aliases in org.apache.spark.sql.types for backwards compatibility

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116724#comment-15116724
 ] 

Apache Spark commented on SPARK-11780:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10915

> Provide type aliases in org.apache.spark.sql.types for backwards compatibility
> --
>
> Key: SPARK-11780
> URL: https://issues.apache.org/jira/browse/SPARK-11780
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Santiago M. Mola
>Assignee: Santiago M. Mola
>
> With SPARK-11273, ArrayData, MapData and others were moved from  
> org.apache.spark.sql.types to org.apache.spark.sql.catalyst.util.
> Since this is a backward incompatible change, it would be good to provide 
> type aliases from the old package (deprecated) to the new one.
> For example:
> {code}
> package object types {
>@deprecated
>type ArrayData = org.apache.spark.sql.catalyst.util.ArrayData
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12993) Remove usage of ADD_FILES in pyspark

2016-01-25 Thread Jeff Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Zhang updated SPARK-12993:
---
Description: environment variable ADD_FILES is created for adding python 
files to spark context (SPARK-865), this is deprecated now. User are encouraged 
to use --py-files for adding python files to executors. 

> Remove usage of ADD_FILES in pyspark
> 
>
> Key: SPARK-12993
> URL: https://issues.apache.org/jira/browse/SPARK-12993
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Reporter: Jeff Zhang
>Priority: Minor
>
> environment variable ADD_FILES is created for adding python files to spark 
> context (SPARK-865), this is deprecated now. User are encouraged to use 
> --py-files for adding python files to executors. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12937) Bloom filter serialization

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12937:


Assignee: Apache Spark  (was: Wenchen Fan)

> Bloom filter serialization
> --
>
> Key: SPARK-12937
> URL: https://issues.apache.org/jira/browse/SPARK-12937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12937) Bloom filter serialization

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116855#comment-15116855
 ] 

Apache Spark commented on SPARK-12937:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10920

> Bloom filter serialization
> --
>
> Key: SPARK-12937
> URL: https://issues.apache.org/jira/browse/SPARK-12937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12937) Bloom filter serialization

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12937:


Assignee: Wenchen Fan  (was: Apache Spark)

> Bloom filter serialization
> --
>
> Key: SPARK-12937
> URL: https://issues.apache.org/jira/browse/SPARK-12937
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Cheng Lian
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-25 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116645#comment-15116645
 ] 

Felix Cheung commented on SPARK-12984:
--

You should specify 'source' - otherwise it defaults to parquet and it seems it 
fails trying to read it as parquet file.

> Not able to read CSV file using Spark 1.4.0
> ---
>
> Key: SPARK-12984
> URL: https://issues.apache.org/jira/browse/SPARK-12984
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.4.0
> Environment: Unix
> Hadoop 2.7.1.2.3.0.0-2557
> R 3.1.1
> Don't have Internet on the server
>Reporter: Jai Murugesh Rajasekaran
>
> Hi,
> We are trying to read a CSV file
> Downloaded following CSV related package (jar files) and configured using 
> Maven
> 1. spark-csv_2.10-1.2.0.jar
> 2. spark-csv_2.10-1.2.0-sources.jar
> 3. spark-csv_2.10-1.2.0-javadoc.jar
> Trying to execute following script
> > library(SparkR)
> > sc <- sparkR.init(appName="SparkR-DataFrame")
> Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
> restart R to create a new Spark Context
> > sqlContext <- sparkRSQL.init(sc)
> > setwd("/home/s/")
> > getwd()
> [1] "/home/s"
> > path <- file.path("Sample.csv")
> > Test <- read.df(sqlContext, path)
> Note: I am able to read CSV file using regular R function but when tried 
> using SparkR functions...ended up with error
> Initiated SparkR
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> Error Messages/Log
> $ sh -x sparkR -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> +++ dirname sparkR
> ++ cd ./..
> ++ pwd
> + export SPARK_HOME=/opt/spark-1.4.0
> + SPARK_HOME=/opt/spark-1.4.0
> + source /opt/spark-1.4.0/bin/load-spark-env.sh
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ FWDIR=/opt/spark-1.4.0
> ++ '[' -z '' ']'
> ++ export SPARK_ENV_LOADED=1
> ++ SPARK_ENV_LOADED=1
>  dirname sparkR
> +++ cd ./..
> +++ pwd
> ++ parent_dir=/opt/spark-1.4.0
> ++ user_conf_dir=/opt/spark-1.4.0/conf
> ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
> ++ set -a
> ++ . /opt/spark-1.4.0/conf/spark-env.sh
> +++ export SPARK_HOME=/opt/spark-1.4.0
> +++ SPARK_HOME=/opt/spark-1.4.0
> +++ export YARN_CONF_DIR=/etc/hadoop/conf
> +++ YARN_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ export HADOOP_CONF_DIR=/etc/hadoop/conf
> +++ HADOOP_CONF_DIR=/etc/hadoop/conf
> ++ set +a
> ++ '[' -z '' ']'
> ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
> ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
> ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
> ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
> ++ export SPARK_SCALA_VERSION=2.10
> ++ SPARK_SCALA_VERSION=2.10
> + export -f usage
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *--help ]]
> + [[ -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
>  = *-h ]]
> + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
> /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
> R version 3.1.1 (2014-07-10) -- "Sock it to Me"
> Copyright (C) 2014 The R Foundation for Statistical Computing
> Platform: x86_64-unknown-linux-gnu (64-bit)
> R is free software and comes with ABSOLUTELY NO WARRANTY.
> You are welcome to redistribute it under certain conditions.
> Type 'license()' or 'licence()' for distribution details.
>   Natural language support but running in an English locale
> R is a collaborative 

[jira] [Resolved] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer

2016-01-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11922.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10085
[https://github.com/apache/spark/pull/10085]

> Python API for ml.feature.QuantileDiscretizer
> --
>
> Key: SPARK-11922
> URL: https://issues.apache.org/jira/browse/SPARK-11922
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: holdenk
>Priority: Minor
> Fix For: 2.0.0
>
>
> Add Python API for ml.feature.QuantileDiscretizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12997) Use cast expression to perform type cast in csv

2016-01-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12997:
---

 Summary: Use cast expression to perform type cast in csv
 Key: SPARK-12997
 URL: https://issues.apache.org/jira/browse/SPARK-12997
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin


CSVTypeCast.castTo should probably be removed, and just replace its usage with 
a projection that uses a sequence of Cast expressions.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2016-01-25 Thread Saisai Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Saisai Shao updated SPARK-12977:

Attachment: screenshot-1.png

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode

2016-01-25 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12994:
--

 Summary: It is not necessary to create ExecutorAllocationManager 
in local mode
 Key: SPARK-12994
 URL: https://issues.apache.org/jira/browse/SPARK-12994
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Jeff Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12995:


Assignee: (was: Apache Spark)

> Remove deprecate APIs from Pregel
> -
>
> Key: SPARK-12995
> URL: https://issues.apache.org/jira/browse/SPARK-12995
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116773#comment-15116773
 ] 

Apache Spark commented on SPARK-12995:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10918

> Remove deprecate APIs from Pregel
> -
>
> Key: SPARK-12995
> URL: https://issues.apache.org/jira/browse/SPARK-12995
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12995:


Assignee: Apache Spark

> Remove deprecate APIs from Pregel
> -
>
> Key: SPARK-12995
> URL: https://issues.apache.org/jira/browse/SPARK-12995
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12977) Factoring out StreamingListener and UI to support history UI

2016-01-25 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116844#comment-15116844
 ] 

Saisai Shao commented on SPARK-12977:
-

Attach the current working progress, still some problems should be fixed to 
deliver the patch.

> Factoring out StreamingListener and UI to support history UI
> 
>
> Key: SPARK-12977
> URL: https://issues.apache.org/jira/browse/SPARK-12977
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.6.0
>Reporter: Saisai Shao
> Attachments: screenshot-1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12996) CSVRelation should be based on HadoopFsRelation

2016-01-25 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-12996:
---

 Summary: CSVRelation should be based on HadoopFsRelation
 Key: SPARK-12996
 URL: https://issues.apache.org/jira/browse/SPARK-12996
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12996) CSVRelation should be based on HadoopFsRelation

2016-01-25 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116774#comment-15116774
 ] 

Reynold Xin commented on SPARK-12996:
-

cc [~hyukjin.kwon] would you be interested in fixing this?


> CSVRelation should be based on HadoopFsRelation
> ---
>
> Key: SPARK-12996
> URL: https://issues.apache.org/jira/browse/SPARK-12996
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12702) Populate statistics for DataFrame when reading CSV

2016-01-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12702.
---
Resolution: Duplicate

Closing this because it is just part of SPARK-12996.


> Populate statistics for DataFrame when reading CSV
> --
>
> Key: SPARK-12702
> URL: https://issues.apache.org/jira/browse/SPARK-12702
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12670) Use spark internal utilities wherever possible

2016-01-25 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-12670.
---
Resolution: Won't Fix

Going to close this one since it is a little bit too broad.


> Use spark internal utilities wherever possible
> --
>
> Key: SPARK-12670
> URL: https://issues.apache.org/jira/browse/SPARK-12670
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Hossein Falaki
>
> The initial code from spark-csv does not rely on Spark's internal utilities 
> to maintain backward compatibility across multiple versions of Spark. 
> * Type casting utilities
> * Schema inference utilities
> * Unit test utilities



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12968) Implement command to set current database

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12968:


Assignee: (was: Apache Spark)

> Implement command to set current database
> -
>
> Key: SPARK-12968
> URL: https://issues.apache.org/jira/browse/SPARK-12968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We currently delegate to Hive for "use database" command. We should implement 
> this in Spark.
> The reason this is important is: as soon as we can track the database, we can 
>  remove the dependency on session state for the catalog API. Right now the 
> implementation of Catalog actually needs to handle session information itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12968) Implement command to set current database

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12968:


Assignee: Apache Spark

> Implement command to set current database
> -
>
> Key: SPARK-12968
> URL: https://issues.apache.org/jira/browse/SPARK-12968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Apache Spark
>Priority: Critical
>
> We currently delegate to Hive for "use database" command. We should implement 
> this in Spark.
> The reason this is important is: as soon as we can track the database, we can 
>  remove the dependency on session state for the catalog API. Right now the 
> implementation of Catalog actually needs to handle session information itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Takeshi Yamamuro (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-12995:
-
Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-11806

> Remove deprecate APIs from Pregel
> -
>
> Key: SPARK-12995
> URL: https://issues.apache.org/jira/browse/SPARK-12995
> Project: Spark
>  Issue Type: Sub-task
>  Components: GraphX
>Affects Versions: 1.6.0
>Reporter: Takeshi Yamamuro
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12993) Remove usage of ADD_FILES in pyspark

2016-01-25 Thread Jeff Zhang (JIRA)
Jeff Zhang created SPARK-12993:
--

 Summary: Remove usage of ADD_FILES in pyspark
 Key: SPARK-12993
 URL: https://issues.apache.org/jira/browse/SPARK-12993
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Reporter: Jeff Zhang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12994:


Assignee: (was: Apache Spark)

> It is not necessary to create ExecutorAllocationManager in local mode
> -
>
> Key: SPARK-12994
> URL: https://issues.apache.org/jira/browse/SPARK-12994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12994:


Assignee: Apache Spark

> It is not necessary to create ExecutorAllocationManager in local mode
> -
>
> Key: SPARK-12994
> URL: https://issues.apache.org/jira/browse/SPARK-12994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jeff Zhang
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116693#comment-15116693
 ] 

Apache Spark commented on SPARK-12994:
--

User 'zjffdu' has created a pull request for this issue:
https://github.com/apache/spark/pull/10914

> It is not necessary to create ExecutorAllocationManager in local mode
> -
>
> Key: SPARK-12994
> URL: https://issues.apache.org/jira/browse/SPARK-12994
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jeff Zhang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12968) Implement command to set current database

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116735#comment-15116735
 ] 

Apache Spark commented on SPARK-12968:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/10916

> Implement command to set current database
> -
>
> Key: SPARK-12968
> URL: https://issues.apache.org/jira/browse/SPARK-12968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We currently delegate to Hive for "use database" command. We should implement 
> this in Spark.
> The reason this is important is: as soon as we can track the database, we can 
>  remove the dependency on session state for the catalog API. Right now the 
> implementation of Catalog actually needs to handle session information itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12995) Remove deprecate APIs from Pregel

2016-01-25 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-12995:


 Summary: Remove deprecate APIs from Pregel
 Key: SPARK-12995
 URL: https://issues.apache.org/jira/browse/SPARK-12995
 Project: Spark
  Issue Type: Improvement
  Components: GraphX
Affects Versions: 1.6.0
Reporter: Takeshi Yamamuro






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12888) benchmark the new hash expression

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116747#comment-15116747
 ] 

Apache Spark commented on SPARK-12888:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10917

> benchmark the new hash expression
> -
>
> Key: SPARK-12888
> URL: https://issues.apache.org/jira/browse/SPARK-12888
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12834:
--
Assignee: Xusen Yin

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
> Fix For: 2.0.0
>
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList

2016-01-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12834.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10772
[https://github.com/apache/spark/pull/10772]

> Use type conversion instead of Ser/De of Pickle to transform JavaArray and 
> JavaList
> ---
>
> Key: SPARK-12834
> URL: https://issues.apache.org/jira/browse/SPARK-12834
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Xusen Yin
>Assignee: Xusen Yin
> Fix For: 2.0.0
>
>
> According to the Ser/De code in Python side:
> {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false}
>   def _java2py(sc, r, encoding="bytes"):
> if isinstance(r, JavaObject):
> clsName = r.getClass().getSimpleName()
> # convert RDD into JavaRDD
> if clsName != 'JavaRDD' and clsName.endswith("RDD"):
> r = r.toJavaRDD()
> clsName = 'JavaRDD'
> if clsName == 'JavaRDD':
> jrdd = sc._jvm.SerDe.javaToPython(r)
> return RDD(jrdd, sc)
> if clsName == 'DataFrame':
> return DataFrame(r, SQLContext.getOrCreate(sc))
> if clsName in _picklable_classes:
> r = sc._jvm.SerDe.dumps(r)
> elif isinstance(r, (JavaArray, JavaList)):
> try:
> r = sc._jvm.SerDe.dumps(r)
> except Py4JJavaError:
> pass  # not pickable
> if isinstance(r, (bytearray, bytes)):
> r = PickleSerializer().loads(bytes(r), encoding=encoding)
> return r
> {code}
> We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, 
> then deserialize them with PickleSerializer in Python side. However, there is 
> no need to transform them in such an inefficient way. Instead of it, we can 
> use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). 
> What's more, there is an issue to Ser/De Scala Array as I said in 
> https://issues.apache.org/jira/browse/SPARK-12780



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12973) Support to set priority when submit spark application to YARN

2016-01-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12973.
---
Resolution: Duplicate

> Support to set priority when submit spark application to YARN
> -
>
> Key: SPARK-12973
> URL: https://issues.apache.org/jira/browse/SPARK-12973
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.6.1
>Reporter: Chaozhong Yang
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204
 ] 

Hyukjin Kwon commented on SPARK-12890:
--

Actually I don't still understand what is an issue here. This might not be 
merging schemas as it is disabled by default and any filter is not being pushed 
down here.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12979) Paths are resolved relative to the local file system

2016-01-25 Thread Iulian Dragos (JIRA)
Iulian Dragos created SPARK-12979:
-

 Summary: Paths are resolved relative to the local file system
 Key: SPARK-12979
 URL: https://issues.apache.org/jira/browse/SPARK-12979
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 1.6.0
Reporter: Iulian Dragos


Spark properties that refer to paths on the cluster (for example, 
`spark.mesos.executor.home`) should be un-interpreted strings. Currently, such 
a path is resolved relative to the local (client) file system, and symlinks are 
resolved, etc. (by calling `getCanonicalPath`).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12968) Implement command to set current database

2016-01-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115288#comment-15115288
 ] 

Herman van Hovell commented on SPARK-12968:
---

I don't mind if you go ahead and work on this. The only thing is that we need 
to be a bit carefull arround set commands. They currently won't work properly 
because the SparkSQLParser interprets these as properties being set. I am 
working on the latter.

> Implement command to set current database
> -
>
> Key: SPARK-12968
> URL: https://issues.apache.org/jira/browse/SPARK-12968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Priority: Critical
>
> We currently delegate to Hive for "use database" command. We should implement 
> this in Spark.
> The reason this is important is: as soon as we can track the database, we can 
>  remove the dependency on session state for the catalog API. Right now the 
> implementation of Catalog actually needs to handle session information itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115148#comment-15115148
 ] 

Liang-Chi Hsieh edited comment on SPARK-12890 at 1/25/16 12:46 PM:
---

As {{DataFrame.parquet}} accepts paths as parameter, you are already specifying 
the certain partitions to scan.


was (Author: viirya):
As {{DataFrame.parquet}} accepts paths as parameter, your partition information 
can be already embedded in the paths?

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115148#comment-15115148
 ] 

Liang-Chi Hsieh commented on SPARK-12890:
-

As {{DataFrame.parquet}} accepts paths as parameter, your partition information 
can be already embedded in the paths?

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115257#comment-15115257
 ] 

Takeshi Yamamuro commented on SPARK-12890:
--

Ah, I see.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Christopher Bourez (JIRA)
Christopher Bourez created SPARK-12980:
--

 Summary: pyspark crash for large dataset - clone
 Key: SPARK-12980
 URL: https://issues.apache.org/jira/browse/SPARK-12980
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.2
 Environment: windows
Reporter: Christopher Bourez


I tried to import a local text(over 100mb) file via textFile in pyspark, when i 
ran data.take(), it failed and gave error messages including:
15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
Traceback (most recent call last):
  File "E:/spark_python/test3.py", line 9, in 
lines.take(5)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
in take
res = self.context.runJob(self, takeUpToNumLeft, p)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
__call__
answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
36, in deco
return f(*a, **kw)
  File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.net.SocketException: Connection reset by peer: socket write 
error

Then i ran the same code for a small text file, this time .take() worked fine.
How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Christopher Bourez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Bourez updated SPARK-12980:
---
Description: 
I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well.
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode.

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

What could be the reason to have the windows spark textfile method fail ?

  was:
I tried to import a local text(over 100mb) file via textFile in pyspark, when i 
ran data.take(), it failed and gave error messages including:
15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
aborting job
Traceback (most recent call last):
  File "E:/spark_python/test3.py", line 9, in 
lines.take(5)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
in take
res = self.context.runJob(self, takeUpToNumLeft, p)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
916, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
partitions)
  File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
__call__
answer, self.gateway_client, self.target_id, self.name)
  File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
36, in deco
return f(*a, **kw)
  File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 
0, localhost): java.net.SocketException: Connection reset by peer: socket write 
error

Then i ran the same code for a small text file, this time .take() worked fine.
How can i solve this problem?


> pyspark crash for large dataset - clone
> ---
>
> Key: SPARK-12980
> URL: https://issues.apache.org/jira/browse/SPARK-12980
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: Christopher Bourez
>
> I installed spark 1.6 on many different computers. 
> On Windows, PySpark textfile method, followed by take(1), does not work on a 
> file of 13M.
> If I set numpartitions to 2000 or take a smaller file, the method works well.
> The Pyspark is set with all RAM memory of the computer thanks to the command 
> --conf spark.driver.memory=5g in local mode.
> On Mac OS, I'm able to launch the exact same program with Pyspark with 16G 
> RAM for a file of much bigger in comparison, of 5G. Memory is correctly 
> allocated, removed etc
> On Ubuntu, no trouble, I can also launch a cluster 
> http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
> What could be the reason to have the windows spark textfile method fail ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC

2016-01-25 Thread Greg Michalopoulos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Michalopoulos updated SPARK-12928:
---
Description: 
When trying to read in a table from Oracle and saveAsParquet, an 
IllegalArgumentException is thrown when a column of FLOAT datatype is 
encountered.

Below is the code being run:
{code}val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> jdbcConnectionString,
  "dbtable" -> "(select someFloat from someTable)",
  "fetchSize" -> fetchSize)).load()

  jdbcDF.saveAsParquetFile(destinationDirectory + table)
{code}

Here is the exception:
{code}java.lang.IllegalArgumentException: Unsupported dataType: 
{"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
 [1.1] failure: `TimestampType' expected but `{' found
{code}

>From the exception it was clear that the FLOAT datatype was presenting itself 
>as scale -127 which appears to be the problem. 


  was:
When trying to read in a table from Oracle and saveAsParquet, an 
IllegalArgumentException is thrown when a column of FLOAT datatype is 
encountered.

Below is the code being run:
{code}val jdbcDF = sqlContext.read.format("jdbc").options(
  Map("url" -> jdbcConnectionString,
  "dbtable" -> "(select someFloat from someTable"),
  "fetchSize" -> fetchSize)).load()

  jdbcDF.saveAsParquetFile(destinationDirectory + table)
{code}

Here is the exception:
{code}java.lang.IllegalArgumentException: Unsupported dataType: 
{"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
 [1.1] failure: `TimestampType' expected but `{' found
{code}

>From the exception it was clear that the FLOAT datatype was presenting itself 
>as scale -127 which appears to be the problem. 



> Oracle FLOAT datatype is not properly handled when reading via JDBC
> ---
>
> Key: SPARK-12928
> URL: https://issues.apache.org/jira/browse/SPARK-12928
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Oracle Database 11g Enterprise Edition   11.2.0.3.0  
> 64bit Production
>Reporter: Greg Michalopoulos
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When trying to read in a table from Oracle and saveAsParquet, an 
> IllegalArgumentException is thrown when a column of FLOAT datatype is 
> encountered.
> Below is the code being run:
> {code}val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> jdbcConnectionString,
>   "dbtable" -> "(select someFloat from someTable)",
>   "fetchSize" -> fetchSize)).load()
>   jdbcDF.saveAsParquetFile(destinationDirectory + table)
> {code}
> Here is the exception:
> {code}java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {code}
> From the exception it was clear that the FLOAT datatype was presenting itself 
> as scale -127 which appears to be the problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115076#comment-15115076
 ] 

Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 1:32 PM:
---

I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrameReader#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).


was (Author: maropu):
I looked over the related codes; partition pruning optimization itself has been 
implemented in 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74.
However, there is no interface in DataFrame#parquet to pass partition 
information 
(https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321).

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:44 PM:
---

Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter and pushes 
down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.


was (Author: hyukjin.kwon):
Actually I don't still understand what is an issue here. This might not be 
merging schemas as it is disabled by default and any filter is not being pushed 
down here.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Christopher Bourez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Bourez updated SPARK-12980:
---
Description: 
I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well.
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode.

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

The error message on Windows is : java.net.SocketException: Connection reset by 
peer: socket write error
What could be the reason to have the windows spark textfile method fail ?

  was:
I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well.
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode.

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

What could be the reason to have the windows spark textfile method fail ?


> pyspark crash for large dataset - clone
> ---
>
> Key: SPARK-12980
> URL: https://issues.apache.org/jira/browse/SPARK-12980
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: Christopher Bourez
>
> I installed spark 1.6 on many different computers. 
> On Windows, PySpark textfile method, followed by take(1), does not work on a 
> file of 13M.
> If I set numpartitions to 2000 or take a smaller file, the method works well.
> The Pyspark is set with all RAM memory of the computer thanks to the command 
> --conf spark.driver.memory=5g in local mode.
> On Mac OS, I'm able to launch the exact same program with Pyspark with 16G 
> RAM for a file of much bigger in comparison, of 5G. Memory is correctly 
> allocated, removed etc
> On Ubuntu, no trouble, I can also launch a cluster 
> http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
> The error message on Windows is : java.net.SocketException: Connection reset 
> by peer: socket write error
> What could be the reason to have the windows spark textfile method fail ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown

2016-01-25 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115225#comment-15115225
 ] 

Thomas Graves commented on SPARK-10911:
---

see the pull request for comments and discussion 
https://github.com/apache/spark/pull/9946

> Executors should System.exit on clean shutdown
> --
>
> Key: SPARK-10911
> URL: https://issues.apache.org/jira/browse/SPARK-10911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.5.1
>Reporter: Thomas Graves
>Assignee: Zhuo Liu
>Priority: Minor
>
> Executors should call System.exit on clean shutdown to make sure all user 
> threads exit and jvm shuts down.
> We ran into a case where an Executor was left around for days trying to 
> shutdown because the user code was using a non-daemon thread pool and one of 
> those threads wasn't exiting.  We should force the jvm to go away with 
> System.exit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3611) Show number of cores for each executor in application web UI

2016-01-25 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115309#comment-15115309
 ] 

Thomas Graves commented on SPARK-3611:
--

I know the pull request was closed due to not being able to reliably get this 
information, it looks like its now available through ExecutorInfo structure.

> Show number of cores for each executor in application web UI
> 
>
> Key: SPARK-3611
> URL: https://issues.apache.org/jira/browse/SPARK-3611
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Matei Zaharia
>Priority: Minor
>  Labels: starter
>
> This number is not always fully known, because e.g. in Mesos your executors 
> can scale up and down in # of CPUs, but it would be nice to show at least the 
> number of cores the machine has in that case, or the # of cores the executor 
> has been configured with if known.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12928:


Assignee: (was: Apache Spark)

> Oracle FLOAT datatype is not properly handled when reading via JDBC
> ---
>
> Key: SPARK-12928
> URL: https://issues.apache.org/jira/browse/SPARK-12928
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Oracle Database 11g Enterprise Edition   11.2.0.3.0  
> 64bit Production
>Reporter: Greg Michalopoulos
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When trying to read in a table from Oracle and saveAsParquet, an 
> IllegalArgumentException is thrown when a column of FLOAT datatype is 
> encountered.
> Below is the code being run:
> {code}val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> jdbcConnectionString,
>   "dbtable" -> "(select someFloat from someTable"),
>   "fetchSize" -> fetchSize)).load()
>   jdbcDF.saveAsParquetFile(destinationDirectory + table)
> {code}
> Here is the exception:
> {code}java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {code}
> From the exception it was clear that the FLOAT datatype was presenting itself 
> as scale -127 which appears to be the problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115332#comment-15115332
 ] 

Apache Spark commented on SPARK-12928:
--

User 'poolis' has created a pull request for this issue:
https://github.com/apache/spark/pull/10899

> Oracle FLOAT datatype is not properly handled when reading via JDBC
> ---
>
> Key: SPARK-12928
> URL: https://issues.apache.org/jira/browse/SPARK-12928
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Oracle Database 11g Enterprise Edition   11.2.0.3.0  
> 64bit Production
>Reporter: Greg Michalopoulos
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When trying to read in a table from Oracle and saveAsParquet, an 
> IllegalArgumentException is thrown when a column of FLOAT datatype is 
> encountered.
> Below is the code being run:
> {code}val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> jdbcConnectionString,
>   "dbtable" -> "(select someFloat from someTable"),
>   "fetchSize" -> fetchSize)).load()
>   jdbcDF.saveAsParquetFile(destinationDirectory + table)
> {code}
> Here is the exception:
> {code}java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {code}
> From the exception it was clear that the FLOAT datatype was presenting itself 
> as scale -127 which appears to be the problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12928:


Assignee: Apache Spark

> Oracle FLOAT datatype is not properly handled when reading via JDBC
> ---
>
> Key: SPARK-12928
> URL: https://issues.apache.org/jira/browse/SPARK-12928
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: Oracle Database 11g Enterprise Edition   11.2.0.3.0  
> 64bit Production
>Reporter: Greg Michalopoulos
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When trying to read in a table from Oracle and saveAsParquet, an 
> IllegalArgumentException is thrown when a column of FLOAT datatype is 
> encountered.
> Below is the code being run:
> {code}val jdbcDF = sqlContext.read.format("jdbc").options(
>   Map("url" -> jdbcConnectionString,
>   "dbtable" -> "(select someFloat from someTable"),
>   "fetchSize" -> fetchSize)).load()
>   jdbcDF.saveAsParquetFile(destinationDirectory + table)
> {code}
> Here is the exception:
> {code}java.lang.IllegalArgumentException: Unsupported dataType: 
> {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]},
>  [1.1] failure: `TimestampType' expected but `{' found
> {code}
> From the exception it was clear that the FLOAT datatype was presenting itself 
> as scale -127 which appears to be the problem. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12360) Support using 64-bit long type in SparkR

2016-01-25 Thread Dmitriy Selivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115376#comment-15115376
 ] 

Dmitriy Selivanov commented on SPARK-12360:
---

+1 for bit64

> Support using 64-bit long type in SparkR
> 
>
> Key: SPARK-12360
> URL: https://issues.apache.org/jira/browse/SPARK-12360
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Sun Rui
>
> R has no support for 64-bit integers. While in Scala/Java API, some methods 
> have one or more arguments of long type. Currently we support only passing an 
> integer cast from a numeric to Scala/Java side for parameters of long type of 
> such methods. This may have problem covering large data sets.
> Storing a 64-bit integer in a double obviously does not work as some 64-bit 
> integers can not be exactly represented in double format, so x and x+1 can't 
> be distinguished.
> There is a bit64 package 
> (https://cran.r-project.org/web/packages/bit64/index.html) in CRAN which 
> supports vectors of 64-bit integers. We can investigate if it can be used for 
> this purpose.
> two questions are:
> 1. Is the license acceptable?
> 2. This will have SparkR depends on a  non-base third-party package, which 
> may complicate the deployment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115144#comment-15115144
 ] 

Liang-Chi Hsieh commented on SPARK-12890:
-

For the original issue, I think it might because you enable schema merging. In 
order to get the correct schema, it will scan all footer and parquet files to 
merge their schema. Try to disable the schema merging if you don't need it, and 
see if it solves your problem.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204
 ] 

Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:46 PM:
---

Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter for a 
function and pushes down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.


was (Author: hyukjin.kwon):
Actually I don't still understand what is an issue here. This might not be 
related with merging schemas as it is disabled by default and any filter is not 
being pushed down here. It does not automatically create a filter and pushes 
down it as far as I know.

I mean, the referenced column would be {{date}} and given filters would be 
empty. So it tries to read all the files regardless of file format as long as 
it supports to partitioned files.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12492) SQL page of Spark-sql is always blank

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12492:


Assignee: (was: Apache Spark)

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12492) SQL page of Spark-sql is always blank

2016-01-25 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12492:


Assignee: Apache Spark

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
>Assignee: Apache Spark
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12492) SQL page of Spark-sql is always blank

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115334#comment-15115334
 ] 

Apache Spark commented on SPARK-12492:
--

User 'KaiXinXiaoLei' has created a pull request for this issue:
https://github.com/apache/spark/pull/10900

> SQL page of Spark-sql is always blank 
> --
>
> Key: SPARK-12492
> URL: https://issues.apache.org/jira/browse/SPARK-12492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Reporter: meiyoula
> Attachments: screenshot-1.png
>
>
> When I run a sql query in spark-sql, the Execution page of SQL tab is always 
> blank. But the JDBCServer is not blank.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12975) Throwing Exception when Bucketing Columns are part of Partitioning Columns

2016-01-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12975:

Description: 
When users are using partitionBy and bucketBy at the same time, some bucketing 
columns might be part of partitioning columns. For example, 
{code}
df.write
  .format(source)
  .partitionBy("i")
  .bucketBy(8, "i", "k")
  .sortBy("k")
  .saveAsTable("bucketed_table")
{code}

However, in the above case, adding column `i` into `bucketBy` is useless. It is 
just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, 
we can issue an exception and let users do the change. 

  was:
When users are using partitionBy and bucketBy at the same time, some bucketing 
columns might be part of partitioning columns. For example, 
{code}
df.write
  .format(source)
  .partitionBy("i")
  .bucketBy(8, "i", "k")
  .sortBy("k")
  .saveAsTable("bucketed_table")
{code}

However, in the above case, adding column `i` is useless. It is just wasting 
extra CPU when reading or writing bucket tables. Thus, we can automatically 
remove these overlapping columns from the bucketing columns. 


> Throwing Exception when Bucketing Columns are part of Partitioning Columns
> --
>
> Key: SPARK-12975
> URL: https://issues.apache.org/jira/browse/SPARK-12975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When users are using partitionBy and bucketBy at the same time, some 
> bucketing columns might be part of partitioning columns. For example, 
> {code}
> df.write
>   .format(source)
>   .partitionBy("i")
>   .bucketBy(8, "i", "k")
>   .sortBy("k")
>   .saveAsTable("bucketed_table")
> {code}
> However, in the above case, adding column `i` into `bucketBy` is useless. It 
> is just wasting extra CPU when reading or writing bucket tables. Thus, like 
> Hive, we can issue an exception and let users do the change. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12975) Throwing Exception when Bucketing Columns are part of Partitioning Columns

2016-01-25 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-12975:

Summary: Throwing Exception when Bucketing Columns are part of Partitioning 
Columns  (was: Eliminate Bucketing Columns that are part of Partitioning 
Columns)

> Throwing Exception when Bucketing Columns are part of Partitioning Columns
> --
>
> Key: SPARK-12975
> URL: https://issues.apache.org/jira/browse/SPARK-12975
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> When users are using partitionBy and bucketBy at the same time, some 
> bucketing columns might be part of partitioning columns. For example, 
> {code}
> df.write
>   .format(source)
>   .partitionBy("i")
>   .bucketBy(8, "i", "k")
>   .sortBy("k")
>   .saveAsTable("bucketed_table")
> {code}
> However, in the above case, adding column `i` is useless. It is just wasting 
> extra CPU when reading or writing bucket tables. Thus, we can automatically 
> remove these overlapping columns from the bucketing columns. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12984) Not able to read CSV file using Spark 1.4.0

2016-01-25 Thread Jai Murugesh Rajasekaran (JIRA)
Jai Murugesh Rajasekaran created SPARK-12984:


 Summary: Not able to read CSV file using Spark 1.4.0
 Key: SPARK-12984
 URL: https://issues.apache.org/jira/browse/SPARK-12984
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 1.4.0
 Environment: Unix
Hadoop 2.7.1.2.3.0.0-2557
R 3.1.1
Don't have Internet on the server
Reporter: Jai Murugesh Rajasekaran


Hi,

We are trying to read a CSV file
Downloaded following CSV related package (jar files) and configured using Maven
1. spark-csv_2.10-1.2.0.jar
2. spark-csv_2.10-1.2.0-sources.jar
3. spark-csv_2.10-1.2.0-javadoc.jar

Trying to execute following script
> library(SparkR)
> sc <- sparkR.init(appName="SparkR-DataFrame")
Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or 
restart R to create a new Spark Context
> sqlContext <- sparkRSQL.init(sc)
> setwd("/home/s/")
> getwd()
[1] "/home/s"
> path <- file.path("Sample.csv")
> Test <- read.df(sqlContext, path)

Note: I am able to read CSV file using regular R function but when tried using 
SparkR functions...ended up with error

Initiated SparkR
$ sh -x sparkR -v --repositories 
/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar

Error Messages/Log
$ sh -x sparkR -v --repositories 
/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
+++ dirname sparkR
++ cd ./..
++ pwd
+ export SPARK_HOME=/opt/spark-1.4.0
+ SPARK_HOME=/opt/spark-1.4.0
+ source /opt/spark-1.4.0/bin/load-spark-env.sh
 dirname sparkR
+++ cd ./..
+++ pwd
++ FWDIR=/opt/spark-1.4.0
++ '[' -z '' ']'
++ export SPARK_ENV_LOADED=1
++ SPARK_ENV_LOADED=1
 dirname sparkR
+++ cd ./..
+++ pwd
++ parent_dir=/opt/spark-1.4.0
++ user_conf_dir=/opt/spark-1.4.0/conf
++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']'
++ set -a
++ . /opt/spark-1.4.0/conf/spark-env.sh
+++ export SPARK_HOME=/opt/spark-1.4.0
+++ SPARK_HOME=/opt/spark-1.4.0
+++ export YARN_CONF_DIR=/etc/hadoop/conf
+++ YARN_CONF_DIR=/etc/hadoop/conf
+++ export HADOOP_CONF_DIR=/etc/hadoop/conf
+++ HADOOP_CONF_DIR=/etc/hadoop/conf
+++ export HADOOP_CONF_DIR=/etc/hadoop/conf
+++ HADOOP_CONF_DIR=/etc/hadoop/conf
++ set +a
++ '[' -z '' ']'
++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11
++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10
++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]]
++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']'
++ export SPARK_SCALA_VERSION=2.10
++ SPARK_SCALA_VERSION=2.10
+ export -f usage
+ [[ -v --repositories 
/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
 = *--help ]]
+ [[ -v --repositories 
/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar
 = *-h ]]
+ exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories 
/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar

R version 3.1.1 (2014-07-10) -- "Sock it to Me"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-unknown-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

Revolution R Enterprise version 7.3: an enhanced distribution of R
Revolution Analytics packages Copyright (C) 2014 Revolution Analytics, Inc.

Type 'revo()' to visit 

[jira] [Created] (SPARK-12985) Spark Hive thrift server big decimal data issue

2016-01-25 Thread Alex Liu (JIRA)
Alex Liu created SPARK-12985:


 Summary: Spark Hive thrift server big decimal data issue
 Key: SPARK-12985
 URL: https://issues.apache.org/jira/browse/SPARK-12985
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.6.0
Reporter: Alex Liu
Priority: Minor


I tested the trial version JDBC driver from Simba, it works for simple query. 
But there is some issue with data mapping. e.g.
{code}
java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching data 
rows: java.math.BigDecimal cannot be cast to 
org.apache.hadoop.hive.common.type.HiveDecimal;
at 
com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown 
Source)
at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source)
at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source)
at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown 
Source)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
Caused by: com.simba.spark.support.exceptions.GeneralException: 
[Simba][SparkJDBCDriver](500312) Error in fetching data rows: 
java.math.BigDecimal cannot be cast to 
org.apache.hadoop.hive.common.type.HiveDecimal;
... 8 more

{code}



To fix it
{code}
   case DecimalType() =>
 -to += from.getDecimal(ordinal)
 +to += HiveDecimal.create(from.getDecimal(ordinal))
{code}
to 
https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12633:
--
Assignee: Vijay Kiran

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12631:
--
Assignee: Bryan Cutler

> Make Parameter Descriptions Consistent for PySpark MLlib Clustering
> ---
>
> Key: SPARK-12631
> URL: https://issues.apache.org/jira/browse/SPARK-12631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> clustering.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12986) Fix pydoc warnings in mllib/regression.py

2016-01-25 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-12986:
-

 Summary: Fix pydoc warnings in mllib/regression.py
 Key: SPARK-12986
 URL: https://issues.apache.org/jira/browse/SPARK-12986
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Yu Ishikawa
Priority: Minor


Got those warnings by running "make html" under "python/docs/":

{code}
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected 
indentation.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends 
without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected 
indentation.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends 
without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a 
blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of 
pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12631:
--
Shepherd: Xiangrui Meng

> Make Parameter Descriptions Consistent for PySpark MLlib Clustering
> ---
>
> Key: SPARK-12631
> URL: https://issues.apache.org/jira/browse/SPARK-12631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> clustering.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12630:
--
Assignee: Vijay Kiran

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12632:
--
Assignee: somil deshmukh

> Make Parameter Descriptions Consistent for PySpark MLlib FPM and 
> Recommendation
> ---
>
> Key: SPARK-12632
> URL: https://issues.apache.org/jira/browse/SPARK-12632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: somil deshmukh
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up fpm.py 
> and recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12634:
--
Assignee: Vijay Kiran

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12633:
--
Shepherd: Bryan Cutler

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115904#comment-15115904
 ] 

Emlyn Corrin commented on SPARK-9740:
-

Thanks for the help. I've tried with {{callUDF}} and that gives me the same 
error as when I use {{expr}}. For now I've managed to work around it by calling 
{{registerTempTable("tempTable")}} on the DataFrame, and then 
{{SQLContext.sql("SELECT LAST(colName,true) OVER(...) FROM tempTable")}}, which 
works, but feels a bit hacky.
I'll try to put together a minimal example that demonstrates this, as it is 
currently in the middle of a fairly big Clojure application that calls Spark 
through Java interop.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12634:
--
Shepherd: Bryan Cutler
Target Version/s: 2.0.0

> Make Parameter Descriptions Consistent for PySpark MLlib Tree
> -
>
> Key: SPARK-12634
> URL: https://issues.apache.org/jira/browse/SPARK-12634
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up tree.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Christopher Bourez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Bourez updated SPARK-12980:
---
Description: 
I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well.
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode.

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

The error message on Windows is : java.net.SocketException: Connection reset by 
peer: socket write error
Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 
v2.42.01
What could be the reason to have the windows spark textfile method fail ?

  was:
I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well.
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode.

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

The error message on Windows is : java.net.SocketException: Connection reset by 
peer: socket write error
What could be the reason to have the windows spark textfile method fail ?


> pyspark crash for large dataset - clone
> ---
>
> Key: SPARK-12980
> URL: https://issues.apache.org/jira/browse/SPARK-12980
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: Christopher Bourez
>
> I installed spark 1.6 on many different computers. 
> On Windows, PySpark textfile method, followed by take(1), does not work on a 
> file of 13M.
> If I set numpartitions to 2000 or take a smaller file, the method works well.
> The Pyspark is set with all RAM memory of the computer thanks to the command 
> --conf spark.driver.memory=5g in local mode.
> On Mac OS, I'm able to launch the exact same program with Pyspark with 16G 
> RAM for a file of much bigger in comparison, of 5G. Memory is correctly 
> allocated, removed etc
> On Ubuntu, no trouble, I can also launch a cluster 
> http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
> The error message on Windows is : java.net.SocketException: Connection reset 
> by peer: socket write error
> Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 
> v2.42.01
> What could be the reason to have the windows spark textfile method fail ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-01-25 Thread Grzegorz Chilkiewicz (JIRA)
Grzegorz Chilkiewicz created SPARK-12982:


 Summary: SQLContext: temporary table registration does not accept 
valid identifier
 Key: SPARK-12982
 URL: https://issues.apache.org/jira/browse/SPARK-12982
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Grzegorz Chilkiewicz
Priority: Minor


We have encountered very strange behavior of SparkSQL temporary table 
registration.
What identifiers for temporary table should be valid?
Alphanumerical + '_' with at least one non-digit?

Valid identifiers:
df
674123a
674123_
a0e97c59_4445_479d_a7ef_d770e3874123
1ae97c59_4445_479d_a7ef_d770e3874123

Invalid identifier:
10e97c59_4445_479d_a7ef_d770e3874123


Stack trace:
[error] java.lang.RuntimeException: [1.1] failure: identifier expected
[error] 
[error] 10e97c59_4445_479d_a7ef_d770e3874123
[error] ^
[error] at scala.sys.package$.error(package.scala:27)
[error] at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
[error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
[error] at 
org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)

Code to reproduce bug:
https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12632:
--
Target Version/s: 2.0.0

> Make Parameter Descriptions Consistent for PySpark MLlib FPM and 
> Recommendation
> ---
>
> Key: SPARK-12632
> URL: https://issues.apache.org/jira/browse/SPARK-12632
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: somil deshmukh
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up fpm.py 
> and recommendation.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12631:
--
Target Version/s: 2.0.0

> Make Parameter Descriptions Consistent for PySpark MLlib Clustering
> ---
>
> Key: SPARK-12631
> URL: https://issues.apache.org/jira/browse/SPARK-12631
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> clustering.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12630:
--
Target Version/s: 2.0.0

> Make Parameter Descriptions Consistent for PySpark MLlib Classification
> ---
>
> Key: SPARK-12630
> URL: https://issues.apache.org/jira/browse/SPARK-12630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> classification.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-12633:
--
Target Version/s: 2.0.0

> Make Parameter Descriptions Consistent for PySpark MLlib Regression
> ---
>
> Key: SPARK-12633
> URL: https://issues.apache.org/jira/browse/SPARK-12633
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 1.6.0
>Reporter: Bryan Cutler
>Assignee: Vijay Kiran
>Priority: Trivial
>  Labels: doc, starter
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Follow example parameter description format from parent task to fix up 
> regression.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception

2016-01-25 Thread Ben Huntley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115986#comment-15115986
 ] 

Ben Huntley commented on SPARK-12945:
-

Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark.  
Adding my own repro:

bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" 
--executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 
1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" 
--conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf 
"spark.akka.frameSize=300" --conf "spark.akka.timeout=3600"

>>> foo = sqlContext.read.parquet('/projects/xxx/month7')
>>> foo.count()
[Stage 1:==>   (16687 + 11) / 
37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener 
threw an exception
 java.lang.NullPointerException
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[Stage 1:=>(36749 + 22) / 
37231]16/01/25 11:15:20 ERROR LiveListenerBus: Listener JobProgressListener 
threw an exception
 java.lang.NullPointerException
16227372864

> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> --
>
> Key: SPARK-12945
> URL: https://issues.apache.org/jira/browse/SPARK-12945
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.6.0
> Environment: Linux, yarn-client
>Reporter: Tristan
>Priority: Minor
>
> Seeing this a lot; not sure if it is a problem or spurious error (I recall 
> this was an ignorable issue in previous version). The UI seems to be working 
> fine:
> ERROR LiveListenerBus: Listener JobProgressListener threw an exception
> java.lang.NullPointerException
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
> at 
> org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
> at scala.collection.immutable.List.foreach(List.scala:318)
> at 
> scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
> at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
> at 
> org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
> at 
> org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
> at 
> org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
> at 
> org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
> at 
> 

[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6

2016-01-25 Thread Stephen DiCocco (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115459#comment-15115459
 ] 

Stephen DiCocco commented on SPARK-12911:
-

So we have determined one way to work around the issue is to add the array you 
want to search for as a literal column on the dataframe and then cache the 
frame.  This causes the underlying types of both to be UnsafeArrayData.

{code}
test("test array comparison") {

val vectors: Vector[Row] =  Vector(
  Row.fromTuple("id_1" -> Array(0L, 2L)),
  Row.fromTuple("id_2" -> Array(0L, 5L)),
  Row.fromTuple("id_3" -> Array(0L, 9L)),
  Row.fromTuple("id_4" -> Array(1L, 0L)),
  Row.fromTuple("id_5" -> Array(1L, 8L)),
  Row.fromTuple("id_6" -> Array(2L, 4L)),
  Row.fromTuple("id_7" -> Array(5L, 6L)),
  Row.fromTuple("id_8" -> Array(6L, 2L)),
  Row.fromTuple("id_9" -> Array(7L, 0L))
)
val data: RDD[Row] = sc.parallelize(vectors, 3)

val schema = StructType(
  StructField("id", StringType, false) ::
StructField("point", DataTypes.createArrayType(LongType, false), false) 
::
Nil
)

val sqlContext = new SQLContext(sc)
val dataframe = sqlContext.createDataFrame(data, schema)

val targetPoint:Array[Long] = Array(0L,9L)

//Adding the target column to the frame allows you to do the comparison 
successfully but there is definite overhead to doing this
dataframe = dataframe.withColumn("target", array(targetPoint.map(value => 
lit(value)): _*))
dataframe.cache()

//This is the line where it fails
//java.util.NoSuchElementException: next on empty iterator
//However we know that there is a valid match
val targetRow = dataframe.where(dataframe("point") === 
dataframe("target").first()

assert(targetRow != null)
  }
{code}

> Cacheing a dataframe causes array comparisons to fail (in filter / where) 
> after 1.6
> ---
>
> Key: SPARK-12911
> URL: https://issues.apache.org/jira/browse/SPARK-12911
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
> Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0
>Reporter: Jesse English
>
> When doing a *where* operation on a dataframe and testing for equality on an 
> array type, after 1.6 no valid comparisons are made if the dataframe has been 
> cached.  If it has not been cached, the results are as expected.
> This appears to be related to the underlying unsafe array data types.
> {code:title=test.scala|borderStyle=solid}
> test("test array comparison") {
> val vectors: Vector[Row] =  Vector(
>   Row.fromTuple("id_1" -> Array(0L, 2L)),
>   Row.fromTuple("id_2" -> Array(0L, 5L)),
>   Row.fromTuple("id_3" -> Array(0L, 9L)),
>   Row.fromTuple("id_4" -> Array(1L, 0L)),
>   Row.fromTuple("id_5" -> Array(1L, 8L)),
>   Row.fromTuple("id_6" -> Array(2L, 4L)),
>   Row.fromTuple("id_7" -> Array(5L, 6L)),
>   Row.fromTuple("id_8" -> Array(6L, 2L)),
>   Row.fromTuple("id_9" -> Array(7L, 0L))
> )
> val data: RDD[Row] = sc.parallelize(vectors, 3)
> val schema = StructType(
>   StructField("id", StringType, false) ::
> StructField("point", DataTypes.createArrayType(LongType, false), 
> false) ::
> Nil
> )
> val sqlContext = new SQLContext(sc)
> val dataframe = sqlContext.createDataFrame(data, schema)
> val targetPoint:Array[Long] = Array(0L,9L)
> //Cacheing is the trigger to cause the error (no cacheing causes no error)
> dataframe.cache()
> //This is the line where it fails
> //java.util.NoSuchElementException: next on empty iterator
> //However we know that there is a valid match
> val targetRow = dataframe.where(dataframe("point") === 
> array(targetPoint.map(value => lit(value)): _*)).first()
> assert(targetRow != null)
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12970) Error in documentation on creating rows with schemas defined by structs

2016-01-25 Thread Haidar Hadi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114539#comment-15114539
 ] 

Haidar Hadi edited comment on SPARK-12970 at 1/25/16 7:11 PM:
--

sure [~joshrosen] I understand. 


was (Author: hhadi):
sure [~jrose] I understand. 

> Error in documentation on creating rows with schemas defined by structs
> ---
>
> Key: SPARK-12970
> URL: https://issues.apache.org/jira/browse/SPARK-12970
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 1.6.0
>Reporter: Haidar Hadi
>Priority: Minor
>  Labels: documentation
>
> The provided example in this doc 
> https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructType.html
>  for creating Row from Struct is wrong
>  // Create a Row with the schema defined by struct
>  val row = Row(Row(1, 2, true))
>  // row: Row = {@link 1,2,true}
>  
> the above example does not create a Row object with schema.
> this error is in the scala docs too. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11965) Update user guide for RFormula feature interactions

2016-01-25 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-11965.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 10222
[https://github.com/apache/spark/pull/10222]

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
> Fix For: 2.0.0
>
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115907#comment-15115907
 ] 

Yin Huai commented on SPARK-9740:
-

Can you provide the full stack trace?

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception

2016-01-25 Thread Ben Huntley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115986#comment-15115986
 ] 

Ben Huntley edited comment on SPARK-12945 at 1/25/16 8:41 PM:
--

Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark.  
Adding my own repro:

{quote}
bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" 
--executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 
1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" 
--conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf 
"spark.akka.frameSize=300" --conf "spark.akka.timeout=3600"

>>> foo = sqlContext.read.parquet('/projects/xxx/month7')
>>> foo.count()
[Stage 1:==>   (16687 + 11) / 
37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener 
threw an exception
 java.lang.NullPointerException
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180)
at 
org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)
[Stage 1:=>(36749 + 22) / 
37231]16/01/25 11:15:20 ERROR LiveListenerBus: Listener JobProgressListener 
threw an exception
 java.lang.NullPointerException
16227372864
{quote}


was (Author: bhuntley):
Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark.  
Adding my own repro:

bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" 
--executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 
1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" 
--conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf 
"spark.akka.frameSize=300" --conf "spark.akka.timeout=3600"

>>> foo = sqlContext.read.parquet('/projects/xxx/month7')
>>> foo.count()
[Stage 1:==>   (16687 + 11) / 
37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener 
threw an exception
 java.lang.NullPointerException
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361)
at 
org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32)
at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45)
at 
org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360)
at 
org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31)
at 
org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55)
at 
org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37)
at 

[jira] [Updated] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error

2016-01-25 Thread Tom Arnfeld (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Arnfeld updated SPARK-12981:

Description: 
We noticed a regression when testing out an upgrade of Spark 1.6 for our 
systems, where pyspark throws a casting exception when using `filter(udf)` 
after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.

Here's a little notebook that demonstrates the exception clearly... 
https://gist.github.com/tarnfeld/ab9b298ae67f697894cd

Though for the sake of here... the following code will throw an exception...

{code}
data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
{code}

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate
{code}

Whereas not using a UDF does not...

{code}
data.select(col("a")).distinct().filter("a = 1").count()
{code}

  was:
We noticed a regression when testing out an upgrade of Spark 1.6 for our 
systems, where pyspark throws a casting exception when using `filter(udf)` 
after a `distinct` operation on a DataFrame.

Here's a little notebook that demonstrates the exception clearly... 
https://gist.github.com/tarnfeld/ab9b298ae67f697894cd

Though for the sake of here... the following code will throw an exception...

{code}
data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
{code}

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate
{code}

Whereas not using a UDF does not...

{code}
data.select(col("a")).distinct().filter("a = 1").count()
{code}


> Dataframe distinct() followed by a filter(udf) in pyspark throws a casting 
> error
> 
>
> Key: SPARK-12981
> URL: https://issues.apache.org/jira/browse/SPARK-12981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8)
>Reporter: Tom Arnfeld
>Priority: Critical
>
> We noticed a regression when testing out an upgrade of Spark 1.6 for our 
> systems, where pyspark throws a casting exception when using `filter(udf)` 
> after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.
> Here's a little notebook that demonstrates the exception clearly... 
> https://gist.github.com/tarnfeld/ab9b298ae67f697894cd
> Though for the sake of here... the following code will throw an exception...
> {code}
> data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
> {code}
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate
> {code}
> Whereas not using a UDF does not...
> {code}
> data.select(col("a")).distinct().filter("a = 1").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-01-25 Thread Grzegorz Chilkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grzegorz Chilkiewicz updated SPARK-12982:
-
Description: 
We have encountered very strange behavior of SparkSQL temporary table 
registration.
What identifiers for temporary table should be valid?
Alphanumerical + '_' with at least one non-digit?

Valid identifiers:
df
674123a
674123_
a0e97c59_4445_479d_a7ef_d770e3874123
1ae97c59_4445_479d_a7ef_d770e3874123
Invalid identifier:
10e97c59_4445_479d_a7ef_d770e3874123

Stack trace:
{code:xml}
java.lang.RuntimeException: [1.1] failure: identifier expected

10e97c59_4445_479d_a7ef_d770e3874123
^
at scala.sys.package$.error(package.scala:27)
at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
at 
SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
at 
SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42)
at 
SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at sbt.Run.invokeMain(Run.scala:67)
at sbt.Run.run0(Run.scala:61)
at sbt.Run.sbt$Run$$execute$1(Run.scala:51)
at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Run$$anonfun$run$1.apply(Run.scala:55)
at sbt.Logger$$anon$4.apply(Logger.scala:85)
at sbt.TrapExit$App.run(TrapExit.scala:248)
at java.lang.Thread.run(Thread.java:745)
{code}

Code to reproduce this bug:
https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier


  was:
We have encountered very strange behavior of SparkSQL temporary table 
registration.
What identifiers for temporary table should be valid?
Alphanumerical + '_' with at least one non-digit?

Valid identifiers:
df
674123a
674123_
a0e97c59_4445_479d_a7ef_d770e3874123
1ae97c59_4445_479d_a7ef_d770e3874123
Invalid identifier:
10e97c59_4445_479d_a7ef_d770e3874123

Stack trace:
{code:xml}
[error] java.lang.RuntimeException: [1.1] failure: identifier expected
[error] 
[error] 10e97c59_4445_479d_a7ef_d770e3874123
[error] ^
[error] at scala.sys.package$.error(package.scala:27)
[error] at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
[error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
[error] at 
org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
{code}

Code to reproduce this bug:
https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>  Labels: sql
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> java.lang.RuntimeException: [1.1] failure: identifier expected
> 10e97c59_4445_479d_a7ef_d770e3874123
> ^
>   at scala.sys.package$.error(package.scala:27)
>   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
>   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
>   at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
>   at 
> SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9)
>   at 
> 

[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.

2016-01-25 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115558#comment-15115558
 ] 

Simeon Simeonov commented on SPARK-12890:
-

[~viirya] If schema merging is the cause of the problem then this is clearly a 
bug. The resulting schema for a query using only partition columns is 
completely independent of the schema in the data files. There is no merging to 
do at all.

> Spark SQL query related to only partition fields should not scan the whole 
> data.
> 
>
> Key: SPARK-12890
> URL: https://issues.apache.org/jira/browse/SPARK-12890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Prakash Chockalingam
>
> I have a SQL query which has only partition fields. The query ends up 
> scanning all the data which is unnecessary.
> Example: select max(date) from table, where the table is partitioned by date.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error

2016-01-25 Thread Tom Arnfeld (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Arnfeld updated SPARK-12981:

Description: 
We noticed a regression when testing out an upgrade of Spark 1.6 for our 
systems, where pyspark throws a casting exception when using `filter(udf)` 
after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.

Here's a little notebook that demonstrates the exception clearly... 
https://gist.github.com/tarnfeld/ab9b298ae67f697894cd

Though for the sake of here... the following code will throw an exception...

{code}
data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
{code}

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate
{code}

Whereas not using a UDF does not throw any errors...
{code}
data.select(col("a")).distinct().filter("a = 1").count()
{code}

  was:
We noticed a regression when testing out an upgrade of Spark 1.6 for our 
systems, where pyspark throws a casting exception when using `filter(udf)` 
after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.

Here's a little notebook that demonstrates the exception clearly... 
https://gist.github.com/tarnfeld/ab9b298ae67f697894cd

Though for the sake of here... the following code will throw an exception...

{code}
data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
{code}

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate
{code}

Whereas not using a UDF does not...

{code}
data.select(col("a")).distinct().filter("a = 1").count()
{code}


> Dataframe distinct() followed by a filter(udf) in pyspark throws a casting 
> error
> 
>
> Key: SPARK-12981
> URL: https://issues.apache.org/jira/browse/SPARK-12981
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.6.0
> Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8)
>Reporter: Tom Arnfeld
>Priority: Critical
>
> We noticed a regression when testing out an upgrade of Spark 1.6 for our 
> systems, where pyspark throws a casting exception when using `filter(udf)` 
> after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5.
> Here's a little notebook that demonstrates the exception clearly... 
> https://gist.github.com/tarnfeld/ab9b298ae67f697894cd
> Though for the sake of here... the following code will throw an exception...
> {code}
> data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
> {code}
> {code}
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
> org.apache.spark.sql.catalyst.plans.logical.Aggregate
> {code}
> Whereas not using a UDF does not throw any errors...
> {code}
> data.select(col("a")).distinct().filter("a = 1").count()
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Yin Huai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115480#comment-15115480
 ] 

Yin Huai commented on SPARK-9740:
-

Can you attach your code? Also, can you try to use {{functions.callUDF("last", 
col, functions.lit(true))}}?

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-12980.
---
Resolution: Invalid

Why is this a clone of another issue? I don't think you've specified clearly 
what the problem is -- you say it doesn't work. QUestions should go to 
u...@spark.apache.org

> pyspark crash for large dataset - clone
> ---
>
> Key: SPARK-12980
> URL: https://issues.apache.org/jira/browse/SPARK-12980
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: Christopher Bourez
>
> I installed spark 1.6 on many different computers. 
> On Windows, PySpark textfile method, followed by take(1), does not work on a 
> file of 13M.
> If I set numpartitions to 2000 or take a smaller file, the method works well.
> The Pyspark is set with all RAM memory of the computer thanks to the command 
> --conf spark.driver.memory=5g in local mode.
> On Mac OS, I'm able to launch the exact same program with Pyspark with 16G 
> RAM for a file of much bigger in comparison, of 5G. Memory is correctly 
> allocated, removed etc
> On Ubuntu, no trouble, I can also launch a cluster 
> http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
> The error message on Windows is : java.net.SocketException: Connection reset 
> by peer: socket write error
> Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 
> v2.42.01
> What could be the reason to have the windows spark textfile method fail ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12980) pyspark crash for large dataset - clone

2016-01-25 Thread Christopher Bourez (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christopher Bourez closed SPARK-12980.
--

> pyspark crash for large dataset - clone
> ---
>
> Key: SPARK-12980
> URL: https://issues.apache.org/jira/browse/SPARK-12980
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: Christopher Bourez
>
> I installed spark 1.6 on many different computers. 
> On Windows, PySpark textfile method, followed by take(1), does not work on a 
> file of 13M.
> If I set numpartitions to 2000 or take a smaller file, the method works well.
> The Pyspark is set with all RAM memory of the computer thanks to the command 
> --conf spark.driver.memory=5g in local mode.
> On Mac OS, I'm able to launch the exact same program with Pyspark with 16G 
> RAM for a file of much bigger in comparison, of 5G. Memory is correctly 
> allocated, removed etc
> On Ubuntu, no trouble, I can also launch a cluster 
> http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
> The error message on Windows is : java.net.SocketException: Connection reset 
> by peer: socket write error
> Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 
> v2.42.01
> What could be the reason to have the windows spark textfile method fail ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12261) pyspark crash for large dataset

2016-01-25 Thread Christopher Bourez (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115433#comment-15115433
 ] 

Christopher Bourez commented on SPARK-12261:


I think the issue is not resolved

I installed spark 1.6 on many different computers. 

On Windows, PySpark textfile method, followed by take(1), does not work on a 
file of 13M.
If I set numpartitions to 2000 or take a smaller file, the method works well. 
The Pyspark is set with all RAM memory of the computer thanks to the command 
--conf spark.driver.memory=5g in local mode. 

On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM 
for a file of much bigger in comparison, of 5G. Memory is correctly allocated, 
removed etc 

On Ubuntu, no trouble, I can also launch a cluster 
http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html
 

The error message on Windows is : java.net.SocketException: Connection reset by 
peer: socket write error 
What could be the reason to have the windows spark textfile method fail ?

> pyspark crash for large dataset
> ---
>
> Key: SPARK-12261
> URL: https://issues.apache.org/jira/browse/SPARK-12261
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2
> Environment: windows
>Reporter: zihao
>
> I tried to import a local text(over 100mb) file via textFile in pyspark, when 
> i ran data.take(), it failed and gave error messages including:
> 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; 
> aborting job
> Traceback (most recent call last):
>   File "E:/spark_python/test3.py", line 9, in 
> lines.take(5)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, 
> in take
> res = self.context.runJob(self, takeUpToNumLeft, p)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 
> 916, in runJob
> port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, 
> partitions)
>   File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in 
> __call__
> answer, self.gateway_client, self.target_id, self.name)
>   File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 
> 36, in deco
> return f(*a, **kw)
>   File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in 
> get_return_value
> format(target_id, ".", name), value)
> py4j.protocol.Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 
> (TID 0, localhost): java.net.SocketException: Connection reset by peer: 
> socket write error
> Then i ran the same code for a small text file, this time .take() worked fine.
> How can i solve this problem?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier

2016-01-25 Thread Grzegorz Chilkiewicz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grzegorz Chilkiewicz updated SPARK-12982:
-
Description: 
We have encountered very strange behavior of SparkSQL temporary table 
registration.
What identifiers for temporary table should be valid?
Alphanumerical + '_' with at least one non-digit?

Valid identifiers:
df
674123a
674123_
a0e97c59_4445_479d_a7ef_d770e3874123
1ae97c59_4445_479d_a7ef_d770e3874123
Invalid identifier:
10e97c59_4445_479d_a7ef_d770e3874123

Stack trace:
{code:xml}
[error] java.lang.RuntimeException: [1.1] failure: identifier expected
[error] 
[error] 10e97c59_4445_479d_a7ef_d770e3874123
[error] ^
[error] at scala.sys.package$.error(package.scala:27)
[error] at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
[error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
[error] at 
org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
{code}

Code to reproduce this bug:
https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier


  was:
We have encountered very strange behavior of SparkSQL temporary table 
registration.
What identifiers for temporary table should be valid?
Alphanumerical + '_' with at least one non-digit?

Valid identifiers:
df
674123a
674123_
a0e97c59_4445_479d_a7ef_d770e3874123
1ae97c59_4445_479d_a7ef_d770e3874123

Invalid identifier:
10e97c59_4445_479d_a7ef_d770e3874123


Stack trace:
[error] java.lang.RuntimeException: [1.1] failure: identifier expected
[error] 
[error] 10e97c59_4445_479d_a7ef_d770e3874123
[error] ^
[error] at scala.sys.package$.error(package.scala:27)
[error] at 
org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
[error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
[error] at 
org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58)
[error] at 
io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)

Code to reproduce bug:
https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



> SQLContext: temporary table registration does not accept valid identifier
> -
>
> Key: SPARK-12982
> URL: https://issues.apache.org/jira/browse/SPARK-12982
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Grzegorz Chilkiewicz
>Priority: Minor
>  Labels: sql
>
> We have encountered very strange behavior of SparkSQL temporary table 
> registration.
> What identifiers for temporary table should be valid?
> Alphanumerical + '_' with at least one non-digit?
> Valid identifiers:
> df
> 674123a
> 674123_
> a0e97c59_4445_479d_a7ef_d770e3874123
> 1ae97c59_4445_479d_a7ef_d770e3874123
> Invalid identifier:
> 10e97c59_4445_479d_a7ef_d770e3874123
> Stack trace:
> {code:xml}
> [error] java.lang.RuntimeException: [1.1] failure: identifier expected
> [error] 
> [error] 10e97c59_4445_479d_a7ef_d770e3874123
> [error] ^
> [error]   at scala.sys.package$.error(package.scala:27)
> [error]   at 
> org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58)
> [error]   at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827)
> [error]   at 
> org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763)
> [error]   at 
> io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27)
> [error]   at 
> io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58)
> [error]   at 
> io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala)
> {code}
> Code to reproduce this bug:
> https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error

2016-01-25 Thread Tom Arnfeld (JIRA)
Tom Arnfeld created SPARK-12981:
---

 Summary: Dataframe distinct() followed by a filter(udf) in pyspark 
throws a casting error
 Key: SPARK-12981
 URL: https://issues.apache.org/jira/browse/SPARK-12981
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.6.0
 Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8)
Reporter: Tom Arnfeld
Priority: Critical


We noticed a regression when testing out an upgrade of Spark 1.6 for our 
systems, where pyspark throws a casting exception when using `filter(udf)` 
after a `distinct` operation on a DataFrame.

Here's a little notebook that demonstrates the exception clearly... 
https://gist.github.com/tarnfeld/ab9b298ae67f697894cd

Though for the sake of here... the following code will throw an exception...

{code}
data.select(col("a")).distinct().filter(my_filter(col("a"))).count()
{code}

{code}
java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to 
org.apache.spark.sql.catalyst.plans.logical.Aggregate
{code}

Whereas not using a UDF does not...

{code}
data.select(col("a")).distinct().filter("a = 1").count()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115650#comment-15115650
 ] 

Herman van Hovell commented on SPARK-9740:
--

We are probably resolving the Hive function by accident. The First/Last 
functions probably don't have an expressions only constructor.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-01-25 Thread Thomas Sebastian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115716#comment-15115716
 ] 

Thomas Sebastian commented on SPARK-12941:
--

Added a pull request https://github.com/thomastechs/spark/pull/1


> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib

2016-01-25 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115756#comment-15115756
 ] 

Bryan Cutler commented on SPARK-11219:
--

Regarding overall style in PySpark, I generally see single line param 
descriptions, and that doesn't look bad since there is usually just a few 
params at most and short descriptions.  So it might not be worth it to update 
this in other areas, but it would be nice to provide the format here in the 
wiki or somewhere for future additions.

> Make Parameter Description Format Consistent in PySpark.MLlib
> -
>
> Key: SPARK-11219
> URL: https://issues.apache.org/jira/browse/SPARK-11219
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, MLlib, PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> There are several different formats for describing params in PySpark.MLlib, 
> making it unclear what the preferred way to document is, i.e. vertical 
> alignment vs single line.
> This is to agree on a format and make it consistent across PySpark.MLlib.
> Following the discussion in SPARK-10560, using 2 lines with an indentation is 
> both readable and doesn't lead to changing many lines when adding/removing 
> parameters.  If the parameter uses a default value, put this in parenthesis 
> in a new line under the description.
> Example:
> {noformat}
> :param stepSize:
>   Step size for each iteration of gradient descent.
>   (default: 0.1)
> :param numIterations:
>   Number of iterations run for each batch of data.
>   (default: 50)
> {noformat}
> h2. Current State of Parameter Description Formating
> h4. Classification
>   * LogisticRegressionModel - single line descriptions, fix indentations
>   * LogisticRegressionWithSGD - vertical alignment, sporatic default values
>   * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values
>   * SVMModel - single line
>   * SVMWithSGD - vertical alignment, sporatic default values
>   * NaiveBayesModel - single line
>   * NaiveBayes - single line
> h4. Clustering
>   * KMeansModel - missing param description
>   * KMeans - missing param description and defaults
>   * GaussianMixture - vertical align, incorrect default formatting
>   * PowerIterationClustering - single line with wrapped indentation, missing 
> defaults
>   * StreamingKMeansModel - single line wrapped
>   * StreamingKMeans - single line wrapped, missing defaults
>   * LDAModel - single line
>   * LDA - vertical align, mising some defaults
> h4. FPM  
>   * FPGrowth - single line
>   * PrefixSpan - single line, defaults values in backticks
> h4. Recommendation
>   * ALS - does not have param descriptions
> h4. Regression
>   * LabeledPoint - single line
>   * LinearModel - single line
>   * LinearRegressionWithSGD - vertical alignment
>   * RidgeRegressionWithSGD - vertical align
>   * IsotonicRegressionModel - single line
>   * IsotonicRegression - single line, missing default
> h4. Tree
>   * DecisionTree - single line with vertical indentation, missing defaults
>   * RandomForest - single line with wrapped indent, missing some defaults
>   * GradientBoostedTrees - single line with wrapped indent
> NOTE
> This issue will just focus on model/algorithm descriptions, which are the 
> largest source of inconsistent formatting
> evaluation.py, feature.py, random.py, utils.py - these supporting classes 
> have param descriptions as single line, but are consistent so don't need to 
> be changed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior

2016-01-25 Thread Herman van Hovell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115659#comment-15115659
 ] 

Herman van Hovell commented on SPARK-9740:
--

Hmmm... It does have a suitable constructor. Please attach an example.

> first/last aggregate NULL behavior
> --
>
> Key: SPARK-9740
> URL: https://issues.apache.org/jira/browse/SPARK-9740
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Herman van Hovell
>Assignee: Yin Huai
>  Labels: releasenotes
> Fix For: 1.6.0
>
>
> The FIRST/LAST aggregates implemented as part of the new UDAF interface, 
> return the first or last non-null value (if any) found. This is a departure 
> from the behavior of the old FIRST/LAST aggregates and from the 
> FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, 
> if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' 
> this behavior for the old UDAF interface.
> Hive makes this behavior configurable, by adding a skipNulls flag. I would 
> suggest to do the same, and make the default behavior compatible with Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype

2016-01-25 Thread Jayadevan M (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115682#comment-15115682
 ] 

Jayadevan M commented on SPARK-12941:
-

Working on JdbcDialect.scala

> Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR 
> datatype
> --
>
> Key: SPARK-12941
> URL: https://issues.apache.org/jira/browse/SPARK-12941
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.4.1
> Environment: Apache Spark 1.4.2.2
>Reporter: Jose Martinez Poblete
>
> When exporting data from Spark to Oracle, string datatypes are translated to 
> TEXT for Oracle, this is leading to the following error
> {noformat}
> java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype
> {noformat}
> As per the following code:
> https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144
> See also:
> http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12983) Correct metrics.properties.template

2016-01-25 Thread Benjamin Fradet (JIRA)
Benjamin Fradet created SPARK-12983:
---

 Summary: Correct metrics.properties.template
 Key: SPARK-12983
 URL: https://issues.apache.org/jira/browse/SPARK-12983
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, Spark Core
Reporter: Benjamin Fradet
Priority: Minor


There are some typos or plain unintelligible sentences in the metrics template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12983) Correct metrics.properties.template

2016-01-25 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115661#comment-15115661
 ] 

Apache Spark commented on SPARK-12983:
--

User 'BenFradet' has created a pull request for this issue:
https://github.com/apache/spark/pull/10902

> Correct metrics.properties.template
> ---
>
> Key: SPARK-12983
> URL: https://issues.apache.org/jira/browse/SPARK-12983
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Core
>Reporter: Benjamin Fradet
>Priority: Minor
>
> There are some typos or plain unintelligible sentences in the metrics 
> template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >