[jira] [Assigned] (SPARK-12993) Remove usage of ADD_FILES in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12993: Assignee: (was: Apache Spark) > Remove usage of ADD_FILES in pyspark > > > Key: SPARK-12993 > URL: https://issues.apache.org/jira/browse/SPARK-12993 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12993) Remove usage of ADD_FILES in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12993: Assignee: Apache Spark > Remove usage of ADD_FILES in pyspark > > > Key: SPARK-12993 > URL: https://issues.apache.org/jira/browse/SPARK-12993 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12993) Remove usage of ADD_FILES in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116647#comment-15116647 ] Apache Spark commented on SPARK-12993: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/10913 > Remove usage of ADD_FILES in pyspark > > > Key: SPARK-12993 > URL: https://issues.apache.org/jira/browse/SPARK-12993 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11780) Provide type aliases in org.apache.spark.sql.types for backwards compatibility
[ https://issues.apache.org/jira/browse/SPARK-11780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116724#comment-15116724 ] Apache Spark commented on SPARK-11780: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/10915 > Provide type aliases in org.apache.spark.sql.types for backwards compatibility > -- > > Key: SPARK-11780 > URL: https://issues.apache.org/jira/browse/SPARK-11780 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Santiago M. Mola >Assignee: Santiago M. Mola > > With SPARK-11273, ArrayData, MapData and others were moved from > org.apache.spark.sql.types to org.apache.spark.sql.catalyst.util. > Since this is a backward incompatible change, it would be good to provide > type aliases from the old package (deprecated) to the new one. > For example: > {code} > package object types { >@deprecated >type ArrayData = org.apache.spark.sql.catalyst.util.ArrayData > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12993) Remove usage of ADD_FILES in pyspark
[ https://issues.apache.org/jira/browse/SPARK-12993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated SPARK-12993: --- Description: environment variable ADD_FILES is created for adding python files to spark context (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files to executors. > Remove usage of ADD_FILES in pyspark > > > Key: SPARK-12993 > URL: https://issues.apache.org/jira/browse/SPARK-12993 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Reporter: Jeff Zhang >Priority: Minor > > environment variable ADD_FILES is created for adding python files to spark > context (SPARK-865), this is deprecated now. User are encouraged to use > --py-files for adding python files to executors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12937) Bloom filter serialization
[ https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12937: Assignee: Apache Spark (was: Wenchen Fan) > Bloom filter serialization > -- > > Key: SPARK-12937 > URL: https://issues.apache.org/jira/browse/SPARK-12937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12937) Bloom filter serialization
[ https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116855#comment-15116855 ] Apache Spark commented on SPARK-12937: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10920 > Bloom filter serialization > -- > > Key: SPARK-12937 > URL: https://issues.apache.org/jira/browse/SPARK-12937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12937) Bloom filter serialization
[ https://issues.apache.org/jira/browse/SPARK-12937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12937: Assignee: Wenchen Fan (was: Apache Spark) > Bloom filter serialization > -- > > Key: SPARK-12937 > URL: https://issues.apache.org/jira/browse/SPARK-12937 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12984) Not able to read CSV file using Spark 1.4.0
[ https://issues.apache.org/jira/browse/SPARK-12984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116645#comment-15116645 ] Felix Cheung commented on SPARK-12984: -- You should specify 'source' - otherwise it defaults to parquet and it seems it fails trying to read it as parquet file. > Not able to read CSV file using Spark 1.4.0 > --- > > Key: SPARK-12984 > URL: https://issues.apache.org/jira/browse/SPARK-12984 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.4.0 > Environment: Unix > Hadoop 2.7.1.2.3.0.0-2557 > R 3.1.1 > Don't have Internet on the server >Reporter: Jai Murugesh Rajasekaran > > Hi, > We are trying to read a CSV file > Downloaded following CSV related package (jar files) and configured using > Maven > 1. spark-csv_2.10-1.2.0.jar > 2. spark-csv_2.10-1.2.0-sources.jar > 3. spark-csv_2.10-1.2.0-javadoc.jar > Trying to execute following script > > library(SparkR) > > sc <- sparkR.init(appName="SparkR-DataFrame") > Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or > restart R to create a new Spark Context > > sqlContext <- sparkRSQL.init(sc) > > setwd("/home/s/") > > getwd() > [1] "/home/s" > > path <- file.path("Sample.csv") > > Test <- read.df(sqlContext, path) > Note: I am able to read CSV file using regular R function but when tried > using SparkR functions...ended up with error > Initiated SparkR > $ sh -x sparkR -v --repositories > /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > Error Messages/Log > $ sh -x sparkR -v --repositories > /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > +++ dirname sparkR > ++ cd ./.. > ++ pwd > + export SPARK_HOME=/opt/spark-1.4.0 > + SPARK_HOME=/opt/spark-1.4.0 > + source /opt/spark-1.4.0/bin/load-spark-env.sh > dirname sparkR > +++ cd ./.. > +++ pwd > ++ FWDIR=/opt/spark-1.4.0 > ++ '[' -z '' ']' > ++ export SPARK_ENV_LOADED=1 > ++ SPARK_ENV_LOADED=1 > dirname sparkR > +++ cd ./.. > +++ pwd > ++ parent_dir=/opt/spark-1.4.0 > ++ user_conf_dir=/opt/spark-1.4.0/conf > ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']' > ++ set -a > ++ . /opt/spark-1.4.0/conf/spark-env.sh > +++ export SPARK_HOME=/opt/spark-1.4.0 > +++ SPARK_HOME=/opt/spark-1.4.0 > +++ export YARN_CONF_DIR=/etc/hadoop/conf > +++ YARN_CONF_DIR=/etc/hadoop/conf > +++ export HADOOP_CONF_DIR=/etc/hadoop/conf > +++ HADOOP_CONF_DIR=/etc/hadoop/conf > +++ export HADOOP_CONF_DIR=/etc/hadoop/conf > +++ HADOOP_CONF_DIR=/etc/hadoop/conf > ++ set +a > ++ '[' -z '' ']' > ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11 > ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10 > ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]] > ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']' > ++ export SPARK_SCALA_VERSION=2.10 > ++ SPARK_SCALA_VERSION=2.10 > + export -f usage > + [[ -v --repositories > /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > = *--help ]] > + [[ -v --repositories > /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > = *-h ]] > + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories > /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar > R version 3.1.1 (2014-07-10) -- "Sock it to Me" > Copyright (C) 2014 The R Foundation for Statistical Computing > Platform: x86_64-unknown-linux-gnu (64-bit) > R is free software and comes with ABSOLUTELY NO WARRANTY. > You are welcome to redistribute it under certain conditions. > Type 'license()' or 'licence()' for distribution details. > Natural language support but running in an English locale > R is a collaborative
[jira] [Resolved] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-11922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11922. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10085 [https://github.com/apache/spark/pull/10085] > Python API for ml.feature.QuantileDiscretizer > -- > > Key: SPARK-11922 > URL: https://issues.apache.org/jira/browse/SPARK-11922 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: holdenk >Priority: Minor > Fix For: 2.0.0 > > > Add Python API for ml.feature.QuantileDiscretizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12997) Use cast expression to perform type cast in csv
Reynold Xin created SPARK-12997: --- Summary: Use cast expression to perform type cast in csv Key: SPARK-12997 URL: https://issues.apache.org/jira/browse/SPARK-12997 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin CSVTypeCast.castTo should probably be removed, and just replace its usage with a projection that uses a sequence of Cast expressions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12977) Factoring out StreamingListener and UI to support history UI
[ https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saisai Shao updated SPARK-12977: Attachment: screenshot-1.png > Factoring out StreamingListener and UI to support history UI > > > Key: SPARK-12977 > URL: https://issues.apache.org/jira/browse/SPARK-12977 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Saisai Shao > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode
Jeff Zhang created SPARK-12994: -- Summary: It is not necessary to create ExecutorAllocationManager in local mode Key: SPARK-12994 URL: https://issues.apache.org/jira/browse/SPARK-12994 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Jeff Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12995) Remove deprecate APIs from Pregel
[ https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12995: Assignee: (was: Apache Spark) > Remove deprecate APIs from Pregel > - > > Key: SPARK-12995 > URL: https://issues.apache.org/jira/browse/SPARK-12995 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12995) Remove deprecate APIs from Pregel
[ https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116773#comment-15116773 ] Apache Spark commented on SPARK-12995: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/10918 > Remove deprecate APIs from Pregel > - > > Key: SPARK-12995 > URL: https://issues.apache.org/jira/browse/SPARK-12995 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12995) Remove deprecate APIs from Pregel
[ https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12995: Assignee: Apache Spark > Remove deprecate APIs from Pregel > - > > Key: SPARK-12995 > URL: https://issues.apache.org/jira/browse/SPARK-12995 > Project: Spark > Issue Type: Improvement > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12977) Factoring out StreamingListener and UI to support history UI
[ https://issues.apache.org/jira/browse/SPARK-12977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116844#comment-15116844 ] Saisai Shao commented on SPARK-12977: - Attach the current working progress, still some problems should be fixed to deliver the patch. > Factoring out StreamingListener and UI to support history UI > > > Key: SPARK-12977 > URL: https://issues.apache.org/jira/browse/SPARK-12977 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.6.0 >Reporter: Saisai Shao > Attachments: screenshot-1.png > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12996) CSVRelation should be based on HadoopFsRelation
Reynold Xin created SPARK-12996: --- Summary: CSVRelation should be based on HadoopFsRelation Key: SPARK-12996 URL: https://issues.apache.org/jira/browse/SPARK-12996 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12996) CSVRelation should be based on HadoopFsRelation
[ https://issues.apache.org/jira/browse/SPARK-12996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116774#comment-15116774 ] Reynold Xin commented on SPARK-12996: - cc [~hyukjin.kwon] would you be interested in fixing this? > CSVRelation should be based on HadoopFsRelation > --- > > Key: SPARK-12996 > URL: https://issues.apache.org/jira/browse/SPARK-12996 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12702) Populate statistics for DataFrame when reading CSV
[ https://issues.apache.org/jira/browse/SPARK-12702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12702. --- Resolution: Duplicate Closing this because it is just part of SPARK-12996. > Populate statistics for DataFrame when reading CSV > -- > > Key: SPARK-12702 > URL: https://issues.apache.org/jira/browse/SPARK-12702 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12670) Use spark internal utilities wherever possible
[ https://issues.apache.org/jira/browse/SPARK-12670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-12670. --- Resolution: Won't Fix Going to close this one since it is a little bit too broad. > Use spark internal utilities wherever possible > -- > > Key: SPARK-12670 > URL: https://issues.apache.org/jira/browse/SPARK-12670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hossein Falaki > > The initial code from spark-csv does not rely on Spark's internal utilities > to maintain backward compatibility across multiple versions of Spark. > * Type casting utilities > * Schema inference utilities > * Unit test utilities -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12968) Implement command to set current database
[ https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12968: Assignee: (was: Apache Spark) > Implement command to set current database > - > > Key: SPARK-12968 > URL: https://issues.apache.org/jira/browse/SPARK-12968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > We currently delegate to Hive for "use database" command. We should implement > this in Spark. > The reason this is important is: as soon as we can track the database, we can > remove the dependency on session state for the catalog API. Right now the > implementation of Catalog actually needs to handle session information itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12968) Implement command to set current database
[ https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12968: Assignee: Apache Spark > Implement command to set current database > - > > Key: SPARK-12968 > URL: https://issues.apache.org/jira/browse/SPARK-12968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark >Priority: Critical > > We currently delegate to Hive for "use database" command. We should implement > this in Spark. > The reason this is important is: as soon as we can track the database, we can > remove the dependency on session state for the catalog API. Right now the > implementation of Catalog actually needs to handle session information itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12995) Remove deprecate APIs from Pregel
[ https://issues.apache.org/jira/browse/SPARK-12995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-12995: - Issue Type: Sub-task (was: Improvement) Parent: SPARK-11806 > Remove deprecate APIs from Pregel > - > > Key: SPARK-12995 > URL: https://issues.apache.org/jira/browse/SPARK-12995 > Project: Spark > Issue Type: Sub-task > Components: GraphX >Affects Versions: 1.6.0 >Reporter: Takeshi Yamamuro > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12993) Remove usage of ADD_FILES in pyspark
Jeff Zhang created SPARK-12993: -- Summary: Remove usage of ADD_FILES in pyspark Key: SPARK-12993 URL: https://issues.apache.org/jira/browse/SPARK-12993 Project: Spark Issue Type: Sub-task Components: PySpark Reporter: Jeff Zhang Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode
[ https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12994: Assignee: (was: Apache Spark) > It is not necessary to create ExecutorAllocationManager in local mode > - > > Key: SPARK-12994 > URL: https://issues.apache.org/jira/browse/SPARK-12994 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode
[ https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12994: Assignee: Apache Spark > It is not necessary to create ExecutorAllocationManager in local mode > - > > Key: SPARK-12994 > URL: https://issues.apache.org/jira/browse/SPARK-12994 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jeff Zhang >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12994) It is not necessary to create ExecutorAllocationManager in local mode
[ https://issues.apache.org/jira/browse/SPARK-12994?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116693#comment-15116693 ] Apache Spark commented on SPARK-12994: -- User 'zjffdu' has created a pull request for this issue: https://github.com/apache/spark/pull/10914 > It is not necessary to create ExecutorAllocationManager in local mode > - > > Key: SPARK-12994 > URL: https://issues.apache.org/jira/browse/SPARK-12994 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jeff Zhang >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12968) Implement command to set current database
[ https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116735#comment-15116735 ] Apache Spark commented on SPARK-12968: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/10916 > Implement command to set current database > - > > Key: SPARK-12968 > URL: https://issues.apache.org/jira/browse/SPARK-12968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > We currently delegate to Hive for "use database" command. We should implement > this in Spark. > The reason this is important is: as soon as we can track the database, we can > remove the dependency on session state for the catalog API. Right now the > implementation of Catalog actually needs to handle session information itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12995) Remove deprecate APIs from Pregel
Takeshi Yamamuro created SPARK-12995: Summary: Remove deprecate APIs from Pregel Key: SPARK-12995 URL: https://issues.apache.org/jira/browse/SPARK-12995 Project: Spark Issue Type: Improvement Components: GraphX Affects Versions: 1.6.0 Reporter: Takeshi Yamamuro -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12888) benchmark the new hash expression
[ https://issues.apache.org/jira/browse/SPARK-12888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15116747#comment-15116747 ] Apache Spark commented on SPARK-12888: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/10917 > benchmark the new hash expression > - > > Key: SPARK-12888 > URL: https://issues.apache.org/jira/browse/SPARK-12888 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList
[ https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-12834: -- Assignee: Xusen Yin > Use type conversion instead of Ser/De of Pickle to transform JavaArray and > JavaList > --- > > Key: SPARK-12834 > URL: https://issues.apache.org/jira/browse/SPARK-12834 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Xusen Yin >Assignee: Xusen Yin > Fix For: 2.0.0 > > > According to the Ser/De code in Python side: > {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false} > def _java2py(sc, r, encoding="bytes"): > if isinstance(r, JavaObject): > clsName = r.getClass().getSimpleName() > # convert RDD into JavaRDD > if clsName != 'JavaRDD' and clsName.endswith("RDD"): > r = r.toJavaRDD() > clsName = 'JavaRDD' > if clsName == 'JavaRDD': > jrdd = sc._jvm.SerDe.javaToPython(r) > return RDD(jrdd, sc) > if clsName == 'DataFrame': > return DataFrame(r, SQLContext.getOrCreate(sc)) > if clsName in _picklable_classes: > r = sc._jvm.SerDe.dumps(r) > elif isinstance(r, (JavaArray, JavaList)): > try: > r = sc._jvm.SerDe.dumps(r) > except Py4JJavaError: > pass # not pickable > if isinstance(r, (bytearray, bytes)): > r = PickleSerializer().loads(bytes(r), encoding=encoding) > return r > {code} > We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, > then deserialize them with PickleSerializer in Python side. However, there is > no need to transform them in such an inefficient way. Instead of it, we can > use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). > What's more, there is an issue to Ser/De Scala Array as I said in > https://issues.apache.org/jira/browse/SPARK-12780 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12834) Use type conversion instead of Ser/De of Pickle to transform JavaArray and JavaList
[ https://issues.apache.org/jira/browse/SPARK-12834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-12834. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10772 [https://github.com/apache/spark/pull/10772] > Use type conversion instead of Ser/De of Pickle to transform JavaArray and > JavaList > --- > > Key: SPARK-12834 > URL: https://issues.apache.org/jira/browse/SPARK-12834 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Xusen Yin >Assignee: Xusen Yin > Fix For: 2.0.0 > > > According to the Ser/De code in Python side: > {code:title=StringIndexerModel|theme=FadeToGrey|linenumbers=true|language=python|firstline=0001|collapse=false} > def _java2py(sc, r, encoding="bytes"): > if isinstance(r, JavaObject): > clsName = r.getClass().getSimpleName() > # convert RDD into JavaRDD > if clsName != 'JavaRDD' and clsName.endswith("RDD"): > r = r.toJavaRDD() > clsName = 'JavaRDD' > if clsName == 'JavaRDD': > jrdd = sc._jvm.SerDe.javaToPython(r) > return RDD(jrdd, sc) > if clsName == 'DataFrame': > return DataFrame(r, SQLContext.getOrCreate(sc)) > if clsName in _picklable_classes: > r = sc._jvm.SerDe.dumps(r) > elif isinstance(r, (JavaArray, JavaList)): > try: > r = sc._jvm.SerDe.dumps(r) > except Py4JJavaError: > pass # not pickable > if isinstance(r, (bytearray, bytes)): > r = PickleSerializer().loads(bytes(r), encoding=encoding) > return r > {code} > We use SerDe.dumps to serialize JavaArray and JavaList in PythonMLLibAPI, > then deserialize them with PickleSerializer in Python side. However, there is > no need to transform them in such an inefficient way. Instead of it, we can > use type conversion to convert them, e.g. list(JavaArray) or list(JavaList). > What's more, there is an issue to Ser/De Scala Array as I said in > https://issues.apache.org/jira/browse/SPARK-12780 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12973) Support to set priority when submit spark application to YARN
[ https://issues.apache.org/jira/browse/SPARK-12973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12973. --- Resolution: Duplicate > Support to set priority when submit spark application to YARN > - > > Key: SPARK-12973 > URL: https://issues.apache.org/jira/browse/SPARK-12973 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.6.1 >Reporter: Chaozhong Yang > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204 ] Hyukjin Kwon commented on SPARK-12890: -- Actually I don't still understand what is an issue here. This might not be merging schemas as it is disabled by default and any filter is not being pushed down here. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12979) Paths are resolved relative to the local file system
Iulian Dragos created SPARK-12979: - Summary: Paths are resolved relative to the local file system Key: SPARK-12979 URL: https://issues.apache.org/jira/browse/SPARK-12979 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 1.6.0 Reporter: Iulian Dragos Spark properties that refer to paths on the cluster (for example, `spark.mesos.executor.home`) should be un-interpreted strings. Currently, such a path is resolved relative to the local (client) file system, and symlinks are resolved, etc. (by calling `getCanonicalPath`). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12968) Implement command to set current database
[ https://issues.apache.org/jira/browse/SPARK-12968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115288#comment-15115288 ] Herman van Hovell commented on SPARK-12968: --- I don't mind if you go ahead and work on this. The only thing is that we need to be a bit carefull arround set commands. They currently won't work properly because the SparkSQLParser interprets these as properties being set. I am working on the latter. > Implement command to set current database > - > > Key: SPARK-12968 > URL: https://issues.apache.org/jira/browse/SPARK-12968 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Priority: Critical > > We currently delegate to Hive for "use database" command. We should implement > this in Spark. > The reason this is important is: as soon as we can track the database, we can > remove the dependency on session state for the catalog API. Right now the > implementation of Catalog actually needs to handle session information itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115148#comment-15115148 ] Liang-Chi Hsieh edited comment on SPARK-12890 at 1/25/16 12:46 PM: --- As {{DataFrame.parquet}} accepts paths as parameter, you are already specifying the certain partitions to scan. was (Author: viirya): As {{DataFrame.parquet}} accepts paths as parameter, your partition information can be already embedded in the paths? > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115148#comment-15115148 ] Liang-Chi Hsieh commented on SPARK-12890: - As {{DataFrame.parquet}} accepts paths as parameter, your partition information can be already embedded in the paths? > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115257#comment-15115257 ] Takeshi Yamamuro commented on SPARK-12890: -- Ah, I see. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12980) pyspark crash for large dataset - clone
Christopher Bourez created SPARK-12980: -- Summary: pyspark crash for large dataset - clone Key: SPARK-12980 URL: https://issues.apache.org/jira/browse/SPARK-12980 Project: Spark Issue Type: Bug Affects Versions: 1.5.2 Environment: windows Reporter: Christopher Bourez I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(), it failed and gave error messages including: 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "E:/spark_python/test3.py", line 9, in lines.take(5) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco return f(*a, **kw) File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error Then i ran the same code for a small text file, this time .take() worked fine. How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html What could be the reason to have the windows spark textfile method fail ? was: I tried to import a local text(over 100mb) file via textFile in pyspark, when i ran data.take(), it failed and gave error messages including: 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job Traceback (most recent call last): File "E:/spark_python/test3.py", line 9, in lines.take(5) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in runJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions) File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in __call__ answer, self.gateway_client, self.target_id, self.name) File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in deco return f(*a, **kw) File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.net.SocketException: Connection reset by peer: socket write error Then i ran the same code for a small text file, this time .take() worked fine. How can i solve this problem? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC
[ https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Michalopoulos updated SPARK-12928: --- Description: When trying to read in a table from Oracle and saveAsParquet, an IllegalArgumentException is thrown when a column of FLOAT datatype is encountered. Below is the code being run: {code}val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> jdbcConnectionString, "dbtable" -> "(select someFloat from someTable)", "fetchSize" -> fetchSize)).load() jdbcDF.saveAsParquetFile(destinationDirectory + table) {code} Here is the exception: {code}java.lang.IllegalArgumentException: Unsupported dataType: {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, [1.1] failure: `TimestampType' expected but `{' found {code} >From the exception it was clear that the FLOAT datatype was presenting itself >as scale -127 which appears to be the problem. was: When trying to read in a table from Oracle and saveAsParquet, an IllegalArgumentException is thrown when a column of FLOAT datatype is encountered. Below is the code being run: {code}val jdbcDF = sqlContext.read.format("jdbc").options( Map("url" -> jdbcConnectionString, "dbtable" -> "(select someFloat from someTable"), "fetchSize" -> fetchSize)).load() jdbcDF.saveAsParquetFile(destinationDirectory + table) {code} Here is the exception: {code}java.lang.IllegalArgumentException: Unsupported dataType: {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, [1.1] failure: `TimestampType' expected but `{' found {code} >From the exception it was clear that the FLOAT datatype was presenting itself >as scale -127 which appears to be the problem. > Oracle FLOAT datatype is not properly handled when reading via JDBC > --- > > Key: SPARK-12928 > URL: https://issues.apache.org/jira/browse/SPARK-12928 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Oracle Database 11g Enterprise Edition 11.2.0.3.0 > 64bit Production >Reporter: Greg Michalopoulos >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When trying to read in a table from Oracle and saveAsParquet, an > IllegalArgumentException is thrown when a column of FLOAT datatype is > encountered. > Below is the code being run: > {code}val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> jdbcConnectionString, > "dbtable" -> "(select someFloat from someTable)", > "fetchSize" -> fetchSize)).load() > jdbcDF.saveAsParquetFile(destinationDirectory + table) > {code} > Here is the exception: > {code}java.lang.IllegalArgumentException: Unsupported dataType: > {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, > [1.1] failure: `TimestampType' expected but `{' found > {code} > From the exception it was clear that the FLOAT datatype was presenting itself > as scale -127 which appears to be the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115076#comment-15115076 ] Takeshi Yamamuro edited comment on SPARK-12890 at 1/25/16 1:32 PM: --- I looked over the related codes; partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrameReader#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). was (Author: maropu): I looked over the related codes; partition pruning optimization itself has been implemented in https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L74. However, there is no interface in DataFrame#parquet to pass partition information (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L321). > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:44 PM: --- Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. was (Author: hyukjin.kwon): Actually I don't still understand what is an issue here. This might not be merging schemas as it is disabled by default and any filter is not being pushed down here. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? was: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10911) Executors should System.exit on clean shutdown
[ https://issues.apache.org/jira/browse/SPARK-10911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115225#comment-15115225 ] Thomas Graves commented on SPARK-10911: --- see the pull request for comments and discussion https://github.com/apache/spark/pull/9946 > Executors should System.exit on clean shutdown > -- > > Key: SPARK-10911 > URL: https://issues.apache.org/jira/browse/SPARK-10911 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.5.1 >Reporter: Thomas Graves >Assignee: Zhuo Liu >Priority: Minor > > Executors should call System.exit on clean shutdown to make sure all user > threads exit and jvm shuts down. > We ran into a case where an Executor was left around for days trying to > shutdown because the user code was using a non-daemon thread pool and one of > those threads wasn't exiting. We should force the jvm to go away with > System.exit. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3611) Show number of cores for each executor in application web UI
[ https://issues.apache.org/jira/browse/SPARK-3611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115309#comment-15115309 ] Thomas Graves commented on SPARK-3611: -- I know the pull request was closed due to not being able to reliably get this information, it looks like its now available through ExecutorInfo structure. > Show number of cores for each executor in application web UI > > > Key: SPARK-3611 > URL: https://issues.apache.org/jira/browse/SPARK-3611 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Matei Zaharia >Priority: Minor > Labels: starter > > This number is not always fully known, because e.g. in Mesos your executors > can scale up and down in # of CPUs, but it would be nice to show at least the > number of cores the machine has in that case, or the # of cores the executor > has been configured with if known. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC
[ https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12928: Assignee: (was: Apache Spark) > Oracle FLOAT datatype is not properly handled when reading via JDBC > --- > > Key: SPARK-12928 > URL: https://issues.apache.org/jira/browse/SPARK-12928 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Oracle Database 11g Enterprise Edition 11.2.0.3.0 > 64bit Production >Reporter: Greg Michalopoulos >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When trying to read in a table from Oracle and saveAsParquet, an > IllegalArgumentException is thrown when a column of FLOAT datatype is > encountered. > Below is the code being run: > {code}val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> jdbcConnectionString, > "dbtable" -> "(select someFloat from someTable"), > "fetchSize" -> fetchSize)).load() > jdbcDF.saveAsParquetFile(destinationDirectory + table) > {code} > Here is the exception: > {code}java.lang.IllegalArgumentException: Unsupported dataType: > {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, > [1.1] failure: `TimestampType' expected but `{' found > {code} > From the exception it was clear that the FLOAT datatype was presenting itself > as scale -127 which appears to be the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC
[ https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115332#comment-15115332 ] Apache Spark commented on SPARK-12928: -- User 'poolis' has created a pull request for this issue: https://github.com/apache/spark/pull/10899 > Oracle FLOAT datatype is not properly handled when reading via JDBC > --- > > Key: SPARK-12928 > URL: https://issues.apache.org/jira/browse/SPARK-12928 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Oracle Database 11g Enterprise Edition 11.2.0.3.0 > 64bit Production >Reporter: Greg Michalopoulos >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When trying to read in a table from Oracle and saveAsParquet, an > IllegalArgumentException is thrown when a column of FLOAT datatype is > encountered. > Below is the code being run: > {code}val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> jdbcConnectionString, > "dbtable" -> "(select someFloat from someTable"), > "fetchSize" -> fetchSize)).load() > jdbcDF.saveAsParquetFile(destinationDirectory + table) > {code} > Here is the exception: > {code}java.lang.IllegalArgumentException: Unsupported dataType: > {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, > [1.1] failure: `TimestampType' expected but `{' found > {code} > From the exception it was clear that the FLOAT datatype was presenting itself > as scale -127 which appears to be the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12928) Oracle FLOAT datatype is not properly handled when reading via JDBC
[ https://issues.apache.org/jira/browse/SPARK-12928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12928: Assignee: Apache Spark > Oracle FLOAT datatype is not properly handled when reading via JDBC > --- > > Key: SPARK-12928 > URL: https://issues.apache.org/jira/browse/SPARK-12928 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: Oracle Database 11g Enterprise Edition 11.2.0.3.0 > 64bit Production >Reporter: Greg Michalopoulos >Assignee: Apache Spark >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > When trying to read in a table from Oracle and saveAsParquet, an > IllegalArgumentException is thrown when a column of FLOAT datatype is > encountered. > Below is the code being run: > {code}val jdbcDF = sqlContext.read.format("jdbc").options( > Map("url" -> jdbcConnectionString, > "dbtable" -> "(select someFloat from someTable"), > "fetchSize" -> fetchSize)).load() > jdbcDF.saveAsParquetFile(destinationDirectory + table) > {code} > Here is the exception: > {code}java.lang.IllegalArgumentException: Unsupported dataType: > {"type":"struct","fields":[{"name":"someFloat","type":"decimal(38,-127)","nullable":true,"metadata":{"name":"someFloat"}}]}, > [1.1] failure: `TimestampType' expected but `{' found > {code} > From the exception it was clear that the FLOAT datatype was presenting itself > as scale -127 which appears to be the problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12360) Support using 64-bit long type in SparkR
[ https://issues.apache.org/jira/browse/SPARK-12360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115376#comment-15115376 ] Dmitriy Selivanov commented on SPARK-12360: --- +1 for bit64 > Support using 64-bit long type in SparkR > > > Key: SPARK-12360 > URL: https://issues.apache.org/jira/browse/SPARK-12360 > Project: Spark > Issue Type: New Feature > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Sun Rui > > R has no support for 64-bit integers. While in Scala/Java API, some methods > have one or more arguments of long type. Currently we support only passing an > integer cast from a numeric to Scala/Java side for parameters of long type of > such methods. This may have problem covering large data sets. > Storing a 64-bit integer in a double obviously does not work as some 64-bit > integers can not be exactly represented in double format, so x and x+1 can't > be distinguished. > There is a bit64 package > (https://cran.r-project.org/web/packages/bit64/index.html) in CRAN which > supports vectors of 64-bit integers. We can investigate if it can be used for > this purpose. > two questions are: > 1. Is the license acceptable? > 2. This will have SparkR depends on a non-base third-party package, which > may complicate the deployment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115144#comment-15115144 ] Liang-Chi Hsieh commented on SPARK-12890: - For the original issue, I think it might because you enable schema merging. In order to get the correct schema, it will scan all footer and parquet files to merge their schema. Try to disable the schema merging if you don't need it, and see if it solves your problem. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115204#comment-15115204 ] Hyukjin Kwon edited comment on SPARK-12890 at 1/25/16 1:46 PM: --- Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter for a function and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. was (Author: hyukjin.kwon): Actually I don't still understand what is an issue here. This might not be related with merging schemas as it is disabled by default and any filter is not being pushed down here. It does not automatically create a filter and pushes down it as far as I know. I mean, the referenced column would be {{date}} and given filters would be empty. So it tries to read all the files regardless of file format as long as it supports to partitioned files. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12492: Assignee: (was: Apache Spark) > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-12492: Assignee: Apache Spark > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula >Assignee: Apache Spark > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12492) SQL page of Spark-sql is always blank
[ https://issues.apache.org/jira/browse/SPARK-12492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115334#comment-15115334 ] Apache Spark commented on SPARK-12492: -- User 'KaiXinXiaoLei' has created a pull request for this issue: https://github.com/apache/spark/pull/10900 > SQL page of Spark-sql is always blank > -- > > Key: SPARK-12492 > URL: https://issues.apache.org/jira/browse/SPARK-12492 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Reporter: meiyoula > Attachments: screenshot-1.png > > > When I run a sql query in spark-sql, the Execution page of SQL tab is always > blank. But the JDBCServer is not blank. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12975) Throwing Exception when Bucketing Columns are part of Partitioning Columns
[ https://issues.apache.org/jira/browse/SPARK-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12975: Description: When users are using partitionBy and bucketBy at the same time, some bucketing columns might be part of partitioning columns. For example, {code} df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .sortBy("k") .saveAsTable("bucketed_table") {code} However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. was: When users are using partitionBy and bucketBy at the same time, some bucketing columns might be part of partitioning columns. For example, {code} df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .sortBy("k") .saveAsTable("bucketed_table") {code} However, in the above case, adding column `i` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, we can automatically remove these overlapping columns from the bucketing columns. > Throwing Exception when Bucketing Columns are part of Partitioning Columns > -- > > Key: SPARK-12975 > URL: https://issues.apache.org/jira/browse/SPARK-12975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > When users are using partitionBy and bucketBy at the same time, some > bucketing columns might be part of partitioning columns. For example, > {code} > df.write > .format(source) > .partitionBy("i") > .bucketBy(8, "i", "k") > .sortBy("k") > .saveAsTable("bucketed_table") > {code} > However, in the above case, adding column `i` into `bucketBy` is useless. It > is just wasting extra CPU when reading or writing bucket tables. Thus, like > Hive, we can issue an exception and let users do the change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12975) Throwing Exception when Bucketing Columns are part of Partitioning Columns
[ https://issues.apache.org/jira/browse/SPARK-12975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-12975: Summary: Throwing Exception when Bucketing Columns are part of Partitioning Columns (was: Eliminate Bucketing Columns that are part of Partitioning Columns) > Throwing Exception when Bucketing Columns are part of Partitioning Columns > -- > > Key: SPARK-12975 > URL: https://issues.apache.org/jira/browse/SPARK-12975 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > When users are using partitionBy and bucketBy at the same time, some > bucketing columns might be part of partitioning columns. For example, > {code} > df.write > .format(source) > .partitionBy("i") > .bucketBy(8, "i", "k") > .sortBy("k") > .saveAsTable("bucketed_table") > {code} > However, in the above case, adding column `i` is useless. It is just wasting > extra CPU when reading or writing bucket tables. Thus, we can automatically > remove these overlapping columns from the bucketing columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12984) Not able to read CSV file using Spark 1.4.0
Jai Murugesh Rajasekaran created SPARK-12984: Summary: Not able to read CSV file using Spark 1.4.0 Key: SPARK-12984 URL: https://issues.apache.org/jira/browse/SPARK-12984 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 1.4.0 Environment: Unix Hadoop 2.7.1.2.3.0.0-2557 R 3.1.1 Don't have Internet on the server Reporter: Jai Murugesh Rajasekaran Hi, We are trying to read a CSV file Downloaded following CSV related package (jar files) and configured using Maven 1. spark-csv_2.10-1.2.0.jar 2. spark-csv_2.10-1.2.0-sources.jar 3. spark-csv_2.10-1.2.0-javadoc.jar Trying to execute following script > library(SparkR) > sc <- sparkR.init(appName="SparkR-DataFrame") Re-using existing Spark Context. Please stop SparkR with sparkR.stop() or restart R to create a new Spark Context > sqlContext <- sparkRSQL.init(sc) > setwd("/home/s/") > getwd() [1] "/home/s" > path <- file.path("Sample.csv") > Test <- read.df(sqlContext, path) Note: I am able to read CSV file using regular R function but when tried using SparkR functions...ended up with error Initiated SparkR $ sh -x sparkR -v --repositories /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar Error Messages/Log $ sh -x sparkR -v --repositories /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar +++ dirname sparkR ++ cd ./.. ++ pwd + export SPARK_HOME=/opt/spark-1.4.0 + SPARK_HOME=/opt/spark-1.4.0 + source /opt/spark-1.4.0/bin/load-spark-env.sh dirname sparkR +++ cd ./.. +++ pwd ++ FWDIR=/opt/spark-1.4.0 ++ '[' -z '' ']' ++ export SPARK_ENV_LOADED=1 ++ SPARK_ENV_LOADED=1 dirname sparkR +++ cd ./.. +++ pwd ++ parent_dir=/opt/spark-1.4.0 ++ user_conf_dir=/opt/spark-1.4.0/conf ++ '[' -f /opt/spark-1.4.0/conf/spark-env.sh ']' ++ set -a ++ . /opt/spark-1.4.0/conf/spark-env.sh +++ export SPARK_HOME=/opt/spark-1.4.0 +++ SPARK_HOME=/opt/spark-1.4.0 +++ export YARN_CONF_DIR=/etc/hadoop/conf +++ YARN_CONF_DIR=/etc/hadoop/conf +++ export HADOOP_CONF_DIR=/etc/hadoop/conf +++ HADOOP_CONF_DIR=/etc/hadoop/conf +++ export HADOOP_CONF_DIR=/etc/hadoop/conf +++ HADOOP_CONF_DIR=/etc/hadoop/conf ++ set +a ++ '[' -z '' ']' ++ ASSEMBLY_DIR2=/opt/spark-1.4.0/assembly/target/scala-2.11 ++ ASSEMBLY_DIR1=/opt/spark-1.4.0/assembly/target/scala-2.10 ++ [[ -d /opt/spark-1.4.0/assembly/target/scala-2.11 ]] ++ '[' -d /opt/spark-1.4.0/assembly/target/scala-2.11 ']' ++ export SPARK_SCALA_VERSION=2.10 ++ SPARK_SCALA_VERSION=2.10 + export -f usage + [[ -v --repositories /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar = *--help ]] + [[ -v --repositories /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar = *-h ]] + exec /opt/spark-1.4.0/bin/spark-submit sparkr-shell-main -v --repositories /home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0/spark-csv_2.10-1.2.0.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-javadoc/spark-csv_2.10-1.2.0-javadoc.jar,/home/s/.m2/repository/com/databricks/spark-csv_2.10/1.2.0-sources/spark-csv_2.10-1.2.0-sources.jar R version 3.1.1 (2014-07-10) -- "Sock it to Me" Copyright (C) 2014 The R Foundation for Statistical Computing Platform: x86_64-unknown-linux-gnu (64-bit) R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. Natural language support but running in an English locale R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Revolution R Enterprise version 7.3: an enhanced distribution of R Revolution Analytics packages Copyright (C) 2014 Revolution Analytics, Inc. Type 'revo()' to visit
[jira] [Created] (SPARK-12985) Spark Hive thrift server big decimal data issue
Alex Liu created SPARK-12985: Summary: Spark Hive thrift server big decimal data issue Key: SPARK-12985 URL: https://issues.apache.org/jira/browse/SPARK-12985 Project: Spark Issue Type: Bug Affects Versions: 1.6.0 Reporter: Alex Liu Priority: Minor I tested the trial version JDBC driver from Simba, it works for simple query. But there is some issue with data mapping. e.g. {code} java.sql.SQLException: [Simba][SparkJDBCDriver](500312) Error in fetching data rows: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal; at com.simba.spark.hivecommon.api.HS2Client.buildExceptionFromTStatus(Unknown Source) at com.simba.spark.hivecommon.api.HS2Client.fetchNRows(Unknown Source) at com.simba.spark.hivecommon.api.HS2Client.fetchRows(Unknown Source) at com.simba.spark.hivecommon.dataengine.BackgroundFetcher.run(Unknown Source) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) Caused by: com.simba.spark.support.exceptions.GeneralException: [Simba][SparkJDBCDriver](500312) Error in fetching data rows: java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal; ... 8 more {code} To fix it {code} case DecimalType() => -to += from.getDecimal(ordinal) +to += HiveDecimal.create(from.getDecimal(ordinal)) {code} to https://github.com/apache/spark/blob/master/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala#L87 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12633: -- Assignee: Vijay Kiran > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering
[ https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12631: -- Assignee: Bryan Cutler > Make Parameter Descriptions Consistent for PySpark MLlib Clustering > --- > > Key: SPARK-12631 > URL: https://issues.apache.org/jira/browse/SPARK-12631 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > clustering.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12986) Fix pydoc warnings in mllib/regression.py
Xiangrui Meng created SPARK-12986: - Summary: Fix pydoc warnings in mllib/regression.py Key: SPARK-12986 URL: https://issues.apache.org/jira/browse/SPARK-12986 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Yu Ishikawa Priority: Minor Got those warnings by running "make html" under "python/docs/": {code} /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.LinearRegressionWithSGD:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.LinearRegressionWithSGD:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.RidgeRegressionWithSGD:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.RidgeRegressionWithSGD:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.LassoWithSGD:3: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.LassoWithSGD:4: WARNING: Block quote ends without a blank line; unexpected unindent. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.IsotonicRegression:7: ERROR: Unexpected indentation. /Users/meng/src/spark/python/pyspark/mllib/regression.py:docstring of pyspark.mllib.regression.IsotonicRegression:12: ERROR: Unexpected indentation. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering
[ https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12631: -- Shepherd: Xiangrui Meng > Make Parameter Descriptions Consistent for PySpark MLlib Clustering > --- > > Key: SPARK-12631 > URL: https://issues.apache.org/jira/browse/SPARK-12631 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > clustering.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12630: -- Assignee: Vijay Kiran > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12632: -- Assignee: somil deshmukh > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: somil deshmukh >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12634: -- Assignee: Vijay Kiran > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12633: -- Shepherd: Bryan Cutler > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115904#comment-15115904 ] Emlyn Corrin commented on SPARK-9740: - Thanks for the help. I've tried with {{callUDF}} and that gives me the same error as when I use {{expr}}. For now I've managed to work around it by calling {{registerTempTable("tempTable")}} on the DataFrame, and then {{SQLContext.sql("SELECT LAST(colName,true) OVER(...) FROM tempTable")}}, which works, but feels a bit hacky. I'll try to put together a minimal example that demonstrates this, as it is currently in the middle of a fairly big Clojure application that calls Spark through Java interop. > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12634) Make Parameter Descriptions Consistent for PySpark MLlib Tree
[ https://issues.apache.org/jira/browse/SPARK-12634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12634: -- Shepherd: Bryan Cutler Target Version/s: 2.0.0 > Make Parameter Descriptions Consistent for PySpark MLlib Tree > - > > Key: SPARK-12634 > URL: https://issues.apache.org/jira/browse/SPARK-12634 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up tree.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez updated SPARK-12980: --- Description: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 v2.42.01 What could be the reason to have the windows spark textfile method fail ? was: I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 > v2.42.01 > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier
Grzegorz Chilkiewicz created SPARK-12982: Summary: SQLContext: temporary table registration does not accept valid identifier Key: SPARK-12982 URL: https://issues.apache.org/jira/browse/SPARK-12982 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.6.0 Reporter: Grzegorz Chilkiewicz Priority: Minor We have encountered very strange behavior of SparkSQL temporary table registration. What identifiers for temporary table should be valid? Alphanumerical + '_' with at least one non-digit? Valid identifiers: df 674123a 674123_ a0e97c59_4445_479d_a7ef_d770e3874123 1ae97c59_4445_479d_a7ef_d770e3874123 Invalid identifier: 10e97c59_4445_479d_a7ef_d770e3874123 Stack trace: [error] java.lang.RuntimeException: [1.1] failure: identifier expected [error] [error] 10e97c59_4445_479d_a7ef_d770e3874123 [error] ^ [error] at scala.sys.package$.error(package.scala:27) [error] at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) [error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) [error] at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58) [error] at io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) Code to reproduce bug: https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12632) Make Parameter Descriptions Consistent for PySpark MLlib FPM and Recommendation
[ https://issues.apache.org/jira/browse/SPARK-12632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12632: -- Target Version/s: 2.0.0 > Make Parameter Descriptions Consistent for PySpark MLlib FPM and > Recommendation > --- > > Key: SPARK-12632 > URL: https://issues.apache.org/jira/browse/SPARK-12632 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: somil deshmukh >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up fpm.py > and recommendation.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12631) Make Parameter Descriptions Consistent for PySpark MLlib Clustering
[ https://issues.apache.org/jira/browse/SPARK-12631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12631: -- Target Version/s: 2.0.0 > Make Parameter Descriptions Consistent for PySpark MLlib Clustering > --- > > Key: SPARK-12631 > URL: https://issues.apache.org/jira/browse/SPARK-12631 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > clustering.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12630) Make Parameter Descriptions Consistent for PySpark MLlib Classification
[ https://issues.apache.org/jira/browse/SPARK-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12630: -- Target Version/s: 2.0.0 > Make Parameter Descriptions Consistent for PySpark MLlib Classification > --- > > Key: SPARK-12630 > URL: https://issues.apache.org/jira/browse/SPARK-12630 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > classification.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12633) Make Parameter Descriptions Consistent for PySpark MLlib Regression
[ https://issues.apache.org/jira/browse/SPARK-12633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-12633: -- Target Version/s: 2.0.0 > Make Parameter Descriptions Consistent for PySpark MLlib Regression > --- > > Key: SPARK-12633 > URL: https://issues.apache.org/jira/browse/SPARK-12633 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 1.6.0 >Reporter: Bryan Cutler >Assignee: Vijay Kiran >Priority: Trivial > Labels: doc, starter > Original Estimate: 1h > Remaining Estimate: 1h > > Follow example parameter description format from parent task to fix up > regression.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception
[ https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115986#comment-15115986 ] Ben Huntley commented on SPARK-12945: - Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark. Adding my own repro: bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" --executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" --conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf "spark.akka.frameSize=300" --conf "spark.akka.timeout=3600" >>> foo = sqlContext.read.parquet('/projects/xxx/month7') >>> foo.count() [Stage 1:==> (16687 + 11) / 37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) [Stage 1:=>(36749 + 22) / 37231]16/01/25 11:15:20 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException 16227372864 > ERROR LiveListenerBus: Listener JobProgressListener threw an exception > -- > > Key: SPARK-12945 > URL: https://issues.apache.org/jira/browse/SPARK-12945 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.6.0 > Environment: Linux, yarn-client >Reporter: Tristan >Priority: Minor > > Seeing this a lot; not sure if it is a problem or spurious error (I recall > this was an ignorable issue in previous version). The UI seems to be working > fine: > ERROR LiveListenerBus: Listener JobProgressListener threw an exception > java.lang.NullPointerException > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) > at > org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) > at scala.collection.immutable.List.foreach(List.scala:318) > at > scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) > at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) > at > org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) > at > org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) > at > org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) > at > org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) > at >
[jira] [Commented] (SPARK-12911) Cacheing a dataframe causes array comparisons to fail (in filter / where) after 1.6
[ https://issues.apache.org/jira/browse/SPARK-12911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115459#comment-15115459 ] Stephen DiCocco commented on SPARK-12911: - So we have determined one way to work around the issue is to add the array you want to search for as a literal column on the dataframe and then cache the frame. This causes the underlying types of both to be UnsafeArrayData. {code} test("test array comparison") { val vectors: Vector[Row] = Vector( Row.fromTuple("id_1" -> Array(0L, 2L)), Row.fromTuple("id_2" -> Array(0L, 5L)), Row.fromTuple("id_3" -> Array(0L, 9L)), Row.fromTuple("id_4" -> Array(1L, 0L)), Row.fromTuple("id_5" -> Array(1L, 8L)), Row.fromTuple("id_6" -> Array(2L, 4L)), Row.fromTuple("id_7" -> Array(5L, 6L)), Row.fromTuple("id_8" -> Array(6L, 2L)), Row.fromTuple("id_9" -> Array(7L, 0L)) ) val data: RDD[Row] = sc.parallelize(vectors, 3) val schema = StructType( StructField("id", StringType, false) :: StructField("point", DataTypes.createArrayType(LongType, false), false) :: Nil ) val sqlContext = new SQLContext(sc) val dataframe = sqlContext.createDataFrame(data, schema) val targetPoint:Array[Long] = Array(0L,9L) //Adding the target column to the frame allows you to do the comparison successfully but there is definite overhead to doing this dataframe = dataframe.withColumn("target", array(targetPoint.map(value => lit(value)): _*)) dataframe.cache() //This is the line where it fails //java.util.NoSuchElementException: next on empty iterator //However we know that there is a valid match val targetRow = dataframe.where(dataframe("point") === dataframe("target").first() assert(targetRow != null) } {code} > Cacheing a dataframe causes array comparisons to fail (in filter / where) > after 1.6 > --- > > Key: SPARK-12911 > URL: https://issues.apache.org/jira/browse/SPARK-12911 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 > Environment: OSX 10.11.1, Scala 2.11.7, Spark 1.6.0 >Reporter: Jesse English > > When doing a *where* operation on a dataframe and testing for equality on an > array type, after 1.6 no valid comparisons are made if the dataframe has been > cached. If it has not been cached, the results are as expected. > This appears to be related to the underlying unsafe array data types. > {code:title=test.scala|borderStyle=solid} > test("test array comparison") { > val vectors: Vector[Row] = Vector( > Row.fromTuple("id_1" -> Array(0L, 2L)), > Row.fromTuple("id_2" -> Array(0L, 5L)), > Row.fromTuple("id_3" -> Array(0L, 9L)), > Row.fromTuple("id_4" -> Array(1L, 0L)), > Row.fromTuple("id_5" -> Array(1L, 8L)), > Row.fromTuple("id_6" -> Array(2L, 4L)), > Row.fromTuple("id_7" -> Array(5L, 6L)), > Row.fromTuple("id_8" -> Array(6L, 2L)), > Row.fromTuple("id_9" -> Array(7L, 0L)) > ) > val data: RDD[Row] = sc.parallelize(vectors, 3) > val schema = StructType( > StructField("id", StringType, false) :: > StructField("point", DataTypes.createArrayType(LongType, false), > false) :: > Nil > ) > val sqlContext = new SQLContext(sc) > val dataframe = sqlContext.createDataFrame(data, schema) > val targetPoint:Array[Long] = Array(0L,9L) > //Cacheing is the trigger to cause the error (no cacheing causes no error) > dataframe.cache() > //This is the line where it fails > //java.util.NoSuchElementException: next on empty iterator > //However we know that there is a valid match > val targetRow = dataframe.where(dataframe("point") === > array(targetPoint.map(value => lit(value)): _*)).first() > assert(targetRow != null) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12970) Error in documentation on creating rows with schemas defined by structs
[ https://issues.apache.org/jira/browse/SPARK-12970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15114539#comment-15114539 ] Haidar Hadi edited comment on SPARK-12970 at 1/25/16 7:11 PM: -- sure [~joshrosen] I understand. was (Author: hhadi): sure [~jrose] I understand. > Error in documentation on creating rows with schemas defined by structs > --- > > Key: SPARK-12970 > URL: https://issues.apache.org/jira/browse/SPARK-12970 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 1.6.0 >Reporter: Haidar Hadi >Priority: Minor > Labels: documentation > > The provided example in this doc > https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/sql/types/StructType.html > for creating Row from Struct is wrong > // Create a Row with the schema defined by struct > val row = Row(Row(1, 2, true)) > // row: Row = {@link 1,2,true} > > the above example does not create a Row object with schema. > this error is in the scala docs too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11965) Update user guide for RFormula feature interactions
[ https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-11965. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 10222 [https://github.com/apache/spark/pull/10222] > Update user guide for RFormula feature interactions > --- > > Key: SPARK-11965 > URL: https://issues.apache.org/jira/browse/SPARK-11965 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > Update the user guide for RFormula to cover feature interactions -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115907#comment-15115907 ] Yin Huai commented on SPARK-9740: - Can you provide the full stack trace? > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12945) ERROR LiveListenerBus: Listener JobProgressListener threw an exception
[ https://issues.apache.org/jira/browse/SPARK-12945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115986#comment-15115986 ] Ben Huntley edited comment on SPARK-12945 at 1/25/16 8:41 PM: -- Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark. Adding my own repro: {quote} bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" --executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" --conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf "spark.akka.frameSize=300" --conf "spark.akka.timeout=3600" >>> foo = sqlContext.read.parquet('/projects/xxx/month7') >>> foo.count() [Stage 1:==> (16687 + 11) / 37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1180) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63) [Stage 1:=>(36749 + 22) / 37231]16/01/25 11:15:20 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException 16227372864 {quote} was (Author: bhuntley): Also seeing this issue in 1.6.0, not limited to Web UI, as it's in pyspark. Adding my own repro: bin/pyspark --master yarn-client --conf "spark.sql.shuffle.partitions=3" --executor-memory 10g --driver-memory 154g --num-executors 50 --executor-cores 1 --conf "spark.driver.maxResultSize=25g" --conf "spark.e xecutor.cores=1" --conf "spark.sql.autoBroadcastJoinThreshold=129400" --conf "spark.akka.frameSize=300" --conf "spark.akka.timeout=3600" >>> foo = sqlContext.read.parquet('/projects/xxx/month7') >>> foo.count() [Stage 1:==> (16687 + 11) / 37231]16/01/25 11:13:32 ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at
[jira] [Updated] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error
[ https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Arnfeld updated SPARK-12981: Description: We noticed a regression when testing out an upgrade of Spark 1.6 for our systems, where pyspark throws a casting exception when using `filter(udf)` after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. Here's a little notebook that demonstrates the exception clearly... https://gist.github.com/tarnfeld/ab9b298ae67f697894cd Though for the sake of here... the following code will throw an exception... {code} data.select(col("a")).distinct().filter(my_filter(col("a"))).count() {code} {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate {code} Whereas not using a UDF does not... {code} data.select(col("a")).distinct().filter("a = 1").count() {code} was: We noticed a regression when testing out an upgrade of Spark 1.6 for our systems, where pyspark throws a casting exception when using `filter(udf)` after a `distinct` operation on a DataFrame. Here's a little notebook that demonstrates the exception clearly... https://gist.github.com/tarnfeld/ab9b298ae67f697894cd Though for the sake of here... the following code will throw an exception... {code} data.select(col("a")).distinct().filter(my_filter(col("a"))).count() {code} {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate {code} Whereas not using a UDF does not... {code} data.select(col("a")).distinct().filter("a = 1").count() {code} > Dataframe distinct() followed by a filter(udf) in pyspark throws a casting > error > > > Key: SPARK-12981 > URL: https://issues.apache.org/jira/browse/SPARK-12981 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.0 > Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8) >Reporter: Tom Arnfeld >Priority: Critical > > We noticed a regression when testing out an upgrade of Spark 1.6 for our > systems, where pyspark throws a casting exception when using `filter(udf)` > after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. > Here's a little notebook that demonstrates the exception clearly... > https://gist.github.com/tarnfeld/ab9b298ae67f697894cd > Though for the sake of here... the following code will throw an exception... > {code} > data.select(col("a")).distinct().filter(my_filter(col("a"))).count() > {code} > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to > org.apache.spark.sql.catalyst.plans.logical.Aggregate > {code} > Whereas not using a UDF does not... > {code} > data.select(col("a")).distinct().filter("a = 1").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier
[ https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Chilkiewicz updated SPARK-12982: - Description: We have encountered very strange behavior of SparkSQL temporary table registration. What identifiers for temporary table should be valid? Alphanumerical + '_' with at least one non-digit? Valid identifiers: df 674123a 674123_ a0e97c59_4445_479d_a7ef_d770e3874123 1ae97c59_4445_479d_a7ef_d770e3874123 Invalid identifier: 10e97c59_4445_479d_a7ef_d770e3874123 Stack trace: {code:xml} java.lang.RuntimeException: [1.1] failure: identifier expected 10e97c59_4445_479d_a7ef_d770e3874123 ^ at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) at SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9) at SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:42) at SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at sbt.Run.invokeMain(Run.scala:67) at sbt.Run.run0(Run.scala:61) at sbt.Run.sbt$Run$$execute$1(Run.scala:51) at sbt.Run$$anonfun$run$1.apply$mcV$sp(Run.scala:55) at sbt.Run$$anonfun$run$1.apply(Run.scala:55) at sbt.Run$$anonfun$run$1.apply(Run.scala:55) at sbt.Logger$$anon$4.apply(Logger.scala:85) at sbt.TrapExit$App.run(TrapExit.scala:248) at java.lang.Thread.run(Thread.java:745) {code} Code to reproduce this bug: https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier was: We have encountered very strange behavior of SparkSQL temporary table registration. What identifiers for temporary table should be valid? Alphanumerical + '_' with at least one non-digit? Valid identifiers: df 674123a 674123_ a0e97c59_4445_479d_a7ef_d770e3874123 1ae97c59_4445_479d_a7ef_d770e3874123 Invalid identifier: 10e97c59_4445_479d_a7ef_d770e3874123 Stack trace: {code:xml} [error] java.lang.RuntimeException: [1.1] failure: identifier expected [error] [error] 10e97c59_4445_479d_a7ef_d770e3874123 [error] ^ [error] at scala.sys.package$.error(package.scala:27) [error] at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) [error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) [error] at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58) [error] at io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) {code} Code to reproduce this bug: https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier > SQLContext: temporary table registration does not accept valid identifier > - > > Key: SPARK-12982 > URL: https://issues.apache.org/jira/browse/SPARK-12982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Priority: Minor > Labels: sql > > We have encountered very strange behavior of SparkSQL temporary table > registration. > What identifiers for temporary table should be valid? > Alphanumerical + '_' with at least one non-digit? > Valid identifiers: > df > 674123a > 674123_ > a0e97c59_4445_479d_a7ef_d770e3874123 > 1ae97c59_4445_479d_a7ef_d770e3874123 > Invalid identifier: > 10e97c59_4445_479d_a7ef_d770e3874123 > Stack trace: > {code:xml} > java.lang.RuntimeException: [1.1] failure: identifier expected > 10e97c59_4445_479d_a7ef_d770e3874123 > ^ > at scala.sys.package$.error(package.scala:27) > at > org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) > at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) > at > SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:9) > at >
[jira] [Commented] (SPARK-12890) Spark SQL query related to only partition fields should not scan the whole data.
[ https://issues.apache.org/jira/browse/SPARK-12890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115558#comment-15115558 ] Simeon Simeonov commented on SPARK-12890: - [~viirya] If schema merging is the cause of the problem then this is clearly a bug. The resulting schema for a query using only partition columns is completely independent of the schema in the data files. There is no merging to do at all. > Spark SQL query related to only partition fields should not scan the whole > data. > > > Key: SPARK-12890 > URL: https://issues.apache.org/jira/browse/SPARK-12890 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Prakash Chockalingam > > I have a SQL query which has only partition fields. The query ends up > scanning all the data which is unnecessary. > Example: select max(date) from table, where the table is partitioned by date. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error
[ https://issues.apache.org/jira/browse/SPARK-12981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tom Arnfeld updated SPARK-12981: Description: We noticed a regression when testing out an upgrade of Spark 1.6 for our systems, where pyspark throws a casting exception when using `filter(udf)` after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. Here's a little notebook that demonstrates the exception clearly... https://gist.github.com/tarnfeld/ab9b298ae67f697894cd Though for the sake of here... the following code will throw an exception... {code} data.select(col("a")).distinct().filter(my_filter(col("a"))).count() {code} {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate {code} Whereas not using a UDF does not throw any errors... {code} data.select(col("a")).distinct().filter("a = 1").count() {code} was: We noticed a regression when testing out an upgrade of Spark 1.6 for our systems, where pyspark throws a casting exception when using `filter(udf)` after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. Here's a little notebook that demonstrates the exception clearly... https://gist.github.com/tarnfeld/ab9b298ae67f697894cd Though for the sake of here... the following code will throw an exception... {code} data.select(col("a")).distinct().filter(my_filter(col("a"))).count() {code} {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate {code} Whereas not using a UDF does not... {code} data.select(col("a")).distinct().filter("a = 1").count() {code} > Dataframe distinct() followed by a filter(udf) in pyspark throws a casting > error > > > Key: SPARK-12981 > URL: https://issues.apache.org/jira/browse/SPARK-12981 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.6.0 > Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8) >Reporter: Tom Arnfeld >Priority: Critical > > We noticed a regression when testing out an upgrade of Spark 1.6 for our > systems, where pyspark throws a casting exception when using `filter(udf)` > after a `distinct` operation on a DataFrame. This does not occur on Spark 1.5. > Here's a little notebook that demonstrates the exception clearly... > https://gist.github.com/tarnfeld/ab9b298ae67f697894cd > Though for the sake of here... the following code will throw an exception... > {code} > data.select(col("a")).distinct().filter(my_filter(col("a"))).count() > {code} > {code} > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to > org.apache.spark.sql.catalyst.plans.logical.Aggregate > {code} > Whereas not using a UDF does not throw any errors... > {code} > data.select(col("a")).distinct().filter("a = 1").count() > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115480#comment-15115480 ] Yin Huai commented on SPARK-9740: - Can you attach your code? Also, can you try to use {{functions.callUDF("last", col, functions.lit(true))}}? > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-12980. --- Resolution: Invalid Why is this a clone of another issue? I don't think you've specified clearly what the problem is -- you say it doesn't work. QUestions should go to u...@spark.apache.org > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 > v2.42.01 > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-12980) pyspark crash for large dataset - clone
[ https://issues.apache.org/jira/browse/SPARK-12980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Christopher Bourez closed SPARK-12980. -- > pyspark crash for large dataset - clone > --- > > Key: SPARK-12980 > URL: https://issues.apache.org/jira/browse/SPARK-12980 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: Christopher Bourez > > I installed spark 1.6 on many different computers. > On Windows, PySpark textfile method, followed by take(1), does not work on a > file of 13M. > If I set numpartitions to 2000 or take a smaller file, the method works well. > The Pyspark is set with all RAM memory of the computer thanks to the command > --conf spark.driver.memory=5g in local mode. > On Mac OS, I'm able to launch the exact same program with Pyspark with 16G > RAM for a file of much bigger in comparison, of 5G. Memory is correctly > allocated, removed etc > On Ubuntu, no trouble, I can also launch a cluster > http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html > The error message on Windows is : java.net.SocketException: Connection reset > by peer: socket write error > Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 > v2.42.01 > What could be the reason to have the windows spark textfile method fail ? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12261) pyspark crash for large dataset
[ https://issues.apache.org/jira/browse/SPARK-12261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115433#comment-15115433 ] Christopher Bourez commented on SPARK-12261: I think the issue is not resolved I installed spark 1.6 on many different computers. On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M. If I set numpartitions to 2000 or take a smaller file, the method works well. The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode. On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error What could be the reason to have the windows spark textfile method fail ? > pyspark crash for large dataset > --- > > Key: SPARK-12261 > URL: https://issues.apache.org/jira/browse/SPARK-12261 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2 > Environment: windows >Reporter: zihao > > I tried to import a local text(over 100mb) file via textFile in pyspark, when > i ran data.take(), it failed and gave error messages including: > 15/12/10 17:17:43 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; > aborting job > Traceback (most recent call last): > File "E:/spark_python/test3.py", line 9, in > lines.take(5) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, > in take > res = self.context.runJob(self, takeUpToNumLeft, p) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line > 916, in runJob > port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, > partitions) > File "C:\Anaconda2\lib\site-packages\py4j\java_gateway.py", line 813, in > __call__ > answer, self.gateway_client, self.target_id, self.name) > File "D:\spark\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line > 36, in deco > return f(*a, **kw) > File "C:\Anaconda2\lib\site-packages\py4j\protocol.py", line 308, in > get_return_value > format(target_id, ".", name), value) > py4j.protocol.Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.runJob. > : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 > in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 > (TID 0, localhost): java.net.SocketException: Connection reset by peer: > socket write error > Then i ran the same code for a small text file, this time .take() worked fine. > How can i solve this problem? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12982) SQLContext: temporary table registration does not accept valid identifier
[ https://issues.apache.org/jira/browse/SPARK-12982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grzegorz Chilkiewicz updated SPARK-12982: - Description: We have encountered very strange behavior of SparkSQL temporary table registration. What identifiers for temporary table should be valid? Alphanumerical + '_' with at least one non-digit? Valid identifiers: df 674123a 674123_ a0e97c59_4445_479d_a7ef_d770e3874123 1ae97c59_4445_479d_a7ef_d770e3874123 Invalid identifier: 10e97c59_4445_479d_a7ef_d770e3874123 Stack trace: {code:xml} [error] java.lang.RuntimeException: [1.1] failure: identifier expected [error] [error] 10e97c59_4445_479d_a7ef_d770e3874123 [error] ^ [error] at scala.sys.package$.error(package.scala:27) [error] at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) [error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) [error] at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58) [error] at io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) {code} Code to reproduce this bug: https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier was: We have encountered very strange behavior of SparkSQL temporary table registration. What identifiers for temporary table should be valid? Alphanumerical + '_' with at least one non-digit? Valid identifiers: df 674123a 674123_ a0e97c59_4445_479d_a7ef_d770e3874123 1ae97c59_4445_479d_a7ef_d770e3874123 Invalid identifier: 10e97c59_4445_479d_a7ef_d770e3874123 Stack trace: [error] java.lang.RuntimeException: [1.1] failure: identifier expected [error] [error] 10e97c59_4445_479d_a7ef_d770e3874123 [error] ^ [error] at scala.sys.package$.error(package.scala:27) [error] at org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) [error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) [error] at org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27) [error] at io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58) [error] at io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) Code to reproduce bug: https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier > SQLContext: temporary table registration does not accept valid identifier > - > > Key: SPARK-12982 > URL: https://issues.apache.org/jira/browse/SPARK-12982 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Grzegorz Chilkiewicz >Priority: Minor > Labels: sql > > We have encountered very strange behavior of SparkSQL temporary table > registration. > What identifiers for temporary table should be valid? > Alphanumerical + '_' with at least one non-digit? > Valid identifiers: > df > 674123a > 674123_ > a0e97c59_4445_479d_a7ef_d770e3874123 > 1ae97c59_4445_479d_a7ef_d770e3874123 > Invalid identifier: > 10e97c59_4445_479d_a7ef_d770e3874123 > Stack trace: > {code:xml} > [error] java.lang.RuntimeException: [1.1] failure: identifier expected > [error] > [error] 10e97c59_4445_479d_a7ef_d770e3874123 > [error] ^ > [error] at scala.sys.package$.error(package.scala:27) > [error] at > org.apache.spark.sql.catalyst.SqlParser$.parseTableIdentifier(SqlParser.scala:58) > [error] at org.apache.spark.sql.SQLContext.table(SQLContext.scala:827) > [error] at > org.apache.spark.sql.SQLContext.dropTempTable(SQLContext.scala:763) > [error] at > io.deepsense.SparkSqlContextTempTableIdentifier$.identifierCheck(SparkSqlContextTempTableIdentifier.scala:27) > [error] at > io.deepsense.SparkSqlContextTempTableIdentifier$.main(SparkSqlContextTempTableIdentifier.scala:58) > [error] at > io.deepsense.SparkSqlContextTempTableIdentifier.main(SparkSqlContextTempTableIdentifier.scala) > {code} > Code to reproduce this bug: > https://github.com/grzegorz-chilkiewicz/SparkSqlContextTempTableIdentifier -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12981) Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error
Tom Arnfeld created SPARK-12981: --- Summary: Dataframe distinct() followed by a filter(udf) in pyspark throws a casting error Key: SPARK-12981 URL: https://issues.apache.org/jira/browse/SPARK-12981 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 1.6.0 Environment: Running on Mac OSX (El Capitan) with Spark 1.6 (Java 1.8) Reporter: Tom Arnfeld Priority: Critical We noticed a regression when testing out an upgrade of Spark 1.6 for our systems, where pyspark throws a casting exception when using `filter(udf)` after a `distinct` operation on a DataFrame. Here's a little notebook that demonstrates the exception clearly... https://gist.github.com/tarnfeld/ab9b298ae67f697894cd Though for the sake of here... the following code will throw an exception... {code} data.select(col("a")).distinct().filter(my_filter(col("a"))).count() {code} {code} java.lang.ClassCastException: org.apache.spark.sql.catalyst.plans.logical.Project cannot be cast to org.apache.spark.sql.catalyst.plans.logical.Aggregate {code} Whereas not using a UDF does not... {code} data.select(col("a")).distinct().filter("a = 1").count() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115650#comment-15115650 ] Herman van Hovell commented on SPARK-9740: -- We are probably resolving the Hive function by accident. The First/Last functions probably don't have an expressions only constructor. > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115716#comment-15115716 ] Thomas Sebastian commented on SPARK-12941: -- Added a pull request https://github.com/thomastechs/spark/pull/1 > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11219) Make Parameter Description Format Consistent in PySpark.MLlib
[ https://issues.apache.org/jira/browse/SPARK-11219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115756#comment-15115756 ] Bryan Cutler commented on SPARK-11219: -- Regarding overall style in PySpark, I generally see single line param descriptions, and that doesn't look bad since there is usually just a few params at most and short descriptions. So it might not be worth it to update this in other areas, but it would be nice to provide the format here in the wiki or somewhere for future additions. > Make Parameter Description Format Consistent in PySpark.MLlib > - > > Key: SPARK-11219 > URL: https://issues.apache.org/jira/browse/SPARK-11219 > Project: Spark > Issue Type: Documentation > Components: Documentation, MLlib, PySpark >Reporter: Bryan Cutler >Priority: Trivial > > There are several different formats for describing params in PySpark.MLlib, > making it unclear what the preferred way to document is, i.e. vertical > alignment vs single line. > This is to agree on a format and make it consistent across PySpark.MLlib. > Following the discussion in SPARK-10560, using 2 lines with an indentation is > both readable and doesn't lead to changing many lines when adding/removing > parameters. If the parameter uses a default value, put this in parenthesis > in a new line under the description. > Example: > {noformat} > :param stepSize: > Step size for each iteration of gradient descent. > (default: 0.1) > :param numIterations: > Number of iterations run for each batch of data. > (default: 50) > {noformat} > h2. Current State of Parameter Description Formating > h4. Classification > * LogisticRegressionModel - single line descriptions, fix indentations > * LogisticRegressionWithSGD - vertical alignment, sporatic default values > * LogisticRegressionWithLBFGS - vertical alignment, sporatic default values > * SVMModel - single line > * SVMWithSGD - vertical alignment, sporatic default values > * NaiveBayesModel - single line > * NaiveBayes - single line > h4. Clustering > * KMeansModel - missing param description > * KMeans - missing param description and defaults > * GaussianMixture - vertical align, incorrect default formatting > * PowerIterationClustering - single line with wrapped indentation, missing > defaults > * StreamingKMeansModel - single line wrapped > * StreamingKMeans - single line wrapped, missing defaults > * LDAModel - single line > * LDA - vertical align, mising some defaults > h4. FPM > * FPGrowth - single line > * PrefixSpan - single line, defaults values in backticks > h4. Recommendation > * ALS - does not have param descriptions > h4. Regression > * LabeledPoint - single line > * LinearModel - single line > * LinearRegressionWithSGD - vertical alignment > * RidgeRegressionWithSGD - vertical align > * IsotonicRegressionModel - single line > * IsotonicRegression - single line, missing default > h4. Tree > * DecisionTree - single line with vertical indentation, missing defaults > * RandomForest - single line with wrapped indent, missing some defaults > * GradientBoostedTrees - single line with wrapped indent > NOTE > This issue will just focus on model/algorithm descriptions, which are the > largest source of inconsistent formatting > evaluation.py, feature.py, random.py, utils.py - these supporting classes > have param descriptions as single line, but are consistent so don't need to > be changed -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9740) first/last aggregate NULL behavior
[ https://issues.apache.org/jira/browse/SPARK-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115659#comment-15115659 ] Herman van Hovell commented on SPARK-9740: -- Hmmm... It does have a suitable constructor. Please attach an example. > first/last aggregate NULL behavior > -- > > Key: SPARK-9740 > URL: https://issues.apache.org/jira/browse/SPARK-9740 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Herman van Hovell >Assignee: Yin Huai > Labels: releasenotes > Fix For: 1.6.0 > > > The FIRST/LAST aggregates implemented as part of the new UDAF interface, > return the first or last non-null value (if any) found. This is a departure > from the behavior of the old FIRST/LAST aggregates and from the > FIRST_VALUE/LAST_VALUE aggregates in Hive. These would return a null value, > if that happened to be the first/last value seen. SPARK-9592 tries to 'fix' > this behavior for the old UDAF interface. > Hive makes this behavior configurable, by adding a skipNulls flag. I would > suggest to do the same, and make the default behavior compatible with Hive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12941) Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR datatype
[ https://issues.apache.org/jira/browse/SPARK-12941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115682#comment-15115682 ] Jayadevan M commented on SPARK-12941: - Working on JdbcDialect.scala > Spark-SQL JDBC Oracle dialect fails to map string datatypes to Oracle VARCHAR > datatype > -- > > Key: SPARK-12941 > URL: https://issues.apache.org/jira/browse/SPARK-12941 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.4.1 > Environment: Apache Spark 1.4.2.2 >Reporter: Jose Martinez Poblete > > When exporting data from Spark to Oracle, string datatypes are translated to > TEXT for Oracle, this is leading to the following error > {noformat} > java.sql.SQLSyntaxErrorException: ORA-00902: invalid datatype > {noformat} > As per the following code: > https://github.com/apache/spark/blob/branch-1.4/sql/core/src/main/scala/org/apache/spark/sql/jdbc/jdbc.scala#L144 > See also: > http://stackoverflow.com/questions/31287182/writing-to-oracle-database-using-apache-spark-1-4-0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-12983) Correct metrics.properties.template
Benjamin Fradet created SPARK-12983: --- Summary: Correct metrics.properties.template Key: SPARK-12983 URL: https://issues.apache.org/jira/browse/SPARK-12983 Project: Spark Issue Type: Documentation Components: Documentation, Spark Core Reporter: Benjamin Fradet Priority: Minor There are some typos or plain unintelligible sentences in the metrics template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12983) Correct metrics.properties.template
[ https://issues.apache.org/jira/browse/SPARK-12983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15115661#comment-15115661 ] Apache Spark commented on SPARK-12983: -- User 'BenFradet' has created a pull request for this issue: https://github.com/apache/spark/pull/10902 > Correct metrics.properties.template > --- > > Key: SPARK-12983 > URL: https://issues.apache.org/jira/browse/SPARK-12983 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Core >Reporter: Benjamin Fradet >Priority: Minor > > There are some typos or plain unintelligible sentences in the metrics > template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org