[jira] [Comment Edited] (SPARK-10525) Add Python example for VectorSlicer to user guide
[ https://issues.apache.org/jira/browse/SPARK-10525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263489#comment-15263489 ] Amit Shinde edited comment on SPARK-10525 at 5/1/16 3:46 AM: - Hi : I was looking at this JIRA and found a similar JIRA logged and fixed here [SPARK-14514|https://issues.apache.org/jira/browse/SPARK-14514] . The pull request is here : https://github.com/apache/spark/pull/12282 Does this resolve this JIRA as well ? [~josephkb] -- Amit was (Author: ashinde1): Hi : I was looking at this JIRA and found a similar JIRA logged and fixed here [SPARK-14514|https://issues.apache.org/jira/browse/SPARK-14514] . The pull request is here : https://github.com/apache/spark/pull/12282 Does this resolve this JIRA as well ? -- Amit > Add Python example for VectorSlicer to user guide > - > > Key: SPARK-10525 > URL: https://issues.apache.org/jira/browse/SPARK-10525 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13425) Documentation for CSV datasource options
[ https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13425: Assignee: (was: Apache Spark) > Documentation for CSV datasource options > > > Key: SPARK-13425 > URL: https://issues.apache.org/jira/browse/SPARK-13425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > As said https://github.com/apache/spark/pull/11262#discussion_r53508815, > CSV datasource is added for Spark 2.0.0 and therefore the options might have > to be added in documentation. > The options can be found > [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf] > in Parsing Options section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options
[ https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265594#comment-15265594 ] Apache Spark commented on SPARK-13425: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/12817 > Documentation for CSV datasource options > > > Key: SPARK-13425 > URL: https://issues.apache.org/jira/browse/SPARK-13425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > As said https://github.com/apache/spark/pull/11262#discussion_r53508815, > CSV datasource is added for Spark 2.0.0 and therefore the options might have > to be added in documentation. > The options can be found > [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf] > in Parsing Options section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13425) Documentation for CSV datasource options
[ https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-13425: Assignee: Apache Spark > Documentation for CSV datasource options > > > Key: SPARK-13425 > URL: https://issues.apache.org/jira/browse/SPARK-13425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > As said https://github.com/apache/spark/pull/11262#discussion_r53508815, > CSV datasource is added for Spark 2.0.0 and therefore the options might have > to be added in documentation. > The options can be found > [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf] > in Parsing Options section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15033) fix a flaky test in CachedTableSuite
[ https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15033. - Resolution: Fixed Fix Version/s: 2.0.0 > fix a flaky test in CachedTableSuite > > > Key: SPARK-15033 > URL: https://issues.apache.org/jira/browse/SPARK-15033 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions
[ https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265591#comment-15265591 ] Xin Wu commented on SPARK-14927: Since Spark 2.0.0 has moved around a lot of stuff, including splitting the HiveMetaStoreCatalog into 2 files for resolving and creating tables, respectively, I would try this on Spark 2.0.0. {code}scala> spark.sql("create database if not exists tmp") 16/04/30 19:59:12 WARN ObjectStore: Failed to get database tmp, returning NoSuchObjectException res23: org.apache.spark.sql.DataFrame = [] scala> df.write.partitionBy("year").mode(SaveMode.Append).saveAsTable("tmp.tmp1") 16/04/30 19:59:50 WARN CreateDataSourceTableUtils: Persisting partitioned data source relation `tmp`.`tmp1` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. Input path(s): file:/home/xwu0226/spark/spark-warehouse/tmp.db/tmp1 scala> spark.sql("select * from tmp.tmp1").show +---++ |val|year| +---++ | a|2012| +---++ {code} For datasource table creation as above, SparkSQL will create the table as a hive internal table but not compatible with hive. SparkSQL puts partition column information (actually including also other things like column schema, bucket/sort columns) into serdeInfo.parameters. When querying the table, SparkSQL resolve the table and parse the information back from serdeInfo.parameters. Spark 2.0.0 does not pass this command to Hive anymore (actually most of DDL commands are run natively in SparkSQL now), so when doing "SHOW PARTITIONS...", the command now does not support showing partitions for datasource table. {code} scala> spark.sql("show partitions tmp.tmp1").show org.apache.spark.sql.AnalysisException: SHOW PARTITIONS is not allowed on a datasource table: tmp.tmp1; at org.apache.spark.sql.execution.command.ShowPartitionsCommand.run(commands.scala:196) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:62) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:113) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:132) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:129) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:112) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:85) at org.apache.spark.sql.Dataset.(Dataset.scala:186) at org.apache.spark.sql.Dataset.(Dataset.scala:167) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:62) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:529) ... 48 elided {code} Hope this helps. > DataFrame. saveAsTable creates RDD partitions but not Hive partitions > - > > Key: SPARK-14927 > URL: https://issues.apache.org/jira/browse/SPARK-14927 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 1.6.1 > Environment: Mac OS X 10.11.4 local >Reporter: Sasha Ovsankin > > This is a followup to > http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive > . I tried to use suggestions in the answers but couldn't make it to work in > Spark 1.6.1 > I am trying to create partitions programmatically from `DataFrame. Here is > the relevant code (adapted from a Spark test): > hc.setConf("hive.metastore.warehouse.dir", "tmp/tests") > //hc.setConf("hive.exec.dynamic.partition", "true") > //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict") > hc.sql("create database if not exists tmp") > hc.sql("drop table if exists tmp.partitiontest1") > Seq(2012 -> "a").toDF("year", "val") > .write > .partitionBy("year") > .mode(SaveMode.Append) > .saveAsTable("tmp.partitiontest1") > hc.sql("show partitions tmp.partitiontest1").show > Full file is here: > https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a > I get the error that the table is not partitioned: > == > HIVE FAILURE OUTPUT > == > SET hive.support.sql11.reserved.keywords=false > SET hive.metastore.warehouse.dir=tmp/tests > OK > OK > FAILED: Execution Error, return code 1 from >
[jira] [Commented] (SPARK-14422) Improve handling of optional configs in SQLConf
[ https://issues.apache.org/jira/browse/SPARK-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265589#comment-15265589 ] Marcelo Vanzin commented on SPARK-14422: Hi [~techaddict] you're free to take any bug that is not assigned to anyone. > Improve handling of optional configs in SQLConf > --- > > Key: SPARK-14422 > URL: https://issues.apache.org/jira/browse/SPARK-14422 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > As Michael showed here: > https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150 > Handling of optional configs in SQLConf is a little sub-optimal right now. We > should clean that up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14422) Improve handling of optional configs in SQLConf
[ https://issues.apache.org/jira/browse/SPARK-14422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265573#comment-15265573 ] Sandeep Singh commented on SPARK-14422: --- Hi Marcelo, Do you mind if I take this up ? > Improve handling of optional configs in SQLConf > --- > > Key: SPARK-14422 > URL: https://issues.apache.org/jira/browse/SPARK-14422 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin >Priority: Minor > > As Michael showed here: > https://github.com/apache/spark/pull/12119/files/69aa1a005cc7003ab62d6dfcdef42181b053eaed#r58634150 > Handling of optional configs in SQLConf is a little sub-optimal right now. We > should clean that up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options
[ https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265572#comment-15265572 ] Hyukjin Kwon commented on SPARK-13425: -- [~rxin] I will. Thanks! (I believe R one is not yet, https://issues.apache.org/jira/browse/SPARK-13174) > Documentation for CSV datasource options > > > Key: SPARK-13425 > URL: https://issues.apache.org/jira/browse/SPARK-13425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > As said https://github.com/apache/spark/pull/11262#discussion_r53508815, > CSV datasource is added for Spark 2.0.0 and therefore the options might have > to be added in documentation. > The options can be found > [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf] > in Parsing Options section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14684) Verification of partition specs in SessionCatalog
[ https://issues.apache.org/jira/browse/SPARK-14684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265570#comment-15265570 ] Apache Spark commented on SPARK-14684: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/12801 > Verification of partition specs in SessionCatalog > - > > Key: SPARK-14684 > URL: https://issues.apache.org/jira/browse/SPARK-14684 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > When users inputting invalid partition spec, we might not be able to catch > and issue the error messages. Sometimes, it could cause a disaster result. > For example, previously, when we alter a table and drop a partition with > invalid spec, it could drop all the partitions due to a bug/defect in Hive > Metastore API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13425) Documentation for CSV datasource options
[ https://issues.apache.org/jira/browse/SPARK-13425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265564#comment-15265564 ] Reynold Xin commented on SPARK-13425: - [~hyukjin.kwon] want to submit a pr now for this documentation? Remember we have scala, python, and maybe R (not sure if CSV data source exists in R yet). > Documentation for CSV datasource options > > > Key: SPARK-13425 > URL: https://issues.apache.org/jira/browse/SPARK-13425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.0.0 >Reporter: Hyukjin Kwon > > As said https://github.com/apache/spark/pull/11262#discussion_r53508815, > CSV datasource is added for Spark 2.0.0 and therefore the options might have > to be added in documentation. > The options can be found > [here|https://issues.apache.org/jira/secure/attachment/12779313/Built-in%20CSV%20datasource%20in%20Spark.pdf] > in Parsing Options section. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14143) Options for parsing NaNs, Infinity and nulls for numeric types
[ https://issues.apache.org/jira/browse/SPARK-14143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-14143. - Resolution: Fixed Assignee: Hossein Falaki Fix Version/s: 2.0.0 > Options for parsing NaNs, Infinity and nulls for numeric types > -- > > Key: SPARK-14143 > URL: https://issues.apache.org/jira/browse/SPARK-14143 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Hossein Falaki >Assignee: Hossein Falaki > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15036) When creating a database, we need to qualify its path
[ https://issues.apache.org/jira/browse/SPARK-15036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15036. - Resolution: Fixed Fix Version/s: 2.0.0 > When creating a database, we need to qualify its path > - > > Key: SPARK-15036 > URL: https://issues.apache.org/jira/browse/SPARK-15036 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-15034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15034. - Resolution: Fixed Fix Version/s: 2.0.0 > Use the value of spark.sql.warehouse.dir as the warehouse location instead of > using hive.metastore.warehouse.dir > > > Key: SPARK-15034 > URL: https://issues.apache.org/jira/browse/SPARK-15034 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Labels: release_notes, releasenotes > Fix For: 2.0.0 > > > Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set > warehouse location. We will not use hive.metastore.warehouse.dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15035) SessionCatalog needs to set the location for default DB
[ https://issues.apache.org/jira/browse/SPARK-15035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-15035. - Resolution: Fixed Fix Version/s: 2.0.0 > SessionCatalog needs to set the location for default DB > --- > > Key: SPARK-15035 > URL: https://issues.apache.org/jira/browse/SPARK-15035 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Yin Huai >Assignee: Yin Huai > Fix For: 2.0.0 > > > Right now, in SessionCatalog, the default location of the database is an > empty string. It will break create table command when we use SparkSession > without hive support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14931) Mismatched default Param values between pipelines in Spark and PySpark
[ https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14931: -- Summary: Mismatched default Param values between pipelines in Spark and PySpark (was: Mismatched default values between pipelines in Spark and PySpark) > Mismatched default Param values between pipelines in Spark and PySpark > -- > > Key: SPARK-14931 > URL: https://issues.apache.org/jira/browse/SPARK-14931 > Project: Spark > Issue Type: Bug >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: ML, PySpark > > Mismatched default values between pipelines in Spark and PySpark lead to > different pipelines in PySpark after saving and loading. > Find generic ways to check JavaParams then fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13448) Document MLlib behavior changes in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-13448: -- Description: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and spark.mllib * SPARK-14768: Remove expectedType arg for PySpark Param * SPARK-14931: Mismatched default Param values between pipelines in Spark and PySpark was: This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can remember to add them to the migration guide / release notes. * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 to 1e-6. * SPARK-7780: Intercept will not be regularized if users train binary classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, because it calls ML LogisticRegresson implementation. Meanwhile if users set without regularization, training with or without feature scaling will return the same solution by the same convergence rate(because they run the same code route), this behavior is different from the old API. * SPARK-12363: Bug fix for PowerIterationClustering which will likely change results * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by default, if checkpointing is being used. * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did not handle them correctly. * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and spark.mllib * SPARK-14768: Remove expectedType arg for PySpark Param > Document MLlib behavior changes in Spark 2.0 > > > Key: SPARK-13448 > URL: https://issues.apache.org/jira/browse/SPARK-13448 > Project: Spark > Issue Type: Documentation > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This JIRA keeps a list of MLlib behavior changes in Spark 2.0. So we can > remember to add them to the migration guide / release notes. > * SPARK-13429: change convergenceTol in LogisticRegressionWithLBFGS from 1e-4 > to 1e-6. > * SPARK-7780: Intercept will not be regularized if users train binary > classification model with L1/L2 Updater by LogisticRegressionWithLBFGS, > because it calls ML LogisticRegresson implementation. Meanwhile if users set > without regularization, training with or without feature scaling will return > the same solution by the same convergence rate(because they run the same code > route), this behavior is different from the old API. > * SPARK-12363: Bug fix for PowerIterationClustering which will likely change > results > * SPARK-13048: LDA using the EM optimizer will keep the last checkpoint by > default, if checkpointing is being used. > * SPARK-12153: Word2Vec now respects sentence boundaries. Previously, it did > not handle them correctly. > * SPARK-10574: HashingTF uses MurmurHash3 by default in both spark.ml and > spark.mllib > * SPARK-14768: Remove expectedType arg for PySpark Param > * SPARK-14931: Mismatched default Param values between pipelines in Spark and > PySpark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15043) Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr
[ https://issues.apache.org/jira/browse/SPARK-15043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-15043: --- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 Priority: Blocker (was: Major) Component/s: MLlib Summary: Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr (was: Flaky test: mllib.stat.JavaStatisticsSuite.testCorr) > Fix and re-enable flaky test: mllib.stat.JavaStatisticsSuite.testCorr > - > > Key: SPARK-15043 > URL: https://issues.apache.org/jira/browse/SPARK-15043 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Priority: Blocker > > It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become > flaky: > https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr > The first observed failure was in > https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816 > {code} > java.lang.AssertionError: expected:<0.9986422261219262> but > was:<0.9986422261219272> > at > org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75) > {code} > I'm going to ignore this test now, but we need to come back and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15043) Flaky test: mllib.stat.JavaStatisticsSuite.testCorr
Josh Rosen created SPARK-15043: -- Summary: Flaky test: mllib.stat.JavaStatisticsSuite.testCorr Key: SPARK-15043 URL: https://issues.apache.org/jira/browse/SPARK-15043 Project: Spark Issue Type: Bug Reporter: Josh Rosen It looks like the {{mllib.stat.JavaStatisticsSuite.testCorr}} test has become flaky: https://spark-tests.appspot.com/tests/org.apache.spark.mllib.stat.JavaStatisticsSuite/testCorr The first observed failure was in https://spark-tests.appspot.com/builds/spark-master-test-maven-hadoop-2.6/816 {code} java.lang.AssertionError: expected:<0.9986422261219262> but was:<0.9986422261219272> at org.apache.spark.mllib.stat.JavaStatisticsSuite.testCorr(JavaStatisticsSuite.java:75) {code} I'm going to ignore this test now, but we need to come back and fix it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark
[ https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14931: -- Assignee: Xusen Yin > Mismatched default values between pipelines in Spark and PySpark > > > Key: SPARK-14931 > URL: https://issues.apache.org/jira/browse/SPARK-14931 > Project: Spark > Issue Type: Bug >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: ML, PySpark > > Mismatched default values between pipelines in Spark and PySpark lead to > different pipelines in PySpark after saving and loading. > Find generic ways to check JavaParams then fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark
[ https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-14931: -- Shepherd: Joseph K. Bradley > Mismatched default values between pipelines in Spark and PySpark > > > Key: SPARK-14931 > URL: https://issues.apache.org/jira/browse/SPARK-14931 > Project: Spark > Issue Type: Bug >Reporter: Xusen Yin >Assignee: Xusen Yin > Labels: ML, PySpark > > Mismatched default values between pipelines in Spark and PySpark lead to > different pipelines in PySpark after saving and loading. > Find generic ways to check JavaParams then fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14931) Mismatched default values between pipelines in Spark and PySpark
[ https://issues.apache.org/jira/browse/SPARK-14931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265518#comment-15265518 ] Apache Spark commented on SPARK-14931: -- User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/12816 > Mismatched default values between pipelines in Spark and PySpark > > > Key: SPARK-14931 > URL: https://issues.apache.org/jira/browse/SPARK-14931 > Project: Spark > Issue Type: Bug >Reporter: Xusen Yin > Labels: ML, PySpark > > Mismatched default values between pipelines in Spark and PySpark lead to > different pipelines in PySpark after saving and loading. > Find generic ways to check JavaParams then fix them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python/Java example.
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15031: -- Description: This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`. For this, two new `SparkSesion` ctor are added, and also fixes the following examples. **sql.py** {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) -people.registerAsTable("people") +people.registerTempTable("people") {code} **dataframe_example.py** {code} - features = df.select("features").map(lambda r: r.features) + features = df.select("features").rdd.map(lambda r: r.features) {code} Note that the following examples are untouched in this PR since it fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` was: This PR aims to update Scala/Python/Java examples by replacing SQLContext with newly added SparkSession. For this, two new `SparkSesion` ctor are added, and also fixes the following examples. **sql.py** {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) -people.registerAsTable("people") +people.registerTempTable("people") {code} **dataframe_example.py** {code} - features = df.select("features").map(lambda r: r.features) + features = df.select("features").rdd.map(lambda r: r.features) {code} Note that the following examples are untouched in this PR since it fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` > Use SparkSession in Scala/Python/Java example. > -- > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Dongjoon Hyun > > This PR aims to update Scala/Python/Java examples by replacing `SQLContext` > with newly added `SparkSession`. For this, two new `SparkSesion` ctor are > added, and also fixes the following examples. > **sql.py** > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} > **dataframe_example.py** > {code} > - features = df.select("features").map(lambda r: r.features) > + features = df.select("features").rdd.map(lambda r: r.features) > {code} > Note that the following examples are untouched in this PR since it fails some > unknown issue. > - `simple_params_example.py` > - `aft_survival_regression.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python/Java example.
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15031: -- Priority: Major (was: Trivial) Description: This PR aims to update Scala/Python/Java examples by replacing SQLContext with newly added SparkSession. For this, two new `SparkSesion` ctor are added, and also fixes the following examples. **sql.py** {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) -people.registerAsTable("people") +people.registerTempTable("people") {code} **dataframe_example.py** {code} - features = df.select("features").map(lambda r: r.features) + features = df.select("features").rdd.map(lambda r: r.features) {code} Note that the following examples are untouched in this PR since it fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` was: This PR aims to update Scala/Python examples by replacing SQLContext with newly added SparkSession. Also, this fixes the following examples. **sql.py** {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) -people.registerAsTable("people") +people.registerTempTable("people") {code} **dataframe_example.py** {code} - features = df.select("features").map(lambda r: r.features) + features = df.select("features").rdd.map(lambda r: r.features) {code} Note that the following examples are untouched in this PR since it fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` Summary: Use SparkSession in Scala/Python/Java example. (was: Use SparkSession in Scala/Python example.) > Use SparkSession in Scala/Python/Java example. > -- > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Dongjoon Hyun > > This PR aims to update Scala/Python/Java examples by replacing SQLContext > with newly added SparkSession. For this, two new `SparkSesion` ctor are > added, and also fixes the following examples. > **sql.py** > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} > **dataframe_example.py** > {code} > - features = df.select("features").map(lambda r: r.features) > + features = df.select("features").rdd.map(lambda r: r.features) > {code} > Note that the following examples are untouched in this PR since it fails some > unknown issue. > - `simple_params_example.py` > - `aft_survival_regression.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265472#comment-15265472 ] Dongjoon Hyun commented on SPARK-15037: --- I'll add the following constructor to `SparkSession` and proceed SPARK-15031 first. {code} def this(sparkContext: JavaSparkContext) = this(sparkContext.sc) {code} > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265470#comment-15265470 ] Dongjoon Hyun commented on SPARK-15037: --- It's because `JavaSparkContext` cannot be converted to SparkContext in the following code. {code} JavaSparkContext ctx = new JavaSparkContext(sparkConf); SparkSession spark = new SparkSession(ctx) {code} > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15042) ConnectedComponents fails to compute graph with 200 vertices (but long paths)
[ https://issues.apache.org/jira/browse/SPARK-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Philipp Claßen updated SPARK-15042: --- Description: ConnectedComponents takes forever and eventually fails with OutOfMemory when computing this graph: {code}{ (i, i+1) | i <- { 1..200 } }{code} If you generate the example graph, e.g., with this bash command {code} for i in {1..200} ; do echo "$i $(($i+1))" ; done > input.graph {code} ... then should be able to reproduce in the spark-shell by running: {code} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ val graph = GraphLoader.edgeListFile(sc, "input.graph").cache() ConnectedComponents.run(graph) {code} I seems to take forever, and spawns these warnings from time to time: {code} 16/04/30 20:06:24 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@7af98fbd,BlockManagerId(driver, localhost, 43440))] in 1 attempts {code} For additional information, here is a link to my related question on Stackoverflow: http://stackoverflow.com/q/36892272/783510 One comment so far, was that the number of skipping tasks grows exponentially. --- Here is the complete output of a spark-shell session: {noformat} phil@terra-arch:~/tmp/spark-graph$ spark-shell log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties To adjust logging level use sc.setLogLevel("INFO") Spark context available as sc. SQL context available as sqlContext. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.1 /_/ Using Scala version 2.11.7 (OpenJDK 64-Bit Server VM, Java 1.8.0_92) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.graphx._ import org.apache.spark.graphx._ scala> import org.apache.spark.graphx.lib._ import org.apache.spark.graphx.lib._ scala> scala> val graph = GraphLoader.edgeListFile(sc, "input.graph").cache() graph: org.apache.spark.graphx.Graph[Int,Int] = org.apache.spark.graphx.impl.GraphImpl@1fa9692b scala> ConnectedComponents.run(graph) 16/04/30 20:05:29 WARN NettyRpcEndpointRef: Error sending message [message = Heartbeat(driver,[Lscala.Tuple2;@50432fd2,BlockManagerId(driver, localhost, 43440))] in 1 attempts org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:101) at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:449) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:470) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) at org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:470) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1765) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:470) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at
[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265417#comment-15265417 ] Reynold Xin commented on SPARK-15037: - Why do we need JavaSparkSession? SparkSession itself should be Java friendly. > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features
[ https://issues.apache.org/jira/browse/SPARK-15041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265411#comment-15265411 ] Gayathri Murali commented on SPARK-15041: - I can work on this > adding mode strategy for ml.feature.Imputer for categorical features > > > Key: SPARK-15041 > URL: https://issues.apache.org/jira/browse/SPARK-15041 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: yuhao yang >Priority: Minor > > Adding mode strategy for ml.feature.Imputer for categorical features. This > need to wait until PR for SPARK-13568 gets merged. > https://github.com/apache/spark/pull/11601 > From comments of jkbradley and Nick Pentreath in the PR > {quote} > Investigate efficiency of approaches using DataFrame/Dataset and/or approx > approaches such as frequentItems or Count-Min Sketch (will require an update > to CMS to return "heavy-hitters"). > investigate if we can use metadata to only allow mode for categorical > features (or perhaps as an easier alternative, allow mode for only Int/Long > columns) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15042) ConnectedComponents fails to compute graph with 200 vertices (but long paths)
Philipp Claßen created SPARK-15042: -- Summary: ConnectedComponents fails to compute graph with 200 vertices (but long paths) Key: SPARK-15042 URL: https://issues.apache.org/jira/browse/SPARK-15042 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.6.1 Environment: Local cluster (1 instance) running on Arch Linux Scala 2.11.7, Java 1.8.0_92 Reporter: Philipp Claßen ConnectedComponents takes forever and eventually fails with OutOfMemory when computing this graph: {code}{ (i, i+1) | i <- { 1..200 } }{code} If you generate the example graph, e.g., with this bash command {code} for i in {1..200} ; do echo "$i $(($i+1))" ; done > input.graph {code} ... then should be able to reproduce in the spark-shell by running: {code} import org.apache.spark.graphx._ import org.apache.spark.graphx.lib._ val graph = GraphLoader.edgeListFile(sc, "input.graph").cache() ConnectedComponents.run(graph) {code} For additional information, here is a link to my related question on Stackoverflow: http://stackoverflow.com/q/36892272/783510 One comment so far, was that the number of skipping tasks grows exponentially. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14993) Inconsistent behavior of partitioning discovery
[ https://issues.apache.org/jira/browse/SPARK-14993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265397#comment-15265397 ] Xiao Li commented on SPARK-14993: - Ok, if nobody starts it, I will work on this. Thanks! > Inconsistent behavior of partitioning discovery > --- > > Key: SPARK-14993 > URL: https://issues.apache.org/jira/browse/SPARK-14993 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > When we load a dataset, if we set the path to {{/path/a=1}}, we will not take > a as the partitioning column. However, if we set the path to > {{/path/a=1/file.parquet}}, we take a as the partitioning column and it shows > up in the schema. We should make the behaviors of these two cases consistent > by not putting a into the schema for the second case. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15041) adding mode strategy for ml.feature.Imputer for categorical features
yuhao yang created SPARK-15041: -- Summary: adding mode strategy for ml.feature.Imputer for categorical features Key: SPARK-15041 URL: https://issues.apache.org/jira/browse/SPARK-15041 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Priority: Minor Adding mode strategy for ml.feature.Imputer for categorical features. This need to wait until PR for SPARK-13568 gets merged. https://github.com/apache/spark/pull/11601 >From comments of jkbradley and Nick Pentreath in the PR {quote} Investigate efficiency of approaches using DataFrame/Dataset and/or approx approaches such as frequentItems or Count-Min Sketch (will require an update to CMS to return "heavy-hitters"). investigate if we can use metadata to only allow mode for categorical features (or perhaps as an easier alternative, allow mode for only Int/Long columns) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15040) PySpark impl for ml.feature.Imputer
yuhao yang created SPARK-15040: -- Summary: PySpark impl for ml.feature.Imputer Key: SPARK-15040 URL: https://issues.apache.org/jira/browse/SPARK-15040 Project: Spark Issue Type: New Feature Components: ML Reporter: yuhao yang Priority: Minor PySpark impl for ml.feature.Imputer. This need to wait until PR for SPARK-13568 gets merged. https://github.com/apache/spark/pull/11601 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15039) Kinesis reciever does not work in Yarn
[ https://issues.apache.org/jira/browse/SPARK-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsai Li Ming updated SPARK-15039: - Description: Hi, Using the pyspark kinesis example, it does not receive any messages from Kinesis when submitting to a YARN cluster, though it is working fine when using local mode. {code} spark-submit \ --executor-cores 4 \ --num-executors 4 \ --packages com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 {code} I had to downgrade the package to 1.5.1. 1.6.1 does not work too. Not sure whether this is related to SPARK-12453 was: Hi, Using the pyspark kinesis example, it does not receive any messages from Kinesis when submitting to a YARN cluster, though it is working fine when using local mode. {code} spark-submit \ --executor-cores 4 \ --num-executors 4 \ --packages com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 {code} I had to downgrade the package to 1.5.1. 1.6.1 does not work too. > Kinesis reciever does not work in Yarn > -- > > Key: SPARK-15039 > URL: https://issues.apache.org/jira/browse/SPARK-15039 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: YARN > HDP 2.4.0 >Reporter: Tsai Li Ming > > Hi, > Using the pyspark kinesis example, it does not receive any messages from > Kinesis when submitting to a YARN cluster, though it is working fine when > using local mode. > {code} > spark-submit \ > --executor-cores 4 \ > --num-executors 4 \ > --packages > com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 > > {code} > I had to downgrade the package to 1.5.1. 1.6.1 does not work too. > Not sure whether this is related to SPARK-12453 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14785) Support correlated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265357#comment-15265357 ] Apache Spark commented on SPARK-14785: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/12815 > Support correlated scalar subquery > -- > > Key: SPARK-14785 > URL: https://issues.apache.org/jira/browse/SPARK-14785 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu > > For example: > {code} > SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) > {code} > it could be rewritten as > {code} > SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON > t3.id = t.id where b > avg_c > {code} > TPCDS Q92, Q81, Q6 required this > Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14785) Support correlated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14785: Assignee: (was: Apache Spark) > Support correlated scalar subquery > -- > > Key: SPARK-14785 > URL: https://issues.apache.org/jira/browse/SPARK-14785 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu > > For example: > {code} > SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) > {code} > it could be rewritten as > {code} > SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON > t3.id = t.id where b > avg_c > {code} > TPCDS Q92, Q81, Q6 required this > Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14785) Support correlated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14785: Assignee: Apache Spark > Support correlated scalar subquery > -- > > Key: SPARK-14785 > URL: https://issues.apache.org/jira/browse/SPARK-14785 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu >Assignee: Apache Spark > > For example: > {code} > SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) > {code} > it could be rewritten as > {code} > SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON > t3.id = t.id where b > avg_c > {code} > TPCDS Q92, Q81, Q6 required this > Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15039) Kinesis reciever does not work in Yarn
[ https://issues.apache.org/jira/browse/SPARK-15039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tsai Li Ming updated SPARK-15039: - Description: Hi, Using the pyspark kinesis example, it does not receive any messages from Kinesis when submitting to a YARN cluster, though it is working fine when using local mode. {code} spark-submit \ --executor-cores 4 \ --num-executors 4 \ --packages com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 {code} I had to downgrade the package to 1.5.1. 1.6.1 does not work too. was: Hi, Using the pyspark kinesis example, it does not receive any messages from Kinesis when submitting to a YARN cluster, though it is working when using local mode. ``` spark-submit \ --executor-cores 4 \ --num-executors 4 \ --packages com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 ``` I had to downgrade the package to 1.5.1 before it can work. > Kinesis reciever does not work in Yarn > -- > > Key: SPARK-15039 > URL: https://issues.apache.org/jira/browse/SPARK-15039 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.6.0 > Environment: YARN > HDP 2.4.0 >Reporter: Tsai Li Ming > > Hi, > Using the pyspark kinesis example, it does not receive any messages from > Kinesis when submitting to a YARN cluster, though it is working fine when > using local mode. > {code} > spark-submit \ > --executor-cores 4 \ > --num-executors 4 \ > --packages > com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 > > {code} > I had to downgrade the package to 1.5.1. 1.6.1 does not work too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14952) Remove methods that were deprecated in 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265356#comment-15265356 ] Apache Spark commented on SPARK-14952: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/12815 > Remove methods that were deprecated in 1.6.0 > > > Key: SPARK-14952 > URL: https://issues.apache.org/jira/browse/SPARK-14952 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.0.0 > > > Running {{grep -inr "@deprecated"}} I found a few methods that were > deprecated in SPARK-1.6: > {noformat} > ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193: > @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392: > @deprecated("Use coefficients instead.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483: > @deprecated("Use coefficients instead.", "1.6.0") > {noformat} > Lets remove those as part of 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15039) Kinesis reciever does not work in Yarn
Tsai Li Ming created SPARK-15039: Summary: Kinesis reciever does not work in Yarn Key: SPARK-15039 URL: https://issues.apache.org/jira/browse/SPARK-15039 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.6.0 Environment: YARN HDP 2.4.0 Reporter: Tsai Li Ming Hi, Using the pyspark kinesis example, it does not receive any messages from Kinesis when submitting to a YARN cluster, though it is working when using local mode. ``` spark-submit \ --executor-cores 4 \ --num-executors 4 \ --packages com.databricks:spark-redshift_2.10:0.6.0,com.databricks:spark-csv_2.10:1.4.0,org.apache.spark:spark-streaming-kinesis-asl_2.10:1.5.1 ``` I had to downgrade the package to 1.5.1 before it can work. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15030) Support formula in spark.kmeans in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15030. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12813 [https://github.com/apache/spark/pull/12813] > Support formula in spark.kmeans in SparkR > - > > Key: SPARK-15030 > URL: https://issues.apache.org/jira/browse/SPARK-15030 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > Fix For: 2.0.0 > > > In SparkR, spark.kmeans take a DataFrame with double columns. This is > different from other ML methods we implemented, which support R model > formula. We should add support for that as well. > {code:none} > spark.kmeans(data = df, formula = ~ lat + lon, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15038) Add ability to do broadcasts in SQL at execution time
Patrick Woody created SPARK-15038: - Summary: Add ability to do broadcasts in SQL at execution time Key: SPARK-15038 URL: https://issues.apache.org/jira/browse/SPARK-15038 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.6.1 Reporter: Patrick Woody Currently the auto broadcasting done in SparkSQL is asynchronous and done at query planning time. If you have a large query with many broadcasts, this can end up creating a large amount of memory pressure/possible OOMs all at once when it actually isn't necessary. The current workaround for these types of queries is to disable broadcast joins, which can be prohibitive performance wise. The proposal for this ticket is to allow a config point to toggle doing these broadcasts either eagerly/asynchronously or doing the broadcasts lazily at execution time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14785) Support correlated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265338#comment-15265338 ] Xiao Li commented on SPARK-14785: - Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. Thanks! > Support correlated scalar subquery > -- > > Key: SPARK-14785 > URL: https://issues.apache.org/jira/browse/SPARK-14785 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu > > For example: > {code} > SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) > {code} > it could be rewritten as > {code} > SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON > t3.id = t.id where b > avg_c > {code} > TPCDS Q92, Q81, Q6 required this > Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14785) Support correlated scalar subquery
[ https://issues.apache.org/jira/browse/SPARK-14785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-14785: Description: For example: {code} SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) {code} it could be rewritten as {code} SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON t3.id = t.id where b > avg_c {code} TPCDS Q92, Q81, Q6 required this Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. was: For example: {code} SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) {code} it could be rewritten as {code} SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON t3.id = t.id where b > avg_c {code} TPCDS Q92, Q81, Q6 required this > Support correlated scalar subquery > -- > > Key: SPARK-14785 > URL: https://issues.apache.org/jira/browse/SPARK-14785 > Project: Spark > Issue Type: New Feature >Reporter: Davies Liu > > For example: > {code} > SELECT a from t where b > (select avg(c) from t2 where t.id = t2.id) > {code} > it could be rewritten as > {code} > SELECT a FROM t JOIN (SELECT id, AVG(c) as avg_c FROM t2 GROUP by id) t3 ON > t3.id = t.id where b > avg_c > {code} > TPCDS Q92, Q81, Q6 required this > Update: TPCDS Q1 and Q30 also require correlated scalar subquery support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265335#comment-15265335 ] Xiao Li commented on SPARK-14968: - [~hvanhovell] Yeah, you are right. After trying to reproduce it, I got the following error message. "org.apache.spark.sql.AnalysisException: Correlated scalar subqueries are not supported" Glad to know you are working on the support of correlated scalar subquery. Thanks! Xiao > TPC-DS query 1 resolved attribute(s) missing > > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > with customer_total_return as > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 > from customer_total_return ctr2 > where ctr1.ctr_store_sk = ctr2.ctr_store_sk) > and s_store_sk = ctr1.ctr_store_sk > and s_state = 'TN' > and ctr1.ctr_customer_sk = c_customer_sk > order by c_customer_id > limit 100 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14952) Remove methods that were deprecated in 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14952: -- Assignee: Herman van Hovell > Remove methods that were deprecated in 1.6.0 > > > Key: SPARK-14952 > URL: https://issues.apache.org/jira/browse/SPARK-14952 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Reporter: Herman van Hovell >Assignee: Herman van Hovell >Priority: Minor > Fix For: 2.0.0 > > > Running {{grep -inr "@deprecated"}} I found a few methods that were > deprecated in SPARK-1.6: > {noformat} > ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193: > @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392: > @deprecated("Use coefficients instead.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483: > @deprecated("Use coefficients instead.", "1.6.0") > {noformat} > Lets remove those as part of 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14952) Remove methods that were deprecated in 1.6.0
[ https://issues.apache.org/jira/browse/SPARK-14952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14952. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12732 [https://github.com/apache/spark/pull/12732] > Remove methods that were deprecated in 1.6.0 > > > Key: SPARK-14952 > URL: https://issues.apache.org/jira/browse/SPARK-14952 > Project: Spark > Issue Type: Improvement > Components: ML, Spark Core >Reporter: Herman van Hovell >Priority: Minor > Fix For: 2.0.0 > > > Running {{grep -inr "@deprecated"}} I found a few methods that were > deprecated in SPARK-1.6: > {noformat} > ./core/src/main/scala/org/apache/spark/input/PortableDataStream.scala:193: > @deprecated("Closing the PortableDataStream is not needed anymore.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala:392: > @deprecated("Use coefficients instead.", "1.6.0") > ./mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala:483: > @deprecated("Use coefficients instead.", "1.6.0") > {noformat} > Lets remove those as part of 2.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14850) VectorUDT/MatrixUDT should take primitive arrays without boxing
[ https://issues.apache.org/jira/browse/SPARK-14850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265330#comment-15265330 ] Apache Spark commented on SPARK-14850: -- User 'tedyu' has created a pull request for this issue: https://github.com/apache/spark/pull/12814 > VectorUDT/MatrixUDT should take primitive arrays without boxing > --- > > Key: SPARK-14850 > URL: https://issues.apache.org/jira/browse/SPARK-14850 > Project: Spark > Issue Type: Improvement > Components: ML, SQL >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.0.0 > > > In SPARK-9390, we switched to use GenericArrayData to store indices and > values in vector/matrix UDTs. However, GenericArrayData is not specialized > for primitive types. This might hurt MLlib performance badly. We should > consider either specialize GenericArrayData or use a different container. > cc: [~cloud_fan] [~yhuai] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14653) Remove NumericParser and jackson dependency from mllib-local
[ https://issues.apache.org/jira/browse/SPARK-14653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-14653. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12802 [https://github.com/apache/spark/pull/12802] > Remove NumericParser and jackson dependency from mllib-local > > > Key: SPARK-14653 > URL: https://issues.apache.org/jira/browse/SPARK-14653 > Project: Spark > Issue Type: Sub-task > Components: Build, ML >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > After SPARK-14549, we should remove NumericParser and jackson from > mllib-local, which were introduced very earlier and now replaced by UDTs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14162) java.lang.IllegalStateException: Did not find registered driver with class oracle.jdbc.OracleDriver
[ https://issues.apache.org/jira/browse/SPARK-14162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265310#comment-15265310 ] Martin Hall commented on SPARK-14162: - I got the same error when I had forgotten to copy the oracle jdbc jar file (ojdbc6.jar) to one of the spark worker nodes > java.lang.IllegalStateException: Did not find registered driver with class > oracle.jdbc.OracleDriver > --- > > Key: SPARK-14162 > URL: https://issues.apache.org/jira/browse/SPARK-14162 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Zoltan Fedor > > This is an interesting one. > We are using JupyterHub with Python to connect to a Hadoop cluster to run > Spark jobs and as the new Spark versions come out I compile them and add as > new kernels to JupyterHub to be used. > There are also some libraries we are using, like ojdbc to connect to an > Oracle database. > Now the interesting thing, that ojdbc worked fine in Spark 1.6.0 but suddenly > "it cannot be found" in 1.6.1. > Everything, all settings are the same when starting pyspark 1.6.1 and 1.6.0, > so there is no reason for it not to work in 1.6.1 if it works in 1.6.0. > This is the pysparjk code I am running in both 1.6.1 and 1.6.0: > {quote} > df = > sqlContext.read.format('jdbc').options(url='jdbc:oracle:thin:'+connection_script+'', > dbtable='bi.contact').load() > print(df.count()){quote} > And it throws this error in 1.6.1 only: > {quote} > java.lang.IllegalStateException: Did not find registered driver with class > oracle.jdbc.OracleDriver > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2$$anonfun$3.apply(JdbcUtils.scala:58) > at scala.Option.getOrElse(Option.scala:120) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:57) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$createConnectionFactory$2.apply(JdbcUtils.scala:52) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$$anon$1.(JDBCRDD.scala:347) > at > org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD.compute(JDBCRDD.scala:339) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745){quote} > I know that this usually means that the ojdbc driver is not available on the > executor, but it is. Spark is being started the exact same way in 1.6.1 as in > 1.6.0 and it does find it on 1.6.0. > I can steadily reproduce this, so the only conclusion that something must > have changed between 1.6.0 and 1.6.1 causing this, but I have see no > "depreciation" notice of anything what could cause this. > Environment variables set when starting pyspark 1.6.1: > {quote} > "SPARK_HOME": "/usr/lib/spark-1.6.1-hive", > "SCALA_HOME": "/usr/lib/scala", > "HADOOP_CONF_DIR": "/etc/hadoop/venus-hadoop-conf", > "HADOOP_HOME": "/usr/bin/hadoop", > "HIVE_HOME": "/usr/bin/hive", > "LD_LIBRARY_PATH": "/usr/local/hadoop/lib/native/:$LD_LIBRARY_PATH", > "YARN_HOME": "", > "SPARK_DIST_CLASSPATH": > "/etc/hadoop/conf:/usr/lib/hadoop/lib/*:/usr/lib/hadoop/.//*:/usr/lib/hadoop-hdfs/./:/usr/lib/hadoop-hdfs/lib/*:/usr/lib/hadoop-hdfs/.//*:/usr/lib/hadoop-yarn/lib/*:/usr/lib/hadoop-yarn/.//*:/usr/lib/hadoop-mapreduce/lib/*:/usr/lib/hadoop-mapreduce/.//*", > "SPARK_LIBRARY_PATH": "/usr/lib/hadoop/lib", > "PATH": >
[jira] [Commented] (SPARK-14968) TPC-DS query 1 resolved attribute(s) missing
[ https://issues.apache.org/jira/browse/SPARK-14968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265296#comment-15265296 ] Herman van Hovell commented on SPARK-14968: --- [~jfc...@us.ibm.com] This is a correlated scalar subquery and this does this not work yet. I am currently working on this. > TPC-DS query 1 resolved attribute(s) missing > > > Key: SPARK-14968 > URL: https://issues.apache.org/jira/browse/SPARK-14968 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: JESSE CHEN >Priority: Critical > > This is a regression from a week ago. Failed to generate plan for query 1 in > TPCDS using 0427 build from > people.apache.org/~pwendell/spark-nightly/spark-master-bin/latest/. > Was working in build from 0421. > The error is: > {noformat} > 16/04/27 07:00:59 INFO spark.SparkContext: Created broadcast 3 from > processCmd at CliDriver.java:376 > 16/04/27 07:00:59 INFO datasources.FileSourceStrategy: Planning scan with bin > packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 > bytes. > Error in query: resolved attribute(s) ctr_store_sk#2#535 missing from > ctr_store_sk#2,ctr_total_return#3 in operator !Filter (ctr_store_sk#2#535 = > ctr_store_sk#2); > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/static/sql,null} > 16/04/27 07:00:59 INFO handler.ContextHandler: stopped > o.s.j.s.ServletContextHandler{/SQL/execution/json,null} > {noformat} > The query is: > {noformat} > with customer_total_return as > (select sr_customer_sk as ctr_customer_sk > ,sr_store_sk as ctr_store_sk > ,sum(SR_RETURN_AMT) as ctr_total_return > from store_returns > ,date_dim > where sr_returned_date_sk = d_date_sk > and d_year =2000 > group by sr_customer_sk > ,sr_store_sk) > select c_customer_id > from customer_total_return ctr1 > ,store > ,customer > where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2 > from customer_total_return ctr2 > where ctr1.ctr_store_sk = ctr2.ctr_store_sk) > and s_store_sk = ctr1.ctr_store_sk > and s_state = 'TN' > and ctr1.ctr_customer_sk = c_customer_sk > order by c_customer_id > limit 100 > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15031: -- Description: This PR aims to update Scala/Python examples by replacing SQLContext with newly added SparkSession. Also, this fixes the following examples. **sql.py** {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) -people.registerAsTable("people") +people.registerTempTable("people") {code} **dataframe_example.py** {code} - features = df.select("features").map(lambda r: r.features) + features = df.select("features").rdd.map(lambda r: r.features) {code} Note that the following examples are untouched in this PR since it fails some unknown issue. - `simple_params_example.py` - `aft_survival_regression.py` was: Currently, Python SQL example, `sql.py`, fails. {code} bin/spark-submit examples/src/main/python/sql.py Traceback (most recent call last): File "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", line 60, in people = sqlContext.jsonFile(path) AttributeError: 'SQLContext' object has no attribute 'jsonFile' {code} {code} Traceback (most recent call last): File "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", line 72, in people.registerAsTable("people") File "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line 795, in __getattr__ AttributeError: 'DataFrame' object has no attribute 'registerAsTable' {code} This issue fixes them by the following fix. {code} -people = sqlContext.jsonFile(path) +people = sqlContext.read.json(path) ... -people.registerAsTable("people") +people.registerTempTable("people") {code} > Use SparkSession in Scala/Python example. > - > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > This PR aims to update Scala/Python examples by replacing SQLContext with > newly added SparkSession. Also, this fixes the following examples. > **sql.py** > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} > **dataframe_example.py** > {code} > - features = df.select("features").map(lambda r: r.features) > + features = df.select("features").rdd.map(lambda r: r.features) > {code} > Note that the following examples are untouched in this PR since it fails some > unknown issue. > - `simple_params_example.py` > - `aft_survival_regression.py` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15031: -- Issue Type: Improvement (was: Bug) > Use SparkSession in Scala/Python example. > - > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Improvement > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > Currently, Python SQL example, `sql.py`, fails. > {code} > bin/spark-submit examples/src/main/python/sql.py > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 60, in > people = sqlContext.jsonFile(path) > AttributeError: 'SQLContext' object has no attribute 'jsonFile' > {code} > {code} > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 72, in > people.registerAsTable("people") > File > "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 795, in __getattr__ > AttributeError: 'DataFrame' object has no attribute 'registerAsTable' > {code} > This issue fixes them by the following fix. > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > ... > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15031) Use SparkSession in Scala/Python example.
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15031: -- Summary: Use SparkSession in Scala/Python example. (was: Fix SQL python example) > Use SparkSession in Scala/Python example. > - > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > Currently, Python SQL example, `sql.py`, fails. > {code} > bin/spark-submit examples/src/main/python/sql.py > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 60, in > people = sqlContext.jsonFile(path) > AttributeError: 'SQLContext' object has no attribute 'jsonFile' > {code} > {code} > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 72, in > people.registerAsTable("people") > File > "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 795, in __getattr__ > AttributeError: 'DataFrame' object has no attribute 'registerAsTable' > {code} > This issue fixes them by the following fix. > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > ... > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14906) Move VectorUDT and MatrixUDT in PySpark to new ML package
[ https://issues.apache.org/jira/browse/SPARK-14906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265288#comment-15265288 ] Liang-Chi Hsieh commented on SPARK-14906: - Yes. > Move VectorUDT and MatrixUDT in PySpark to new ML package > - > > Key: SPARK-14906 > URL: https://issues.apache.org/jira/browse/SPARK-14906 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Reporter: Liang-Chi Hsieh > > As we move VectorUDT and MatrixUDT in Scala to new ml package, the PySpark > codes should be moved too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15030) Support formula in spark.kmeans in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15030: Assignee: Yanbo Liang (was: Apache Spark) > Support formula in spark.kmeans in SparkR > - > > Key: SPARK-15030 > URL: https://issues.apache.org/jira/browse/SPARK-15030 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > In SparkR, spark.kmeans take a DataFrame with double columns. This is > different from other ML methods we implemented, which support R model > formula. We should add support for that as well. > {code:none} > spark.kmeans(data = df, formula = ~ lat + lon, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15030) Support formula in spark.kmeans in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15030: Assignee: Apache Spark (was: Yanbo Liang) > Support formula in spark.kmeans in SparkR > - > > Key: SPARK-15030 > URL: https://issues.apache.org/jira/browse/SPARK-15030 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark > > In SparkR, spark.kmeans take a DataFrame with double columns. This is > different from other ML methods we implemented, which support R model > formula. We should add support for that as well. > {code:none} > spark.kmeans(data = df, formula = ~ lat + lon, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15030) Support formula in spark.kmeans in SparkR
[ https://issues.apache.org/jira/browse/SPARK-15030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265282#comment-15265282 ] Apache Spark commented on SPARK-15030: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/12813 > Support formula in spark.kmeans in SparkR > - > > Key: SPARK-15030 > URL: https://issues.apache.org/jira/browse/SPARK-15030 > Project: Spark > Issue Type: New Feature > Components: ML, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yanbo Liang > > In SparkR, spark.kmeans take a DataFrame with double columns. This is > different from other ML methods we implemented, which support R model > formula. We should add support for that as well. > {code:none} > spark.kmeans(data = df, formula = ~ lat + lon, ...) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14858) Push predicates with subquery
[ https://issues.apache.org/jira/browse/SPARK-14858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14858: -- Assignee: Herman van Hovell > Push predicates with subquery > -- > > Key: SPARK-14858 > URL: https://issues.apache.org/jira/browse/SPARK-14858 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Davies Liu >Assignee: Herman van Hovell > Fix For: 2.0.0 > > > Currently we rewrite the subquery as Join in the beginning of Optimizer, we > should defer that to enable predicates push down (because Join can't be > easily pushed down). > cc [~hvanhovell] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14981) CatalogTable should contain sorting directions of sorting columns
[ https://issues.apache.org/jira/browse/SPARK-14981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14981: -- Assignee: Cheng Lian > CatalogTable should contain sorting directions of sorting columns > - > > Key: SPARK-14981 > URL: https://issues.apache.org/jira/browse/SPARK-14981 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Fix For: 2.0.0 > > > For a bucketed table with sorting columns, {{CatalogTable}} only records > sorting column names, while sorting directions (ASC/DESC) are missing. > Our SQL parser supports the syntax, but sorting directions are silently > dropped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5
[ https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-13289: -- Assignee: Nick Pentreath > Word2Vec generate infinite distances when numIterations>5 > - > > Key: SPARK-13289 > URL: https://issues.apache.org/jira/browse/SPARK-13289 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux, Scala >Reporter: Qi Dai >Assignee: Nick Pentreath > Labels: features > Fix For: 2.0.0 > > > I recently ran some word2vec experiments on a cluster with 50 executors on > some large text dataset but find out that when number of iterations is larger > than 5 the distance between words will be all infinite. My code looks like > this: > val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" > ").toSeq) > import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} > val word2vec = new > Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5) > val model = word2vec.fit(text) > val synonyms = model.findSynonyms("who", 40) > for((synonym, cosineSimilarity) <- synonyms) { > println(s"$synonym $cosineSimilarity") > } > The results are: > to Infinity > and Infinity > that Infinity > with Infinity > said Infinity > it Infinity > by Infinity > be Infinity > have Infinity > he Infinity > has Infinity > his Infinity > an Infinity > ) Infinity > not Infinity > who Infinity > I Infinity > had Infinity > their Infinity > were Infinity > they Infinity > but Infinity > been Infinity > I tried many different datasets and different words for finding synonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14737) Kafka Brokers are down - spark stream should retry
[ https://issues.apache.org/jira/browse/SPARK-14737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14737. --- Resolution: Not A Problem Given the problem statement here, I think this is not a Spark problem. > Kafka Brokers are down - spark stream should retry > -- > > Key: SPARK-14737 > URL: https://issues.apache.org/jira/browse/SPARK-14737 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 1.3.0 > Environment: Suse Linux, Cloudera Enterprise 5.4.8 (#7 built by > jenkins on 20151023-1205 git: d7dbdf29ac1d57ae9fb19958502d50dcf4e4fffd), > kafka_2.10-0.8.2.2 >Reporter: Faisal > > I have spark streaming application that uses direct streaming - listening to > KAFKA topic. > {code} > HashMapkafkaParams = new HashMap (); > kafkaParams.put("metadata.broker.list", "broker1,broker2,broker3"); > kafkaParams.put("auto.offset.reset", "largest"); > HashSet topicsSet = new HashSet(); > topicsSet.add("Topic1"); > JavaPairInputDStream messages = > KafkaUtils.createDirectStream( > jssc, > String.class, > String.class, > StringDecoder.class, > StringDecoder.class, > kafkaParams, > topicsSet > ); > {code} > I notice when i stop/shutdown kafka brokers, my spark application also > shutdown. > Here is the spark execution script > {code} > spark-submit \ > --master yarn-cluster \ > --files /home/siddiquf/spark/log4j-spark.xml > --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ > --conf > "spark.executor.extraJavaOptions=-Dlog4j.configuration=log4j-spark.xml" \ > --class com.example.MyDataStreamProcessor \ > myapp.jar > {code} > Spark job submitted successfully and i can track the application driver and > worker/executor nodes. > Everything works fine but only concern if kafka borkers are offline or > restarted my application controlled by yarn should not shutdown? but it does. > If this is expected behavior then how to handle such situation with least > maintenance? Keeping in mind Kafka cluster is not in hadoop cluster and > managed by different team that is why requires our application to be > resilient enough. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14750) Make historyServer refer application log in hdfs
[ https://issues.apache.org/jira/browse/SPARK-14750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14750. --- Resolution: Won't Fix > Make historyServer refer application log in hdfs > > > Key: SPARK-14750 > URL: https://issues.apache.org/jira/browse/SPARK-14750 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.6.1 >Reporter: SuYan > > Make history server refer application log, just like MR history server -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13289) Word2Vec generate infinite distances when numIterations>5
[ https://issues.apache.org/jira/browse/SPARK-13289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-13289. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 11812 [https://github.com/apache/spark/pull/11812] > Word2Vec generate infinite distances when numIterations>5 > - > > Key: SPARK-13289 > URL: https://issues.apache.org/jira/browse/SPARK-13289 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.6.0 > Environment: Linux, Scala >Reporter: Qi Dai > Labels: features > Fix For: 2.0.0 > > > I recently ran some word2vec experiments on a cluster with 50 executors on > some large text dataset but find out that when number of iterations is larger > than 5 the distance between words will be all infinite. My code looks like > this: > val text = sc.textFile("/project/NLP/1_biliion_words/train").map(_.split(" > ").toSeq) > import org.apache.spark.mllib.feature.{Word2Vec, Word2VecModel} > val word2vec = new > Word2Vec().setMinCount(25).setVectorSize(96).setNumPartitions(99).setNumIterations(10).setWindowSize(5) > val model = word2vec.fit(text) > val synonyms = model.findSynonyms("who", 40) > for((synonym, cosineSimilarity) <- synonyms) { > println(s"$synonym $cosineSimilarity") > } > The results are: > to Infinity > and Infinity > that Infinity > with Infinity > said Infinity > it Infinity > by Infinity > be Infinity > have Infinity > he Infinity > has Infinity > his Infinity > an Infinity > ) Infinity > not Infinity > who Infinity > I Infinity > had Infinity > their Infinity > were Infinity > they Infinity > but Infinity > been Infinity > I tried many different datasets and different words for finding synonyms. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14985) Update LinearRegression, LogisticRegression summary internals to handle model copy
[ https://issues.apache.org/jira/browse/SPARK-14985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265261#comment-15265261 ] Benjamin Fradet commented on SPARK-14985: - I'll take this one if you guys don't mind. > Update LinearRegression, LogisticRegression summary internals to handle model > copy > -- > > Key: SPARK-14985 > URL: https://issues.apache.org/jira/browse/SPARK-14985 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > See parent JIRA + the PR for [SPARK-14852] for details. The summaries should > handle creating an internal copy of the model. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14989) Upgrade to Jackson 2.7.3
[ https://issues.apache.org/jira/browse/SPARK-14989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14989: -- Priority: Blocker (was: Major) This is one of a handful that I think actually have to be resolved before 2.0.0 one way or the other given it's a dependency change. > Upgrade to Jackson 2.7.3 > > > Key: SPARK-14989 > URL: https://issues.apache.org/jira/browse/SPARK-14989 > Project: Spark > Issue Type: Improvement > Components: Build >Reporter: Josh Rosen >Assignee: Josh Rosen >Priority: Blocker > > For Spark 2.0, we should upgrade to a newer version of Jackson (2.7.3). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12154) Upgrade to Jersey 2
[ https://issues.apache.org/jira/browse/SPARK-12154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-12154: -- Priority: Blocker (was: Major) This is one of a handful that I think actually have to be resolved before 2.0.0 one way or the other given it's a dependency change. > Upgrade to Jersey 2 > --- > > Key: SPARK-12154 > URL: https://issues.apache.org/jira/browse/SPARK-12154 > Project: Spark > Issue Type: Sub-task > Components: Build, Spark Core >Affects Versions: 1.5.2 >Reporter: Matt Cheah >Priority: Blocker > > Fairly self-explanatory, Jersey 1 is a bit old and could use an upgrade. > Library conflicts for Jersey are difficult to workaround - see discussion on > SPARK-11081. It's easier to upgrade Jersey entirely, but we should target > Spark 2.0 since this may be a break for users who were using Jersey 1 in > their Spark jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15014) Spark Shell does not work with Ammonite Shell
[ https://issues.apache.org/jira/browse/SPARK-15014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265257#comment-15265257 ] Sean Owen commented on SPARK-15014: --- Why is this a Spark problem per se? Spark has its own shell (derived of course from the Scala shell), but it isn't pluggable. > Spark Shell does not work with Ammonite Shell > - > > Key: SPARK-15014 > URL: https://issues.apache.org/jira/browse/SPARK-15014 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 1.6.1 > Environment: All >Reporter: John-Michael Reed >Priority: Minor > Labels: shell, shell-script > > Lihaoyi has an enhanced Scala Shell called Ammonite. > https://github.com/lihaoyi/Ammonite > Users of Ammonite shell have tried to use it with Apache Spark. > https://github.com/lihaoyi/Ammonite/issues/382 > Spark Shell does not work with Ammonite Shell, but I want it to because the > Ammonite REPL offers enhanced auto-complete, pretty printing, and other > features. See http://www.lihaoyi.com/Ammonite/#Ammonite-REPL -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265256#comment-15265256 ] Dongjoon Hyun commented on SPARK-15037: --- Hi, [~rxin]. Until now, it seems there are two issues. - `object SQLContext` has still its own unique functions. We cannot replace `SQLContext` completely because `SharedSQLContext` uses it. Also, `MLlibTestSparkContext` does. - Also, constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` class is needed for Java testsuite. We had better handle them as separate issues before this kind of refactoring issue. How do you think about this? > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265256#comment-15265256 ] Dongjoon Hyun edited comment on SPARK-15037 at 4/30/16 9:08 AM: Hi, [~rxin]. Until now, it seems there are two issues. - `object SQLContext` has still its own unique functions. We cannot replace `SQLContext` completely because `SharedSQLContext` uses it. Also, `MLlibTestSparkContext` does. - Constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` class is needed for Java testsuite. We had better handle them as separate issues before this kind of refactoring issue. How do you think about this? was (Author: dongjoon): Hi, [~rxin]. Until now, it seems there are two issues. - `object SQLContext` has still its own unique functions. We cannot replace `SQLContext` completely because `SharedSQLContext` uses it. Also, `MLlibTestSparkContext` does. - Also, constructor `SparkSession(JavaSparkSession)` or `JavaSparkSession` class is needed for Java testsuite. We had better handle them as separate issues before this kind of refactoring issue. How do you think about this? > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15015) Log statements lack file name/number
[ https://issues.apache.org/jira/browse/SPARK-15015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265255#comment-15265255 ] Sean Owen commented on SPARK-15015: --- Hm, is it actually possible to know the line number at runtime? it's present in the bytecode but not sure how a logging API would reach it. Here, it's your IDE providing this info. > Log statements lack file name/number > > > Key: SPARK-15015 > URL: https://issues.apache.org/jira/browse/SPARK-15015 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 1.6.1 > Environment: All >Reporter: John-Michael Reed >Priority: Trivial > Labels: debug, log > > I would like it if the Apache Spark project had file names and line numbers > in its log statements like this: > http://i.imgur.com/4hvGQ0t.png > The example uses my library, http://johnreedlol.github.io/scala-trace-debug/, > but https://github.com/lihaoyi/sourcecode is also useful for this purpose. > The real benefit in doing this is because the user of an IDE can jump to the > location of a log statement without having to set breakpoints. > http://s29.postimg.org/ud0knou1j/debug_Screenshot_Crop.png > Note that the arrow will go to the next log statement if each log statement > is hyperlinked. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
Dongjoon Hyun created SPARK-15037: - Summary: Use SparkSession instread of SQLContext in testsuites Key: SPARK-15037 URL: https://issues.apache.org/jira/browse/SPARK-15037 Project: Spark Issue Type: Bug Reporter: Dongjoon Hyun This issue aims to update the existing testsuites to use `SparkSession` instread of `SQLContext` since `SQLContext` exists just for backward compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15037) Use SparkSession instread of SQLContext in testsuites
[ https://issues.apache.org/jira/browse/SPARK-15037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-15037: -- Issue Type: Test (was: Bug) > Use SparkSession instread of SQLContext in testsuites > - > > Key: SPARK-15037 > URL: https://issues.apache.org/jira/browse/SPARK-15037 > Project: Spark > Issue Type: Test >Reporter: Dongjoon Hyun > > This issue aims to update the existing testsuites to use `SparkSession` > instread of `SQLContext` since `SQLContext` exists just for backward > compatibility. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15028) Remove Hive config override
[ https://issues.apache.org/jira/browse/SPARK-15028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai resolved SPARK-15028. -- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 12806 [https://github.com/apache/spark/pull/12806] > Remove Hive config override > --- > > Key: SPARK-15028 > URL: https://issues.apache.org/jira/browse/SPARK-15028 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14113) Consider marking JobConf closure-cleaning in HadoopRDD as optional
[ https://issues.apache.org/jira/browse/SPARK-14113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-14113. --- Resolution: Won't Fix See PR discussion > Consider marking JobConf closure-cleaning in HadoopRDD as optional > -- > > Key: SPARK-14113 > URL: https://issues.apache.org/jira/browse/SPARK-14113 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Rajesh Balamohan >Priority: Minor > > In HadoopRDD, the following code was introduced as a part of SPARK-6943. > {noformat} > if (initLocalJobConfFuncOpt.isDefined) { > sparkContext.clean(initLocalJobConfFuncOpt.get) > } > {noformat} > When working on one of the changes in OrcRelation, I tried passing > initLocalJobConfFuncOpt to HadoopRDD and that incurred good performance > penalty (due to closure cleaning) with large RDDs. This would be invoked for > every HadoopRDD initialization causing the bottleneck. > example threadstack is given below > {noformat} > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.readUTF8(Unknown Source) > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:402) > at > org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:390) > at > scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) > at > scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:102) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:102) > at > scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) > at > org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:390) > at org.apache.xbean.asm5.ClassReader.a(Unknown Source) > at org.apache.xbean.asm5.ClassReader.b(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at org.apache.xbean.asm5.ClassReader.accept(Unknown Source) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:224) > at > org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$15.apply(ClosureCleaner.scala:223) > at scala.collection.immutable.List.foreach(List.scala:318) > at > org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:223) > at > org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2079) > at > org.apache.spark.rdd.HadoopRDD.(HadoopRDD.scala:112){noformat} > Creating this JIRA to explore the possibility of removing it or mark it > optional. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15036) When creating a database, we need to qualify its path
Yin Huai created SPARK-15036: Summary: When creating a database, we need to qualify its path Key: SPARK-15036 URL: https://issues.apache.org/jira/browse/SPARK-15036 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15035) SessionCatalog needs to set the location for default DB
Yin Huai created SPARK-15035: Summary: SessionCatalog needs to set the location for default DB Key: SPARK-15035 URL: https://issues.apache.org/jira/browse/SPARK-15035 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Right now, in SessionCatalog, the default location of the database is an empty string. It will break create table command when we use SparkSession without hive support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15034) Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir
Yin Huai created SPARK-15034: Summary: Use the value of spark.sql.warehouse.dir as the warehouse location instead of using hive.metastore.warehouse.dir Key: SPARK-15034 URL: https://issues.apache.org/jira/browse/SPARK-15034 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Yin Huai Assignee: Yin Huai Starting from Spark 2.0, spark.sql.warehouse.dir will be the conf to set warehouse location. We will not use hive.metastore.warehouse.dir. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15033) fix a flaky test in CachedTableSuite
[ https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265240#comment-15265240 ] Apache Spark commented on SPARK-15033: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12811 > fix a flaky test in CachedTableSuite > > > Key: SPARK-15033 > URL: https://issues.apache.org/jira/browse/SPARK-15033 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15033) fix a flaky test in CachedTableSuite
[ https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15033: Assignee: Apache Spark (was: Wenchen Fan) > fix a flaky test in CachedTableSuite > > > Key: SPARK-15033 > URL: https://issues.apache.org/jira/browse/SPARK-15033 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15033) fix a flaky test in CachedTableSuite
[ https://issues.apache.org/jira/browse/SPARK-15033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15033: Assignee: Wenchen Fan (was: Apache Spark) > fix a flaky test in CachedTableSuite > > > Key: SPARK-15033 > URL: https://issues.apache.org/jira/browse/SPARK-15033 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15033) fix a flaky test in CachedTableSuite
Wenchen Fan created SPARK-15033: --- Summary: fix a flaky test in CachedTableSuite Key: SPARK-15033 URL: https://issues.apache.org/jira/browse/SPARK-15033 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Target Version/s: (was: 2.0.0) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265238#comment-15265238 ] Xiangrui Meng commented on SPARK-15027: --- It might be tricky to use Dataset due to encoders and generic ID types. But if we use DataFrame as input and output, it seems feasible. It would be great if you can take a look. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Assignee: (was: Xiangrui Meng) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265232#comment-15265232 ] Nick Pentreath commented on SPARK-15027: Ok - it would make sense to have it in 2.0 if possible even though it is DeveloperApi. I can do it. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229 ] Xiangrui Meng edited comment on SPARK-15027 at 4/30/16 7:50 AM: Just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. was (Author: mengxr): No, just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265229#comment-15265229 ] Xiangrui Meng commented on SPARK-15027: --- No, just API change. I guess there are still gaps to use DataFrame for the implementation. Maybe this is not urgent for 2.0 since ALS.train is a developer API. > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265228#comment-15265228 ] Nick Pentreath commented on SPARK-15027: [~mengxr] are you intending this to be a more "superficial" change (as in, change the signature of train to take a Dataset, but still operate on RDDs inside the method), or try to have the entire algorithm operate on Dataset? > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This continue the work from SPARK-14412 to update > `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and > `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update > `ALS.train` to use `Dataset` instead of `RDD`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Description: We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` to be consistent with other APIs under spark.ml and it also leaves space for Tungsten-based optimization. (was: This continue the work from SPARK-14412 to update `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update `ALS.train` to use `Dataset` instead of `RDD`.) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > We should also update `ALS.train` to use `Dataset/DataFrame` instead of `RDD` > to be consistent with other APIs under spark.ml and it also leaves space for > Tungsten-based optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15027) ALS.train should use DataFrame instead of RDD
[ https://issues.apache.org/jira/browse/SPARK-15027?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15027: -- Summary: ALS.train should use DataFrame instead of RDD (was: ml.ALS params and ALS.train should not depend on RDD) > ALS.train should use DataFrame instead of RDD > - > > Key: SPARK-15027 > URL: https://issues.apache.org/jira/browse/SPARK-15027 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This continue the work from SPARK-14412 to update > `intermediateRDDStorageLevel` to `intermediateStorageLevel`, and > `finalRDDStorageLevel` to `finalStoargeLevel`. We should also update > `ALS.train` to use `Dataset` instead of `RDD`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-15032) When we create a new JDBC session, we may need to create a new session of executionHive
Yin Huai created SPARK-15032: Summary: When we create a new JDBC session, we may need to create a new session of executionHive Key: SPARK-15032 URL: https://issues.apache.org/jira/browse/SPARK-15032 Project: Spark Issue Type: Bug Components: SQL Reporter: Yin Huai Right now, we only use executionHive in thriftserver. When we create a new jdbc session, we probably need to create a new session of executionHive. I am not sure what will break if we leave the code as is. But, I feel it will be safer to create a new session of executionHive. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15029) Bad error message for two generators in the project clause
[ https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15029: Assignee: Apache Spark > Bad error message for two generators in the project clause > -- > > Key: SPARK-15029 > URL: https://issues.apache.org/jira/browse/SPARK-15029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin >Assignee: Apache Spark > > {code} > scala> spark.range(1000).map(i => (Array[Long](i), > Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true) > org.apache.spark.sql.AnalysisException: Only one generator allowed per select > but Generate and and Explode found.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > {code} > It's confusing to call one "Generator" and the other "Explode". There is also > two "and"s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15029) Bad error message for two generators in the project clause
[ https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265226#comment-15265226 ] Apache Spark commented on SPARK-15029: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/12810 > Bad error message for two generators in the project clause > -- > > Key: SPARK-15029 > URL: https://issues.apache.org/jira/browse/SPARK-15029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > scala> spark.range(1000).map(i => (Array[Long](i), > Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true) > org.apache.spark.sql.AnalysisException: Only one generator allowed per select > but Generate and and Explode found.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > {code} > It's confusing to call one "Generator" and the other "Explode". There is also > two "and"s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15029) Bad error message for two generators in the project clause
[ https://issues.apache.org/jira/browse/SPARK-15029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15029: Assignee: (was: Apache Spark) > Bad error message for two generators in the project clause > -- > > Key: SPARK-15029 > URL: https://issues.apache.org/jira/browse/SPARK-15029 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Reynold Xin > > {code} > scala> spark.range(1000).map(i => (Array[Long](i), > Array[Long](i))).selectExpr("explode(_1)", "explode(_2)").explain(true) > org.apache.spark.sql.AnalysisException: Only one generator allowed per select > but Generate and and Explode found.; > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:54) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1275) > at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$21$$anonfun$53.apply(Analyzer.scala:1272) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) > at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104) > {code} > It's confusing to call one "Generator" and the other "Explode". There is also > two "and"s. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14975) Predicted Probability per training instance for Gradient Boosted Trees in mllib.
[ https://issues.apache.org/jira/browse/SPARK-14975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265219#comment-15265219 ] Partha Talukder commented on SPARK-14975: - Thanks Joseph. I would keep that in mind. > Predicted Probability per training instance for Gradient Boosted Trees in > mllib. > - > > Key: SPARK-14975 > URL: https://issues.apache.org/jira/browse/SPARK-14975 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Partha Talukder >Priority: Minor > Labels: mllib > > This function available for Logistic Regression, SVM etc. > (model.setThreshold()) but not for GBT. In comparison to "gbm" package in R, > where we can specify the distribution and get predicted probabilities or > classes. I understand that this algorithm works with "Classification" and > "Regression" algo's. Is there any way where in GBT we can get predicted > probabilities or provide thresholds to the model? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-13485. - Resolution: Fixed Fix Version/s: 2.0.0 > (Dataset-oriented) API evolution in Spark 2.0 > - > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > Fix For: 2.0.0 > > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are > not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal > because of its heavy dependency on RDDs) > 3. First class support for sessions > 4. First class support for some system catalog > See the design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-13485) (Dataset-oriented) API evolution in Spark 2.0
[ https://issues.apache.org/jira/browse/SPARK-13485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-13485: Priority: Blocker (was: Major) > (Dataset-oriented) API evolution in Spark 2.0 > - > > Key: SPARK-13485 > URL: https://issues.apache.org/jira/browse/SPARK-13485 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Blocker > Fix For: 2.0.0 > > Attachments: API Evolution in Spark 2.0.pdf > > > As part of Spark 2.0, we want to create a stable API foundation for Dataset > to become the main user-facing API in Spark. This ticket tracks various tasks > related to that. > The main high level changes are: > 1. Merge Dataset/DataFrame > 2. Create a more natural entry point for Dataset (SQLContext/HiveContext are > not ideal because of the name "SQL"/"Hive", and "SparkContext" is not ideal > because of its heavy dependency on RDDs) > 3. First class support for sessions > 4. First class support for some system catalog > See the design doc for more details. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15031) Fix SQL python example
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-15031: Assignee: (was: Apache Spark) > Fix SQL python example > -- > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > Currently, Python SQL example, `sql.py`, fails. > {code} > bin/spark-submit examples/src/main/python/sql.py > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 60, in > people = sqlContext.jsonFile(path) > AttributeError: 'SQLContext' object has no attribute 'jsonFile' > {code} > {code} > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 72, in > people.registerAsTable("people") > File > "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 795, in __getattr__ > AttributeError: 'DataFrame' object has no attribute 'registerAsTable' > {code} > This issue fixes them by the following fix. > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > ... > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15031) Fix SQL python example
[ https://issues.apache.org/jira/browse/SPARK-15031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15265218#comment-15265218 ] Apache Spark commented on SPARK-15031: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/12809 > Fix SQL python example > -- > > Key: SPARK-15031 > URL: https://issues.apache.org/jira/browse/SPARK-15031 > Project: Spark > Issue Type: Bug > Components: Examples >Reporter: Dongjoon Hyun >Priority: Trivial > > Currently, Python SQL example, `sql.py`, fails. > {code} > bin/spark-submit examples/src/main/python/sql.py > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 60, in > people = sqlContext.jsonFile(path) > AttributeError: 'SQLContext' object has no attribute 'jsonFile' > {code} > {code} > Traceback (most recent call last): > File > "/Users/dongjoon/spark-release/spark-2.0/examples/src/main/python/sql.py", > line 72, in > people.registerAsTable("people") > File > "/Users/dongjoon/spark/python/lib/pyspark.zip/pyspark/sql/dataframe.py", line > 795, in __getattr__ > AttributeError: 'DataFrame' object has no attribute 'registerAsTable' > {code} > This issue fixes them by the following fix. > {code} > -people = sqlContext.jsonFile(path) > +people = sqlContext.read.json(path) > ... > -people.registerAsTable("people") > +people.registerTempTable("people") > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org