[jira] [Assigned] (SPARK-17166) CTAS lost table properties after conversion to data source tables.
[ https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17166: Assignee: (was: Apache Spark) > CTAS lost table properties after conversion to data source tables. > -- > > Key: SPARK-17166 > URL: https://issues.apache.org/jira/browse/SPARK-17166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > CTAS lost table properties after conversion to data source tables. For > example, > {noformat} > CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, > 1 as b > {noformat} > The output of `DESC FORMATTED t` does not have the related properties. > {noformat} > |Table Parameters: | > | | > | rawDataSize |-1 > | | > | numFiles |1 > | | > | transient_lastDdlTime |1471670983 > | | > | totalSize |496 > | | > | spark.sql.sources.provider|parquet > | | > | EXTERNAL |FALSE > | | > | COLUMN_STATS_ACCURATE |false > | | > | numRows |-1 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > | | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > | | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > | path > |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| >| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17166) CTAS lost table properties after conversion to data source tables.
[ https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429244#comment-15429244 ] Apache Spark commented on SPARK-17166: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/14727 > CTAS lost table properties after conversion to data source tables. > -- > > Key: SPARK-17166 > URL: https://issues.apache.org/jira/browse/SPARK-17166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li > > CTAS lost table properties after conversion to data source tables. For > example, > {noformat} > CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, > 1 as b > {noformat} > The output of `DESC FORMATTED t` does not have the related properties. > {noformat} > |Table Parameters: | > | | > | rawDataSize |-1 > | | > | numFiles |1 > | | > | transient_lastDdlTime |1471670983 > | | > | totalSize |496 > | | > | spark.sql.sources.provider|parquet > | | > | EXTERNAL |FALSE > | | > | COLUMN_STATS_ACCURATE |false > | | > | numRows |-1 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > | | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > | | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > | path > |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| >| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17166) CTAS lost table properties after conversion to data source tables.
[ https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17166: Assignee: Apache Spark > CTAS lost table properties after conversion to data source tables. > -- > > Key: SPARK-17166 > URL: https://issues.apache.org/jira/browse/SPARK-17166 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiao Li >Assignee: Apache Spark > > CTAS lost table properties after conversion to data source tables. For > example, > {noformat} > CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, > 1 as b > {noformat} > The output of `DESC FORMATTED t` does not have the related properties. > {noformat} > |Table Parameters: | > | | > | rawDataSize |-1 > | | > | numFiles |1 > | | > | transient_lastDdlTime |1471670983 > | | > | totalSize |496 > | | > | spark.sql.sources.provider|parquet > | | > | EXTERNAL |FALSE > | | > | COLUMN_STATS_ACCURATE |false > | | > | numRows |-1 > | | > || > | | > |# Storage Information | > | | > |SerDe Library: > |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe > | | > |InputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat > | | > |OutputFormat: > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat > | | > |Compressed: |No > | | > |Storage Desc Parameters:| > | | > | serialization.format |1 > | | > | path > |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| >| > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17166) CTAS lost table properties after conversion to data source tables.
Xiao Li created SPARK-17166: --- Summary: CTAS lost table properties after conversion to data source tables. Key: SPARK-17166 URL: https://issues.apache.org/jira/browse/SPARK-17166 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiao Li CTAS lost table properties after conversion to data source tables. For example, {noformat} CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 1 as b {noformat} The output of `DESC FORMATTED t` does not have the related properties. {noformat} |Table Parameters: | | | | rawDataSize |-1 | | | numFiles |1 | | | transient_lastDdlTime |1471670983 | | | totalSize |496 | | | spark.sql.sources.provider|parquet | | | EXTERNAL |FALSE | | | COLUMN_STATS_ACCURATE |false | | | numRows |-1 | | || | | |# Storage Information | | | |SerDe Library: |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat: |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat: |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat | | |Compressed: |No | | |Storage Desc Parameters:| | | | serialization.format |1 | | | path |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t| | {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16757) Set up caller context to HDFS
[ https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429237#comment-15429237 ] Weiqing Yang commented on SPARK-16757: -- Thanks, [~srowen]. When Spark applications run on HDFS, if Spark reads data from HDFS or writes data into HDFS, a corresponding operation record with spark caller contexts will be written into hdfs-audit.log. The Spark caller contexts are JobID_stageID_stageAttemptId_taskID_attemptNumbe and applications’ name. That can help users to better diagnose and understand how specific applications impacting parts of the Hadoop system and potential problems they may be creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given HDFS operation, it's very helpful to track which upper level job issues it. > Set up caller context to HDFS > - > > Key: SPARK-16757 > URL: https://issues.apache.org/jira/browse/SPARK-16757 > Project: Spark > Issue Type: Sub-task >Reporter: Weiqing Yang > > In this jira, Spark will invoke hadoop caller context api to set up its > caller context to HDFS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely
Reynold Xin created SPARK-17165: --- Summary: FileStreamSource should not track the list of seen files indefinitely Key: SPARK-17165 URL: https://issues.apache.org/jira/browse/SPARK-17165 Project: Spark Issue Type: Bug Components: SQL, Streaming Reporter: Reynold Xin FileStreamSource currently tracks all the files seen indefinitely, which means it can run out of memory or overflow. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17150) Support SQL generation for inline tables
[ https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17150: Assignee: Peter Lee > Support SQL generation for inline tables > > > Key: SPARK-17150 > URL: https://issues.apache.org/jira/browse/SPARK-17150 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > > Inline tables currently do not support SQL generation, and as a result a view > that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17150) Support SQL generation for inline tables
[ https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17150. - Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 14709 [https://github.com/apache/spark/pull/14709] > Support SQL generation for inline tables > > > Key: SPARK-17150 > URL: https://issues.apache.org/jira/browse/SPARK-17150 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > > Inline tables currently do not support SQL generation, and as a result a view > that depends on inline tables would fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`
[ https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429226#comment-15429226 ] Apache Spark commented on SPARK-16862: -- User 'tejasapatil' has created a pull request for this issue: https://github.com/apache/spark/pull/14726 > Configurable buffer size in `UnsafeSorterSpillReader` > - > > Key: SPARK-16862 > URL: https://issues.apache.org/jira/browse/SPARK-16862 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Tejas Patil >Priority: Minor > > `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k > buffer to read data off disk. This could be made configurable to improve on > disk reads. > https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16264) Allow the user to use operators on the received DataFrame
[ https://issues.apache.org/jira/browse/SPARK-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-16264. --- Resolution: Won't Fix > Allow the user to use operators on the received DataFrame > - > > Key: SPARK-16264 > URL: https://issues.apache.org/jira/browse/SPARK-16264 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Shixiong Zhu > > Currently Sink cannot apply any operators on the given DataFrame because new > DataFrame created by the operator will use QueryExecution rather than > IncrementalExecution. > There are two options to fix this one: > 1. Merge IncrementalExecution into QueryExecution so that QueryExecution can > also deal with streaming operators. > 2. Make Dataset operators inherits the QueryExecution(IncrementalExecution is > just a subclass of IncrementalExecution) from it's parent. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429196#comment-15429196 ] Yanbo Liang commented on SPARK-17134: - [~qhuang] Please feel free to take this task and do the performance investigation. Thanks! > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429195#comment-15429195 ] Reynold Xin commented on SPARK-17164: - I tried in Postgres: {code} rxin=# create table a:b (id int); ERROR: syntax error at or near ":" LINE 1: create table a:b (id int); {code} Also it seems like Presto does not support it. You can always quote this though. > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429194#comment-15429194 ] Reynold Xin commented on SPARK-17164: - This is actually valid? > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
[ https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429193#comment-15429193 ] Sital Kedia commented on SPARK-17164: - cc - [~hvanhovell], [~rxin] > Query with colon in the table name fails to parse in 2.0 > > > Key: SPARK-17164 > URL: https://issues.apache.org/jira/browse/SPARK-17164 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Sital Kedia > > Running a simple query with colon in table name fails to parse in 2.0 > {code} > == SQL == > SELECT * FROM a:b > ---^^^ > at > org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) > at > org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) > at > org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) > at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) > ... 48 elided > {code} > Please note that this is a regression from Spark 1.6 as the query runs fine > in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17164) Query with colon in the table name fails to parse in 2.0
Sital Kedia created SPARK-17164: --- Summary: Query with colon in the table name fails to parse in 2.0 Key: SPARK-17164 URL: https://issues.apache.org/jira/browse/SPARK-17164 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Sital Kedia Running a simple query with colon in table name fails to parse in 2.0 {code} == SQL == SELECT * FROM a:b ---^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682) ... 48 elided {code} Please note that this is a regression from Spark 1.6 as the query runs fine in 1.6. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17158) Improve error message for numeric literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17158. - Resolution: Fixed Assignee: Srinath Fix Version/s: 2.1.0 2.0.1 > Improve error message for numeric literal parsing > - > > Key: SPARK-17158 > URL: https://issues.apache.org/jira/browse/SPARK-17158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Srinath >Assignee: Srinath >Priority: Minor > Fix For: 2.0.1, 2.1.0 > > > Spark currently gives confusing and inconsistent error messages for numeric > literals. For example: > scala> sql("select 123456Y") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456Y > ---^^^ > scala> sql("select 123456S") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456S > ---^^^ > scala> sql("select 12345623434523434564565L") > org.apache.spark.sql.catalyst.parser.ParseException: > For input string: "12345623434523434564565"(line 1, pos 7) > == SQL == > select 12345623434523434564565L > ---^^^ > The problem is that we are relying on JDK's implementations for parsing, and > those functions throw different error messages. This code can be found in > AstBuilder.numericLiteral function. > The proposal is that instead of using `_.toByte` to turn a string into a > byte, we always turn the numeric literal string into a BigDecimal, and then > we validate the range before turning it into a numeric value. This way, we > have more control over the data. > If BigDecimal fails to parse the number, we should throw a better exception > than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17149) array.sql for testing array related functions
[ https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-17149. - Resolution: Fixed Assignee: Peter Lee Fix Version/s: 2.1.0 2.0.1 > array.sql for testing array related functions > - > > Key: SPARK-17149 > URL: https://issues.apache.org/jira/browse/SPARK-17149 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Peter Lee >Assignee: Peter Lee > Fix For: 2.0.1, 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces
Seth Hendrickson created SPARK-17163: Summary: Decide on unified multinomial and binary logistic regression interfaces Key: SPARK-17163 URL: https://issues.apache.org/jira/browse/SPARK-17163 Project: Spark Issue Type: Sub-task Reporter: Seth Hendrickson Before the 2.1 release, we should finalize the API for logistic regression. After SPARK-7159, we have both LogisticRegression and MultinomialLogisticRegression models. This may be confusing to users and, is a bit superfluous since MLOR can do basically all of what BLOR does. We should decide if it needs to be changed and implement those changes before 2.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429069#comment-15429069 ] DB Tsai commented on SPARK-17151: - [~sethah] I think it sort of makes sense that we allow users to specify the number of classes if they want instead of inferring from the data. > Decide how to handle inferring number of classes in Multinomial logistic > regression > --- > > Key: SPARK-17151 > URL: https://issues.apache.org/jira/browse/SPARK-17151 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > This JIRA is to discuss how the number of label classes should be inferred in > multinomial logistic regression. Currently, MLOR checks the dataframe > metadata and if the number of classes is not specified then it uses the > maximum value seen in the label column. If the labels are not properly > indexed, then this can cause a large number of zero coefficients and > potentially produce instabilities in model training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429066#comment-15429066 ] DB Tsai commented on SPARK-17151: - BTW, not only the zero coefficients issues but also the intercepts will be negative infinity for those classes which are not seen in the training time. This will cause some instabilities during the optimization, and we should not train on those unseen classes. As a result, we need to keep track on what are the seen classes in the training time, and only optimize the coefficients for them. Since we know all the possible classes which should be able to be specified by users as part of the API, in prediction time, we just make them probability zero. > Decide how to handle inferring number of classes in Multinomial logistic > regression > --- > > Key: SPARK-17151 > URL: https://issues.apache.org/jira/browse/SPARK-17151 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > This JIRA is to discuss how the number of label classes should be inferred in > multinomial logistic regression. Currently, MLOR checks the dataframe > metadata and if the number of classes is not specified then it uses the > maximum value seen in the label column. If the labels are not properly > indexed, then this can cause a large number of zero coefficients and > potentially produce instabilities in model training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression
[ https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429066#comment-15429066 ] DB Tsai edited comment on SPARK-17151 at 8/19/16 11:49 PM: --- Not only the zero coefficients issues but also the intercepts will be negative infinity for those classes which are not seen in the training time. This will cause some instabilities during the optimization, and we should not train on those unseen classes. As a result, we need to keep track on what are the seen classes in the training time, and only optimize the coefficients for them. Since we know all the possible classes which should be able to be specified by users as part of the API, in prediction time, we just make them probability zero. was (Author: dbtsai): BTW, not only the zero coefficients issues but also the intercepts will be negative infinity for those classes which are not seen in the training time. This will cause some instabilities during the optimization, and we should not train on those unseen classes. As a result, we need to keep track on what are the seen classes in the training time, and only optimize the coefficients for them. Since we know all the possible classes which should be able to be specified by users as part of the API, in prediction time, we just make them probability zero. > Decide how to handle inferring number of classes in Multinomial logistic > regression > --- > > Key: SPARK-17151 > URL: https://issues.apache.org/jira/browse/SPARK-17151 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Seth Hendrickson >Priority: Minor > > This JIRA is to discuss how the number of label classes should be inferred in > multinomial logistic regression. Currently, MLOR checks the dataframe > metadata and if the number of classes is not specified then it uses the > maximum value seen in the label column. If the labels are not properly > indexed, then this can cause a large number of zero coefficients and > potentially produce instabilities in model training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17161: Assignee: (was: Apache Spark) > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > Often in Spark ML, there are classes that use a Scala `Array` to construct. > In order to add the same API to Python, a Java-friendly alternate constructor > needs to exist to be compatible with py4j when converting from a list. This > is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala `Array` > currently used in classes like {{CountVectorizerModel}} and > {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429043#comment-15429043 ] Apache Spark commented on SPARK-17161: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/14725 > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > Often in Spark ML, there are classes that use a Scala `Array` to construct. > In order to add the same API to Python, a Java-friendly alternate constructor > needs to exist to be compatible with py4j when converting from a list. This > is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala `Array` > currently used in classes like {{CountVectorizerModel}} and > {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17161: Assignee: Apache Spark > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Apache Spark >Priority: Minor > > Often in Spark ML, there are classes that use a Scala `Array` to construct. > In order to add the same API to Python, a Java-friendly alternate constructor > needs to exist to be compatible with py4j when converting from a list. This > is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala `Array` > currently used in classes like {{CountVectorizerModel}} and > {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms
[ https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429039#comment-15429039 ] DB Tsai commented on SPARK-17136: - Typically, the first order optimizer will take a function which returns the first derivative and the value of objective function. Second order one will take the hessian matrix. Since second order one doesn't scale in number of features, we can focus on first order optimizer first. Also, we need to have a interface to handle non differentiable loss which is L1 outside the return, since it's specific to the design of algorithms so can not be part of the loss. We may take a look on how other packages in R, matlab, or python defining the interfaces, and come out with generic one. The default implementation can wrap the breeze one. Users can have their own implementation to change the default optimizer. > Design optimizer interface for ML algorithms > > > Key: SPARK-17136 > URL: https://issues.apache.org/jira/browse/SPARK-17136 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > We should consider designing an interface that allows users to use their own > optimizers in some of the ML algorithms, similar to MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
[ https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-17161: - Summary: Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays (was: Add PySpark-ML JavaWrapper convienience function to create py4j JavaArrays) > Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays > - > > Key: SPARK-17161 > URL: https://issues.apache.org/jira/browse/SPARK-17161 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > > Often in Spark ML, there are classes that use a Scala `Array` to construct. > In order to add the same API to Python, a Java-friendly alternate constructor > needs to exist to be compatible with py4j when converting from a list. This > is because the current conversion in PySpark _py2java creates a > java.util.ArrayList, as shown in this error msg > {noformat} > Py4JError: An error occurred while calling > None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: > py4j.Py4JException: Constructor > org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) > does not exist > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) > at > py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) > at py4j.Gateway.invoke(Gateway.java:235) > {noformat} > Creating an alternate constructor can be avoided by creating a py4j JavaArray > using {{new_array}}. This type is compatible with the Scala `Array` > currently used in classes like {{CountVectorizerModel}} and > {{StringIndexerModel}}. > Most of the boiler-plate Python code to do this can be put in a convenience > function inside of ml.JavaWrapper to give a clean way of constructing ML > objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429025#comment-15429025 ] DB Tsai edited comment on SPARK-17137 at 8/19/16 11:16 PM: --- Currently, for LiR or BLOR, we always do `Vector.compressed` when creating the models which is optimized for space, but computation. We need to investigate the trade-off. was (Author: dbtsai): Currently, for LiR or BLOR, we always do `Vector.compressed` which is optimized for space, but computation. We need to investigate the trade-off. > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients
[ https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429025#comment-15429025 ] DB Tsai commented on SPARK-17137: - Currently, for LiR or BLOR, we always do `Vector.compressed` which is optimized for space, but computation. We need to investigate the trade-off. > Add compressed support for multinomial logistic regression coefficients > --- > > Key: SPARK-17137 > URL: https://issues.apache.org/jira/browse/SPARK-17137 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > For sparse coefficients in MLOR, such as when high L1 regularization, it may > be more efficient to store coefficients in compressed format. We can add this > option to MLOR and perhaps to do some performance tests to verify > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17162) Range does not support SQL generation
[ https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17162: Assignee: (was: Apache Spark) > Range does not support SQL generation > - > > Key: SPARK-17162 > URL: https://issues.apache.org/jira/browse/SPARK-17162 > Project: Spark > Issue Type: Bug >Reporter: Eric Liang >Priority: Minor > > {code} > scala> sql("create view a as select * from range(100)") > 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as > select * from range(100) > java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, > splits=8) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > ``` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17162) Range does not support SQL generation
[ https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429018#comment-15429018 ] Apache Spark commented on SPARK-17162: -- User 'ericl' has created a pull request for this issue: https://github.com/apache/spark/pull/14724 > Range does not support SQL generation > - > > Key: SPARK-17162 > URL: https://issues.apache.org/jira/browse/SPARK-17162 > Project: Spark > Issue Type: Bug >Reporter: Eric Liang >Priority: Minor > > {code} > scala> sql("create view a as select * from range(100)") > 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as > select * from range(100) > java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, > splits=8) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > ``` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17162) Range does not support SQL generation
[ https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17162: Assignee: Apache Spark > Range does not support SQL generation > - > > Key: SPARK-17162 > URL: https://issues.apache.org/jira/browse/SPARK-17162 > Project: Spark > Issue Type: Bug >Reporter: Eric Liang >Assignee: Apache Spark >Priority: Minor > > {code} > scala> sql("create view a as select * from range(100)") > 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as > select * from range(100) > java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, > splits=8) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) > at > org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) > at > org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) > at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97) > at > org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > ``` > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17162) Range does not support SQL generation
Eric Liang created SPARK-17162: -- Summary: Range does not support SQL generation Key: SPARK-17162 URL: https://issues.apache.org/jira/browse/SPARK-17162 Project: Spark Issue Type: Bug Reporter: Eric Liang Priority: Minor {code} scala> sql("create view a as select * from range(100)") 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as select * from range(100) java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, splits=8) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165) at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229) at org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127) at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97) at org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) ``` {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17161) Add PySpark-ML JavaWrapper convienience function to create py4j JavaArrays
Bryan Cutler created SPARK-17161: Summary: Add PySpark-ML JavaWrapper convienience function to create py4j JavaArrays Key: SPARK-17161 URL: https://issues.apache.org/jira/browse/SPARK-17161 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Bryan Cutler Priority: Minor Often in Spark ML, there are classes that use a Scala `Array` to construct. In order to add the same API to Python, a Java-friendly alternate constructor needs to exist to be compatible with py4j when converting from a list. This is because the current conversion in PySpark _py2java creates a java.util.ArrayList, as shown in this error msg {noformat} Py4JError: An error occurred while calling None.org.apache.spark.ml.feature.CountVectorizerModel. Trace: py4j.Py4JException: Constructor org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179) at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196) at py4j.Gateway.invoke(Gateway.java:235) {noformat} Creating an alternate constructor can be avoided by creating a py4j JavaArray using {{new_array}}. This type is compatible with the Scala `Array` currently used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}. Most of the boiler-plate Python code to do this can be put in a convenience function inside of ml.JavaWrapper to give a clean way of constructing ML objects without adding special constructors. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429003#comment-15429003 ] DB Tsai commented on SPARK-17140: - Since we're doing smoothing, the intercepts computed from priors with smoothing will not be the one that actually converged with large L1. Just keep in mind when write the tests. > Add initial model to MultinomialLogisticRegression > -- > > Key: SPARK-17140 > URL: https://issues.apache.org/jira/browse/SPARK-17140 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > We should add initial model support to Multinomial logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428988#comment-15428988 ] Nicholas Chammas commented on SPARK-17025: -- {quote} We'd need to figure out a good design for this, especially since it will require a Python process to be started up for what might otherwise be pure Scala applications. {quote} Ah, I guess since a Transformer using some Python code may be persisted and then loaded into a Scala application, right? Sounds hairy. Anyway, thanks for chiming in Joseph. I'll watch the linked issue. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17128) Schema is not Created for nested Json Array objects
[ https://issues.apache.org/jira/browse/SPARK-17128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-17128. --- Resolution: Invalid Target Version/s: (was: 2.0.0) This is not a reasonable description of an issue. Please read https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first > Schema is not Created for nested Json Array objects > --- > > Key: SPARK-17128 > URL: https://issues.apache.org/jira/browse/SPARK-17128 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.0.0 > Environment: Java and Scalab both >Reporter: vidit Singh >Priority: Critical > > When i am trying to generate Schema of Nested Json Array elements, > Shows Error : [_corrupt_record: string]. > Please fix this issue ASAP. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428906#comment-15428906 ] Joseph K. Bradley commented on SPARK-17025: --- I'd call this a new API, not a bug. This kind of support would be great to add; we just have not had a chance to work on it. We'd need to figure out a good design for this, especially since it will require a Python process to be started up for what might otherwise be pure Scala applications. Linking a related task which should come before this. > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer
[ https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-17025: -- Issue Type: New Feature (was: Bug) > Cannot persist PySpark ML Pipeline model that includes custom Transformer > - > > Key: SPARK-17025 > URL: https://issues.apache.org/jira/browse/SPARK-17025 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > Following the example in [this Databricks blog > post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html] > under "Python tuning", I'm trying to save an ML Pipeline model. > This pipeline, however, includes a custom transformer. When I try to save the > model, the operation fails because the custom transformer doesn't have a > {{_to_java}} attribute. > {code} > Traceback (most recent call last): > File ".../file.py", line 56, in > model.bestModel.save('model') > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 222, in save > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 217, in write > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py", > line 93, in __init__ > File > "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py", > line 254, in _to_java > AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java' > {code} > Looking at the source code for > [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py], > I see that not even the base Transformer class has such an attribute. > I'm assuming this is missing functionality that is intended to be patched up > (i.e. [like > this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]). > I'm not sure if there is an existing JIRA for this (my searches didn't turn > up clear results). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the DSE (Datastax enterprise) spark shell: {code} dse spark --master=local[2] {code} {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} It looks like a different ClassLoader is involved and cannot load my case class. However it works fine with a Tuple, and it works fine with the standalone version of Spark. {code:java} val fut = Future{ Seq((1, 2)).toDS() } Await.result(fut, Duration.Inf).show() +---+---+ | _1| _2| +---+---+ | 1| 2| +---+---+ {code} was: The following code throws an exception in the DSE (Datastax enterprise) spark shell: {code} dse spark --master=local[2] {code} {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the DSE (Datastax enterprise) spark shell: {code:bash} dse spark --master=local[2] {code} {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} It looks like a different ClassLoader is involved and cannot load my case class. However it works fine with a Tuple: {code:java} val fut = Future{ Seq((1, 2)).toDS() } Await.result(fut, Duration.Inf).show() +---+---+ | _1| _2| +---+---+ | 1| 2| +---+---+ {code} was: The following code throws an exception in the spark shell: {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the DSE (Datastax enterprise) spark shell: {code} dse spark --master=local[2] {code} {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} It looks like a different ClassLoader is involved and cannot load my case class. However it works fine with a Tuple: {code:java} val fut = Future{ Seq((1, 2)).toDS() } Await.result(fut, Duration.Inf).show() +---+---+ | _1| _2| +---+---+ | 1| 2| +---+---+ {code} was: The following code throws an exception in the DSE (Datastax enterprise) spark shell: {code:bash} dse spark --master=local[2] {code} {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at
[jira] [Resolved] (SPARK-16443) ALS wrapper in SparkR
[ https://issues.apache.org/jira/browse/SPARK-16443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16443. --- Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 14384 [https://github.com/apache/spark/pull/14384] > ALS wrapper in SparkR > - > > Key: SPARK-16443 > URL: https://issues.apache.org/jira/browse/SPARK-16443 > Project: Spark > Issue Type: Sub-task > Components: MLlib, SparkR >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Fix For: 2.1.0 > > > Wrap MLlib's ALS in SparkR. We should discuss whether we want to support R > formula or not for ALS. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428848#comment-15428848 ] DB Tsai edited comment on SPARK-17134 at 8/19/16 9:21 PM: -- It may also worth to try the following. I see some performance improvement when the # of classes are high, and we can avoid doing normalization again and again. By doing so, the access pattern of coefficients array will not be sequential, and will change the locality of caching in CPU. As a result, I think the performance will be case by case. Maybe we can store the coefficients in a transpose matrix which may help the locality? We need more investigation to understand the problem. {code:borderStyle=solid} val margins = Array.ofDim[Double](numClasses) features.foreachActive { (index, value) => if (featuresStd(index) != 0.0 && value != 0.0) { var i = 0 val temp = value / featuresStd(index) while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp i += 1 } } } if (fitIntercept) { var i = 0 val length = features.size while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + length) i += 1 } } val maxMargin = margins.max val marginOfLabel = margins(label.toInt) {code} was (Author: dbtsai): {code:borderStyle=solid} val margins = Array.ofDim[Double](numClasses) features.foreachActive { (index, value) => if (featuresStd(index) != 0.0 && value != 0.0) { var i = 0 val temp = value / featuresStd(index) while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp i += 1 } } } if (fitIntercept) { var i = 0 val length = features.size while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + length) i += 1 } } val maxMargin = margins.max val marginOfLabel = margins(label.toInt) {code} > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator
[ https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428848#comment-15428848 ] DB Tsai commented on SPARK-17134: - {code:borderStyle=solid} val margins = Array.ofDim[Double](numClasses) features.foreachActive { (index, value) => if (featuresStd(index) != 0.0 && value != 0.0) { var i = 0 val temp = value / featuresStd(index) while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp i += 1 } } } if (fitIntercept) { var i = 0 val length = features.size while ( i < numClasses) { margins(i) += coefficients(i * numFeaturesPlusIntercept + length) i += 1 } } val maxMargin = margins.max val marginOfLabel = margins(label.toInt) {code} > Use level 2 BLAS operations in LogisticAggregator > - > > Key: SPARK-17134 > URL: https://issues.apache.org/jira/browse/SPARK-17134 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Multinomial logistic regression uses LogisticAggregator class for gradient > updates. We should look into refactoring MLOR to use level 2 BLAS operations > for the updates. Performance testing should be done to show improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16569) Use Cython to speed up Pyspark internals
[ https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu closed SPARK-16569. -- Resolution: Won't Fix > Use Cython to speed up Pyspark internals > > > Key: SPARK-16569 > URL: https://issues.apache.org/jira/browse/SPARK-16569 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > CC: [~davies] > Many operations I do are like: > {code} > dataframe.rdd.map(some_function) > {code} > In Pyspark this mean creating Row object for every record and this is slow. > IDEA: > Use Cython to speed up Pyspark internals > What do you think ? > Sample profile: > {code} > > Profile of
[jira] [Commented] (SPARK-16569) Use Cython to speed up Pyspark internals
[ https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428843#comment-15428843 ] Davies Liu commented on SPARK-16569: Agreed to [~robert3005]. Another options could be just use PyPy, we already support that. > Use Cython to speed up Pyspark internals > > > Key: SPARK-16569 > URL: https://issues.apache.org/jira/browse/SPARK-16569 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Maciej Bryński >Priority: Minor > > CC: [~davies] > Many operations I do are like: > {code} > dataframe.rdd.map(some_function) > {code} > In Pyspark this mean creating Row object for every record and this is slow. > IDEA: > Use Cython to speed up Pyspark internals > What do you think ? > Sample profile: > {code} > > Profile of
[jira] [Commented] (SPARK-13286) JDBC driver doesn't report full exception
[ https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428762#comment-15428762 ] Apache Spark commented on SPARK-13286: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/14722 > JDBC driver doesn't report full exception > - > > Key: SPARK-13286 > URL: https://issues.apache.org/jira/browse/SPARK-13286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Adrian Bridgett >Assignee: Davies Liu >Priority: Minor > > Testing some failure scenarios (inserting data into postgresql where there is > a schema mismatch) , there is an exception thrown (fine so far) however it > doesn't report the actual SQL error. It refers to a getNextException call > but this is beyond my non-existant Java skills to deal with correctly. > Supporting this would help users to see the SQL error quickly and resolve the > underlying problem. > {noformat} > Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core > VALUES('5fdf5...',) was aborted. Call getNextException to see the cause. > at > org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) > at > org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13342) Cannot run INSERT statements in Spark
[ https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428753#comment-15428753 ] Dongjoon Hyun commented on SPARK-13342: --- Hi, All. Just to make this issue up-to-date, the following is the result of Spark 2.0.0. {code} scala> sql("create table x(a int)") scala> sql("select * from x").show +---+ | a| +---+ +---+ scala> sql("insert into x values 1") scala> sql("select * from x").show +---+ | a| +---+ | 1| +---+ {code} > Cannot run INSERT statements in Spark > - > > Key: SPARK-13342 > URL: https://issues.apache.org/jira/browse/SPARK-13342 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.1, 1.6.0 >Reporter: neo > > I cannot run a INSERT statement using spark-sql. I tried with both versions > 1.5.1 and 1.6.0 without any luck. But it runs ok on hive. > These are the steps I took. > 1) Launch hive and create the table / insert a record. > create database test > use test > CREATE TABLE stgTable > ( > sno string, > total bigint > ); > INSERT INTO TABLE stgTable VALUES ('12',12) > 2) Launch spark-sql (1.5.1 or 1.6.0) > 3) Try inserting a record from the shell > INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1 > I got this error message > "Invalid method name: 'alter_table_with_cascade'" > I tried changing the hive version inside the spark-sql shell using SET > command. > I changed the hive version > from > SET spark.sql.hive.version=1.2.1 (this is the default setting for my spark > installation) > to > SET spark.sql.hive.version=0.14.0 > but that did not help either -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428751#comment-15428751 ] Xin Ren commented on SPARK-17157: - I guess a lot more ml algorithms are still missing R wrappers? > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428745#comment-15428745 ] Iaroslav Zeigerman commented on SPARK-17024: Issue occurs only when reading the dataset from CSV format. > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iaroslav Zeigerman updated SPARK-17024: --- Affects Version/s: (was: 1.6.0) 2.0.0 > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.
[ https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Iaroslav Zeigerman reopened SPARK-17024: The issue occurs in Spark 2.0.0. Now it's even worse. I can't even get an rdd from a DataFrame. Backquotes doesn't help any more. > Weird behaviour of the DataFrame when a column name contains dots. > -- > > Key: SPARK-17024 > URL: https://issues.apache.org/jira/browse/SPARK-17024 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Iaroslav Zeigerman > > When a column name contains dots and one of the segment in a name is the same > as other column's name, Spark treats this column as a nested structure, > although the actual type of column is String/Int/etc. Example: > {code} > val df = sqlContext.createDataFrame(Seq( > ("user1", "task1"), > ("user2", "task2") > )).toDF("user", "user.task") > {code} > Two columns "user" and "user.task". Both of them are string, and the schema > resolution seems to be correct: > {noformat} > root > |-- user: string (nullable = true) > |-- user.task: string (nullable = true) > {noformat} > But when I'm trying to query this DataFrame like i.e.: > {code} > df.select(df("user"), df("user.task")) > {code} > Spark throws an exception "Can't extract value from user#2;" > It happens during the resolution of the LogicalPlan while processing the > "user.task" column. > Here is the full stacktrace: > {noformat} > Can't extract value from user#2; > org.apache.spark.sql.AnalysisException: Can't extract value from user#2; > at > org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275) > at > scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111) > at scala.collection.immutable.List.foldLeft(List.scala:84) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191) > at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151) > at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708) > at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696) > {noformat} > Is this actually an expected behaviour? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17160) GetExternalRowField does not properly escape field names, causing generated code not to compile
Josh Rosen created SPARK-17160: -- Summary: GetExternalRowField does not properly escape field names, causing generated code not to compile Key: SPARK-17160 URL: https://issues.apache.org/jira/browse/SPARK-17160 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Josh Rosen Priority: Critical The following end-to-end test uncovered a bug in {{GetExternalRowField}}: {code} import org.apache.spark.sql.functions._ import org.apache.spark.sql.catalyst.encoders._ spark.sql("set spark.sql.codegen.fallback=false") val df = Seq(("100-200", "1", "300")).toDF("a", "b", "c") val df2 = df.select(regexp_replace($"a", "(\\d+)", "num")) df2.mapPartitions(x => x)(RowEncoder(df2.schema)).collect() {code} This causes {code} java.lang.Exception: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, Column 64: Invalid escape sequence {code} The generated code is {code} /* 001 */ public Object generate(Object[] references) { /* 002 */ return new GeneratedIterator(references); /* 003 */ } /* 004 */ /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator { /* 006 */ private Object[] references; /* 007 */ private scala.collection.Iterator inputadapter_input; /* 008 */ private java.lang.String serializefromobject_errMsg; /* 009 */ private java.lang.String serializefromobject_errMsg1; /* 010 */ private UnsafeRow serializefromobject_result; /* 011 */ private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder; /* 012 */ private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter; /* 013 */ /* 014 */ public GeneratedIterator(Object[] references) { /* 015 */ this.references = references; /* 016 */ } /* 017 */ /* 018 */ public void init(int index, scala.collection.Iterator inputs[]) { /* 019 */ partitionIndex = index; /* 020 */ inputadapter_input = inputs[0]; /* 021 */ this.serializefromobject_errMsg = (java.lang.String) references[0]; /* 022 */ this.serializefromobject_errMsg1 = (java.lang.String) references[1]; /* 023 */ serializefromobject_result = new UnsafeRow(1); /* 024 */ this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 32); /* 025 */ this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1); /* 026 */ } /* 027 */ /* 028 */ protected void processNext() throws java.io.IOException { /* 029 */ while (inputadapter_input.hasNext()) { /* 030 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 031 */ org.apache.spark.sql.Row inputadapter_value = (org.apache.spark.sql.Row)inputadapter_row.get(0, null); /* 032 */ /* 033 */ if (false) { /* 034 */ throw new RuntimeException(serializefromobject_errMsg); /* 035 */ } /* 036 */ /* 037 */ boolean serializefromobject_isNull1 = false || false; /* 038 */ final boolean serializefromobject_value1 = serializefromobject_isNull1 ? false : inputadapter_value.isNullAt(0); /* 039 */ boolean serializefromobject_isNull = false; /* 040 */ UTF8String serializefromobject_value = null; /* 041 */ if (!serializefromobject_isNull1 && serializefromobject_value1) { /* 042 */ final UTF8String serializefromobject_value5 = null; /* 043 */ serializefromobject_isNull = true; /* 044 */ serializefromobject_value = serializefromobject_value5; /* 045 */ } else { /* 046 */ if (false) { /* 047 */ throw new RuntimeException(serializefromobject_errMsg1); /* 048 */ } /* 049 */ /* 050 */ if (false) { /* 051 */ throw new RuntimeException("The input external row cannot be null."); /* 052 */ } /* 053 */ /* 054 */ if (inputadapter_value.isNullAt(0)) { /* 055 */ throw new RuntimeException("The 0th field 'regexp_replace(a, (\d+), num)' of input row " + /* 056 */ "cannot be null."); /* 057 */ } /* 058 */ /* 059 */ final Object serializefromobject_value8 = inputadapter_value.get(0); /* 060 */ java.lang.String serializefromobject_value7 = null; /* 061 */ if (!false) { /* 062 */ if (serializefromobject_value8 instanceof java.lang.String) { /* 063 */ serializefromobject_value7 = (java.lang.String) serializefromobject_value8; /* 064 */ } else { /* 065 */ throw new RuntimeException(serializefromobject_value8.getClass().getName() + " is not a valid " + /* 066 */ "external type for schema of string"); /* 067 */ } /* 068 */ } /* 069 */ boolean
[jira] [Commented] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
[ https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428677#comment-15428677 ] Steve Loughran commented on SPARK-17159: # the most minimal change is to get rid of that directoryFilter and just do it as a filter on the returned list of generated FileStatus entries. Against s3a, That will eliminate 1-4 HTTP calls per path which matches the pattern. It will mean a larger array of FileStatus entries are returned, but otherwise has # it *may* be possible to go further and have the glob also pickup all files underneath. That may or may not provide a speedup. # The listStatus() operatin can be speeded up by having its file filter executed after the run (and so there being an existing filestatus entry: no need to go near the FS to get the modification time > Improve FileInputDStream.findNewFiles list performance > -- > > Key: SPARK-17159 > URL: https://issues.apache.org/jira/browse/SPARK-17159 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 > Environment: spark against object stores >Reporter: Steve Loughran >Priority: Minor > > {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that > calls getFileStatus() on every file, takes the output and does listStatus() > on the output. > This going to suffer on object stores, as dir listing and getFileStatus calls > are so expensive. It's clear this is a problem, as the method has code to > detect timeouts in the window and warn of problems. > It should be possible to make this faster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance
Steve Loughran created SPARK-17159: -- Summary: Improve FileInputDStream.findNewFiles list performance Key: SPARK-17159 URL: https://issues.apache.org/jira/browse/SPARK-17159 Project: Spark Issue Type: Improvement Components: Streaming Affects Versions: 2.0.0 Environment: spark against object stores Reporter: Steve Loughran Priority: Minor {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that calls getFileStatus() on every file, takes the output and does listStatus() on the output. This going to suffer on object stores, as dir listing and getFileStatus calls are so expensive. It's clear this is a problem, as the method has code to detect timeouts in the window and warn of problems. It should be possible to make this faster -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set
[ https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428617#comment-15428617 ] Dongjoon Hyun commented on SPARK-10746: --- Just as an update, Spark 2.0 now raises an exception for this case with an explicit error message "Distinct window functions are not supported". > count ( distinct columnref) over () returns wrong result set > > > Key: SPARK-10746 > URL: https://issues.apache.org/jira/browse/SPARK-10746 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell > > Same issue as report against HIVE (HIVE-9534) > Result set was expected to contain 5 rows instead of 1 row as others vendors > (ORACLE, Netezza etc) would. > select count( distinct column) over () from t1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17113) Job failure due to Executor OOM in offheap mode
[ https://issues.apache.org/jira/browse/SPARK-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu updated SPARK-17113: --- Assignee: Sital Kedia > Job failure due to Executor OOM in offheap mode > --- > > Key: SPARK-17113 > URL: https://issues.apache.org/jira/browse/SPARK-17113 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.0.1, 2.1.0 > > > We have been seeing many job failure due to executor OOM with following stack > trace - > {code} > java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Digging into the code, we found out that this is an issue with cooperative > memory management for off heap memory allocation. > In the code > https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463, > when the UnsafeExternalSorter is checking if memory page is being used by > upstream, the base object in case of off heap memory is always null so the > UnsafeExternalSorter does not spill the memory pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17113) Job failure due to Executor OOM in offheap mode
[ https://issues.apache.org/jira/browse/SPARK-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17113. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Job failure due to Executor OOM in offheap mode > --- > > Key: SPARK-17113 > URL: https://issues.apache.org/jira/browse/SPARK-17113 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Sital Kedia >Assignee: Sital Kedia > Fix For: 2.0.1, 2.1.0 > > > We have been seeing many job failure due to executor OOM with following stack > trace - > {code} > java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 > at > org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90) > at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:271) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Digging into the code, we found out that this is an issue with cooperative > memory management for off heap memory allocation. > In the code > https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463, > when the UnsafeExternalSorter is checking if memory page is being used by > upstream, the base object in case of off heap memory is always null so the > UnsafeExternalSorter does not spill the memory pages. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17158) Improve error message for numeric literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17158: Assignee: Apache Spark > Improve error message for numeric literal parsing > - > > Key: SPARK-17158 > URL: https://issues.apache.org/jira/browse/SPARK-17158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Srinath >Assignee: Apache Spark >Priority: Minor > > Spark currently gives confusing and inconsistent error messages for numeric > literals. For example: > scala> sql("select 123456Y") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456Y > ---^^^ > scala> sql("select 123456S") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456S > ---^^^ > scala> sql("select 12345623434523434564565L") > org.apache.spark.sql.catalyst.parser.ParseException: > For input string: "12345623434523434564565"(line 1, pos 7) > == SQL == > select 12345623434523434564565L > ---^^^ > The problem is that we are relying on JDK's implementations for parsing, and > those functions throw different error messages. This code can be found in > AstBuilder.numericLiteral function. > The proposal is that instead of using `_.toByte` to turn a string into a > byte, we always turn the numeric literal string into a BigDecimal, and then > we validate the range before turning it into a numeric value. This way, we > have more control over the data. > If BigDecimal fails to parse the number, we should throw a better exception > than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17158) Improve error message for numeric literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17158: Assignee: (was: Apache Spark) > Improve error message for numeric literal parsing > - > > Key: SPARK-17158 > URL: https://issues.apache.org/jira/browse/SPARK-17158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Srinath >Priority: Minor > > Spark currently gives confusing and inconsistent error messages for numeric > literals. For example: > scala> sql("select 123456Y") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456Y > ---^^^ > scala> sql("select 123456S") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456S > ---^^^ > scala> sql("select 12345623434523434564565L") > org.apache.spark.sql.catalyst.parser.ParseException: > For input string: "12345623434523434564565"(line 1, pos 7) > == SQL == > select 12345623434523434564565L > ---^^^ > The problem is that we are relying on JDK's implementations for parsing, and > those functions throw different error messages. This code can be found in > AstBuilder.numericLiteral function. > The proposal is that instead of using `_.toByte` to turn a string into a > byte, we always turn the numeric literal string into a BigDecimal, and then > we validate the range before turning it into a numeric value. This way, we > have more control over the data. > If BigDecimal fails to parse the number, we should throw a better exception > than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17158) Improve error message for numeric literal parsing
[ https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428587#comment-15428587 ] Apache Spark commented on SPARK-17158: -- User 'srinathshankar' has created a pull request for this issue: https://github.com/apache/spark/pull/14721 > Improve error message for numeric literal parsing > - > > Key: SPARK-17158 > URL: https://issues.apache.org/jira/browse/SPARK-17158 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Srinath >Priority: Minor > > Spark currently gives confusing and inconsistent error messages for numeric > literals. For example: > scala> sql("select 123456Y") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456Y > ---^^^ > scala> sql("select 123456S") > org.apache.spark.sql.catalyst.parser.ParseException: > Value out of range. Value:"123456" Radix:10(line 1, pos 7) > == SQL == > select 123456S > ---^^^ > scala> sql("select 12345623434523434564565L") > org.apache.spark.sql.catalyst.parser.ParseException: > For input string: "12345623434523434564565"(line 1, pos 7) > == SQL == > select 12345623434523434564565L > ---^^^ > The problem is that we are relying on JDK's implementations for parsing, and > those functions throw different error messages. This code can be found in > AstBuilder.numericLiteral function. > The proposal is that instead of using `_.toByte` to turn a string into a > byte, we always turn the numeric literal string into a BigDecimal, and then > we validate the range before turning it into a numeric value. This way, we > have more control over the data. > If BigDecimal fails to parse the number, we should throw a better exception > than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-13286) JDBC driver doesn't report full exception
[ https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-13286: -- Assignee: Davies Liu > JDBC driver doesn't report full exception > - > > Key: SPARK-13286 > URL: https://issues.apache.org/jira/browse/SPARK-13286 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.0 >Reporter: Adrian Bridgett >Assignee: Davies Liu >Priority: Minor > > Testing some failure scenarios (inserting data into postgresql where there is > a schema mismatch) , there is an exception thrown (fine so far) however it > doesn't report the actual SQL error. It refers to a getNextException call > but this is beyond my non-existant Java skills to deal with correctly. > Supporting this would help users to see the SQL error quickly and resolve the > underlying problem. > {noformat} > Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core > VALUES('5fdf5...',) was aborted. Call getNextException to see the cause. > at > org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746) > at > org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457) > at > org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887) > at > org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405) > at > org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248) > at > org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at > org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17158) Improve error message for numeric literal parsing
Srinath created SPARK-17158: --- Summary: Improve error message for numeric literal parsing Key: SPARK-17158 URL: https://issues.apache.org/jira/browse/SPARK-17158 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Srinath Priority: Minor Spark currently gives confusing and inconsistent error messages for numeric literals. For example: scala> sql("select 123456Y") org.apache.spark.sql.catalyst.parser.ParseException: Value out of range. Value:"123456" Radix:10(line 1, pos 7) == SQL == select 123456Y ---^^^ scala> sql("select 123456S") org.apache.spark.sql.catalyst.parser.ParseException: Value out of range. Value:"123456" Radix:10(line 1, pos 7) == SQL == select 123456S ---^^^ scala> sql("select 12345623434523434564565L") org.apache.spark.sql.catalyst.parser.ParseException: For input string: "12345623434523434564565"(line 1, pos 7) == SQL == select 12345623434523434564565L ---^^^ The problem is that we are relying on JDK's implementations for parsing, and those functions throw different error messages. This code can be found in AstBuilder.numericLiteral function. The proposal is that instead of using `_.toByte` to turn a string into a byte, we always turn the numeric literal string into a BigDecimal, and then we validate the range before turning it into a numeric value. This way, we have more control over the data. If BigDecimal fails to parse the number, we should throw a better exception than "For input string ...". -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled
[ https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-15382: Fix Version/s: 2.1.0 2.0.1 > monotonicallyIncreasingId doesn't work when data is upsampled > - > > Key: SPARK-15382 > URL: https://issues.apache.org/jira/browse/SPARK-15382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Mateusz Buśkiewicz > Fix For: 2.0.1, 2.1.0 > > > Assigned ids are not unique > {code} > from pyspark.sql import Row > from pyspark.sql.functions import monotonicallyIncreasingId > hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, > 10.0).withColumn('id', monotonicallyIncreasingId()).collect() > {code} > Output: > {code} > [Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16686) Dataset.sample with seed: result seems to depend on downstream usage
[ https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-16686: Fix Version/s: 2.0.1 > Dataset.sample with seed: result seems to depend on downstream usage > > > Key: SPARK-16686 > URL: https://issues.apache.org/jira/browse/SPARK-16686 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 > Environment: Spark 1.6.2 and Spark 2.0 - RC4 > Standalone > Single-worker cluster >Reporter: Joseph K. Bradley >Assignee: Liang-Chi Hsieh > Fix For: 2.0.1, 2.1.0 > > Attachments: DataFrame.sample bug - 2.0.html > > > Summary to reproduce bug: > * Create a DataFrame DF, and sample it with a fixed seed. > * Collect that DataFrame -> result1 > * Call a particular UDF on that DataFrame -> result2 > You would expect results 1 and 2 to use the same rows from DF, but they > appear not to. > Note: result1 and result2 are both deterministic. > See the attached notebook for details. Cells in the notebook were executed > in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers
[ https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428557#comment-15428557 ] Xusen Yin commented on SPARK-14381: --- I believe we can resolve this. > Review spark.ml parity for feature transformers > --- > > Key: SPARK-14381 > URL: https://issues.apache.org/jira/browse/SPARK-14381 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Joseph K. Bradley > > Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all > functionality. List all missing items. > This only covers Scala since we can compare Scala vs. Python in spark.ml > itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10401) spark-submit --unsupervise
[ https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428572#comment-15428572 ] Michael Gummelt commented on SPARK-10401: - This should probably be a separate JIRA, but I'm just adding a note here that {{--kill}} doesn't seem to kill the job immediately. It invokes Mesos' {{killTask}} function, which runs a {{docker stop}} for docker images. This sends a SIGTERM, which seems to be ignored, then sends a SIGKILL after 10s, which ultimately kills the job. I'd like to find out why the SIGTERM is ignored. > spark-submit --unsupervise > --- > > Key: SPARK-10401 > URL: https://issues.apache.org/jira/browse/SPARK-10401 > Project: Spark > Issue Type: New Feature > Components: Deploy, Mesos >Affects Versions: 1.5.0 >Reporter: Alberto Miorin > > When I submit a streaming job with the option --supervise to the new mesos > spark dispatcher, I cannot decommission the job. > I tried spark-submit --kill, but dispatcher always restarts it. > Driver and Executors are both Docker containers. > I think there should be a subcommand spark-submit --unsupervise -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Miao Wang updated SPARK-17157: -- Component/s: SparkR > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
[ https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428519#comment-15428519 ] Apache Spark commented on SPARK-12868: -- User 'Parth-Brahmbhatt' has created a pull request for this issue: https://github.com/apache/spark/pull/14720 > ADD JAR via sparkSQL JDBC will fail when using a HDFS URL > - > > Key: SPARK-12868 > URL: https://issues.apache.org/jira/browse/SPARK-12868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0 >Reporter: Trystan Leftwich > > When trying to add a jar with a HDFS URI, i.E > {code:sql} > ADD JAR hdfs:///tmp/foo.jar > {code} > Via the spark sql JDBC interface it will fail with: > {code:sql} > java.net.MalformedURLException: unknown protocol: hdfs > at java.net.URL.(URL.java:593) > at java.net.URL.(URL.java:483) > at java.net.URL.(URL.java:432) > at java.net.URI.toURL(URI.java:1089) > at > org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578) > at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652) > at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58) > at > org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56) > at > org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:145) > at org.apache.spark.sql.DataFrame.(DataFrame.scala:130) > at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52) > at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628) > at > org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
[ https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428518#comment-15428518 ] Miao Wang commented on SPARK-17157: --- [~felixcheung] Shall we add it to SparkR? I open this JIRA for discussion. Thanks! > Add multiclass logistic regression SparkR Wrapper > - > > Key: SPARK-17157 > URL: https://issues.apache.org/jira/browse/SPARK-17157 > Project: Spark > Issue Type: New Feature >Reporter: Miao Wang > > [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to Master. I open this JIRA for discussion of adding SparkR wrapper > for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper
Miao Wang created SPARK-17157: - Summary: Add multiclass logistic regression SparkR Wrapper Key: SPARK-17157 URL: https://issues.apache.org/jira/browse/SPARK-17157 Project: Spark Issue Type: New Feature Reporter: Miao Wang [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been merged to Master. I open this JIRA for discussion of adding SparkR wrapper for multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17156) Add multiclass logistic regression Scala Example
[ https://issues.apache.org/jira/browse/SPARK-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428509#comment-15428509 ] Miao Wang commented on SPARK-17156: --- I will submit PR soon. > Add multiclass logistic regression Scala Example > > > Key: SPARK-17156 > URL: https://issues.apache.org/jira/browse/SPARK-17156 > Project: Spark > Issue Type: Task > Components: ML >Reporter: Miao Wang > > As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been > merged to master, we should add Scala example of using multiclass logistic > regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17156) Add multiclass logistic regression Scala Example
Miao Wang created SPARK-17156: - Summary: Add multiclass logistic regression Scala Example Key: SPARK-17156 URL: https://issues.apache.org/jira/browse/SPARK-17156 Project: Spark Issue Type: Task Components: ML Reporter: Miao Wang As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML has been merged to master, we should add Scala example of using multiclass logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the spark shell: {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} It looks like a different ClassLoader is involved and cannot load my case class. However it works fine with a Tuple: {code:java} val fut = Future{ Seq((1, 2)).toDS() } Await.result(fut, Duration.Inf).show() +---+---+ | _1| _2| +---+---+ | 1| 2| +---+---+ {code} was: The following code throws an exception in the spark shell: {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the spark shell: {code:scala} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was: The following code throws an exception in the spark shell: {{ case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() }} {{ scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at
[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
[ https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikael Valot updated SPARK-17155: - Description: The following code throws an exception in the spark shell: {code:java} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was: The following code throws an exception in the spark shell: {code:scala} case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() {code} {code} scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at
[jira] [Created] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError
Mikael Valot created SPARK-17155: Summary: usage of a Dataset inside a Future throws MissingRequirementError Key: SPARK-17155 URL: https://issues.apache.org/jira/browse/SPARK-17155 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 1.6.1 Reporter: Mikael Valot The following code throws an exception in the spark shell: {{ case class A(i1: Int, i2: Int) import scala.concurrent.ExecutionContext.Implicits.global import scala.concurrent.Future import scala.concurrent.Await import scala.concurrent.duration.Duration import sqlContext.implicits._ import org.apache.spark.sql.functions._ val fut = Future{ Seq(A(1, 2)).toDS() } Await.result(fut, Duration.Inf).show() }} {{ scala.reflect.internal.MissingRequirementError: object $line8.$read not found. at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16) at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17) at scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161) at scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231) at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231) at org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654) at org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53) at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70) at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) }} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16152) `In` predicate does not work with null values
[ https://issues.apache.org/jira/browse/SPARK-16152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-16152. - Resolution: Invalid Hi, [~fushar]. This seems to be a SQL question. [~kevinyu98] is right. Spark/PostgreSQL/MySQL are consistent with this. `NULL IN (NULL)` is NULL. Please run the following query. The result is also `TRUE` for the above SQL engines. {code} SELECT (NULL IN (NULL)) IS NULL {code} > `In` predicate does not work with null values > - > > Key: SPARK-16152 > URL: https://issues.apache.org/jira/browse/SPARK-16152 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Ashar Fuadi > > According to > https://github.com/apache/spark/blob/v1.6.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L134..L136: > {code} > override def eval(input: InternalRow): Any = { > val evaluatedValue = value.eval(input) > if (evaluatedValue == null) { > null > } else { > ... > {code} > we always return {{null}} when the current value is null, ignoring the > elements of {{list}}. Therefore, we cannot have a predicate which tests > whether a column contains values in e.g. {{[1, 2, 3, null]}} > Is this a bug, or is this actually the expected behavior? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled
[ https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin closed SPARK-15382. --- Resolution: Fixed > monotonicallyIncreasingId doesn't work when data is upsampled > - > > Key: SPARK-15382 > URL: https://issues.apache.org/jira/browse/SPARK-15382 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.1 >Reporter: Mateusz Buśkiewicz > > Assigned ids are not unique > {code} > from pyspark.sql import Row > from pyspark.sql.functions import monotonicallyIncreasingId > hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, > 10.0).withColumn('id', monotonicallyIncreasingId()).collect() > {code} > Output: > {code} > [Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=1, id=429496729600), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792), > Row(a=2, id=867583393792)] > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16197) Cleanup PySpark status api and example
[ https://issues.apache.org/jira/browse/SPARK-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-16197. -- Resolution: Won't Fix This minor change is would be better addressed during a QA audit > Cleanup PySpark status api and example > -- > > Key: SPARK-16197 > URL: https://issues.apache.org/jira/browse/SPARK-16197 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Bryan Cutler >Priority: Trivial > > Cleanup of Status API example to use SparkSession and be more consistent with > other examples. > Changing this JIRA to just cleanup the example, the other changes are just > code style which is not really followed in other areas of PySpark > -also noticed that Status defines two empty classes without using 'pass' and > two methods that do not return 'None' explicitly if requested info can not be > fetched. These issues do not cause any errors, but it is good practice to > use 'pass' on an empty class definitions and return 'None' for a function if > the caller is expecting a return value.- -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-15018: - Description: When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead the following error is raised: {noformat} Traceback (most recent call last): File "./spark/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit for stage in stages: TypeError: 'NoneType' object is not iterable {noformat} The param {{stages}} needs to be an empty list and {{getStages}} should call {{getOrDefault}}. Also, since the default value is {{None}} is then changed to and empty list {{[]}}, this never changes the value if passed in as a keyword argument. Instead, the {{kwargs}} value should be changed directly if {{stages is None}}. For example {noformat} if stages is None: stages = [] {noformat} should be this {noformat} if stages is None: kwargs['stages'] = [] {noformat} However, since there is no default value in the Scala implementation, assigning a default here is not needed and should be cleaned up. The pydocs should better indicate that stages is required to be a list. was: When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead the following error is raised: {noformat} Traceback (most recent call last): File "./spark/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit for stage in stages: TypeError: 'NoneType' object is not iterable {noformat} The param {{stages}} needs to be an empty list and {{getStages}} should call {{getOrDefault}}. Also, since the default value is {{None}} is then changed to and empty list {{[]}}, this never changes the value if passed in as a keyword argument. Instead, the {{kwargs}} value should be changed directly if {{stages is None}}. For example {noformat} if stages is None: stages = [] {noformat} should be this {noformat} if stages is None: kwargs['stages'] = [] {noformat} However, since there is no default value in the Scala implementation, assigning a default here is not needed and should be cleaned up. > PySpark ML Pipeline raises unclear error when no stages set > --- > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} needs to be an empty list and {{getStages}} should call > {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} > However, since there is no default value in the Scala implementation, > assigning a default here is not needed and should be cleaned up. The pydocs > should better indicate that stages is required to be a list. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-15018: - Description: When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead the following error is raised: {noformat} Traceback (most recent call last): File "./spark/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit for stage in stages: TypeError: 'NoneType' object is not iterable {noformat} The param {{stages}} needs to be an empty list and {{getStages}} should call {{getOrDefault}}. Also, since the default value is {{None}} is then changed to and empty list {{[]}}, this never changes the value if passed in as a keyword argument. Instead, the {{kwargs}} value should be changed directly if {{stages is None}}. For example {noformat} if stages is None: stages = [] {noformat} should be this {noformat} if stages is None: kwargs['stages'] = [] {noformat} However, since there is no default value in the Scala implementation, assigning a default here is not needed and should be cleaned up. was: When fitting a PySpark Pipeline with no stages, it should work as an identity transformer. Instead the following error is raised: {noformat} Traceback (most recent call last): File "./spark/python/pyspark/ml/base.py", line 64, in fit return self._fit(dataset) File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit for stage in stages: TypeError: 'NoneType' object is not iterable {noformat} The param {{stages}} should be added to the default param list and {{getStages}} should call {{getOrDefault}}. Also, since the default value is {{None}} is then changed to and empty list {{[]}}, this never changes the value if passed in as a keyword argument. Instead, the {{kwargs}} value should be changed directly if {{stages is None}}. For example {noformat} if stages is None: stages = [] {noformat} should be this {noformat} if stages is None: kwargs['stages'] = [] {noformat} > PySpark ML Pipeline raises unclear error when no stages set > --- > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} needs to be an empty list and {{getStages}} should call > {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} > However, since there is no default value in the Scala implementation, > assigning a default here is not needed and should be cleaned up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-15018: - Summary: PySpark ML Pipeline raises unclear error when no stages set (was: PySpark ML Pipeline fails when no stages set) > PySpark ML Pipeline raises unclear error when no stages set > --- > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} should be added to the default param list and > {{getStages}} should call {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-15018: - Issue Type: Improvement (was: Bug) > PySpark ML Pipeline fails when no stages set > > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} should be added to the default param list and > {{getStages}} should call {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set
[ https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler updated SPARK-15018: - Priority: Minor (was: Major) > PySpark ML Pipeline fails when no stages set > > > Key: SPARK-15018 > URL: https://issues.apache.org/jira/browse/SPARK-15018 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > > When fitting a PySpark Pipeline with no stages, it should work as an identity > transformer. Instead the following error is raised: > {noformat} > Traceback (most recent call last): > File "./spark/python/pyspark/ml/base.py", line 64, in fit > return self._fit(dataset) > File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit > for stage in stages: > TypeError: 'NoneType' object is not iterable > {noformat} > The param {{stages}} should be added to the default param list and > {{getStages}} should call {{getOrDefault}}. > Also, since the default value is {{None}} is then changed to and empty list > {{[]}}, this never changes the value if passed in as a keyword argument. > Instead, the {{kwargs}} value should be changed directly if {{stages is > None}}. > For example > {noformat} > if stages is None: > stages = [] > {noformat} > should be this > {noformat} > if stages is None: > kwargs['stages'] = [] > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17154: Assignee: Apache Spark > Wrong result can be returned or AnalysisException can be thrown after > self-join or similar operations > - > > Key: SPARK-17154 > URL: https://issues.apache.org/jira/browse/SPARK-17154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kousuke Saruta >Assignee: Apache Spark > > When we join two DataFrames which are originated from a same DataFrame, > operations to the joined DataFrame can fail. > One reproducible example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") > val selected1 = joined.select(df("col3")) > {code} > In this case, AnalysisException is thrown. > Another example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), > "right") > val selected2 = rightOuterJoined.select(df("col1")) > selected2.show > {code} > In this case, we will expect to get the answer like as follows. > {code} > 1 > 2 > 3 > 4 > 5 > {code} > But the actual result is as follows. > {code} > 1 > 2 > null > 4 > 5 > {code} > The cause of the problems in the examples is that the logical plan related to > the right side DataFrame and the expressions of its output are re-created in > the analyzer (at ResolveReference rule) when a DataFrame has expressions > which have a same exprId each other. > Re-created expressions are equally to the original ones except exprId. > This will happen when we do self-join or similar pattern operations. > In the first example, df("col3") returns a Column which includes an > expression and the expression have an exprId (say id1 here). > After join, the expresion which the right side DataFrame (df) has is > re-created and the old and new expressions are equally but exprId is renewed > (say id2 for the new exprId here). > Because of the mismatch of those exprIds, AnalysisException is thrown. > In the second example, df("col1") returns a column and the expression > contained in the column is assigned an exprId (say id3). > On the other hand, a column returned by filtered("col1") has an expression > which has the same exprId (id3). > After join, the expressions in the right side DataFrame are re-created and > the expression assigned id3 is no longer present in the right side but > present in the left side. > So, referring df("col1") to the joined DataFrame, we get col1 of right side > which includes null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17154: Assignee: (was: Apache Spark) > Wrong result can be returned or AnalysisException can be thrown after > self-join or similar operations > - > > Key: SPARK-17154 > URL: https://issues.apache.org/jira/browse/SPARK-17154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kousuke Saruta > > When we join two DataFrames which are originated from a same DataFrame, > operations to the joined DataFrame can fail. > One reproducible example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") > val selected1 = joined.select(df("col3")) > {code} > In this case, AnalysisException is thrown. > Another example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), > "right") > val selected2 = rightOuterJoined.select(df("col1")) > selected2.show > {code} > In this case, we will expect to get the answer like as follows. > {code} > 1 > 2 > 3 > 4 > 5 > {code} > But the actual result is as follows. > {code} > 1 > 2 > null > 4 > 5 > {code} > The cause of the problems in the examples is that the logical plan related to > the right side DataFrame and the expressions of its output are re-created in > the analyzer (at ResolveReference rule) when a DataFrame has expressions > which have a same exprId each other. > Re-created expressions are equally to the original ones except exprId. > This will happen when we do self-join or similar pattern operations. > In the first example, df("col3") returns a Column which includes an > expression and the expression have an exprId (say id1 here). > After join, the expresion which the right side DataFrame (df) has is > re-created and the old and new expressions are equally but exprId is renewed > (say id2 for the new exprId here). > Because of the mismatch of those exprIds, AnalysisException is thrown. > In the second example, df("col1") returns a column and the expression > contained in the column is assigned an exprId (say id3). > On the other hand, a column returned by filtered("col1") has an expression > which has the same exprId (id3). > After join, the expressions in the right side DataFrame are re-created and > the expression assigned id3 is no longer present in the right side but > present in the left side. > So, referring df("col1") to the joined DataFrame, we get col1 of right side > which includes null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
[ https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428438#comment-15428438 ] Apache Spark commented on SPARK-17154: -- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/14719 > Wrong result can be returned or AnalysisException can be thrown after > self-join or similar operations > - > > Key: SPARK-17154 > URL: https://issues.apache.org/jira/browse/SPARK-17154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Kousuke Saruta > > When we join two DataFrames which are originated from a same DataFrame, > operations to the joined DataFrame can fail. > One reproducible example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") > val selected1 = joined.select(df("col3")) > {code} > In this case, AnalysisException is thrown. > Another example is as follows. > {code} > val df = Seq( > (1, "a", "A"), > (2, "b", "B"), > (3, "c", "C"), > (4, "d", "D"), > (5, "e", "E")).toDF("col1", "col2", "col3") > val filtered = df.filter("col1 != 3").select("col1", "col2") > val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), > "right") > val selected2 = rightOuterJoined.select(df("col1")) > selected2.show > {code} > In this case, we will expect to get the answer like as follows. > {code} > 1 > 2 > 3 > 4 > 5 > {code} > But the actual result is as follows. > {code} > 1 > 2 > null > 4 > 5 > {code} > The cause of the problems in the examples is that the logical plan related to > the right side DataFrame and the expressions of its output are re-created in > the analyzer (at ResolveReference rule) when a DataFrame has expressions > which have a same exprId each other. > Re-created expressions are equally to the original ones except exprId. > This will happen when we do self-join or similar pattern operations. > In the first example, df("col3") returns a Column which includes an > expression and the expression have an exprId (say id1 here). > After join, the expresion which the right side DataFrame (df) has is > re-created and the old and new expressions are equally but exprId is renewed > (say id2 for the new exprId here). > Because of the mismatch of those exprIds, AnalysisException is thrown. > In the second example, df("col1") returns a column and the expression > contained in the column is assigned an exprId (say id3). > On the other hand, a column returned by filtered("col1") has an expression > which has the same exprId (id3). > After join, the expressions in the right side DataFrame are re-created and > the expression assigned id3 is no longer present in the right side but > present in the left side. > So, referring df("col1") to the joined DataFrame, we get col1 of right side > which includes null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17135) Consolidate code in linear/logistic regression where possible
[ https://issues.apache.org/jira/browse/SPARK-17135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428401#comment-15428401 ] Gayathri Murali commented on SPARK-17135: - I can work on this > Consolidate code in linear/logistic regression where possible > - > > Key: SPARK-17135 > URL: https://issues.apache.org/jira/browse/SPARK-17135 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Priority: Minor > > There is shared code between MultinomialLogisticRegression, > LogisticRegression, and LinearRegression. We should consolidate where > possible. Also, we should move some code out of LogisticRegression.scala into > a separate util file or similar. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-13331) Spark network encryption optimization
[ https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reopened SPARK-13331: > Spark network encryption optimization > - > > Key: SPARK-13331 > URL: https://issues.apache.org/jira/browse/SPARK-13331 > Project: Spark > Issue Type: Improvement > Components: Deploy >Reporter: Dong Chen >Priority: Minor > > In network/common, SASL with DIGEST-MD5 authentication is used for > negotiating a secure communication channel. When SASL operation mode is > "auth-conf", the data transferred on the network is encrypted. DIGEST-MD5 > mechanism supports following encryption: 3DES, DES, and RC4. The negotiation > procedure will select one of them to encrypt / decrypt the data on the > channel. > However, 3des and rc4 are slow relatively. We could add code in the > negotiation to make it support AES for more secure and performance. > The proposed solution is: > When "auth-conf" is enabled, at the end of original negotiation, the > authentication succeeds and a secure channel is built. We could add one more > negotiation step: Client and server negotiate whether they both support AES. > If yes, the Key and IV used by AES will be generated by server and sent to > client through the already secure channel. Then update the encryption / > decryption handler to AES at both client and server side. Following data > transfer will use AES instead of original encryption algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations
Kousuke Saruta created SPARK-17154: -- Summary: Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations Key: SPARK-17154 URL: https://issues.apache.org/jira/browse/SPARK-17154 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.2 Reporter: Kousuke Saruta When we join two DataFrames which are originated from a same DataFrame, operations to the joined DataFrame can fail. One reproducible example is as follows. {code} val df = Seq( (1, "a", "A"), (2, "b", "B"), (3, "c", "C"), (4, "d", "D"), (5, "e", "E")).toDF("col1", "col2", "col3") val filtered = df.filter("col1 != 3").select("col1", "col2") val joined = filtered.join(df, filtered("col1") === df("col1"), "inner") val selected1 = joined.select(df("col3")) {code} In this case, AnalysisException is thrown. Another example is as follows. {code} val df = Seq( (1, "a", "A"), (2, "b", "B"), (3, "c", "C"), (4, "d", "D"), (5, "e", "E")).toDF("col1", "col2", "col3") val filtered = df.filter("col1 != 3").select("col1", "col2") val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), "right") val selected2 = rightOuterJoined.select(df("col1")) selected2.show {code} In this case, we will expect to get the answer like as follows. {code} 1 2 3 4 5 {code} But the actual result is as follows. {code} 1 2 null 4 5 {code} The cause of the problems in the examples is that the logical plan related to the right side DataFrame and the expressions of its output are re-created in the analyzer (at ResolveReference rule) when a DataFrame has expressions which have a same exprId each other. Re-created expressions are equally to the original ones except exprId. This will happen when we do self-join or similar pattern operations. In the first example, df("col3") returns a Column which includes an expression and the expression have an exprId (say id1 here). After join, the expresion which the right side DataFrame (df) has is re-created and the old and new expressions are equally but exprId is renewed (say id2 for the new exprId here). Because of the mismatch of those exprIds, AnalysisException is thrown. In the second example, df("col1") returns a column and the expression contained in the column is assigned an exprId (say id3). On the other hand, a column returned by filtered("col1") has an expression which has the same exprId (id3). After join, the expressions in the right side DataFrame are re-created and the expression assigned id3 is no longer present in the right side but present in the left side. So, referring df("col1") to the joined DataFrame, we get col1 of right side which includes null. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428339#comment-15428339 ] Seth Hendrickson commented on SPARK-17139: -- SPARK-7159 has been merged, as an FYI. I can review this when you submit the PR. It would be nice to get this in soon. > Add model summary for MultinomialLogisticRegression > --- > > Key: SPARK-17139 > URL: https://issues.apache.org/jira/browse/SPARK-17139 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson > > Add model summary to multinomial logistic regression using same interface as > in other ML models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428338#comment-15428338 ] Seth Hendrickson commented on SPARK-17140: -- Going to hold off for a little bit to see what happens with [SPARK-10780|https://issues.apache.org/jira/browse/SPARK-10780] > Add initial model to MultinomialLogisticRegression > -- > > Key: SPARK-17140 > URL: https://issues.apache.org/jira/browse/SPARK-17140 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > > We should add initial model support to Multinomial logistic regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
[ https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-11227: -- Assignee: Kousuke Saruta > Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1 > > > Key: SPARK-11227 > URL: https://issues.apache.org/jira/browse/SPARK-11227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.6.1, 2.0.0 > Environment: OS: CentOS 6.6 > Memory: 28G > CPU: 8 > Mesos: 0.22.0 > HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager) >Reporter: Yuri Saito >Assignee: Kousuke Saruta > Fix For: 2.0.1, 2.1.0 > > > When running jar including Spark Job at HDFS HA Cluster, Mesos and > Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: > nameservice1" and fail. > I do below in Terminal. > {code} > /opt/spark/bin/spark-submit \ > --class com.example.Job /jobs/job-assembly-1.0.0.jar > {code} > So, job throw below message. > {code} > 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 > (TID 0, spark003.example.com): java.lang.IllegalArgumentException: > java.net.UnknownHostException: nameservice1 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) > at > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.UnknownHostException: nameservice1 > ... 41 more > {code} > But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the > job, job complete with Success. > In Addition, I disable High Availability on HDFS,
[jira] [Resolved] (SPARK-16673) New Executor Page displays columns that used to be conditionally hidden
[ https://issues.apache.org/jira/browse/SPARK-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-16673. --- Resolution: Fixed Fix Version/s: 2.1.0 > New Executor Page displays columns that used to be conditionally hidden > --- > > Key: SPARK-16673 > URL: https://issues.apache.org/jira/browse/SPARK-16673 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > Fix For: 2.1.0 > > > SPARK-15951 switched the Executors page to use JQuery DataTables, but it also > removed the functionality of conditionally hiding the Logs and Thread Dump > columns. In the case of the Logs column this is not a big issue, but in the > case of Thread Dump, previously it was never shown on the History Server > since it isn't available. > We should reintroduce the functionality to hide these columns according to > the same conditions as before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
[ https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-11227. --- Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 > Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1 > > > Key: SPARK-11227 > URL: https://issues.apache.org/jira/browse/SPARK-11227 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.5.0, 1.5.1, 1.6.1, 2.0.0 > Environment: OS: CentOS 6.6 > Memory: 28G > CPU: 8 > Mesos: 0.22.0 > HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager) >Reporter: Yuri Saito >Assignee: Kousuke Saruta > Fix For: 2.0.1, 2.1.0 > > > When running jar including Spark Job at HDFS HA Cluster, Mesos and > Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: > nameservice1" and fail. > I do below in Terminal. > {code} > /opt/spark/bin/spark-submit \ > --class com.example.Job /jobs/job-assembly-1.0.0.jar > {code} > So, job throw below message. > {code} > 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 > (TID 0, spark003.example.com): java.lang.IllegalArgumentException: > java.net.UnknownHostException: nameservice1 > at > org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374) > at > org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312) > at > org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665) > at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169) > at > org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436) > at > org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at > org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176) > at scala.Option.map(Option.scala:145) > at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176) > at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216) > at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:264) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) > at org.apache.spark.scheduler.Task.run(Task.scala:88) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.net.UnknownHostException: nameservice1 > ... 41 more > {code} > But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the > job, job complete with Success. > In
[jira] [Created] (SPARK-17153) [Structured streams] readStream ignores partition columns
Dmitri Carpov created SPARK-17153: - Summary: [Structured streams] readStream ignores partition columns Key: SPARK-17153 URL: https://issues.apache.org/jira/browse/SPARK-17153 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 2.0.0 Reporter: Dmitri Carpov When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: ``` case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) ``` when data is read as stream: ``` spark.readStream.schema(spark.read.load(url).schema).parquet(url) ``` it reads: ``` id, value null, 1 null, 2 null, 3 ``` A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16673) New Executor Page displays columns that used to be conditionally hidden
[ https://issues.apache.org/jira/browse/SPARK-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-16673: -- Assignee: Alex Bozarth > New Executor Page displays columns that used to be conditionally hidden > --- > > Key: SPARK-16673 > URL: https://issues.apache.org/jira/browse/SPARK-16673 > Project: Spark > Issue Type: Bug > Components: Web UI >Reporter: Alex Bozarth >Assignee: Alex Bozarth > Fix For: 2.1.0 > > > SPARK-15951 switched the Executors page to use JQuery DataTables, but it also > removed the functionality of conditionally hiding the Logs and Thread Dump > columns. In the case of the Logs column this is not a big issue, but in the > case of Thread Dump, previously it was never shown on the History Server > since it isn't available. > We should reintroduce the functionality to hide these columns according to > the same conditions as before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns
[ https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitri Carpov updated SPARK-17153: -- Description: When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: {noformat} case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) {noformat} when data is read as stream: {noformat} spark.readStream.schema(spark.read.load(url).schema).parquet(url) {noformat} it reads: {noformat} id, value null, 1 null, 2 null, 3 {noformat} A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. was: When parquet files are persisted using partitions, spark's `readStream` returns data with all `null`s for the partitioned columns. For example: ``` case class A(id: Int, value: Int) val data = spark.createDataset(Seq( A(1, 1), A(2, 2), A(2, 3)) ) val url = "/mnt/databricks/test" data.write.partitionBy("id").parquet(url) ``` when data is read as stream: ``` spark.readStream.schema(spark.read.load(url).schema).parquet(url) ``` it reads: ``` id, value null, 1 null, 2 null, 3 ``` A possible reason is `readStream` reads parquet files directly but when those are stored the columns they are partitioned by are excluded from the file itself. In the given example the parquet files contain `value` information only since `id` is partition. > [Structured streams] readStream ignores partition columns > - > > Key: SPARK-17153 > URL: https://issues.apache.org/jira/browse/SPARK-17153 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Dmitri Carpov > > When parquet files are persisted using partitions, spark's `readStream` > returns data with all `null`s for the partitioned columns. > For example: > {noformat} > case class A(id: Int, value: Int) > val data = spark.createDataset(Seq( > A(1, 1), > A(2, 2), > A(2, 3)) > ) > val url = "/mnt/databricks/test" > data.write.partitionBy("id").parquet(url) > {noformat} > when data is read as stream: > {noformat} > spark.readStream.schema(spark.read.load(url).schema).parquet(url) > {noformat} > it reads: > {noformat} > id, value > null, 1 > null, 2 > null, 3 > {noformat} > A possible reason is `readStream` reads parquet files directly but when those > are stored the columns they are partitioned by are excluded from the file > itself. In the given example the parquet files contain `value` information > only since `id` is partition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”
[ https://issues.apache.org/jira/browse/SPARK-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428289#comment-15428289 ] Thomas Graves commented on SPARK-17148: --- If this is causing the nodemanager to die this is bad and we should fix it. It should just fail the request and not kill the NM. I assume the error was caused because someone cleaned up the data for for this already? > NodeManager exit because of exception “Executor is not registered” > -- > > Key: SPARK-17148 > URL: https://issues.apache.org/jira/browse/SPARK-17148 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.6.2 > Environment: hadoop 2.7.2 spark 1.6.2 >Reporter: cen yuhai > > java.lang.RuntimeException: Executor is not registered > (appId=application_1467288504738_1341061, execId=423) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85) > at > org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72) > at > org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149) > at > org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104) > at > org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51) > at > io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86) > at > io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308) > at > io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131) > at > io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org