[jira] [Assigned] (SPARK-17166) CTAS lost table properties after conversion to data source tables.

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17166:


Assignee: (was: Apache Spark)

> CTAS lost table properties after conversion to data source tables.
> --
>
> Key: SPARK-17166
> URL: https://issues.apache.org/jira/browse/SPARK-17166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> CTAS lost table properties after conversion to data source tables. For 
> example, 
> {noformat}
> CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 
> 1 as b
> {noformat}
> The output of `DESC FORMATTED t` does not have the related properties. 
> {noformat}
> |Table Parameters:   |
>   |   |
> |  rawDataSize   |-1  
>   |   |
> |  numFiles  |1   
>   |   |
> |  transient_lastDdlTime |1471670983  
>   |   |
> |  totalSize |496 
>   |   |
> |  spark.sql.sources.provider|parquet 
>   |   |
> |  EXTERNAL  |FALSE   
>   |   |
> |  COLUMN_STATS_ACCURATE |false   
>   |   |
> |  numRows   |-1  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
>  |   |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>  |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
>  |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |  path  
> |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t|
>|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17166) CTAS lost table properties after conversion to data source tables.

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429244#comment-15429244
 ] 

Apache Spark commented on SPARK-17166:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/14727

> CTAS lost table properties after conversion to data source tables.
> --
>
> Key: SPARK-17166
> URL: https://issues.apache.org/jira/browse/SPARK-17166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>
> CTAS lost table properties after conversion to data source tables. For 
> example, 
> {noformat}
> CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 
> 1 as b
> {noformat}
> The output of `DESC FORMATTED t` does not have the related properties. 
> {noformat}
> |Table Parameters:   |
>   |   |
> |  rawDataSize   |-1  
>   |   |
> |  numFiles  |1   
>   |   |
> |  transient_lastDdlTime |1471670983  
>   |   |
> |  totalSize |496 
>   |   |
> |  spark.sql.sources.provider|parquet 
>   |   |
> |  EXTERNAL  |FALSE   
>   |   |
> |  COLUMN_STATS_ACCURATE |false   
>   |   |
> |  numRows   |-1  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
>  |   |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>  |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
>  |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |  path  
> |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t|
>|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17166) CTAS lost table properties after conversion to data source tables.

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17166:


Assignee: Apache Spark

> CTAS lost table properties after conversion to data source tables.
> --
>
> Key: SPARK-17166
> URL: https://issues.apache.org/jira/browse/SPARK-17166
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>
> CTAS lost table properties after conversion to data source tables. For 
> example, 
> {noformat}
> CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 
> 1 as b
> {noformat}
> The output of `DESC FORMATTED t` does not have the related properties. 
> {noformat}
> |Table Parameters:   |
>   |   |
> |  rawDataSize   |-1  
>   |   |
> |  numFiles  |1   
>   |   |
> |  transient_lastDdlTime |1471670983  
>   |   |
> |  totalSize |496 
>   |   |
> |  spark.sql.sources.provider|parquet 
>   |   |
> |  EXTERNAL  |FALSE   
>   |   |
> |  COLUMN_STATS_ACCURATE |false   
>   |   |
> |  numRows   |-1  
>   |   |
> ||
>   |   |
> |# Storage Information   |
>   |   |
> |SerDe Library:  
> |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe  
>  |   |
> |InputFormat:
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
>  |   |
> |OutputFormat:   
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat   
>  |   |
> |Compressed: |No  
>   |   |
> |Storage Desc Parameters:|
>   |   |
> |  serialization.format  |1   
>   |   |
> |  path  
> |file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t|
>|
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17166) CTAS lost table properties after conversion to data source tables.

2016-08-19 Thread Xiao Li (JIRA)
Xiao Li created SPARK-17166:
---

 Summary: CTAS lost table properties after conversion to data 
source tables.
 Key: SPARK-17166
 URL: https://issues.apache.org/jira/browse/SPARK-17166
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiao Li


CTAS lost table properties after conversion to data source tables. For example, 
{noformat}
CREATE TABLE t TBLPROPERTIES('prop1' = 'c', 'prop2' = 'd') AS SELECT 1 as a, 1 
as b
{noformat}
The output of `DESC FORMATTED t` does not have the related properties. 
{noformat}
|Table Parameters:   |  
|   |
|  rawDataSize   |-1
|   |
|  numFiles  |1 
|   |
|  transient_lastDdlTime |1471670983
|   |
|  totalSize |496   
|   |
|  spark.sql.sources.provider|parquet   
|   |
|  EXTERNAL  |FALSE 
|   |
|  COLUMN_STATS_ACCURATE |false 
|   |
|  numRows   |-1
|   |
||  
|   |
|# Storage Information   |  
|   |
|SerDe Library:  
|org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
   |   |
|InputFormat:
|org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat  
   |   |
|OutputFormat:   
|org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat 
   |   |
|Compressed: |No
|   |
|Storage Desc Parameters:|  
|   |
|  serialization.format  |1 
|   |
|  path  
|file:/private/var/folders/4b/sgmfldk15js406vk7lw5llzwgn/T/warehouse-f3aa2927-6464-4a35-a715-1300dde6c614/t|
   |
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16757) Set up caller context to HDFS

2016-08-19 Thread Weiqing Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429237#comment-15429237
 ] 

Weiqing Yang commented on SPARK-16757:
--

Thanks, [~srowen]. When Spark applications run on HDFS, if Spark reads data 
from HDFS or writes data into HDFS, a corresponding operation record with spark 
caller contexts will be written into hdfs-audit.log. The Spark caller contexts 
are JobID_stageID_stageAttemptId_taskID_attemptNumbe and applications’ name. 
That can help users to better diagnose and understand how specific applications 
impacting parts of the Hadoop system and potential problems they may be 
creating (e.g. overloading NN). As HDFS mentioned in HDFS-9184, for a given 
HDFS operation, it's very helpful to track which upper level job issues it.

> Set up caller context to HDFS
> -
>
> Key: SPARK-16757
> URL: https://issues.apache.org/jira/browse/SPARK-16757
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Weiqing Yang
>
> In this jira, Spark will invoke hadoop caller context api to set up its 
> caller context to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17165) FileStreamSource should not track the list of seen files indefinitely

2016-08-19 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-17165:
---

 Summary: FileStreamSource should not track the list of seen files 
indefinitely
 Key: SPARK-17165
 URL: https://issues.apache.org/jira/browse/SPARK-17165
 Project: Spark
  Issue Type: Bug
  Components: SQL, Streaming
Reporter: Reynold Xin


FileStreamSource currently tracks all the files seen indefinitely, which means 
it can run out of memory or overflow.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17150) Support SQL generation for inline tables

2016-08-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17150:

Assignee: Peter Lee

> Support SQL generation for inline tables
> 
>
> Key: SPARK-17150
> URL: https://issues.apache.org/jira/browse/SPARK-17150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> Inline tables currently do not support SQL generation, and as a result a view 
> that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17150) Support SQL generation for inline tables

2016-08-19 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17150.
-
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 14709
[https://github.com/apache/spark/pull/14709]

> Support SQL generation for inline tables
> 
>
> Key: SPARK-17150
> URL: https://issues.apache.org/jira/browse/SPARK-17150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>
> Inline tables currently do not support SQL generation, and as a result a view 
> that depends on inline tables would fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16862) Configurable buffer size in `UnsafeSorterSpillReader`

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429226#comment-15429226
 ] 

Apache Spark commented on SPARK-16862:
--

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/14726

> Configurable buffer size in `UnsafeSorterSpillReader`
> -
>
> Key: SPARK-16862
> URL: https://issues.apache.org/jira/browse/SPARK-16862
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Tejas Patil
>Priority: Minor
>
> `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k 
> buffer to read data off disk. This could be made configurable to improve on 
> disk reads.
> https://github.com/apache/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java#L53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16264) Allow the user to use operators on the received DataFrame

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-16264.
---
Resolution: Won't Fix

> Allow the user to use operators on the received DataFrame
> -
>
> Key: SPARK-16264
> URL: https://issues.apache.org/jira/browse/SPARK-16264
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Shixiong Zhu
>
> Currently Sink cannot apply any operators on the given DataFrame because new 
> DataFrame created by the operator will use QueryExecution rather than 
> IncrementalExecution.
> There are two options to fix this one:
> 1. Merge IncrementalExecution into QueryExecution so that QueryExecution can 
> also deal with streaming operators.
> 2. Make Dataset operators inherits the QueryExecution(IncrementalExecution is 
> just a subclass of IncrementalExecution) from it's parent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-19 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429196#comment-15429196
 ] 

Yanbo Liang commented on SPARK-17134:
-

[~qhuang] Please feel free to take this task and do the performance 
investigation. Thanks! 

> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429195#comment-15429195
 ] 

Reynold Xin commented on SPARK-17164:
-

I tried in Postgres:

{code}
rxin=# create table a:b (id int);
ERROR:  syntax error at or near ":"
LINE 1: create table a:b (id int);
{code}

Also it seems like Presto does not support it. You can always quote this though.


> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-19 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429194#comment-15429194
 ] 

Reynold Xin commented on SPARK-17164:
-

This is actually valid?

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-19 Thread Sital Kedia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429193#comment-15429193
 ] 

Sital Kedia commented on SPARK-17164:
-

cc - [~hvanhovell], [~rxin]

> Query with colon in the table name fails to parse in 2.0
> 
>
> Key: SPARK-17164
> URL: https://issues.apache.org/jira/browse/SPARK-17164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
>
> Running a simple query with colon in table name fails to parse in 2.0
> {code}
> == SQL ==
> SELECT * FROM a:b
> ---^^^
>   at 
> org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
>   at 
> org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
>   at 
> org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>   at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
>   ... 48 elided
> {code}
> Please note that this is a regression from Spark 1.6 as the query runs fine 
> in 1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17164) Query with colon in the table name fails to parse in 2.0

2016-08-19 Thread Sital Kedia (JIRA)
Sital Kedia created SPARK-17164:
---

 Summary: Query with colon in the table name fails to parse in 2.0
 Key: SPARK-17164
 URL: https://issues.apache.org/jira/browse/SPARK-17164
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Sital Kedia


Running a simple query with colon in table name fails to parse in 2.0

{code}
== SQL ==
SELECT * FROM a:b
---^^^

  at 
org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:197)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:99)
  at 
org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
  at 
org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
  at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:682)
  ... 48 elided

{code}

Please note that this is a regression from Spark 1.6 as the query runs fine in 
1.6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17158.
-
   Resolution: Fixed
 Assignee: Srinath
Fix Version/s: 2.1.0
   2.0.1

> Improve error message for numeric literal parsing
> -
>
> Key: SPARK-17158
> URL: https://issues.apache.org/jira/browse/SPARK-17158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Srinath
>Assignee: Srinath
>Priority: Minor
> Fix For: 2.0.1, 2.1.0
>
>
> Spark currently gives confusing and inconsistent error messages for numeric 
> literals. For example:
> scala> sql("select 123456Y")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456Y
> ---^^^
> scala> sql("select 123456S")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456S
> ---^^^
> scala> sql("select 12345623434523434564565L")
> org.apache.spark.sql.catalyst.parser.ParseException:
> For input string: "12345623434523434564565"(line 1, pos 7)
> == SQL ==
> select 12345623434523434564565L
> ---^^^
> The problem is that we are relying on JDK's implementations for parsing, and 
> those functions throw different error messages. This code can be found in 
> AstBuilder.numericLiteral function.
> The proposal is that instead of using `_.toByte` to turn a string into a 
> byte, we always turn the numeric literal string into a BigDecimal, and then 
> we validate the range before turning it into a numeric value. This way, we 
> have more control over the data.
> If BigDecimal fails to parse the number, we should throw a better exception 
> than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17149) array.sql for testing array related functions

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-17149.
-
   Resolution: Fixed
 Assignee: Peter Lee
Fix Version/s: 2.1.0
   2.0.1

> array.sql for testing array related functions
> -
>
> Key: SPARK-17149
> URL: https://issues.apache.org/jira/browse/SPARK-17149
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Peter Lee
>Assignee: Peter Lee
> Fix For: 2.0.1, 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17163) Decide on unified multinomial and binary logistic regression interfaces

2016-08-19 Thread Seth Hendrickson (JIRA)
Seth Hendrickson created SPARK-17163:


 Summary: Decide on unified multinomial and binary logistic 
regression interfaces
 Key: SPARK-17163
 URL: https://issues.apache.org/jira/browse/SPARK-17163
 Project: Spark
  Issue Type: Sub-task
Reporter: Seth Hendrickson


Before the 2.1 release, we should finalize the API for logistic regression. 
After SPARK-7159, we have both LogisticRegression and 
MultinomialLogisticRegression models. This may be confusing to users and, is a 
bit superfluous since MLOR can do basically all of what BLOR does. We should 
decide if it needs to be changed and implement those changes before 2.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429069#comment-15429069
 ] 

DB Tsai commented on SPARK-17151:
-

[~sethah] I think it sort of makes sense that we allow users to specify the 
number of classes if they want instead of inferring from the data.

> Decide how to handle inferring number of classes in Multinomial logistic 
> regression
> ---
>
> Key: SPARK-17151
> URL: https://issues.apache.org/jira/browse/SPARK-17151
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> This JIRA is to discuss how the number of label classes should be inferred in 
> multinomial logistic regression. Currently, MLOR checks the dataframe 
> metadata and if the number of classes is not specified then it uses the 
> maximum value seen in the label column. If the labels are not properly 
> indexed, then this can cause a large number of zero coefficients and 
> potentially produce instabilities in model training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429066#comment-15429066
 ] 

DB Tsai commented on SPARK-17151:
-

BTW, not only the zero coefficients issues but also the intercepts will be 
negative infinity for those classes which are not seen in the training time. 
This will cause some instabilities during the optimization, and we should not 
train on those unseen classes. As a result, we need to keep track on what are 
the seen classes in the training time, and only optimize the coefficients for 
them. Since we know all the possible classes which should be able to be 
specified by users as part of the API, in prediction time, we just make them 
probability zero. 

> Decide how to handle inferring number of classes in Multinomial logistic 
> regression
> ---
>
> Key: SPARK-17151
> URL: https://issues.apache.org/jira/browse/SPARK-17151
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> This JIRA is to discuss how the number of label classes should be inferred in 
> multinomial logistic regression. Currently, MLOR checks the dataframe 
> metadata and if the number of classes is not specified then it uses the 
> maximum value seen in the label column. If the labels are not properly 
> indexed, then this can cause a large number of zero coefficients and 
> potentially produce instabilities in model training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17151) Decide how to handle inferring number of classes in Multinomial logistic regression

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429066#comment-15429066
 ] 

DB Tsai edited comment on SPARK-17151 at 8/19/16 11:49 PM:
---

Not only the zero coefficients issues but also the intercepts will be negative 
infinity for those classes which are not seen in the training time. This will 
cause some instabilities during the optimization, and we should not train on 
those unseen classes. As a result, we need to keep track on what are the seen 
classes in the training time, and only optimize the coefficients for them. 
Since we know all the possible classes which should be able to be specified by 
users as part of the API, in prediction time, we just make them probability 
zero. 


was (Author: dbtsai):
BTW, not only the zero coefficients issues but also the intercepts will be 
negative infinity for those classes which are not seen in the training time. 
This will cause some instabilities during the optimization, and we should not 
train on those unseen classes. As a result, we need to keep track on what are 
the seen classes in the training time, and only optimize the coefficients for 
them. Since we know all the possible classes which should be able to be 
specified by users as part of the API, in prediction time, we just make them 
probability zero. 

> Decide how to handle inferring number of classes in Multinomial logistic 
> regression
> ---
>
> Key: SPARK-17151
> URL: https://issues.apache.org/jira/browse/SPARK-17151
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Seth Hendrickson
>Priority: Minor
>
> This JIRA is to discuss how the number of label classes should be inferred in 
> multinomial logistic regression. Currently, MLOR checks the dataframe 
> metadata and if the number of classes is not specified then it uses the 
> maximum value seen in the label column. If the labels are not properly 
> indexed, then this can cause a large number of zero coefficients and 
> potentially produce instabilities in model training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17161:


Assignee: (was: Apache Spark)

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Often in Spark ML, there are classes that use a Scala `Array` to construct.  
> In order to add the same API to Python, a Java-friendly alternate constructor 
> needs to exist to be compatible with py4j when converting from a list.  This 
> is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala `Array` 
> currently used in classes like {{CountVectorizerModel}} and 
> {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429043#comment-15429043
 ] 

Apache Spark commented on SPARK-17161:
--

User 'BryanCutler' has created a pull request for this issue:
https://github.com/apache/spark/pull/14725

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Often in Spark ML, there are classes that use a Scala `Array` to construct.  
> In order to add the same API to Python, a Java-friendly alternate constructor 
> needs to exist to be compatible with py4j when converting from a list.  This 
> is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala `Array` 
> currently used in classes like {{CountVectorizerModel}} and 
> {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17161:


Assignee: Apache Spark

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Apache Spark
>Priority: Minor
>
> Often in Spark ML, there are classes that use a Scala `Array` to construct.  
> In order to add the same API to Python, a Java-friendly alternate constructor 
> needs to exist to be compatible with py4j when converting from a list.  This 
> is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala `Array` 
> currently used in classes like {{CountVectorizerModel}} and 
> {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17136) Design optimizer interface for ML algorithms

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17136?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429039#comment-15429039
 ] 

DB Tsai commented on SPARK-17136:
-

Typically, the first order optimizer will take a function which returns the 
first derivative and the value of objective function. Second order one will 
take the hessian matrix. Since second order one doesn't scale in number of 
features, we can focus on first order optimizer first. Also, we need to have a 
interface to handle non differentiable loss which is L1 outside the return, 
since it's specific to the design of algorithms so can not be part of the loss. 
We may take a look on how other packages in R, matlab, or python defining the 
interfaces, and come out with generic one. The default implementation can wrap 
the breeze one. Users can have their own implementation to change the default 
optimizer.

> Design optimizer interface for ML algorithms
> 
>
> Key: SPARK-17136
> URL: https://issues.apache.org/jira/browse/SPARK-17136
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> We should consider designing an interface that allows users to use their own 
> optimizers in some of the ML algorithms, similar to MLlib. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17161) Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-17161:
-
Summary: Add PySpark-ML JavaWrapper convenience function to create py4j 
JavaArrays  (was: Add PySpark-ML JavaWrapper convienience function to create 
py4j JavaArrays)

> Add PySpark-ML JavaWrapper convenience function to create py4j JavaArrays
> -
>
> Key: SPARK-17161
> URL: https://issues.apache.org/jira/browse/SPARK-17161
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
>
> Often in Spark ML, there are classes that use a Scala `Array` to construct.  
> In order to add the same API to Python, a Java-friendly alternate constructor 
> needs to exist to be compatible with py4j when converting from a list.  This 
> is because the current conversion in PySpark _py2java creates a 
> java.util.ArrayList, as shown in this error msg
> {noformat}
> Py4JError: An error occurred while calling 
> None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
> py4j.Py4JException: Constructor 
> org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
> does not exist
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
>   at 
> py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
>   at py4j.Gateway.invoke(Gateway.java:235)
> {noformat}
> Creating an alternate constructor can be avoided by creating a py4j JavaArray 
> using {{new_array}}.  This type is compatible with the Scala `Array` 
> currently used in classes like {{CountVectorizerModel}} and 
> {{StringIndexerModel}}.
> Most of the boiler-plate Python code to do this can be put in a convenience 
> function inside of  ml.JavaWrapper to give a clean way of constructing ML 
> objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429025#comment-15429025
 ] 

DB Tsai edited comment on SPARK-17137 at 8/19/16 11:16 PM:
---

Currently, for LiR or BLOR, we always do `Vector.compressed` when creating the 
models which is optimized for space, but computation. We need to investigate 
the trade-off. 


was (Author: dbtsai):
Currently, for LiR or BLOR, we always do `Vector.compressed` which is optimized 
for space, but computation. We need to investigate the trade-off. 

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17137) Add compressed support for multinomial logistic regression coefficients

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429025#comment-15429025
 ] 

DB Tsai commented on SPARK-17137:
-

Currently, for LiR or BLOR, we always do `Vector.compressed` which is optimized 
for space, but computation. We need to investigate the trade-off. 

> Add compressed support for multinomial logistic regression coefficients
> ---
>
> Key: SPARK-17137
> URL: https://issues.apache.org/jira/browse/SPARK-17137
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> For sparse coefficients in MLOR, such as when high L1 regularization, it may 
> be more efficient to store coefficients in compressed format. We can add this 
> option to MLOR and perhaps to do some performance tests to verify 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17162) Range does not support SQL generation

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17162:


Assignee: (was: Apache Spark)

> Range does not support SQL generation
> -
>
> Key: SPARK-17162
> URL: https://issues.apache.org/jira/browse/SPARK-17162
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>Priority: Minor
>
> {code}
> scala> sql("create view a as select * from range(100)")
> 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as 
> select * from range(100)
> java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
> splits=8)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> ```
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17162) Range does not support SQL generation

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429018#comment-15429018
 ] 

Apache Spark commented on SPARK-17162:
--

User 'ericl' has created a pull request for this issue:
https://github.com/apache/spark/pull/14724

> Range does not support SQL generation
> -
>
> Key: SPARK-17162
> URL: https://issues.apache.org/jira/browse/SPARK-17162
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>Priority: Minor
>
> {code}
> scala> sql("create view a as select * from range(100)")
> 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as 
> select * from range(100)
> java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
> splits=8)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> ```
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17162) Range does not support SQL generation

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17162:


Assignee: Apache Spark

> Range does not support SQL generation
> -
>
> Key: SPARK-17162
> URL: https://issues.apache.org/jira/browse/SPARK-17162
> Project: Spark
>  Issue Type: Bug
>Reporter: Eric Liang
>Assignee: Apache Spark
>Priority: Minor
>
> {code}
> scala> sql("create view a as select * from range(100)")
> 16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as 
> select * from range(100)
> java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
> splits=8)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
>   at 
> org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
>   at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
>   at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
> ```
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17162) Range does not support SQL generation

2016-08-19 Thread Eric Liang (JIRA)
Eric Liang created SPARK-17162:
--

 Summary: Range does not support SQL generation
 Key: SPARK-17162
 URL: https://issues.apache.org/jira/browse/SPARK-17162
 Project: Spark
  Issue Type: Bug
Reporter: Eric Liang
Priority: Minor


{code}
scala> sql("create view a as select * from range(100)")
16/08/19 21:10:29 INFO SparkSqlParser: Parsing command: create view a as select 
* from range(100)
java.lang.UnsupportedOperationException: unsupported plan Range (0, 100, 
splits=8)

  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:212)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
  at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:165)
  at org.apache.spark.sql.catalyst.SQLBuilder.projectToSQL(SQLBuilder.scala:229)
  at 
org.apache.spark.sql.catalyst.SQLBuilder.org$apache$spark$sql$catalyst$SQLBuilder$$toSQL(SQLBuilder.scala:127)
  at org.apache.spark.sql.catalyst.SQLBuilder.toSQL(SQLBuilder.scala:97)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.prepareTable(views.scala:174)
  at 
org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:138)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58)
  at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
```

{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17161) Add PySpark-ML JavaWrapper convienience function to create py4j JavaArrays

2016-08-19 Thread Bryan Cutler (JIRA)
Bryan Cutler created SPARK-17161:


 Summary: Add PySpark-ML JavaWrapper convienience function to 
create py4j JavaArrays
 Key: SPARK-17161
 URL: https://issues.apache.org/jira/browse/SPARK-17161
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Bryan Cutler
Priority: Minor


Often in Spark ML, there are classes that use a Scala `Array` to construct.  In 
order to add the same API to Python, a Java-friendly alternate constructor 
needs to exist to be compatible with py4j when converting from a list.  This is 
because the current conversion in PySpark _py2java creates a 
java.util.ArrayList, as shown in this error msg

{noformat}
Py4JError: An error occurred while calling 
None.org.apache.spark.ml.feature.CountVectorizerModel. Trace:
py4j.Py4JException: Constructor 
org.apache.spark.ml.feature.CountVectorizerModel([class java.util.ArrayList]) 
does not exist
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at 
py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
{noformat}

Creating an alternate constructor can be avoided by creating a py4j JavaArray 
using {{new_array}}.  This type is compatible with the Scala `Array` currently 
used in classes like {{CountVectorizerModel}} and {{StringIndexerModel}}.

Most of the boiler-plate Python code to do this can be put in a convenience 
function inside of  ml.JavaWrapper to give a clean way of constructing ML 
objects without adding special constructors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15429003#comment-15429003
 ] 

DB Tsai commented on SPARK-17140:
-

Since we're doing smoothing, the intercepts computed from priors with smoothing 
will not be the one that actually converged with large L1. Just keep in mind 
when write the tests.

> Add initial model to MultinomialLogisticRegression
> --
>
> Key: SPARK-17140
> URL: https://issues.apache.org/jira/browse/SPARK-17140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> We should add initial model support to Multinomial logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428988#comment-15428988
 ] 

Nicholas Chammas commented on SPARK-17025:
--

{quote}
We'd need to figure out a good design for this, especially since it will 
require a Python process to be started up for what might otherwise be pure 
Scala applications.
{quote}

Ah, I guess since a Transformer using some Python code may be persisted and 
then loaded into a Scala application, right? Sounds hairy. 

Anyway, thanks for chiming in Joseph. I'll watch the linked issue.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17128) Schema is not Created for nested Json Array objects

2016-08-19 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-17128.
---
  Resolution: Invalid
Target Version/s:   (was: 2.0.0)

This is not a reasonable description of an issue. Please read 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark first

> Schema is not Created for nested Json Array objects
> ---
>
> Key: SPARK-17128
> URL: https://issues.apache.org/jira/browse/SPARK-17128
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.0.0
> Environment: Java and Scalab both
>Reporter: vidit Singh
>Priority: Critical
>
> When i am trying to generate Schema of Nested Json Array elements, 
> Shows Error :  [_corrupt_record: string].
> Please fix this issue ASAP.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-19 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428906#comment-15428906
 ] 

Joseph K. Bradley commented on SPARK-17025:
---

I'd call this a new API, not a bug.  This kind of support would be great to 
add; we just have not had a chance to work on it.  We'd need to figure out a 
good design for this, especially since it will require a Python process to be 
started up for what might otherwise be pure Scala applications.  Linking a 
related task which should come before this.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2016-08-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17025:
--
Issue Type: New Feature  (was: Bug)

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the DSE (Datastax enterprise) spark 
shell:

{code}
dse spark --master=local[2]
{code}
{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

It looks like a different ClassLoader is involved and cannot load my case class.
However it works fine with a Tuple, and it works fine with the standalone 
version of Spark.
{code:java}
val fut = Future{ Seq((1, 2)).toDS() }
Await.result(fut, Duration.Inf).show()
+---+---+   
| _1| _2|
+---+---+
|  1|  2|
+---+---+

{code}

  was:
The following code throws an exception in the DSE (Datastax enterprise) spark 
shell:

{code}
dse spark --master=local[2]
{code}
{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at 

[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the DSE (Datastax enterprise) spark 
shell:

{code:bash}
dse spark --master=local[2]
{code}
{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

It looks like a different ClassLoader is involved and cannot load my case class.
However it works fine with a Tuple:
{code:java}
val fut = Future{ Seq((1, 2)).toDS() }
Await.result(fut, Duration.Inf).show()
+---+---+   
| _1| _2|
+---+---+
|  1|  2|
+---+---+

{code}

  was:
The following code throws an exception in the spark shell:

{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 

[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the DSE (Datastax enterprise) spark 
shell:

{code}
dse spark --master=local[2]
{code}
{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

It looks like a different ClassLoader is involved and cannot load my case class.
However it works fine with a Tuple:
{code:java}
val fut = Future{ Seq((1, 2)).toDS() }
Await.result(fut, Duration.Inf).show()
+---+---+   
| _1| _2|
+---+---+
|  1|  2|
+---+---+

{code}

  was:
The following code throws an exception in the DSE (Datastax enterprise) spark 
shell:

{code:bash}
dse spark --master=local[2]
{code}
{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 

[jira] [Resolved] (SPARK-16443) ALS wrapper in SparkR

2016-08-19 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16443.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 14384
[https://github.com/apache/spark/pull/14384]

> ALS wrapper in SparkR
> -
>
> Key: SPARK-16443
> URL: https://issues.apache.org/jira/browse/SPARK-16443
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, SparkR
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
> Fix For: 2.1.0
>
>
> Wrap MLlib's ALS in SparkR. We should discuss whether we want to support R 
> formula or not for ALS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428848#comment-15428848
 ] 

DB Tsai edited comment on SPARK-17134 at 8/19/16 9:21 PM:
--

It may also worth to try the following. I see some performance improvement when 
the # of classes are high, and we can avoid doing normalization again and 
again. By doing so, the access pattern of coefficients array will not be 
sequential, and will change the locality of caching in CPU. As a result, I 
think the performance will be case by case. Maybe we can store the coefficients 
in a transpose matrix which may help the locality? 

We need more investigation to understand the problem.

{code:borderStyle=solid}
val margins = Array.ofDim[Double](numClasses)
features.foreachActive { (index, value) =>
  if (featuresStd(index) != 0.0 && value != 0.0) {
var i = 0
val temp = value / featuresStd(index)
while ( i < numClasses) {
  margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp
  i += 1
   }
  }
}

if (fitIntercept) {
  var i = 0
  val length = features.size
  while ( i < numClasses) {
margins(i) += coefficients(i * numFeaturesPlusIntercept + length)
i += 1
  }
}

val maxMargin = margins.max
val marginOfLabel = margins(label.toInt)
{code}



was (Author: dbtsai):
{code:borderStyle=solid}
val margins = Array.ofDim[Double](numClasses)
features.foreachActive { (index, value) =>
  if (featuresStd(index) != 0.0 && value != 0.0) {
var i = 0
val temp = value / featuresStd(index)
while ( i < numClasses) {
  margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp
  i += 1
   }
  }
}

if (fitIntercept) {
  var i = 0
  val length = features.size
  while ( i < numClasses) {
margins(i) += coefficients(i * numFeaturesPlusIntercept + length)
i += 1
  }
}

val maxMargin = margins.max
val marginOfLabel = margins(label.toInt)
{code}


> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17134) Use level 2 BLAS operations in LogisticAggregator

2016-08-19 Thread DB Tsai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428848#comment-15428848
 ] 

DB Tsai commented on SPARK-17134:
-

{code:borderStyle=solid}
val margins = Array.ofDim[Double](numClasses)
features.foreachActive { (index, value) =>
  if (featuresStd(index) != 0.0 && value != 0.0) {
var i = 0
val temp = value / featuresStd(index)
while ( i < numClasses) {
  margins(i) += coefficients(i * numFeaturesPlusIntercept + index) * temp
  i += 1
   }
  }
}

if (fitIntercept) {
  var i = 0
  val length = features.size
  while ( i < numClasses) {
margins(i) += coefficients(i * numFeaturesPlusIntercept + length)
i += 1
  }
}

val maxMargin = margins.max
val marginOfLabel = margins(label.toInt)
{code}


> Use level 2 BLAS operations in LogisticAggregator
> -
>
> Key: SPARK-17134
> URL: https://issues.apache.org/jira/browse/SPARK-17134
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Multinomial logistic regression uses LogisticAggregator class for gradient 
> updates. We should look into refactoring MLOR to use level 2 BLAS operations 
> for the updates. Performance testing should be done to show improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16569) Use Cython to speed up Pyspark internals

2016-08-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu closed SPARK-16569.
--
Resolution: Won't Fix

> Use Cython to speed up Pyspark internals
> 
>
> Key: SPARK-16569
> URL: https://issues.apache.org/jira/browse/SPARK-16569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> 
> Profile of 

[jira] [Commented] (SPARK-16569) Use Cython to speed up Pyspark internals

2016-08-19 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428843#comment-15428843
 ] 

Davies Liu commented on SPARK-16569:


Agreed to [~robert3005]. Another options could be just use PyPy, we already 
support that.

> Use Cython to speed up Pyspark internals
> 
>
> Key: SPARK-16569
> URL: https://issues.apache.org/jira/browse/SPARK-16569
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Maciej Bryński
>Priority: Minor
>
> CC: [~davies]
> Many operations I do are like:
> {code}
> dataframe.rdd.map(some_function)
> {code}
> In Pyspark this mean creating Row object for every record and this is slow.
> IDEA:
> Use Cython to speed up Pyspark internals
> What do you think ?
> Sample profile:
> {code}
> 
> Profile of 

[jira] [Commented] (SPARK-13286) JDBC driver doesn't report full exception

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428762#comment-15428762
 ] 

Apache Spark commented on SPARK-13286:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/14722

> JDBC driver doesn't report full exception
> -
>
> Key: SPARK-13286
> URL: https://issues.apache.org/jira/browse/SPARK-13286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Assignee: Davies Liu
>Priority: Minor
>
> Testing some failure scenarios (inserting data into postgresql where there is 
> a schema mismatch) , there is an exception thrown (fine so far) however it 
> doesn't report the actual SQL error.  It refers to a getNextException call 
> but this is beyond my non-existant Java skills to deal with correctly.  
> Supporting this would help users to see the SQL error quickly and resolve the 
> underlying problem.
> {noformat}
> Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core 
> VALUES('5fdf5...',) was aborted.  Call getNextException to see the cause.
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13342) Cannot run INSERT statements in Spark

2016-08-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428753#comment-15428753
 ] 

Dongjoon Hyun commented on SPARK-13342:
---

Hi, All.
Just to make this issue up-to-date, the following is the result of Spark 2.0.0.
{code}
scala> sql("create table x(a int)")
scala> sql("select * from x").show
+---+
|  a|
+---+
+---+
scala> sql("insert into x values 1")
scala> sql("select * from x").show
+---+
|  a|
+---+
|  1|
+---+
{code}

> Cannot run INSERT statements in Spark
> -
>
> Key: SPARK-13342
> URL: https://issues.apache.org/jira/browse/SPARK-13342
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0
>Reporter: neo
>
> I cannot run a INSERT statement using spark-sql. I tried with both versions 
> 1.5.1 and 1.6.0 without any luck. But it runs ok on hive.
> These are the steps I took.
> 1) Launch hive and create the table / insert a record.
> create database test
> use test
> CREATE TABLE stgTable
> (
> sno string,
> total bigint
> );
> INSERT INTO TABLE stgTable VALUES ('12',12)
> 2) Launch spark-sql (1.5.1 or 1.6.0)
> 3) Try inserting a record from the shell
> INSERT INTO table stgTable SELECT 'sno2',224 from stgTable limit 1
> I got this error message 
> "Invalid method name: 'alter_table_with_cascade'"
> I tried changing the hive version inside the spark-sql shell  using SET 
> command.
> I changed the hive version
> from
> SET spark.sql.hive.version=1.2.1  (this is the default setting for my spark 
> installation)
> to
> SET spark.sql.hive.version=0.14.0
> but that did not help either



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-19 Thread Xin Ren (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428751#comment-15428751
 ] 

Xin Ren commented on SPARK-17157:
-

I guess a lot more ml algorithms are still missing R wrappers?

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-19 Thread Iaroslav Zeigerman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428745#comment-15428745
 ] 

Iaroslav Zeigerman commented on SPARK-17024:


Issue occurs only when reading the dataset from CSV format.

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-19 Thread Iaroslav Zeigerman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iaroslav Zeigerman updated SPARK-17024:
---
Affects Version/s: (was: 1.6.0)
   2.0.0

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-17024) Weird behaviour of the DataFrame when a column name contains dots.

2016-08-19 Thread Iaroslav Zeigerman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Iaroslav Zeigerman reopened SPARK-17024:


The issue occurs in Spark 2.0.0. Now it's even worse. I can't even get an rdd 
from a DataFrame. Backquotes doesn't help any more.

> Weird behaviour of the DataFrame when a column name contains dots.
> --
>
> Key: SPARK-17024
> URL: https://issues.apache.org/jira/browse/SPARK-17024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Iaroslav Zeigerman
>
> When a column name contains dots and one of the segment in a name is the same 
> as other column's name, Spark treats this column as a nested structure, 
> although the actual type of column is String/Int/etc. Example:
> {code}
>   val df = sqlContext.createDataFrame(Seq(
> ("user1", "task1"),
> ("user2", "task2")
>   )).toDF("user", "user.task")
> {code}
> Two columns "user" and "user.task". Both of them are string, and the schema 
> resolution seems to be correct:
> {noformat}
> root
>  |-- user: string (nullable = true)
>  |-- user.task: string (nullable = true)
> {noformat}
> But when I'm trying to query this DataFrame like i.e.:
> {code}
>   df.select(df("user"), df("user.task"))
> {code}
> Spark throws an exception "Can't extract value from user#2;" 
> It happens during the resolution of the LogicalPlan while processing the  
> "user.task" column.
> Here is the full stacktrace:
> {noformat}
> Can't extract value from user#2;
> org.apache.spark.sql.AnalysisException: Can't extract value from user#2;
>   at 
> org.apache.spark.sql.catalyst.expressions.ExtractValue$.apply(complexTypeExtractors.scala:73)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:276)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$4.apply(LogicalPlan.scala:275)
>   at 
> scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
>   at scala.collection.immutable.List.foldLeft(List.scala:84)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:275)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:191)
>   at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:151)
>   at org.apache.spark.sql.DataFrame.col(DataFrame.scala:708)
>   at org.apache.spark.sql.DataFrame.apply(DataFrame.scala:696)
> {noformat}
> Is this actually an expected behaviour? 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17160) GetExternalRowField does not properly escape field names, causing generated code not to compile

2016-08-19 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-17160:
--

 Summary: GetExternalRowField does not properly escape field names, 
causing generated code not to compile
 Key: SPARK-17160
 URL: https://issues.apache.org/jira/browse/SPARK-17160
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Josh Rosen
Priority: Critical


The following end-to-end test uncovered a bug in {{GetExternalRowField}}:

{code}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.catalyst.encoders._

spark.sql("set spark.sql.codegen.fallback=false")

val df = Seq(("100-200", "1", "300")).toDF("a", "b", "c")
val df2 = df.select(regexp_replace($"a", "(\\d+)", "num"))
df2.mapPartitions(x => x)(RowEncoder(df2.schema)).collect()
{code}

This causes

{code}
java.lang.Exception: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, 
Column 64: Invalid escape sequence
{code}

The generated code is

{code}
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIterator(references);
/* 003 */ }
/* 004 */
/* 005 */ final class GeneratedIterator extends 
org.apache.spark.sql.execution.BufferedRowIterator {
/* 006 */   private Object[] references;
/* 007 */   private scala.collection.Iterator inputadapter_input;
/* 008 */   private java.lang.String serializefromobject_errMsg;
/* 009 */   private java.lang.String serializefromobject_errMsg1;
/* 010 */   private UnsafeRow serializefromobject_result;
/* 011 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder 
serializefromobject_holder;
/* 012 */   private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter 
serializefromobject_rowWriter;
/* 013 */
/* 014 */   public GeneratedIterator(Object[] references) {
/* 015 */ this.references = references;
/* 016 */   }
/* 017 */
/* 018 */   public void init(int index, scala.collection.Iterator inputs[]) {
/* 019 */ partitionIndex = index;
/* 020 */ inputadapter_input = inputs[0];
/* 021 */ this.serializefromobject_errMsg = (java.lang.String) 
references[0];
/* 022 */ this.serializefromobject_errMsg1 = (java.lang.String) 
references[1];
/* 023 */ serializefromobject_result = new UnsafeRow(1);
/* 024 */ this.serializefromobject_holder = new 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result,
 32);
/* 025 */ this.serializefromobject_rowWriter = new 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder,
 1);
/* 026 */   }
/* 027 */
/* 028 */   protected void processNext() throws java.io.IOException {
/* 029 */ while (inputadapter_input.hasNext()) {
/* 030 */   InternalRow inputadapter_row = (InternalRow) 
inputadapter_input.next();
/* 031 */   org.apache.spark.sql.Row inputadapter_value = 
(org.apache.spark.sql.Row)inputadapter_row.get(0, null);
/* 032 */
/* 033 */   if (false) {
/* 034 */ throw new RuntimeException(serializefromobject_errMsg);
/* 035 */   }
/* 036 */
/* 037 */   boolean serializefromobject_isNull1 = false || false;
/* 038 */   final boolean serializefromobject_value1 = 
serializefromobject_isNull1 ? false : inputadapter_value.isNullAt(0);
/* 039 */   boolean serializefromobject_isNull = false;
/* 040 */   UTF8String serializefromobject_value = null;
/* 041 */   if (!serializefromobject_isNull1 && serializefromobject_value1) 
{
/* 042 */ final UTF8String serializefromobject_value5 = null;
/* 043 */ serializefromobject_isNull = true;
/* 044 */ serializefromobject_value = serializefromobject_value5;
/* 045 */   } else {
/* 046 */ if (false) {
/* 047 */   throw new RuntimeException(serializefromobject_errMsg1);
/* 048 */ }
/* 049 */
/* 050 */ if (false) {
/* 051 */   throw new RuntimeException("The input external row cannot 
be null.");
/* 052 */ }
/* 053 */
/* 054 */ if (inputadapter_value.isNullAt(0)) {
/* 055 */   throw new RuntimeException("The 0th field 
'regexp_replace(a, (\d+), num)' of input row " +
/* 056 */ "cannot be null.");
/* 057 */ }
/* 058 */
/* 059 */ final Object serializefromobject_value8 = 
inputadapter_value.get(0);
/* 060 */ java.lang.String serializefromobject_value7 = null;
/* 061 */ if (!false) {
/* 062 */   if (serializefromobject_value8 instanceof java.lang.String) 
{
/* 063 */ serializefromobject_value7 = (java.lang.String) 
serializefromobject_value8;
/* 064 */   } else {
/* 065 */ throw new 
RuntimeException(serializefromobject_value8.getClass().getName() + " is not a 
valid " +
/* 066 */   "external type for schema of string");
/* 067 */   }
/* 068 */ }
/* 069 */ boolean 

[jira] [Commented] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2016-08-19 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428677#comment-15428677
 ] 

Steve Loughran commented on SPARK-17159:


# the most minimal change is to get rid of that directoryFilter and just do it 
as a filter on the returned list of generated FileStatus entries. Against s3a, 
That will eliminate 1-4 HTTP calls per path which matches the pattern. It will 
mean a larger array of FileStatus entries are returned, but otherwise has 
# it *may* be possible to go further and have the glob also pickup all files 
underneath. That may or may not provide a speedup.
# The listStatus() operatin can be speeded up by having its file filter 
executed after the run (and so there being an existing filestatus entry: no 
need to go near the FS to get the modification time

> Improve FileInputDStream.findNewFiles list performance
> --
>
> Key: SPARK-17159
> URL: https://issues.apache.org/jira/browse/SPARK-17159
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
> Environment: spark against object stores
>Reporter: Steve Loughran
>Priority: Minor
>
> {{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
> calls getFileStatus() on every file, takes the output and does listStatus() 
> on the output.
> This going to suffer on object stores, as dir listing and getFileStatus calls 
> are so expensive. It's clear this is a problem, as the method has code to 
> detect timeouts in the window and warn of problems.
> It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17159) Improve FileInputDStream.findNewFiles list performance

2016-08-19 Thread Steve Loughran (JIRA)
Steve Loughran created SPARK-17159:
--

 Summary: Improve FileInputDStream.findNewFiles list performance
 Key: SPARK-17159
 URL: https://issues.apache.org/jira/browse/SPARK-17159
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 2.0.0
 Environment: spark against object stores
Reporter: Steve Loughran
Priority: Minor


{{FileInputDStream.findNewFiles()}} is doing a globStatus with a fitler that 
calls getFileStatus() on every file, takes the output and does listStatus() on 
the output.

This going to suffer on object stores, as dir listing and getFileStatus calls 
are so expensive. It's clear this is a problem, as the method has code to 
detect timeouts in the window and warn of problems.

It should be possible to make this faster



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10746) count ( distinct columnref) over () returns wrong result set

2016-08-19 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428617#comment-15428617
 ] 

Dongjoon Hyun commented on SPARK-10746:
---

Just as an update, Spark 2.0 now raises an exception for this case with an 
explicit error message "Distinct window functions are not supported".

> count ( distinct columnref) over () returns wrong result set
> 
>
> Key: SPARK-10746
> URL: https://issues.apache.org/jira/browse/SPARK-10746
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>
> Same issue as report against HIVE (HIVE-9534) 
> Result set was expected to contain 5 rows instead of 1 row as others vendors 
> (ORACLE, Netezza etc) would.
> select count( distinct column) over () from t1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17113) Job failure due to Executor OOM in offheap mode

2016-08-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-17113:
---
Assignee: Sital Kedia

> Job failure due to Executor OOM in offheap mode
> ---
>
> Key: SPARK-17113
> URL: https://issues.apache.org/jira/browse/SPARK-17113
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.1, 2.1.0
>
>
> We have been seeing many job failure due to executor OOM with following stack 
> trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Digging into the code, we found out that this is an issue with cooperative 
> memory management for off heap memory allocation.  
> In the code 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463,
>  when the UnsafeExternalSorter is checking if memory page is being used by 
> upstream, the base object in case of off heap memory is always null so the 
> UnsafeExternalSorter does not spill the memory pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17113) Job failure due to Executor OOM in offheap mode

2016-08-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17113.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Job failure due to Executor OOM in offheap mode
> ---
>
> Key: SPARK-17113
> URL: https://issues.apache.org/jira/browse/SPARK-17113
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Sital Kedia
>Assignee: Sital Kedia
> Fix For: 2.0.1, 2.1.0
>
>
> We have been seeing many job failure due to executor OOM with following stack 
> trace - 
> {code}
> java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
>   at 
> org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:90)
>   at org.apache.spark.sql.execution.Sort$$anonfun$1.apply(Sort.scala:64)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$21.apply(RDD.scala:736)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:307)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:271)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Digging into the code, we found out that this is an issue with cooperative 
> memory management for off heap memory allocation.  
> In the code 
> https://github.com/sitalkedia/spark/blob/master/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L463,
>  when the UnsafeExternalSorter is checking if memory page is being used by 
> upstream, the base object in case of off heap memory is always null so the 
> UnsafeExternalSorter does not spill the memory pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17158:


Assignee: Apache Spark

> Improve error message for numeric literal parsing
> -
>
> Key: SPARK-17158
> URL: https://issues.apache.org/jira/browse/SPARK-17158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Srinath
>Assignee: Apache Spark
>Priority: Minor
>
> Spark currently gives confusing and inconsistent error messages for numeric 
> literals. For example:
> scala> sql("select 123456Y")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456Y
> ---^^^
> scala> sql("select 123456S")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456S
> ---^^^
> scala> sql("select 12345623434523434564565L")
> org.apache.spark.sql.catalyst.parser.ParseException:
> For input string: "12345623434523434564565"(line 1, pos 7)
> == SQL ==
> select 12345623434523434564565L
> ---^^^
> The problem is that we are relying on JDK's implementations for parsing, and 
> those functions throw different error messages. This code can be found in 
> AstBuilder.numericLiteral function.
> The proposal is that instead of using `_.toByte` to turn a string into a 
> byte, we always turn the numeric literal string into a BigDecimal, and then 
> we validate the range before turning it into a numeric value. This way, we 
> have more control over the data.
> If BigDecimal fails to parse the number, we should throw a better exception 
> than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17158:


Assignee: (was: Apache Spark)

> Improve error message for numeric literal parsing
> -
>
> Key: SPARK-17158
> URL: https://issues.apache.org/jira/browse/SPARK-17158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Srinath
>Priority: Minor
>
> Spark currently gives confusing and inconsistent error messages for numeric 
> literals. For example:
> scala> sql("select 123456Y")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456Y
> ---^^^
> scala> sql("select 123456S")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456S
> ---^^^
> scala> sql("select 12345623434523434564565L")
> org.apache.spark.sql.catalyst.parser.ParseException:
> For input string: "12345623434523434564565"(line 1, pos 7)
> == SQL ==
> select 12345623434523434564565L
> ---^^^
> The problem is that we are relying on JDK's implementations for parsing, and 
> those functions throw different error messages. This code can be found in 
> AstBuilder.numericLiteral function.
> The proposal is that instead of using `_.toByte` to turn a string into a 
> byte, we always turn the numeric literal string into a BigDecimal, and then 
> we validate the range before turning it into a numeric value. This way, we 
> have more control over the data.
> If BigDecimal fails to parse the number, we should throw a better exception 
> than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428587#comment-15428587
 ] 

Apache Spark commented on SPARK-17158:
--

User 'srinathshankar' has created a pull request for this issue:
https://github.com/apache/spark/pull/14721

> Improve error message for numeric literal parsing
> -
>
> Key: SPARK-17158
> URL: https://issues.apache.org/jira/browse/SPARK-17158
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Srinath
>Priority: Minor
>
> Spark currently gives confusing and inconsistent error messages for numeric 
> literals. For example:
> scala> sql("select 123456Y")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456Y
> ---^^^
> scala> sql("select 123456S")
> org.apache.spark.sql.catalyst.parser.ParseException:
> Value out of range. Value:"123456" Radix:10(line 1, pos 7)
> == SQL ==
> select 123456S
> ---^^^
> scala> sql("select 12345623434523434564565L")
> org.apache.spark.sql.catalyst.parser.ParseException:
> For input string: "12345623434523434564565"(line 1, pos 7)
> == SQL ==
> select 12345623434523434564565L
> ---^^^
> The problem is that we are relying on JDK's implementations for parsing, and 
> those functions throw different error messages. This code can be found in 
> AstBuilder.numericLiteral function.
> The proposal is that instead of using `_.toByte` to turn a string into a 
> byte, we always turn the numeric literal string into a BigDecimal, and then 
> we validate the range before turning it into a numeric value. This way, we 
> have more control over the data.
> If BigDecimal fails to parse the number, we should throw a better exception 
> than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-13286) JDBC driver doesn't report full exception

2016-08-19 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-13286:
--

Assignee: Davies Liu

> JDBC driver doesn't report full exception
> -
>
> Key: SPARK-13286
> URL: https://issues.apache.org/jira/browse/SPARK-13286
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Adrian Bridgett
>Assignee: Davies Liu
>Priority: Minor
>
> Testing some failure scenarios (inserting data into postgresql where there is 
> a schema mismatch) , there is an exception thrown (fine so far) however it 
> doesn't report the actual SQL error.  It refers to a getNextException call 
> but this is beyond my non-existant Java skills to deal with correctly.  
> Supporting this would help users to see the SQL error quickly and resolve the 
> underlying problem.
> {noformat}
> Caused by: java.sql.BatchUpdateException: Batch entry 0 INSERT INTO core 
> VALUES('5fdf5...',) was aborted.  Call getNextException to see the cause.
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement$BatchResultHandler.handleError(AbstractJdbc2Statement.java:2746)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl$1.handleError(QueryExecutorImpl.java:457)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1887)
>   at 
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:405)
>   at 
> org.postgresql.jdbc2.AbstractJdbc2Statement.executeBatch(AbstractJdbc2Statement.java:2893)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:185)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:248)
>   at 
> org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:247)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at 
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>   at org.apache.spark.scheduler.Task.run(Task.scala:89)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17158) Improve error message for numeric literal parsing

2016-08-19 Thread Srinath (JIRA)
Srinath created SPARK-17158:
---

 Summary: Improve error message for numeric literal parsing
 Key: SPARK-17158
 URL: https://issues.apache.org/jira/browse/SPARK-17158
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Srinath
Priority: Minor


Spark currently gives confusing and inconsistent error messages for numeric 
literals. For example:
scala> sql("select 123456Y")
org.apache.spark.sql.catalyst.parser.ParseException:
Value out of range. Value:"123456" Radix:10(line 1, pos 7)

== SQL ==
select 123456Y
---^^^
scala> sql("select 123456S")
org.apache.spark.sql.catalyst.parser.ParseException:
Value out of range. Value:"123456" Radix:10(line 1, pos 7)

== SQL ==
select 123456S
---^^^
scala> sql("select 12345623434523434564565L")
org.apache.spark.sql.catalyst.parser.ParseException:
For input string: "12345623434523434564565"(line 1, pos 7)

== SQL ==
select 12345623434523434564565L
---^^^
The problem is that we are relying on JDK's implementations for parsing, and 
those functions throw different error messages. This code can be found in 
AstBuilder.numericLiteral function.
The proposal is that instead of using `_.toByte` to turn a string into a byte, 
we always turn the numeric literal string into a BigDecimal, and then we 
validate the range before turning it into a numeric value. This way, we have 
more control over the data.
If BigDecimal fails to parse the number, we should throw a better exception 
than "For input string ...".



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-15382:

Fix Version/s: 2.1.0
   2.0.1

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
> Fix For: 2.0.1, 2.1.0
>
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16686) Dataset.sample with seed: result seems to depend on downstream usage

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-16686:

Fix Version/s: 2.0.1

> Dataset.sample with seed: result seems to depend on downstream usage
> 
>
> Key: SPARK-16686
> URL: https://issues.apache.org/jira/browse/SPARK-16686
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
> Environment: Spark 1.6.2 and Spark 2.0 - RC4
> Standalone
> Single-worker cluster
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
> Fix For: 2.0.1, 2.1.0
>
> Attachments: DataFrame.sample bug - 2.0.html
>
>
> Summary to reproduce bug:
> * Create a DataFrame DF, and sample it with a fixed seed.
> * Collect that DataFrame -> result1
> * Call a particular UDF on that DataFrame -> result2
> You would expect results 1 and 2 to use the same rows from DF, but they 
> appear not to.
> Note: result1 and result2 are both deterministic.
> See the attached notebook for details.  Cells in the notebook were executed 
> in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14381) Review spark.ml parity for feature transformers

2016-08-19 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428557#comment-15428557
 ] 

Xusen Yin commented on SPARK-14381:
---

I believe we can resolve this.

> Review spark.ml parity for feature transformers
> ---
>
> Key: SPARK-14381
> URL: https://issues.apache.org/jira/browse/SPARK-14381
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality. List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10401) spark-submit --unsupervise

2016-08-19 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428572#comment-15428572
 ] 

Michael Gummelt commented on SPARK-10401:
-

This should probably be a separate JIRA, but I'm just adding a note here that 
{{--kill}} doesn't seem to kill the job immediately.  It invokes Mesos' 
{{killTask}} function, which runs a {{docker stop}} for docker images.  This 
sends a SIGTERM, which seems to be ignored, then sends a SIGKILL after 10s, 
which ultimately kills the job.  I'd like to find out why the SIGTERM is 
ignored.

> spark-submit --unsupervise 
> ---
>
> Key: SPARK-10401
> URL: https://issues.apache.org/jira/browse/SPARK-10401
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy, Mesos
>Affects Versions: 1.5.0
>Reporter: Alberto Miorin
>
> When I submit a streaming job with the option --supervise to the new mesos 
> spark dispatcher, I cannot decommission the job.
> I tried spark-submit --kill, but dispatcher always restarts it.
> Driver and Executors are both Docker containers.
> I think there should be a subcommand spark-submit --unsupervise 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-19 Thread Miao Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Miao Wang updated SPARK-17157:
--
Component/s: SparkR

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12868) ADD JAR via sparkSQL JDBC will fail when using a HDFS URL

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428519#comment-15428519
 ] 

Apache Spark commented on SPARK-12868:
--

User 'Parth-Brahmbhatt' has created a pull request for this issue:
https://github.com/apache/spark/pull/14720

> ADD JAR via sparkSQL JDBC will fail when using a HDFS URL
> -
>
> Key: SPARK-12868
> URL: https://issues.apache.org/jira/browse/SPARK-12868
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Trystan Leftwich
>
> When trying to add a jar with a HDFS URI, i.E
> {code:sql}
> ADD JAR hdfs:///tmp/foo.jar
> {code}
> Via the spark sql JDBC interface it will fail with:
> {code:sql}
> java.net.MalformedURLException: unknown protocol: hdfs
> at java.net.URL.(URL.java:593)
> at java.net.URL.(URL.java:483)
> at java.net.URL.(URL.java:432)
> at java.net.URI.toURL(URI.java:1089)
> at 
> org.apache.spark.sql.hive.client.ClientWrapper.addJar(ClientWrapper.scala:578)
> at org.apache.spark.sql.hive.HiveContext.addJar(HiveContext.scala:652)
> at org.apache.spark.sql.hive.execution.AddJar.run(commands.scala:89)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at 
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:145)
> at org.apache.spark.sql.DataFrame.(DataFrame.scala:130)
> at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
> at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:817)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:211)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:154)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1$$anon$2.run(SparkExecuteStatementOperation.scala:151)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
> at 
> org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$1.run(SparkExecuteStatementOperation.scala:164)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428518#comment-15428518
 ] 

Miao Wang commented on SPARK-17157:
---

[~felixcheung] Shall we add it to SparkR? I open this JIRA for discussion. 
Thanks!

> Add multiclass logistic regression SparkR Wrapper
> -
>
> Key: SPARK-17157
> URL: https://issues.apache.org/jira/browse/SPARK-17157
> Project: Spark
>  Issue Type: New Feature
>Reporter: Miao Wang
>
> [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to Master. I open this JIRA for discussion of adding SparkR wrapper 
> for multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17157) Add multiclass logistic regression SparkR Wrapper

2016-08-19 Thread Miao Wang (JIRA)
Miao Wang created SPARK-17157:
-

 Summary: Add multiclass logistic regression SparkR Wrapper
 Key: SPARK-17157
 URL: https://issues.apache.org/jira/browse/SPARK-17157
 Project: Spark
  Issue Type: New Feature
Reporter: Miao Wang


[SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
merged to Master. I open this JIRA for discussion of adding SparkR wrapper for 
multiclass logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17156) Add multiclass logistic regression Scala Example

2016-08-19 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428509#comment-15428509
 ] 

Miao Wang commented on SPARK-17156:
---

I will submit PR soon.

> Add multiclass logistic regression Scala Example
> 
>
> Key: SPARK-17156
> URL: https://issues.apache.org/jira/browse/SPARK-17156
> Project: Spark
>  Issue Type: Task
>  Components: ML
>Reporter: Miao Wang
>
> As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
> merged to master, we should add Scala example of using multiclass logistic 
> regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17156) Add multiclass logistic regression Scala Example

2016-08-19 Thread Miao Wang (JIRA)
Miao Wang created SPARK-17156:
-

 Summary: Add multiclass logistic regression Scala Example
 Key: SPARK-17156
 URL: https://issues.apache.org/jira/browse/SPARK-17156
 Project: Spark
  Issue Type: Task
  Components: ML
Reporter: Miao Wang


As [SPARK-7159][ML] Add multiclass logistic regression to Spark ML  has been 
merged to master, we should add Scala example of using multiclass logistic 
regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the spark shell:

{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

It looks like a different ClassLoader is involved and cannot load my case class.
However it works fine with a Tuple:
{code:java}
val fut = Future{ Seq((1, 2)).toDS() }
Await.result(fut, Duration.Inf).show()
+---+---+   
| _1| _2|
+---+---+
|  1|  2|
+---+---+

{code}

  was:
The following code throws an exception in the spark shell:

{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 

[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the spark shell:

{code:scala}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}


  was:
The following code throws an exception in the spark shell:

{{
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
}}

{{
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 

[jira] [Updated] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikael Valot updated SPARK-17155:
-
Description: 
The following code throws an exception in the spark shell:

{code:java}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}


  was:
The following code throws an exception in the spark shell:

{code:scala}
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
{code}

{code}
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 

[jira] [Created] (SPARK-17155) usage of a Dataset inside a Future throws MissingRequirementError

2016-08-19 Thread Mikael Valot (JIRA)
Mikael Valot created SPARK-17155:


 Summary: usage of a Dataset inside a Future throws 
MissingRequirementError
 Key: SPARK-17155
 URL: https://issues.apache.org/jira/browse/SPARK-17155
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.6.1
Reporter: Mikael Valot


The following code throws an exception in the spark shell:

{{
case class A(i1: Int, i2: Int) 

import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
import scala.concurrent.Await
import scala.concurrent.duration.Duration

import sqlContext.implicits._
import org.apache.spark.sql.functions._

val fut = Future{ Seq(A(1, 2)).toDS() }

Await.result(fut, Duration.Inf).show()
}}

{{
scala.reflect.internal.MissingRequirementError: object $line8.$read not found.
at 
scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at 
scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at 
scala.reflect.internal.Mirrors$RootsBase.ensureModuleSymbol(Mirrors.scala:126)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:161)
at 
scala.reflect.internal.Mirrors$RootsBase.staticModule(Mirrors.scala:21)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1$$typecreator1$1.apply(:70)
at 
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:231)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:231)
at 
org.apache.spark.sql.catalyst.ScalaReflection$class.localTypeOf(ScalaReflection.scala:654)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.localTypeOf(ScalaReflection.scala:30)
at 
org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:52)
at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:53)
at 
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(:70)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at 
scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at 
scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
}}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16152) `In` predicate does not work with null values

2016-08-19 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-16152.
-
Resolution: Invalid

Hi, [~fushar]. 
This seems to be a SQL question. [~kevinyu98] is right. Spark/PostgreSQL/MySQL 
are consistent with this. `NULL IN (NULL)` is NULL. Please run the following 
query. The result is also `TRUE` for the above SQL engines.
{code}
SELECT (NULL IN (NULL)) IS NULL
{code}

> `In` predicate does not work with null values
> -
>
> Key: SPARK-16152
> URL: https://issues.apache.org/jira/browse/SPARK-16152
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Ashar Fuadi
>
> According to 
> https://github.com/apache/spark/blob/v1.6.1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala#L134..L136:
> {code}
>  override def eval(input: InternalRow): Any = {
> val evaluatedValue = value.eval(input)
> if (evaluatedValue == null) {
>   null
> } else {
>   ...
> {code}
> we always return {{null}} when the current value is null, ignoring the 
> elements of {{list}}. Therefore, we cannot have a predicate which tests 
> whether a column contains values in e.g. {{[1, 2, 3, null]}}
> Is this a bug, or is this actually the expected behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15382) monotonicallyIncreasingId doesn't work when data is upsampled

2016-08-19 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin closed SPARK-15382.
---
Resolution: Fixed

> monotonicallyIncreasingId doesn't work when data is upsampled
> -
>
> Key: SPARK-15382
> URL: https://issues.apache.org/jira/browse/SPARK-15382
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Mateusz Buśkiewicz
>
> Assigned ids are not unique
> {code}
> from pyspark.sql import Row
> from pyspark.sql.functions import monotonicallyIncreasingId
> hiveContext.createDataFrame([Row(a=1), Row(a=2)]).sample(True, 
> 10.0).withColumn('id', monotonicallyIncreasingId()).collect()
> {code}
> Output:
> {code}
> [Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=1, id=429496729600),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792),
>  Row(a=2, id=867583393792)]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16197) Cleanup PySpark status api and example

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16197?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-16197.
--
Resolution: Won't Fix

This minor change is would be better addressed during a QA audit

> Cleanup PySpark status api and example
> --
>
> Key: SPARK-16197
> URL: https://issues.apache.org/jira/browse/SPARK-16197
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Bryan Cutler
>Priority: Trivial
>
> Cleanup of Status API example to use SparkSession and be more consistent with 
> other examples.
> Changing this JIRA to just cleanup the example, the other changes are just 
> code style which is not really followed in other areas of PySpark
> -also noticed that Status defines two empty classes without using 'pass' and 
> two methods that do not return 'None' explicitly if requested info can not be 
> fetched.  These issues do not cause any errors, but it is good practice to 
> use 'pass' on an empty class definitions and return 'None' for a function if 
> the caller is expecting a return value.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15018:
-
Description: 
When fitting a PySpark Pipeline with no stages, it should work as an identity 
transformer.  Instead the following error is raised:

{noformat}
Traceback (most recent call last):
  File "./spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
  File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
for stage in stages:
TypeError: 'NoneType' object is not iterable
{noformat}

The param {{stages}} needs to be an empty list and {{getStages}} should call 
{{getOrDefault}}.

Also, since the default value is {{None}} is then changed to and empty list 
{{[]}}, this never changes the value if passed in as a keyword argument.  
Instead, the {{kwargs}} value should be changed directly if {{stages is None}}.

For example
{noformat}
if stages is None:
stages = []
{noformat}
should be this
{noformat}
if stages is None:
kwargs['stages'] = []
{noformat}

However, since there is no default value in the Scala implementation, assigning 
a default here is not needed and should be cleaned up.  The pydocs should 
better indicate that stages is required to be a list.


  was:
When fitting a PySpark Pipeline with no stages, it should work as an identity 
transformer.  Instead the following error is raised:

{noformat}
Traceback (most recent call last):
  File "./spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
  File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
for stage in stages:
TypeError: 'NoneType' object is not iterable
{noformat}

The param {{stages}} needs to be an empty list and {{getStages}} should call 
{{getOrDefault}}.

Also, since the default value is {{None}} is then changed to and empty list 
{{[]}}, this never changes the value if passed in as a keyword argument.  
Instead, the {{kwargs}} value should be changed directly if {{stages is None}}.

For example
{noformat}
if stages is None:
stages = []
{noformat}
should be this
{noformat}
if stages is None:
kwargs['stages'] = []
{noformat}

However, since there is no default value in the Scala implementation, assigning 
a default here is not needed and should be cleaned up.



> PySpark ML Pipeline raises unclear error when no stages set
> ---
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} needs to be an empty list and {{getStages}} should call 
> {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}
> However, since there is no default value in the Scala implementation, 
> assigning a default here is not needed and should be cleaned up.  The pydocs 
> should better indicate that stages is required to be a list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15018:
-
Description: 
When fitting a PySpark Pipeline with no stages, it should work as an identity 
transformer.  Instead the following error is raised:

{noformat}
Traceback (most recent call last):
  File "./spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
  File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
for stage in stages:
TypeError: 'NoneType' object is not iterable
{noformat}

The param {{stages}} needs to be an empty list and {{getStages}} should call 
{{getOrDefault}}.

Also, since the default value is {{None}} is then changed to and empty list 
{{[]}}, this never changes the value if passed in as a keyword argument.  
Instead, the {{kwargs}} value should be changed directly if {{stages is None}}.

For example
{noformat}
if stages is None:
stages = []
{noformat}
should be this
{noformat}
if stages is None:
kwargs['stages'] = []
{noformat}

However, since there is no default value in the Scala implementation, assigning 
a default here is not needed and should be cleaned up.


  was:
When fitting a PySpark Pipeline with no stages, it should work as an identity 
transformer.  Instead the following error is raised:

{noformat}
Traceback (most recent call last):
  File "./spark/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
  File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
for stage in stages:
TypeError: 'NoneType' object is not iterable
{noformat}

The param {{stages}} should be added to the default param list and 
{{getStages}} should call {{getOrDefault}}.

Also, since the default value is {{None}} is then changed to and empty list 
{{[]}}, this never changes the value if passed in as a keyword argument.  
Instead, the {{kwargs}} value should be changed directly if {{stages is None}}.

For example
{noformat}
if stages is None:
stages = []
{noformat}
should be this
{noformat}
if stages is None:
kwargs['stages'] = []
{noformat}



> PySpark ML Pipeline raises unclear error when no stages set
> ---
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} needs to be an empty list and {{getStages}} should call 
> {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}
> However, since there is no default value in the Scala implementation, 
> assigning a default here is not needed and should be cleaned up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline raises unclear error when no stages set

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15018:
-
Summary: PySpark ML Pipeline raises unclear error when no stages set  (was: 
PySpark ML Pipeline fails when no stages set)

> PySpark ML Pipeline raises unclear error when no stages set
> ---
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} should be added to the default param list and 
> {{getStages}} should call {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15018:
-
Issue Type: Improvement  (was: Bug)

> PySpark ML Pipeline fails when no stages set
> 
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} should be added to the default param list and 
> {{getStages}} should call {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15018) PySpark ML Pipeline fails when no stages set

2016-08-19 Thread Bryan Cutler (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler updated SPARK-15018:
-
Priority: Minor  (was: Major)

> PySpark ML Pipeline fails when no stages set
> 
>
> Key: SPARK-15018
> URL: https://issues.apache.org/jira/browse/SPARK-15018
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
>
> When fitting a PySpark Pipeline with no stages, it should work as an identity 
> transformer.  Instead the following error is raised:
> {noformat}
> Traceback (most recent call last):
>   File "./spark/python/pyspark/ml/base.py", line 64, in fit
> return self._fit(dataset)
>   File "./spark/python/pyspark/ml/pipeline.py", line 99, in _fit
> for stage in stages:
> TypeError: 'NoneType' object is not iterable
> {noformat}
> The param {{stages}} should be added to the default param list and 
> {{getStages}} should call {{getOrDefault}}.
> Also, since the default value is {{None}} is then changed to and empty list 
> {{[]}}, this never changes the value if passed in as a keyword argument.  
> Instead, the {{kwargs}} value should be changed directly if {{stages is 
> None}}.
> For example
> {noformat}
> if stages is None:
> stages = []
> {noformat}
> should be this
> {noformat}
> if stages is None:
> kwargs['stages'] = []
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17154:


Assignee: Apache Spark

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-08-19 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17154:


Assignee: (was: Apache Spark)

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-08-19 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428438#comment-15428438
 ] 

Apache Spark commented on SPARK-17154:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/14719

> Wrong result can be returned or AnalysisException can be thrown after 
> self-join or similar operations
> -
>
> Key: SPARK-17154
> URL: https://issues.apache.org/jira/browse/SPARK-17154
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Kousuke Saruta
>
> When we join two DataFrames which are originated from a same DataFrame, 
> operations to the joined DataFrame can fail.
> One reproducible  example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
>   val selected1 = joined.select(df("col3"))
> {code}
> In this case, AnalysisException is thrown.
> Another example is as follows.
> {code}
> val df = Seq(
>   (1, "a", "A"),
>   (2, "b", "B"),
>   (3, "c", "C"),
>   (4, "d", "D"),
>   (5, "e", "E")).toDF("col1", "col2", "col3")
>   val filtered = df.filter("col1 != 3").select("col1", "col2")
>   val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
> "right")
>   val selected2 = rightOuterJoined.select(df("col1"))
>   selected2.show
> {code}
> In this case, we will expect to get the answer like as follows.
> {code}
> 1
> 2
> 3
> 4
> 5
> {code}
> But the actual result is as follows.
> {code}
> 1
> 2
> null
> 4
> 5
> {code}
> The cause of the problems in the examples is that the logical plan related to 
> the right side DataFrame and the expressions of its output are re-created in 
> the analyzer (at ResolveReference rule) when a DataFrame has expressions 
> which have a same exprId each other.
> Re-created expressions are equally to the original ones except exprId.
> This will happen when we do self-join or similar pattern operations.
> In the first example, df("col3") returns a Column which includes an 
> expression and the expression have an exprId (say id1 here).
> After join, the expresion which the right side DataFrame (df) has is 
> re-created and the old and new expressions are equally but exprId is renewed 
> (say id2 for the new exprId here).
> Because of the mismatch of those exprIds, AnalysisException is thrown.
> In the second example, df("col1") returns a column and the expression 
> contained in the column is assigned an exprId (say id3).
> On the other hand, a column returned by filtered("col1") has an expression 
> which has the same exprId (id3).
> After join, the expressions in the right side DataFrame are re-created and 
> the expression assigned id3 is no longer present in the right side but 
> present in the left side.
> So, referring df("col1") to the joined DataFrame, we get col1 of right side 
> which includes null.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17135) Consolidate code in linear/logistic regression where possible

2016-08-19 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428401#comment-15428401
 ] 

Gayathri Murali commented on SPARK-17135:
-

I can work on this

> Consolidate code in linear/logistic regression where possible
> -
>
> Key: SPARK-17135
> URL: https://issues.apache.org/jira/browse/SPARK-17135
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Priority: Minor
>
> There is shared code between MultinomialLogisticRegression, 
> LogisticRegression, and LinearRegression. We should consolidate where 
> possible. Also, we should move some code out of LogisticRegression.scala into 
> a separate util file or similar.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-13331) Spark network encryption optimization

2016-08-19 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-13331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reopened SPARK-13331:


> Spark network encryption optimization
> -
>
> Key: SPARK-13331
> URL: https://issues.apache.org/jira/browse/SPARK-13331
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Reporter: Dong Chen
>Priority: Minor
>
> In network/common, SASL with DIGEST­-MD5 authentication is used for 
> negotiating a secure communication channel. When SASL operation mode is 
> "auth­-conf", the data transferred on the network is encrypted. DIGEST-MD5 
> mechanism supports following encryption: 3DES, DES, and RC4. The negotiation 
> procedure will select one of them to encrypt / decrypt the data on the 
> channel.
> However, 3des and rc4 are slow relatively. We could add code in the 
> negotiation to make it support AES for more secure and performance.
> The proposed solution is:
> When "auth-conf" is enabled, at the end of original negotiation, the 
> authentication succeeds and a secure channel is built. We could add one more 
> negotiation step: Client and server negotiate whether they both support AES. 
> If yes, the Key and IV used by AES will be generated by server and sent to 
> client through the already secure channel. Then update the encryption / 
> decryption handler to AES at both client and server side. Following data 
> transfer will use AES instead of original encryption algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-17154) Wrong result can be returned or AnalysisException can be thrown after self-join or similar operations

2016-08-19 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-17154:
--

 Summary: Wrong result can be returned or AnalysisException can be 
thrown after self-join or similar operations
 Key: SPARK-17154
 URL: https://issues.apache.org/jira/browse/SPARK-17154
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.2
Reporter: Kousuke Saruta


When we join two DataFrames which are originated from a same DataFrame, 
operations to the joined DataFrame can fail.

One reproducible  example is as follows.

{code}
val df = Seq(
  (1, "a", "A"),
  (2, "b", "B"),
  (3, "c", "C"),
  (4, "d", "D"),
  (5, "e", "E")).toDF("col1", "col2", "col3")
  val filtered = df.filter("col1 != 3").select("col1", "col2")
  val joined = filtered.join(df, filtered("col1") === df("col1"), "inner")
  val selected1 = joined.select(df("col3"))
{code}

In this case, AnalysisException is thrown.

Another example is as follows.

{code}
val df = Seq(
  (1, "a", "A"),
  (2, "b", "B"),
  (3, "c", "C"),
  (4, "d", "D"),
  (5, "e", "E")).toDF("col1", "col2", "col3")
  val filtered = df.filter("col1 != 3").select("col1", "col2")
  val rightOuterJoined = filtered.join(df, filtered("col1") === df("col1"), 
"right")
  val selected2 = rightOuterJoined.select(df("col1"))
  selected2.show
{code}

In this case, we will expect to get the answer like as follows.
{code}
1
2
3
4
5
{code}

But the actual result is as follows.

{code}
1
2
null
4
5
{code}

The cause of the problems in the examples is that the logical plan related to 
the right side DataFrame and the expressions of its output are re-created in 
the analyzer (at ResolveReference rule) when a DataFrame has expressions which 
have a same exprId each other.
Re-created expressions are equally to the original ones except exprId.
This will happen when we do self-join or similar pattern operations.

In the first example, df("col3") returns a Column which includes an expression 
and the expression have an exprId (say id1 here).
After join, the expresion which the right side DataFrame (df) has is re-created 
and the old and new expressions are equally but exprId is renewed (say id2 for 
the new exprId here).
Because of the mismatch of those exprIds, AnalysisException is thrown.

In the second example, df("col1") returns a column and the expression contained 
in the column is assigned an exprId (say id3).
On the other hand, a column returned by filtered("col1") has an expression 
which has the same exprId (id3).
After join, the expressions in the right side DataFrame are re-created and the 
expression assigned id3 is no longer present in the right side but present in 
the left side.
So, referring df("col1") to the joined DataFrame, we get col1 of right side 
which includes null.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17139) Add model summary for MultinomialLogisticRegression

2016-08-19 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428339#comment-15428339
 ] 

Seth Hendrickson commented on SPARK-17139:
--

SPARK-7159 has been merged, as an FYI. I can review this when you submit the 
PR. It would be nice to get this in soon.

> Add model summary for MultinomialLogisticRegression
> ---
>
> Key: SPARK-17139
> URL: https://issues.apache.org/jira/browse/SPARK-17139
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>
> Add model summary to multinomial logistic regression using same interface as 
> in other ML models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17140) Add initial model to MultinomialLogisticRegression

2016-08-19 Thread Seth Hendrickson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428338#comment-15428338
 ] 

Seth Hendrickson commented on SPARK-17140:
--

Going to hold off for a little bit to see what happens with 
[SPARK-10780|https://issues.apache.org/jira/browse/SPARK-10780]

> Add initial model to MultinomialLogisticRegression
> --
>
> Key: SPARK-17140
> URL: https://issues.apache.org/jira/browse/SPARK-17140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>
> We should add initial model support to Multinomial logistic regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2016-08-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-11227:
--
Assignee: Kousuke Saruta

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.6.1, 2.0.0
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>Assignee: Kousuke Saruta
> Fix For: 2.0.1, 2.1.0
>
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> job, job complete with Success.
> In Addition, I disable High Availability on HDFS, 

[jira] [Resolved] (SPARK-16673) New Executor Page displays columns that used to be conditionally hidden

2016-08-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-16673.
---
   Resolution: Fixed
Fix Version/s: 2.1.0

> New Executor Page displays columns that used to be conditionally hidden
> ---
>
> Key: SPARK-16673
> URL: https://issues.apache.org/jira/browse/SPARK-16673
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.1.0
>
>
> SPARK-15951 switched the Executors page to use JQuery DataTables, but it also 
> removed the functionality of conditionally hiding the Logs and Thread Dump 
> columns. In the case of the Logs column this is not a big issue, but in the 
> case of Thread Dump, previously it was never shown on the History Server 
> since it isn't available.
> We should reintroduce the functionality to hide these columns according to 
> the same conditions as before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11227) Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1

2016-08-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11227?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-11227.
---
   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

> Spark1.5+ HDFS HA mode throw java.net.UnknownHostException: nameservice1
> 
>
> Key: SPARK-11227
> URL: https://issues.apache.org/jira/browse/SPARK-11227
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.5.0, 1.5.1, 1.6.1, 2.0.0
> Environment: OS: CentOS 6.6
> Memory: 28G
> CPU: 8
> Mesos: 0.22.0
> HDFS: Hadoop 2.6.0-CDH5.4.0 (build by Cloudera Manager)
>Reporter: Yuri Saito
>Assignee: Kousuke Saruta
> Fix For: 2.0.1, 2.1.0
>
>
> When running jar including Spark Job at HDFS HA Cluster, Mesos and 
> Spark1.5.1, the job throw Exception as "java.net.UnknownHostException: 
> nameservice1" and fail.
> I do below in Terminal.
> {code}
> /opt/spark/bin/spark-submit \
>   --class com.example.Job /jobs/job-assembly-1.0.0.jar
> {code}
> So, job throw below message.
> {code}
> 15/10/21 15:22:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 
> (TID 0, spark003.example.com): java.lang.IllegalArgumentException: 
> java.net.UnknownHostException: nameservice1
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:374)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:312)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:178)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:665)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:601)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:148)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:169)
> at 
> org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:436)
> at 
> org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:1016)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at 
> org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
> at scala.Option.map(Option.scala:145)
> at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:220)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:216)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.net.UnknownHostException: nameservice1
> ... 41 more
> {code}
> But, I changed from Spark Cluster 1.5.1 to Spark Cluster 1.4.0, then run the 
> job, job complete with Success.
> In 

[jira] [Created] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-08-19 Thread Dmitri Carpov (JIRA)
Dmitri Carpov created SPARK-17153:
-

 Summary: [Structured streams] readStream ignores partition columns
 Key: SPARK-17153
 URL: https://issues.apache.org/jira/browse/SPARK-17153
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 2.0.0
Reporter: Dmitri Carpov


When parquet files are persisted using partitions, spark's `readStream` returns 
data with all `null`s for the partitioned columns.

For example:

```
case class A(id: Int, value: Int)

val data = spark.createDataset(Seq(
  A(1, 1), 
  A(2, 2), 
  A(2, 3))
)

val url = "/mnt/databricks/test"
data.write.partitionBy("id").parquet(url)
```

when data is read as stream:

```
spark.readStream.schema(spark.read.load(url).schema).parquet(url)
```

it reads:

```
id, value
null, 1
null, 2
null, 3
```

A possible reason is `readStream` reads parquet files directly but when those 
are stored the columns they are partitioned by are excluded from the file 
itself. In the given example the parquet files contain `value` information only 
since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16673) New Executor Page displays columns that used to be conditionally hidden

2016-08-19 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-16673:
--
Assignee: Alex Bozarth

> New Executor Page displays columns that used to be conditionally hidden
> ---
>
> Key: SPARK-16673
> URL: https://issues.apache.org/jira/browse/SPARK-16673
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Reporter: Alex Bozarth
>Assignee: Alex Bozarth
> Fix For: 2.1.0
>
>
> SPARK-15951 switched the Executors page to use JQuery DataTables, but it also 
> removed the functionality of conditionally hiding the Logs and Thread Dump 
> columns. In the case of the Logs column this is not a big issue, but in the 
> case of Thread Dump, previously it was never shown on the History Server 
> since it isn't available.
> We should reintroduce the functionality to hide these columns according to 
> the same conditions as before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17153) [Structured streams] readStream ignores partition columns

2016-08-19 Thread Dmitri Carpov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17153?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitri Carpov updated SPARK-17153:
--
Description: 
When parquet files are persisted using partitions, spark's `readStream` returns 
data with all `null`s for the partitioned columns.

For example:

{noformat}
case class A(id: Int, value: Int)

val data = spark.createDataset(Seq(
  A(1, 1), 
  A(2, 2), 
  A(2, 3))
)

val url = "/mnt/databricks/test"
data.write.partitionBy("id").parquet(url)
{noformat}

when data is read as stream:

{noformat}
spark.readStream.schema(spark.read.load(url).schema).parquet(url)
{noformat}

it reads:

{noformat}
id, value
null, 1
null, 2
null, 3
{noformat}

A possible reason is `readStream` reads parquet files directly but when those 
are stored the columns they are partitioned by are excluded from the file 
itself. In the given example the parquet files contain `value` information only 
since `id` is partition.

  was:
When parquet files are persisted using partitions, spark's `readStream` returns 
data with all `null`s for the partitioned columns.

For example:

```
case class A(id: Int, value: Int)

val data = spark.createDataset(Seq(
  A(1, 1), 
  A(2, 2), 
  A(2, 3))
)

val url = "/mnt/databricks/test"
data.write.partitionBy("id").parquet(url)
```

when data is read as stream:

```
spark.readStream.schema(spark.read.load(url).schema).parquet(url)
```

it reads:

```
id, value
null, 1
null, 2
null, 3
```

A possible reason is `readStream` reads parquet files directly but when those 
are stored the columns they are partitioned by are excluded from the file 
itself. In the given example the parquet files contain `value` information only 
since `id` is partition.


> [Structured streams] readStream ignores partition columns
> -
>
> Key: SPARK-17153
> URL: https://issues.apache.org/jira/browse/SPARK-17153
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Dmitri Carpov
>
> When parquet files are persisted using partitions, spark's `readStream` 
> returns data with all `null`s for the partitioned columns.
> For example:
> {noformat}
> case class A(id: Int, value: Int)
> val data = spark.createDataset(Seq(
>   A(1, 1), 
>   A(2, 2), 
>   A(2, 3))
> )
> val url = "/mnt/databricks/test"
> data.write.partitionBy("id").parquet(url)
> {noformat}
> when data is read as stream:
> {noformat}
> spark.readStream.schema(spark.read.load(url).schema).parquet(url)
> {noformat}
> it reads:
> {noformat}
> id, value
> null, 1
> null, 2
> null, 3
> {noformat}
> A possible reason is `readStream` reads parquet files directly but when those 
> are stored the columns they are partitioned by are excluded from the file 
> itself. In the given example the parquet files contain `value` information 
> only since `id` is partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17148) NodeManager exit because of exception “Executor is not registered”

2016-08-19 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15428289#comment-15428289
 ] 

Thomas Graves commented on SPARK-17148:
---

If this is causing the nodemanager to die this is bad and we should fix it.  It 
should just fail the request and not kill the NM.

I assume the error was caused because someone cleaned up the data for for this 
already?

> NodeManager exit because of exception “Executor is not registered”
> --
>
> Key: SPARK-17148
> URL: https://issues.apache.org/jira/browse/SPARK-17148
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.6.2
> Environment: hadoop 2.7.2 spark 1.6.2
>Reporter: cen yuhai
>
> java.lang.RuntimeException: Executor is not registered 
> (appId=application_1467288504738_1341061, execId=423)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.handleMessage(ExternalShuffleBlockHandler.java:85)
> at 
> org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.receive(ExternalShuffleBlockHandler.java:72)
> at 
> org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:149)
> at 
> org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:102)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:104)
> at 
> org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:86)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:308)
> at 
> io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:294)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:846)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >