date:20170814

[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Yanbo Liang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126743#comment-16126743
 ] 

Yanbo Liang commented on SPARK-12664:
-

[~josephkb] Please go ahead. Thanks for taking over shepherding. I really have 
lots of PRs waiting to review.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-14 Thread Jepson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-21733:
---
Description: 
Kafka+Spark streaming ,throw these error:

{code:java}
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8003 took 11 ms
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
(TID 64178). 1740 bytes result sent to driver
17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
64186
17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 (TID 
64186)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8004
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8004 took 8 ms
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
(TID 64186). 1740 bytes result sent to driver
h3. 17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL TERM
17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
{code}


  was:
Kafka+Spark streaming ,throw these error:

{code:java}
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8003 took 11 ms
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
(TID 64178). 1740 bytes result sent to driver
17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
64186
17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 (TID 
64186)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8004
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8004 took 8 ms
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
(TID 64186). 1740 bytes result sent to driver
17/08/15 09:34:29 {color:#59afe1}ERROR executor.CoarseGrainedExecutorBackend: 
RECEIVED SIGNAL TERM{color}
17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
{code}



> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>  Labels: patch
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15

[jira] [Commented] (SPARK-4131) Support "Writing data into the filesystem from queries"

2017-08-14 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126723#comment-16126723
 ] 

xinzhang commented on SPARK-4131:
-

any progress here?

> Support "Writing data into the filesystem from queries"
> ---
>
> Key: SPARK-4131
> URL: https://issues.apache.org/jira/browse/SPARK-4131
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: XiaoJing wang
>Assignee: Fei Wang
>Priority: Critical
>   Original Estimate: 0.05h
>  Remaining Estimate: 0.05h
>
> Writing data into the filesystem from queries,SparkSql is not support .
> eg:
> {code}insert overwrite LOCAL DIRECTORY '/data1/wangxj/sql_spark' select * 
> from page_views;
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-14 Thread Jepson (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jepson updated SPARK-21733:
---
Description: 
Kafka+Spark streaming ,throw these error:

{code:java}
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8003 took 11 ms
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
(TID 64178). 1740 bytes result sent to driver
17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
64186
17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 (TID 
64186)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8004
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8004 took 8 ms
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
(TID 64186). 1740 bytes result sent to driver
17/08/15 09:34:29 {color:#59afe1}ERROR executor.CoarseGrainedExecutorBackend: 
RECEIVED SIGNAL TERM{color}
17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
{code}


  was:
Kafka+Spark streaming ,throw these error:

{code:java}
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8003 took 11 ms
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
(TID 64178). 1740 bytes result sent to driver
17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
64186
17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 (TID 
64186)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8004
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8004 took 8 ms
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
(TID 64186). 1740 bytes result sent to driver
17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
{code}



> ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
> -
>
> Key: SPARK-21733
> URL: https://issues.apache.org/jira/browse/SPARK-21733
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.1.1
> Environment: Apache Spark2.1.1 
> CDH5.12.0 Yarn
>Reporter: Jepson
>  Labels: patch
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Kafka+Spark streaming ,throw these error:
> {code:java}
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
> as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
> 17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
> 8003 took 11 ms
> 17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
> values in memory (estimated size 2.9 KB, free 1643.2 MB)
> 17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the 
> same as ending offset skipping kssh 5
> 17/08/15 09:34:14

[jira] [Created] (SPARK-21733) ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM

2017-08-14 Thread Jepson (JIRA)

Jepson created SPARK-21733:
--

 Summary: ERROR executor.CoarseGrainedExecutorBackend: RECEIVED 
SIGNAL TERM
 Key: SPARK-21733
 URL: https://issues.apache.org/jira/browse/SPARK-21733
 Project: Spark
  Issue Type: Bug
  Components: DStreams
Affects Versions: 2.1.1
 Environment: Apache Spark2.1.1 
CDH5.12.0 Yarn


Reporter: Jepson


Kafka+Spark streaming ,throw these error:

{code:java}
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:14 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8003 took 11 ms
17/08/15 09:34:14 INFO memory.MemoryStore: Block broadcast_8003 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:14 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:14 INFO executor.Executor: Finished task 7.0 in stage 8003.0 
(TID 64178). 1740 bytes result sent to driver
17/08/15 09:34:21 INFO storage.BlockManager: Removing RDD 8002
17/08/15 09:34:21 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
64186
17/08/15 09:34:21 INFO executor.Executor: Running task 7.0 in stage 8004.0 (TID 
64186)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Started reading broadcast 
variable 8004
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004_piece0 stored 
as bytes in memory (estimated size 1895.0 B, free 1643.2 MB)
17/08/15 09:34:21 INFO broadcast.TorrentBroadcast: Reading broadcast variable 
8004 took 8 ms
17/08/15 09:34:21 INFO memory.MemoryStore: Block broadcast_8004 stored as 
values in memory (estimated size 2.9 KB, free 1643.2 MB)
17/08/15 09:34:21 INFO kafka010.KafkaRDD: Beginning offset 10130733 is the same 
as ending offset skipping kssh 5
17/08/15 09:34:21 INFO executor.Executor: Finished task 7.0 in stage 8004.0 
(TID 64186). 1740 bytes result sent to driver
17/08/15 09:34:29 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 
TERM
17/08/15 09:34:29 INFO storage.DiskBlockManager: Shutdown hook called
17/08/15 09:34:29 INFO util.ShutdownHookManager: Shutdown hook called
{code}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-08-14 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108303#comment-16108303
 ] 

xinzhang edited comment on SPARK-21067 at 8/15/17 1:09 AM:
---

hi .
 I use the parquet to avoid the issuse about create table as.
It appear in insert overwrite table (partition). I could not find any ways to 
avoid this issuse ?Any suggests will be great helpful.
https://issues.apache.org/jira/browse/SPARK-21725


was (Author: zhangxin0112zx):
hi .
 I use the parquet to avoid the issuse about create table as.
It appear in insert overwrite table (partition). I could not find any ways to 
avoid this issuse ?Any suggests will great helpful.
https://issues.apache.org/jira/browse/SPARK-21725

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
>

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-08-14 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108303#comment-16108303
 ] 

xinzhang edited comment on SPARK-21067 at 8/15/17 1:07 AM:
---

hi .
 I use the parquet to avoid the issuse about create table as.
It appear in insert overwrite table (partition). I could not find any ways to 
avoid this issuse ?Any suggests will great helpful.
https://issues.apache.org/jira/browse/SPARK-21725


was (Author: zhangxin0112zx):
hi srowen.could u  consider about this .give some suggest.[~srowen]


> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
>

[jira] [Comment Edited] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-08-14 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16119310#comment-16119310
 ] 

xinzhang edited comment on SPARK-21067 at 8/15/17 1:04 AM:
---

hi [~smilegator]
Can you push this BUG repair? 
In my consideration, this is a very big obstacle for us when we go to 
popularization and use SparkSQL(thriftserver).


was (Author: zhangxin0112zx):
hi [~cloud_fan] 
Can you push this BUG repair? 
In my consideration, this is a very big obstacle for us when we go to 
popularization and use SparkSQL(thriftserver).

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
>

[jira] [Updated] (SPARK-21732) Lazily init hive metastore client

2017-08-14 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-21732:
-
Description: Right now when the hive metastore server is down, we cannot 
create SparkSession. It would be great if we can decouple them.

> Lazily init hive metastore client
> -
>
> Key: SPARK-21732
> URL: https://issues.apache.org/jira/browse/SPARK-21732
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>
> Right now when the hive metastore server is down, we cannot create 
> SparkSession. It would be great if we can decouple them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21732) Lazily init hive metastore client

2017-08-14 Thread Shixiong Zhu (JIRA)

Shixiong Zhu created SPARK-21732:


 Summary: Lazily init hive metastore client
 Key: SPARK-21732
 URL: https://issues.apache.org/jira/browse/SPARK-21732
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.0
Reporter: Shixiong Zhu
Assignee: Shixiong Zhu






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21617) ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables

2017-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126657#comment-16126657
 ] 

Marcelo Vanzin commented on SPARK-21617:


https://github.com/apache/spark/pull/18849

> ALTER TABLE...ADD COLUMNS broken in Hive 2.1 for DS tables
> --
>
> Key: SPARK-21617
> URL: https://issues.apache.org/jira/browse/SPARK-21617
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marcelo Vanzin
>
> When you have a data source table and you run a "ALTER TABLE...ADD COLUMNS" 
> query, Spark will save invalid metadata to the Hive metastore.
> Namely, it will overwrite the table's schema with the data frame's schema; 
> that is not desired for data source tables (where the schema is stored in a 
> table property instead).
> Moreover, if you use a newer metastore client where 
> METASTORE_DISALLOW_INCOMPATIBLE_COL_TYPE_CHANGES is on by default, you 
> actually get an exception:
> {noformat}
> InvalidOperationException(message:The following columns have types 
> incompatible with the existing columns in their respective positions :
> c1)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreUtils.throwExceptionIfIncompatibleColTypeChange(MetaStoreUtils.java:615)
>   at 
> org.apache.hadoop.hive.metastore.HiveAlterHandler.alterTable(HiveAlterHandler.java:133)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_core(HiveMetaStore.java:3704)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.alter_table_with_environment_context(HiveMetaStore.java:3675)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:140)
>   at 
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:99)
>   at com.sun.proxy.$Proxy26.alter_table_with_environment_context(Unknown 
> Source)
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.alter_table_with_environmentContext(HiveMetaStoreClient.java:402)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.alter_table_with_environmentContext(SessionHiveMetaStoreClient.java:309)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:154)
>   at com.sun.proxy.$Proxy27.alter_table_with_environmentContext(Unknown 
> Source)
>   at org.apache.hadoop.hive.ql.metadata.Hive.alterTable(Hive.java:601)
> {noformat}
> That exception is handled by Spark in an odd way (see code in 
> {{HiveExternalCatalog.scala}}) which still stores invalid metadata.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-08-14 Thread poplav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126618#comment-16126618
 ] 

poplav commented on SPARK-19372:


I opened a PR for backporting this at 
https://github.com/apache/spark/pull/18942, thanks.

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0, 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21731) Upgrade scalastyle to 0.9

2017-08-14 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-21731:
--

 Summary: Upgrade scalastyle to 0.9
 Key: SPARK-21731
 URL: https://issues.apache.org/jira/browse/SPARK-21731
 Project: Spark
  Issue Type: Task
  Components: Build
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Trivial


No new features that I'm interested in, but it fixes an issue with the import 
order checker so that it provides more useful errors 
(https://github.com/scalastyle/scalastyle/pull/185).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-18278) SPIP: Support native submission of spark jobs to a kubernetes cluster

2017-08-14 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-18278:
---
Attachment: (was: SPARK-18278 - Spark on Kubernetes Design Proposal.pdf)

> SPIP: Support native submission of spark jobs to a kubernetes cluster
> -
>
> Key: SPARK-18278
> URL: https://issues.apache.org/jira/browse/SPARK-18278
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Deploy, Documentation, Scheduler, Spark Core
>Reporter: Erik Erlandson
> Attachments: SPARK-18278 Spark on Kubernetes Design Proposal Revision 
> 2 (1).pdf
>
>
> A new Apache Spark sub-project that enables native support for submitting 
> Spark applications to a kubernetes cluster.   The submitted application runs 
> in a driver executing on a kubernetes pod, and executors lifecycles are also 
> managed as pods.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126583#comment-16126583
 ] 

Joseph K. Bradley commented on SPARK-12664:
---

[~yanboliang] I can take over shepherding this feature, but let me know if 
you'd like to return to it.  I just made an initial review pass.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12664:
--
Shepherd: Joseph K. Bradley  (was: Yanbo Liang)

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21718) Heavy log of type: "Skipping partition based on stats ..."

2017-08-14 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126575#comment-16126575
 ] 

Xiao Li commented on SPARK-21718:
-

[~meox] Please submit a PR to fix this. 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryTableScanExec.scala#L174

> Heavy log of type: "Skipping partition based on stats ..."
> --
>
> Key: SPARK-21718
> URL: https://issues.apache.org/jira/browse/SPARK-21718
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
> Environment: Linux
>Reporter: Gian Lorenzo Meocci
>Priority: Trivial
>
> My spark job have three partions. This cause many logs entry of type:
> "Skipping partition based on stats ..." in SPARK_WORKDIR//stderr
> We should add a check to disable this kind of working or move the loglevel 
> from Info to Debug



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21721) Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable

2017-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21721:

Affects Version/s: 2.0.2
   2.2.0

> Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable
> --
>
> Key: SPARK-21721
> URL: https://issues.apache.org/jira/browse/SPARK-21721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: yzheng616
>Priority: Critical
>
> The leak came from org.apache.spark.sql.hive.execution.InsertIntoHiveTable. 
> At line 118, it put a staging path to FileSystem delete cache, and then 
> remove the path from disk at line 385. It does not remove the path from 
> FileSystem cache. If a streaming application keep persisting data to a 
> partitioned hive table, the memory will keep increasing until JVM terminated.
> Below is a simple code to reproduce it.
> {code:java}
> package test
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.FileSystem
> import org.apache.spark.sql.SaveMode
> import java.lang.reflect.Field
> case class PathLeakTest(id: Int, gp: String)
> object StagePathLeak {
>   def main(args: Array[String]): Unit = {
> val spark = 
> SparkSession.builder().master("local[4]").appName("StagePathLeak").enableHiveSupport().getOrCreate()
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> //create a partitioned table
> spark.sql("drop table if exists path_leak");
> spark.sql("create table if not exists path_leak(id int)" +
> " partitioned by (gp String)"+
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[PathLeakTest]()
> // 2 partitions
> for (x <- 1 to 2) {
>   seq += (new PathLeakTest(x, "g" + x))
> }
> val rdd = spark.sparkContext.makeRDD[PathLeakTest](seq)
> //insert 50 records to Hive table
> for (j <- 1 to 50) {
>   val df = spark.createDataFrame(rdd)
>   //#1 InsertIntoHiveTable line 118:  add stage path to FileSystem 
> deleteOnExit cache
>   //#2 InsertIntoHiveTable line 385:  delete the path from disk but not 
> from the FileSystem cache, and it caused the leak
>   df.write.mode(SaveMode.Overwrite).insertInto("path_leak")  
> }
> 
> val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
> val deleteOnExit = getDeleteOnExit(fs.getClass)
> deleteOnExit.setAccessible(true)
> val caches = deleteOnExit.get(fs).asInstanceOf[java.util.TreeSet[Path]]
> //check FileSystem deleteOnExit cache size
> println(caches.size())
> val it = caches.iterator()
> //all starge pathes were still cached even they have already been deleted 
> from the disk
> while(it.hasNext()){
>   println(it.next());
> }
>   }
>   
>   def getDeleteOnExit(cls: Class[_]) : Field = {
> try{
>return cls.getDeclaredField("deleteOnExit")
> }catch{
>   case ex: NoSuchFieldException => return 
> getDeleteOnExit(cls.getSuperclass)
> }
> return null
>   }
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21721) Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable

2017-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21721:

Priority: Critical  (was: Major)

> Memory leak in org.apache.spark.sql.hive.execution.InsertIntoHiveTable
> --
>
> Key: SPARK-21721
> URL: https://issues.apache.org/jira/browse/SPARK-21721
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.1, 2.2.0
>Reporter: yzheng616
>Priority: Critical
>
> The leak came from org.apache.spark.sql.hive.execution.InsertIntoHiveTable. 
> At line 118, it put a staging path to FileSystem delete cache, and then 
> remove the path from disk at line 385. It does not remove the path from 
> FileSystem cache. If a streaming application keep persisting data to a 
> partitioned hive table, the memory will keep increasing until JVM terminated.
> Below is a simple code to reproduce it.
> {code:java}
> package test
> import org.apache.spark.sql.SparkSession
> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.FileSystem
> import org.apache.spark.sql.SaveMode
> import java.lang.reflect.Field
> case class PathLeakTest(id: Int, gp: String)
> object StagePathLeak {
>   def main(args: Array[String]): Unit = {
> val spark = 
> SparkSession.builder().master("local[4]").appName("StagePathLeak").enableHiveSupport().getOrCreate()
> spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict")
> //create a partitioned table
> spark.sql("drop table if exists path_leak");
> spark.sql("create table if not exists path_leak(id int)" +
> " partitioned by (gp String)"+
>   " row format serde 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'"+
>   " stored as"+
> " inputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'"+
> " outputformat 
> 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'")
> var seq = new scala.collection.mutable.ArrayBuffer[PathLeakTest]()
> // 2 partitions
> for (x <- 1 to 2) {
>   seq += (new PathLeakTest(x, "g" + x))
> }
> val rdd = spark.sparkContext.makeRDD[PathLeakTest](seq)
> //insert 50 records to Hive table
> for (j <- 1 to 50) {
>   val df = spark.createDataFrame(rdd)
>   //#1 InsertIntoHiveTable line 118:  add stage path to FileSystem 
> deleteOnExit cache
>   //#2 InsertIntoHiveTable line 385:  delete the path from disk but not 
> from the FileSystem cache, and it caused the leak
>   df.write.mode(SaveMode.Overwrite).insertInto("path_leak")  
> }
> 
> val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)
> val deleteOnExit = getDeleteOnExit(fs.getClass)
> deleteOnExit.setAccessible(true)
> val caches = deleteOnExit.get(fs).asInstanceOf[java.util.TreeSet[Path]]
> //check FileSystem deleteOnExit cache size
> println(caches.size())
> val it = caches.iterator()
> //all starge pathes were still cached even they have already been deleted 
> from the disk
> while(it.hasNext()){
>   println(it.next());
> }
>   }
>   
>   def getDeleteOnExit(cls: Class[_]) : Field = {
> try{
>return cls.getDeclaredField("deleteOnExit")
> }catch{
>   case ex: NoSuchFieldException => return 
> getDeleteOnExit(cls.getSuperclass)
> }
> return null
>   }
> }
> {code}
>  
>  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21730) Consider officially dropping PyPy pre-2.5 support

2017-08-14 Thread holdenk (JIRA)

holdenk created SPARK-21730:
---

 Summary: Consider officially dropping PyPy pre-2.5 support
 Key: SPARK-21730
 URL: https://issues.apache.org/jira/browse/SPARK-21730
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: holdenk


Jenkins currently tests with PyPy 2.5+, should we consider dropping 2.3 support 
from the documentation?

cc [~davies] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21730) Consider officially dropping PyPy pre-2.5 support

2017-08-14 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-21730:

Issue Type: Improvement  (was: Bug)

> Consider officially dropping PyPy pre-2.5 support
> -
>
> Key: SPARK-21730
> URL: https://issues.apache.org/jira/browse/SPARK-21730
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: holdenk
>
> Jenkins currently tests with PyPy 2.5+, should we consider dropping 2.3 
> support from the documentation?
> cc [~davies] [~shaneknapp]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21729) Generic test for ProbabilisticClassifier to ensure consistent output columns

2017-08-14 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-21729:
-

 Summary: Generic test for ProbabilisticClassifier to ensure 
consistent output columns
 Key: SPARK-21729
 URL: https://issues.apache.org/jira/browse/SPARK-21729
 Project: Spark
  Issue Type: Test
  Components: ML
Affects Versions: 2.2.0
Reporter: Joseph K. Bradley


One challenge with the ProbabilisticClassifier abstraction is that it 
introduces different code paths for predictions depending on which output 
columns are turned on or off: probability, rawPrediction, prediction.  We ran 
into a bug in MLOR with this.

This task is for adding a generic test usable in all test suites for 
ProbabilisticClassifier types which does the following:
* Take a dataset + Estimator
* Fit the Estimator
* Test prediction using the model with all combinations of output columns 
turned on/off.
* Make sure the output column values match, presumably by comparing vs. the 
case with all 3 output columns turned on

CC [~WeichenXu123] since this came up in 
https://github.com/apache/spark/pull/17373



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21696) State Store can't handle corrupted snapshots

2017-08-14 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-21696.
---
   Resolution: Fixed
Fix Version/s: 3.0.0
   2.2.1

Issue resolved by pull request 18928
[https://github.com/apache/spark/pull/18928]

> State Store can't handle corrupted snapshots
> 
>
> Key: SPARK-21696
> URL: https://issues.apache.org/jira/browse/SPARK-21696
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Alexander Bessonov
>Priority: Critical
> Fix For: 2.2.1, 3.0.0
>
>
> State store's asynchronous maintenance task (generation of Snapshot files) is 
> not rescheduled if crashed which might lead to corrupted snapshots.
> In our case, on multiple occasions, executors died during maintenance task 
> with Out Of Memory error which led to following error on recovery:
> {code:none}
> 17/08/07 20:12:24 WARN TaskSetManager: Lost task 3.1 in stage 102.0 (TID 
> 3314, dnj2-bach-r2n10.bloomberg.com, executor 94): java.io.EOFException
> at java.io.DataInputStream.readInt(DataInputStream.java:392)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:436)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:314)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:313)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:313)
> at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:220)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:186)
> at 
> org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
>

[jira] [Created] (SPARK-21728) Allow SparkSubmit to use logging

2017-08-14 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-21728:
--

 Summary: Allow SparkSubmit to use logging
 Key: SPARK-21728
 URL: https://issues.apache.org/jira/browse/SPARK-21728
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: Marcelo Vanzin
Priority: Minor


Currently, code in {{SparkSubmit}} cannot call classes or methods that 
initialize the Spark {{Logging}} framework. That is because at that time 
{{SparkSubmit}} doesn't yet know which application will run, and logging is 
initialized differently for certain special applications (notably, the shells).

It would be better if either {{SparkSubmit}} did logging initialization earlier 
based on the application to be run, or did it in a way that could be overridden 
later when the app initializes.

Without this, there are currently a few parts of {{SparkSubmit}} that 
duplicates code from other parts of Spark just to avoid logging. For example:

* 
[downloadFiles|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L860]
 replicates code from Utils.scala
* 
[createTempDir|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/DependencyUtils.scala#L54]
 replicates code from Utils.scala and installs its own shutdown hook
* a few parts of the code could use {{SparkConf}} but can't right now because 
of the logging issue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-08-14 Thread Miles Crawford (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126402#comment-16126402
 ] 

Miles Crawford commented on SPARK-18838:


[~imranr] We do runs tens of thousands of tasks per stage in a few stages, yes. 
We kept running into this issue: 
https://spark.apache.org/docs/latest/tuning.html#memory-usage-of-reduce-tasks 
in some of our operations, and had to increase our partition count very high to 
get around it.

I'll try running with loglevel INFO on spark 2.2.0 now and share that log with 
you when it hangs.


> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21715) History Server should not respond history page html content multiple times for only one http request

2017-08-14 Thread Ye Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126398#comment-16126398
 ] 

Ye Zhou commented on SPARK-21715:
-

Pull Request: https://github.com/apache/spark/pull/18941

> History Server should not respond history page html content multiple times 
> for only one http request
> 
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Priority: Minor
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.
> !https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!
> !https://issues.apache.org/jira/secure/attachment/12881535/ResponseContent.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6680) Be able to specifie IP for spark-shell(spark driver) blocker for Docker integration

2017-08-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-6680.
---
Resolution: Duplicate

> Be able to specifie IP for spark-shell(spark driver) blocker for Docker 
> integration
> ---
>
> Key: SPARK-6680
> URL: https://issues.apache.org/jira/browse/SPARK-6680
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 1.3.0
> Environment: Docker.
>Reporter: Egor Pakhomov
>Priority: Minor
>  Labels: core, deploy, docker
>
> Suppose I have 3 docker containers - spark_master, spark_worker and 
> spark_shell. In docker for public IP of this container there is an alias like 
> "fgsdfg454534". It only visible in this container. When spark use it for 
> communication other containers receive this alias and don't know what to do 
> with it. Thats why I used SPARK_LOCAL_IP for master and worker. But it 
> doesn't work for spark driver(for spark shell - other types of drivers I 
> haven't try). Spark driver sent everyone "fgsdfg454534" alias about itself 
> and then nobody can address it. I've overcome it in 
> https://github.com/epahomov/docker-spark, but it would be better if it would 
> be solved on spark code level.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21668) Ability to run driver programs within a container

2017-08-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-21668.

Resolution: Duplicate

> Ability to run driver programs within a container
> -
>
> Key: SPARK-21668
> URL: https://issues.apache.org/jira/browse/SPARK-21668
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Arseniy Tashoyan
>Priority: Minor
>  Labels: containers, docker, driver, spark-submit, standalone
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> When a driver program in Client mode runs in a Docker container, it binds to 
> the IP address of the container, not the host machine. This container IP 
> address is accessible only within the host machine, it is inaccessible for 
> master and worker nodes.
> For example, the host machine has IP address 192.168.216.10. When Docker 
> machine starts a container, it places it to a special bridged network and 
> assigns it an IP address like 172.17.0.2. All Spark nodes belonging to the 
> 192.168.216.0 network cannot access the bridged network with the container. 
> Therefore, the driver program is not able to communicate with the Spark 
> cluster.
> Spark already provides SPARK_PUBLIC_DNS environment variable for this 
> purpose. However, in this scenario setting SPARK_PUBLIC_DNS to the host 
> machine IP address does not work.
> Topic on StackOverflow: 
> [https://stackoverflow.com/questions/45489248/running-spark-driver-program-in-docker-container-no-connection-back-from-execu]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21715) History Server should not respond history page html content multiple times for only one http request

2017-08-14 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21715:

Summary: History Server should not respond history page html content 
multiple times for only one http request  (was: History Server respondes 
history page html content multiple times for only one http request)

> History Server should not respond history page html content multiple times 
> for only one http request
> 
>
> Key: SPARK-21715
> URL: https://issues.apache.org/jira/browse/SPARK-21715
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Priority: Minor
> Attachments: Performance.png, ResponseContent.png
>
>
> UI looks fine for the home page. But we check the performance for each 
> individual components, we found that there are three picture downloading 
> requests which takes much longer time than expected: favicon.ico, 
> sort_both.png, sort_desc.png. 
> These are the list of the request address: http://hostname:port/favicon.ico, 
> http://hostname:port/images/sort_both.png, 
> http://hostname:port/images/sort_desc.png. Later if user clicks on the head 
> of the table to sort the column, another request for 
> http://hostname:port/images/sort_asc.png will be sent.
> Browsers will send request for favicon.ico in default. And all these three 
> sort_xxx.png are the default behavior in dataTables jQuery plugin.
> Spark history server will start several handlers to handle http request. But 
> none of these requests are getting correctly handled and they are all 
> triggering the history server to respond the history page html content. As we 
> can find from the screenshot, the response data type are all "text/html".
> To solve this problem, We need to download those images dir from here: 
> https://github.com/DataTables/Plugins/tree/master/integration/bootstrap/images.
>  Put the folder under "core/src/main/resources/org/apache/spark/ui/static/". 
> We also need to modify the dataTables.bootstrap.css to get the correct images 
> location. For favicon.ico downloading request, we need to add one line in the 
> html header to disable the downloading. 
> I can post a pull request if this is the correct way to fix this. I have 
> tried it which works fine.
> !https://issues.apache.org/jira/secure/attachment/12881534/Performance.png!
> !https://issues.apache.org/jira/secure/attachment/12881535/ResponseContent.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as ArrayType(Double). 

However, any subsequent operation on this SparkR DataFrame appears to throw an 
error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as having type ArrayType(Double). 

However, any subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices),

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as ArrayType(Double). 

However, any subsequent operation on this SparkR DataFrame appears to throw an 
error.


Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}


Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}


Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}


Examine the SparkR DataFrame schema; notice that the list column was 
successfully converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}


However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}


Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as ArrayType(Double). 

However, any subsequent operation on this SparkR DataFrame appears to throw an 
error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0,

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
just fine... SparkR treats the column as having type ArrayType(Double). 

However, any subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType)

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the SparkR DataFrame schema; notice the list column was successfully 
converted to ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}

> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer list 
-- i.e., each of the column elements embeds an entire R list of integers -- 
then I can convert the data.frame to a SparkR DataFrame just fine; SparkR 
treats the column as ArrayType(Double). However, any subsequent operation on 
this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements

[jira] [Updated] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McQuarrie updated SPARK-21727:
---
Description: 
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Examine it to make sure it looks okay:
{code}
> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.


  was:
Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer 
*list* -- i.e., each of the elements in the column embeds an entire R list of 
integers -- then it seems I can convert the data.frame to a SparkR DataFrame 
just fine; SparkR treats the column as an ArrayType(Double). However, any 
subsequent operation on this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}

> str(myDf) 
'data.frame':   4 obs. of  2 variables:  
 $ indices: int  1 2 3 4  
 $ data   :List of 4
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...
   ..$ : num  0 0 0 0 0 0 0 0 0 0 ...

> head(myDf)   
  indices   data 
1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131,

[jira] [Created] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-08-14 Thread Neil McQuarrie (JIRA)

Neil McQuarrie created SPARK-21727:
--

 Summary: Operating on an ArrayType in a SparkR DataFrame throws 
error
 Key: SPARK-21727
 URL: https://issues.apache.org/jira/browse/SPARK-21727
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Neil McQuarrie


Previously 
[posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
 this as a stack overflow question but it seems to be a bug.

If I have an R data.frame where one of the column data types is an integer list 
-- i.e., each of the column elements embeds an entire R list of integers -- 
then I can convert the data.frame to a SparkR DataFrame just fine; SparkR 
treats the column as ArrayType(Double). However, any subsequent operation on 
this DataFrame appears to throw an error.

Create an example R data.frame:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(rep(0, 20))}}
{code}

Convert it to a SparkR DataFrame:
{code}
library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
sparkR.session(master = "local[*]")
mySparkDf <- as.DataFrame(myDf)
{code}

Examine the DataFrame schema; the list column was successfully converted to 
ArrayType:
{code}
> schema(mySparkDf)
StructType
|-name = "indices", type = "IntegerType", nullable = TRUE
|-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
{code}

However, operating on the SparkR DataFrame throws an error:
{code}
> collect(mySparkDf)
17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
(TID 1)
java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
java.lang.Double is not a valid external type for schema of array
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
... long stack trace ...
{code}

Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18838) High latency of event processing for large jobs

2017-08-14 Thread Jason Dunkelberger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126022#comment-16126022
 ] 

Jason Dunkelberger edited comment on SPARK-18838 at 8/14/17 6:55 PM:
-

It's been about a week now, and we've have had 100% completed runs since the 
blocking change. We've forked spark for the moment with the hack mashed on top 
of 2.2.0. Here's the diff: 
https://github.com/apache/spark/compare/v2.2.0...allenai:v2.2.0-ai2-SNAPSHOT#diff-ca0fe05a42fd5edcab8a1bdaa8e58db9

To be clear I don't think I've actually fixed anything specific. I've just 
changed the possible failures away from whatever leaves it hanging. [~irashid] 
I'll look into what you suggest. For now the goal was stability, which we've 
achieved with the naive change. I'm pretty confident that total performance has 
gone down, but again, that's still better than hanging altogether.

One other thought, I looked at subclasses of SparkListener which all go through 
LiveListenerBus (?). These seem pretty critical: -ExecutorAllocationListener- 
(ah this is the dynamic piece, so that's ok), 
BlockStatusListener/StorageListener (?) I didn't say before but we run on EMR 
via Yarn which may be relevant.


was (Author: dirkraft):
It's been about a week now, and we've have had 100% completed runs since the 
blocking change. We've forked spark for the moment with the hack mashed on top 
of 2.2.0. Here's the diff: 
https://github.com/apache/spark/compare/v2.2.0...allenai:v2.2.0-ai2-SNAPSHOT#diff-ca0fe05a42fd5edcab8a1bdaa8e58db9

To be clear I don't think I've actually fixed anything specific. I've just 
changed the possible failures away from whatever leaves it hanging. [~irashid] 
I'll look into what you suggest. For now the goal was stability, which we've 
achieved with the naive change. I'm pretty confident that total performance has 
gone down, but again, that's still better than hanging altogether.

One other thought, I looked at subclasses of SparkListener which all go through 
LiveListenerBus (?). These seem pretty critical: ExecutorAllocationListener, 
BlockStatusListener/StorageListener (?) I didn't say before but we run on EMR 
via Yarn which may be relevant.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19372) Code generation for Filter predicate including many OR conditions exceeds JVM method size limit

2017-08-14 Thread poplav (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126182#comment-16126182
 ] 

poplav commented on SPARK-19372:


Hi, [~kiszk] I met this failure also.
Is it possible to backport this to 2.1.1?
Appreciate it!

> Code generation for Filter predicate including many OR conditions exceeds JVM 
> method size limit 
> 
>
> Key: SPARK-19372
> URL: https://issues.apache.org/jira/browse/SPARK-19372
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Jay Pranavamurthi
>Assignee: Kazuaki Ishizaki
> Fix For: 2.2.0, 2.3.0
>
> Attachments: wide400cols.csv
>
>
> For the attached csv file, the code below causes the exception 
> "org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(Lorg/apache/spark/sql/catalyst/InternalRow;)Z" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate" 
> grows beyond 64 KB
> Code:
> {code:borderStyle=solid}
>   val conf = new SparkConf().setMaster("local[1]")
>   val sqlContext = 
> SparkSession.builder().config(conf).getOrCreate().sqlContext
>   val dataframe =
> sqlContext
>   .read
>   .format("com.databricks.spark.csv")
>   .load("wide400cols.csv")
>   val filter = (0 to 399)
> .foldLeft(lit(false))((e, index) => 
> e.or(dataframe.col(dataframe.columns(index)) =!= s"column${index+1}"))
>   val filtered = dataframe.filter(filter)
>   filtered.show(100)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-08-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-21726:
---

 Summary: Check for structural integrity of the plan in QO in test 
mode
 Key: SPARK-21726
 URL: https://issues.apache.org/jira/browse/SPARK-21726
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Reynold Xin


Right now we don't have any checks in the optimizer to check for the structural 
integrity of the plan (e.g. resolved). It would be great if in test mode, we 
can check whether a plan is still resolved after the execution of each rule, so 
we can catch rules that return invalid plans.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-08-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126115#comment-16126115
 ] 

Reynold Xin commented on SPARK-21726:
-

cc [~viirya] would you be interested in doing this?


> Check for structural integrity of the plan in QO in test mode
> -
>
> Key: SPARK-21726
> URL: https://issues.apache.org/jira/browse/SPARK-21726
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>
> Right now we don't have any checks in the optimizer to check for the 
> structural integrity of the plan (e.g. resolved). It would be great if in 
> test mode, we can check whether a plan is still resolved after the execution 
> of each rule, so we can catch rules that return invalid plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-08-14 Thread Jason Dunkelberger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126030#comment-16126030
 ] 

Jason Dunkelberger commented on SPARK-18838:


I just realized I have diverged from the original performance issue of this 
ticket. If there's a better issue to track this on, let me know.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14927) DataFrame. saveAsTable creates RDD partitions but not Hive partitions

2017-08-14 Thread Chaoyu Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126026#comment-16126026
 ] 

Chaoyu Tang commented on SPARK-14927:
-

I was not able to reproduce the issue as reported using Spark SQL 2.1.1 and 
Hive 2.1.1 from EMR. The table created in Spark SQL (partitioned or 
non-partitioned) could be accessed in Hive without any problems. I also 
examined table/partition/column metadata in HMS backend tables like TBLS, SDS, 
PARTITION_KEY_VALS, COLUMS_V2, and so far have not seen any discrepancies 
between the tables created from Spark and Hive.
[~rajeshc] I wonder if you are still having the issue, if yes, could you share 
your sample so that I may help to look into it? Thanks

> DataFrame. saveAsTable creates RDD partitions but not Hive partitions
> -
>
> Key: SPARK-14927
> URL: https://issues.apache.org/jira/browse/SPARK-14927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.1
> Environment: Mac OS X 10.11.4 local
>Reporter: Sasha Ovsankin
>
> This is a followup to 
> http://stackoverflow.com/questions/31341498/save-spark-dataframe-as-dynamic-partitioned-table-in-hive
>  . I tried to use suggestions in the answers but couldn't make it to work in 
> Spark 1.6.1
> I am trying to create partitions programmatically from `DataFrame. Here is 
> the relevant code (adapted from a Spark test):
> hc.setConf("hive.metastore.warehouse.dir", "tmp/tests")
> //hc.setConf("hive.exec.dynamic.partition", "true")
> //hc.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
> hc.sql("create database if not exists tmp")
> hc.sql("drop table if exists tmp.partitiontest1")
> Seq(2012 -> "a").toDF("year", "val")
>   .write
>   .partitionBy("year")
>   .mode(SaveMode.Append)
>   .saveAsTable("tmp.partitiontest1")
> hc.sql("show partitions tmp.partitiontest1").show
> Full file is here: 
> https://gist.github.com/SashaOv/7c65f03a51c7e8f9c9e018cd42aa4c4a
> I get the error that the table is not partitioned:
> ==
> HIVE FAILURE OUTPUT
> ==
> SET hive.support.sql11.reserved.keywords=false
> SET hive.metastore.warehouse.dir=tmp/tests
> OK
> OK
> FAILED: Execution Error, return code 1 from 
> org.apache.hadoop.hive.ql.exec.DDLTask. Table tmp.partitiontest1 is not a 
> partitioned table
> ==
> It looks like the root cause is that 
> `org.apache.spark.sql.hive.HiveMetastoreCatalog.newSparkSQLSpecificMetastoreTable`
>  always creates table with empty partitions.
> Any help to move this forward is appreciated.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18838) High latency of event processing for large jobs

2017-08-14 Thread Jason Dunkelberger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16126022#comment-16126022
 ] 

Jason Dunkelberger commented on SPARK-18838:


It's been about a week now, and we've have had 100% completed runs since the 
blocking change. We've forked spark for the moment with the hack mashed on top 
of 2.2.0. Here's the diff: 
https://github.com/apache/spark/compare/v2.2.0...allenai:v2.2.0-ai2-SNAPSHOT#diff-ca0fe05a42fd5edcab8a1bdaa8e58db9

To be clear I don't think I've actually fixed anything specific. I've just 
changed the possible failures away from whatever leaves it hanging. [~irashid] 
I'll look into what you suggest. For now the goal was stability, which we've 
achieved with the naive change. I'm pretty confident that total performance has 
gone down, but again, that's still better than hanging altogether.

One other thought, I looked at subclasses of SparkListener which all go through 
LiveListenerBus (?). These seem pretty critical: ExecutorAllocationListener, 
BlockStatusListener/StorageListener (?) I didn't say before but we run on EMR 
via Yarn which may be relevant.

> High latency of event processing for large jobs
> ---
>
> Key: SPARK-18838
> URL: https://issues.apache.org/jira/browse/SPARK-18838
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Sital Kedia
> Attachments: perfResults.pdf, SparkListernerComputeTime.xlsx
>
>
> Currently we are observing the issue of very high event processing delay in 
> driver's `ListenerBus` for large jobs with many tasks. Many critical 
> component of the scheduler like `ExecutorAllocationManager`, 
> `HeartbeatReceiver` depend on the `ListenerBus` events and this delay might 
> hurt the job performance significantly or even fail the job.  For example, a 
> significant delay in receiving the `SparkListenerTaskStart` might cause 
> `ExecutorAllocationManager` manager to mistakenly remove an executor which is 
> not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread 
> which loops through all the Listeners for each event and processes each event 
> synchronously 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/LiveListenerBus.scala#L94.
>  This single threaded processor often becomes the bottleneck for large jobs.  
> Also, if one of the Listener is very slow, all the listeners will pay the 
> price of delay incurred by the slow listener. In addition to that a slow 
> listener can cause events to be dropped from the event queue which might be 
> fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the 
> single threaded event processor. Instead each listener will have its own 
> dedicate single threaded executor service . When ever an event is posted, it 
> will be submitted to executor service of all the listeners. The Single 
> threaded executor service will guarantee in order processing of the events 
> per listener.  The queue used for the executor service will be bounded to 
> guarantee we do not grow the memory indefinitely. The downside of this 
> approach is separate event queue per listener will increase the driver memory 
> footprint. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-19471.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> [SQL]A confusing NullPointerException when creating table
> -
>
> Key: SPARK-19471
> URL: https://issues.apache.org/jira/browse/SPARK-19471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Feng Zhu
>Priority: Critical
> Fix For: 2.3.0
>
>
> After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
> NullPointerException when creating table under Spark 2.1.0, but the problem 
> does not exists in Spark 1.6.1. 
> Environment: Hive 1.2.1, Hadoop 2.6.4 
> {noformat}
>  Code  
> // spark is an instance of HiveContext 
> // merge is a Hive UDF 
> val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b 
> FROM tb_1 group by field_a, field_b") 
> df.createTempView("tb_temp") 
> spark.sql("create table tb_result stored as parquet as " + 
>   "SELECT new_a" + 
>   "FROM tb_temp" + 
>   "LEFT JOIN `tb_2` ON " + 
>   "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
> concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
> `tb_2`.`fka6862f17`") 
>  Physical Plan  
> *Project [new_a] 
> +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
> cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
> [fka6862f17], LeftOuter, BuildRight 
>:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
> new_b, _nondeterministic]) 
>:  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
> coordinator[target post-shuffle partition size: 1024880] 
>: +- *HashAggregate(keys=[field_a, field_b], functions=[], 
> output=[field_a, field_b]) 
>:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true])) 
>   +- *Project [fka6862f17] 
>  +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
> What does '*' mean before HashAggregate? 
>  Exception  
> org.apache.spark.SparkException: Task failed while writing rows 
> ... 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
>  
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
>  
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
>  
> at 
>

[jira] [Assigned] (SPARK-19471) [SQL]A confusing NullPointerException when creating table

2017-08-14 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-19471:
---

Assignee: Feng Zhu

> [SQL]A confusing NullPointerException when creating table
> -
>
> Key: SPARK-19471
> URL: https://issues.apache.org/jira/browse/SPARK-19471
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: StanZhai
>Assignee: Feng Zhu
>Priority: Critical
> Fix For: 2.3.0
>
>
> After upgrading our Spark from 1.6.2 to 2.1.0, I encounter a confusing 
> NullPointerException when creating table under Spark 2.1.0, but the problem 
> does not exists in Spark 1.6.1. 
> Environment: Hive 1.2.1, Hadoop 2.6.4 
> {noformat}
>  Code  
> // spark is an instance of HiveContext 
> // merge is a Hive UDF 
> val df = spark.sql("SELECT merge(field_a, null) AS new_a, field_b AS new_b 
> FROM tb_1 group by field_a, field_b") 
> df.createTempView("tb_temp") 
> spark.sql("create table tb_result stored as parquet as " + 
>   "SELECT new_a" + 
>   "FROM tb_temp" + 
>   "LEFT JOIN `tb_2` ON " + 
>   "if(((`tb_temp`.`new_b`) = '' OR (`tb_temp`.`new_b`) IS NULL), 
> concat('GrLSRwZE_', cast((rand() * 200) AS int)), (`tb_temp`.`new_b`)) = 
> `tb_2`.`fka6862f17`") 
>  Physical Plan  
> *Project [new_a] 
> +- *BroadcastHashJoin [if (((new_b = ) || isnull(new_b))) concat(GrLSRwZE_, 
> cast(cast((_nondeterministic * 200.0) as int) as string)) else new_b], 
> [fka6862f17], LeftOuter, BuildRight 
>:- HashAggregate(keys=[field_a, field_b], functions=[], output=[new_a, 
> new_b, _nondeterministic]) 
>:  +- Exchange(coordinator ) hashpartitioning(field_a, field_b, 180), 
> coordinator[target post-shuffle partition size: 1024880] 
>: +- *HashAggregate(keys=[field_a, field_b], functions=[], 
> output=[field_a, field_b]) 
>:+- *FileScan parquet bdp.tb_1[field_a,field_b] Batched: true, 
> Format: Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_1, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
>+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, 
> true])) 
>   +- *Project [fka6862f17] 
>  +- *FileScan parquet bdp.tb_2[fka6862f17] Batched: true, Format: 
> Parquet, Location: InMemoryFileIndex[hdfs://hdcluster/data/tb_2, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct 
> What does '*' mean before HashAggregate? 
>  Exception  
> org.apache.spark.SparkException: Task failed while writing rows 
> ... 
> java.lang.NullPointerException 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply_2$(Unknown
>  Source) 
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:260)
>  
> at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$generateResultProjection$3.apply(AggregationIterator.scala:259)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:392)
>  
> at 
> org.apache.spark.sql.execution.aggregate.TungstenAggregationIterator.next(TungstenAggregationIterator.scala:79)
>  
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source) 
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:252)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:199)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:197)
>  
> at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
>  
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:202)
>  
> at 
>

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-08-14 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125922#comment-16125922
 ] 

Marcelo Vanzin commented on SPARK-18085:


If your only problem is how fast to load a 100g event log, then correct, this 
change alone won't help you. I recommend you check the linked issues in this 
bug, some of which track things closer to what you want.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125906#comment-16125906
 ] 

Hyukjin Kwon commented on SPARK-21658:
--

I can't assign this JIRA to this account ^, [~byakuinss]. Please set this if 
anyone is able.

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21658:


Assignee: Hyukjin Kwon

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-21658:


Assignee: (was: Hyukjin Kwon)

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-14 Thread Chin Han Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125898#comment-16125898
 ] 

Chin Han Yu commented on SPARK-21658:
-

[~hyukjin.kwon]
Hi, here is my JIRA account.

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21658) Adds the default None for value in na.replace in PySpark to match

2017-08-14 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-21658.
--
   Resolution: Fixed
Fix Version/s: 2.3.0

Issue resolved by pull request 18895
[https://github.com/apache/spark/pull/18895]

> Adds the default None for value in na.replace in PySpark to match
> -
>
> Key: SPARK-21658
> URL: https://issues.apache.org/jira/browse/SPARK-21658
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>  Labels: Starter
> Fix For: 2.3.0
>
>
> Looks {{na.replace}} missed the default value {{None}}.
> Both docs says they are aliases 
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrame.replace
> http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.DataFrameNaFunctions.replace
> but the default values looks different, which ends up with:
> {code}
> >>> df = spark.createDataFrame([('Alice', 10, 80.0)])
> >>> df.replace({"Alice": "a"}).first()
> Row(_1=u'a', _2=10, _3=80.0)
> >>> df.na.replace({"Alice": "a"}).first()
> Traceback (most recent call last):
>   File "", line 1, in 
> TypeError: replace() takes at least 3 arguments (2 given)
> {code}
> To take the advantage of SPARK-19454, sounds we should match them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21563) Race condition when serializing TaskDescriptions and adding jars

2017-08-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-21563.
-
   Resolution: Fixed
 Assignee: Andrew Ash
Fix Version/s: 2.3.0
   2.2.1

> Race condition when serializing TaskDescriptions and adding jars
> 
>
> Key: SPARK-21563
> URL: https://issues.apache.org/jira/browse/SPARK-21563
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Andrew Ash
> Fix For: 2.2.1, 2.3.0
>
>
> cc [~robert3005]
> I was seeing this exception during some running Spark jobs:
> {noformat}
> 16:16:28.294 [dispatcher-event-loop-14] ERROR 
> org.apache.spark.rpc.netty.Inbox - Ignoring error
> java.io.EOFException: null
> at java.io.DataInputStream.readFully(DataInputStream.java:197)
> at java.io.DataInputStream.readUTF(DataInputStream.java:609)
> at java.io.DataInputStream.readUTF(DataInputStream.java:564)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:127)
> at 
> org.apache.spark.scheduler.TaskDescription$$anonfun$decode$1.apply(TaskDescription.scala:126)
> at scala.collection.immutable.Range.foreach(Range.scala:160)
> at 
> org.apache.spark.scheduler.TaskDescription$.decode(TaskDescription.scala:126)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:95)
> at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
> at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
> at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
> at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
> After some debugging, we determined that this is due to a race condition in 
> task serde.  cc [~irashid] [~kayousterhout] who last touched that code in 
> SPARK-19796
> The race is between adding additional jars to the SparkContext and 
> serializing the TaskDescription.
> Consider this sequence of events:
> - TaskSetManager creates a TaskDescription using a reference to the 
> SparkContext's jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L506
> - TaskDescription starts serializing, and begins writing jars: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L84
> - the size of the jar map is written out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L63
> - _on another thread_: the application adds a jar to the SparkContext's jars 
> list
> - then the entries in the jars list are serialized out: 
> https://github.com/apache/spark/blob/v2.2.0/core/src/main/scala/org/apache/spark/scheduler/TaskDescription.scala#L64
> The problem now is that the jars list is serialized as having N entries, but 
> actually N+1 entries follow that count!
> This causes task deserialization to fail in the executor, with the stacktrace 
> above.
> The same issue also likely exists for files, though I haven't observed that 
> and our application does not stress that codepath the same way it did for jar 
> additions.
> One fix here is that TaskSetManager could make an immutable copy of the jars 
> list that it passes into the TaskDescription constructor, so that list 
> doesn't change mid-serialization.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-12837) Spark driver requires large memory space for serialized results even there are no data collected to the driver

2017-08-14 Thread Todd Leo (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-12837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125726#comment-16125726
 ] 

Todd Leo commented on SPARK-12837:
--

[~cloud_fan] Is there a ticket tracking "driver-free broadcast" ?

> Spark driver requires large memory space for serialized results even there 
> are no data collected to the driver
> --
>
> Key: SPARK-12837
> URL: https://issues.apache.org/jira/browse/SPARK-12837
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 1.5.2, 1.6.0
>Reporter: Tien-Dung LE
>Assignee: Wenchen Fan
>Priority: Critical
> Fix For: 2.2.0
>
>
> Executing a sql statement with a large number of partitions requires a high 
> memory space for the driver even there are no requests to collect data back 
> to the driver.
> Here are steps to re-produce the issue.
> 1. Start spark shell with a spark.driver.maxResultSize setting
> {code:java}
> bin/spark-shell --driver-memory=1g --conf spark.driver.maxResultSize=1m
> {code}
> 2. Execute the code 
> {code:java}
> case class Toto( a: Int, b: Int)
> val df = sc.parallelize( 1 to 1e6.toInt).map( i => Toto( i, i)).toDF
> sqlContext.setConf( "spark.sql.shuffle.partitions", "200" )
> df.groupBy("a").count().saveAsParquetFile( "toto1" ) // OK
> sqlContext.setConf( "spark.sql.shuffle.partitions", 1e3.toInt.toString )
> df.repartition(1e3.toInt).groupBy("a").count().repartition(1e3.toInt).saveAsParquetFile(
>  "toto2" ) // ERROR
> {code}
> The error message is 
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Total size of serialized results of 393 tasks (1025.9 KB) is bigger than 
> spark.driver.maxResultSize (1024.0 KB)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join

2017-08-14 Thread Abhijit Bhole (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhijit Bhole updated SPARK-21034:
--
Comment: was deleted

(was: In FilterPushdownSuite.scala, it seems this should have been handled? Am 
I making some mistake in understanding?
 
{code:java}
 test("nondeterministic: push down part of filter through aggregate with 
deterministic field") {
val originalQuery = testRelation
  .groupBy('a)('a)
  .where('a > 5 && Rand(10) > 5)
  .analyze

val optimized = Optimize.execute(originalQuery.analyze)

val correctAnswer = testRelation
  .where('a > 5)
  .groupBy('a)('a)
  .where(Rand(10) > 5)
  .analyze

comparePlans(optimized, correctAnswer)
  }
{code}
)

> Allow filter pushdown filters through non deterministic functions for columns 
> involved in groupby / join
> 
>
> Key: SPARK-21034
> URL: https://issues.apache.org/jira/browse/SPARK-21034
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> If the column is involved in aggregation / join then pushing down filter 
> should not change the results.
> Here is a sample code - 
> {code:java}
> from pyspark.sql import functions as F
> df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" 
> : 8},
>{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, 
> "c":7} ])
> df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain()
> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *HashAggregate(keys=[a#15L], functions=[sum(b#16L)])
> +- Exchange hashpartitioning(a#15L, 4)
>+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)])
>   +- *Project [a#15L, b#16L]
>  +- *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> >>>
> >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)])
>+- Exchange hashpartitioning(a#15L, 4)
>   +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), 
> partial_first(c#17L, false)])
>  +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> {code}
> As you can see, the filter is not pushed down when F.first aggregate function 
> is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21034) Allow filter pushdown filters through non deterministic functions for columns involved in groupby / join

2017-08-14 Thread Abhijit Bhole (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125619#comment-16125619
 ] 

Abhijit Bhole commented on SPARK-21034:
---

In FilterPushdownSuite.scala, it seems this should have been handled? Am I 
making some mistake in understanding?
 
{code:java}
 test("nondeterministic: push down part of filter through aggregate with 
deterministic field") {
val originalQuery = testRelation
  .groupBy('a)('a)
  .where('a > 5 && Rand(10) > 5)
  .analyze

val optimized = Optimize.execute(originalQuery.analyze)

val correctAnswer = testRelation
  .where('a > 5)
  .groupBy('a)('a)
  .where(Rand(10) > 5)
  .analyze

comparePlans(optimized, correctAnswer)
  }
{code}


> Allow filter pushdown filters through non deterministic functions for columns 
> involved in groupby / join
> 
>
> Key: SPARK-21034
> URL: https://issues.apache.org/jira/browse/SPARK-21034
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer, SQL
>Affects Versions: 2.1.1, 2.2.0
>Reporter: Abhijit Bhole
>
> If the column is involved in aggregation / join then pushing down filter 
> should not change the results.
> Here is a sample code - 
> {code:java}
> from pyspark.sql import functions as F
> df = spark.createDataFrame([{ "a": 1, "b" : 2, "c":7}, { "a": 3, "b" : 4, "c" 
> : 8},
>{ "a": 1, "b" : 5, "c":7}, { "a": 1, "b" : 6, 
> "c":7} ])
> df.groupBy(["a"]).agg(F.sum("b")).where("a = 1").explain()
> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *HashAggregate(keys=[a#15L], functions=[sum(b#16L)])
> +- Exchange hashpartitioning(a#15L, 4)
>+- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L)])
>   +- *Project [a#15L, b#16L]
>  +- *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> >>>
> >>> df.groupBy(["a"]).agg(F.sum("b"), F.first("c")).where("a = 1").explain()
> == Physical Plan ==
> *Filter (isnotnull(a#15L) && (a#15L = 1))
> +- *HashAggregate(keys=[a#15L], functions=[sum(b#16L), first(c#17L, false)])
>+- Exchange hashpartitioning(a#15L, 4)
>   +- *HashAggregate(keys=[a#15L], functions=[partial_sum(b#16L), 
> partial_first(c#17L, false)])
>  +- Scan ExistingRDD[a#15L,b#16L,c#17L]
> {code}
> As you can see, the filter is not pushed down when F.first aggregate function 
> is used.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-08-14 Thread xinzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinzhang updated SPARK-21725:
-
Description: 
use thriftserver create table with partitions.

session 1:
 SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
partitioned by (pt string) stored as parquet;
--ok
 !exit

session 2:
 SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
partitioned by (pt string) stored as parquet; 
--ok
 !exit

session 3:
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--ok
 !exit

session 4(do it again):
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--error
 !exit

-
17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
..
..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
512282-2/-ext-1/part-0 to destination 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
... 45 more
Caused by: java.io.IOException: Filesystem closed

-


the doc about the parquet table desc here 
http://spark.apache.org/docs/latest/sql-programming-guide.html#parquet-files

Hive metastore Parquet table conversion
When reading from and writing to Hive metastore Parquet tables, Spark SQL will 
try to use its own Parquet support instead of Hive SerDe for better 
performance. This behavior is controlled by the 
spark.sql.hive.convertMetastoreParquet configuration, and is turned on by 
default.

I am confused the problem appear in the table(partitions)  but it is ok with 
table(with out partitions) . It means spark do not use its own parquet ?
Maybe someone give any suggest how could I avoid the issue?

  was:
use thriftserver create table with partitions.

session 1:
 SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
partitioned by (pt string) stored as parquet;
--ok
 !exit

session 2:
 SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
partitioned by (pt string) stored as parquet; 
--ok
 !exit

session 3:
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--ok
 !exit

session 4(do it again):
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--error
 !exit

-
17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
..
..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
512282-2/-ext-1/part-0 to destination 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
... 45 more
Caused by: java.io.IOException: Filesystem closed

-




> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET

[jira] [Updated] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-08-14 Thread xinzhang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xinzhang updated SPARK-21725:
-
Description: 
use thriftserver create table with partitions.

session 1:
 SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
partitioned by (pt string) stored as parquet;
--ok
 !exit

session 2:
 SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
partitioned by (pt string) stored as parquet; 
--ok
 !exit

session 3:
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--ok
 !exit

session 4(do it again):
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--error
 !exit

-
17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
..
..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
512282-2/-ext-1/part-0 to destination 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
... 45 more
Caused by: java.io.IOException: Filesystem closed

-



  was:
use thriftserver create table with partitions.
session 1:
 SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
partitioned by (pt string) stored as parquet;
--ok
 !exit
session 2:
 SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
partitioned by (pt string) stored as parquet; 
--ok
 !exit
session 3:
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--ok
 !exit
session 4(do it again):
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--error
 !exit

-
17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
..
..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
512282-2/-ext-1/part-0 to destination 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
... 45 more
Caused by: java.io.IOException: Filesystem closed

-




> spark thriftserver insert overwrite table partition select 
> ---
>
> Key: SPARK-21725
> URL: https://issues.apache.org/jira/browse/SPARK-21725
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0
> Environment: centos 6.7 spark 2.1  jdk8
>Reporter: xinzhang
>  Labels: spark-sql
>
> use thriftserver create table with partitions.
> session 1:
>  SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
> partitioned by (pt string) stored as parquet;
> --ok
>  !exit
> session 2:
>  SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
> partitioned by (pt string) stored as parquet; 
> --ok
>  !exit
> session 3:
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --ok
>  !exit
> session 4(do it again):
> --connect the thriftserver
> SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
> partition(pt='1') select count(1) count from tmp_11;
> --error
>  !exit
>

[jira] [Created] (SPARK-21725) spark thriftserver insert overwrite table partition select

2017-08-14 Thread xinzhang (JIRA)

xinzhang created SPARK-21725:


 Summary: spark thriftserver insert overwrite table partition 
select 
 Key: SPARK-21725
 URL: https://issues.apache.org/jira/browse/SPARK-21725
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.1.0
 Environment: centos 6.7 spark 2.1  jdk8
Reporter: xinzhang


use thriftserver create table with partitions.
session 1:
 SET hive.default.fileformat=Parquet;create table tmp_10(count bigint) 
partitioned by (pt string) stored as parquet;
--ok
 !exit
session 2:
 SET hive.default.fileformat=Parquet;create table tmp_11(count bigint) 
partitioned by (pt string) stored as parquet; 
--ok
 !exit
session 3:
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--ok
 !exit
session 4(do it again):
--connect the thriftserver
SET hive.default.fileformat=Parquet;insert overwrite table tmp_10 
partition(pt='1') select count(1) count from tmp_11;
--error
 !exit

-
17/08/14 18:13:42 ERROR SparkExecuteStatementOperation: Error executing query, 
currentState RUNNING, 
java.lang.reflect.InvocationTargetException
..
..
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move 
source 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/.hive-staging_hive_2017-08-14_18-13-39_035_6303339779053
512282-2/-ext-1/part-0 to destination 
hdfs://dc-hadoop54:50001/group/user/user1/meta/hive-temp-table/user1.db/tmp_11/pt=1/part-0
at org.apache.hadoop.hive.ql.metadata.Hive.moveFile(Hive.java:2644)
at org.apache.hadoop.hive.ql.metadata.Hive.copyFiles(Hive.java:2711)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1403)
at org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(Hive.java:1324)
... 45 more
Caused by: java.io.IOException: Filesystem closed

-





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21724) Missing since information in the documentation of date functions

2017-08-14 Thread panbingkun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125520#comment-16125520
 ] 

panbingkun commented on SPARK-21724:


In DescribeFunctionCommand
info.getSince

> Missing since information in the documentation of date functions
> 
>
> Key: SPARK-21724
> URL: https://issues.apache.org/jira/browse/SPARK-21724
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Minor
>
> Currently, we have missing version information for Spark SQL's builtin 
> functions. 
> Please see https://spark-test.github.io/sparksqldoc/
> For example, we could add the version information as below:
> {code}
> spark-sql> describe function extended datediff;
> Function: datediff
> ...
> Extended Usage:
> Examples:
>   ...
> Since: 1.5.0
> {code}
> and also in the SQL builtin function documentation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21724) Missing since information in the documentation of date functions

2017-08-14 Thread Hyukjin Kwon (JIRA)

Hyukjin Kwon created SPARK-21724:


 Summary: Missing since information in the documentation of date 
functions
 Key: SPARK-21724
 URL: https://issues.apache.org/jira/browse/SPARK-21724
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon
Priority: Minor


Currently, we have missing version information for Spark SQL's builtin 
functions. 
Please see https://spark-test.github.io/sparksqldoc/

For example, we could add the version information as below:

{code}
spark-sql> describe function extended datediff;
Function: datediff
...
Extended Usage:
Examples:
  ...

Since: 1.5.0
{code}

and also in the SQL builtin function documentation.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21723) Can't write LibSVM - key not found: numFeatures

2017-08-14 Thread JIRA

Jan Vršovský created SPARK-21723:


 Summary: Can't write LibSVM - key not found: numFeatures
 Key: SPARK-21723
 URL: https://issues.apache.org/jira/browse/SPARK-21723
 Project: Spark
  Issue Type: Bug
  Components: Input/Output, ML
Affects Versions: 2.2.0, 2.3.0
Reporter: Jan Vršovský


Writing a dataset to LibSVM format raises an exception

{{java.util.NoSuchElementException: key not found: numFeatures}}

Happens only when the dataset was NOT read from a LibSVM format before (because 
otherwise numFeatures is in its metadata). Steps to reproduce:

{{import org.apache.spark.ml.linalg.Vectors
val rawData = Seq((1.0, Vectors.sparse(3, Seq((0, 2.0), (1, 3.0,
  (4.0, Vectors.sparse(3, Seq((0, 5.0), (2, 6.0)
val dfTemp = spark.sparkContext.parallelize(rawData).toDF("label", "features")
dfTemp.coalesce(1).write.format("libsvm").save("...filename...")}}

PR with a fix and unit test is ready.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125407#comment-16125407
 ] 

Sean Owen commented on SPARK-21688:
---

Solving part of the problem is better than none at all. I do agree that it's 
possible set the variable, but to a value that's 'bad', so setting it isn't 
even a sure sign that it's optimized. 

Maybe we can warn if none of the common env variables are set. That is a 
somewhat separate issue, I suppose.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
>Priority: Minor
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21688) performance improvement in mllib SVM with native BLAS

2017-08-14 Thread Vincent (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125353#comment-16125353
 ] 

Vincent commented on SPARK-21688:
-

sorry for late reply. 
Yes, It's simple and easy to check the env variables in the code, but I don't 
think that's a right thing to do. 
First, I still believe that if a user decides to run on native blas to speed up 
his/her application, he/she should be aware of proper settings as mentioned in 
https://issues.apache.org/jira/browse/SPARK-21305. they can set 1, 2... or any 
arbitrary number of threads for native blas that can give them better 
performance; 
Second, there are a bunch of BLAS variations, MKL, Openblas, Atlas, Cublas 
...etc, each one has a different variable name for this setting, check all 
these variant settings in the code doesn't seem right.

> performance improvement in mllib SVM with native BLAS 
> --
>
> Key: SPARK-21688
> URL: https://issues.apache.org/jira/browse/SPARK-21688
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.2.0
> Environment: 4 nodes: 1 master node, 3 worker nodes
> model name  : Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz
> Memory : 180G
> num of core per node: 10
>Reporter: Vincent
>Priority: Minor
> Attachments: ddot unitest.png, mllib svm training.png, 
> native-trywait.png, svm1.png, svm2.png, svm-mkl-1.png, svm-mkl-2.png
>
>
> in current mllib SVM implementation, we found that the CPU is not fully 
> utilized, one reason is that f2j blas is set to be used in the HingeGradient 
> computation. As we found out earlier 
> (https://issues.apache.org/jira/browse/SPARK-21305) that with proper 
> settings, native blas is generally better than f2j on the uni-test level, 
> here we make the blas operations in SVM go with MKL blas and get an end to 
> end performance report showing that in most cases native blas outperformance 
> f2j blas up to 50%.
> So, we suggest removing those f2j-fixed calling and going for native blas if 
> available. If this proposal is acceptable, we will move on to benchmark other 
> algorithms impacted. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-9104) expose network layer memory usage in shuffle part

2017-08-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-9104?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125331#comment-16125331
 ] 

Saisai Shao commented on SPARK-9104:


I think it is quite useful to get the memory details of Netty, so I'm trying to 
get another shot on this issue (https://github.com/apache/spark/pull/18935).

> expose network layer memory usage in shuffle part
> -
>
> Key: SPARK-9104
> URL: https://issues.apache.org/jira/browse/SPARK-9104
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Zhang, Liye
>
> The default network transportation is netty, and when transfering blocks for 
> shuffle, the network layer will consume a decent size of memory, we shall 
> collect the memory usage of this part and expose it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21024) CSV parse mode handles Univocity parser exceptions

2017-08-14 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16125293#comment-16125293
 ] 

Takeshi Yamamuro commented on SPARK-21024:
--

I talked with the author of univocity-parser 
(https://github.com/uniVocity/univocity-parsers/issues/179#issuecomment-321434323),
 but I didn't get the workaround for this error handling. So, If we  handle 
this case, I feel we might do something in the spark-side. This is a just FYI.

> CSV parse mode handles Univocity parser exceptions
> --
>
> Key: SPARK-21024
> URL: https://issues.apache.org/jira/browse/SPARK-21024
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> The current master cannot skip the illegal records that Univocity parsers:
> This comes from the spark-user mailing list:
> https://www.mail-archive.com/user@spark.apache.org/msg63985.html
> {code}
> scala> Seq("0,1", "0,1,2,3").toDF().write.text("/Users/maropu/Desktop/data")
> scala> val df = spark.read.format("csv").schema("a int, b 
> int").option("maxColumns", "3").load("/Users/maropu/Desktop/data")
> scala> df.show
> com.univocity.parsers.common.TextParsingException: 
> java.lang.ArrayIndexOutOfBoundsException - 3
> Hint: Number of columns processed may have exceeded limit of 3 columns. Use 
> settings.setMaxColumns(int) to define the maximum number of columns your 
> input can have
> Ensure your configuration is correct, with delimiters, quotes and escape 
> sequences that match the input format you are trying to parse
> Parser Configuration: CsvParserSettings:
> Auto configuration enabled=true
> Autodetect column delimiter=false
> Autodetect quotes=false
> Column reordering enabled=true
> Empty value=null
> Escape unquoted values=false
> ...
> at 
> com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:339)
> at 
> com.univocity.parsers.common.AbstractParser.handleEOF(AbstractParser.java:195)
> at 
> com.univocity.parsers.common.AbstractParser.parseLine(AbstractParser.java:544)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser.parse(UnivocityParser.scala:191)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$5.apply(UnivocityParser.scala:308)
> at 
> org.apache.spark.sql.execution.datasources.FailureSafeParser.parse(FailureSafeParser.scala:60)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at 
> org.apache.spark.sql.execution.datasources.csv.UnivocityParser$$anonfun$parseIterator$1.apply(UnivocityParser.scala:312)
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
> ...
> {code}
> We could easily fix this like: 
> https://github.com/apache/spark/compare/master...maropu:HandleExceptionInParser



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

65 matches

Mail list logo