[jira] [Created] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method

2018-09-04 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-25339:
---

 Summary: Refactor FilterPushdownBenchmark to use main method
 Key: SPARK-25339
 URL: https://issues.apache.org/jira/browse/SPARK-25339
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.0
Reporter: Yuming Wang


Wenchen commented on the PR: 
https://github.com/apache/spark/pull/22336#issuecomment-418604019



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25336.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22334
[https://github.com/apache/spark/pull/22334]

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
> Fix For: 2.4.0
>
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2018-09-04 Thread Yuming Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603937#comment-16603937
 ] 

Yuming Wang commented on SPARK-19145:
-

How about we target this ticket and 
[SPARK-25039|https://issues.apache.org/jira/browse/SPARK-25039] to Spark 3.0.0?

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Priority: Major
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method

2018-09-04 Thread Kazuaki Ishizaki (JIRA)
Kazuaki Ishizaki created SPARK-25338:


 Summary: Several tests miss calling super.afterAll() in their 
afterAll() method
 Key: SPARK-25338
 URL: https://issues.apache.org/jira/browse/SPARK-25338
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The following tests under {{external}} may not call {{super.afterAll()}} in 
their {{afterAll()}} method.

{code}
external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala
external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala
external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala
external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala
external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603926#comment-16603926
 ] 

Apache Spark commented on SPARK-25306:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22336

> Avoid skewed filter trees to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In both ORC data sources, createFilter function has exponential time 
> complexity due to its skewed filter tree generation. This PR aims to improve 
> it by using new buildTree function.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603921#comment-16603921
 ] 

Apache Spark commented on SPARK-25306:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22336

> Avoid skewed filter trees to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In both ORC data sources, createFilter function has exponential time 
> complexity due to its skewed filter tree generation. This PR aims to improve 
> it by using new buildTree function.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603912#comment-16603912
 ] 

Apache Spark commented on SPARK-25091:
--

User 'cfangplus' has created a pull request for this issue:
https://github.com/apache/spark/pull/22335

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25091:


Assignee: Apache Spark

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Assignee: Apache Spark
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25091:


Assignee: (was: Apache Spark)

> UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
> -
>
> Key: SPARK-25091
> URL: https://issues.apache.org/jira/browse/SPARK-25091
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Yunling Cai
>Priority: Critical
> Attachments: 0.png, 1.png, 2.png, 3.png, 4.png
>
>
> UNCACHE TABLE and CLEAR CACHE does not clean up executor memory.
> Through Spark UI, although in Storage, we see the cached table removed. In 
> Executor, the executors continue to hold the RDD and the memory is not 
> cleared. This results in huge waste in executor memory usage. As we call 
> CACHE TABLE, we run into issues where the cached tables are spilled to disk 
> instead of reclaiming the memory storage. 
> Steps to reproduce:
> CACHE TABLE test.test_cache;
> UNCACHE TABLE test.test_cache;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> CACHE TABLE test.test_cache;
> CLEAR CACHE;
> == Storage shows table is not cached; Executor shows the executor storage 
> memory does not change == 
> Similar behavior when using pyspark df.unpersist().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou

2018-09-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603865#comment-16603865
 ] 

Dongjoon Hyun commented on SPARK-25337:
---

[~srowen]. I reproduced this locally. The failure occurs during executing old 
Spark inside `beforeAll` . So, it's marked as `aborted`. And, the root cause of 
failure is corrupted class path for some reasons. I'm still investigation.

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
>  line 545, in sql
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
>  line 1257, in __call__
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
>  line 63, in deco
> 2018-09-04 20:00:04.949 - stdout>   File 
> "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
>  line 328, in get_return_value
> 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error 
> occurred while calling o27.sql.
> 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25337:
--
Description: 
Observed in the Scala 2.12 pull request builder consistently now. I don't see 
this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
hard to see how.

CC [~sadhen]
{code:java}
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - spark-submit 
returned with exit code 1.
Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 
'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
'spark.master.rest.enabled=false' '--conf' 
'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
'-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 
'/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
...
2018-09-04 20:00:04.949 - stdout>   File 
"/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py",
 line 545, in sql
2018-09-04 20:00:04.949 - stdout>   File 
"/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py",
 line 1257, in __call__
2018-09-04 20:00:04.949 - stdout>   File 
"/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py",
 line 63, in deco
2018-09-04 20:00:04.949 - stdout>   File 
"/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py",
 line 328, in get_return_value
2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error occurred 
while calling o27.sql.
2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated
{code}

  was:
Observed in the Scala 2.12 pull request builder consistently now. I don't see 
this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
hard to see how.

CC [~sadhen]
{code:java}
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - spark-submit 
returned with exit code 1.
Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 
'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
'spark.master.rest.enabled=false' '--conf' 
'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
'-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 
'/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
...
2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be 
instantiated
...
2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
2018-09-04 07:48:30.834 - stdout> at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81)
...{code}


> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 

[jira] [Closed] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li closed SPARK-24256.
--

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`

2018-09-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25300.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22306
[https://github.com/apache/spark/pull/22306]

> Unified the configuration parameter `spark.shuffle.service.enabled`
> ---
>
> Key: SPARK-25300
> URL: https://issues.apache.org/jira/browse/SPARK-25300
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.0
>
>
> The configuration parameter "spark.shuffle.service.enabled"  has defined in 
> `package.scala`,  and it  is also used in many place, so we can replace it 
> with `SHUFFLE_SERVICE_ENABLED`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`

2018-09-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25300:
---

Assignee: liuxian

> Unified the configuration parameter `spark.shuffle.service.enabled`
> ---
>
> Key: SPARK-25300
> URL: https://issues.apache.org/jira/browse/SPARK-25300
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.0
>
>
> The configuration parameter "spark.shuffle.service.enabled"  has defined in 
> `package.scala`,  and it  is also used in many place, so we can replace it 
> with `SHUFFLE_SERVICE_ENABLED`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25306:
--
Description: 
In both ORC data sources, createFilter function has exponential time complexity 
due to its skewed filter tree generation. This PR aims to improve it by using 
new buildTree function.

*REPRODUCE*
{code}
// Create and read 1 row table with 1000 columns
sql("set spark.sql.orc.filterPushdown=true")
val selectExpr = (1 to 1000).map(i => s"id c$i")
spark.range(1).selectExpr(selectExpr: 
_*).write.mode("overwrite").orc("/tmp/orc")
print(s"With 0 filters, ")
spark.time(spark.read.orc("/tmp/orc").count)

// Increase the number of filters
(20 to 30).foreach { width =>
  val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
  print(s"With $width filters, ")
  spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
}
{code}

*RESULT*
{code}
With 0 filters, Time taken: 653 ms  
With 20 filters, Time taken: 962 ms
With 21 filters, Time taken: 1282 ms
With 22 filters, Time taken: 1982 ms
With 23 filters, Time taken: 3855 ms
With 24 filters, Time taken: 6719 ms
With 25 filters, Time taken: 12669 ms
With 26 filters, Time taken: 25032 ms
With 27 filters, Time taken: 49585 ms
With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
With 29 filters, Time taken: 198368 ms   // over 3 mins
With 30 filters, Time taken: 393744 ms   // over 6 mins
{code}

  was:
In ORC data source, `createFilter` function has exponential time complexity due 
to lack of memoization like the following. This issue aims to improve it.

*REPRODUCE*
{code}
// Create and read 1 row table with 1000 columns
sql("set spark.sql.orc.filterPushdown=true")
val selectExpr = (1 to 1000).map(i => s"id c$i")
spark.range(1).selectExpr(selectExpr: 
_*).write.mode("overwrite").orc("/tmp/orc")
print(s"With 0 filters, ")
spark.time(spark.read.orc("/tmp/orc").count)

// Increase the number of filters
(20 to 30).foreach { width =>
  val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
  print(s"With $width filters, ")
  spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
}
{code}

*RESULT*
{code}
With 0 filters, Time taken: 653 ms  
With 20 filters, Time taken: 962 ms
With 21 filters, Time taken: 1282 ms
With 22 filters, Time taken: 1982 ms
With 23 filters, Time taken: 3855 ms
With 24 filters, Time taken: 6719 ms
With 25 filters, Time taken: 12669 ms
With 26 filters, Time taken: 25032 ms
With 27 filters, Time taken: 49585 ms
With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
With 29 filters, Time taken: 198368 ms   // over 3 mins
With 30 filters, Time taken: 393744 ms   // over 6 mins
{code}


> Avoid skewed filter trees to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In both ORC data sources, createFilter function has exponential time 
> complexity due to its skewed filter tree generation. This PR aims to improve 
> it by using new buildTree function.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional 

[jira] [Updated] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25306:
--
Summary: Avoid skewed filter trees to speed up `createFilter` in ORC  (was: 
Use cache to speed up `createFilter` in ORC)

> Avoid skewed filter trees to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In ORC data source, `createFilter` function has exponential time complexity 
> due to lack of memoization like the following. This issue aims to improve it.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25306) Use cache to speed up `createFilter` in ORC

2018-09-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-25306.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22313
[https://github.com/apache/spark/pull/22313]

> Use cache to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In ORC data source, `createFilter` function has exponential time complexity 
> due to lack of memoization like the following. This issue aims to improve it.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25306) Use cache to speed up `createFilter` in ORC

2018-09-04 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-25306:
---

Assignee: Dongjoon Hyun

> Use cache to speed up `createFilter` in ORC
> ---
>
> Key: SPARK-25306
> URL: https://issues.apache.org/jira/browse/SPARK-25306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Critical
> Fix For: 2.4.0
>
>
> In ORC data source, `createFilter` function has exponential time complexity 
> due to lack of memoization like the following. This issue aims to improve it.
> *REPRODUCE*
> {code}
> // Create and read 1 row table with 1000 columns
> sql("set spark.sql.orc.filterPushdown=true")
> val selectExpr = (1 to 1000).map(i => s"id c$i")
> spark.range(1).selectExpr(selectExpr: 
> _*).write.mode("overwrite").orc("/tmp/orc")
> print(s"With 0 filters, ")
> spark.time(spark.read.orc("/tmp/orc").count)
> // Increase the number of filters
> (20 to 30).foreach { width =>
>   val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ")
>   print(s"With $width filters, ")
>   spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count)
> }
> {code}
> *RESULT*
> {code}
> With 0 filters, Time taken: 653 ms
>   
> With 20 filters, Time taken: 962 ms
> With 21 filters, Time taken: 1282 ms
> With 22 filters, Time taken: 1982 ms
> With 23 filters, Time taken: 3855 ms
> With 24 filters, Time taken: 6719 ms
> With 25 filters, Time taken: 12669 ms
> With 26 filters, Time taken: 25032 ms
> With 27 filters, Time taken: 49585 ms
> With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds
> With 29 filters, Time taken: 198368 ms   // over 3 mins
> With 30 filters, Time taken: 393744 ms   // over 6 mins
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data

2018-09-04 Thread Matt Cheah (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603807#comment-16603807
 ] 

Matt Cheah commented on SPARK-25299:


(Changed the title to "remote storage" for a little more generalization)

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data

2018-09-04 Thread Matt Cheah (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matt Cheah updated SPARK-25299:
---
Summary: Use remote storage for persisting shuffle data  (was: Use 
distributed storage for persisting shuffle data)

> Use remote storage for persisting shuffle data
> --
>
> Key: SPARK-25299
> URL: https://issues.apache.org/jira/browse/SPARK-25299
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle
>Affects Versions: 2.4.0
>Reporter: Matt Cheah
>Priority: Major
>
> In Spark, the shuffle primitive requires Spark executors to persist data to 
> the local disk of the worker nodes. If executors crash, the external shuffle 
> service can continue to serve the shuffle data that was written beyond the 
> lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the 
> external shuffle service is deployed on every worker node. The shuffle 
> service shares local disk with the executors that run on its node.
> There are some shortcomings with the way shuffle is fundamentally implemented 
> right now. Particularly:
>  * If any external shuffle service process or node becomes unavailable, all 
> applications that had an executor that ran on that node must recompute the 
> shuffle blocks that were lost.
>  * Similarly to the above, the external shuffle service must be kept running 
> at all times, which may waste resources when no applications are using that 
> shuffle service node.
>  * Mounting local storage can prevent users from taking advantage of 
> desirable isolation benefits from using containerized environments, like 
> Kubernetes. We had an external shuffle service implementation in an early 
> prototype of the Kubernetes backend, but it was rejected due to its strict 
> requirement to be able to mount hostPath volumes or other persistent volume 
> setups.
> In the following [architecture discussion 
> document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40]
>  (note: _not_ an SPIP), we brainstorm various high level architectures for 
> improving the external shuffle service in a way that addresses the above 
> problems. The purpose of this umbrella JIRA is to promote additional 
> discussion on how we can approach these problems, both at the architecture 
> level and the implementation level. We anticipate filing sub-issues that 
> break down the tasks that must be completed to achieve this goal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2018-09-04 Thread Aaron Hiniker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603803#comment-16603803
 ] 

Aaron Hiniker edited comment on SPARK-19145 at 9/5/18 1:20 AM:
---

I found another (potentially huge) performance impact where the filter won't 
get pushed down to the reader/scan when there is a `cast` expression involved.  
I commented on the PR with more details here: 
[https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743]


was (Author: hindog):
I found another (potentially huge) performance impact where the filter won't 
get pushed down to the reader/scan when there is a `cast` expression involved.  
I commented on the PR with more details here: 
[https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743,]

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Priority: Major
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly

2018-09-04 Thread Aaron Hiniker (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603803#comment-16603803
 ] 

Aaron Hiniker commented on SPARK-19145:
---

I found another (potentially huge) performance impact where the filter won't 
get pushed down to the reader/scan when there is a `cast` expression involved.  
I commented on the PR with more details here: 
[https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743,]

> Timestamp to String casting is slowing the query significantly
> --
>
> Key: SPARK-19145
> URL: https://issues.apache.org/jira/browse/SPARK-19145
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: gagan taneja
>Priority: Major
>
> i have a time series table with timestamp column 
> Following query
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> is significantly SLOWER than 
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> After investigation i found that in the first query time colum is cast to 
> String before applying the filter 
> However in the second query no such casting is performed and its a filter 
> with long value 
> Below are the generate Physical plan for slower execution followed by 
> physical plan for faster execution 
> SELECT COUNT(*) AS `count`
>FROM `default`.`table`
>WHERE `time` >= '2017-01-02 19:53:51'
> AND `time` <= '2017-01-09 19:53:51' LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3339L])
>  +- *Project
> +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) 
> >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 
> 19:53:51))
>+- *FileScan parquet default.cstat[time#3314] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: 
> struct
> SELECT COUNT(*) AS `count`
> FROM `default`.`table`
> WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD 
> HH24:MI:SS−0800')
>   AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD 
> HH24:MI:SS−0800') LIMIT 5
> == Physical Plan ==
> CollectLimit 5
> +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L])
>+- Exchange SinglePartition
>   +- *HashAggregate(keys=[], functions=[partial_count(1)], 
> output=[count#3287L])
>  +- *Project
> +- *Filter ((isnotnull(time#3262) && (time#3262 >= 
> 148340483100)) && (time#3262 <= 148400963100))
>+- *FileScan parquet default.cstat[time#3262] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], 
> PartitionFilters: [], PushedFilters: [IsNotNull(time), 
> GreaterThanOrEqual(time,2017-01-02 19:53:51.0), 
> LessThanOrEqual(time,2017-01-09..., ReadSchema: struct
> In Impala both query run efficiently without and performance difference
> Spark should be able to parse the Date string and convert to Long/Timestamp 
> during generation of Optimized Logical Plan so that both the query would have 
> similar performance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fangshi Li resolved SPARK-24256.

Resolution: Won't Fix

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple

2018-09-04 Thread Fangshi Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603800#comment-16603800
 ] 

Fangshi Li commented on SPARK-24256:


To summarize our discussion for this pr:
Spark-avro is now merged into Spark as a built-in data source. Upstream 
community is not merging the AvroEncoder to support Avro types in Dataset, 
instead, the plan is to exposing the user-defined type API to support defining 
arbitrary user types in Dataset.

The purpose of this patch is to enable ExpressionEncoder to work together with 
other types of Encoders, while it seems like upstream prefers to go with UDT. 
Given this, we can close this PR and we will start the discussion on UDT in 
another channel

> ExpressionEncoder should support user-defined types as fields of Scala case 
> class and tuple
> ---
>
> Key: SPARK-24256
> URL: https://issues.apache.org/jira/browse/SPARK-24256
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Fangshi Li
>Priority: Major
>
> Right now, ExpressionEncoder supports ser/de of primitive types, as well as 
> scala case class, tuple and java bean class. Spark's Dataset natively 
> supports these mentioned types, but we find Dataset is not flexible for other 
> user-defined types and encoders.
> For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. 
> Although we can use AvroEncoder to define Dataset with types being the Avro 
> Generic or Specific Record, using such Avro typed Dataset has many 
> limitations:
>  # We can not use joinWith on this Dataset since the result is a tuple, but 
> Avro types cannot be the field of this tuple.
>  # We can not use some type-safe aggregation methods on this Dataset, such as 
> KeyValueGroupedDataset's reduceGroups, since the result is also a tuple.
>  # We cannot augment an Avro SpecificRecord with additional primitive fields 
> together in a case class, which we find is a very common use case.
> The limitation is that ExpressionEncoder does not support serde of Scala case 
> class/tuple with subfields being any other user-defined type with its own 
> Encoder for serde.
> To address this issue, we propose a trait as a contract(between 
> ExpressionEncoder and any other user-defined Encoder) to enable case 
> class/tuple/java bean to support user-defined types.
> With this proposed patch and our minor modification in AvroEncoder, we remove 
> above-mentioned limitations with cluster-default conf 
> spark.expressionencoder.org.apache.avro.specific.SpecificRecord = 
> com.databricks.spark.avro.AvroEncoder$
> This is a patch we have implemented internally and has been used for a few 
> quarters. We want to propose to upstream as we think this is a useful feature 
> to make Dataset more flexible to user types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-09-04 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-25332:
-
Issue Type: Improvement  (was: Bug)

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-09-04 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603761#comment-16603761
 ] 

Takeshi Yamamuro commented on SPARK-25332:
--

Probably, you need to describe more about this case. Also, I think you'd be 
better to ask in the spark-user mailing list first.

> Instead of broadcast hash join  ,Sort merge join has selected when restart 
> spark-shell/spark-JDBC for hive provider
> ---
>
> Key: SPARK-25332
> URL: https://issues.apache.org/jira/browse/SPARK-25332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Babulal
>Priority: Major
>
> spark.sql("create table x1(name string,age int) stored as parquet ")
>  spark.sql("insert into x1 select 'a',29")
>  spark.sql("create table x2 (name string,age int) stored as parquet '")
>  spark.sql("insert into x2_ex select 'a',29")
>  scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
> BuildRight
> :- *(2) Project [name#101, age#102]
> : +- *(2) Filter isnotnull(name#101)
> : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
>  +- *(1) Project [name#103, age#104]
>  +- *(1) Filter isnotnull(name#103)
>  +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
> Format: Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and 
> run same select query again
>  
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
> == Physical Plan ==
> *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
> :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
> : +- Exchange hashpartitioning(name#43, 200)
> : +- *(1) Project [name#43, age#44]
> : +- *(1) Filter isnotnull(name#43)
> : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
> +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
>  +- Exchange hashpartitioning(name#45, 200)
>  +- *(3) Project [name#45, age#46]
>  +- *(3) Filter isnotnull(name#45)
>  +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
> Parquet, Location: 
> InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
> PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
> struct
>  
>  
> scala> spark.sql("desc formatted x1").show(200,false)
> ++--+---+
> |col_name |data_type |comment|
> ++--+---+
> |name |string |null |
> |age |int |null |
> | | | |
> |# Detailed Table Information| | |
> |Database |default | |
> |Table |x1 | |
> |Owner |Administrator | |
> |Created Time |Sun Aug 19 12:36:58 IST 2018 | |
> |Last Access |Thu Jan 01 05:30:00 IST 1970 | |
> |Created By |Spark 2.3.0 | |
> |Type |MANAGED | |
> |Provider |hive | |
> |Table Properties |[transient_lastDdlTime=1534662418] | |
> |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
> |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | 
> |
> |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | 
> |
> |OutputFormat 
> |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
> |Storage Properties |[serialization.format=1] | |
> |Partition Provider |Catalog | |
> ++--+---+
>  
> With datasource table ,working fine ( create table using parquet instead of 
> stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603712#comment-16603712
 ] 

Apache Spark commented on SPARK-25258:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22179

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25258:


Assignee: Apache Spark

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Assignee: Apache Spark
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603710#comment-16603710
 ] 

Apache Spark commented on SPARK-25258:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22179

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25258:


Assignee: (was: Apache Spark)

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25337:
--
Description: 
Observed in the Scala 2.12 pull request builder consistently now. I don't see 
this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
hard to see how.

CC [~sadhen]
{code:java}
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - spark-submit 
returned with exit code 1.
Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 
'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
'spark.master.rest.enabled=false' '--conf' 
'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
'-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 
'/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
...
2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be 
instantiated
...
2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
2018-09-04 07:48:30.834 - stdout> at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81)
...{code}

  was:
Observed in the Scala 2.12 pull request builder consistently now. I don't see 
this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
hard to see how.

CC [~sadhen] and [~dongjoon]
{code:java}
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - spark-submit 
returned with exit code 1.
Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 
'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
'spark.master.rest.enabled=false' '--conf' 
'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
'-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 
'/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
...
2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be 
instantiated
...
2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
2018-09-04 07:48:30.834 - stdout> at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81)
...{code}


> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' 

[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou

2018-09-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603663#comment-16603663
 ] 

Dongjoon Hyun commented on SPARK-25337:
---

I'll take a look, [~srowen].

> HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
> 
>
> Key: SPARK-25337
> URL: https://issues.apache.org/jira/browse/SPARK-25337
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Sean Owen
>Priority: Major
>
> Observed in the Scala 2.12 pull request builder consistently now. I don't see 
> this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
> hard to see how.
> CC [~sadhen]
> {code:java}
> org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
> Exception encountered when invoking run on a nested suite - spark-submit 
> returned with exit code 1.
> Command line: './bin/spark-submit' '--name' 'prepare testing tables' 
> '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
> 'spark.master.rest.enabled=false' '--conf' 
> 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
> '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
>  
> '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
> ...
> 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: 
> org.apache.spark.sql.sources.DataSourceRegister: Provider 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be 
> instantiated
> ...
> 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
> 2018-09-04 07:48:30.834 - stdout> at 
> org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81)
> ...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25297) Future for Scala 2.12 will block on a already shutdown ExecutionContext

2018-09-04 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-25297.
---
Resolution: Duplicate

> Future for Scala 2.12 will block on a already shutdown ExecutionContext
> ---
>
> Key: SPARK-25297
> URL: https://issues.apache.org/jira/browse/SPARK-25297
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Darcy Shen
>Priority: Major
>
> *+see 
> [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/193/]+*
> *The Units Test blocks on FileBasedWriteAheadLogWithFileCloseAfterWriteSuite 
> in Console Output.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603652#comment-16603652
 ] 

Apache Spark commented on SPARK-24748:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603651#comment-16603651
 ] 

Apache Spark commented on SPARK-24748:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Support for reporting custom metrics via Streaming Query Progress
> -
>
> Key: SPARK-24748
> URL: https://issues.apache.org/jira/browse/SPARK-24748
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently the Structured Streaming sources and sinks does not have a way to 
> report custom metrics. Providing an option to report custom metrics and 
> making it available via Streaming Query progress can enable sources and sinks 
> to report custom progress information (E.g. the lag metrics for Kafka source).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603649#comment-16603649
 ] 

Apache Spark commented on SPARK-24863:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Report offset lag as a custom metrics for Kafka structured streaming source
> ---
>
> Key: SPARK-24863
> URL: https://issues.apache.org/jira/browse/SPARK-24863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> We can build on top of SPARK-24748 to report offset lag as a custom metrics 
> for Kafka structured streaming source.
> This is the difference between the latest offsets in Kafka the time the 
> metrics is reported (just after a micro-batch completes) and the latest 
> offset Spark has processed. It can be 0 (or close to 0) if spark keeps up 
> with the rate at which messages are ingested into Kafka topics in steady 
> state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603645#comment-16603645
 ] 

Apache Spark commented on SPARK-24863:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Report offset lag as a custom metrics for Kafka structured streaming source
> ---
>
> Key: SPARK-24863
> URL: https://issues.apache.org/jira/browse/SPARK-24863
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> We can build on top of SPARK-24748 to report offset lag as a custom metrics 
> for Kafka structured streaming source.
> This is the difference between the latest offsets in Kafka the time the 
> metrics is reported (just after a micro-batch completes) and the latest 
> offset Spark has processed. It can be 0 (or close to 0) if spark keeps up 
> with the rate at which messages are ingested into Kafka topics in steady 
> state.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603646#comment-16603646
 ] 

Apache Spark commented on SPARK-25336:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25336:
-
Summary: Revert SPARK-24863 and SPARK-24748  (was: Revert SPARK-24863 and 
SPARK 24748)

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK 24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25336:


Assignee: (was: Apache Spark)

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603643#comment-16603643
 ] 

Apache Spark commented on SPARK-25336:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/22334

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-25336:
-
Description: Revert SPARK-24863 and SPARK-24748. We will revisit them when 
the data source v2 APIs are out.  (was: Revert SPARK-24863 and SPARK 24748. We 
will revisit them when the data source v2 APIs are out.)

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-25336:


Assignee: Shixiong Zhu

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25336:


Assignee: Apache Spark

> Revert SPARK-24863 and SPARK-24748
> --
>
> Key: SPARK-25336
> URL: https://issues.apache.org/jira/browse/SPARK-25336
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source 
> v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc

2018-09-04 Thread Sean Owen (JIRA)
Sean Owen created SPARK-25337:
-

 Summary: HiveExternalCatalogVersionsSuite + Scala 2.12 = 
NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)
 Key: SPARK-25337
 URL: https://issues.apache.org/jira/browse/SPARK-25337
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.4.0
Reporter: Sean Owen


Observed in the Scala 2.12 pull request builder consistently now. I don't see 
this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of 
hard to see how.

CC [~sadhen] and [~dongjoon]
{code:java}
org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED ***
Exception encountered when invoking run on a nested suite - spark-submit 
returned with exit code 1.
Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 
'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 
'spark.master.rest.enabled=false' '--conf' 
'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' 
'-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643'
 
'/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py'
...
2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: 
org.apache.spark.sql.sources.DataSourceRegister: Provider 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be 
instantiated
...
2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: 
org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V
2018-09-04 07:48:30.834 - stdout> at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81)
...{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25336) Revert SPARK-24863 and SPARK 24748

2018-09-04 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-25336:


 Summary: Revert SPARK-24863 and SPARK 24748
 Key: SPARK-25336
 URL: https://issues.apache.org/jira/browse/SPARK-25336
 Project: Spark
  Issue Type: Task
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Shixiong Zhu


Revert SPARK-24863 and SPARK 24748. We will revisit them when the data source 
v2 APIs are out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25335:
--
Description: 
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once. However, it occurs too many times in 
build systems. This issue aims to skip Zinc downloading when the system already 
has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%
{code}

This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
Docker-based build system.

  was:
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once. However, it occurs too many times in 
build systems. This issue aims to skip Zinc downloading when the system already 
has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%

{code}

This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
Docker-based build system.


> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once. However, it occurs too many times in 
> build systems. This issue aims to skip Zinc downloading when the system 
> already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25335:
--
Description: 
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once. However, it occurs too many times in 
build systems. This issue aims to skip Zinc downloading when the system already 
has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%

{code}

This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
Docker-based build system.

  was:
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
downloading when the system already has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%

{code}

This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
Docker-based build system.


> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once. However, it occurs too many times in 
> build systems. This issue aims to skip Zinc downloading when the system 
> already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> 
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25335:


Assignee: (was: Apache Spark)

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
> downloading when the system already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> 
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25335:


Assignee: Apache Spark

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
> downloading when the system already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> 
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603600#comment-16603600
 ] 

Apache Spark commented on SPARK-25335:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/22333

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
> downloading when the system already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> 
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25335:
--
Description: 
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
downloading when the system already has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%

{code}

This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
Docker-based build system.

  was:
Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
downloading when the system already has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%
 {code}


> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
> downloading when the system already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
> 
> {code}
> This will reduce many resources(CPU/Networks/DISK) at least in Mac and 
> Docker-based build system.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25335:
--
Priority: Minor  (was: Major)

> Skip Zinc downloading if it's installed in the system
> -
>
> Key: SPARK-25335
> URL: https://issues.apache.org/jira/browse/SPARK-25335
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Zinc is 23.5MB.
> {code}
> $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>   % Total% Received % Xferd  Average Speed   TimeTime Time  
> Current
>  Dload  Upload   Total   SpentLeft  Speed
> 100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
> {code}
> Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
> downloading when the system already has it.
> {code}
> $ build/mvn clean
> exec: curl --progress-bar -L 
> https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
>  
> 100.0%
>  
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25335) Skip Zinc downloading if it's installed in the system

2018-09-04 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-25335:
-

 Summary: Skip Zinc downloading if it's installed in the system
 Key: SPARK-25335
 URL: https://issues.apache.org/jira/browse/SPARK-25335
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


Zinc is 23.5MB.
{code}
$ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
  % Total% Received % Xferd  Average Speed   TimeTime Time  Current
 Dload  Upload   Total   SpentLeft  Speed
100 23.5M  100 23.5M0 0  35.4M  0 --:--:-- --:--:-- --:--:-- 35.3M
{code}

Currently, Spark downloads Zinc once, but this issue aims to skip Zinc 
downloading when the system already has it.
{code}
$ build/mvn clean
exec: curl --progress-bar -L 
https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz
 100.0%
 {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25333) Ability to add new columns in Dataset in a user-defined position

2018-09-04 Thread Walid Mellouli (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walid Mellouli updated SPARK-25333:
---
Description: 
When we add new columns in a Dataset, they are added automatically at the end 
of the Dataset.

Consider this data frame:
{code:java}
val df = sc.parallelize(Seq(1, 2, 3)).toDF
df.printSchema


root
 |-- value: integer (nullable = true)
{code}
When we add a new column:
{code:java}
val newDf = df.withColumn("newColumn", col("value") + 1)
newDf.printSchema


root
 |-- value: integer (nullable = true)
 |-- newColumn: integer (nullable = true)
{code}
Generally users want to add new columns either at the end, in the beginning or 
in a defined position, depends on use cases.
 In my case for example, we add technical columns in the beginning of a Dataset 
and we add business columns at the end.

  was:
When we add new columns in a Dataset, they are added automatically at the end 
of the Dataset.
{code:java}
val df = sc.parallelize(Seq(1, 2, 3)).toDF
df.printSchema


root
 |-- value: integer (nullable = true)
{code}

When we add a new column:

{code:java}
val newDf = df.withColumn("newColumn", col("value") + 1)
newDf.printSchema


root
 |-- value: integer (nullable = true)
 |-- newColumn: integer (nullable = true)
{code}

Generally users want to add new columns either at the end or in the beginning, 
depends on use cases.
 In my case for example, we add technical columns in the beginning of a Dataset 
and we add business columns at the end.

Summary: Ability to add new columns in Dataset in a user-defined 
position  (was: Ability to add new columns in the beginning of a Dataset)

> Ability to add new columns in Dataset in a user-defined position
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> Consider this data frame:
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end, in the beginning 
> or in a defined position, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table

2018-09-04 Thread Ruslan Dautkhanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603590#comment-16603590
 ] 

Ruslan Dautkhanov commented on SPARK-24316:
---

Thanks [~bersprockets] 

Is cloudera spark.2.3.cloudera3 parcel based on upstream Spark 2.3.*2*?

As we still see this issue with latest Cloudera's Spark 2.3 parcel ("2.3 
release 3").

 

> Spark sql queries stall for  column width more than 6k for parquet based table
> --
>
> Key: SPARK-24316
> URL: https://issues.apache.org/jira/browse/SPARK-24316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> When we create a table from a data frame using spark sql with columns around 
> 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while 
> the same table if we create through Hive with same data , the same query just 
> takes 5 minutes.
>  
> Instrumenting the code we see that the executors are looping in the while 
> loop of the function initializeInternal(). The majority of time is getting 
> spent in the for loop in below code looping through the columns and the 
> executor appears to be stalled for long time .
>   
> {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}
> private void initializeInternal() ..
>  ..
>  for (int i = 0; i < requestedSchema.getFieldCount(); ++i)
> { ... }
> }
> {code:java}
>  {code}
>  
> When spark sql is creating table, it also stores the metadata in the 
> TBLPROPERTIES in json format. We see that if we remove this metadata from the 
> table the queries become fast , which is the case when we create the same 
> table through Hive. The exact same table takes 5 times more time with the 
> Json meta data as compared to without the json metadata.
>  
> So looks like as the number of columns are growing bigger than 5 to 6k, the 
> processing of the metadata and comparing it becomes more and more expensive 
> and the performance degrades drastically.
> To recreate the problem simply run the following query:
> import org.apache.spark.sql.SparkSession
> val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")
>  resp_data.write.format("csv").save("/tmp/filename")
>  
> The table should be created by spark sql from dataframe so that the Json meta 
> data is stored. For ex:-
> val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")
> dff.createOrReplaceTempView("my_temp_table")
>  val tmp = spark.sql("Create table tableName stored as parquet as select * 
> from my_temp_table")
>  
>  
> from pyspark.sql import SQL
> Context 
>  sqlContext = SQLContext(sc) 
>  resp_data = spark.sql( " select * from test").limit(2000) 
>  print resp_data_fgv_1k.count() 
>  (resp_data_fgv_1k.write.option('header', 
> False).mode('overwrite').csv('/tmp/2.csv') ) 
>  
>  
> The performance seems to be even slow in the loop if the schema does not 
> match or the fields are empty and the code goes into the if condition where 
> the missing column is marked true:
> missingColumns[i] = true;
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23131:
--
Summary: Kryo raises StackOverflow during serializing GLR model  (was: 
Stackoverflow using ML and Kryo serializer)

> Kryo raises StackOverflow during serializing GLR model
> --
>
> Key: SPARK-23131
> URL: https://issues.apache.org/jira/browse/SPARK-23131
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Peigen
>Priority: Minor
>
> When trying to use GeneralizedLinearRegression model and set SparkConf to use 
> KryoSerializer(JavaSerializer is fine)
> It causes StackOverflowException
> {quote}Exception in thread "dispatcher-event-loop-34" 
> java.lang.StackOverflowError
>  at java.util.HashMap.hash(HashMap.java:338)
>  at java.util.HashMap.get(HashMap.java:556)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
>  at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62)
> {quote}
> This is very likely to be 
> [https://github.com/EsotericSoftware/kryo/issues/341]
> Upgrade Kryo to 4.0+ probably could fix this
>  
> Wish for upgrade Kryo version for spark



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603466#comment-16603466
 ] 

Dongjoon Hyun commented on SPARK-25258:
---

[~yumwang]. You wrote that you had submitted PR, but you didn't use 
`[SPARK-25258]` in your PR. Is there any reason for that?

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25258) Upgrade kryo package to version 4.0.2

2018-09-04 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25258:
--
Summary: Upgrade kryo package to version 4.0.2  (was: Upgrade kryo package 
to version 4.0.2+)

> Upgrade kryo package to version 4.0.2
> -
>
> Key: SPARK-25258
> URL: https://issues.apache.org/jira/browse/SPARK-25258
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.3.1
>Reporter: liupengcheng
>Priority: Major
>
> Recently, we encountered a kryo performance issue in spark2.1.0, and the 
> issue affect all kryo below 4.0.2, so it seems that all spark version might 
> encounter this issue.
> Issue description:
> In shuffle write phase or some spilling operation, spark will use kryo 
> serializer to serialize data if `spark.serializer` is set to 
> `KryoSerializer`, however, when data contains some extremely large records, 
> kryoSerializer's MapReferenceResolver would be expand, and it's `reset` 
> method will take a long time to reset all items in writtenObjects table to 
> null.
> com.esotericsoftware.kryo.util.MapReferenceResolver
> {code:java}
> public void reset () {
>  readObjects.clear();
>  writtenObjects.clear();
> }
> public void clear () {
>  K[] keyTable = this.keyTable;
>  for (int i = capacity + stashSize; i-- > 0;)
>   keyTable[i] = null;
>  size = 0;
>  stashSize = 0;
> }
> {code}
> I checked the kryo project in github, and this issue seems fixed in 4.0.2+
> [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28]
>  
> I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix 
> this problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table

2018-09-04 Thread Bruce Robbins (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603453#comment-16603453
 ] 

Bruce Robbins commented on SPARK-24316:
---

This is likely SPARK-25164.

> Spark sql queries stall for  column width more than 6k for parquet based table
> --
>
> Key: SPARK-24316
> URL: https://issues.apache.org/jira/browse/SPARK-24316
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0
>Reporter: Bimalendu Choudhary
>Priority: Major
>
> When we create a table from a data frame using spark sql with columns around 
> 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while 
> the same table if we create through Hive with same data , the same query just 
> takes 5 minutes.
>  
> Instrumenting the code we see that the executors are looping in the while 
> loop of the function initializeInternal(). The majority of time is getting 
> spent in the for loop in below code looping through the columns and the 
> executor appears to be stalled for long time .
>   
> {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid}
> private void initializeInternal() ..
>  ..
>  for (int i = 0; i < requestedSchema.getFieldCount(); ++i)
> { ... }
> }
> {code:java}
>  {code}
>  
> When spark sql is creating table, it also stores the metadata in the 
> TBLPROPERTIES in json format. We see that if we remove this metadata from the 
> table the queries become fast , which is the case when we create the same 
> table through Hive. The exact same table takes 5 times more time with the 
> Json meta data as compared to without the json metadata.
>  
> So looks like as the number of columns are growing bigger than 5 to 6k, the 
> processing of the metadata and comparing it becomes more and more expensive 
> and the performance degrades drastically.
> To recreate the problem simply run the following query:
> import org.apache.spark.sql.SparkSession
> val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")
>  resp_data.write.format("csv").save("/tmp/filename")
>  
> The table should be created by spark sql from dataframe so that the Json meta 
> data is stored. For ex:-
> val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")
> dff.createOrReplaceTempView("my_temp_table")
>  val tmp = spark.sql("Create table tableName stored as parquet as select * 
> from my_temp_table")
>  
>  
> from pyspark.sql import SQL
> Context 
>  sqlContext = SQLContext(sc) 
>  resp_data = spark.sql( " select * from test").limit(2000) 
>  print resp_data_fgv_1k.count() 
>  (resp_data_fgv_1k.write.option('header', 
> False).mode('overwrite').csv('/tmp/2.csv') ) 
>  
>  
> The performance seems to be even slow in the loop if the schema does not 
> match or the fields are empty and the code goes into the if condition where 
> the missing column is marked true:
> missingColumns[i] = true;
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25334) Default SessionCatalog should support UDFs

2018-09-04 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603341#comment-16603341
 ] 

Tomasz Gawęda commented on SPARK-25334:
---

If commiters say it's not very important, I can start work on this. However, it 
will probably take more time for me to implement it than for some that's 
already a Spark committer :)

> Default SessionCatalog should support UDFs
> --
>
> Key: SPARK-25334
> URL: https://issues.apache.org/jira/browse/SPARK-25334
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Tomasz Gawęda
>Priority: Major
>
> SessionCatalog calls registerFunction to add a function to function registry. 
> However, makeFunctionExpression supports only UserDefinedAggregateFunction.
> We should make makeFunctionExpression support UserDefinedFunction, as it's 
> one of functions type.
> Currently we can use persistent functions only with Hive metastore, but 
> "create function" command works also with default SessionCatalog. It 
> sometimes cause user confusion, like in 
> https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25334) Default SessionCatalog should support UDFs

2018-09-04 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Gawęda updated SPARK-25334:
--
Summary: Default SessionCatalog should support UDFs  (was: Default 
SessionCatalog doesn't support UDFs)

> Default SessionCatalog should support UDFs
> --
>
> Key: SPARK-25334
> URL: https://issues.apache.org/jira/browse/SPARK-25334
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Tomasz Gawęda
>Priority: Major
>
> SessionCatalog calls registerFunction to add a function to function registry. 
> However, makeFunctionExpression supports only UserDefinedAggregateFunction.
> We should make makeFunctionExpression support UserDefinedFunction, as it's 
> one of functions type.
> Currently we can use persistent functions only with Hive metastore, but 
> "create function" command works also with default SessionCatalog. It 
> sometimes cause user confusion, like in 
> https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25334) Default SessionCatalog doesn't support UDFs

2018-09-04 Thread JIRA
Tomasz Gawęda created SPARK-25334:
-

 Summary: Default SessionCatalog doesn't support UDFs
 Key: SPARK-25334
 URL: https://issues.apache.org/jira/browse/SPARK-25334
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Tomasz Gawęda


SessionCatalog calls registerFunction to add a function to function registry. 
However, makeFunctionExpression supports only UserDefinedAggregateFunction.

We should make makeFunctionExpression support UserDefinedFunction, as it's one 
of functions type.

Currently we can use persistent functions only with Hive metastore, but "create 
function" command works also with default SessionCatalog. It sometimes cause 
user confusion, like in 
https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Walid Mellouli (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Walid Mellouli updated SPARK-25333:
---
External issue URL: https://github.com/apache/spark/pull/22332
Labels: pull-request-available  (was: )

> Ability to add new columns in the beginning of a Dataset
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Priority: Minor
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end or in the 
> beginning, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25333:


Assignee: (was: Apache Spark)

> Ability to add new columns in the beginning of a Dataset
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end or in the 
> beginning, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603330#comment-16603330
 ] 

Apache Spark commented on SPARK-25333:
--

User 'wmellouli' has created a pull request for this issue:
https://github.com/apache/spark/pull/22332

> Ability to add new columns in the beginning of a Dataset
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end or in the 
> beginning, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25333:


Assignee: Apache Spark

> Ability to add new columns in the beginning of a Dataset
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Assignee: Apache Spark
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end or in the 
> beginning, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603329#comment-16603329
 ] 

Apache Spark commented on SPARK-25333:
--

User 'wmellouli' has created a pull request for this issue:
https://github.com/apache/spark/pull/22332

> Ability to add new columns in the beginning of a Dataset
> 
>
> Key: SPARK-25333
> URL: https://issues.apache.org/jira/browse/SPARK-25333
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Walid Mellouli
>Priority: Minor
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> When we add new columns in a Dataset, they are added automatically at the end 
> of the Dataset.
> {code:java}
> val df = sc.parallelize(Seq(1, 2, 3)).toDF
> df.printSchema
> root
>  |-- value: integer (nullable = true)
> {code}
> When we add a new column:
> {code:java}
> val newDf = df.withColumn("newColumn", col("value") + 1)
> newDf.printSchema
> root
>  |-- value: integer (nullable = true)
>  |-- newColumn: integer (nullable = true)
> {code}
> Generally users want to add new columns either at the end or in the 
> beginning, depends on use cases.
>  In my case for example, we add technical columns in the beginning of a 
> Dataset and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25248) Audit barrier APIs for Spark 2.4

2018-09-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-25248.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22240
[https://github.com/apache/spark/pull/22240]

> Audit barrier APIs for Spark 2.4
> 
>
> Key: SPARK-25248
> URL: https://issues.apache.org/jira/browse/SPARK-25248
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
> Fix For: 2.4.0
>
>
> Make a pass over APIs added for barrier execution mode.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25333) Ability to add new columns in the beginning of a Dataset

2018-09-04 Thread Walid Mellouli (JIRA)
Walid Mellouli created SPARK-25333:
--

 Summary: Ability to add new columns in the beginning of a Dataset
 Key: SPARK-25333
 URL: https://issues.apache.org/jira/browse/SPARK-25333
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Walid Mellouli


When we add new columns in a Dataset, they are added automatically at the end 
of the Dataset.
{code:java}
val df = sc.parallelize(Seq(1, 2, 3)).toDF
df.printSchema


root
 |-- value: integer (nullable = true)
{code}

When we add a new column:

{code:java}
val newDf = df.withColumn("newColumn", col("value") + 1)
newDf.printSchema


root
 |-- value: integer (nullable = true)
 |-- newColumn: integer (nullable = true)
{code}

Generally users want to add new columns either at the end or in the beginning, 
depends on use cases.
 In my case for example, we add technical columns in the beginning of a Dataset 
and we add business columns at the end.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider

2018-09-04 Thread Babulal (JIRA)
Babulal created SPARK-25332:
---

 Summary: Instead of broadcast hash join  ,Sort merge join has 
selected when restart spark-shell/spark-JDBC for hive provider
 Key: SPARK-25332
 URL: https://issues.apache.org/jira/browse/SPARK-25332
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Babulal


spark.sql("create table x1(name string,age int) stored as parquet ")
 spark.sql("insert into x1 select 'a',29")
 spark.sql("create table x2 (name string,age int) stored as parquet '")
 spark.sql("insert into x2_ex select 'a',29")
 scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain

== Physical Plan ==
*{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, 
BuildRight
:- *(2) Project [name#101, age#102]
: +- *(2) Filter isnotnull(name#101)
: +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, 
Format: Parquet, Location: 
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, 
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
struct
+- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
 +- *(1) Project [name#103, age#104]
 +- *(1) Filter isnotnull(name#103)
 +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, 
Format: Parquet, Location: 
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, 
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
struct

 

 

Now Restart Spark-Shell or do spark-submit orrestart JDBCServer  again and run 
same select query again

 

scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain
== Physical Plan ==
*{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner
:- *(2) Sort [name#43 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(name#43, 200)
: +- *(1) Project [name#43, age#44]
: +- *(1) Filter isnotnull(name#43)
: +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], 
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
struct
+- *(4) Sort [name#45 ASC NULLS FIRST], false, 0
 +- Exchange hashpartitioning(name#45, 200)
 +- *(3) Project [name#45, age#46]
 +- *(3) Filter isnotnull(name#45)
 +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: 
Parquet, Location: 
InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], 
PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: 
struct

 

 

scala> spark.sql("desc formatted x1").show(200,false)
++--+---+
|col_name |data_type |comment|
++--+---+
|name |string |null |
|age |int |null |
| | | |
|# Detailed Table Information| | |
|Database |default | |
|Table |x1 | |
|Owner |Administrator | |
|Created Time |Sun Aug 19 12:36:58 IST 2018 | |
|Last Access |Thu Jan 01 05:30:00 IST 1970 | |
|Created By |Spark 2.3.0 | |
|Type |MANAGED | |
|Provider |hive | |
|Table Properties |[transient_lastDdlTime=1534662418] | |
|Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | |
|Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | |
|InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | |
|OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| |
|Storage Properties |[serialization.format=1] | |
|Partition Provider |Catalog | |
++--+---+

 

With datasource table ,working fine ( create table using parquet instead of 
stored by )



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22666) Spark datasource for image format

2018-09-04 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-22666:
-

Assignee: Weichen Xu

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Assignee: Weichen Xu
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25317) MemoryBlock performance regression

2018-09-04 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603174#comment-16603174
 ] 

Marco Gaido commented on SPARK-25317:
-

I think I have a fix for this. I can submit a PR if you want, but I am still 
not sure about the root cause of the regression. My best guess is that there 
are more than one reason and the perf improvement happens iff all the reasons 
are fixed, which is rather strange to me.

> MemoryBlock performance regression
> --
>
> Key: SPARK-25317
> URL: https://issues.apache.org/jira/browse/SPARK-25317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> eThere is a performance regression when calculating hash code for UTF8String:
> {code:java}
>   test("hashing") {
> import org.apache.spark.unsafe.hash.Murmur3_x86_32
> import org.apache.spark.unsafe.types.UTF8String
> val hasher = new Murmur3_x86_32(0)
> val str = UTF8String.fromString("b" * 10001)
> val numIter = 10
> val start = System.nanoTime
> for (i <- 0 until numIter) {
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
>   Murmur3_x86_32.hashUTF8String(str, 0)
> }
> val duration = (System.nanoTime() - start) / 1000 / numIter
> println(s"duration $duration us")
>   }
> {code}
> To run this test in 2.3, we need to add
> {code:java}
> public static int hashUTF8String(UTF8String str, int seed) {
> return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), 
> str.numBytes(), seed);
>   }
> {code}
> to `Murmur3_x86_32`
> In my laptop, the result for master vs 2.3 is: 120 us vs 40 us



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603143#comment-16603143
 ] 

Mihaly Toth commented on SPARK-25331:
-

After looking into how this could be solved there are a few potential ways I 
could think of:
# Make the resulting file names deterministic based on the input. Currently it 
contains a UUID which is by nature different in each run. The question here if 
partitioning of the data can always be done the same way. And what else was the 
motivation for adding a UUID to the name.
# Create a "write ahead manifest file" which contains the generated file names. 
This could be used in the {{ManifestFileCommitProtocol.setupJob}} which is 
currently a noop. We may need to store some additional data like partitioning 
in order to generate the same file contents again.
# Document and mandate the use of the manifest file for the consumer of the 
file output. Currently this file is not mentioned in the docs. Even if this 
would be documented that would make the life of the consumer more difficult not 
to mention that this would be somewhat counter intuitive.

Before rushing into the implementation it would make sense to discuss the 
direction I guess. I would pick the first if that is possible.

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-09-04 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603126#comment-16603126
 ] 

Sujith commented on SPARK-25271:


[~cloud_fan] [~sowen]  Will this cause a compatibility problem compare to older 
version, If user has  null record ,then he is getting an exception with the 
current version where as the older version of spark(2.2.1)  wont throw any 
exception.

I think the Output writers has been updated in the below PR

[https://github.com/apache/spark/pull/20521]

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> 

[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-09-04 Thread Sujith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603128#comment-16603128
 ] 

Sujith commented on SPARK-25271:


cc [~hyukjin.kwon]

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
>   ... 21 more
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Assigned] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25331:


Assignee: (was: Apache Spark)

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603120#comment-16603120
 ] 

Apache Spark commented on SPARK-25331:
--

User 'misutoth' has created a pull request for this issue:
https://github.com/apache/spark/pull/22331

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25331:


Assignee: Apache Spark

> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Assignee: Apache Spark
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)
Mihaly Toth created SPARK-25331:
---

 Summary: Structured Streaming File Sink duplicates records in case 
of driver failure
 Key: SPARK-25331
 URL: https://issues.apache.org/jira/browse/SPARK-25331
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.1
Reporter: Mihaly Toth


Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
1. call {{FileStreamSink.addBtach}}
2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
3. call {{FileStreamSink.addBtach}} with the same data
4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
5. Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure

2018-09-04 Thread Mihaly Toth (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihaly Toth updated SPARK-25331:

Description: 
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
# call {{FileStreamSink.addBtach}}
# make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
# call {{FileStreamSink.addBtach}} with the same data
# make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
# Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331


  was:
Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
been started by {{FileFormatWrite.write}} and then the resulting task sets are 
completed but in the meantime the driver dies. In such a case repeating 
{{FileStreamSink.addBtach}} will result in duplicate writing of the data

In the event the driver fails after the executors start processing the job the 
processed batch will be written twice.

Steps needed:
1. call {{FileStreamSink.addBtach}}
2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
3. call {{FileStreamSink.addBtach}} with the same data
4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully
5. Verify file output - according to {{Sink.addBatch}} documentation the rdd 
should be written only once

I have created a wip PR with a unit test:
https://github.com/apache/spark/pull/22331



> Structured Streaming File Sink duplicates records in case of driver failure
> ---
>
> Key: SPARK-25331
> URL: https://issues.apache.org/jira/browse/SPARK-25331
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Mihaly Toth
>Priority: Major
>
> Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has 
> been started by {{FileFormatWrite.write}} and then the resulting task sets 
> are completed but in the meantime the driver dies. In such a case repeating 
> {{FileStreamSink.addBtach}} will result in duplicate writing of the data
> In the event the driver fails after the executors start processing the job 
> the processed batch will be written twice.
> Steps needed:
> # call {{FileStreamSink.addBtach}}
> # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}}
> # call {{FileStreamSink.addBtach}} with the same data
> # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} 
> successfully
> # Verify file output - according to {{Sink.addBatch}} documentation the rdd 
> should be written only once
> I have created a wip PR with a unit test:
> https://github.com/apache/spark/pull/22331



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2018-09-04 Thread shivusondur (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603106#comment-16603106
 ] 

shivusondur commented on SPARK-25271:
-

After further analyzing the issue i got following details

In  SingleDirectoryWriteTask private 
class(org.apache.spark.sql.execution.datasources.FileFormatWriter File) , 
currentWriter is  initialized with different outputWriter in spark-2.2.1 and 
spar-2.3.1, as shown below. 


{code:java}
Spark-2.3.1= currentWriter is initilized with "HiveOutputWriter"
Spark-2.2.1= currentWriter is initilized with "ParquetOutputWriter"
{code}


So ParquetOutputWriter may be handling the null/empty values.

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: shivusondur
>Priority: Major
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> 

[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602990#comment-16602990
 ] 

Apache Spark commented on SPARK-19355:
--

User 'jiangxb1987' has created a pull request for this issue:
https://github.com/apache/spark/pull/22330

> Use map output statistices to improve global limit's parallelism
> 
>
> Key: SPARK-19355
> URL: https://issues.apache.org/jira/browse/SPARK-19355
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> A logical Limit is performed actually by two physical operations LocalLimit 
> and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to 
> shuffle data to single partition. When the limit number is very big, we 
> shuffle a lot of data to a single partition and significantly reduce 
> parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single 
> partition. Instead, we perform the map stage of the shuffling and collect the 
> statistics of the number of rows in each partition. Shuffled data are 
> actually all retrieved locally without from remote executors.
> Once we get the number of output rows in each partition, we only take the 
> required number of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25176:


Assignee: (was: Apache Spark)

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Major
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25176:


Assignee: Apache Spark

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Assignee: Apache Spark
>Priority: Major
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602974#comment-16602974
 ] 

Apache Spark commented on SPARK-25176:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/22179

> Kryo fails to serialize a parametrised type hierarchy
> -
>
> Key: SPARK-25176
> URL: https://issues.apache.org/jira/browse/SPARK-25176
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.2, 2.3.1
>Reporter: Mikhail Pryakhin
>Priority: Major
>
> I'm using the latest spark version spark-core_2.11:2.3.1 which 
> transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the
> com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo 
> serializer contains an issue [1,2] which results in throwing 
> ClassCastExceptions when serialising parameterised type hierarchy.
> This issue has been fixed in kryo version 4.0.0 [3]. It would be great to 
> have this update in Spark as well. Could you please upgrade the version of 
> com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2?
> You can find a simple test to reproduce the issue [4].
> [1] https://github.com/EsotericSoftware/kryo/issues/384
> [2] https://github.com/EsotericSoftware/kryo/issues/377
> [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0
> [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-09-04 Thread Vladimir Pchelko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602909#comment-16602909
 ] 

Vladimir Pchelko edited comment on SPARK-20168 at 9/4/18 11:04 AM:
---

[~srowen]


 this bug must be covered by unit tests.

 


was (Author: vpchelko):
[~srowen]
 this bug must be covered by unit tests

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>Priority: Minor
>  Labels: kinesis, streaming
> Fix For: 2.4.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp

2018-09-04 Thread Vladimir Pchelko (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602909#comment-16602909
 ] 

Vladimir Pchelko commented on SPARK-20168:
--

[~srowen]
 this bug must be covered by unit tests

> Enable kinesis to start stream from Initial position specified by a timestamp
> -
>
> Key: SPARK-20168
> URL: https://issues.apache.org/jira/browse/SPARK-20168
> Project: Spark
>  Issue Type: Improvement
>  Components: DStreams
>Affects Versions: 2.1.0
>Reporter: Yash Sharma
>Assignee: Yash Sharma
>Priority: Minor
>  Labels: kinesis, streaming
> Fix For: 2.4.0
>
>
> Kinesis client can resume from a specified timestamp while creating a stream. 
> We should have option to pass a timestamp in config to allow kinesis to 
> resume from the given timestamp.
> Have started initial work and will be posting a PR after I test the patch -
> https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions

2018-09-04 Thread Binzi Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889
 ] 

Binzi Cao edited comment on SPARK-24189 at 9/4/18 10:49 AM:


It seems I'm hitting a similar issuel. I managed to set the kafka isolation 
level with
{code:java}
.option("kafka.isolation.level", "read_committed")
{code}
and using
{code:java}
kafka-client 1.0.0 
Spark version: 2.3.1{code}
and I'm seeing this issue:
{code:java}
[error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due 
to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: 
Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): 
java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 
2000 milliseconds
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143)
[error] at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205)
[error] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[error] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230)


{code}
So it looks like it is not working with a topic with kafka transactions at all.

The exception was thrown here:
 
[https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272]

Setting
{code:java}
 failOnDataLoss=false
{code}
can't fix the issue, as the exception is never caught in the 
KafkaDataConsumer.scala code.


was (Author: caobinzi):
It seems I'm hitting a similar issuel. I managed to set the kafka isolation 
level with

{code:java}
.option("kafka.isolation.level", "read_committed")
{code}

and using 
{code:java}
kafka-client 1.0.0 
{code}
 and I'm seeing this issue: 


{code:java}
[error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due 
to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: 
Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): 
java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 
2000 milliseconds
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
[error] at 

[jira] [Comment Edited] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions

2018-09-04 Thread Binzi Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889
 ] 

Binzi Cao edited comment on SPARK-24189 at 9/4/18 10:49 AM:


It seems I'm hitting a similar issue. I managed to set the kafka isolation 
level with
{code:java}
.option("kafka.isolation.level", "read_committed")
{code}
and using
{code:java}
kafka-client 1.0.0 
Spark version: 2.3.1{code}
and I'm seeing this issue:
{code:java}
[error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due 
to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: 
Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): 
java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 
2000 milliseconds
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143)
[error] at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205)
[error] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[error] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230)


{code}
So it looks like it is not working with a topic with kafka transactions at all.

The exception was thrown here:
 
[https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272]

Setting
{code:java}
 failOnDataLoss=false
{code}
can't fix the issue, as the exception is never caught in the 
KafkaDataConsumer.scala code.


was (Author: caobinzi):
It seems I'm hitting a similar issuel. I managed to set the kafka isolation 
level with
{code:java}
.option("kafka.isolation.level", "read_committed")
{code}
and using
{code:java}
kafka-client 1.0.0 
Spark version: 2.3.1{code}
and I'm seeing this issue:
{code:java}
[error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due 
to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: 
Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): 
java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 
2000 milliseconds
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
[error] at 

[jira] [Commented] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions

2018-09-04 Thread Binzi Cao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889
 ] 

Binzi Cao commented on SPARK-24189:
---

It seems I'm hitting a similar issuel. I managed to set the kafka isolation 
level with

{code:java}
.option("kafka.isolation.level", "read_committed")
{code}

and using 
{code:java}
kafka-client 1.0.0 
{code}
 and I'm seeing this issue: 


{code:java}
[error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due 
to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: 
Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): 
java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 
2000 milliseconds
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109)
[error] at 
org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54)
[error] at 
org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152)
[error] at 
org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143)
[error] at 
org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205)
[error] at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[error] at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
[error] at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
[error] at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230)


{code}


So it looks like it is not working with a topic with kafka transactions at all. 

The exception was thrown here:
https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272

Setting
{code:java}
 failOnDataLoss=false
{code}
 can't fix the issue, as the exception is never caught in the 
KafkaDataConsumer.scala code. 





> Spark Strcutured Streaming not working with the Kafka Transactions
> --
>
> Key: SPARK-24189
> URL: https://issues.apache.org/jira/browse/SPARK-24189
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: bharath kumar avusherla
>Priority: Major
>
> Was trying to read kafka transactional topic using Spark Structured Streaming 
> 2.3.0 with the  kafka option isolation-level = "read_committed", but spark 
> reading the data immediately without waiting for the data in topic to be 
> committed. In spark documentation it was mentioned as Structured Streaming 
> supports Kafka version 0.10 or higher. I am using below command to test the 
> scenario.
> val df = spark
>  .readStream
>  .format("kafka")
>  .option("kafka.bootstrap.servers", "localhost:9092")
>  .option("subscribe", "test-topic")
>  .option("isolation-level","read_committed")
>  .load()
> Can you please let me know if the transactional read is supported in SPark 
> 2.3.0 strcutured Streaming or am i missing anything.
>  
> Thank you.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25328:


Assignee: (was: Apache Spark)

> Add an example for having two columns as the grouping key in group aggregate 
> pandas UDF
> ---
>
> Key: SPARK-25328
> URL: https://issues.apache.org/jira/browse/SPARK-25328
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> https://github.com/apache/spark/pull/20295 added an alternative interface for 
> group aggregate pandas UDFs. It does not have an example that have more than 
> one columns as the grouping key in {{functions.py}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-25328:


Assignee: Apache Spark

> Add an example for having two columns as the grouping key in group aggregate 
> pandas UDF
> ---
>
> Key: SPARK-25328
> URL: https://issues.apache.org/jira/browse/SPARK-25328
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> https://github.com/apache/spark/pull/20295 added an alternative interface for 
> group aggregate pandas UDFs. It does not have an example that have more than 
> one columns as the grouping key in {{functions.py}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602846#comment-16602846
 ] 

Apache Spark commented on SPARK-25328:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/22329

> Add an example for having two columns as the grouping key in group aggregate 
> pandas UDF
> ---
>
> Key: SPARK-25328
> URL: https://issues.apache.org/jira/browse/SPARK-25328
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Xiao Li
>Priority: Major
>
> https://github.com/apache/spark/pull/20295 added an alternative interface for 
> group aggregate pandas UDFs. It does not have an example that have more than 
> one columns as the grouping key in {{functions.py}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22666) Spark datasource for image format

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22666:


Assignee: Apache Spark

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Assignee: Apache Spark
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22666) Spark datasource for image format

2018-09-04 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22666:


Assignee: (was: Apache Spark)

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22666) Spark datasource for image format

2018-09-04 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602845#comment-16602845
 ] 

Apache Spark commented on SPARK-22666:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/22328

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25301) When a view uses an UDF from a non default database, Spark analyser throws AnalysisException

2018-09-04 Thread Vinod KC (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25301:
-
Description: 
When a hive view uses an UDF from a non default database, Spark analyser throws 
AnalysisException

Steps to simulate this issue
 -
 Step 1 : Run following statements in Hive
 
 ```
 CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
 CREATE DATABASE d100;
 CREATE FUNCTION d100.udf100 as 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
created in d100
 CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
 SELECT * FROM d100.v100; // query on view d100.v100 gives correct result
 ```
 Step2 : Run following statements in Spark
 -
 1) spark.sql("select * from d100.v100").show
 throws 
 ```
 org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database '*default*'
 ```

This is because, while parsing the SQL statement of the View 'select 
`d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split 
database name and udf name and hence Spark function registry tries to load the 
UDF 'd100.udf100' from 'default' database.

  was:
When a hive view uses an UDF from a non default database, Spark analyser throws 
AnalysisException

Steps to simulate this issue
 -
 Step 1 : Run following statements in Hive
 
```sql
CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 as 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
created in d100
CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
SELECT * FROM d100.v100; // query on view d100.v100 gives correct result
```
Step2 : Run following statements in Spark
 -
 1) spark.sql("select * from d100.v100").show
 throws 
 ```
 org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database '*default*'
 ```

This is because, while parsing the SQL statement of the View 'select 
`d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split 
database name and udf name and hence Spark function registry tries to load the 
UDF 'd100.udf100' from 'default' database.


> When a view uses an UDF from a non default database, Spark analyser throws 
> AnalysisException
> 
>
> Key: SPARK-25301
> URL: https://issues.apache.org/jira/browse/SPARK-25301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> When a hive view uses an UDF from a non default database, Spark analyser 
> throws AnalysisException
> Steps to simulate this issue
>  -
>  Step 1 : Run following statements in Hive
>  
>  ```
>  CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
>  CREATE DATABASE d100;
>  CREATE FUNCTION d100.udf100 as 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
> created in d100
>  CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
>  SELECT * FROM d100.v100; // query on view d100.v100 gives correct result
>  ```
>  Step2 : Run following statements in Spark
>  -
>  1) spark.sql("select * from d100.v100").show
>  throws 
>  ```
>  org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database '*default*'
>  ```
> This is because, while parsing the SQL statement of the View 'select 
> `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to 
> split database name and udf name and hence Spark function registry tries to 
> load the UDF 'd100.udf100' from 'default' database.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25301) When a view uses an UDF from a non default database, Spark analyser throws AnalysisException

2018-09-04 Thread Vinod KC (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod KC updated SPARK-25301:
-
Description: 
When a hive view uses an UDF from a non default database, Spark analyser throws 
AnalysisException

Steps to simulate this issue
 -
 Step 1 : Run following statements in Hive
 
```sql
CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
CREATE DATABASE d100;
CREATE FUNCTION d100.udf100 as 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
created in d100
CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
SELECT * FROM d100.v100; // query on view d100.v100 gives correct result
```
Step2 : Run following statements in Spark
 -
 1) spark.sql("select * from d100.v100").show
 throws 
 ```
 org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database '*default*'
 ```

This is because, while parsing the SQL statement of the View 'select 
`d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split 
database name and udf name and hence Spark function registry tries to load the 
UDF 'd100.udf100' from 'default' database.

  was:
When a hive view uses an UDF from a non default database, Spark analyser throws 
AnalysisException

Steps to simulate this issue
 -
 In Hive
 
 1) CREATE DATABASE d100;
 2) create function d100.udf100 as 
'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
created in d100
 3) create view d100.v100 as select *d100.udf100*(name)  from default.emp; // 
Note : table default.emp has two columns 'name', 'address', 
 5) select * from d100.v100; // query on view d100.v100 gives correct result

In Spark
 -
 1) spark.sql("select * from d100.v100").show
 throws 
 ```
 org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
This function is neither a registered temporary function nor a permanent 
function registered in the database '*default*'
 ```

This is because, while parsing the SQL statement of the View 'select 
`d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split 
database name and udf name and hence Spark function registry tries to load the 
UDF 'd100.udf100' from 'default' database.


> When a view uses an UDF from a non default database, Spark analyser throws 
> AnalysisException
> 
>
> Key: SPARK-25301
> URL: https://issues.apache.org/jira/browse/SPARK-25301
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Vinod KC
>Priority: Minor
>
> When a hive view uses an UDF from a non default database, Spark analyser 
> throws AnalysisException
> Steps to simulate this issue
>  -
>  Step 1 : Run following statements in Hive
>  
> ```sql
> CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address;
> CREATE DATABASE d100;
> CREATE FUNCTION d100.udf100 as 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is 
> created in d100
> CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; 
> SELECT * FROM d100.v100; // query on view d100.v100 gives correct result
> ```
> Step2 : Run following statements in Spark
>  -
>  1) spark.sql("select * from d100.v100").show
>  throws 
>  ```
>  org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database '*default*'
>  ```
> This is because, while parsing the SQL statement of the View 'select 
> `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to 
> split database name and udf name and hence Spark function registry tries to 
> load the UDF 'd100.udf100' from 'default' database.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7

2018-09-04 Thread Yuming Wang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-25330:

Description: 
How to reproduce:
{code:java}
# build spark
./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
-Phive-thriftserver -Pyarn

tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
spark-2.4.0-SNAPSHOT-bin-SPARK-25330
export HADOOP_PROXY_USER=user_a
bin/spark-sql

export HADOOP_PROXY_USER=user_b
bin/spark-sql{code}
 
{noformat}
Exception in thread "main" java.lang.RuntimeException: 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=user_b, access=EXECUTE, 
inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}

  was:
How to reproduce:
{code}
# build spark
./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
-Phive-thriftserver -Pyarn

tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
spark-2.4.0-SNAPSHOT-bin-SPARK-25330
export HADOOP_PROXY_USER=user_a
bin/spark-sql

export HADOOP_PROXY_USER=user_b
bin/spark-sql{code}
 
{noformat}
Exception in thread "main" java.lang.RuntimeException: 
org.apache.hadoop.security.AccessControlException: Permission denied: 
user=user_b, access=EXECUTE, 
inode="/tmp/hive-$%7Buser.name%7D/b_slng/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
at 
org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}


> Permission issue after upgrade hadoop version to 2.7.7
> --
>
> Key: SPARK-25330
> URL: https://issues.apache.org/jira/browse/SPARK-25330
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.3.2, 2.4.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:java}
> # build spark
> ./dev/make-distribution.sh --name SPARK-25330 --tgz  -Phadoop-2.7 -Phive 
> -Phive-thriftserver -Pyarn
> tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd 
> spark-2.4.0-SNAPSHOT-bin-SPARK-25330
> export HADOOP_PROXY_USER=user_a
> bin/spark-sql
> export HADOOP_PROXY_USER=user_b
> bin/spark-sql{code}
>  
> {noformat}
> Exception in thread "main" java.lang.RuntimeException: 
> org.apache.hadoop.security.AccessControlException: Permission denied: 
> user=user_b, access=EXECUTE, 
> inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx--
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >