[jira] [Created] (SPARK-25339) Refactor FilterPushdownBenchmark to use main method
Yuming Wang created SPARK-25339: --- Summary: Refactor FilterPushdownBenchmark to use main method Key: SPARK-25339 URL: https://issues.apache.org/jira/browse/SPARK-25339 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.4.0 Reporter: Yuming Wang Wenchen commented on the PR: https://github.com/apache/spark/pull/22336#issuecomment-418604019 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25336. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22334 [https://github.com/apache/spark/pull/22334] > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > Fix For: 2.4.0 > > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603937#comment-16603937 ] Yuming Wang commented on SPARK-19145: - How about we target this ticket and [SPARK-25039|https://issues.apache.org/jira/browse/SPARK-25039] to Spark 3.0.0? > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja >Priority: Major > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25338) Several tests miss calling super.afterAll() in their afterAll() method
Kazuaki Ishizaki created SPARK-25338: Summary: Several tests miss calling super.afterAll() in their afterAll() method Key: SPARK-25338 URL: https://issues.apache.org/jira/browse/SPARK-25338 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The following tests under {{external}} may not call {{super.afterAll()}} in their {{afterAll()}} method. {code} external/flume/src/test/scala/org/apache/spark/streaming/flume/FlumePollingStreamSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaRelationSuite.scala external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaSinkSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/DirectKafkaStreamSuite.scala external/kafka-0-10/src/test/scala/org/apache/spark/streaming/kafka010/KafkaRDDSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/DirectKafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaClusterSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/KafkaStreamSuite.scala external/kafka-0-8/src/test/scala/org/apache/spark/streaming/kafka/ReliableKafkaStreamSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisInputDStreamBuilderSuite.scala external/kinesis-asl/src/test/scala/org/apache/spark/streaming/kinesis/KinesisStreamSuite.scala {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603926#comment-16603926 ] Apache Spark commented on SPARK-25306: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22336 > Avoid skewed filter trees to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In both ORC data sources, createFilter function has exponential time > complexity due to its skewed filter tree generation. This PR aims to improve > it by using new buildTree function. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603921#comment-16603921 ] Apache Spark commented on SPARK-25306: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22336 > Avoid skewed filter trees to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In both ORC data sources, createFilter function has exponential time > complexity due to its skewed filter tree generation. This PR aims to improve > it by using new buildTree function. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603912#comment-16603912 ] Apache Spark commented on SPARK-25091: -- User 'cfangplus' has created a pull request for this issue: https://github.com/apache/spark/pull/22335 > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25091: Assignee: Apache Spark > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Assignee: Apache Spark >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25091) UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory
[ https://issues.apache.org/jira/browse/SPARK-25091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25091: Assignee: (was: Apache Spark) > UNCACHE TABLE, CLEAR CACHE, rdd.unpersist() does not clean up executor memory > - > > Key: SPARK-25091 > URL: https://issues.apache.org/jira/browse/SPARK-25091 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: Yunling Cai >Priority: Critical > Attachments: 0.png, 1.png, 2.png, 3.png, 4.png > > > UNCACHE TABLE and CLEAR CACHE does not clean up executor memory. > Through Spark UI, although in Storage, we see the cached table removed. In > Executor, the executors continue to hold the RDD and the memory is not > cleared. This results in huge waste in executor memory usage. As we call > CACHE TABLE, we run into issues where the cached tables are spilled to disk > instead of reclaiming the memory storage. > Steps to reproduce: > CACHE TABLE test.test_cache; > UNCACHE TABLE test.test_cache; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > CACHE TABLE test.test_cache; > CLEAR CACHE; > == Storage shows table is not cached; Executor shows the executor storage > memory does not change == > Similar behavior when using pyspark df.unpersist(). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603865#comment-16603865 ] Dongjoon Hyun commented on SPARK-25337: --- [~srowen]. I reproduced this locally. The failure occurs during executing old Spark inside `beforeAll` . So, it's marked as `aborted`. And, the root cause of failure is corrupted class path for some reasons. I'm still investigation. > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", > line 545, in sql > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 63, in deco > 2018-09-04 20:00:04.949 - stdout> File > "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", > line 328, in get_return_value > 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error > occurred while calling o27.sql. > 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25337: -- Description: Observed in the Scala 2.12 pull request builder consistently now. I don't see this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of hard to see how. CC [~sadhen] {code:java} org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' ... 2018-09-04 20:00:04.949 - stdout> File "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/session.py", line 545, in sql 2018-09-04 20:00:04.949 - stdout> File "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ 2018-09-04 20:00:04.949 - stdout> File "/private/tmp/test-spark/spark-2.1.3/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco 2018-09-04 20:00:04.949 - stdout> File "/private/tmp/test-spark/spark-2.1.3/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value 2018-09-04 20:00:04.95 - stdout> py4j.protocol.Py4JJavaError: An error occurred while calling o27.sql. 2018-09-04 20:00:04.95 - stdout> : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.hive.execution.HiveFileFormat could not be instantiated {code} was: Observed in the Scala 2.12 pull request builder consistently now. I don't see this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of hard to see how. CC [~sadhen] {code:java} org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' ... 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be instantiated ... 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V 2018-09-04 07:48:30.834 - stdout> at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81) ...{code} > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name'
[jira] [Closed] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li closed SPARK-24256. -- > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`
[ https://issues.apache.org/jira/browse/SPARK-25300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25300. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22306 [https://github.com/apache/spark/pull/22306] > Unified the configuration parameter `spark.shuffle.service.enabled` > --- > > Key: SPARK-25300 > URL: https://issues.apache.org/jira/browse/SPARK-25300 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.0 > > > The configuration parameter "spark.shuffle.service.enabled" has defined in > `package.scala`, and it is also used in many place, so we can replace it > with `SHUFFLE_SERVICE_ENABLED` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25300) Unified the configuration parameter `spark.shuffle.service.enabled`
[ https://issues.apache.org/jira/browse/SPARK-25300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25300: --- Assignee: liuxian > Unified the configuration parameter `spark.shuffle.service.enabled` > --- > > Key: SPARK-25300 > URL: https://issues.apache.org/jira/browse/SPARK-25300 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.0 > > > The configuration parameter "spark.shuffle.service.enabled" has defined in > `package.scala`, and it is also used in many place, so we can replace it > with `SHUFFLE_SERVICE_ENABLED` -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25306: -- Description: In both ORC data sources, createFilter function has exponential time complexity due to its skewed filter tree generation. This PR aims to improve it by using new buildTree function. *REPRODUCE* {code} // Create and read 1 row table with 1000 columns sql("set spark.sql.orc.filterPushdown=true") val selectExpr = (1 to 1000).map(i => s"id c$i") spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc") print(s"With 0 filters, ") spark.time(spark.read.orc("/tmp/orc").count) // Increase the number of filters (20 to 30).foreach { width => val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") print(s"With $width filters, ") spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) } {code} *RESULT* {code} With 0 filters, Time taken: 653 ms With 20 filters, Time taken: 962 ms With 21 filters, Time taken: 1282 ms With 22 filters, Time taken: 1982 ms With 23 filters, Time taken: 3855 ms With 24 filters, Time taken: 6719 ms With 25 filters, Time taken: 12669 ms With 26 filters, Time taken: 25032 ms With 27 filters, Time taken: 49585 ms With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds With 29 filters, Time taken: 198368 ms // over 3 mins With 30 filters, Time taken: 393744 ms // over 6 mins {code} was: In ORC data source, `createFilter` function has exponential time complexity due to lack of memoization like the following. This issue aims to improve it. *REPRODUCE* {code} // Create and read 1 row table with 1000 columns sql("set spark.sql.orc.filterPushdown=true") val selectExpr = (1 to 1000).map(i => s"id c$i") spark.range(1).selectExpr(selectExpr: _*).write.mode("overwrite").orc("/tmp/orc") print(s"With 0 filters, ") spark.time(spark.read.orc("/tmp/orc").count) // Increase the number of filters (20 to 30).foreach { width => val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") print(s"With $width filters, ") spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) } {code} *RESULT* {code} With 0 filters, Time taken: 653 ms With 20 filters, Time taken: 962 ms With 21 filters, Time taken: 1282 ms With 22 filters, Time taken: 1982 ms With 23 filters, Time taken: 3855 ms With 24 filters, Time taken: 6719 ms With 25 filters, Time taken: 12669 ms With 26 filters, Time taken: 25032 ms With 27 filters, Time taken: 49585 ms With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds With 29 filters, Time taken: 198368 ms // over 3 mins With 30 filters, Time taken: 393744 ms // over 6 mins {code} > Avoid skewed filter trees to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In both ORC data sources, createFilter function has exponential time > complexity due to its skewed filter tree generation. This PR aims to improve > it by using new buildTree function. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Updated] (SPARK-25306) Avoid skewed filter trees to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25306: -- Summary: Avoid skewed filter trees to speed up `createFilter` in ORC (was: Use cache to speed up `createFilter` in ORC) > Avoid skewed filter trees to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In ORC data source, `createFilter` function has exponential time complexity > due to lack of memoization like the following. This issue aims to improve it. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25306) Use cache to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-25306. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22313 [https://github.com/apache/spark/pull/22313] > Use cache to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In ORC data source, `createFilter` function has exponential time complexity > due to lack of memoization like the following. This issue aims to improve it. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25306) Use cache to speed up `createFilter` in ORC
[ https://issues.apache.org/jira/browse/SPARK-25306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-25306: --- Assignee: Dongjoon Hyun > Use cache to speed up `createFilter` in ORC > --- > > Key: SPARK-25306 > URL: https://issues.apache.org/jira/browse/SPARK-25306 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1, 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Critical > Fix For: 2.4.0 > > > In ORC data source, `createFilter` function has exponential time complexity > due to lack of memoization like the following. This issue aims to improve it. > *REPRODUCE* > {code} > // Create and read 1 row table with 1000 columns > sql("set spark.sql.orc.filterPushdown=true") > val selectExpr = (1 to 1000).map(i => s"id c$i") > spark.range(1).selectExpr(selectExpr: > _*).write.mode("overwrite").orc("/tmp/orc") > print(s"With 0 filters, ") > spark.time(spark.read.orc("/tmp/orc").count) > // Increase the number of filters > (20 to 30).foreach { width => > val whereExpr = (1 to width).map(i => s"c$i is not null").mkString(" and ") > print(s"With $width filters, ") > spark.time(spark.read.orc("/tmp/orc").where(whereExpr).count) > } > {code} > *RESULT* > {code} > With 0 filters, Time taken: 653 ms > > With 20 filters, Time taken: 962 ms > With 21 filters, Time taken: 1282 ms > With 22 filters, Time taken: 1982 ms > With 23 filters, Time taken: 3855 ms > With 24 filters, Time taken: 6719 ms > With 25 filters, Time taken: 12669 ms > With 26 filters, Time taken: 25032 ms > With 27 filters, Time taken: 49585 ms > With 28 filters, Time taken: 98980 ms// over 1 min 38 seconds > With 29 filters, Time taken: 198368 ms // over 3 mins > With 30 filters, Time taken: 393744 ms // over 6 mins > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603807#comment-16603807 ] Matt Cheah commented on SPARK-25299: (Changed the title to "remote storage" for a little more generalization) > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matt Cheah updated SPARK-25299: --- Summary: Use remote storage for persisting shuffle data (was: Use distributed storage for persisting shuffle data) > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603803#comment-16603803 ] Aaron Hiniker edited comment on SPARK-19145 at 9/5/18 1:20 AM: --- I found another (potentially huge) performance impact where the filter won't get pushed down to the reader/scan when there is a `cast` expression involved. I commented on the PR with more details here: [https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743] was (Author: hindog): I found another (potentially huge) performance impact where the filter won't get pushed down to the reader/scan when there is a `cast` expression involved. I commented on the PR with more details here: [https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743,] > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja >Priority: Major > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19145) Timestamp to String casting is slowing the query significantly
[ https://issues.apache.org/jira/browse/SPARK-19145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603803#comment-16603803 ] Aaron Hiniker commented on SPARK-19145: --- I found another (potentially huge) performance impact where the filter won't get pushed down to the reader/scan when there is a `cast` expression involved. I commented on the PR with more details here: [https://github.com/apache/spark/pull/17174#issuecomment-418566743|https://github.com/apache/spark/pull/17174#issuecomment-418566743,] > Timestamp to String casting is slowing the query significantly > -- > > Key: SPARK-19145 > URL: https://issues.apache.org/jira/browse/SPARK-19145 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: gagan taneja >Priority: Major > > i have a time series table with timestamp column > Following query > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > is significantly SLOWER than > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > After investigation i found that in the first query time colum is cast to > String before applying the filter > However in the second query no such casting is performed and its a filter > with long value > Below are the generate Physical plan for slower execution followed by > physical plan for faster execution > SELECT COUNT(*) AS `count` >FROM `default`.`table` >WHERE `time` >= '2017-01-02 19:53:51' > AND `time` <= '2017-01-09 19:53:51' LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3290L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3339L]) > +- *Project > +- *Filter ((isnotnull(time#3314) && (cast(time#3314 as string) > >= 2017-01-02 19:53:51)) && (cast(time#3314 as string) <= 2017-01-09 > 19:53:51)) >+- *FileScan parquet default.cstat[time#3314] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time)], ReadSchema: > struct > SELECT COUNT(*) AS `count` > FROM `default`.`table` > WHERE `time` >= to_utc_timestamp('2017-01-02 19:53:51','-MM-DD > HH24:MI:SS−0800') > AND `time` <= to_utc_timestamp('2017-01-09 19:53:51','-MM-DD > HH24:MI:SS−0800') LIMIT 5 > == Physical Plan == > CollectLimit 5 > +- *HashAggregate(keys=[], functions=[count(1)], output=[count#3238L]) >+- Exchange SinglePartition > +- *HashAggregate(keys=[], functions=[partial_count(1)], > output=[count#3287L]) > +- *Project > +- *Filter ((isnotnull(time#3262) && (time#3262 >= > 148340483100)) && (time#3262 <= 148400963100)) >+- *FileScan parquet default.cstat[time#3262] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[hdfs://10.65.55.220/user/spark/spark-warehouse/cstat], > PartitionFilters: [], PushedFilters: [IsNotNull(time), > GreaterThanOrEqual(time,2017-01-02 19:53:51.0), > LessThanOrEqual(time,2017-01-09..., ReadSchema: struct > In Impala both query run efficiently without and performance difference > Spark should be able to parse the Date string and convert to Long/Timestamp > during generation of Optimized Logical Plan so that both the query would have > similar performance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Fangshi Li resolved SPARK-24256. Resolution: Won't Fix > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24256) ExpressionEncoder should support user-defined types as fields of Scala case class and tuple
[ https://issues.apache.org/jira/browse/SPARK-24256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603800#comment-16603800 ] Fangshi Li commented on SPARK-24256: To summarize our discussion for this pr: Spark-avro is now merged into Spark as a built-in data source. Upstream community is not merging the AvroEncoder to support Avro types in Dataset, instead, the plan is to exposing the user-defined type API to support defining arbitrary user types in Dataset. The purpose of this patch is to enable ExpressionEncoder to work together with other types of Encoders, while it seems like upstream prefers to go with UDT. Given this, we can close this PR and we will start the discussion on UDT in another channel > ExpressionEncoder should support user-defined types as fields of Scala case > class and tuple > --- > > Key: SPARK-24256 > URL: https://issues.apache.org/jira/browse/SPARK-24256 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Fangshi Li >Priority: Major > > Right now, ExpressionEncoder supports ser/de of primitive types, as well as > scala case class, tuple and java bean class. Spark's Dataset natively > supports these mentioned types, but we find Dataset is not flexible for other > user-defined types and encoders. > For example, spark-avro has an AvroEncoder for ser/de Avro types in Dataset. > Although we can use AvroEncoder to define Dataset with types being the Avro > Generic or Specific Record, using such Avro typed Dataset has many > limitations: > # We can not use joinWith on this Dataset since the result is a tuple, but > Avro types cannot be the field of this tuple. > # We can not use some type-safe aggregation methods on this Dataset, such as > KeyValueGroupedDataset's reduceGroups, since the result is also a tuple. > # We cannot augment an Avro SpecificRecord with additional primitive fields > together in a case class, which we find is a very common use case. > The limitation is that ExpressionEncoder does not support serde of Scala case > class/tuple with subfields being any other user-defined type with its own > Encoder for serde. > To address this issue, we propose a trait as a contract(between > ExpressionEncoder and any other user-defined Encoder) to enable case > class/tuple/java bean to support user-defined types. > With this proposed patch and our minor modification in AvroEncoder, we remove > above-mentioned limitations with cluster-default conf > spark.expressionencoder.org.apache.avro.specific.SpecificRecord = > com.databricks.spark.avro.AvroEncoder$ > This is a patch we have implemented internally and has been used for a few > quarters. We want to propose to upstream as we think this is a useful feature > to make Dataset more flexible to user types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-25332: - Issue Type: Improvement (was: Bug) > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
[ https://issues.apache.org/jira/browse/SPARK-25332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603761#comment-16603761 ] Takeshi Yamamuro commented on SPARK-25332: -- Probably, you need to describe more about this case. Also, I think you'd be better to ask in the spark-user mailing list first. > Instead of broadcast hash join ,Sort merge join has selected when restart > spark-shell/spark-JDBC for hive provider > --- > > Key: SPARK-25332 > URL: https://issues.apache.org/jira/browse/SPARK-25332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Babulal >Priority: Major > > spark.sql("create table x1(name string,age int) stored as parquet ") > spark.sql("insert into x1 select 'a',29") > spark.sql("create table x2 (name string,age int) stored as parquet '") > spark.sql("insert into x2_ex select 'a',29") > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, > BuildRight > :- *(2) Project [name#101, age#102] > : +- *(2) Filter isnotnull(name#101) > : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) > +- *(1) Project [name#103, age#104] > +- *(1) Filter isnotnull(name#103) > +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, > Format: Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and > run same select query again > > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain > == Physical Plan == > *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner > :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 > : +- Exchange hashpartitioning(name#43, 200) > : +- *(1) Project [name#43, age#44] > : +- *(1) Filter isnotnull(name#43) > : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 > +- Exchange hashpartitioning(name#45, 200) > +- *(3) Project [name#45, age#46] > +- *(3) Filter isnotnull(name#45) > +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: > Parquet, Location: > InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], > PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: > struct > > > scala> spark.sql("desc formatted x1").show(200,false) > ++--+---+ > |col_name |data_type |comment| > ++--+---+ > |name |string |null | > |age |int |null | > | | | | > |# Detailed Table Information| | | > |Database |default | | > |Table |x1 | | > |Owner |Administrator | | > |Created Time |Sun Aug 19 12:36:58 IST 2018 | | > |Last Access |Thu Jan 01 05:30:00 IST 1970 | | > |Created By |Spark 2.3.0 | | > |Type |MANAGED | | > |Provider |hive | | > |Table Properties |[transient_lastDdlTime=1534662418] | | > |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | > |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | > | > |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | > | > |OutputFormat > |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | > |Storage Properties |[serialization.format=1] | | > |Partition Provider |Catalog | | > ++--+---+ > > With datasource table ,working fine ( create table using parquet instead of > stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603712#comment-16603712 ] Apache Spark commented on SPARK-25258: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22179 > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25258: Assignee: Apache Spark > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Assignee: Apache Spark >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603710#comment-16603710 ] Apache Spark commented on SPARK-25258: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22179 > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25258: Assignee: (was: Apache Spark) > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25337: -- Description: Observed in the Scala 2.12 pull request builder consistently now. I don't see this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of hard to see how. CC [~sadhen] {code:java} org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' ... 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be instantiated ... 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V 2018-09-04 07:48:30.834 - stdout> at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81) ...{code} was: Observed in the Scala 2.12 pull request builder consistently now. I don't see this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of hard to see how. CC [~sadhen] and [~dongjoon] {code:java} org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' ... 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be instantiated ... 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V 2018-09-04 07:48:30.834 - stdout> at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81) ...{code} > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0'
[jira] [Commented] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasou
[ https://issues.apache.org/jira/browse/SPARK-25337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603663#comment-16603663 ] Dongjoon Hyun commented on SPARK-25337: --- I'll take a look, [~srowen]. > HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) > > > Key: SPARK-25337 > URL: https://issues.apache.org/jira/browse/SPARK-25337 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Priority: Major > > Observed in the Scala 2.12 pull request builder consistently now. I don't see > this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of > hard to see how. > CC [~sadhen] > {code:java} > org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** > Exception encountered when invoking run on a nested suite - spark-submit > returned with exit code 1. > Command line: './bin/spark-submit' '--name' 'prepare testing tables' > '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' > 'spark.master.rest.enabled=false' '--conf' > 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' > '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' > > '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' > ... > 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: > org.apache.spark.sql.sources.DataSourceRegister: Provider > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be > instantiated > ... > 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: > org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V > 2018-09-04 07:48:30.834 - stdout> at > org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81) > ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25297) Future for Scala 2.12 will block on a already shutdown ExecutionContext
[ https://issues.apache.org/jira/browse/SPARK-25297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-25297. --- Resolution: Duplicate > Future for Scala 2.12 will block on a already shutdown ExecutionContext > --- > > Key: SPARK-25297 > URL: https://issues.apache.org/jira/browse/SPARK-25297 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Darcy Shen >Priority: Major > > *+see > [https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/193/]+* > *The Units Test blocks on FileBasedWriteAheadLogWithFileCloseAfterWriteSuite > in Console Output.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress
[ https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603652#comment-16603652 ] Apache Spark commented on SPARK-24748: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Support for reporting custom metrics via Streaming Query Progress > - > > Key: SPARK-24748 > URL: https://issues.apache.org/jira/browse/SPARK-24748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > Currently the Structured Streaming sources and sinks does not have a way to > report custom metrics. Providing an option to report custom metrics and > making it available via Streaming Query progress can enable sources and sinks > to report custom progress information (E.g. the lag metrics for Kafka source). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24748) Support for reporting custom metrics via Streaming Query Progress
[ https://issues.apache.org/jira/browse/SPARK-24748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603651#comment-16603651 ] Apache Spark commented on SPARK-24748: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Support for reporting custom metrics via Streaming Query Progress > - > > Key: SPARK-24748 > URL: https://issues.apache.org/jira/browse/SPARK-24748 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > Currently the Structured Streaming sources and sinks does not have a way to > report custom metrics. Providing an option to report custom metrics and > making it available via Streaming Query progress can enable sources and sinks > to report custom progress information (E.g. the lag metrics for Kafka source). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603649#comment-16603649 ] Apache Spark commented on SPARK-24863: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Report offset lag as a custom metrics for Kafka structured streaming source > --- > > Key: SPARK-24863 > URL: https://issues.apache.org/jira/browse/SPARK-24863 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > We can build on top of SPARK-24748 to report offset lag as a custom metrics > for Kafka structured streaming source. > This is the difference between the latest offsets in Kafka the time the > metrics is reported (just after a micro-batch completes) and the latest > offset Spark has processed. It can be 0 (or close to 0) if spark keeps up > with the rate at which messages are ingested into Kafka topics in steady > state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24863) Report offset lag as a custom metrics for Kafka structured streaming source
[ https://issues.apache.org/jira/browse/SPARK-24863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603645#comment-16603645 ] Apache Spark commented on SPARK-24863: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Report offset lag as a custom metrics for Kafka structured streaming source > --- > > Key: SPARK-24863 > URL: https://issues.apache.org/jira/browse/SPARK-24863 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > We can build on top of SPARK-24748 to report offset lag as a custom metrics > for Kafka structured streaming source. > This is the difference between the latest offsets in Kafka the time the > metrics is reported (just after a micro-batch completes) and the latest > offset Spark has processed. It can be 0 (or close to 0) if spark keeps up > with the rate at which messages are ingested into Kafka topics in steady > state. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603646#comment-16603646 ] Apache Spark commented on SPARK-25336: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25336: - Summary: Revert SPARK-24863 and SPARK-24748 (was: Revert SPARK-24863 and SPARK 24748) > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK 24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25336: Assignee: (was: Apache Spark) > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603643#comment-16603643 ] Apache Spark commented on SPARK-25336: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/22334 > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-25336: - Description: Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source v2 APIs are out. (was: Revert SPARK-24863 and SPARK 24748. We will revisit them when the data source v2 APIs are out.) > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-25336: Assignee: Shixiong Zhu > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25336) Revert SPARK-24863 and SPARK-24748
[ https://issues.apache.org/jira/browse/SPARK-25336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25336: Assignee: Apache Spark > Revert SPARK-24863 and SPARK-24748 > -- > > Key: SPARK-25336 > URL: https://issues.apache.org/jira/browse/SPARK-25336 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark >Priority: Major > > Revert SPARK-24863 and SPARK-24748. We will revisit them when the data source > v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25337) HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasourc
Sean Owen created SPARK-25337: - Summary: HiveExternalCatalogVersionsSuite + Scala 2.12 = NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;) Key: SPARK-25337 URL: https://issues.apache.org/jira/browse/SPARK-25337 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.0 Reporter: Sean Owen Observed in the Scala 2.12 pull request builder consistently now. I don't see this failing the main 2.11 builds, so assume it's 2.12-related, but, kind of hard to see how. CC [~sadhen] and [~dongjoon] {code:java} org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite *** ABORTED *** Exception encountered when invoking run on a nested suite - spark-submit returned with exit code 1. Command line: './bin/spark-submit' '--name' 'prepare testing tables' '--master' 'local[2]' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' '--conf' 'spark.sql.warehouse.dir=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '--conf' 'spark.sql.test.version.index=0' '--driver-java-options' '-Dderby.system.home=/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/warehouse-37386cdb-c0fb-405d-9442-8f0044b81643' '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.12/sql/hive/target/tmp/test7888487003559759098.py' ... 2018-09-04 07:48:30.833 - stdout> : java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.execution.datasources.orc.OrcFileFormat could not be instantiated ... 2018-09-04 07:48:30.834 - stdout> Caused by: java.lang.NoSuchMethodError: org.apache.spark.sql.execution.datasources.FileFormat.$init$(Lorg/apache/spark/sql/execution/datasources/FileFormat;)V 2018-09-04 07:48:30.834 - stdout> at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.(OrcFileFormat.scala:81) ...{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25336) Revert SPARK-24863 and SPARK 24748
Shixiong Zhu created SPARK-25336: Summary: Revert SPARK-24863 and SPARK 24748 Key: SPARK-25336 URL: https://issues.apache.org/jira/browse/SPARK-25336 Project: Spark Issue Type: Task Components: Structured Streaming Affects Versions: 2.4.0 Reporter: Shixiong Zhu Revert SPARK-24863 and SPARK 24748. We will revisit them when the data source v2 APIs are out. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25335: -- Description: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once. However, it occurs too many times in build systems. This issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} This will reduce many resources(CPU/Networks/DISK) at least in Mac and Docker-based build system. was: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once. However, it occurs too many times in build systems. This issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} This will reduce many resources(CPU/Networks/DISK) at least in Mac and Docker-based build system. > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once. However, it occurs too many times in > build systems. This issue aims to skip Zinc downloading when the system > already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25335: -- Description: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once. However, it occurs too many times in build systems. This issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} This will reduce many resources(CPU/Networks/DISK) at least in Mac and Docker-based build system. was: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once, but this issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} This will reduce many resources(CPU/Networks/DISK) at least in Mac and Docker-based build system. > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once. However, it occurs too many times in > build systems. This issue aims to skip Zinc downloading when the system > already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25335: Assignee: (was: Apache Spark) > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once, but this issue aims to skip Zinc > downloading when the system already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25335: Assignee: Apache Spark > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once, but this issue aims to skip Zinc > downloading when the system already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603600#comment-16603600 ] Apache Spark commented on SPARK-25335: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/22333 > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once, but this issue aims to skip Zinc > downloading when the system already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25335: -- Description: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once, but this issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} This will reduce many resources(CPU/Networks/DISK) at least in Mac and Docker-based build system. was: Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once, but this issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once, but this issue aims to skip Zinc > downloading when the system already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} > This will reduce many resources(CPU/Networks/DISK) at least in Mac and > Docker-based build system. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25335) Skip Zinc downloading if it's installed in the system
[ https://issues.apache.org/jira/browse/SPARK-25335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25335: -- Priority: Minor (was: Major) > Skip Zinc downloading if it's installed in the system > - > > Key: SPARK-25335 > URL: https://issues.apache.org/jira/browse/SPARK-25335 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Zinc is 23.5MB. > {code} > $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > % Total% Received % Xferd Average Speed TimeTime Time > Current > Dload Upload Total SpentLeft Speed > 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M > {code} > Currently, Spark downloads Zinc once, but this issue aims to skip Zinc > downloading when the system already has it. > {code} > $ build/mvn clean > exec: curl --progress-bar -L > https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz > > 100.0% > > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25335) Skip Zinc downloading if it's installed in the system
Dongjoon Hyun created SPARK-25335: - Summary: Skip Zinc downloading if it's installed in the system Key: SPARK-25335 URL: https://issues.apache.org/jira/browse/SPARK-25335 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 2.4.0 Reporter: Dongjoon Hyun Zinc is 23.5MB. {code} $ curl -LO https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz % Total% Received % Xferd Average Speed TimeTime Time Current Dload Upload Total SpentLeft Speed 100 23.5M 100 23.5M0 0 35.4M 0 --:--:-- --:--:-- --:--:-- 35.3M {code} Currently, Spark downloads Zinc once, but this issue aims to skip Zinc downloading when the system already has it. {code} $ build/mvn clean exec: curl --progress-bar -L https://downloads.lightbend.com/zinc/0.3.15/zinc-0.3.15.tgz 100.0% {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25333) Ability to add new columns in Dataset in a user-defined position
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walid Mellouli updated SPARK-25333: --- Description: When we add new columns in a Dataset, they are added automatically at the end of the Dataset. Consider this data frame: {code:java} val df = sc.parallelize(Seq(1, 2, 3)).toDF df.printSchema root |-- value: integer (nullable = true) {code} When we add a new column: {code:java} val newDf = df.withColumn("newColumn", col("value") + 1) newDf.printSchema root |-- value: integer (nullable = true) |-- newColumn: integer (nullable = true) {code} Generally users want to add new columns either at the end, in the beginning or in a defined position, depends on use cases. In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end. was: When we add new columns in a Dataset, they are added automatically at the end of the Dataset. {code:java} val df = sc.parallelize(Seq(1, 2, 3)).toDF df.printSchema root |-- value: integer (nullable = true) {code} When we add a new column: {code:java} val newDf = df.withColumn("newColumn", col("value") + 1) newDf.printSchema root |-- value: integer (nullable = true) |-- newColumn: integer (nullable = true) {code} Generally users want to add new columns either at the end or in the beginning, depends on use cases. In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end. Summary: Ability to add new columns in Dataset in a user-defined position (was: Ability to add new columns in the beginning of a Dataset) > Ability to add new columns in Dataset in a user-defined position > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Priority: Minor > Labels: pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > Consider this data frame: > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end, in the beginning > or in a defined position, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603590#comment-16603590 ] Ruslan Dautkhanov commented on SPARK-24316: --- Thanks [~bersprockets] Is cloudera spark.2.3.cloudera3 parcel based on upstream Spark 2.3.*2*? As we still see this issue with latest Cloudera's Spark 2.3 parcel ("2.3 release 3"). > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23131) Kryo raises StackOverflow during serializing GLR model
[ https://issues.apache.org/jira/browse/SPARK-23131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23131: -- Summary: Kryo raises StackOverflow during serializing GLR model (was: Stackoverflow using ML and Kryo serializer) > Kryo raises StackOverflow during serializing GLR model > -- > > Key: SPARK-23131 > URL: https://issues.apache.org/jira/browse/SPARK-23131 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.2.0 >Reporter: Peigen >Priority: Minor > > When trying to use GeneralizedLinearRegression model and set SparkConf to use > KryoSerializer(JavaSerializer is fine) > It causes StackOverflowException > {quote}Exception in thread "dispatcher-event-loop-34" > java.lang.StackOverflowError > at java.util.HashMap.hash(HashMap.java:338) > at java.util.HashMap.get(HashMap.java:556) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:61) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > at com.esotericsoftware.kryo.Generics.getConcreteClass(Generics.java:62) > {quote} > This is very likely to be > [https://github.com/EsotericSoftware/kryo/issues/341] > Upgrade Kryo to 4.0+ probably could fix this > > Wish for upgrade Kryo version for spark -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603466#comment-16603466 ] Dongjoon Hyun commented on SPARK-25258: --- [~yumwang]. You wrote that you had submitted PR, but you didn't use `[SPARK-25258]` in your PR. Is there any reason for that? > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25258) Upgrade kryo package to version 4.0.2
[ https://issues.apache.org/jira/browse/SPARK-25258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25258: -- Summary: Upgrade kryo package to version 4.0.2 (was: Upgrade kryo package to version 4.0.2+) > Upgrade kryo package to version 4.0.2 > - > > Key: SPARK-25258 > URL: https://issues.apache.org/jira/browse/SPARK-25258 > Project: Spark > Issue Type: Wish > Components: Spark Core >Affects Versions: 2.1.0, 2.3.1 >Reporter: liupengcheng >Priority: Major > > Recently, we encountered a kryo performance issue in spark2.1.0, and the > issue affect all kryo below 4.0.2, so it seems that all spark version might > encounter this issue. > Issue description: > In shuffle write phase or some spilling operation, spark will use kryo > serializer to serialize data if `spark.serializer` is set to > `KryoSerializer`, however, when data contains some extremely large records, > kryoSerializer's MapReferenceResolver would be expand, and it's `reset` > method will take a long time to reset all items in writtenObjects table to > null. > com.esotericsoftware.kryo.util.MapReferenceResolver > {code:java} > public void reset () { > readObjects.clear(); > writtenObjects.clear(); > } > public void clear () { > K[] keyTable = this.keyTable; > for (int i = capacity + stashSize; i-- > 0;) > keyTable[i] = null; > size = 0; > stashSize = 0; > } > {code} > I checked the kryo project in github, and this issue seems fixed in 4.0.2+ > [https://github.com/EsotericSoftware/kryo/commit/77935c696ee4976963aa5c6ac53d53d9b40b8bdd#diff-215fa9846e1e4e54bbeede0500de1e28] > > I was wondering if we can make spark kryo package upgrade to 4.0.2+ to fix > this problem. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24316) Spark sql queries stall for column width more than 6k for parquet based table
[ https://issues.apache.org/jira/browse/SPARK-24316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603453#comment-16603453 ] Bruce Robbins commented on SPARK-24316: --- This is likely SPARK-25164. > Spark sql queries stall for column width more than 6k for parquet based table > -- > > Key: SPARK-24316 > URL: https://issues.apache.org/jira/browse/SPARK-24316 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0, 2.2.1, 2.3.0, 2.4.0 >Reporter: Bimalendu Choudhary >Priority: Major > > When we create a table from a data frame using spark sql with columns around > 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while > the same table if we create through Hive with same data , the same query just > takes 5 minutes. > > Instrumenting the code we see that the executors are looping in the while > loop of the function initializeInternal(). The majority of time is getting > spent in the for loop in below code looping through the columns and the > executor appears to be stalled for long time . > > {code:java|title=spark/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java|borderStyle=solid} > private void initializeInternal() .. > .. > for (int i = 0; i < requestedSchema.getFieldCount(); ++i) > { ... } > } > {code:java} > {code} > > When spark sql is creating table, it also stores the metadata in the > TBLPROPERTIES in json format. We see that if we remove this metadata from the > table the queries become fast , which is the case when we create the same > table through Hive. The exact same table takes 5 times more time with the > Json meta data as compared to without the json metadata. > > So looks like as the number of columns are growing bigger than 5 to 6k, the > processing of the metadata and comparing it becomes more and more expensive > and the performance degrades drastically. > To recreate the problem simply run the following query: > import org.apache.spark.sql.SparkSession > val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") > resp_data.write.format("csv").save("/tmp/filename") > > The table should be created by spark sql from dataframe so that the Json meta > data is stored. For ex:- > val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") > dff.createOrReplaceTempView("my_temp_table") > val tmp = spark.sql("Create table tableName stored as parquet as select * > from my_temp_table") > > > from pyspark.sql import SQL > Context > sqlContext = SQLContext(sc) > resp_data = spark.sql( " select * from test").limit(2000) > print resp_data_fgv_1k.count() > (resp_data_fgv_1k.write.option('header', > False).mode('overwrite').csv('/tmp/2.csv') ) > > > The performance seems to be even slow in the loop if the schema does not > match or the fields are empty and the code goes into the if condition where > the missing column is marked true: > missingColumns[i] = true; > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25334) Default SessionCatalog should support UDFs
[ https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603341#comment-16603341 ] Tomasz Gawęda commented on SPARK-25334: --- If commiters say it's not very important, I can start work on this. However, it will probably take more time for me to implement it than for some that's already a Spark committer :) > Default SessionCatalog should support UDFs > -- > > Key: SPARK-25334 > URL: https://issues.apache.org/jira/browse/SPARK-25334 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Tomasz Gawęda >Priority: Major > > SessionCatalog calls registerFunction to add a function to function registry. > However, makeFunctionExpression supports only UserDefinedAggregateFunction. > We should make makeFunctionExpression support UserDefinedFunction, as it's > one of functions type. > Currently we can use persistent functions only with Hive metastore, but > "create function" command works also with default SessionCatalog. It > sometimes cause user confusion, like in > https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25334) Default SessionCatalog should support UDFs
[ https://issues.apache.org/jira/browse/SPARK-25334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Gawęda updated SPARK-25334: -- Summary: Default SessionCatalog should support UDFs (was: Default SessionCatalog doesn't support UDFs) > Default SessionCatalog should support UDFs > -- > > Key: SPARK-25334 > URL: https://issues.apache.org/jira/browse/SPARK-25334 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Tomasz Gawęda >Priority: Major > > SessionCatalog calls registerFunction to add a function to function registry. > However, makeFunctionExpression supports only UserDefinedAggregateFunction. > We should make makeFunctionExpression support UserDefinedFunction, as it's > one of functions type. > Currently we can use persistent functions only with Hive metastore, but > "create function" command works also with default SessionCatalog. It > sometimes cause user confusion, like in > https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25334) Default SessionCatalog doesn't support UDFs
Tomasz Gawęda created SPARK-25334: - Summary: Default SessionCatalog doesn't support UDFs Key: SPARK-25334 URL: https://issues.apache.org/jira/browse/SPARK-25334 Project: Spark Issue Type: Task Components: SQL Affects Versions: 2.3.1 Reporter: Tomasz Gawęda SessionCatalog calls registerFunction to add a function to function registry. However, makeFunctionExpression supports only UserDefinedAggregateFunction. We should make makeFunctionExpression support UserDefinedFunction, as it's one of functions type. Currently we can use persistent functions only with Hive metastore, but "create function" command works also with default SessionCatalog. It sometimes cause user confusion, like in https://stackoverflow.com/questions/52164488/spark-hive-udf-no-handler-for-udaf-analysis-exception/52170519 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Walid Mellouli updated SPARK-25333: --- External issue URL: https://github.com/apache/spark/pull/22332 Labels: pull-request-available (was: ) > Ability to add new columns in the beginning of a Dataset > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Priority: Minor > Labels: pull-request-available > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end or in the > beginning, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25333: Assignee: (was: Apache Spark) > Ability to add new columns in the beginning of a Dataset > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end or in the > beginning, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603330#comment-16603330 ] Apache Spark commented on SPARK-25333: -- User 'wmellouli' has created a pull request for this issue: https://github.com/apache/spark/pull/22332 > Ability to add new columns in the beginning of a Dataset > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end or in the > beginning, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25333: Assignee: Apache Spark > Ability to add new columns in the beginning of a Dataset > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Assignee: Apache Spark >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end or in the > beginning, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
[ https://issues.apache.org/jira/browse/SPARK-25333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603329#comment-16603329 ] Apache Spark commented on SPARK-25333: -- User 'wmellouli' has created a pull request for this issue: https://github.com/apache/spark/pull/22332 > Ability to add new columns in the beginning of a Dataset > > > Key: SPARK-25333 > URL: https://issues.apache.org/jira/browse/SPARK-25333 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Walid Mellouli >Priority: Minor > Original Estimate: 2h > Remaining Estimate: 2h > > When we add new columns in a Dataset, they are added automatically at the end > of the Dataset. > {code:java} > val df = sc.parallelize(Seq(1, 2, 3)).toDF > df.printSchema > root > |-- value: integer (nullable = true) > {code} > When we add a new column: > {code:java} > val newDf = df.withColumn("newColumn", col("value") + 1) > newDf.printSchema > root > |-- value: integer (nullable = true) > |-- newColumn: integer (nullable = true) > {code} > Generally users want to add new columns either at the end or in the > beginning, depends on use cases. > In my case for example, we add technical columns in the beginning of a > Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25248) Audit barrier APIs for Spark 2.4
[ https://issues.apache.org/jira/browse/SPARK-25248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-25248. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 22240 [https://github.com/apache/spark/pull/22240] > Audit barrier APIs for Spark 2.4 > > > Key: SPARK-25248 > URL: https://issues.apache.org/jira/browse/SPARK-25248 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Major > Fix For: 2.4.0 > > > Make a pass over APIs added for barrier execution mode. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25333) Ability to add new columns in the beginning of a Dataset
Walid Mellouli created SPARK-25333: -- Summary: Ability to add new columns in the beginning of a Dataset Key: SPARK-25333 URL: https://issues.apache.org/jira/browse/SPARK-25333 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Walid Mellouli When we add new columns in a Dataset, they are added automatically at the end of the Dataset. {code:java} val df = sc.parallelize(Seq(1, 2, 3)).toDF df.printSchema root |-- value: integer (nullable = true) {code} When we add a new column: {code:java} val newDf = df.withColumn("newColumn", col("value") + 1) newDf.printSchema root |-- value: integer (nullable = true) |-- newColumn: integer (nullable = true) {code} Generally users want to add new columns either at the end or in the beginning, depends on use cases. In my case for example, we add technical columns in the beginning of a Dataset and we add business columns at the end. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25332) Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider
Babulal created SPARK-25332: --- Summary: Instead of broadcast hash join ,Sort merge join has selected when restart spark-shell/spark-JDBC for hive provider Key: SPARK-25332 URL: https://issues.apache.org/jira/browse/SPARK-25332 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Babulal spark.sql("create table x1(name string,age int) stored as parquet ") spark.sql("insert into x1 select 'a',29") spark.sql("create table x2 (name string,age int) stored as parquet '") spark.sql("insert into x2_ex select 'a',29") scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain == Physical Plan == *{color:#14892c}(2) BroadcastHashJoin{color} [name#101], [name#103], Inner, BuildRight :- *(2) Project [name#101, age#102] : +- *(2) Filter isnotnull(name#101) : +- *(2) FileScan parquet default.x1_ex[name#101,age#102] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true])) +- *(1) Project [name#103, age#104] +- *(1) Filter isnotnull(name#103) +- *(1) FileScan parquet default.x2_ex[name#103,age#104] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2, PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct Now Restart Spark-Shell or do spark-submit orrestart JDBCServer again and run same select query again scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain scala> spark.sql("select * from x1 t1 ,x2 t2 where t1.name=t2.name").explain == Physical Plan == *{color:#FF}(5) SortMergeJoin [{color}name#43], [name#45], Inner :- *(2) Sort [name#43 ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(name#43, 200) : +- *(1) Project [name#43, age#44] : +- *(1) Filter isnotnull(name#43) : +- *(1) FileScan parquet default.x1[name#43,age#44] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x1], PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct +- *(4) Sort [name#45 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(name#45, 200) +- *(3) Project [name#45, age#46] +- *(3) Filter isnotnull(name#45) +- *(3) FileScan parquet default.x2[name#45,age#46] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/D:/spark_release/spark/bin/spark-warehouse/x2], PartitionFilters: [], PushedFilters: [IsNotNull(name)], ReadSchema: struct scala> spark.sql("desc formatted x1").show(200,false) ++--+---+ |col_name |data_type |comment| ++--+---+ |name |string |null | |age |int |null | | | | | |# Detailed Table Information| | | |Database |default | | |Table |x1 | | |Owner |Administrator | | |Created Time |Sun Aug 19 12:36:58 IST 2018 | | |Last Access |Thu Jan 01 05:30:00 IST 1970 | | |Created By |Spark 2.3.0 | | |Type |MANAGED | | |Provider |hive | | |Table Properties |[transient_lastDdlTime=1534662418] | | |Location |file:/D:/spark_release/spark/bin/spark-warehouse/x1 | | |Serde Library |org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | | |InputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat | | |OutputFormat |org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat| | |Storage Properties |[serialization.format=1] | | |Partition Provider |Catalog | | ++--+---+ With datasource table ,working fine ( create table using parquet instead of stored by ) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-22666: - Assignee: Weichen Xu > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Assignee: Weichen Xu >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25317) MemoryBlock performance regression
[ https://issues.apache.org/jira/browse/SPARK-25317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603174#comment-16603174 ] Marco Gaido commented on SPARK-25317: - I think I have a fix for this. I can submit a PR if you want, but I am still not sure about the root cause of the regression. My best guess is that there are more than one reason and the perf improvement happens iff all the reasons are fixed, which is rather strange to me. > MemoryBlock performance regression > -- > > Key: SPARK-25317 > URL: https://issues.apache.org/jira/browse/SPARK-25317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Priority: Blocker > > eThere is a performance regression when calculating hash code for UTF8String: > {code:java} > test("hashing") { > import org.apache.spark.unsafe.hash.Murmur3_x86_32 > import org.apache.spark.unsafe.types.UTF8String > val hasher = new Murmur3_x86_32(0) > val str = UTF8String.fromString("b" * 10001) > val numIter = 10 > val start = System.nanoTime > for (i <- 0 until numIter) { > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > Murmur3_x86_32.hashUTF8String(str, 0) > } > val duration = (System.nanoTime() - start) / 1000 / numIter > println(s"duration $duration us") > } > {code} > To run this test in 2.3, we need to add > {code:java} > public static int hashUTF8String(UTF8String str, int seed) { > return hashUnsafeBytes(str.getBaseObject(), str.getBaseOffset(), > str.numBytes(), seed); > } > {code} > to `Murmur3_x86_32` > In my laptop, the result for master vs 2.3 is: 120 us vs 40 us -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603143#comment-16603143 ] Mihaly Toth commented on SPARK-25331: - After looking into how this could be solved there are a few potential ways I could think of: # Make the resulting file names deterministic based on the input. Currently it contains a UUID which is by nature different in each run. The question here if partitioning of the data can always be done the same way. And what else was the motivation for adding a UUID to the name. # Create a "write ahead manifest file" which contains the generated file names. This could be used in the {{ManifestFileCommitProtocol.setupJob}} which is currently a noop. We may need to store some additional data like partitioning in order to generate the same file contents again. # Document and mandate the use of the manifest file for the consumer of the file output. Currently this file is not mentioned in the docs. Even if this would be documented that would make the life of the consumer more difficult not to mention that this would be somewhat counter intuitive. Before rushing into the implementation it would make sense to discuss the direction I guess. I would pick the first if that is possible. > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception
[ https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603126#comment-16603126 ] Sujith commented on SPARK-25271: [~cloud_fan] [~sowen] Will this cause a compatibility problem compare to older version, If user has null record ,then he is getting an exception with the current version where as the older version of spark(2.2.1) wont throw any exception. I think the Output writers has been updated in the below PR [https://github.com/apache/spark/pull/20521] > Creating parquet table with all the column null throws exception > > > Key: SPARK-25271 > URL: https://issues.apache.org/jira/browse/SPARK-25271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: shivusondur >Priority: Major > > {code:java} > 1)cat /data/parquet.dat > 1$abc2$pqr:3$xyz > null{code} > > {code:java} > 2)spark.sql("create table vp_reader_temp (projects map) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' > MAP KEYS TERMINATED BY '$'") > {code} > {code:java} > 3)spark.sql(" > LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp") > {code} > {code:java} > 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from > vp_reader_temp") > {code} > *Result :* Throwing exception (Working fine with spark 2.2.1) > {code:java} > java.lang.RuntimeException: Parquet record is malformed: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89) > at >
[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception
[ https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603128#comment-16603128 ] Sujith commented on SPARK-25271: cc [~hyukjin.kwon] > Creating parquet table with all the column null throws exception > > > Key: SPARK-25271 > URL: https://issues.apache.org/jira/browse/SPARK-25271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: shivusondur >Priority: Major > > {code:java} > 1)cat /data/parquet.dat > 1$abc2$pqr:3$xyz > null{code} > > {code:java} > 2)spark.sql("create table vp_reader_temp (projects map) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' > MAP KEYS TERMINATED BY '$'") > {code} > {code:java} > 3)spark.sql(" > LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp") > {code} > {code:java} > 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from > vp_reader_temp") > {code} > *Result :* Throwing exception (Working fine with spark 2.2.1) > {code:java} > java.lang.RuntimeException: Parquet record is malformed: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60) > ... 21 more > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Assigned] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25331: Assignee: (was: Apache Spark) > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603120#comment-16603120 ] Apache Spark commented on SPARK-25331: -- User 'misutoth' has created a pull request for this issue: https://github.com/apache/spark/pull/22331 > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25331: Assignee: Apache Spark > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Assignee: Apache Spark >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
Mihaly Toth created SPARK-25331: --- Summary: Structured Streaming File Sink duplicates records in case of driver failure Key: SPARK-25331 URL: https://issues.apache.org/jira/browse/SPARK-25331 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.1 Reporter: Mihaly Toth Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25331) Structured Streaming File Sink duplicates records in case of driver failure
[ https://issues.apache.org/jira/browse/SPARK-25331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mihaly Toth updated SPARK-25331: Description: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: # call {{FileStreamSink.addBtach}} # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} # call {{FileStreamSink.addBtach}} with the same data # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully # Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 was: Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has been started by {{FileFormatWrite.write}} and then the resulting task sets are completed but in the meantime the driver dies. In such a case repeating {{FileStreamSink.addBtach}} will result in duplicate writing of the data In the event the driver fails after the executors start processing the job the processed batch will be written twice. Steps needed: 1. call {{FileStreamSink.addBtach}} 2. make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} 3. call {{FileStreamSink.addBtach}} with the same data 4. make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} successfully 5. Verify file output - according to {{Sink.addBatch}} documentation the rdd should be written only once I have created a wip PR with a unit test: https://github.com/apache/spark/pull/22331 > Structured Streaming File Sink duplicates records in case of driver failure > --- > > Key: SPARK-25331 > URL: https://issues.apache.org/jira/browse/SPARK-25331 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.1 >Reporter: Mihaly Toth >Priority: Major > > Lets assume {{FileStreamSink.addBtach}} is called and an appropriate job has > been started by {{FileFormatWrite.write}} and then the resulting task sets > are completed but in the meantime the driver dies. In such a case repeating > {{FileStreamSink.addBtach}} will result in duplicate writing of the data > In the event the driver fails after the executors start processing the job > the processed batch will be written twice. > Steps needed: > # call {{FileStreamSink.addBtach}} > # make the {{ManifestFileCommitProtocol}} fail to finish its {{commitJob}} > # call {{FileStreamSink.addBtach}} with the same data > # make the {{ManifestFileCommitProtocol}} finish its {{commitJob}} > successfully > # Verify file output - according to {{Sink.addBatch}} documentation the rdd > should be written only once > I have created a wip PR with a unit test: > https://github.com/apache/spark/pull/22331 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception
[ https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16603106#comment-16603106 ] shivusondur commented on SPARK-25271: - After further analyzing the issue i got following details In SingleDirectoryWriteTask private class(org.apache.spark.sql.execution.datasources.FileFormatWriter File) , currentWriter is initialized with different outputWriter in spark-2.2.1 and spar-2.3.1, as shown below. {code:java} Spark-2.3.1= currentWriter is initilized with "HiveOutputWriter" Spark-2.2.1= currentWriter is initilized with "ParquetOutputWriter" {code} So ParquetOutputWriter may be handling the null/empty values. > Creating parquet table with all the column null throws exception > > > Key: SPARK-25271 > URL: https://issues.apache.org/jira/browse/SPARK-25271 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: shivusondur >Priority: Major > > {code:java} > 1)cat /data/parquet.dat > 1$abc2$pqr:3$xyz > null{code} > > {code:java} > 2)spark.sql("create table vp_reader_temp (projects map) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' > MAP KEYS TERMINATED BY '$'") > {code} > {code:java} > 3)spark.sql(" > LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp") > {code} > {code:java} > 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from > vp_reader_temp") > {code} > *Result :* Throwing exception (Working fine with spark 2.2.1) > {code:java} > java.lang.RuntimeException: Parquet record is malformed: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31) > at > org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180) > at > org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112) > at > org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125) > at > org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are > illegal, the field should be ommited completely instead > at > org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320) > at > org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241) > at > org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116) > at >
[jira] [Commented] (SPARK-19355) Use map output statistices to improve global limit's parallelism
[ https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602990#comment-16602990 ] Apache Spark commented on SPARK-19355: -- User 'jiangxb1987' has created a pull request for this issue: https://github.com/apache/spark/pull/22330 > Use map output statistices to improve global limit's parallelism > > > Key: SPARK-19355 > URL: https://issues.apache.org/jira/browse/SPARK-19355 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > A logical Limit is performed actually by two physical operations LocalLimit > and GlobalLimit. > In most of time, before GlobalLimit, we will perform a shuffle exchange to > shuffle data to single partition. When the limit number is very big, we > shuffle a lot of data to a single partition and significantly reduce > parallelism, except for the cost of shuffling. > This change tries to perform GlobalLimit without shuffling data to single > partition. Instead, we perform the map stage of the shuffling and collect the > statistics of the number of rows in each partition. Shuffled data are > actually all retrieved locally without from remote executors. > Once we get the number of output rows in each partition, we only take the > required number of rows from the locally shuffled data. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25176: Assignee: (was: Apache Spark) > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25176: Assignee: Apache Spark > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Assignee: Apache Spark >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25176) Kryo fails to serialize a parametrised type hierarchy
[ https://issues.apache.org/jira/browse/SPARK-25176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602974#comment-16602974 ] Apache Spark commented on SPARK-25176: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/22179 > Kryo fails to serialize a parametrised type hierarchy > - > > Key: SPARK-25176 > URL: https://issues.apache.org/jira/browse/SPARK-25176 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.2, 2.3.1 >Reporter: Mikhail Pryakhin >Priority: Major > > I'm using the latest spark version spark-core_2.11:2.3.1 which > transitively depends on com.esotericsoftware:kryo-shaded:3.0.3 via the > com.twitter:chill_2.11:0.8.0 dependency. This exact version of kryo > serializer contains an issue [1,2] which results in throwing > ClassCastExceptions when serialising parameterised type hierarchy. > This issue has been fixed in kryo version 4.0.0 [3]. It would be great to > have this update in Spark as well. Could you please upgrade the version of > com.twitter:chill_2.11 dependency from 0.8.0 up to 0.9.2? > You can find a simple test to reproduce the issue [4]. > [1] https://github.com/EsotericSoftware/kryo/issues/384 > [2] https://github.com/EsotericSoftware/kryo/issues/377 > [3] https://github.com/EsotericSoftware/kryo/releases/tag/kryo-parent-4.0.0 > [4] https://github.com/mpryahin/kryo-parametrized-type-inheritance -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602909#comment-16602909 ] Vladimir Pchelko edited comment on SPARK-20168 at 9/4/18 11:04 AM: --- [~srowen] this bug must be covered by unit tests. was (Author: vpchelko): [~srowen] this bug must be covered by unit tests > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma >Priority: Minor > Labels: kinesis, streaming > Fix For: 2.4.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20168) Enable kinesis to start stream from Initial position specified by a timestamp
[ https://issues.apache.org/jira/browse/SPARK-20168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602909#comment-16602909 ] Vladimir Pchelko commented on SPARK-20168: -- [~srowen] this bug must be covered by unit tests > Enable kinesis to start stream from Initial position specified by a timestamp > - > > Key: SPARK-20168 > URL: https://issues.apache.org/jira/browse/SPARK-20168 > Project: Spark > Issue Type: Improvement > Components: DStreams >Affects Versions: 2.1.0 >Reporter: Yash Sharma >Assignee: Yash Sharma >Priority: Minor > Labels: kinesis, streaming > Fix For: 2.4.0 > > > Kinesis client can resume from a specified timestamp while creating a stream. > We should have option to pass a timestamp in config to allow kinesis to > resume from the given timestamp. > Have started initial work and will be posting a PR after I test the patch - > https://github.com/yssharma/spark/commit/11269abf8b2a533a1b10ceee80ac2c3a2a80c4e8 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions
[ https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889 ] Binzi Cao edited comment on SPARK-24189 at 9/4/18 10:49 AM: It seems I'm hitting a similar issuel. I managed to set the kafka isolation level with {code:java} .option("kafka.isolation.level", "read_committed") {code} and using {code:java} kafka-client 1.0.0 Spark version: 2.3.1{code} and I'm seeing this issue: {code:java} [error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 2000 milliseconds [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147) [error] at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143) [error] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205) [error] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [error] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230) {code} So it looks like it is not working with a topic with kafka transactions at all. The exception was thrown here: [https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272] Setting {code:java} failOnDataLoss=false {code} can't fix the issue, as the exception is never caught in the KafkaDataConsumer.scala code. was (Author: caobinzi): It seems I'm hitting a similar issuel. I managed to set the kafka isolation level with {code:java} .option("kafka.isolation.level", "read_committed") {code} and using {code:java} kafka-client 1.0.0 {code} and I'm seeing this issue: {code:java} [error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 2000 milliseconds [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147) [error] at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362) [error] at
[jira] [Comment Edited] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions
[ https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889 ] Binzi Cao edited comment on SPARK-24189 at 9/4/18 10:49 AM: It seems I'm hitting a similar issue. I managed to set the kafka isolation level with {code:java} .option("kafka.isolation.level", "read_committed") {code} and using {code:java} kafka-client 1.0.0 Spark version: 2.3.1{code} and I'm seeing this issue: {code:java} [error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 2000 milliseconds [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147) [error] at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143) [error] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205) [error] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [error] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230) {code} So it looks like it is not working with a topic with kafka transactions at all. The exception was thrown here: [https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272] Setting {code:java} failOnDataLoss=false {code} can't fix the issue, as the exception is never caught in the KafkaDataConsumer.scala code. was (Author: caobinzi): It seems I'm hitting a similar issuel. I managed to set the kafka isolation level with {code:java} .option("kafka.isolation.level", "read_committed") {code} and using {code:java} kafka-client 1.0.0 Spark version: 2.3.1{code} and I'm seeing this issue: {code:java} [error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 2000 milliseconds [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147) [error] at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362) [error] at
[jira] [Commented] (SPARK-24189) Spark Strcutured Streaming not working with the Kafka Transactions
[ https://issues.apache.org/jira/browse/SPARK-24189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602889#comment-16602889 ] Binzi Cao commented on SPARK-24189: --- It seems I'm hitting a similar issuel. I managed to set the kafka isolation level with {code:java} .option("kafka.isolation.level", "read_committed") {code} and using {code:java} kafka-client 1.0.0 {code} and I'm seeing this issue: {code:java} [error] org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 17 in stage 0.0 failed 1 times, most recent failure: Lost task 17.0 in stage 0.0 (TID 17, localhost, executor driver): java.util.concurrent.TimeoutException: Cannot fetch record for offset 145 in 2000 milliseconds [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.org$apache$spark$sql$kafka010$InternalKafkaConsumer$$fetchData(KafkaDataConsumer.scala:271) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:163) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer$$anonfun$get$1.apply(KafkaDataConsumer.scala:147) [error] at org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.runUninterruptiblyIfPossible(KafkaDataConsumer.scala:109) [error] at org.apache.spark.sql.kafka010.InternalKafkaConsumer.get(KafkaDataConsumer.scala:147) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$class.get(KafkaDataConsumer.scala:54) [error] at org.apache.spark.sql.kafka010.KafkaDataConsumer$CachedKafkaDataConsumer.get(KafkaDataConsumer.scala:362) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:152) [error] at org.apache.spark.sql.kafka010.KafkaSourceRDD$$anon$1.getNext(KafkaSourceRDD.scala:143) [error] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(generated.java:205) [error] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) [error] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614) [error] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) [error] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.agg_doAggregateWithKeys_0$(generated.java:230) {code} So it looks like it is not working with a topic with kafka transactions at all. The exception was thrown here: https://github.com/apache/spark/blob/v2.3.1/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaDataConsumer.scala#L271-L272 Setting {code:java} failOnDataLoss=false {code} can't fix the issue, as the exception is never caught in the KafkaDataConsumer.scala code. > Spark Strcutured Streaming not working with the Kafka Transactions > -- > > Key: SPARK-24189 > URL: https://issues.apache.org/jira/browse/SPARK-24189 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: bharath kumar avusherla >Priority: Major > > Was trying to read kafka transactional topic using Spark Structured Streaming > 2.3.0 with the kafka option isolation-level = "read_committed", but spark > reading the data immediately without waiting for the data in topic to be > committed. In spark documentation it was mentioned as Structured Streaming > supports Kafka version 0.10 or higher. I am using below command to test the > scenario. > val df = spark > .readStream > .format("kafka") > .option("kafka.bootstrap.servers", "localhost:9092") > .option("subscribe", "test-topic") > .option("isolation-level","read_committed") > .load() > Can you please let me know if the transactional read is supported in SPark > 2.3.0 strcutured Streaming or am i missing anything. > > Thank you. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25328: Assignee: (was: Apache Spark) > Add an example for having two columns as the grouping key in group aggregate > pandas UDF > --- > > Key: SPARK-25328 > URL: https://issues.apache.org/jira/browse/SPARK-25328 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > https://github.com/apache/spark/pull/20295 added an alternative interface for > group aggregate pandas UDFs. It does not have an example that have more than > one columns as the grouping key in {{functions.py}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-25328: Assignee: Apache Spark > Add an example for having two columns as the grouping key in group aggregate > pandas UDF > --- > > Key: SPARK-25328 > URL: https://issues.apache.org/jira/browse/SPARK-25328 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Xiao Li >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/20295 added an alternative interface for > group aggregate pandas UDFs. It does not have an example that have more than > one columns as the grouping key in {{functions.py}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25328) Add an example for having two columns as the grouping key in group aggregate pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-25328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602846#comment-16602846 ] Apache Spark commented on SPARK-25328: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/22329 > Add an example for having two columns as the grouping key in group aggregate > pandas UDF > --- > > Key: SPARK-25328 > URL: https://issues.apache.org/jira/browse/SPARK-25328 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Xiao Li >Priority: Major > > https://github.com/apache/spark/pull/20295 added an alternative interface for > group aggregate pandas UDFs. It does not have an example that have more than > one columns as the grouping key in {{functions.py}} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22666: Assignee: Apache Spark > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Assignee: Apache Spark >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22666: Assignee: (was: Apache Spark) > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22666) Spark datasource for image format
[ https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16602845#comment-16602845 ] Apache Spark commented on SPARK-22666: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/22328 > Spark datasource for image format > - > > Key: SPARK-22666 > URL: https://issues.apache.org/jira/browse/SPARK-22666 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Timothy Hunter >Priority: Major > > The current API for the new image format is implemented as a standalone > feature, in order to make it reside within the mllib package. As discussed in > SPARK-21866, users should be able to load images through the more common > spark source reader interface. > This ticket is concerned with adding image reading support in the spark > source API, through either of the following interfaces: > - {{spark.read.format("image")...}} > - {{spark.read.image}} > The output is a dataframe that contains images (and the file names for > example), following the semantics discussed already in SPARK-21866. > A few technical notes: > * since the functionality is implemented in {{mllib}}, calling this function > may fail at runtime if users have not imported the {{spark-mllib}} dependency > * How to deal with very flat directories? It is common to have millions of > files in a single "directory" (like in S3), which seems to have caused some > issues to some users. If this issue is too complex to handle in this ticket, > it can be dealt with separately. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25301) When a view uses an UDF from a non default database, Spark analyser throws AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25301: - Description: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - Step 1 : Run following statements in Hive ``` CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100 CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; SELECT * FROM d100.v100; // query on view d100.v100 gives correct result ``` Step2 : Run following statements in Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. was: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - Step 1 : Run following statements in Hive ```sql CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100 CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; SELECT * FROM d100.v100; // query on view d100.v100 gives correct result ``` Step2 : Run following statements in Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. > When a view uses an UDF from a non default database, Spark analyser throws > AnalysisException > > > Key: SPARK-25301 > URL: https://issues.apache.org/jira/browse/SPARK-25301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > When a hive view uses an UDF from a non default database, Spark analyser > throws AnalysisException > Steps to simulate this issue > - > Step 1 : Run following statements in Hive > > ``` > CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; > CREATE DATABASE d100; > CREATE FUNCTION d100.udf100 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is > created in d100 > CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; > SELECT * FROM d100.v100; // query on view d100.v100 gives correct result > ``` > Step2 : Run following statements in Spark > - > 1) spark.sql("select * from d100.v100").show > throws > ``` > org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. > This function is neither a registered temporary function nor a permanent > function registered in the database '*default*' > ``` > This is because, while parsing the SQL statement of the View 'select > `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to > split database name and udf name and hence Spark function registry tries to > load the UDF 'd100.udf100' from 'default' database. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25301) When a view uses an UDF from a non default database, Spark analyser throws AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-25301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod KC updated SPARK-25301: - Description: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - Step 1 : Run following statements in Hive ```sql CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; CREATE DATABASE d100; CREATE FUNCTION d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100 CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; SELECT * FROM d100.v100; // query on view d100.v100 gives correct result ``` Step2 : Run following statements in Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. was: When a hive view uses an UDF from a non default database, Spark analyser throws AnalysisException Steps to simulate this issue - In Hive 1) CREATE DATABASE d100; 2) create function d100.udf100 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is created in d100 3) create view d100.v100 as select *d100.udf100*(name) from default.emp; // Note : table default.emp has two columns 'name', 'address', 5) select * from d100.v100; // query on view d100.v100 gives correct result In Spark - 1) spark.sql("select * from d100.v100").show throws ``` org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. This function is neither a registered temporary function nor a permanent function registered in the database '*default*' ``` This is because, while parsing the SQL statement of the View 'select `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to split database name and udf name and hence Spark function registry tries to load the UDF 'd100.udf100' from 'default' database. > When a view uses an UDF from a non default database, Spark analyser throws > AnalysisException > > > Key: SPARK-25301 > URL: https://issues.apache.org/jira/browse/SPARK-25301 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Vinod KC >Priority: Minor > > When a hive view uses an UDF from a non default database, Spark analyser > throws AnalysisException > Steps to simulate this issue > - > Step 1 : Run following statements in Hive > > ```sql > CREATE TABLE emp AS SELECT 'user' AS name, 'address' as address; > CREATE DATABASE d100; > CREATE FUNCTION d100.udf100 as > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFUpper'; // Note: udf100 is > created in d100 > CREATE VIEW d100.v100 AS SELECT d100.udf100(name) FROM default.emp; > SELECT * FROM d100.v100; // query on view d100.v100 gives correct result > ``` > Step2 : Run following statements in Spark > - > 1) spark.sql("select * from d100.v100").show > throws > ``` > org.apache.spark.sql.AnalysisException: Undefined function: '*d100.udf100*'. > This function is neither a registered temporary function nor a permanent > function registered in the database '*default*' > ``` > This is because, while parsing the SQL statement of the View 'select > `d100.udf100`(`emp`.`name`) from `default`.`emp`' , spark parser fails to > split database name and udf name and hence Spark function registry tries to > load the UDF 'd100.udf100' from 'default' database. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25330) Permission issue after upgrade hadoop version to 2.7.7
[ https://issues.apache.org/jira/browse/SPARK-25330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-25330: Description: How to reproduce: {code:java} # build spark ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd spark-2.4.0-SNAPSHOT-bin-SPARK-25330 export HADOOP_PROXY_USER=user_a bin/spark-sql export HADOOP_PROXY_USER=user_b bin/spark-sql{code} {noformat} Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user_b, access=EXECUTE, inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} was: How to reproduce: {code} # build spark ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd spark-2.4.0-SNAPSHOT-bin-SPARK-25330 export HADOOP_PROXY_USER=user_a bin/spark-sql export HADOOP_PROXY_USER=user_b bin/spark-sql{code} {noformat} Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=user_b, access=EXECUTE, inode="/tmp/hive-$%7Buser.name%7D/b_slng/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} > Permission issue after upgrade hadoop version to 2.7.7 > -- > > Key: SPARK-25330 > URL: https://issues.apache.org/jira/browse/SPARK-25330 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.3.2, 2.4.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce: > {code:java} > # build spark > ./dev/make-distribution.sh --name SPARK-25330 --tgz -Phadoop-2.7 -Phive > -Phive-thriftserver -Pyarn > tar -zxf spark-2.4.0-SNAPSHOT-bin-SPARK-25330.tgz && cd > spark-2.4.0-SNAPSHOT-bin-SPARK-25330 > export HADOOP_PROXY_USER=user_a > bin/spark-sql > export HADOOP_PROXY_USER=user_b > bin/spark-sql{code} > > {noformat} > Exception in thread "main" java.lang.RuntimeException: > org.apache.hadoop.security.AccessControlException: Permission denied: > user=user_b, access=EXECUTE, > inode="/tmp/hive-$%7Buser.name%7D/user_b/668748f2-f6c5-4325-a797-fd0a7ee7f4d4":user_b:hadoop:drwx-- > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkTraverse(FSPermissionChecker.java:259) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:205) > at > org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:190){noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org