[jira] [Created] (SPARK-26461) Use ConfigEntry for hardcoded configs for dynamicAllocation category.
Takuya Ueshin created SPARK-26461: - Summary: Use ConfigEntry for hardcoded configs for dynamicAllocation category. Key: SPARK-26461 URL: https://issues.apache.org/jira/browse/SPARK-26461 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26468) Use ConfigEntry for hardcoded configs for task category.
Takuya Ueshin created SPARK-26468: - Summary: Use ConfigEntry for hardcoded configs for task category. Key: SPARK-26468 URL: https://issues.apache.org/jira/browse/SPARK-26468 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26463) Use ConfigEntry for hardcoded configs for scheduler category.
Takuya Ueshin created SPARK-26463: - Summary: Use ConfigEntry for hardcoded configs for scheduler category. Key: SPARK-26463 URL: https://issues.apache.org/jira/browse/SPARK-26463 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26478) Use ConfigEntry for hardcoded configs for rdd category.
Takuya Ueshin created SPARK-26478: - Summary: Use ConfigEntry for hardcoded configs for rdd category. Key: SPARK-26478 URL: https://issues.apache.org/jira/browse/SPARK-26478 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26491) Use ConfigEntry for hardcoded configs for test category.
Takuya Ueshin created SPARK-26491: - Summary: Use ConfigEntry for hardcoded configs for test category. Key: SPARK-26491 URL: https://issues.apache.org/jira/browse/SPARK-26491 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26489) Use ConfigEntry for hardcoded configs for python category.
Takuya Ueshin created SPARK-26489: - Summary: Use ConfigEntry for hardcoded configs for python category. Key: SPARK-26489 URL: https://issues.apache.org/jira/browse/SPARK-26489 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26490) Use ConfigEntry for hardcoded configs for r category.
Takuya Ueshin created SPARK-26490: - Summary: Use ConfigEntry for hardcoded configs for r category. Key: SPARK-26490 URL: https://issues.apache.org/jira/browse/SPARK-26490 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26488) Use ConfigEntry for hardcoded configs for modify.acl category.
Takuya Ueshin created SPARK-26488: - Summary: Use ConfigEntry for hardcoded configs for modify.acl category. Key: SPARK-26488 URL: https://issues.apache.org/jira/browse/SPARK-26488 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26483) Use ConfigEntry for hardcoded configs for ssl category.
Takuya Ueshin created SPARK-26483: - Summary: Use ConfigEntry for hardcoded configs for ssl category. Key: SPARK-26483 URL: https://issues.apache.org/jira/browse/SPARK-26483 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26486) Use ConfigEntry for hardcoded configs for metrics category.
Takuya Ueshin created SPARK-26486: - Summary: Use ConfigEntry for hardcoded configs for metrics category. Key: SPARK-26486 URL: https://issues.apache.org/jira/browse/SPARK-26486 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26477) Use ConfigEntry for hardcoded configs for unsafe category.
Takuya Ueshin created SPARK-26477: - Summary: Use ConfigEntry for hardcoded configs for unsafe category. Key: SPARK-26477 URL: https://issues.apache.org/jira/browse/SPARK-26477 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26476) Use ConfigEntry for hardcoded configs for cleaner category.
Takuya Ueshin created SPARK-26476: - Summary: Use ConfigEntry for hardcoded configs for cleaner category. Key: SPARK-26476 URL: https://issues.apache.org/jira/browse/SPARK-26476 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26481) Use ConfigEntry for hardcoded configs for reducer category.
Takuya Ueshin created SPARK-26481: - Summary: Use ConfigEntry for hardcoded configs for reducer category. Key: SPARK-26481 URL: https://issues.apache.org/jira/browse/SPARK-26481 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26479) Use ConfigEntry for hardcoded configs for locality category.
Takuya Ueshin created SPARK-26479: - Summary: Use ConfigEntry for hardcoded configs for locality category. Key: SPARK-26479 URL: https://issues.apache.org/jira/browse/SPARK-26479 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26482) Use ConfigEntry for hardcoded configs for ui category.
Takuya Ueshin created SPARK-26482: - Summary: Use ConfigEntry for hardcoded configs for ui category. Key: SPARK-26482 URL: https://issues.apache.org/jira/browse/SPARK-26482 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26485) Use ConfigEntry for hardcoded configs for master.rest/ui categories.
Takuya Ueshin created SPARK-26485: - Summary: Use ConfigEntry for hardcoded configs for master.rest/ui categories. Key: SPARK-26485 URL: https://issues.apache.org/jira/browse/SPARK-26485 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26487) Use ConfigEntry for hardcoded configs for admin category.
Takuya Ueshin created SPARK-26487: - Summary: Use ConfigEntry for hardcoded configs for admin category. Key: SPARK-26487 URL: https://issues.apache.org/jira/browse/SPARK-26487 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26480) Use ConfigEntry for hardcoded configs for broadcast category.
Takuya Ueshin created SPARK-26480: - Summary: Use ConfigEntry for hardcoded configs for broadcast category. Key: SPARK-26480 URL: https://issues.apache.org/jira/browse/SPARK-26480 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26484) Use ConfigEntry for hardcoded configs for authenticate category.
Takuya Ueshin created SPARK-26484: - Summary: Use ConfigEntry for hardcoded configs for authenticate category. Key: SPARK-26484 URL: https://issues.apache.org/jira/browse/SPARK-26484 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26469) Use ConfigEntry for hardcoded configs for io category.
Takuya Ueshin created SPARK-26469: - Summary: Use ConfigEntry for hardcoded configs for io category. Key: SPARK-26469 URL: https://issues.apache.org/jira/browse/SPARK-26469 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26475) Use ConfigEntry for hardcoded configs for buffer category.
Takuya Ueshin created SPARK-26475: - Summary: Use ConfigEntry for hardcoded configs for buffer category. Key: SPARK-26475 URL: https://issues.apache.org/jira/browse/SPARK-26475 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26473) Use ConfigEntry for hardcoded configs for deploy category.
Takuya Ueshin created SPARK-26473: - Summary: Use ConfigEntry for hardcoded configs for deploy category. Key: SPARK-26473 URL: https://issues.apache.org/jira/browse/SPARK-26473 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26471) Use ConfigEntry for hardcoded configs for speculation category.
Takuya Ueshin created SPARK-26471: - Summary: Use ConfigEntry for hardcoded configs for speculation category. Key: SPARK-26471 URL: https://issues.apache.org/jira/browse/SPARK-26471 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26474) Use ConfigEntry for hardcoded configs for worker category.
Takuya Ueshin created SPARK-26474: - Summary: Use ConfigEntry for hardcoded configs for worker category. Key: SPARK-26474 URL: https://issues.apache.org/jira/browse/SPARK-26474 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26472) Use ConfigEntry for hardcoded configs for serializer category.
Takuya Ueshin created SPARK-26472: - Summary: Use ConfigEntry for hardcoded configs for serializer category. Key: SPARK-26472 URL: https://issues.apache.org/jira/browse/SPARK-26472 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26467) Use ConfigEntry for hardcoded configs for rpc category.
Takuya Ueshin created SPARK-26467: - Summary: Use ConfigEntry for hardcoded configs for rpc category. Key: SPARK-26467 URL: https://issues.apache.org/jira/browse/SPARK-26467 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26462) Use ConfigEntry for hardcoded configs for memory category.
Takuya Ueshin created SPARK-26462: - Summary: Use ConfigEntry for hardcoded configs for memory category. Key: SPARK-26462 URL: https://issues.apache.org/jira/browse/SPARK-26462 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26466) Use ConfigEntry for hardcoded configs for submit category.
Takuya Ueshin created SPARK-26466: - Summary: Use ConfigEntry for hardcoded configs for submit category. Key: SPARK-26466 URL: https://issues.apache.org/jira/browse/SPARK-26466 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26465) Use ConfigEntry for hardcoded configs for jars category.
Takuya Ueshin created SPARK-26465: - Summary: Use ConfigEntry for hardcoded configs for jars category. Key: SPARK-26465 URL: https://issues.apache.org/jira/browse/SPARK-26465 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26464) Use ConfigEntry for hardcoded configs for storage category.
Takuya Ueshin created SPARK-26464: - Summary: Use ConfigEntry for hardcoded configs for storage category. Key: SPARK-26464 URL: https://issues.apache.org/jira/browse/SPARK-26464 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26460) Use ConfigEntry for hardcoded kryo/kryoserializer configs.
Takuya Ueshin created SPARK-26460: - Summary: Use ConfigEntry for hardcoded kryo/kryoserializer configs. Key: SPARK-26460 URL: https://issues.apache.org/jira/browse/SPARK-26460 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: Takuya Ueshin -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26460) Use ConfigEntry for hardcoded configs for kryo/kryoserializer categories.
[ https://issues.apache.org/jira/browse/SPARK-26460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takuya Ueshin updated SPARK-26460: -- Summary: Use ConfigEntry for hardcoded configs for kryo/kryoserializer categories. (was: Use ConfigEntry for hardcoded kryo/kryoserializer configs.) > Use ConfigEntry for hardcoded configs for kryo/kryoserializer categories. > - > > Key: SPARK-26460 > URL: https://issues.apache.org/jira/browse/SPARK-26460 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Takuya Ueshin >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20415) SPARK job hangs while writing DataFrame to HDFS
[ https://issues.apache.org/jira/browse/SPARK-20415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730084#comment-16730084 ] Saisai Shao commented on SPARK-20415: - Have you tried latest version of Spark, does this problem still exist in latest version? Also can we have a way to reproduce this problem easily? > SPARK job hangs while writing DataFrame to HDFS > --- > > Key: SPARK-20415 > URL: https://issues.apache.org/jira/browse/SPARK-20415 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 2.1.0 > Environment: EMR 5.4.0 >Reporter: P K >Priority: Major > > We are in POC phase with Spark. One of the Steps is reading compressed json > files that come from sources, "explode" them into tabular format and then > write them to HDFS. This worked for about three weeks until a few days ago, > for a particular dataset, the writer just hangs. I logged in to the worker > machines and see this stack trace: > "Executor task launch worker-0" #39 daemon prio=5 os_prio=0 > tid=0x7f6210352800 nid=0x4542 runnable [0x7f61f52b3000] >java.lang.Thread.State: RUNNABLE > at org.apache.spark.unsafe.Platform.copyMemory(Platform.java:210) > at > org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.writeToMemory(UnsafeArrayData.java:311) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply6_2$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:111) > at > org.apache.spark.sql.execution.GenerateExec$$anonfun$doExecute$1$$anonfun$apply$9.apply(GenerateExec.scala:109) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:211) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:243) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:190) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:188) > at > org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:193) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > The last messages ever printed in stderr before the hang are: > 17/04/18 01:41:14 INFO DAGScheduler: Final stage: ResultStage 4 (save at > NativeMethodAccessorImpl.java:0) > 17/04/18 01:41:14 INFO DAGScheduler: Parents of final stage: List() > 17/04/18 01:41:14 INFO DAGScheduler: Missing parents: List() > 17/04/18 01:41:14 INFO DAGScheduler: Submitting ResultStage 4 > (MapPartitionsRDD[31] at save at NativeMethodAccessorImpl.java:0), which has > no missing parents > 17/04/18 01:41:14 INFO MemoryStore: Block
[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
[ https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730029#comment-16730029 ] Cheng Su commented on SPARK-26164: -- [~cloud_fan], could you help take a look of [https://github.com/apache/spark/pull/23163] ? Thanks! > [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort > -- > > Key: SPARK-26164 > URL: https://issues.apache.org/jira/browse/SPARK-26164 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Cheng Su >Priority: Minor > > Problem: > Current spark always requires a local sort before writing to output table on > partition/bucket columns [1]. The disadvantage is the sort might waste > reserved CPU time on executor due to spill. Hive does not require the local > sort before writing output table [2], and we saw performance regression when > migrating hive workload to spark. > > Proposal: > We can avoid the local sort by keeping the mapping between file path and > output writer. In case of writing row to a new file path, we create a new > output writer. Otherwise, re-use the same output writer if the writer already > exists (mainly change should be in FileFormatDataWriter.scala). This is very > similar to what hive does in [2]. > Given the new behavior (i.e. avoid sort by keeping multiple output writer) > consumes more memory on executor (multiple output writer needs to be opened > in same time), than the current behavior (i.e. only one output writer > opened). We can add the config to switch between the current and new behavior. > > [1]: spark FileFormatWriter.scala - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123] > [2]: hive FileSinkOperator.java - > [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510] > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26459) remove UpdateNullabilityInAttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730026#comment-16730026 ] Apache Spark commented on SPARK-26459: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23390 > remove UpdateNullabilityInAttributeReferences > - > > Key: SPARK-26459 > URL: https://issues.apache.org/jira/browse/SPARK-26459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26459) remove UpdateNullabilityInAttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730027#comment-16730027 ] Apache Spark commented on SPARK-26459: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/23390 > remove UpdateNullabilityInAttributeReferences > - > > Key: SPARK-26459 > URL: https://issues.apache.org/jira/browse/SPARK-26459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26459) remove UpdateNullabilityInAttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26459: Assignee: Wenchen Fan (was: Apache Spark) > remove UpdateNullabilityInAttributeReferences > - > > Key: SPARK-26459 > URL: https://issues.apache.org/jira/browse/SPARK-26459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26459) remove UpdateNullabilityInAttributeReferences
[ https://issues.apache.org/jira/browse/SPARK-26459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26459: Assignee: Apache Spark (was: Wenchen Fan) > remove UpdateNullabilityInAttributeReferences > - > > Key: SPARK-26459 > URL: https://issues.apache.org/jira/browse/SPARK-26459 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26459) remove UpdateNullabilityInAttributeReferences
Wenchen Fan created SPARK-26459: --- Summary: remove UpdateNullabilityInAttributeReferences Key: SPARK-26459 URL: https://issues.apache.org/jira/browse/SPARK-26459 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26458) OneHotEncoderModel verifies the number of category values incorrectly when tries to transform a dataframe.
duruihuan created SPARK-26458: - Summary: OneHotEncoderModel verifies the number of category values incorrectly when tries to transform a dataframe. Key: SPARK-26458 URL: https://issues.apache.org/jira/browse/SPARK-26458 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.1 Reporter: duruihuan When the handleInvalid is set to "keep", then one should not compare the categorySizes of the tranformSchema and the values of the metadata of the dataframe to be transformed. Because there may be more than one invalid values in some columns in the dataframe, which causes exception as described in lines 302-306 in OneHotEncoderEstimator.scala. To be concluded, I think the verifyNumOfValues in the method transformSchema should be removed, which can be found in line 299 in the code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingxin Wu updated SPARK-26446: --- Description: Add docs to describe how remove policy act while considering the property _*{{spark.dynamicAllocation.cachedExecutorIdleTimeout}}*_ in ExecutorAllocationManager. was: Add docs to describe how remove policy act while considering the property {code:java} spark.dynamicAllocation.cachedExecutorIdleTimeout {code} in ExecutorAllocationManager. > Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager > --- > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > > Add docs to describe how remove policy act while considering the property > _*{{spark.dynamicAllocation.cachedExecutorIdleTimeout}}*_ in > ExecutorAllocationManager. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingxin Wu updated SPARK-26446: --- Description: Add docs to describe how remove policy act while considering the property {code:java} spark.dynamicAllocation.cachedExecutorIdleTimeout {code} in ExecutorAllocationManager. was: Add docs to describe how remove policy act while considering the property {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in ExecutorAllocationManager. > Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager > --- > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > > > Add docs to describe how remove policy act while considering the property > {code:java} > spark.dynamicAllocation.cachedExecutorIdleTimeout > {code} > in ExecutorAllocationManager. > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22579) BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be implemented using streaming
[ https://issues.apache.org/jira/browse/SPARK-22579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730007#comment-16730007 ] Andrey Siunov commented on SPARK-22579: --- Is it true that the issue has been resolved in the ticket https://issues.apache.org/jira/browse/SPARK-25905 (PR: https://github.com/apache/spark/pull/23058)? > BlockManager.getRemoteValues and BlockManager.getRemoteBytes should be > implemented using streaming > -- > > Key: SPARK-22579 > URL: https://issues.apache.org/jira/browse/SPARK-22579 > Project: Spark > Issue Type: Improvement > Components: Block Manager, Spark Core >Affects Versions: 2.1.0 >Reporter: Eyal Farago >Priority: Major > > when an RDD partition is cached on an executor bu the task requiring it is > running on another executor (process locality ANY), the cached partition is > fetched via BlockManager.getRemoteValues which delegates to > BlockManager.getRemoteBytes, both calls are blocking. > in my use case I had a 700GB RDD spread over 1000 partitions on a 6 nodes > cluster, cached to disk. rough math shows that average partition size is > 700MB. > looking at spark UI it was obvious that tasks running with process locality > 'ANY' are much slower than local tasks (~40 seconds to 8-10 minutes ratio), I > was able to capture thread dumps of executors executing remote tasks and got > this stake trace: > {quote}Thread ID Thread Name Thread StateThread Locks > 1521 Executor task launch worker-1000WAITING > Lock(java.util.concurrent.ThreadPoolExecutor$Worker@196462978}) > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) > scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202) > scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218) > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > scala.concurrent.Await$.result(package.scala:190) > org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190) > org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104) > org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:582) > org.apache.spark.storage.BlockManager.getRemoteValues(BlockManager.scala:550) > org.apache.spark.storage.BlockManager.get(BlockManager.scala:638) > org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:690) > org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334) > org.apache.spark.rdd.RDD.iterator(RDD.scala:285) > org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:89) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > org.apache.spark.rdd.RDD.iterator(RDD.scala:287){quote} > digging into the code showed that the block manager first fetches all bytes > (getRemoteBytes) and then wraps it with a deserialization stream, this has > several draw backs: > 1. blocking, requesting executor is blocked while the remote executor is > serving the block. > 2. potentially large memory footprint on requesting executor, in my use case > a 700mb of raw bytes stored in a ChunkedByteBuffer. > 3. inefficient, requesting side usually don't need all values at once as it > consumes the values via an iterator. > 4. potentially large memory footprint on serving executor, in case the block > is cached in deserialized form the serving executor has to serialize it into > a ChunkedByteBuffer (BlockManager.doGetLocalBytes). this is both memory & CPU > intensive, memory footprint can be reduced by using a limited buffer for > serialization 'spilling' to the response stream. > I suggest improving this either by implementing full streaming mechanism or > some kind of pagination mechanism, in addition the requesting executor should > be able to make progress with the data it already has, blocking only when > local buffer is exhausted and remote side didn't deliver the next chunk of > the stream (or page in case of pagination) yet. -- This message was sent by Atlassian JIRA
[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingxin Wu updated SPARK-26446: --- Description: Add docs to describe how remove policy act while considering the property {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in ExecutorAllocationManager. > Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager > --- > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > > Add docs to describe how remove policy act while considering the property > {{spark.dynamicAllocation.cachedExecutorIdleTimeout}} in > ExecutorAllocationManager. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26446) Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager
[ https://issues.apache.org/jira/browse/SPARK-26446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qingxin Wu updated SPARK-26446: --- Summary: Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager (was: improve doc on ExecutorAllocationManager) > Add cachedExecutorIdleTimeout docs at ExecutorAllocationManager > --- > > Key: SPARK-26446 > URL: https://issues.apache.org/jira/browse/SPARK-26446 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Qingxin Wu >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24630) SPIP: Support SQLStreaming in Spark
[ https://issues.apache.org/jira/browse/SPARK-24630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730003#comment-16730003 ] Jackey Lee commented on SPARK-24630: [~jackylk] I have finished a detailed design doc, we can talk about it in the [mail list|http://apache-spark-developers-list.1001551.n3.nabble.com/Support-SqlStreaming-in-spark-td24202.html] or the [doc|https://docs.google.com/document/d/19degwnIIcuMSELv6BQ_1VQI5AIVcvGeqOm5xE2-aRA0/edit]. > SPIP: Support SQLStreaming in Spark > --- > > Key: SPARK-24630 > URL: https://issues.apache.org/jira/browse/SPARK-24630 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0, 2.2.1 >Reporter: Jackey Lee >Priority: Minor > Labels: SQLStreaming > Attachments: SQLStreaming SPIP V2.pdf > > > At present, KafkaSQL, Flink SQL(which is actually based on Calcite), > SQLStream, StormSQL all provide a stream type SQL interface, with which users > with little knowledge about streaming, can easily develop a flow system > processing model. In Spark, we can also support SQL API based on > StructStreamig. > To support for SQL Streaming, there are two key points: > 1, Analysis should be able to parse streaming type SQL. > 2, Analyzer should be able to map metadata information to the corresponding > Relation. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26437: -- Affects Version/s: 1.6.3 > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.3, 2.0.2, 2.1.3, 2.2.2, 2.3.1 >Reporter: zengxl >Priority: Major > Fix For: 3.0.0 > > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26437. --- Resolution: Fixed Fix Version/s: 3.0.0 > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1 >Reporter: zengxl >Priority: Major > Fix For: 3.0.0 > > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729945#comment-16729945 ] Dongjoon Hyun commented on SPARK-26437: --- Hi, [~zengxl]. Thank you for reporting. This is a very old issue since Apache Spark 1.x which occurs when you use `decimal`. Please note that `CAST` and `decimal` in the following example. Since Spark 2.0, `0.0` literal interpreted as `Decimal`. So, you are hitting this issue without casting, too. This is fixed at `master` branch and will be released as Apache Spark 3.0.0. {code} scala> sc.version res0: String = 1.6.3 scala> sql("drop table spark_orc") scala> sql("create table spark_orc stored as orc as select cast(0.00 as decimal(2,2)) as a") scala> sql("select * from spark_orc").show ... Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 limit: 0 {code} If you are interested, the followings are the details. First, the underlying ORC issue (HIVE-13083) is fixed at Hive 1.3.0, but Spark is still using embedded Hive 1.2.1. To avoid the underlying ORC issue, you can use new ORC data source (`set spark.sql.orc.impl=native`). So, in Spark 2.4.0, you can use `USING` syntax to avoid this. {code} scala> sql("create table spark_orc using orc as select 0.00 as a") scala> sql("select * from spark_orc").show ++ | a| ++ |0.00| ++ scala> spark.version res2: String = 2.4.0 {code} Second, SPARK-22977 made a regression on CTAS at Spark 2.3.0 and is fixed recently SPARK-25271 (Hive CTAS commands should use data source if it is convertible) at Apache Spark 3.0.0. In Spark 3.0.0, you can use `STORED AS ORC` syntax without this problem. {code} scala> sql("create table spark_orc stored as orc as select 0.00 as a") scala> sql("select * from spark_orc").show ++ | a| ++ |0.00| ++ scala> spark.version res3: String = 3.0.0-SNAPSHOT {code} So, I'll close this issue since this is fixed in 3.0.0. cc [~cloud_fan], [~viirya], [~smilegator], [~hyukjin.kwon] > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1 >Reporter: zengxl >Priority: Major > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26437: -- Affects Version/s: 2.0.2 2.1.3 2.2.2 > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.2, 2.3.1 >Reporter: zengxl >Priority: Major > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17572) Write.df is failing on spark cluster
[ https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729921#comment-16729921 ] Tarun Parmar edited comment on SPARK-17572 at 12/27/18 10:51 PM: - I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am using Spark with RStudio without hadoop to generate parquet files and store them in local/nfs mount. What I noticed is the _temporary directory is owned by my userid but the '0' directory inside _temporary is owned by root which is probably why it is failing to delete. Already checked with RStudio, they don't think this it is an issue with sparklyr package. was (Author: tarunparmar): I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am using Spark with RStudio without hadoop to generate parquet files and store them in local/nfs mount. What I noticed is the _temporary directory is owned by my userid but the '0' directory inside _temporary is owned by root which is probably why it is failing to delete. Already checked with RStudio, they don't this it is an issue with sparklyr package. > Write.df is failing on spark cluster > > > Key: SPARK-17572 > URL: https://issues.apache.org/jira/browse/SPARK-17572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Sankar Mittapally >Priority: Major > > Hi, > We have spark cluster with four nodes, all four nodes have NFS partition > shared(there is no HDFS), We have same uid on all servers. When we are trying > to write data we are getting following exceptions. I am not sure whether it > is a error or not and not sure will I lost the data in the output. > The command which I am using to save the data. > {code} > saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true") > {code} > {noformat} > 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > java.io.IOException: Failed to rename > DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv; > isDirectory=false; length=436486316; replication=1; blocksize=33554432; > modification_time=147409940; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false} to > file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) > at
[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729925#comment-16729925 ] Dongjoon Hyun commented on SPARK-26437: --- Thanks, [~mgaido]. > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zengxl >Priority: Major > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17572) Write.df is failing on spark cluster
[ https://issues.apache.org/jira/browse/SPARK-17572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729921#comment-16729921 ] Tarun Parmar commented on SPARK-17572: -- I am facing similar issue, my Spark+Hadoop version is same as Sankar's. I am using Spark with RStudio without hadoop to generate parquet files and store them in local/nfs mount. What I noticed is the _temporary directory is owned by my userid but the '0' directory inside _temporary is owned by root which is probably why it is failing to delete. Already checked with RStudio, they don't this it is an issue with sparklyr package. > Write.df is failing on spark cluster > > > Key: SPARK-17572 > URL: https://issues.apache.org/jira/browse/SPARK-17572 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.0.0 >Reporter: Sankar Mittapally >Priority: Major > > Hi, > We have spark cluster with four nodes, all four nodes have NFS partition > shared(there is no HDFS), We have same uid on all servers. When we are trying > to write data we are getting following exceptions. I am not sure whether it > is a error or not and not sure will I lost the data in the output. > The command which I am using to save the data. > {code} > saveDF(banking_l1_1,"banking_l1_v2.csv",source="csv",mode="append",schema="true") > {code} > {noformat} > 16/09/17 08:03:28 ERROR InsertIntoHadoopFsRelationCommand: Aborting job. > java.io.IOException: Failed to rename > DeprecatedRawLocalFileStatus{path=file:/nfspartition/sankar/banking_l1_v2.csv/_temporary/0/task_201609170802_0013_m_00/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv; > isDirectory=false; length=436486316; replication=1; blocksize=33554432; > modification_time=147409940; access_time=0; owner=; group=; > permission=rw-rw-rw-; isSymlink=false} to > file:/nfspartition/sankar/banking_l1_v2.csv/part-r-0-46a7f178-2490-444e-9110-510978eaaecb.csv > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:371) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384) > at > org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326) > at > org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:222) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelationCommand.scala:144) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1.apply(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:115) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:60) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:58) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:115) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:136) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:133) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:86) > at > org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:487) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) > at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:194) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at >
[jira] [Assigned] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26450: Assignee: Apache Spark > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Assignee: Apache Spark >Priority: Minor > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26450: Assignee: (was: Apache Spark) > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive
[ https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26021: -- Fix Version/s: (was: 2.4.1) > -0.0 and 0.0 not treated consistently, doesn't match Hive > - > > Key: SPARK-26021 > URL: https://issues.apache.org/jira/browse/SPARK-26021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Alon Doron >Priority: Critical > Fix For: 3.0.0 > > > Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new > issue: > The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are > numerically identical but not the same double value: > In hive, 0.0 and -0.0 are equal since > https://issues.apache.org/jira/browse/HIVE-11174. > That's not the case with spark sql as "group by" (non-codegen) treats them > as different values. Since their hash is different they're put in different > buckets of UnsafeFixedWidthAggregationMap. > In addition there's an inconsistency when using the codegen, for example the > following unit test: > {code:java} > println(Seq(0.0d, 0.0d, > -0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,3] > {code:java} > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,1], [-0.0,2] > {code:java} > spark.conf.set("spark.sql.codegen.wholeStage", "false") > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,2], [-0.0,1] > Note that the only difference between the first 2 lines is the order of the > elements in the Seq. > This inconsistency is resulted by different partitioning of the Seq and the > usage of the generated fast hash map in the first, partial, aggregation. > It looks like we need to add a specific check for -0.0 before hashing (both > in codegen and non-codegen modes) if we want to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26021) -0.0 and 0.0 not treated consistently, doesn't match Hive
[ https://issues.apache.org/jira/browse/SPARK-26021?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729804#comment-16729804 ] Dongjoon Hyun commented on SPARK-26021: --- This is reverted from `branch-2.4` via https://github.com/apache/spark/pull/23389 . > -0.0 and 0.0 not treated consistently, doesn't match Hive > - > > Key: SPARK-26021 > URL: https://issues.apache.org/jira/browse/SPARK-26021 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Sean Owen >Assignee: Alon Doron >Priority: Critical > Fix For: 3.0.0 > > > Per [~adoron] and [~mccheah] and SPARK-24834, I'm splitting this out as a new > issue: > The underlying issue is how Spark and Hive treat 0.0 and -0.0, which are > numerically identical but not the same double value: > In hive, 0.0 and -0.0 are equal since > https://issues.apache.org/jira/browse/HIVE-11174. > That's not the case with spark sql as "group by" (non-codegen) treats them > as different values. Since their hash is different they're put in different > buckets of UnsafeFixedWidthAggregationMap. > In addition there's an inconsistency when using the codegen, for example the > following unit test: > {code:java} > println(Seq(0.0d, 0.0d, > -0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,3] > {code:java} > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,1], [-0.0,2] > {code:java} > spark.conf.set("spark.sql.codegen.wholeStage", "false") > println(Seq(0.0d, -0.0d, > 0.0d).toDF("i").groupBy("i").count().collect().mkString(", ")) > {code} > [0.0,2], [-0.0,1] > Note that the only difference between the first 2 lines is the order of the > elements in the Seq. > This inconsistency is resulted by different partitioning of the Seq and the > usage of the generated fast hash map in the first, partial, aggregation. > It looks like we need to add a specific check for -0.0 before hashing (both > in codegen and non-codegen modes) if we want to fix this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26363) Avoid duplicated KV store lookups for task table
[ https://issues.apache.org/jira/browse/SPARK-26363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-26363: --- Description: In the method `taskList`(Since https://github.com/apache/spark/pull/21688), the executor log value is queried in KV store for every task(method `constructTaskData`). We can use a hashmap for reducing duplicated KV store lookups in the method. was: In https://github.com/apache/spark/pull/21688, a new filed `executorLogs` is added to `TaskData` in `api.scala`: 1. The field should not belong to `TaskData` (from the meaning of wording). 2. This is redundant with ExecutorSummary. 3. For each row in the task table, the executor log value is lookup in KV store every time, which can be avoided for better performance in large scale. This PR propose to reuse the executor details of request "/allexecutors" , so that we can have a cleaner api data structure, and redundant KV store queries are avoided. > Avoid duplicated KV store lookups for task table > - > > Key: SPARK-26363 > URL: https://issues.apache.org/jira/browse/SPARK-26363 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > In the method `taskList`(Since https://github.com/apache/spark/pull/21688), > the executor log value is queried in KV store for every task(method > `constructTaskData`). > We can use a hashmap for reducing duplicated KV store lookups in the method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26457) Show hadoop configurations in HistoryServer environment tab
deshanxiao created SPARK-26457: -- Summary: Show hadoop configurations in HistoryServer environment tab Key: SPARK-26457 URL: https://issues.apache.org/jira/browse/SPARK-26457 Project: Spark Issue Type: New Feature Components: Spark Core, Web UI Affects Versions: 2.4.0, 2.3.2 Environment: Maybe it is good to show some configurations in HistoryServer environment tab for debugging some bugs about hadoop Reporter: deshanxiao -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26451) Change lead/lag argument name from count to offset
[ https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-26451: -- Docs Text: The 'lag' function in Pyspark accepted an argument 'count' which should have been called 'offset'. It has been renamed accordingly. Labels: release-notes (was: ) Issue Type: Bug (was: Documentation) > Change lead/lag argument name from count to offset > -- > > Key: SPARK-26451 > URL: https://issues.apache.org/jira/browse/SPARK-26451 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Deepyaman Datta >Assignee: Deepyaman Datta >Priority: Minor > Labels: release-notes > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26451) Change lead/lag argument name from count to offset
[ https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26451. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23357 [https://github.com/apache/spark/pull/23357] > Change lead/lag argument name from count to offset > -- > > Key: SPARK-26451 > URL: https://issues.apache.org/jira/browse/SPARK-26451 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Deepyaman Datta >Assignee: Deepyaman Datta >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26451) Change lead/lag argument name from count to offset
[ https://issues.apache.org/jira/browse/SPARK-26451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26451: Assignee: Deepyaman Datta > Change lead/lag argument name from count to offset > -- > > Key: SPARK-26451 > URL: https://issues.apache.org/jira/browse/SPARK-26451 > Project: Spark > Issue Type: Documentation > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Deepyaman Datta >Assignee: Deepyaman Datta >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729675#comment-16729675 ] Marco Gaido commented on SPARK-26450: - Great, thanks! > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter
[ https://issues.apache.org/jira/browse/SPARK-26456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26456: Assignee: (was: Apache Spark) > Cast date/timestamp by Date/TimestampFormatter > -- > > Key: SPARK-26456 > URL: https://issues.apache.org/jira/browse/SPARK-26456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Minor > > Currently, dates and timestamps are casted to strings by using > SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and > TimestampFormatter that are already used in CSV and JSON datasources for the > same purpose. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter
[ https://issues.apache.org/jira/browse/SPARK-26456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26456: Assignee: Apache Spark > Cast date/timestamp by Date/TimestampFormatter > -- > > Key: SPARK-26456 > URL: https://issues.apache.org/jira/browse/SPARK-26456 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > > Currently, dates and timestamps are casted to strings by using > SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and > TimestampFormatter that are already used in CSV and JSON datasources for the > same purpose. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26456) Cast date/timestamp by Date/TimestampFormatter
Maxim Gekk created SPARK-26456: -- Summary: Cast date/timestamp by Date/TimestampFormatter Key: SPARK-26456 URL: https://issues.apache.org/jira/browse/SPARK-26456 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk Currently, dates and timestamps are casted to strings by using SimpleDateFormat. The ticket aims to switch the code on new DateFormatter and TimestampFormatter that are already used in CSV and JSON datasources for the same purpose. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26248) Infer date type from CSV
[ https://issues.apache.org/jira/browse/SPARK-26248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk resolved SPARK-26248. Resolution: Won't Fix > Infer date type from CSV > > > Key: SPARK-26248 > URL: https://issues.apache.org/jira/browse/SPARK-26248 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > > Currently, DateType cannot be inferred from CSV. To parse CSV string, you > have to specify schema explicitly if CSV input contains dates. This ticket > aims to extend CSVInferSchema to support such inferring. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729647#comment-16729647 ] Bruce Robbins commented on SPARK-26450: --- I can attempt a patch later today. > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25892) AttributeReference.withMetadata method should have return type AttributeReference
[ https://issues.apache.org/jira/browse/SPARK-25892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25892: Assignee: kevin yu > AttributeReference.withMetadata method should have return type > AttributeReference > - > > Key: SPARK-25892 > URL: https://issues.apache.org/jira/browse/SPARK-25892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jari Kujansuu >Assignee: kevin yu >Priority: Trivial > Fix For: 3.0.0 > > > AttributeReference.withMetadata method should have return type > AttributeReference instead of Attribute. > AttributeReference overrides withMetadata method defined in Attribute super > class and returns AttributeReference instance but method's return type is > Attribute unlike in other with... methods overridden by AttributeReference. > In some cases you have to cast the return value back to AttributeReference. > For example if you want to modify metadata for AttributeReference in > LogicalRelation you have to cast return value of withMetadata back to > AttributeReference because LogicalRelation takes Seq[AttributeReference] as > argument. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25892) AttributeReference.withMetadata method should have return type AttributeReference
[ https://issues.apache.org/jira/browse/SPARK-25892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25892. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22918 [https://github.com/apache/spark/pull/22918] > AttributeReference.withMetadata method should have return type > AttributeReference > - > > Key: SPARK-25892 > URL: https://issues.apache.org/jira/browse/SPARK-25892 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Jari Kujansuu >Assignee: kevin yu >Priority: Trivial > Fix For: 3.0.0 > > > AttributeReference.withMetadata method should have return type > AttributeReference instead of Attribute. > AttributeReference overrides withMetadata method defined in Attribute super > class and returns AttributeReference instance but method's return type is > Attribute unlike in other with... methods overridden by AttributeReference. > In some cases you have to cast the return value back to AttributeReference. > For example if you want to modify metadata for AttributeReference in > LogicalRelation you have to cast return value of withMetadata back to > AttributeReference because LogicalRelation takes Seq[AttributeReference] as > argument. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26455) Spark Kinesis Integration with no SSL
Shashikant Bangera created SPARK-26455: -- Summary: Spark Kinesis Integration with no SSL Key: SPARK-26455 URL: https://issues.apache.org/jira/browse/SPARK-26455 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.3.0 Reporter: Shashikant Bangera Hi, we are trying access the endpoint thought library mentioned below and we get the SSL error i think internally it use KCL library. if you look at the error, so if I have to skip the certificate is it possible through KCL utils call ? because I do not find any provision to do that with set SSL as false within spark streaming kinesis library like we do with KCL. Can you please help me with the same. compile("org.apache.spark:spark-streaming-kinesis-asl_2.11:2.3.0") { exclude group: 'org.apache.spark', module: 'spark-streaming_2.11' } Caused by: javax.net.ssl.SSLPeerUnverifiedException: Certificate for kinesis-endpoint> doesn't match any of the subject alternative names: [kinesis-fips.us-east-1.amazonaws.com, *.kinesis.us-east-1.vpce.amazonaws.com, kinesis.us-east-1.amazonaws.com] at org.apache.http.conn.ssl.SSLConnectionSocketFactory.verifyHostname(SSLConnectionSocketFactory.java:467) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.createLayeredSocket(SSLConnectionSocketFactory.java:397) at org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:355) at shade.com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.connectSocket(SdkTLSSocketFactory.java:132) at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142) at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:373) at sun.reflect.GeneratedMethodAccessor19.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at shade.com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76) at shade.com.amazonaws.http.conn.$Proxy18.connect(Unknown Source) at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:381) at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:237) at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) at shade.com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72) at shade.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1238) at shade.com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1058) ... 20 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26454) IllegegalArgument Exception is Thrown while creating new UDF with JAR
[ https://issues.apache.org/jira/browse/SPARK-26454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729628#comment-16729628 ] Udbhav Agrawal commented on SPARK-26454: [~dongjoon] [~cloud_fan] Creating the UDF for the first time using Jar options, HDFS path is converted to local path and JAR is added to that path. Now, Creating the function second time will check if that Jar is present and associated with any other path, it will Throw an IllegalArgumentException, and continue creating the function. > IllegegalArgument Exception is Thrown while creating new UDF with JAR > - > > Key: SPARK-26454 > URL: https://issues.apache.org/jira/browse/SPARK-26454 > Project: Spark > Issue Type: Bug > Components: Spark Shell >Affects Versions: 2.3.2 >Reporter: Udbhav Agrawal >Priority: Major > > 【Test step】: > 1.launch spark-shell > 2. set role admin; > 3. create new function > CREATE FUNCTION Func AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar' > 4. Do select on the function > sql("select Func('2018-03-09')").show() > 5.Create new UDF with same JAR > sql("CREATE FUNCTION newFunc AS > 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR > 'hdfs:///tmp/super_udf/two_udfs.jar'") > 6. Do select on the new function created. > sql("select newFunc ('2018-03-09')").show() > 【Output】: > Function is getting created but illegal argument exception is thrown , select > provides result but with illegal argument exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26454) IllegegalArgument Exception is Thrown while creating new UDF with JAR
Udbhav Agrawal created SPARK-26454: -- Summary: IllegegalArgument Exception is Thrown while creating new UDF with JAR Key: SPARK-26454 URL: https://issues.apache.org/jira/browse/SPARK-26454 Project: Spark Issue Type: Bug Components: Spark Shell Affects Versions: 2.3.2 Reporter: Udbhav Agrawal 【Test step】: 1.launch spark-shell 2. set role admin; 3. create new function CREATE FUNCTION Func AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/super_udf/two_udfs.jar' 4. Do select on the function sql("select Func('2018-03-09')").show() 5.Create new UDF with same JAR sql("CREATE FUNCTION newFunc AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 'hdfs:///tmp/super_udf/two_udfs.jar'") 6. Do select on the new function created. sql("select newFunc ('2018-03-09')").show() 【Output】: Function is getting created but illegal argument exception is thrown , select provides result but with illegal argument exception. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26450) Map of schema is built too frequently in some wide queries
[ https://issues.apache.org/jira/browse/SPARK-26450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729621#comment-16729621 ] Marco Gaido commented on SPARK-26450: - Thanks for this JIRA [~bersprockets]. This makes sense to me. Do you want to submit a patch for this? Otherwise I can take it over. Thanks. > Map of schema is built too frequently in some wide queries > -- > > Key: SPARK-26450 > URL: https://issues.apache.org/jira/browse/SPARK-26450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Bruce Robbins >Priority: Minor > > When executing queries with wide projections and wide schemas, Spark rebuilds > an attribute map for the same schema many times. > For example: > {noformat} > select * from orctbl where id1 = 1 > {noformat} > Assume {{orctbl}} has 6000 columns and 34 files. In that case, the above > query creates an AttributeSeq object 270,000 times[1]. Each AttributeSeq > instantiation builds a map of the entire list of 6000 attributes (but not > until lazy val exprIdToOrdinal is referenced). > Whenever OrcFileFormat reads a new file, it generates a new unsafe > projection. That results in this > [function|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala#L319] > getting called: > {code:java} > protected def bind(in: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = > in.map(BindReferences.bindReference(_, inputSchema)) > {code} > For each column in the projection, this line calls bindReference. Each call > passes inputSchema, a Sequence of Attributes, to a parameter position > expecting an AttributeSeq. The compiler implicitly calls the constructor for > AttributeSeq, which (lazily) builds a map for every attribute in the schema. > Therefore, this function builds a map of the entire schema once for each > column in the projection, and it does this for each input file. For the above > example query, this accounts for 204K instantiations of AttributeSeq. > Readers for CSV and JSON tables do something similar. > In addition, ProjectExec also creates an unsafe projection for each task. As > a result, this > [line|https://github.com/apache/spark/blob/827383a97c11a61661440ff86ce0c3382a2a23b2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/package.scala#L91] > gets called, which has the same issue: > {code:java} > def toBoundExprs(exprs: Seq[Expression], inputSchema: Seq[Attribute]): > Seq[Expression] = { > exprs.map(BindReferences.bindReference(_, inputSchema)) > } > {code} > The above affects all wide queries that have a projection node, regardless of > the file reader. For the example query, ProjectExec accounts for the > additional 66K instantiations of the AttributeSeq. > Spark can save time by pre-building the AttributeSeq right before the map > operations in {{bind}} and {{toBoundExprs}}. The time saved depends on size > of schema, size of projection, number of input files (for Orc), number of > file splits (for CSV, and JSON tables), and number of tasks. > For a 6000 column CSV table with 500K records and 34 input files, the time > savings is only 6%[1] because Spark doesn't create as many unsafe projections > as compared to Orc tables. > On the other hand, for a 6000 column Orc table with 500K records and 34 input > files, the time savings is about 16%[1]. > [1] based on queries run in local mode with 8 executor threads on my laptop. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25910) accumulator updates from previous stage attempt should not fail
[ https://issues.apache.org/jira/browse/SPARK-25910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25910. -- Resolution: Duplicate > accumulator updates from previous stage attempt should not fail > --- > > Key: SPARK-25910 > URL: https://issues.apache.org/jira/browse/SPARK-25910 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26437) Decimal data becomes bigint to query, unable to query
[ https://issues.apache.org/jira/browse/SPARK-26437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729526#comment-16729526 ] Marco Gaido commented on SPARK-26437: - cc [~dongjoon] > Decimal data becomes bigint to query, unable to query > - > > Key: SPARK-26437 > URL: https://issues.apache.org/jira/browse/SPARK-26437 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.1 >Reporter: zengxl >Priority: Major > > this is my sql: > create table tmp.tmp_test_6387_1224_spark stored as ORCFile as select 0.00 > as a > select a from tmp.tmp_test_6387_1224_spark > CREATE TABLE `tmp.tmp_test_6387_1224_spark`( > {color:#f79232} `a` decimal(2,2)){color} > ROW FORMAT SERDE > 'org.apache.hadoop.hive.ql.io.orc.OrcSerde' > STORED AS INPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat' > When I query this table(use hive or sparksql,the exception is same), I throw > the following exception information > *Caused by: java.io.EOFException: Reading BigInteger past EOF from compressed > stream Stream for column 1 kind DATA position: 0 length: 0 range: 0 offset: 0 > limit: 0* > *at > org.apache.hadoop.hive.ql.io.orc.SerializationUtils.readBigInteger(SerializationUtils.java:176)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$DecimalTreeReader.next(TreeReaderFactory.java:1264)* > *at > org.apache.hadoop.hive.ql.io.orc.TreeReaderFactory$StructTreeReader.next(TreeReaderFactory.java:2004)* > *at > org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.next(RecordReaderImpl.java:1039)* > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-26453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26453. -- Resolution: Invalid Looks a question. Questions should go to mailing list before filing it as an issue. You could have a better answer than this. > running spark sql cli is looking for wrong path of > hive.metastore.warehouse.dir > > > Key: SPARK-26453 > URL: https://issues.apache.org/jira/browse/SPARK-26453 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: anubhav tarar >Priority: Major > > i started the spark sql cli and run the following sql > spark-sql> create table cars(make varchar(10)); > it give me below error > 2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - > MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or > unable to create one) > Note:i have not specify hive.metastore.warehouse.dir anywhere i just > downloaded the latest spark version from offical site and try to execute sql > > further more metastore info logs is printed the right location,but looking at > above error it seems that *hive.warehouse.metastore.dir* is not pointing to > that location > > *2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration > hive.metastore.warehouse.dir changed from /user/hive/warehouse to > file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir
[ https://issues.apache.org/jira/browse/SPARK-26453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26453: - Target Version/s: (was: 2.4.0) > running spark sql cli is looking for wrong path of > hive.metastore.warehouse.dir > > > Key: SPARK-26453 > URL: https://issues.apache.org/jira/browse/SPARK-26453 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.0 >Reporter: anubhav tarar >Priority: Major > > i started the spark sql cli and run the following sql > spark-sql> create table cars(make varchar(10)); > it give me below error > 2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - > MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or > unable to create one) > Note:i have not specify hive.metastore.warehouse.dir anywhere i just > downloaded the latest spark version from offical site and try to execute sql > > further more metastore info logs is printed the right location,but looking at > above error it seems that *hive.warehouse.metastore.dir* is not pointing to > that location > > *2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration > hive.metastore.warehouse.dir changed from /user/hive/warehouse to > file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26191) Control number of truncated fields
[ https://issues.apache.org/jira/browse/SPARK-26191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-26191. --- Resolution: Fixed Assignee: Maxim Gekk Fix Version/s: 3.0.0 > Control number of truncated fields > -- > > Key: SPARK-26191 > URL: https://issues.apache.org/jira/browse/SPARK-26191 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Currently, the threshold for truncated fields converted to string can be > controlled via global SQL config. Need to add the maxFields parameter to all > functions/methods that potentially could produce truncated string from a > sequence of fields. > One of use cases is toFile. This method aims to output not truncated plans. > For now users has to set global config to flush whole plans. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729474#comment-16729474 ] Peiyu Zhuang edited comment on SPARK-25299 at 12/27/18 9:32 AM: Sure. I just create a [SPIP in google doc|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing]. Here is our [design document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing]. was (Author: jealous): Sure. I just create a [SPIP in google doc|[https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing].]]. Here is our [design document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing]. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26453) running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir
anubhav tarar created SPARK-26453: - Summary: running spark sql cli is looking for wrong path of hive.metastore.warehouse.dir Key: SPARK-26453 URL: https://issues.apache.org/jira/browse/SPARK-26453 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.0 Reporter: anubhav tarar i started the spark sql cli and run the following sql spark-sql> create table cars(make varchar(10)); it give me below error 2018-12-27 14:49:39 ERROR RetryingHMSHandler:159 - MetaException(message:file:*/user/hive/warehouse/*cars is not a directory or unable to create one) Note:i have not specify hive.metastore.warehouse.dir anywhere i just downloaded the latest spark version from offical site and try to execute sql further more metastore info logs is printed the right location,but looking at above error it seems that *hive.warehouse.metastore.dir* is not pointing to that location *2018-12-27 14:49:36 INFO metastore:291 - Mestastore configuration hive.metastore.warehouse.dir changed from /user/hive/warehouse to file:/home/anubhav/Downloads/spark-2.4.0-bin-hadoop2.7/bin/spark-warehouse* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25299) Use remote storage for persisting shuffle data
[ https://issues.apache.org/jira/browse/SPARK-25299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16729474#comment-16729474 ] Peiyu Zhuang commented on SPARK-25299: -- Sure. I just create a [SPIP in google doc|[https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing|https://docs.google.com/document/d/1TA-gDw3ophy-gSu2IAW_5IMbRK_8pWBeXJwngN9YB80/edit?usp=sharing].]]. Here is our [design document|https://docs.google.com/document/d/1kSpbBB-sDk41LeORm3-Hfr-up98Ozm5wskvB49tUhSs/edit?usp=sharing]. > Use remote storage for persisting shuffle data > -- > > Key: SPARK-25299 > URL: https://issues.apache.org/jira/browse/SPARK-25299 > Project: Spark > Issue Type: New Feature > Components: Shuffle >Affects Versions: 2.4.0 >Reporter: Matt Cheah >Priority: Major > > In Spark, the shuffle primitive requires Spark executors to persist data to > the local disk of the worker nodes. If executors crash, the external shuffle > service can continue to serve the shuffle data that was written beyond the > lifetime of the executor itself. In YARN, Mesos, and Standalone mode, the > external shuffle service is deployed on every worker node. The shuffle > service shares local disk with the executors that run on its node. > There are some shortcomings with the way shuffle is fundamentally implemented > right now. Particularly: > * If any external shuffle service process or node becomes unavailable, all > applications that had an executor that ran on that node must recompute the > shuffle blocks that were lost. > * Similarly to the above, the external shuffle service must be kept running > at all times, which may waste resources when no applications are using that > shuffle service node. > * Mounting local storage can prevent users from taking advantage of > desirable isolation benefits from using containerized environments, like > Kubernetes. We had an external shuffle service implementation in an early > prototype of the Kubernetes backend, but it was rejected due to its strict > requirement to be able to mount hostPath volumes or other persistent volume > setups. > In the following [architecture discussion > document|https://docs.google.com/document/d/1uCkzGGVG17oGC6BJ75TpzLAZNorvrAU3FRd2X-rVHSM/edit#heading=h.btqugnmt2h40] > (note: _not_ an SPIP), we brainstorm various high level architectures for > improving the external shuffle service in a way that addresses the above > problems. The purpose of this umbrella JIRA is to promote additional > discussion on how we can approach these problems, both at the architecture > level and the implementation level. We anticipate filing sub-issues that > break down the tasks that must be completed to achieve this goal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26435) Support creating partitioned table using Hive CTAS by specifying partition column names
[ https://issues.apache.org/jira/browse/SPARK-26435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-26435: --- Assignee: Liang-Chi Hsieh > Support creating partitioned table using Hive CTAS by specifying partition > column names > --- > > Key: SPARK-26435 > URL: https://issues.apache.org/jira/browse/SPARK-26435 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL > syntax. However it is supported by using DataFrameWriter API. > {code} > val df = Seq(("a", 1)).toDF("part", "id") > df.write.format("hive").partitionBy("part").saveAsTable("t") > {code} > Hive begins to support this in newer version: > https://issues.apache.org/jira/browse/HIVE-20241: > {code} > CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part > {code} > To match DataFrameWriter API, we should this support to SQL syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26435) Support creating partitioned table using Hive CTAS by specifying partition column names
[ https://issues.apache.org/jira/browse/SPARK-26435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26435. - Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23376 [https://github.com/apache/spark/pull/23376] > Support creating partitioned table using Hive CTAS by specifying partition > column names > --- > > Key: SPARK-26435 > URL: https://issues.apache.org/jira/browse/SPARK-26435 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 3.0.0 > > > Spark SQL doesn't support creating partitioned table using Hive CTAS in SQL > syntax. However it is supported by using DataFrameWriter API. > {code} > val df = Seq(("a", 1)).toDF("part", "id") > df.write.format("hive").partitionBy("part").saveAsTable("t") > {code} > Hive begins to support this in newer version: > https://issues.apache.org/jira/browse/HIVE-20241: > {code} > CREATE TABLE t PARTITIONED BY (part) AS SELECT 1 as id, "a" as part > {code} > To match DataFrameWriter API, we should this support to SQL syntax. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org