[jira] [Commented] (SPARK-27504) File source V2: support refreshing metadata cache
[ https://issues.apache.org/jira/browse/SPARK-27504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859126#comment-16859126 ] Dongjoon Hyun commented on SPARK-27504: --- This feature will be reverted by SPARK-27961. > File source V2: support refreshing metadata cache > - > > Key: SPARK-27504 > URL: https://issues.apache.org/jira/browse/SPARK-27504 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > In file source V1, if some file is deleted manually, reading the > DataFrame/Table will throws an exception with suggestion message "It is > possible the underlying files have been updated. You can explicitly > invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in > SQL or by recreating the Dataset/DataFrame involved.". > After refreshing the table/DataFrame, the reads should return correct results. > We should follow it in file source V2 as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
[ https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27981: Assignee: (was: Apache Spark) > Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` > -- > > Key: SPARK-27981 > URL: https://issues.apache.org/jira/browse/SPARK-27981 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This PR aims to remove the following warnings for `java.nio.Bits.unaligned` > at JDK9/10/11/12. Please note that there are more warnings which is beyond of > this PR's scope. > {code} > bin/spark-shell --driver-java-options=--illegal-access=warn > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar) > to method java.nio.Bits.unaligned() > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
[ https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27981: Assignee: Apache Spark > Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` > -- > > Key: SPARK-27981 > URL: https://issues.apache.org/jira/browse/SPARK-27981 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > This PR aims to remove the following warnings for `java.nio.Bits.unaligned` > at JDK9/10/11/12. Please note that there are more warnings which is beyond of > this PR's scope. > {code} > bin/spark-shell --driver-java-options=--illegal-access=warn > WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform > (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar) > to method java.nio.Bits.unaligned() > ... > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
Dongjoon Hyun created SPARK-27981: - Summary: Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()` Key: SPARK-27981 URL: https://issues.apache.org/jira/browse/SPARK-27981 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Dongjoon Hyun This PR aims to remove the following warnings for `java.nio.Bits.unaligned` at JDK9/10/11/12. Please note that there are more warnings which is beyond of this PR's scope. {code} bin/spark-shell --driver-java-options=--illegal-access=warn WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar) to method java.nio.Bits.unaligned() ... {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27980) Add built-in Ordered-Set Aggregate Functions: percentile_cont
Yuming Wang created SPARK-27980: --- Summary: Add built-in Ordered-Set Aggregate Functions: percentile_cont Key: SPARK-27980 URL: https://issues.apache.org/jira/browse/SPARK-27980 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return Type||Partial Mode||Description|| |{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY _{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or {{interval}}|same as sort expression|No|continuous percentile: returns a value corresponding to the specified fraction in the ordering, interpolating between adjacent input items if needed| |{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or {{interval}}|array of sort expression's type|No|multiple continuous percentile: returns an array of results matching the shape of the _{{fractions}}_ parameter, with each non-null element replaced by the value corresponding to that percentile| https://www.postgresql.org/docs/current/functions-aggregate.html Other DBs: https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27979: Assignee: Apache Spark > Remove deprecated `--force` option in `build/mvn` and `run-tests.py` > > > Key: SPARK-27979 > URL: https://issues.apache.org/jira/browse/SPARK-27979 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This > issue cleans up the code completely at 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27979: Assignee: (was: Apache Spark) > Remove deprecated `--force` option in `build/mvn` and `run-tests.py` > > > Key: SPARK-27979 > URL: https://issues.apache.org/jira/browse/SPARK-27979 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This > issue cleans up the code completely at 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
[ https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27979: -- Summary: Remove deprecated `--force` option in `build/mvn` and `run-tests.py` (was: Remove deprecated `--force` option in `build/mvn`) > Remove deprecated `--force` option in `build/mvn` and `run-tests.py` > > > Key: SPARK-27979 > URL: https://issues.apache.org/jira/browse/SPARK-27979 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This > issue cleans up the code completely at 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27979) Remove deprecated `--force` option in `build/mvn`
[ https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-27979: -- Description: Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This issue cleans up the code completely at 3.0.0. (was: Since 2.0.0, `--force` option is removed and deprecated. This issue remove the code completely at 3.0.0.) > Remove deprecated `--force` option in `build/mvn` > - > > Key: SPARK-27979 > URL: https://issues.apache.org/jira/browse/SPARK-27979 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Minor > > Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This > issue cleans up the code completely at 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27979) Remove deprecated `--force` option in `build/mvn`
Dongjoon Hyun created SPARK-27979: - Summary: Remove deprecated `--force` option in `build/mvn` Key: SPARK-27979 URL: https://issues.apache.org/jira/browse/SPARK-27979 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.0.0 Reporter: Dongjoon Hyun Since 2.0.0, `--force` option is removed and deprecated. This issue remove the code completely at 3.0.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27978) Add built-in Aggregate Functions: string_agg
Yuming Wang created SPARK-27978: --- Summary: Add built-in Aggregate Functions: string_agg Key: SPARK-27978 URL: https://issues.apache.org/jira/browse/SPARK-27978 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Argument Type(s)||Return Type||Partial Mode||Description|| |string_agg(_{{expression}}_,_{{delimiter}}_)|({{text}}, {{text}}) or ({{bytea}}, {{bytea}})|same as argument types|No|input values concatenated into a string, separated by delimiter| https://www.postgresql.org/docs/current/functions-aggregate.html We can workaround it by concat_ws(_{{delimiter}}_, collect_list(_{{expression}}_)) currently. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859067#comment-16859067 ] Hyukjin Kwon edited comment on SPARK-27966 at 6/8/19 1:22 AM: -- It doesn't have to be a perfect reproducer. It's kind of difficult for other people like me to debug deeper with the current diagnosis. was (Author: hyukjin.kwon): It doesn't have to be a perfect reproducer. It's kind of difficult for other people like me to debug deeper win the current diagnosis.. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859067#comment-16859067 ] Hyukjin Kwon commented on SPARK-27966: -- It doesn't have to be a perfect reproducer. It's kind of difficult for other people like me to debug deeper win the current diagnosis.. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27970) Support Hive 3.0 metastore
[ https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27970. - Resolution: Fixed Assignee: Yuming Wang Fix Version/s: 3.0.0 > Support Hive 3.0 metastore > -- > > Key: SPARK-27970 > URL: https://issues.apache.org/jira/browse/SPARK-27970 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: screenshot-1.png > > > It seems that some users are using Hive 3.0.0, at least HDP 3.0.0: > !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67! > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]
[ https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859001#comment-16859001 ] Dhruve Ashar commented on SPARK-27937: -- The exception that we started encountering is while spark tries to create a path of the logic nameservice or nameservice id configured as a part of HDFS federation. {code:java} 19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication enabled; ui acls enabled; users with view permissions: Set(...); groups with view permissions: Set(); users with modify permissions: Set(); groups with modify permissions: Set(.) 19/05/20 08:48:43 INFO Client: Deleted staging directory hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456 Exception in thread "main" java.io.IOException: Cannot create proxy with unresolved address: abcabcabc-nn1:8020 at org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345) at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214) at scala.Option.map(Option.scala:146) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43) at org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48) {code} > Revert changes introduced as a part of Automatic namespace discovery > [SPARK-24149] > -- > > Key: SPARK-27937 > URL: https://issues.apache.org/jira/browse/SPARK-27937 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.3 >Reporter: Dhruve Ashar >Priority: Major > > Spark fails to launch for a valid deployment of HDFS while trying to get > tokens for a logical nameservice instead of an actual namenode (with HDFS > federation enabled). > On inspecting the source code closely, it is unclear why we were doing it and > based on the context from SPARK-24149, it solves a very specific use case of > getting the tokens for only those namenodes which are configured for HDFS > federation in the same cluster. IMHO these are better left to the user to > specify explicitly. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]
[ https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859001#comment-16859001 ] Dhruve Ashar edited comment on SPARK-27937 at 6/7/19 9:27 PM: -- The exception that we started encountering is while spark tries to create a path of the logic nameservice or nameservice id configured as a part of HDFS federation as a part of the code here: https://github.com/apache/spark/blob/v2.4.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L215 {code:java} 19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication enabled; ui acls enabled; users with view permissions: Set(...); groups with view permissions: Set(); users with modify permissions: Set(); groups with modify permissions: Set(.) 19/05/20 08:48:43 INFO Client: Deleted staging directory hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456 Exception in thread "main" java.io.IOException: Cannot create proxy with unresolved address: abcabcabc-nn1:8020 at org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345) at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821) at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214) at scala.Option.map(Option.scala:146) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186) at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43) at org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43) at org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48) {code} was (Author: dhruve ashar): The exception that we started encountering is while spark tries to create a path of the logic nameservice or nameservice id configured as a part of HDFS federation. {code:java} 19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication enabled; ui acls enabled; users with view permissions: Set(...); groups with view permissions: Set(); users with modify permissions: Set(); groups with modify permissions: Set(.) 19/05/20 08:48:43 INFO Client: Deleted staging directory hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456 Exception in thread "main" java.io.IOException: Cannot create proxy with unresolved address: abcabcabc-nn1:8020 at org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345) at org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351) at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(Dis
[jira] [Resolved] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
[ https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27870. - Resolution: Fixed Assignee: Weichen Xu Fix Version/s: 3.0.0 > Flush each batch for pandas UDF (for improving pandas UDFs pipeline) > > > Key: SPARK-27870 > URL: https://issues.apache.org/jira/browse/SPARK-27870 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 3.0.0 > > > Flush each batch for pandas UDF. > This could improve performance when multiple pandas UDF plans are pipelined. > When batch being flushed in time, downstream pandas UDFs will get pipelined > as soon as possible, and pipeline will help hide the donwstream UDFs > computation time. For example: > When the first UDF start computing on batch-3, the second pipelined UDF can > start computing on batch-2, and the third pipelined UDF can start computing > on batch-1. > If we do not flush each batch in time, the donwstream UDF's pipeline will lag > behind too much, which may increase the total processing time. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs
[ https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-27823: - Assignee: Thomas Graves > Add an abstraction layer for accelerator resource handling to avoid > manipulating raw confs > -- > > Key: SPARK-27823 > URL: https://issues.apache.org/jira/browse/SPARK-27823 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Thomas Graves >Priority: Major > > In SPARK-27488, we extract resource requests and allocation by parsing raw > Spark confs. This hurts readability because we didn't have the abstraction at > resource level. After we merge the core changes, we should do a refactoring > and make the code more readable. > See https://github.com/apache/spark/pull/24615#issuecomment-494580663. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs
[ https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27823: Assignee: Apache Spark > Add an abstraction layer for accelerator resource handling to avoid > manipulating raw confs > -- > > Key: SPARK-27823 > URL: https://issues.apache.org/jira/browse/SPARK-27823 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Assignee: Apache Spark >Priority: Major > > In SPARK-27488, we extract resource requests and allocation by parsing raw > Spark confs. This hurts readability because we didn't have the abstraction at > resource level. After we merge the core changes, we should do a refactoring > and make the code more readable. > See https://github.com/apache/spark/pull/24615#issuecomment-494580663. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs
[ https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-27823: Assignee: (was: Apache Spark) > Add an abstraction layer for accelerator resource handling to avoid > manipulating raw confs > -- > > Key: SPARK-27823 > URL: https://issues.apache.org/jira/browse/SPARK-27823 > Project: Spark > Issue Type: Story > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiangrui Meng >Priority: Major > > In SPARK-27488, we extract resource requests and allocation by parsing raw > Spark confs. This hurts readability because we didn't have the abstraction at > resource level. After we merge the core changes, we should do a refactoring > and make the code more readable. > See https://github.com/apache/spark/pull/24615#issuecomment-494580663. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858739#comment-16858739 ] Christian Homberg edited comment on SPARK-27966 at 6/7/19 3:32 PM: --- I'm afraid I can't. For one thing I can't share the data, for another even I'm not always able to reproduce the bug. For exactly the same data, code and a clean environment I get filenames and sometimes I don't. All I can provide is logging information and try to debug the issue if anyone can give me pointers. I can say though, that this has not been an issue so far with a larger spark cluster. Then again, the input data is "only" ~3,000 files, each < 1mb. So I don't get why the original cluster should have any problems regarding size. was (Author: chr_96er): I'm afraid I can't. For one thing I can't share the data, for another even I'm not always able to reproduce the bug. For exactly the same data, code and a clean environment I get filenames and sometimes I don't. All I can provide is logging information and try to debug the issue if anyone can give me pointers. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel
[ https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858739#comment-16858739 ] Christian Homberg commented on SPARK-27966: --- I'm afraid I can't. For one thing I can't share the data, for another even I'm not always able to reproduce the bug. For exactly the same data, code and a clean environment I get filenames and sometimes I don't. All I can provide is logging information and try to debug the issue if anyone can give me pointers. > input_file_name empty when listing files in parallel > > > Key: SPARK-27966 > URL: https://issues.apache.org/jira/browse/SPARK-27966 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.0 > Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11) > > Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 > Workers: 3 > Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2 >Reporter: Christian Homberg >Priority: Minor > Attachments: input_file_name_bug > > > I ran into an issue similar and probably related to SPARK-26128. The > _org.apache.spark.sql.functions.input_file_name_ is sometimes empty. > > {code:java} > df.select(input_file_name()).show(5,false) > {code} > > {code:java} > +-+ > |input_file_name()| > +-+ > | | > | | > | | > | | > | | > +-+ > {code} > My environment is databricks and debugging the Log4j output showed me that > the issue occurred when the files are being listed in parallel, e.g. when > {code:java} > 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 127; threshold: 32 > 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories > in parallel under:{code} > > Everything's fine as long as > {code:java} > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 6; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and > directories. Size of Paths: 0; threshold: 32 > {code} > > Setting spark.sql.sources.parallelPartitionDiscovery.threshold to > resolves the issue for me. > > *edit: the problem is not exclusively linked to listing files in parallel. > I've setup a larger cluster for which after parallel file listing the > input_file_name did return the correct filename. After inspecting the log4j > again, I assume that it's linked to some kind of MetaStore being full. I've > attached a section of the log4j output that I think should indicate why it's > failing. If you need more, please let me know.* > ** > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches
[ https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Dettinger resolved SPARK-27932. Resolution: Won't Fix Right, I didn't get that possible fixes/workarounds were already discussed. Thanks for reporting. I think this ticket could be closed as 'Won't Fix' then. > Update jackson versions on 2.4.x and 2.3.x branches > --- > > Key: SPARK-27932 > URL: https://issues.apache.org/jira/browse/SPARK-27932 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.3.3, 2.4.3 >Reporter: Alex Dettinger >Priority: Major > > SPARK-27051 has bumped jackson versions to 2.9.8, which is good. > Would it be possible to upgrade the jackson version to >= 2.9.8 for > spark-2.4.x, spark-2.3.x ? > In case >= 2.9.8 is not possible, versions below would be ok too: > * jackson >= 2.8.11.3 > * jackson >= 2.7.9.5 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches
[ https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858630#comment-16858630 ] Sean Owen commented on SPARK-27932: --- I don't see how you can update to 2.7.x and not get the behavior change? we already had this discussion and pretty much concluded not to do so. > Update jackson versions on 2.4.x and 2.3.x branches > --- > > Key: SPARK-27932 > URL: https://issues.apache.org/jira/browse/SPARK-27932 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.3.3, 2.4.3 >Reporter: Alex Dettinger >Priority: Major > > SPARK-27051 has bumped jackson versions to 2.9.8, which is good. > Would it be possible to upgrade the jackson version to >= 2.9.8 for > spark-2.4.x, spark-2.3.x ? > In case >= 2.9.8 is not possible, versions below would be ok too: > * jackson >= 2.8.11.3 > * jackson >= 2.7.9.5 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches
[ https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858624#comment-16858624 ] Alex Dettinger commented on SPARK-27932: [~srowen] stated in [a somewhat related PR|https://github.com/apache/spark/pull/24493] that it appears hard to upgrade jackson-databind > 2.6 on spark 2.3.x, 2.4.x branches. A key aspect to keep in mind is that jackson-databind introduced a behavior change in 2.7 onward. I propose to keep this ticket opened a bit of time in case someone could come up with a bright idea. > Update jackson versions on 2.4.x and 2.3.x branches > --- > > Key: SPARK-27932 > URL: https://issues.apache.org/jira/browse/SPARK-27932 > Project: Spark > Issue Type: Task > Components: Spark Core >Affects Versions: 2.3.3, 2.4.3 >Reporter: Alex Dettinger >Priority: Major > > SPARK-27051 has bumped jackson versions to 2.9.8, which is good. > Would it be possible to upgrade the jackson version to >= 2.9.8 for > spark-2.4.x, spark-2.3.x ? > In case >= 2.9.8 is not possible, versions below would be ok too: > * jackson >= 2.8.11.3 > * jackson >= 2.7.9.5 > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage
[ https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-27973: - Assignee: Yuexin Zhang > Streaming sample DirectKafkaWordCount should mention GroupId in usage > - > > Key: SPARK-27973 > URL: https://issues.apache.org/jira/browse/SPARK-27973 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.4.3 >Reporter: Yuexin Zhang >Assignee: Yuexin Zhang >Priority: Trivial > > The DirectKafkaWordCount sample has been updated to take Consumer Group Id as > one of the input arguments, but we missed it in the sample usage: > System.err.println(s""" > |Usage: DirectKafkaWordCount > | is a list of one or more Kafka brokers > | is a consumer group name to consume from topics > | is a list of one or more kafka topics to consume from > | > """.stripMargin) > Usage should be : DirectKafkaWordCount -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage
[ https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-27973. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 24819 [https://github.com/apache/spark/pull/24819] > Streaming sample DirectKafkaWordCount should mention GroupId in usage > - > > Key: SPARK-27973 > URL: https://issues.apache.org/jira/browse/SPARK-27973 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.4.3 >Reporter: Yuexin Zhang >Assignee: Yuexin Zhang >Priority: Trivial > Fix For: 3.0.0 > > > The DirectKafkaWordCount sample has been updated to take Consumer Group Id as > one of the input arguments, but we missed it in the sample usage: > System.err.println(s""" > |Usage: DirectKafkaWordCount > | is a list of one or more Kafka brokers > | is a consumer group name to consume from topics > | is a list of one or more kafka topics to consume from > | > """.stripMargin) > Usage should be : DirectKafkaWordCount -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage
[ https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-27973: -- Priority: Trivial (was: Minor) (This is too trivial for a JIRA; the description and fix are all but identical) > Streaming sample DirectKafkaWordCount should mention GroupId in usage > - > > Key: SPARK-27973 > URL: https://issues.apache.org/jira/browse/SPARK-27973 > Project: Spark > Issue Type: Improvement > Components: Examples >Affects Versions: 2.4.3 >Reporter: Yuexin Zhang >Priority: Trivial > > The DirectKafkaWordCount sample has been updated to take Consumer Group Id as > one of the input arguments, but we missed it in the sample usage: > System.err.println(s""" > |Usage: DirectKafkaWordCount > | is a list of one or more Kafka brokers > | is a consumer group name to consume from topics > | is a list of one or more kafka topics to consume from > | > """.stripMargin) > Usage should be : DirectKafkaWordCount -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27977) MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString)
Jacek Laskowski created SPARK-27977: --- Summary: MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString) Key: SPARK-27977 URL: https://issues.apache.org/jira/browse/SPARK-27977 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Jacek Laskowski The following is a extended explain for a streaming query: {code} == Parsed Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Project [value#39 AS value#0] +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Analyzed Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Project [value#39 AS value#0] +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Optimized Logical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=]) == Physical Plan == WriteToDataSourceV2 org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef +- *(1) Project [value#39] +- *(1) ScanV2 socket[value#39] (Options: [host=localhost,port=]) {code} As you may have noticed, {{WriteToDataSourceV2}} is followed by the internal representation of {{MicroBatchWriter}} that is a mere adapter for {{StreamWriter}}, e.g. {{ConsoleWriter}}. It'd be more debugging-friendly if the plans included whatever {{StreamWriter.toString}} would (which in case of {{ConsoleWriter}} would be {{ConsoleWriter[numRows=..., truncate=...]}} which gives more context). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27976) Add built-in Array Functions: array_append
Yuming Wang created SPARK-27976: --- Summary: Add built-in Array Functions: array_append Key: SPARK-27976 URL: https://issues.apache.org/jira/browse/SPARK-27976 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Yuming Wang ||Function||Return Type||Description||Example||Result|| |{{array_append}}{{(}}{{anyarray}}{{,}}{{anyelement}}{{)}}|{{anyarray}}|append an element to the end of an array|{{array_append(ARRAY[1,2], 3)}}|{{{1,2,3}}}| https://www.postgresql.org/docs/current/functions-array.html Other DBs: https://phoenix.apache.org/language/functions.html#array_append https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27975) ConsoleSink should display alias and options for streaming progress
Jacek Laskowski created SPARK-27975: --- Summary: ConsoleSink should display alias and options for streaming progress Key: SPARK-27975 URL: https://issues.apache.org/jira/browse/SPARK-27975 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.4.3 Reporter: Jacek Laskowski {{console}} sink shows itself in progress with this internal instance representation as follows: {code:json} "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a" } {code} That is not very user-friendly and would be much better for debugging if it included the alias and options as {{socket}} does: {code} "sources" : [ { "description" : "TextSocketV2[host: localhost, port: ]", ... } ], {code} The entire sample progress looks as follows: {code} 19/06/07 11:47:18 INFO MicroBatchExecution: Streaming query made progress: { "id" : "26bedc9f-076f-4b15-8e17-f09609aaecac", "runId" : "8c365e74-7ac9-4fad-bf1b-397eb086661e", "name" : "socket-console", "timestamp" : "2019-06-07T09:47:18.969Z", "batchId" : 2, "numInputRows" : 0, "inputRowsPerSecond" : 0.0, "durationMs" : { "getEndOffset" : 0, "setOffsetRange" : 0, "triggerExecution" : 0 }, "stateOperators" : [ ], "sources" : [ { "description" : "TextSocketV2[host: localhost, port: ]", "startOffset" : 0, "endOffset" : 0, "numInputRows" : 0, "inputRowsPerSecond" : 0.0 } ], "sink" : { "description" : "org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a" } } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes
[ https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858430#comment-16858430 ] Edwin Biemond commented on SPARK-27927: --- just doing a spark-submit on the same host ( same pod) works fine. in k8s the drv just hangs when I don't have this sparkContext.stop() > driver pod hangs with pyspark 2.4.3 and master on kubenetes > --- > > Key: SPARK-27927 > URL: https://issues.apache.org/jira/browse/SPARK-27927 > Project: Spark > Issue Type: Bug > Components: Kubernetes, PySpark >Affects Versions: 3.0.0, 2.4.3 > Environment: k8s 1.11.9 > spark 2.4.3 and master branch. >Reporter: Edwin Biemond >Priority: Major > > When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs > and never calls the shutdown hook. > {code:java} > #!/usr/bin/env python > from __future__ import print_function > import os > import os.path > import sys > # Are we really in Spark? > from pyspark.sql import SparkSession > spark = SparkSession.builder.appName('hello_world').getOrCreate() > print('Our Spark version is {}'.format(spark.version)) > print('Spark context information: {} parallelism={} python version={}'.format( > str(spark.sparkContext), > spark.sparkContext.defaultParallelism, > spark.sparkContext.pythonVer > )) > {code} > When we run this on kubernetes the driver and executer are just hanging. We > see the output of this python script. > {noformat} > bash-4.2# cat stdout.log > Our Spark version is 2.4.3 > Spark context information: master=k8s://https://kubernetes.default.svc:443 appName=hello_world> > parallelism=2 python version=3.6{noformat} > What works > * a simple python with a print works fine on 2.4.3 and 3.0.0 > * same setup on 2.4.0 > * 2.4.3 spark-submit with the above pyspark > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables
[ https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858402#comment-16858402 ] Hyukjin Kwon commented on SPARK-27785: -- To me, I don't have much information about how common this typed API is. If this is common enough and something asked frequently somewhere, might be worth doing it. The problem sounds valid but I feel like missing about the importance in this API. For instance, we won't probably expose such API from 1 to 22 arguments like UDF. > Introduce .joinWith() overloads for typed inner joins of 3 or more tables > - > > Key: SPARK-27785 > URL: https://issues.apache.org/jira/browse/SPARK-27785 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Josh Rosen >Priority: Major > > Today it's rather painful to do a typed dataset join of more than two tables: > {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining > on a third inner join requires users to specify a complicated join condition > (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), > resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become > even more painful if you want to layer on a fourth join. Using {{.map()}} to > flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too. > To simplify this use case, I propose to introduce a new set of overloads of > {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some > reasonable number (say, 6). For example: > {code:java} > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > condition: Column > ): Dataset[(T, T1, T2)] > Dataset[T].joinWith[T1, T2]( > ds1: Dataset[T1], > ds2: Dataset[T2], > ds3: Dataset[T3], > condition: Column > ): Dataset[(T, T1, T2, T3)]{code} > I propose to do this only for inner joins (consistent with the default join > type for {{joinWith}} in case joins are not specified). > I haven't though about this too much yet and am not committed to the API > proposed above (it's just my initial idea), so I'm open to suggestions for > alternative typed APIs for this. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-27965) Add extractors for logical transforms
[ https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-27965. - Resolution: Fixed Assignee: Ryan Blue Fix Version/s: 3.0.0 > Add extractors for logical transforms > - > > Key: SPARK-27965 > URL: https://issues.apache.org/jira/browse/SPARK-27965 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Extractors can be used to make any Transform class appear like a case class > to Spark internals. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org