[jira] [Commented] (SPARK-27504) File source V2: support refreshing metadata cache

2019-06-07 Thread Dongjoon Hyun (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859126#comment-16859126
 ] 

Dongjoon Hyun commented on SPARK-27504:
---

This feature will be reverted by SPARK-27961.

> File source V2: support refreshing metadata cache
> -
>
> Key: SPARK-27504
> URL: https://issues.apache.org/jira/browse/SPARK-27504
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> In file source V1, if some file is deleted manually, reading the 
> DataFrame/Table will throws an exception with suggestion message "It is 
> possible the underlying files have been updated. You can explicitly 
> invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
> SQL or by recreating the Dataset/DataFrame involved.".
> After refreshing the table/DataFrame, the reads should return correct results.
> We should follow it in file source V2 as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27981:


Assignee: (was: Apache Spark)

> Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
> --
>
> Key: SPARK-27981
> URL: https://issues.apache.org/jira/browse/SPARK-27981
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This PR aims to remove the following warnings for `java.nio.Bits.unaligned` 
> at JDK9/10/11/12. Please note that there are more warnings which is beyond of 
> this PR's scope.
> {code}
> bin/spark-shell --driver-java-options=--illegal-access=warn
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
>  to method java.nio.Bits.unaligned()
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27981:


Assignee: Apache Spark

> Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`
> --
>
> Key: SPARK-27981
> URL: https://issues.apache.org/jira/browse/SPARK-27981
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This PR aims to remove the following warnings for `java.nio.Bits.unaligned` 
> at JDK9/10/11/12. Please note that there are more warnings which is beyond of 
> this PR's scope.
> {code}
> bin/spark-shell --driver-java-options=--illegal-access=warn
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
>  to method java.nio.Bits.unaligned()
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27981) Remove `Illegal reflective access` warning for `java.nio.Bits.unaligned()`

2019-06-07 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27981:
-

 Summary: Remove `Illegal reflective access` warning for 
`java.nio.Bits.unaligned()`
 Key: SPARK-27981
 URL: https://issues.apache.org/jira/browse/SPARK-27981
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


This PR aims to remove the following warnings for `java.nio.Bits.unaligned` at 
JDK9/10/11/12. Please note that there are more warnings which is beyond of this 
PR's scope.
{code}
bin/spark-shell --driver-java-options=--illegal-access=warn
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
(file:/Users/dhyun/APACHE/spark-release/spark-3.0/assembly/target/scala-2.12/jars/spark-unsafe_2.12-3.0.0-SNAPSHOT.jar)
 to method java.nio.Bits.unaligned()
...
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27980) Add built-in Ordered-Set Aggregate Functions: percentile_cont

2019-06-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27980:
---

 Summary: Add built-in Ordered-Set Aggregate Functions: 
percentile_cont
 Key: SPARK-27980
 URL: https://issues.apache.org/jira/browse/SPARK-27980
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Direct Argument Type(s)||Aggregated Argument Type(s)||Return 
Type||Partial Mode||Description||
|{{percentile_cont(_{{fraction}}_) WITHIN GROUP (ORDER BY 
_{{sort_expression}}_)}}|{{double precision}}|{{double precision}} or 
{{interval}}|same as sort expression|No|continuous percentile: returns a value 
corresponding to the specified fraction in the ordering, interpolating between 
adjacent input items if needed|
|{{percentile_cont(_{{fractions}}_) WITHIN GROUP (ORDER 
BY_{{sort_expression}}_)}}|{{double precision[]}}|{{double precision}} or 
{{interval}}|array of sort expression's type|No|multiple continuous percentile: 
returns an array of results matching the shape of the _{{fractions}}_ 
parameter, with each non-null element replaced by the value corresponding to 
that percentile|

https://www.postgresql.org/docs/current/functions-aggregate.html

Other DBs:
https://docs.aws.amazon.com/redshift/latest/dg/r_PERCENTILE_CONT.html
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/PERCENTILE_CONTAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_25




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27979:


Assignee: Apache Spark

> Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
> 
>
> Key: SPARK-27979
> URL: https://issues.apache.org/jira/browse/SPARK-27979
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This 
> issue cleans up the code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27979:


Assignee: (was: Apache Spark)

> Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
> 
>
> Key: SPARK-27979
> URL: https://issues.apache.org/jira/browse/SPARK-27979
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This 
> issue cleans up the code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27979) Remove deprecated `--force` option in `build/mvn` and `run-tests.py`

2019-06-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27979:
--
Summary: Remove deprecated `--force` option in `build/mvn` and 
`run-tests.py`  (was: Remove deprecated `--force` option in `build/mvn`)

> Remove deprecated `--force` option in `build/mvn` and `run-tests.py`
> 
>
> Key: SPARK-27979
> URL: https://issues.apache.org/jira/browse/SPARK-27979
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This 
> issue cleans up the code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27979) Remove deprecated `--force` option in `build/mvn`

2019-06-07 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27979:
--
Description: Since 2.0.0, SPARK-14867 deprecated `--force` option and 
ignores it. This issue cleans up the code completely at 3.0.0.  (was: Since 
2.0.0, `--force` option is removed and deprecated. This issue remove the code 
completely at 3.0.0.)

> Remove deprecated `--force` option in `build/mvn`
> -
>
> Key: SPARK-27979
> URL: https://issues.apache.org/jira/browse/SPARK-27979
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Since 2.0.0, SPARK-14867 deprecated `--force` option and ignores it. This 
> issue cleans up the code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27979) Remove deprecated `--force` option in `build/mvn`

2019-06-07 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-27979:
-

 Summary: Remove deprecated `--force` option in `build/mvn`
 Key: SPARK-27979
 URL: https://issues.apache.org/jira/browse/SPARK-27979
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun


Since 2.0.0, `--force` option is removed and deprecated. This issue remove the 
code completely at 3.0.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27978) Add built-in Aggregate Functions: string_agg

2019-06-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27978:
---

 Summary: Add built-in Aggregate Functions: string_agg
 Key: SPARK-27978
 URL: https://issues.apache.org/jira/browse/SPARK-27978
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Argument Type(s)||Return Type||Partial Mode||Description||
|string_agg(_{{expression}}_,_{{delimiter}}_)|({{text}}, {{text}}) or 
({{bytea}}, {{bytea}})|same as argument types|No|input values concatenated into 
a string, separated by delimiter|

https://www.postgresql.org/docs/current/functions-aggregate.html

We can workaround it by concat_ws(_{{delimiter}}_, 
collect_list(_{{expression}}_)) currently.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859067#comment-16859067
 ] 

Hyukjin Kwon edited comment on SPARK-27966 at 6/8/19 1:22 AM:
--

It doesn't have to be a perfect reproducer. It's kind of difficult for other 
people like me to debug deeper with the current diagnosis.


was (Author: hyukjin.kwon):
It doesn't have to be a perfect reproducer. It's kind of difficult for other 
people like me to debug deeper win the current diagnosis..

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859067#comment-16859067
 ] 

Hyukjin Kwon commented on SPARK-27966:
--

It doesn't have to be a perfect reproducer. It's kind of difficult for other 
people like me to debug deeper win the current diagnosis..

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27970) Support Hive 3.0 metastore

2019-06-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27970.
-
   Resolution: Fixed
 Assignee: Yuming Wang
Fix Version/s: 3.0.0

> Support Hive 3.0 metastore
> --
>
> Key: SPARK-27970
> URL: https://issues.apache.org/jira/browse/SPARK-27970
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: screenshot-1.png
>
>
> It seems that some users are using Hive 3.0.0, at least HDP 3.0.0:
> !https://camo.githubusercontent.com/736d8a9f04d3960e0cdc3a8ee09aa199ce103b51/68747470733a2f2f32786262686a786336776b3376323170363274386e3464342d7770656e67696e652e6e6574646e612d73736c2e636f6d2f77702d636f6e74656e742f75706c6f6164732f323031382f31322f6864702d332e312e312d4173706172616775732e706e67!
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]

2019-06-07 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859001#comment-16859001
 ] 

Dhruve Ashar commented on SPARK-27937:
--

The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation. 

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48)
{code}
 

> Revert changes introduced as a part of Automatic namespace discovery 
> [SPARK-24149]
> --
>
> Key: SPARK-27937
> URL: https://issues.apache.org/jira/browse/SPARK-27937
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.3
>Reporter: Dhruve Ashar
>Priority: Major
>
> Spark fails to launch for a valid deployment of HDFS while trying to get 
> tokens for a logical nameservice instead of an actual namenode (with HDFS 
> federation enabled). 
> On inspecting the source code closely, it is unclear why we were doing it and 
> based on the context from SPARK-24149, it solves a very specific use case of 
> getting the tokens for only those namenodes which are configured for HDFS 
> federation in the same cluster. IMHO these are better left to the user to 
> specify explicitly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27937) Revert changes introduced as a part of Automatic namespace discovery [SPARK-24149]

2019-06-07 Thread Dhruve Ashar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16859001#comment-16859001
 ] 

Dhruve Ashar edited comment on SPARK-27937 at 6/7/19 9:27 PM:
--

The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation as a part of the code here:

https://github.com/apache/spark/blob/v2.4.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnSparkHadoopUtil.scala#L215

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2821)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:100)
at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2892)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2874)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:389)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:215)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5$$anonfun$apply$2.apply(YarnSparkHadoopUtil.scala:214)
at scala.Option.map(Option.scala:146)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:214)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$$anonfun$5.apply(YarnSparkHadoopUtil.scala:213)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at 
org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.hadoopFSsToAccess(YarnSparkHadoopUtil.scala:213)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.yarn.security.YARNHadoopDelegationTokenManager$$anonfun$1.apply(YARNHadoopDelegationTokenManager.scala:43)
at 
org.apache.spark.deploy.security.HadoopFSDelegationTokenProvider.obtainDelegationTokens(HadoopFSDelegationTokenProvider.scala:48)
{code}
 


was (Author: dhruve ashar):
The exception that we started encountering is while spark tries to create a 
path of the logic nameservice or nameservice id configured as a part of HDFS 
federation. 

 
{code:java}
19/05/20 08:48:42 INFO SecurityManager: Changing modify acls groups to: 
19/05/20 08:48:42 INFO SecurityManager: SecurityManager: authentication 
enabled; ui acls enabled; users  with view permissions: Set(...); groups with 
view permissions: Set(); users  with modify permissions: Set(); groups 
with modify permissions: Set(.)
19/05/20 08:48:43 INFO Client: Deleted staging directory 
hdfs://..:8020/user/abc/.sparkStaging/application_123456_123456
Exception in thread "main" java.io.IOException: Cannot create proxy with 
unresolved address: abcabcabc-nn1:8020
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createNonHAProxyWithClientProtocol(NameNodeProxiesClient.java:345)
at 
org.apache.hadoop.hdfs.NameNodeProxiesClient.createProxyWithClientProtocol(NameNodeProxiesClient.java:133)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:351)
at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:285)
at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(Dis

[jira] [Resolved] (SPARK-27870) Flush each batch for pandas UDF (for improving pandas UDFs pipeline)

2019-06-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27870.
-
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 3.0.0

> Flush each batch for pandas UDF (for improving pandas UDFs pipeline)
> 
>
> Key: SPARK-27870
> URL: https://issues.apache.org/jira/browse/SPARK-27870
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 3.0.0
>
>
> Flush each batch for pandas UDF.
> This could improve performance when multiple pandas UDF plans are pipelined.
> When batch being flushed in time, downstream pandas UDFs will get pipelined 
> as soon as possible, and pipeline will help hide the donwstream UDFs 
> computation time. For example:
> When the first UDF start computing on batch-3, the second pipelined UDF can 
> start computing on batch-2, and the third pipelined UDF can start computing 
> on batch-1.
> If we do not flush each batch in time, the donwstream UDF's pipeline will lag 
> behind too much, which may increase the total processing time.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs

2019-06-07 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-27823:
-

Assignee: Thomas Graves

> Add an abstraction layer for accelerator resource handling to avoid 
> manipulating raw confs
> --
>
> Key: SPARK-27823
> URL: https://issues.apache.org/jira/browse/SPARK-27823
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Thomas Graves
>Priority: Major
>
> In SPARK-27488, we extract resource requests and allocation by parsing raw 
> Spark confs. This hurts readability because we didn't have the abstraction at 
> resource level. After we merge the core changes, we should do a refactoring 
> and make the code more readable.
> See https://github.com/apache/spark/pull/24615#issuecomment-494580663.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27823:


Assignee: Apache Spark

> Add an abstraction layer for accelerator resource handling to avoid 
> manipulating raw confs
> --
>
> Key: SPARK-27823
> URL: https://issues.apache.org/jira/browse/SPARK-27823
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Apache Spark
>Priority: Major
>
> In SPARK-27488, we extract resource requests and allocation by parsing raw 
> Spark confs. This hurts readability because we didn't have the abstraction at 
> resource level. After we merge the core changes, we should do a refactoring 
> and make the code more readable.
> See https://github.com/apache/spark/pull/24615#issuecomment-494580663.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27823) Add an abstraction layer for accelerator resource handling to avoid manipulating raw confs

2019-06-07 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-27823:


Assignee: (was: Apache Spark)

> Add an abstraction layer for accelerator resource handling to avoid 
> manipulating raw confs
> --
>
> Key: SPARK-27823
> URL: https://issues.apache.org/jira/browse/SPARK-27823
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> In SPARK-27488, we extract resource requests and allocation by parsing raw 
> Spark confs. This hurts readability because we didn't have the abstraction at 
> resource level. After we merge the core changes, we should do a refactoring 
> and make the code more readable.
> See https://github.com/apache/spark/pull/24615#issuecomment-494580663.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-07 Thread Christian Homberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858739#comment-16858739
 ] 

Christian Homberg edited comment on SPARK-27966 at 6/7/19 3:32 PM:
---

I'm afraid I can't. For one thing I can't share the data, for another even I'm 
not always able to reproduce the bug. For exactly the same data, code and a 
clean environment I get filenames and sometimes I don't. All I can provide is 
logging information and try to debug the issue if anyone can give me pointers.

 

I can say though, that this has not been an issue so far with a larger spark 
cluster. Then again, the input data is "only" ~3,000 files, each < 1mb. So I 
don't get why the original cluster should have any problems regarding size.


was (Author: chr_96er):
I'm afraid I can't. For one thing I can't share the data, for another even I'm 
not always able to reproduce the bug. For exactly the same data, code and a 
clean environment I get filenames and sometimes I don't. All I can provide is 
logging information and try to debug the issue if anyone can give me pointers.

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27966) input_file_name empty when listing files in parallel

2019-06-07 Thread Christian Homberg (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858739#comment-16858739
 ] 

Christian Homberg commented on SPARK-27966:
---

I'm afraid I can't. For one thing I can't share the data, for another even I'm 
not always able to reproduce the bug. For exactly the same data, code and a 
clean environment I get filenames and sometimes I don't. All I can provide is 
logging information and try to debug the issue if anyone can give me pointers.

> input_file_name empty when listing files in parallel
> 
>
> Key: SPARK-27966
> URL: https://issues.apache.org/jira/browse/SPARK-27966
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.0
> Environment: Databricks 5.3 (includes Apache Spark 2.4.0, Scala 2.11)
>  
> Worker Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
> Workers: 3
> Driver Type: 14.0 GB Memory, 4 Cores, 0.75 DBU Standard_DS3_v2
>Reporter: Christian Homberg
>Priority: Minor
> Attachments: input_file_name_bug
>
>
> I ran into an issue similar and probably related to SPARK-26128. The 
> _org.apache.spark.sql.functions.input_file_name_ is sometimes empty.
>  
> {code:java}
> df.select(input_file_name()).show(5,false)
> {code}
>  
> {code:java}
> +-+
> |input_file_name()|
> +-+
> | |
> | |
> | |
> | |
> | |
> +-+
> {code}
> My environment is databricks and debugging the Log4j output showed me that 
> the issue occurred when the files are being listed in parallel, e.g. when 
> {code:java}
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 127; threshold: 32
> 19/06/06 11:50:47 INFO InMemoryFileIndex: Listing leaf files and directories 
> in parallel under:{code}
>  
> Everything's fine as long as
> {code:java}
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 6; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> 19/06/06 11:54:43 INFO InMemoryFileIndex: Start listing leaf files and 
> directories. Size of Paths: 0; threshold: 32
> {code}
>  
> Setting spark.sql.sources.parallelPartitionDiscovery.threshold to  
> resolves the issue for me.
>  
> *edit: the problem is not exclusively linked to listing files in parallel. 
> I've setup a larger cluster for which after parallel file listing the 
> input_file_name did return the correct filename. After inspecting the log4j 
> again, I assume that it's linked to some kind of MetaStore being full. I've 
> attached a section of the log4j output that I think should indicate why it's 
> failing. If you need more, please let me know.*
>  ** 
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches

2019-06-07 Thread Alex Dettinger (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alex Dettinger resolved SPARK-27932.

Resolution: Won't Fix

Right, I didn't get that possible fixes/workarounds were already discussed. 
Thanks for reporting. I think this ticket could be closed as 'Won't Fix' then.

> Update jackson versions on 2.4.x and 2.3.x branches
> ---
>
> Key: SPARK-27932
> URL: https://issues.apache.org/jira/browse/SPARK-27932
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Alex Dettinger
>Priority: Major
>
> SPARK-27051 has bumped jackson versions to 2.9.8, which is good.
> Would it be possible to upgrade the jackson version to >= 2.9.8 for 
> spark-2.4.x, spark-2.3.x ?
> In case >= 2.9.8 is not possible, versions below would be ok too:
>  * jackson >= 2.8.11.3
>  * jackson >= 2.7.9.5
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches

2019-06-07 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858630#comment-16858630
 ] 

Sean Owen commented on SPARK-27932:
---

I don't see how you can update to 2.7.x and not get the behavior change? we 
already had this discussion and pretty much concluded not to do so.

> Update jackson versions on 2.4.x and 2.3.x branches
> ---
>
> Key: SPARK-27932
> URL: https://issues.apache.org/jira/browse/SPARK-27932
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Alex Dettinger
>Priority: Major
>
> SPARK-27051 has bumped jackson versions to 2.9.8, which is good.
> Would it be possible to upgrade the jackson version to >= 2.9.8 for 
> spark-2.4.x, spark-2.3.x ?
> In case >= 2.9.8 is not possible, versions below would be ok too:
>  * jackson >= 2.8.11.3
>  * jackson >= 2.7.9.5
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27932) Update jackson versions on 2.4.x and 2.3.x branches

2019-06-07 Thread Alex Dettinger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858624#comment-16858624
 ] 

Alex Dettinger commented on SPARK-27932:


[~srowen] stated in [a somewhat related 
PR|https://github.com/apache/spark/pull/24493] that it appears hard to upgrade 
jackson-databind > 2.6 on spark 2.3.x, 2.4.x branches.

A key aspect to keep in mind is that jackson-databind introduced a behavior 
change in 2.7 onward.

I propose to keep this ticket opened a bit of time in case someone could come 
up with a bright idea.

 

 

> Update jackson versions on 2.4.x and 2.3.x branches
> ---
>
> Key: SPARK-27932
> URL: https://issues.apache.org/jira/browse/SPARK-27932
> Project: Spark
>  Issue Type: Task
>  Components: Spark Core
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Alex Dettinger
>Priority: Major
>
> SPARK-27051 has bumped jackson versions to 2.9.8, which is good.
> Would it be possible to upgrade the jackson version to >= 2.9.8 for 
> spark-2.4.x, spark-2.3.x ?
> In case >= 2.9.8 is not possible, versions below would be ok too:
>  * jackson >= 2.8.11.3
>  * jackson >= 2.7.9.5
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen reassigned SPARK-27973:
-

Assignee: Yuexin Zhang

> Streaming sample DirectKafkaWordCount should mention GroupId in usage
> -
>
> Key: SPARK-27973
> URL: https://issues.apache.org/jira/browse/SPARK-27973
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.4.3
>Reporter: Yuexin Zhang
>Assignee: Yuexin Zhang
>Priority: Trivial
>
> The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
> one of the input arguments, but we missed it in the sample usage:
>   System.err.println(s"""
> |Usage: DirectKafkaWordCount  
> |   is a list of one or more Kafka brokers
> |   is a consumer group name to consume from topics
> |   is a list of one or more kafka topics to consume from
> |
> """.stripMargin)
> Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27973.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 24819
[https://github.com/apache/spark/pull/24819]

> Streaming sample DirectKafkaWordCount should mention GroupId in usage
> -
>
> Key: SPARK-27973
> URL: https://issues.apache.org/jira/browse/SPARK-27973
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.4.3
>Reporter: Yuexin Zhang
>Assignee: Yuexin Zhang
>Priority: Trivial
> Fix For: 3.0.0
>
>
> The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
> one of the input arguments, but we missed it in the sample usage:
>   System.err.println(s"""
> |Usage: DirectKafkaWordCount  
> |   is a list of one or more Kafka brokers
> |   is a consumer group name to consume from topics
> |   is a list of one or more kafka topics to consume from
> |
> """.stripMargin)
> Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27973) Streaming sample DirectKafkaWordCount should mention GroupId in usage

2019-06-07 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-27973:
--
Priority: Trivial  (was: Minor)

(This is too trivial for a JIRA; the description and fix are all but identical)

> Streaming sample DirectKafkaWordCount should mention GroupId in usage
> -
>
> Key: SPARK-27973
> URL: https://issues.apache.org/jira/browse/SPARK-27973
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 2.4.3
>Reporter: Yuexin Zhang
>Priority: Trivial
>
> The DirectKafkaWordCount sample has been updated to take Consumer Group Id as 
> one of the input arguments, but we missed it in the sample usage:
>   System.err.println(s"""
> |Usage: DirectKafkaWordCount  
> |   is a list of one or more Kafka brokers
> |   is a consumer group name to consume from topics
> |   is a list of one or more kafka topics to consume from
> |
> """.stripMargin)
> Usage should be : DirectKafkaWordCount   



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27977) MicroBatchWriter should use StreamWriter for human-friendly textual representation (toString)

2019-06-07 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-27977:
---

 Summary: MicroBatchWriter should use StreamWriter for 
human-friendly textual representation (toString)
 Key: SPARK-27977
 URL: https://issues.apache.org/jira/browse/SPARK-27977
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Jacek Laskowski


The following is a extended explain for a streaming query:

{code}
== Parsed Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Project [value#39 AS value#0]
   +- Streaming RelationV2 socket[value#39] (Options: 
[host=localhost,port=])

== Analyzed Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Project [value#39 AS value#0]
   +- Streaming RelationV2 socket[value#39] (Options: 
[host=localhost,port=])

== Optimized Logical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- Streaming RelationV2 socket[value#39] (Options: [host=localhost,port=])

== Physical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4737caef
+- *(1) Project [value#39]
   +- *(1) ScanV2 socket[value#39] (Options: [host=localhost,port=])
{code}

As you may have noticed, {{WriteToDataSourceV2}} is followed by the internal 
representation of {{MicroBatchWriter}} that is a mere adapter for 
{{StreamWriter}}, e.g. {{ConsoleWriter}}.

It'd be more debugging-friendly if the plans included whatever 
{{StreamWriter.toString}} would (which in case of {{ConsoleWriter}} would be 
{{ConsoleWriter[numRows=..., truncate=...]}} which gives more context).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27976) Add built-in Array Functions: array_append

2019-06-07 Thread Yuming Wang (JIRA)
Yuming Wang created SPARK-27976:
---

 Summary: Add built-in Array Functions: array_append
 Key: SPARK-27976
 URL: https://issues.apache.org/jira/browse/SPARK-27976
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Yuming Wang


||Function||Return Type||Description||Example||Result||
|{{array_append}}{{(}}{{anyarray}}{{,}}{{anyelement}}{{)}}|{{anyarray}}|append 
an element to the end of an array|{{array_append(ARRAY[1,2], 3)}}|{{{1,2,3}}}|


https://www.postgresql.org/docs/current/functions-array.html

Other DBs:
https://phoenix.apache.org/language/functions.html#array_append
https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/68fdFR3LWhx7KtHc9Iv5Qg



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27975) ConsoleSink should display alias and options for streaming progress

2019-06-07 Thread Jacek Laskowski (JIRA)
Jacek Laskowski created SPARK-27975:
---

 Summary: ConsoleSink should display alias and options for 
streaming progress
 Key: SPARK-27975
 URL: https://issues.apache.org/jira/browse/SPARK-27975
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 2.4.3
Reporter: Jacek Laskowski


{{console}} sink shows itself in progress with this internal instance 
representation as follows:

{code:json}
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a"
  }
{code}

That is not very user-friendly and would be much better for debugging if it 
included the alias and options as {{socket}} does:

{code}
  "sources" : [ {
"description" : "TextSocketV2[host: localhost, port: ]",
...
  } ],
{code}

The entire sample progress looks as follows:

{code}
19/06/07 11:47:18 INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "26bedc9f-076f-4b15-8e17-f09609aaecac",
  "runId" : "8c365e74-7ac9-4fad-bf1b-397eb086661e",
  "name" : "socket-console",
  "timestamp" : "2019-06-07T09:47:18.969Z",
  "batchId" : 2,
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "durationMs" : {
"getEndOffset" : 0,
"setOffsetRange" : 0,
"triggerExecution" : 0
  },
  "stateOperators" : [ ],
  "sources" : [ {
"description" : "TextSocketV2[host: localhost, port: ]",
"startOffset" : 0,
"endOffset" : 0,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0
  } ],
  "sink" : {
"description" : 
"org.apache.spark.sql.execution.streaming.ConsoleSinkProvider@12fa674a"
  }
}
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27927) driver pod hangs with pyspark 2.4.3 and master on kubenetes

2019-06-07 Thread Edwin Biemond (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858430#comment-16858430
 ] 

Edwin Biemond commented on SPARK-27927:
---

just doing a spark-submit on the same host ( same pod) works fine. in k8s the 
drv just hangs when I don't have this sparkContext.stop()

> driver pod hangs with pyspark 2.4.3 and master on kubenetes
> ---
>
> Key: SPARK-27927
> URL: https://issues.apache.org/jira/browse/SPARK-27927
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, PySpark
>Affects Versions: 3.0.0, 2.4.3
> Environment: k8s 1.11.9
> spark 2.4.3 and master branch.
>Reporter: Edwin Biemond
>Priority: Major
>
> When we run a simple pyspark on spark 2.4.3 or 3.0.0 the driver pods hangs 
> and never calls the shutdown hook. 
> {code:java}
> #!/usr/bin/env python
> from __future__ import print_function
> import os
> import os.path
> import sys
> # Are we really in Spark?
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName('hello_world').getOrCreate()
> print('Our Spark version is {}'.format(spark.version))
> print('Spark context information: {} parallelism={} python version={}'.format(
> str(spark.sparkContext),
> spark.sparkContext.defaultParallelism,
> spark.sparkContext.pythonVer
> ))
> {code}
> When we run this on kubernetes the driver and executer are just hanging. We 
> see the output of this python script. 
> {noformat}
> bash-4.2# cat stdout.log
> Our Spark version is 2.4.3
> Spark context information:  master=k8s://https://kubernetes.default.svc:443 appName=hello_world> 
> parallelism=2 python version=3.6{noformat}
> What works
>  * a simple python with a print works fine on 2.4.3 and 3.0.0
>  * same setup on 2.4.0
>  * 2.4.3 spark-submit with the above pyspark
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2019-06-07 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858402#comment-16858402
 ] 

Hyukjin Kwon commented on SPARK-27785:
--

To me, I don't have much information about how common this typed API is. If 
this is common enough and something asked frequently somewhere, might be worth 
doing it. The problem sounds valid but I feel like missing about the importance 
in this API.

For instance, we won't probably expose such API from 1 to 22 arguments like UDF.

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27965) Add extractors for logical transforms

2019-06-07 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-27965.
-
   Resolution: Fixed
 Assignee: Ryan Blue
Fix Version/s: 3.0.0

> Add extractors for logical transforms
> -
>
> Key: SPARK-27965
> URL: https://issues.apache.org/jira/browse/SPARK-27965
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ryan Blue
>Assignee: Ryan Blue
>Priority: Major
> Fix For: 3.0.0
>
>
> Extractors can be used to make any Transform class appear like a case class 
> to Spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org