date:20200317

[jira] [Assigned] (SPARK-31119) Add interval value support for extract expression as source

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31119:
---

Assignee: Kent Yao

> Add interval value support for extract expression as source
> ---
>
> Key: SPARK-31119
> URL: https://issues.apache.org/jira/browse/SPARK-31119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> {code:java}
>  ::= EXTRACT   FROM  source> 
>  ::=  | 
> {code}
> We now only support datetime values as extract source for expression but it's 
> alternative function `date_part` supports both datetime and interval.
> For ANSI compliance and the consistency between extract and `date_part`, we 
> support intervals for extract expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31119) Add interval value support for extract expression as source

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31119.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27876
[https://github.com/apache/spark/pull/27876]

> Add interval value support for extract expression as source
> ---
>
> Key: SPARK-31119
> URL: https://issues.apache.org/jira/browse/SPARK-31119
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
>  ::= EXTRACT   FROM  source> 
>  ::=  | 
> {code}
> We now only support datetime values as extract source for expression but it's 
> alternative function `date_part` supports both datetime and interval.
> For ANSI compliance and the consistency between extract and `date_part`, we 
> support intervals for extract expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22299) Use OFFSET and LIMIT for JDBC DataFrameReader striping

2020-03-17 Thread Pushpinder Heer (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061359#comment-17061359
 ] 

Pushpinder Heer commented on SPARK-22299:
-

This shows resolved, but no version indicated.  Is this still active?  Thanks!

> Use OFFSET and LIMIT for JDBC DataFrameReader striping
> --
>
> Key: SPARK-22299
> URL: https://issues.apache.org/jira/browse/SPARK-22299
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 2.0.0, 2.1.0, 2.2.0
>Reporter: Zack Behringer
>Priority: Minor
>  Labels: bulk-closed
>
> Loading a large table (300M rows) from JDBC can be partitioned into tasks 
> using the column, numPartitions, lowerBound and upperBound parameters on 
> DataFrameReader.jdbc(), but that becomes troublesome if the column is 
> skewed/fragmented (as in somebody used a global sequence for the partition 
> column instead of a sequence specific to the table, or if the table becomes 
> fragmented by deletes, etc.).
> This can be worked around by using a modulus operation on the column, but 
> that will be slow unless there is a already an index using the modulus 
> expression with the exact numPartitions value, so that doesn't scale well if 
> you want to change the number partitions. Another way would be to use an 
> expression index on a hash of the partition column, but I'm not sure if JDBC 
> striping is smart enough to create hash ranges for each stripe using hashes 
> of the lower and upper bound parameters. If it is, that is great, but still 
> that requires a very large index just for this use case.
> A less invasive approach would be to use the table's physical ordering along 
> with OFFSET and LIMIT so that only the total number of records to read would 
> need to be known beforehand in order to evenly distribute, no indexes needed. 
> I realize that OFFSET and LIMIT are not standard SQL keywords.
> I also see that a list of custom predicates can be defined. I haven't tried 
> that to see if I can embed numPartitions specific predicates each with their 
> own OFFSET and LIMIT range.
> Some relational databases take quite a long time to count the number of 
> records in order to determine the stripe size, though, so this can also 
> troublesome. Could a feature similar to "spark.sql.files.maxRecordsPerFile" 
> be used in conjunction with the number of executors to read manageable 
> batches (internally using OFFSET and LIMIT) until there are no more available 
> results?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28593) Rename ShuffleClient to BlockStoreClient which more close to its usage

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28593:
--
Affects Version/s: (was: 2.4.0)
   3.0.0

> Rename ShuffleClient to BlockStoreClient which more close to its usage
> --
>
> Key: SPARK-28593
> URL: https://issues.apache.org/jira/browse/SPARK-28593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> After SPARK-27677, the shuffle client not only handles the shuffle block but 
> also responsible for local persist RDD blocks. For better code scalability 
> and clear semantics, here we rename ShuffleClient to BlockStoreClient. 
> Correspondingly rename the ExternalShuffleClient to ExternalBlockStoreClient, 
> also change the server-side class from ExternalShuffleBlockHandler to 
> ExternalBlockHandler. Note, we still keep the name of BlockTransferService, 
> because the `Service` contains both client and server, also the name of 
> BlockTransferService is not referencing shuffle client only.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

2020-03-17 Thread feiwang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

feiwang updated SPARK-31179:

Description: 
When reading shuffle data, maybe several fetch request sent to a same shuffle 
server.
There is a client pool, and these request may share the same client.
When the shuffle server is busy, it may cause the request connection timeout.
For example: there are two request connection, rc1 and rc2.
Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
minutes.

1: rc1 hold the client lock, it timeout after 2 minutes.
2: rc2 hold the client lock, it timeout after 2 minutes.
3: rc1 start the second retry, hold lock and timeout after 2 minutes.
4: rc2 start the second retry, hold lock and timeout after 2 minutes.
5: rc1 start the third retry, hold lock and timeout after 2 minutes.
6: rc2 start the third retry, hold lock and timeout after 2 minutes.
It wastes lots of time.

> Fast fail the connection while last shuffle connection failed in the last 
> retry IO wait 
> 
>
> Key: SPARK-31179
> URL: https://issues.apache.org/jira/browse/SPARK-31179
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Major
>
> When reading shuffle data, maybe several fetch request sent to a same shuffle 
> server.
> There is a client pool, and these request may share the same client.
> When the shuffle server is busy, it may cause the request connection timeout.
> For example: there are two request connection, rc1 and rc2.
> Especially, the io.numConnectionsPerPeer is 1 and connection timeout is 2 
> minutes.
> 1: rc1 hold the client lock, it timeout after 2 minutes.
> 2: rc2 hold the client lock, it timeout after 2 minutes.
> 3: rc1 start the second retry, hold lock and timeout after 2 minutes.
> 4: rc2 start the second retry, hold lock and timeout after 2 minutes.
> 5: rc1 start the third retry, hold lock and timeout after 2 minutes.
> 6: rc2 start the third retry, hold lock and timeout after 2 minutes.
> It wastes lots of time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061337#comment-17061337
 ] 

Dongjoon Hyun commented on SPARK-27665:
---

Thank you for confirmation!

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29278) Data Source V2: Support SHOW CURRENT CATALOG

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061332#comment-17061332
 ] 

Dongjoon Hyun commented on SPARK-29278:
---

Thanks!

> Data Source V2: Support SHOW CURRENT CATALOG
> 
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31179) Fast fail the connection while last shuffle connection failed in the last retry IO wait

2020-03-17 Thread feiwang (Jira)

feiwang created SPARK-31179:
---

 Summary: Fast fail the connection while last shuffle connection 
failed in the last retry IO wait 
 Key: SPARK-31179
 URL: https://issues.apache.org/jira/browse/SPARK-31179
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 3.1.0
Reporter: feiwang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-17 Thread Burak Yavuz (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061292#comment-17061292
 ] 

Burak Yavuz commented on SPARK-31178:
-

cc [~wenchen] [~rdblue]

> sql("INSERT INTO v2DataSource ...").collect() double inserts
> 
>
> Key: SPARK-31178
> URL: https://issues.apache.org/jira/browse/SPARK-31178
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Priority: Blocker
>
> The following unit test fails in DataSourceV2SQLSuite:
> {code:java}
> test("do not double insert on INSERT INTO collect()") {
>   import testImplicits._
>   val t1 = s"${catalogAndNamespace}tbl"
>   sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
>   val tmpView = "test_data"
>   val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
>   df.createOrReplaceTempView(tmpView)
>   sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()
>   verifyTable(t1, df)
> } {code}
> The INSERT INTO is double inserting when ".collect()" is called. I think this 
> is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
> can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31178) sql("INSERT INTO v2DataSource ...").collect() double inserts

2020-03-17 Thread Burak Yavuz (Jira)

Burak Yavuz created SPARK-31178:
---

 Summary: sql("INSERT INTO v2DataSource ...").collect() double 
inserts
 Key: SPARK-31178
 URL: https://issues.apache.org/jira/browse/SPARK-31178
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Burak Yavuz


The following unit test fails in DataSourceV2SQLSuite:
{code:java}
test("do not double insert on INSERT INTO collect()") {
  import testImplicits._
  val t1 = s"${catalogAndNamespace}tbl"
  sql(s"CREATE TABLE $t1 (id bigint, data string) USING $v2Format")
  val tmpView = "test_data"
  val df = Seq((1L, "a"), (2L, "b"), (3L, "c")).toDF("id", "data")
  df.createOrReplaceTempView(tmpView)
  sql(s"INSERT INTO TABLE $t1 SELECT * FROM $tmpView").collect()

  verifyTable(t1, df)
} {code}
The INSERT INTO is double inserting when ".collect()" is called. I think this 
is because the V2 SparkPlans are not commands, and doExecute on a Spark plan 
can be called multiple times causing data to be inserted multiple times.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30954) TreeModelWrappers class name do not correspond to file name

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30954:
-

Assignee: kevin yu

> TreeModelWrappers class name do not correspond to file name 
> 
>
> Key: SPARK-30954
> URL: https://issues.apache.org/jira/browse/SPARK-30954
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: kevin yu
>Priority: Minor
>
> all wrappers except TreeModelWrappers have same name of class and file, we 
> should rename the scala files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30954) TreeModelWrappers class name do not correspond to file name

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30954.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27940
[https://github.com/apache/spark/pull/27940]

> TreeModelWrappers class name do not correspond to file name 
> 
>
> Key: SPARK-30954
> URL: https://issues.apache.org/jira/browse/SPARK-30954
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, R
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: kevin yu
>Priority: Minor
> Fix For: 3.0.0
>
>
> all wrappers except TreeModelWrappers have same name of class and file, we 
> should rename the scala files.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31177) DataFrameReader.csv incorrectly reads gzip encoded CSV from S3 when it has non-".gz" extension

2020-03-17 Thread Mark Waddle (Jira)

Mark Waddle created SPARK-31177:
---

 Summary: DataFrameReader.csv incorrectly reads gzip encoded CSV 
from S3 when it has non-".gz" extension
 Key: SPARK-31177
 URL: https://issues.apache.org/jira/browse/SPARK-31177
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.4
Reporter: Mark Waddle


i have large CSV files that are gzipped and uploaded to S3 with 
Content-Encoding=gzip. the files have file extension ".csv", as most web 
clients will automatically decompress the file based on the Content-Encoding 
header. using pyspark to read these CSV files does not mimic this behavior.

works as expected:
{code:java}
df = spark.read.csv('s3://bucket/large.csv.gz', header=True)
{code}
does not decompress and tries to load entire contents of file as the first row:
{code:java}
df = spark.read.csv('s3://bucket/large.csv', header=True)
{code}

it looks like it's relying on the file extension to determine if the file is 
gzip compressed or not. it would be great if S3 resources, and any other http 
based resources, could consult the Content-Encoding response header as well.

i tried to find the code that determines this, but i'm not familiar with the 
code base. any pointers would be helpful. and i can look into fixing it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31047:
-

Assignee: Manu Zhang

> Improve file listing for ViewFileSystem
> ---
>
> Key: SPARK-31047
> URL: https://issues.apache.org/jira/browse/SPARK-31047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
>
> https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing 
> for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes 
> use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. 
> This ticket intends to improve the case where ViewFileSystem is used to 
> manage multiple DistributedFileSystems. It has also overridden the 
> {{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
> e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31047) Improve file listing for ViewFileSystem

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31047.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27801
[https://github.com/apache/spark/pull/27801]

> Improve file listing for ViewFileSystem
> ---
>
> Key: SPARK-31047
> URL: https://issues.apache.org/jira/browse/SPARK-31047
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.1.0
>
>
> https://issues.apache.org/jira/browse/SPARK-27801 has improved file listing 
> for DistributedFileSystem, where {{InMemoryFileIndex.listLeafFiles}} makes 
> use of DistributedFileSystem's one single {{listLocatedStatus}} to namenode. 
> This ticket intends to improve the case where ViewFileSystem is used to 
> manage multiple DistributedFileSystems. It has also overridden the 
> {{listLocatedStatus}} method by delegating to the filesystem it resolves to, 
> e.g. DistributedFileSystem.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31123) Drop does not work after join with aliases

2020-03-17 Thread Mikel San Vicente (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikel San Vicente resolved SPARK-31123.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Drop does not work after join with aliases
> --
>
> Key: SPARK-31123
> URL: https://issues.apache.org/jira/browse/SPARK-31123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Mikel San Vicente
>Priority: Major
> Fix For: 3.0.0
>
>
>  
> Hi,
> I am seeing a really strange behaviour in drop method after a join with 
> aliases. It doesn't seem to find the column when I reference to it using 
> dataframe("columnName") syntax, but it does work with other combinators like 
> select
> {code:java}
> case class Record(a: String, dup: String)
> case class Record2(b: String, dup: String)
> val df = Seq(Record("a", "dup")).toDF
> val df2 = Seq(Record2("a", "dup")).toDF 
> val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b"))
> val dupCol = df("dup")
> joined.drop(dupCol) // Does not drop anything
> joined.drop(func.col("a.dup")) // It drops the column  
> joined.select(dupCol) // It selects the column
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31125) When processing new K8s state snapshots Spark treats Terminating nodes as terminated.

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31125.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27905
[https://github.com/apache/spark/pull/27905]

> When processing new K8s state snapshots Spark treats Terminating nodes as 
> terminated.
> -
>
> Key: SPARK-31125
> URL: https://issues.apache.org/jira/browse/SPARK-31125
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31171) size(null) should return null under ansi mode

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31171:
-

Assignee: Wenchen Fan

> size(null) should return null under ansi mode
> -
>
> Key: SPARK-31171
> URL: https://issues.apache.org/jira/browse/SPARK-31171
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31171) size(null) should return null under ansi mode

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31171.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27936
[https://github.com/apache/spark/pull/27936]

> size(null) should return null under ansi mode
> -
>
> Key: SPARK-31171
> URL: https://issues.apache.org/jira/browse/SPARK-31171
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28594) Allow event logs for running streaming apps to be rolled over

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061101#comment-17061101
 ] 

Dongjoon Hyun commented on SPARK-28594:
---

`releasenotes` is added for 3.0.0 release.

> Allow event logs for running streaming apps to be rolled over
> -
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28594) Allow event logs for running streaming apps to be rolled over

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28594?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28594:
--
Labels: releasenotes  (was: )

> Allow event logs for running streaming apps to be rolled over
> -
>
> Key: SPARK-28594
> URL: https://issues.apache.org/jira/browse/SPARK-28594
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Stephen Levett
>Assignee: Jungtaek Lim
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.0.0
>
>
> At all current Spark releases when event logging on spark streaming is 
> enabled the event logs grow massively.  The files continue to grow until the 
> application is stopped or killed.
> The Spark history server then has difficulty processing the files.
> https://issues.apache.org/jira/browse/SPARK-8617
> Addresses .inprogress files but not event log files that are still running.
> Identify a mechanism to set a "max file" size so that the file is rolled over 
> when it reaches this size?
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30860) Different behavior between rolling and non-rolling event log

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30860.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27764
[https://github.com/apache/spark/pull/27764]

> Different behavior between rolling and non-rolling event log
> 
>
> Key: SPARK-30860
> URL: https://issues.apache.org/jira/browse/SPARK-30860
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
> Fix For: 3.0.0
>
>
> When creating a rolling event log, the application directory is created with 
> a call to FileSystem.mkdirs, with the file permission 770. The default 
> behavior of HDFS is to set the permission of a file created with 
> FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the 
> permission in the API call and umask is a system value set by 
> fs.permissions.umask-mode and defaults to 0022. This means, with default 
> settings, any mkdirs call can have at most 755 permissions, which causes 
> rolling event log directories to be created with 750 permissions. This causes 
> the history server to be unable to prune old applications if they are not run 
> by the same user running the history server.
> This is not a problem for non-rolling logs, because it uses 
> SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then 
> calls FileSystem.setPermission with 770 after the file has been created. 
> setPermission doesn't have the umask applied to it, so this works fine.
> Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm 
> not sure the reason that's set in the first place or if this would hurt 
> anything else. The main issue is there is different behavior between rolling 
> and non-rolling event logs that might want to be updated in this repo to be 
> consistent across each.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30860) Different behavior between rolling and non-rolling event log

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30860:
-

Assignee: Adam Binford

> Different behavior between rolling and non-rolling event log
> 
>
> Key: SPARK-30860
> URL: https://issues.apache.org/jira/browse/SPARK-30860
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Adam Binford
>Assignee: Adam Binford
>Priority: Major
>
> When creating a rolling event log, the application directory is created with 
> a call to FileSystem.mkdirs, with the file permission 770. The default 
> behavior of HDFS is to set the permission of a file created with 
> FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the 
> permission in the API call and umask is a system value set by 
> fs.permissions.umask-mode and defaults to 0022. This means, with default 
> settings, any mkdirs call can have at most 755 permissions, which causes 
> rolling event log directories to be created with 750 permissions. This causes 
> the history server to be unable to prune old applications if they are not run 
> by the same user running the history server.
> This is not a problem for non-rolling logs, because it uses 
> SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then 
> calls FileSystem.setPermission with 770 after the file has been created. 
> setPermission doesn't have the umask applied to it, so this works fine.
> Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm 
> not sure the reason that's set in the first place or if this would hurt 
> anything else. The main issue is there is different behavior between rolling 
> and non-rolling event logs that might want to be updated in this repo to be 
> consistent across each.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31176) Remove support for 'e'/'c' as datetime pattern charactar

2020-03-17 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-31176:
-
Summary: Remove support for 'e'/'c' as datetime pattern charactar   (was: 
Romove support for 'e'/'c' as datetime pattern charactar )

> Remove support for 'e'/'c' as datetime pattern charactar 
> -
>
> Key: SPARK-31176
> URL: https://issues.apache.org/jira/browse/SPARK-31176
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> The meaning of 'u' was day number of week in SimpleDateFormat, it was changed 
> to year in DateTimeFormatter.  So we keep the old meaning of 'u' by 
> substituting 'u' to 'e' internally and use DateTimeFormatter to parse the 
> pattern string. In DateTimeFormatter, the 'e' and 'c' also represents 
> day-of-week, we should mark them as illegal pattern character to stay the 
> same as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31176) Romove support for 'e'/'c' as datetime pattern charactar

2020-03-17 Thread Kent Yao (Jira)

Kent Yao created SPARK-31176:


 Summary: Romove support for 'e'/'c' as datetime pattern charactar 
 Key: SPARK-31176
 URL: https://issues.apache.org/jira/browse/SPARK-31176
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


The meaning of 'u' was day number of week in SimpleDateFormat, it was changed 
to year in DateTimeFormatter.  So we keep the old meaning of 'u' by 
substituting 'u' to 'e' internally and use DateTimeFormatter to parse the 
pattern string. In DateTimeFormatter, the 'e' and 'c' also represents 
day-of-week, we should mark them as illegal pattern character to stay the same 
as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31175) Avoid creating reverse comparator for each compare in InterpretedOrdering

2020-03-17 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31175:
-
Summary: Avoid creating reverse comparator for each compare in 
InterpretedOrdering  (was: Avoid creating new reverse comparator per compare in 
InterpretedOrdering )

> Avoid creating reverse comparator for each compare in InterpretedOrdering
> -
>
> Key: SPARK-31175
> URL: https://issues.apache.org/jira/browse/SPARK-31175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
> Currently, we'll create a new reverse comparator for each compare in 
> InterpretedOrdering, which could generate lots of small and instant object to 
> hurt JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31175) Avoid creating new reverse comparator per compare in InterpretedOrdering

2020-03-17 Thread wuyi (Jira)

wuyi created SPARK-31175:


 Summary: Avoid creating new reverse comparator per compare in 
InterpretedOrdering 
 Key: SPARK-31175
 URL: https://issues.apache.org/jira/browse/SPARK-31175
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


Currently, we'll create a new reverse comparator for each compare in 
InterpretedOrdering, which could generate lots of small and instant object to 
hurt JVM.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31174) unix_timestamp() function returning NULL values for corner cases (daylight saving)

2020-03-17 Thread Jira

Piotr Skąpski created SPARK-31174:
-

 Summary: unix_timestamp() function returning NULL values for 
corner cases (daylight saving)
 Key: SPARK-31174
 URL: https://issues.apache.org/jira/browse/SPARK-31174
 Project: Spark
  Issue Type: Bug
  Components: Spark Shell, SQL
Affects Versions: 2.3.4
Reporter: Piotr Skąpski


Running below code does not return 4 values due to one timestamp not being 
correctly created (possible problem with daylight saving):
{code:java}
spark.sql("""SELECT from_unixtime(unix_timestamp('2020-03-08 01:00:00'), 
'MMdd') t1, from_unixtime(unix_timestamp('2020-03-08 02:00:00'), 
'MMdd') t2, from_unixtime(unix_timestamp('2020-03-08 03:00:00'), 
'MMdd') t3, from_unixtime(unix_timestamp('2020-03-08 04:00:00'), 
'MMdd') t4""").show

+++++
|  t1|  t2|  t3|  t4|
+++++
|20200308|null|20200308|20200308|
+++++{code}
This unexpected NULL value caused us problems as we did not expect it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29278) Data Source V2: Support SHOW CURRENT CATALOG

2020-03-17 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim resolved SPARK-29278.
---
Resolution: Duplicate

> Data Source V2: Support SHOW CURRENT CATALOG
> 
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29278) Data Source V2: Support SHOW CURRENT CATALOG

2020-03-17 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17061005#comment-17061005
 ] 

Terry Kim commented on SPARK-29278:
---

Thanks [~dongjoon]. I am resolving this since SHOW CURRENT CATALOG is not 
needed any longer (SHOW CURRENT NAMESPACE covers this case).

> Data Source V2: Support SHOW CURRENT CATALOG
> 
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29278) Data Source V2: Support SHOW CURRENT CATALOG

2020-03-17 Thread Terry Kim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Terry Kim updated SPARK-29278:
--
Summary: Data Source V2: Support SHOW CURRENT CATALOG  (was: Implement 
CATALOG/NAMESPACE related SQL commands for Data Source V2)

> Data Source V2: Support SHOW CURRENT CATALOG
> 
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31170.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27933
[https://github.com/apache/spark/pull/27933]

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31170) Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31170:
---

Assignee: Kent Yao

> Spark Cli does not respect hive-site.xml and spark.sql.warehouse.dir
> 
>
> Key: SPARK-31170
> URL: https://issues.apache.org/jira/browse/SPARK-31170
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> In Spark CLI, we create a hive CliSessionState and it does not load the 
> hive-site.xml. So the configurations in hive-site.xml will not take effects 
> like other spark-hive integration apps.
> Also, the warehouse directory is not correctly picked. If the `default` 
> database does not exist, the CliSessionState will create one during the first 
> time it talks to the metastore. The `Location` of the default DB will be 
> neither the value of spark.sql.warehousr.dir nor the user-specified value of 
> hive.metastore.warehourse.dir, but the default value of 
> hive.metastore.warehourse.dir which will always be `/user/hive/warehouse`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2020-03-17 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25561:
-
Description: 
In HiveShim.scala, the current 
behahttps://issues.apache.org/jira/browse/SPARK-25561#vior is that if 
hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
call to succeed. If it fails, we'll throw a RuntimeException.

However, this might not always be the case. Hive's direct SQL functionality is 
best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
should handle that exception correctly if Hive falls back to ORM. 

  was:
In HiveShim.scala, the current behavior is that if 
hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
call to succeed. If it fails, we'll throw a RuntimeException.

However, this might not always be the case. Hive's direct SQL functionality is 
best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
should handle that exception correctly if Hive falls back to ORM. 


> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current 
> behahttps://issues.apache.org/jira/browse/SPARK-25561#vior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25561) HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql

2020-03-17 Thread Imran Rashid (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid updated SPARK-25561:
-
Description: 
In HiveShim.scala, the current behavior is that if 
hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
call to succeed. If it fails, we'll throw a RuntimeException.

However, this might not always be the case. Hive's direct SQL functionality is 
best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
should handle that exception correctly if Hive falls back to ORM. 

  was:
In HiveShim.scala, the current 
behahttps://issues.apache.org/jira/browse/SPARK-25561#vior is that if 
hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
call to succeed. If it fails, we'll throw a RuntimeException.

However, this might not always be the case. Hive's direct SQL functionality is 
best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
should handle that exception correctly if Hive falls back to ORM. 


> HiveClient.getPartitionsByFilter throws an exception if Hive retries directSql
> --
>
> Key: SPARK-25561
> URL: https://issues.apache.org/jira/browse/SPARK-25561
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Karthik Manamcheri
>Priority: Major
>
> In HiveShim.scala, the current behavior is that if 
> hive.metastore.try.direct.sql is enabled, we expect the getPartitionsByFilter 
> call to succeed. If it fails, we'll throw a RuntimeException.
> However, this might not always be the case. Hive's direct SQL functionality 
> is best-attempt. Meaning, it will fall back to ORM if direct sql fails. Spark 
> should handle that exception correctly if Hive falls back to ORM. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31173) Spark Kubernetes add tolerations and nodeName support

2020-03-17 Thread zhongwei liu (Jira)

zhongwei liu created SPARK-31173:


 Summary: Spark Kubernetes add tolerations and nodeName support
 Key: SPARK-31173
 URL: https://issues.apache.org/jira/browse/SPARK-31173
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 3.1.0, 2.4.6
 Environment: Alibaba Cloud ACK with spark 
operator(v1beta2-1.1.0-2.4.5) and spark(2.4.5)
Reporter: zhongwei liu


When you run spark on serverless kubernetes cluster(virtual-kubelet). you need 
to specific the nodeSelectors,tolerations even nodeName when you want to gain 
better scheduling performance. Currently spark doesn't support tolerations. If 
you want to use this feature, You must use admission controller webhook to 
decorate the pod. But the performance is extremely bad. Here is the benchmark. 

With webhook 

Batch Size: 500 Pod creation: about 7 Pods/s   All Pods running: 5min

Without webhook 

Batch Size: 500 Pod creation: more than 500 Pods/s All Pods running: 45s

Adding tolerations and nodeName in spark will bring great help when you want to 
run a large scale job on serverless kubernetes cluster.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31172) RecordBinaryComparator Tests failing on Big Endian Platform (s390x)

2020-03-17 Thread Jinesh Patel (Jira)

Jinesh Patel created SPARK-31172:


 Summary: RecordBinaryComparator Tests failing on Big Endian 
Platform (s390x)
 Key: SPARK-31172
 URL: https://issues.apache.org/jira/browse/SPARK-31172
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5
 Environment: Architecture: Big Endian s390x

Operating Systems:
 * RHEL 7.x
 * RHEL 8.x
 * Ubuntu 16.04
 * Ubuntu 18.04
 * Ubuntu 19.10
 * SLES 12 SP4 and SLES 12 SP5
 * SLES 15 SP1
Reporter: Jinesh Patel
 Fix For: 3.0.0, 3.1.0, 2.4.6


Failing Test Cases in the RecordBinaryComparatorSuite:
 * testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue
 * testBinaryComparatorWhenSubtractionCanOverflowLongValue

Test cases failed after the change related to:

[Github Pull Request 
#26548|https://github.com/apache/spark/pull/26548#issuecomment-554645859]

Test Case: testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue
 * Fails due to changing the compare from `<` to `>` as the test condition (In 
little endian this is valid when the bytes are reversed, but not for big endian)

Test Case: testBinaryComparatorWhenSubtractionCanOverflowLongValue
 * Fails due to using Long.compareUnsigned
 * Possible Fix: Use signed compare for big endian platforms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31150) Parsing seconds fraction with variable length for timestamp

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31150:
---

Assignee: Kent Yao

> Parsing seconds fraction with variable length for timestamp
> ---
>
> Key: SPARK-31150
> URL: https://issues.apache.org/jira/browse/SPARK-31150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> This JIRA is to support parsing timestamp values with variable length second 
> fraction parts.
> e.g. '-MM-dd HH:mm:ss.SS[zzz]' can parse timestamp with 0~6 
> digit-length second fraction but fail >=7
> {code:java}
> select to_timestamp(v, '-MM-dd HH:mm:ss.SS[zzz]') from values
>  ('2019-10-06 10:11:12.'),
>  ('2019-10-06 10:11:12.0'),
>  ('2019-10-06 10:11:12.1'),
>  ('2019-10-06 10:11:12.12'),
>  ('2019-10-06 10:11:12.123UTC'),
>  ('2019-10-06 10:11:12.1234'),
>  ('2019-10-06 10:11:12.12345CST'),
>  ('2019-10-06 10:11:12.123456PST') t(v)
> 2019-10-06 03:11:12.123
> 2019-10-06 08:11:12.12345
> 2019-10-06 10:11:12
> 2019-10-06 10:11:12
> 2019-10-06 10:11:12.1
> 2019-10-06 10:11:12.12
> 2019-10-06 10:11:12.1234
> 2019-10-06 10:11:12.123456
> select to_timestamp('2019-10-06 10:11:12.1234567PST', '-MM-dd 
> HH:mm:ss.SS[zzz]')
> NULL
> {code}
> Since 3.0, we use java 8 time API to parse and format timestamp values. when 
> we create the DateTimeFormatter, we use appendPattern to create the build 
> first, where the 'S..S' part will be parsed to a fixed-length(= 
> 'S..S'.length). This fits the formatting part but too strict for the parsing 
> part because the trailing zeros are very likely to be truncated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31150) Parsing seconds fraction with variable length for timestamp

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31150.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27906
[https://github.com/apache/spark/pull/27906]

> Parsing seconds fraction with variable length for timestamp
> ---
>
> Key: SPARK-31150
> URL: https://issues.apache.org/jira/browse/SPARK-31150
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> This JIRA is to support parsing timestamp values with variable length second 
> fraction parts.
> e.g. '-MM-dd HH:mm:ss.SS[zzz]' can parse timestamp with 0~6 
> digit-length second fraction but fail >=7
> {code:java}
> select to_timestamp(v, '-MM-dd HH:mm:ss.SS[zzz]') from values
>  ('2019-10-06 10:11:12.'),
>  ('2019-10-06 10:11:12.0'),
>  ('2019-10-06 10:11:12.1'),
>  ('2019-10-06 10:11:12.12'),
>  ('2019-10-06 10:11:12.123UTC'),
>  ('2019-10-06 10:11:12.1234'),
>  ('2019-10-06 10:11:12.12345CST'),
>  ('2019-10-06 10:11:12.123456PST') t(v)
> 2019-10-06 03:11:12.123
> 2019-10-06 08:11:12.12345
> 2019-10-06 10:11:12
> 2019-10-06 10:11:12
> 2019-10-06 10:11:12.1
> 2019-10-06 10:11:12.12
> 2019-10-06 10:11:12.1234
> 2019-10-06 10:11:12.123456
> select to_timestamp('2019-10-06 10:11:12.1234567PST', '-MM-dd 
> HH:mm:ss.SS[zzz]')
> NULL
> {code}
> Since 3.0, we use java 8 time API to parse and format timestamp values. when 
> we create the DateTimeFormatter, we use appendPattern to create the build 
> first, where the 'S..S' part will be parsed to a fixed-length(= 
> 'S..S'.length). This fits the formatting part but too strict for the parsing 
> part because the trailing zeros are very likely to be truncated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31171) size(null) should return null under ansi mode

2020-03-17 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31171:
---

 Summary: size(null) should return null under ansi mode
 Key: SPARK-31171
 URL: https://issues.apache.org/jira/browse/SPARK-31171
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31164) Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31164:
---

Assignee: Zhenhua Wang

> Inconsistent rdd and output partitioning for bucket table when output doesn't 
> contain all bucket columns
> 
>
> Key: SPARK-31164
> URL: https://issues.apache.org/jira/browse/SPARK-31164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> For a bucketed table, when deciding output partitioning, if the output 
> doesn't contain all bucket columns, the result is `UnknownPartitioning`. But 
> when generating rdd, current Spark uses `createBucketedReadRDD` because it 
> doesn't check if the output contains all bucket columns. So the rdd and its 
> output partitioning are inconsistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31164) Inconsistent rdd and output partitioning for bucket table when output doesn't contain all bucket columns

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31164.
-
Fix Version/s: 2.4.6
   3.0.0
   Resolution: Fixed

> Inconsistent rdd and output partitioning for bucket table when output doesn't 
> contain all bucket columns
> 
>
> Key: SPARK-31164
> URL: https://issues.apache.org/jira/browse/SPARK-31164
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5, 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> For a bucketed table, when deciding output partitioning, if the output 
> doesn't contain all bucket columns, the result is `UnknownPartitioning`. But 
> when generating rdd, current Spark uses `createBucketedReadRDD` because it 
> doesn't check if the output contains all bucket columns. So the rdd and its 
> output partitioning are inconsistent.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2020-03-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-28934.
-
Resolution: Won't Fix

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2020-03-17 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060870#comment-17060870
 ] 

Wenchen Fan commented on SPARK-27665:
-

I believe it's fixed by SPARK-29435

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25383) Image data source supports sample pushdown

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25383:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Image data source supports sample pushdown
> --
>
> Key: SPARK-25383
> URL: https://issues.apache.org/jira/browse/SPARK-25383
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> After SPARK-25349, we should update image data source to support sampling.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25349) Support sample pushdown in Data Source V2

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25349:
--
Issue Type: New Feature  (was: Story)

> Support sample pushdown in Data Source V2
> -
>
> Key: SPARK-25349
> URL: https://issues.apache.org/jira/browse/SPARK-25349
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Support sample pushdown would help file-based data source implementation save 
> I/O cost significantly if it can decide whether to read a file or not.
>  
> cc: [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25349) Support sample pushdown in Data Source V2

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25349:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support sample pushdown in Data Source V2
> -
>
> Key: SPARK-25349
> URL: https://issues.apache.org/jira/browse/SPARK-25349
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Support sample pushdown would help file-based data source implementation save 
> I/O cost significantly if it can decide whether to read a file or not.
>  
> cc: [~cloud_fan]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060780#comment-17060780
 ] 

Dongjoon Hyun commented on SPARK-25728:
---

Hi, [~kiszk]. is there any update on this issue?

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25728) SPIP: Structured Intermediate Representation (Tungsten IR) for generating Java code

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25728:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> SPIP: Structured Intermediate Representation (Tungsten IR) for generating 
> Java code
> ---
>
> Key: SPARK-25728
> URL: https://issues.apache.org/jira/browse/SPARK-25728
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> This JIRA entry is to start a discussion about adding structure intermediate 
> representation for generating Java code from a program using DataFrame or 
> Dataset API, in addition to the current String-based representation.
> This addition is based on the discussions in [a 
> thread|https://github.com/apache/spark/pull/21537#issuecomment-413268196].
> Please feel free to comment on this JIRA entry or [Google 
> Doc|https://docs.google.com/document/d/1Jzf56bxpMpSwsGV_hSzl9wQG22hyI731McQcjognqxY/edit?usp=sharing],
>  too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25802) Use JDBC Oracle Binds from Spark SQL

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25802:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Use JDBC Oracle Binds from Spark SQL
> 
>
> Key: SPARK-25802
> URL: https://issues.apache.org/jira/browse/SPARK-25802
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nathan Loyer
>Priority: Major
>
> In case those reading aren't aware, any time a query is run against Oracle, 
> the database creates a plan and caches it. When a query is run, first it 
> checks to see if it can reuse a plan from the cache. When you use literals it 
> has to create a new plan even though there is one in the cache that matches 
> everything except for that literal value, which is the case for the spark 
> generated queries. Using binds/parameters instead allows the database to 
> reuse the previous plans and reduce the load on the database.
> My team is using spark sql with JDBC to query large amounts of data from 
> production Oracle databases. The queries built with the JDBCRDD class today 
> results in our databases having to do more work than they really need to, 
> which results in more load on our databases, which affects our users. For 
> this reason we've been investigating if it is possible to use spark sql with 
> query binds/parameters.  From what I can tell from reviewing documentation 
> and diving into the spark source code, this does not appear to be possible 
> today.
> Our spark usage looks like this:
> {code:java}
> spark.read()
> .format("jdbc")
> .option("url", connectionUrl)
> .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
> TO_DATE('2018-01-01 00:00:00', '-MM-DD HH24:MI:SS') )")
> .option("driver", "oracle.jdbc.OracleDriver")
> .option("fetchSize", fetchSize)
> .option("lowerBound", minId)
> .option("upperBound", maxId)
> .option("partitionColumn", "ID")
> .option("numPartitions", numPartitions)
> .load();
> {code}
> So one way to alter the call to get what I am looking for would be like this:
> {code:java}
> spark.read()
> .format("jdbc")
> .option("url", connectionUrl)
> .option("dbtable", "( select c1, c2, c3 from tableName where c4 > 
> TO_DATE(:timestamp, '-MM-DD HH24:MI:SS') )")
> .option("driver", "oracle.jdbc.OracleDriver")
> .option("fetchSize", fetchSize)
> .option("lowerBound", minId)
> .option("upperBound", maxId)
> .option("partitionColumn", "ID")
> .option("numPartitions", numPartitions)
> .option("binds", ImmutableMap.of("timestamp", "2018-01-01 00:00:00"))
> .load();
> {code}
> The queries that spark generates from this should something look like:
> {code:sql}
> SELECT c1, c2, c3
> FROM
>   (
> SELECT c1, c2, c3
> FROM tableName
> WHERE column > TO_DATE(:timestamp, '-MM-DD HH24:MI:SS')
>   ) AS __SPARK_GEN_JDBC_SUBQUERY_NAME_1
> WHERE
>   ID >= :partitionLowerBound
>   AND ID < :partitionUpperBound
> {code}
> I am not certain if this parameterized query syntax is supported by all other 
> jdbc drivers or if it improves the performance on those databases or not.
> I'm also not sure if I picked the correct component or versions. Feel free to 
> correct them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26028) Design sketch for SPIP: Property Graphs, Cypher Queries, and Algorithms

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26028.
---
  Assignee: (was: Martin Junghanns)
Resolution: Won't Do

Unfortunately, the community decided not to add this.

> Design sketch for SPIP: Property Graphs, Cypher Queries, and Algorithms
> ---
>
> Key: SPARK-26028
> URL: https://issues.apache.org/jira/browse/SPARK-26028
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> Placeholder for the design discussion of SPARK-25994. The scope here is to 
> help SPIP vote instead of the final design.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26301) Consider switching from putting secret in environment variable directly to using secret reference

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26301:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Consider switching from putting secret in environment variable directly to 
> using secret reference
> -
>
> Key: SPARK-26301
> URL: https://issues.apache.org/jira/browse/SPARK-26301
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> In SPARK-26194 we proposed using an environment variable that is loaded in 
> the executor pod spec to share the generated SASL secret key between the 
> driver and the executors. However in practice this is very difficult to 
> secure. Most traditional Kubernetes deployments will handle permissions by 
> allowing wide access to viewing pod specs but restricting access to view 
> Kubernetes secrets. Now however any user that can view the pod spec can also 
> view the contents of the SASL secrets.
> An example use case where this quickly breaks down is in the case where a 
> systems administrator is allowed to look at pods that run user code in order 
> to debug failing infrastructure, but the cluster administrator should not be 
> able to view contents of secrets or other sensitive data from Spark 
> applications run by their users.
> We propose modifying the existing solution to instead automatically create a 
> Kubernetes Secret object containing the SASL encryption key, then using the 
> [secret reference feature in 
> Kubernetes|https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-environment-variables]
>  to store the data in the environment variable without putting the secret 
> data in the pod spec itself.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26354) Ability to return schema prefix before dataframe column names

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26354:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Ability to return schema prefix before dataframe column names
> -
>
> Key: SPARK-26354
> URL: https://issues.apache.org/jira/browse/SPARK-26354
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: t oo
>Priority: Major
>
> This query returns dataframe with prdct, prdct, addr, pho :
> select a.prdct, b.prdct, a.addr, b.pho from ac a
> full outer join baa b on a.prdct = b.prdct
>  
> This feature Jira is about having a new config flag (defaulted to false) that 
> would be show.schema.prefix. When true it should return dataframe for above 
> example of a.prdct, b.prdct, a.addr, b.pho. This would help to clearly 
> distinguish origin of the columns with same name in >=2 tables without having 
> to rewrite query to put specific alias ie as a_prdct or as b_prdct. My 
> current use case is loading dataframe into List of Maps in java but it is 
> only taking the first prdct column rather than both prdct columns
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31165) Multiple wrong references in Dockerfile for k8s

2020-03-17 Thread Nikolay Dimolarov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Dimolarov updated SPARK-31165:
--
Description: 
I am currently trying to follow the k8s instructions for Spark: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
references after trying to build my Docker image:

 

*Issue 1: The comments in the Dockerfile reference the wrong folder for the 
Dockerfile:*
{code:java}
# If this docker file is being used in the context of building your images from 
a Spark # distribution, the docker build command should be invoked from the top 
level directory # of the Spark distribution. E.g.: # docker build -t 
spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
Well that docker build command simply won't run. I only got the following to 
run:
{code:java}
docker build -t spark:latest -f 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
{code}
which is the actual path to the Dockerfile.

 

*Issue 2: jars folder does not exist*

After I read the tutorial I of course build spark first as per the instructions 
with:
{code:java}
./build/mvn -Pkubernetes -DskipTests clean package{code}
Nonetheless, in the Dockerfile I get this error when building:
{code:java}
Step 5/18 : COPY jars /opt/spark/jars
COPY failed: stat /var/lib/docker/tmp/docker-builder402673637/jars: no such 
file or directory{code}
 for which I may have found a similar issue here: 
[https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac]

I am new to Spark but I assume that this jars folder - if the build step would 
actually make it and I ran the maven build of the master branch successfully 
with the command I mentioned above - would exist in the root folder of the 
project.

 

*Issue 3: missing entrypoint.sh and decom.sh due to wrong reference*

While Issue 2 remains unresolved as I can't wrap my head around the missing 
jars folder (bin and sbin got copied successfully after I made a dummy jars 
folder) I then got stuck on these 2 steps:
{code:java}
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY 
kubernetes/dockerfiles/spark/decom.sh /opt/{code}
 
 with:
  
{code:java}
Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY failed: stat 
/var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh:
 no such file or directory{code}
 
 which makes sense since the path should actually be:
  
 resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
 resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh
  
 *Remark*
  
 I only created one issue since this seems like somebody cleaned up the repo 
and forgot to change these. Am I missing something here? If I am, I apologise 
in advance since I am new to the Spark project. I also saw that some of these 
references were handled through vars in previous branches: 
[https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile]
 (e.g. 2.4) but that also does not run out of the box.
  
 I am also really not sure about the affected versions since that was not 
transparent enough for me on GH - feel free to edit that field :) 
  
 I can also create a PR and change these but I need help with Issue 2 and the 
jar files since I am not sure what the correct path for that one is. Would love 
some help on this :) 
  
 Thanks in advance!
  
  
  

  was:
I am currently trying to follow the k8s instructions for Spark: 
[https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
references after trying to build my Docker image:

 

*Issue 1: The comments in the Dockerfile reference the wrong folder for the 
Dockerfile:*
{code:java}
# If this docker file is being used in the context of building your images from 
a Spark # distribution, the docker build command should be invoked from the top 
level directory # of the Spark distribution. E.g.: # docker build -t 
spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
Well that docker build command simply won't run. I only got the following to 
run:
{code:java}
docker build -t spark:latest -f 
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
{code}
which is the actual path to the Dockerfile.

 

*Issue 2: jars folder does not exist*

After I read the tutorial I of course build spark first as per the instructions 
with:
{code:java}
./build/mvn -Pkubernetes -DskipTests clean package{code}
Nonetheless, in the Dockerfile I get this error when building:
{code:java}
Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY failed: stat

[jira] [Updated] (SPARK-26373) Spark UI 'environment' tab - column to indicate default vs overridden values

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26373:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Spark UI 'environment' tab - column to indicate default vs overridden values
> 
>
> Key: SPARK-26373
> URL: https://issues.apache.org/jira/browse/SPARK-26373
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.1.0
>Reporter: t oo
>Priority: Minor
>
> Rather than just showing name and value for each property, a new column would 
> also show whether the value is default (show 'AS PER DEFAULT') or if its 
> overridden (show the actual default value).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26498) Integrate barrier execution with MMLSpark's LightGBM

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26498:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Integrate barrier execution with MMLSpark's LightGBM
> 
>
> Key: SPARK-26498
> URL: https://issues.apache.org/jira/browse/SPARK-26498
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, MLlib
>Affects Versions: 3.1.0
>Reporter: Ilya Matiach
>Priority: Major
>
> I would like to use the new barrier execution mode introduced in spark 2.4 
> with LightGBM in the spark package mmlspark but I ran into some issues.
> Currently, the LightGBM distributed learner tries to figure out the number of 
> cores on the cluster and then does a coalesce and a mapPartitions, and inside 
> the mapPartitions we do a NetworkInit (where the address:port of all workers 
> needs to be passed in the constructor) and pass the data in-memory to the 
> native layer of the distributed lightgbm learner.
> With barrier execution mode, I think the code would become much more robust.  
> However, there are several issues that I am running into when trying to move 
> my code over to the new barrier execution mode scheduler:
> Does not support dynamic allocation – however, I think it would be convenient 
> if it restarted the job when the number of workers has decreased and allowed 
> the dev to decide whether to restart the job if the number of workers 
> increased
> Does not work with DataFrame or Dataset API, but I think it would be much 
> more convenient if it did.
> How does barrier execution mode deal with #partitions > #tasks?  If the 
> number of partitions is larger than the number of “tasks” or workers, can 
> barrier execution mode automatically coalesce the dataset to have # 
> partitions == # tasks?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26410) Support per Pandas UDF configuration

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26410:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support per Pandas UDF configuration
> 
>
> Key: SPARK-26410
> URL: https://issues.apache.org/jira/browse/SPARK-26410
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Xiangrui Meng
>Priority: Major
>
> We use a "maxRecordsPerBatch" conf to control the batch sizes. However, the 
> "right" batch size usually depends on the task itself. It would be nice if 
> user can configure the batch size when they declare the Pandas UDF.
> This is orthogonal to SPARK-23258 (using max buffer size instead of row 
> count).
> Besides API, we should also discuss how to merge Pandas UDFs of different 
> configurations. For example,
> {code}
> df.select(predict1(col("features"), predict2(col("features")))
> {code}
> when predict1 requests 100 rows per batch, while predict2 requests 120 rows 
> per batch.
> cc: [~icexelloss] [~bryanc] [~holdenk] [~hyukjin.kwon] [~ueshin] [~smilegator]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26589) proper `median` method for spark dataframe

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26589:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> proper `median` method for spark dataframe
> --
>
> Key: SPARK-26589
> URL: https://issues.apache.org/jira/browse/SPARK-26589
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jan Gorecki
>Priority: Minor
>
> I found multiple tickets asking for median function to be implemented in 
> Spark. Most of those tickets links to "SPARK-6761 Approximate quantile" as 
> duplicate of it. The thing is that approximate quantile is a workaround for 
> lack of median function. Thus I am filling this Feature Request for proper, 
> exact, not approximation of, median function. I am aware about difficulties 
> that are caused by distributed environment when trying to compute median, 
> nevertheless I don't think those difficulties is reason good enough to drop 
> out `median` function from scope of Spark. I am not asking about efficient 
> median but exact median.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26764) [SPIP] Spark Relational Cache

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26764:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> [SPIP] Spark Relational Cache
> -
>
> Key: SPARK-26764
> URL: https://issues.apache.org/jira/browse/SPARK-26764
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Adrian Wang
>Priority: Major
> Attachments: Relational+Cache+SPIP.pdf
>
>
> In modern database systems, relational cache is a common technology to boost 
> ad-hoc queries. While Spark provides cache natively, Spark SQL should be able 
> to utilize the relationship between relations to boost all possible queries. 
> In this SPIP, we will make Spark be able to utilize all defined cached 
> relations if possible, without explicit substitution in user query, as well 
> as keep some user defined cache available in different sessions. Materialized 
> views in many database systems provide similar function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27014) Support removal of jars and Spark binaries from Mesos driver and executor sandboxes

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27014:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support removal of jars and Spark binaries from Mesos driver and executor 
> sandboxes
> ---
>
> Key: SPARK-27014
> URL: https://issues.apache.org/jira/browse/SPARK-27014
> Project: Spark
>  Issue Type: New Feature
>  Components: Mesos
>Affects Versions: 3.1.0
>Reporter: Martin Loncaric
>Priority: Minor
>
> Currently, each Spark application run on Mesos leaves behind at least 500MB 
> of data in sandbox directories, coming from Spark binaries and copied URIs. 
> These can build up as a disk leak, causing major issues on Mesos clusters 
> unless their grace period for sandbox directories is very short.
> Spark should have a feature to delete these (from both driver and executor 
> sandboxes) on teardown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26867) Spark Support of YARN Placement Constraint

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26867:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Spark Support of YARN Placement Constraint
> --
>
> Key: SPARK-26867
> URL: https://issues.apache.org/jira/browse/SPARK-26867
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Prabhu Joseph
>Priority: Major
>
> YARN provides Placement Constraint Features - where application can request 
> containers based on affinity / anti-affinity / cardinality to services or 
> other application containers / node attributes. This is a useful feature for 
> Spark Jobs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27143) Provide REST API for JDBC/ODBC level information

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27143:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Provide REST API for JDBC/ODBC level information
> 
>
> Key: SPARK-27143
> URL: https://issues.apache.org/jira/browse/SPARK-27143
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Ajith S
>Priority: Minor
>
> Currently for Monitoring Spark application JDBC/ODBC information is not 
> available from REST but only via UI. REST provides only 
> applications,jobs,stages,environment. This Jira is targeted to provide a REST 
> API so that JDBC/ODBC level information like session statistics, sql 
> staistics can be provided



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27249:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27447) Add collaborate filtering Explain API in SPARKML

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27447:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add collaborate filtering Explain API in SPARKML
> 
>
> Key: SPARK-27447
> URL: https://issues.apache.org/jira/browse/SPARK-27447
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: guohao xiao
>Priority: Minor
>
> Machine learning recommender systems have supercharged the online retail 
> environment by directly targeting what the customer wants. While customers 
> are getting better product recommendations than ever before, in the age of 
> GDPR there is growing concern about customer privacy and transparency with ML 
> models. Many are asking, just why am I receiving these recommendations? While 
> the current Implicit Collaborative Filtering (CF) algorithm in spark.ml is 
> great for generating recommendations at scale, its currently lacks any method 
> to explain why a particular customer is getting the recommendations they are 
> getting. In this talk, we demonstrate a way to expand collaborative filtering 
> so that the viewing history of a customer can be directly related to their 
> recommendations. Why were you recommended footwear? Well, 40% of this 
> recommendation came from browsing runners and 20% came from the shorts you 
> recently purchased. Turns out, rethinking of the linear algebra in the 
> current spark.ml CF implementation makes this possible. We show how this is 
> done and demonstrate its implemented as a new feature to spark.ml, expanding 
> the API to allow everyone to explain recommendations at scale and create a 
> more transparent ML future.
>  
>  
> This project is going to present in Spark summit 2019:
> https://databricks.com/sparkaisummit/north-america/sessions-single-2019?id=56



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27561) Support "lateral column alias references" to allow column aliases to be used within SELECT clauses

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27561:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support "lateral column alias references" to allow column aliases to be used 
> within SELECT clauses
> --
>
> Key: SPARK-27561
> URL: https://issues.apache.org/jira/browse/SPARK-27561
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
>
> Amazon Redshift has a feature called "lateral column alias references": 
> [https://aws.amazon.com/about-aws/whats-new/2018/08/amazon-redshift-announces-support-for-lateral-column-alias-reference/].
>  Quoting from that blogpost:
> {quote}The support for lateral column alias reference enables you to write 
> queries without repeating the same expressions in the SELECT list. For 
> example, you can define the alias 'probability' and use it within the same 
> select statement:
> {code:java}
> select clicks / impressions as probability, round(100 * probability, 1) as 
> percentage from raw_data;
> {code}
> {quote}
> There's more information about this feature on 
> [https://docs.aws.amazon.com/redshift/latest/dg/r_SELECT_list.html:]
> {quote}The benefit of the lateral alias reference is you don't need to repeat 
> the aliased expression when building more complex expressions in the same 
> target list. When Amazon Redshift parses this type of reference, it just 
> inlines the previously defined aliases. If there is a column with the same 
> name defined in the FROM clause as the previously aliased expression, the 
> column in the FROM clause takes priority. For example, in the above query if 
> there is a column named 'probability' in table raw_data, the 'probability' in 
> the second expression in the target list will refer to that column instead of 
> the alias name 'probability'.
> {quote}
> It would be nice if Spark supported this syntax. I don't think that this is 
> standard SQL, so it might be a good idea to research if other SQL databases 
> support similar syntax (and to see if they implement the same column 
> resolution strategy as Redshift).
> We should also consider whether this needs to be feature-flagged as part of a 
> specific SQL compatibility mode / dialect.
> One possibly-related existing ticket: SPARK-9338, which discusses the use of 
> SELECT aliases in GROUP BY expressions.
> /cc [~hvanhovell]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27616) Standalone cluster management user resource allocation

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27616?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27616:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Standalone cluster management user resource allocation
> --
>
> Key: SPARK-27616
> URL: https://issues.apache.org/jira/browse/SPARK-27616
> Project: Spark
>  Issue Type: New Feature
>  Components: Deploy
>Affects Versions: 3.1.0
>Reporter: weDataSphere
>Priority: Major
>
> Standalone clusters are now unable to restrict user resources, and users can 
> adjust resource usage by setting parameters. In order to make the Standalone 
> cluster have the isolation of resources like the queue in the yarn.
>  I want to add resource statistics and allocation policies to the Master 
> process. Administrators can adjust user profiles by adjusting configuration 
> files. When the client submits the registration application to the Master, 
> the user resource limit can be determined by judging the resources of the 
> submitting user, and the user resource record display is added to the UI to 
> facilitate the user to view.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27753) Support SQL expressions for interval parameter in Structured Streaming

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27753:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support SQL expressions for interval parameter in Structured Streaming
> --
>
> Key: SPARK-27753
> URL: https://issues.apache.org/jira/browse/SPARK-27753
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Shixiong Zhu
>Priority: Major
>
> Structured Streaming has several methods that accept an interval string. It 
> would be great that we can use the parser to parse it so that we can also 
> support SQL expressions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060768#comment-17060768
 ] 

Dongjoon Hyun commented on SPARK-27780:
---

Hi, [~irashid]. Is there any update?

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27665) Split fetch shuffle blocks protocol from OpenBlocks

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060767#comment-17060767
 ] 

Dongjoon Hyun commented on SPARK-27665:
---

[~koert]. Do you still hit the same issue at 3.0.0-preview2?

> Split fetch shuffle blocks protocol from OpenBlocks
> ---
>
> Key: SPARK-27665
> URL: https://issues.apache.org/jira/browse/SPARK-27665
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> As the current approach in OneForOneBlockFetcher, we reuse the OpenBlocks 
> protocol to describe the fetch request for shuffle blocks, and it causes the 
> extension work for shuffle fetching like SPARK-9853 and SPARK-25341 very 
> awkward. We need a new protocol only for shuffle blocks fetcher.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27785) Introduce .joinWith() overloads for typed inner joins of 3 or more tables

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27785?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27785:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Introduce .joinWith() overloads for typed inner joins of 3 or more tables
> -
>
> Key: SPARK-27785
> URL: https://issues.apache.org/jira/browse/SPARK-27785
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
>
> Today it's rather painful to do a typed dataset join of more than two tables: 
> {{Dataset[A].joinWith(Dataset[B])}} returns {{Dataset[(A, B)]}} so chaining 
> on a third inner join requires users to specify a complicated join condition 
> (referencing variables like {{_1}} or {{_2}} in the join condition, AFAIK), 
> resulting a doubly-nested schema like {{Dataset[((A, B), C)]}}. Things become 
> even more painful if you want to layer on a fourth join. Using {{.map()}} to 
> flatten the data into {{Dataset[(A, B, C)]}} has a performance penalty, too.
> To simplify this use case, I propose to introduce a new set of overloads of 
> {{.joinWith}}, supporting joins of {{N > 2}} tables for {{N}} up to some 
> reasonable number (say, 6). For example:
> {code:java}
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   condition: Column
> ): Dataset[(T, T1, T2)]
> Dataset[T].joinWith[T1, T2](
>   ds1: Dataset[T1],
>   ds2: Dataset[T2],
>   ds3: Dataset[T3],
>   condition: Column
> ): Dataset[(T, T1, T2, T3)]{code}
> I propose to do this only for inner joins (consistent with the default join 
> type for {{joinWith}} in case joins are not specified).
> I haven't though about this too much yet and am not committed to the API 
> proposed above (it's just my initial idea), so I'm open to suggestions for 
> alternative typed APIs for this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27790) Support SQL INTERVAL types

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27790:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support SQL INTERVAL types
> --
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Major
>
> SQL standard defines 2 interval types:
> # year-month interval contains a YEAR field or a MONTH field or both
> # day-time interval contains DAY, HOUR, MINUTE, and SECOND (possibly fraction 
> of seconds)
> Need to add 2 new internal types YearMonthIntervalType and 
> DayTimeIntervalType, support operations defined by SQL standard as well as 
> INTERVAL literals.
> The java.time.Period and java.time.Duration can be supported as external type 
> for YearMonthIntervalType and DayTimeIntervalType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27780) Shuffle server & client should be versioned to enable smoother upgrade

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27780:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Shuffle server & client should be versioned to enable smoother upgrade
> --
>
> Key: SPARK-27780
> URL: https://issues.apache.org/jira/browse/SPARK-27780
> Project: Spark
>  Issue Type: New Feature
>  Components: Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Imran Rashid
>Priority: Major
>
> The external shuffle service is often upgraded at a different time than spark 
> itself.  However, this causes problems when the protocol changes between the 
> shuffle service and the spark runtime -- this forces users to upgrade 
> everything simultaneously.
> We should add versioning to the shuffle client & server, so they know what 
> messages the other will support.  This would allow better handling of mixed 
> versions, from better error msgs to allowing some mismatched versions (with 
> reduced capabilities).
> This originally came up in a discussion here: 
> https://github.com/apache/spark/pull/24565#issuecomment-493496466
> There are a few ways we could do the versioning which we still need to 
> discuss:
> 1) Version specified by config.  This allows for mixed versions across the 
> cluster and rolling upgrades.  It also will let a spark 3.0 client talk to a 
> 2.4 shuffle service.  But, may be a nuisance for users to get this right.
> 2) Auto-detection during registration with local shuffle service.  This makes 
> the versioning easy for the end user, and can even handle a 2.4 shuffle 
> service though it does not support the new versioning.  However, it will not 
> handle a rolling upgrade correctly -- if the local shuffle service has been 
> upgraded, but other nodes in the cluster have not, it will get the version 
> wrong.
> 3) Exchange versions per-connection.  When a connection is opened, the server 
> & client could first exchange messages with their versions, so they know how 
> to continue communication after that.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27799) Allow SerializerManager.canUseKryo whitelist to be extended via a configuration

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27799:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Allow SerializerManager.canUseKryo whitelist to be extended via a 
> configuration
> ---
>
> Key: SPARK-27799
> URL: https://issues.apache.org/jira/browse/SPARK-27799
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Josh Rosen
>Priority: Major
>
> Kryo serialization can offer a substantial performance boost compared to Java 
> serialization and I generally recommend that users configure Spark to use it.
> That said, in general it may not be safe to _blindly_ flip the default to 
> Kryo: certain jobs might depend on Java serialization, so switching them to 
> Kryo might cause crashes or incorrect behavior.
> However, we may know that certain data types are safe to serialize with Kryo, 
> in which case we can whitelist _just those types_ for use with Kryo 
> serialization but keep everything else using the default Java serializer.
> Back in SPARK-13926 (Spark 2.0) I added a {{SerializerManager}} to implement 
> this idea for strings, primitives, primitive arrays, and a few other data 
> types: those types will automatically use Kryo serialization when used as 
> top-level types in RDDs. However, there's no ability for users to customize / 
> extend this whitelist.
> I propose to add a new user-facing configuration, name TBD, which accepts a 
> comma-separated list of class / interface names and uses them to expand the 
> {{SerializerMananger.canUseKryo}} whitelist.
> This will allow advanced users to incrementally default to Kryo for certain 
> types (e.g. Scrooge ThriftStructs).
> This feature is useful for "data platform" teams who provide 
> Spark-as-a-service to internal customers: with this proposed configuration, 
> platform teams can configure global defaults for serialization in a way which 
> is more incremental / narrow-in-scope than simply defaulting to Kryo 
> everywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27808) Ability to ignore existing files for structured streaming

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27808:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Ability to ignore existing files for structured streaming
> -
>
> Key: SPARK-27808
> URL: https://issues.apache.org/jira/browse/SPARK-27808
> Project: Spark
>  Issue Type: New Feature
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Vladimir Matveev
>Priority: Major
>
> Currently it is not easily possible to make a structured streaming query to 
> ignore all of the existing data inside a directory and only process new 
> files, created after the job was started. See here for example: 
> [https://stackoverflow.com/questions/51391722/spark-structured-streaming-file-source-starting-offset]
>  
> My use case is to ignore everything which existed in the directory when the 
> streaming job is first started (and there are no checkpoints), but to behave 
> as usual when the stream is restarted, e.g. catch up reading new files since 
> the last restart. This would allow us to use the streaming job for continuous 
> processing, with all the benefits it brings, but also to keep the possibility 
> to reprocess the data in the batch fashion by a different job, drop the 
> checkpoints and make the streaming job only run for the new data.
>  
> It would be great to have an option similar to the `newFilesOnly` option on 
> the original StreamingContext.fileStream method: 
> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.StreamingContext@fileStream[K,V,F%3C:org.apache.hadoop.mapreduce.InputFormat[K,V]](directory:String,filter:org.apache.hadoop.fs.Path=%3EBoolean,newFilesOnly:Boolean)(implicitevidence$7:scala.reflect.ClassTag[K],implicitevidence$8:scala.reflect.ClassTag[V],implicitevidence$9:scala.reflect.ClassTag[F]):org.apache.spark.streaming.dstream.InputDStream[(K,V])]
> but probably with slightly different semantics, described above (ignore all 
> existing for the first run, catch up for the following runs)>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27911) PySpark Packages should automatically choose correct scala version

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27911:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> PySpark Packages should automatically choose correct scala version
> --
>
> Key: SPARK-27911
> URL: https://issues.apache.org/jira/browse/SPARK-27911
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.1.0
>Reporter: Michael Armbrust
>Priority: Major
>
> Today, users of pyspark (and Scala) need to manually specify the version of 
> Scala that their Spark installation is using when adding a Spark package to 
> their application. This extra configuration is confusing to users who may not 
> even know which version of Scala they are using (for example, if they 
> installed Spark using {{pip}}). The confusion here is exacerbated by releases 
> in Spark that have changed the default from {{2.11}} -> {{2.12}} -> {{2.11}}.
> https://spark.apache.org/releases/spark-release-2-4-2.html
> https://spark.apache.org/releases/spark-release-2-4-3.html
> Since Spark can know which version of Scala it was compiled for, we should 
> give users the option to automatically choose the correct version.  This 
> could be as simple as a substitution for {{$scalaVersion}} or something when 
> resolving a package (similar to SBTs support for automatically handling scala 
> dependencies).
> Here are some concrete examples of users getting it wrong and getting 
> confused:
> https://github.com/delta-io/delta/issues/6
> https://github.com/delta-io/delta/issues/63



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27935) Introduce leftOuterJoinWith and fullOuterJoinWith

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27935:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Introduce leftOuterJoinWith and fullOuterJoinWith
> -
>
> Key: SPARK-27935
> URL: https://issues.apache.org/jira/browse/SPARK-27935
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Spencer
>Priority: Minor
>
> Currently, calling *Dataset[A].joinWith(Dataset[B], col, "left_outer")* or 
> *Dataset[A].joinWith(Dataset[B], col, "full_outer")* require users to do null 
> checks on the resulting *Dataset[(A, B)]*
>  
> To make the expected result types of outer joins more explicit, I propose a 
> couple of new joinWith functions:
> {noformat}
> def leftOuterJoinWith[U](other: Dataset[U], condition: Column): Dataset[(T, 
> Option[U])]
> def fullOuterJoinWith[U](other: Dataset[U], condition: Column): 
> Dataset[(Option[T], Option[U])]{noformat}
>  
> The return type of *fullOuterJoinWith* is imperfect, since *(None, None)* is 
> an invalid case, but still an improvement on the present interface.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19700) Design an API for pluggable scheduler implementations

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-19700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-19700:
--
Affects Version/s: (was: 2.1.0)
   3.1.0

> Design an API for pluggable scheduler implementations
> -
>
> Key: SPARK-19700
> URL: https://issues.apache.org/jira/browse/SPARK-19700
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Matt Cheah
>Priority: Major
>
> One point that was brought up in discussing SPARK-18278 was that schedulers 
> cannot easily be added to Spark without forking the whole project. The main 
> reason is that much of the scheduler's behavior fundamentally depends on the 
> CoarseGrainedSchedulerBackend class, which is not part of the public API of 
> Spark and is in fact quite a complex module. As resource management and 
> allocation continues evolves, Spark will need to be integrated with more 
> cluster managers, but maintaining support for all possible allocators in the 
> Spark project would be untenable. Furthermore, it would be impossible for 
> Spark to support proprietary frameworks that are developed by specific users 
> for their other particular use cases.
> Therefore, this ticket proposes making scheduler implementations fully 
> pluggable. The idea is that Spark will provide a Java/Scala interface that is 
> to be implemented by a scheduler that is backed by the cluster manager of 
> interest. The user can compile their scheduler's code into a JAR that is 
> placed on the driver's classpath. Finally, as is the case in the current 
> world, the scheduler implementation is selected and dynamically loaded 
> depending on the user's provided master URL.
> Determining the correct API is the most challenging problem. The current 
> CoarseGrainedSchedulerBackend handles many responsibilities, some of which 
> will be common across all cluster managers, and some which will be specific 
> to a particular cluster manager. For example, the particular mechanism for 
> creating the executor processes will differ between YARN and Mesos, but, once 
> these executors have started running, the means to submit tasks to them over 
> the Netty RPC is identical across the board.
> We must also consider a plugin model and interface for submitting the 
> application as well, because different cluster managers support different 
> configuration options, and thus the driver must be bootstrapped accordingly. 
> For example, in YARN mode the application and Hadoop configuration must be 
> packaged and shipped to the distributed cache prior to launching the job. A 
> prototype of a Kubernetes implementation starts a Kubernetes pod that runs 
> the driver in cluster mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27941) Serverless Spark in the Cloud

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-27941:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Serverless Spark in the Cloud
> -
>
> Key: SPARK-27941
> URL: https://issues.apache.org/jira/browse/SPARK-27941
> Project: Spark
>  Issue Type: New Feature
>  Components: Build, Deploy, Scheduler, Security, Shuffle, Spark Core
>Affects Versions: 3.1.0
>Reporter: Shuheng Dai
>Priority: Major
>
> Public cloud providers have started offering serverless container services. 
> For example, AWS offers Fargate [https://aws.amazon.com/fargate/]
> This opens up the possibility to run Spark workloads in a serverless manner 
> and remove the need to provision, maintain and manage a cluster. POC: 
> [https://github.com/mu5358271/spark-on-fargate]
> While it might not make sense for Spark to favor any particular cloud 
> provider or to support a large number of cloud providers natively, it would 
> make sense to make some of the internal Spark components more pluggable and 
> cloud friendly so that it is easier for various cloud providers to integrate. 
> For example, 
>  * authentication: IO and network encryption requires authentication via 
> securely sharing a secret, and the implementation of this is currently tied 
> to the cluster manager: yarn uses hadoop ugi, kubernetes uses a shared file 
> mounted on all pods. These can be decoupled so it is possible to swap in 
> implementation using public cloud. In the POC, this is implemented by passing 
> around AWS KMS encrypted secret and decrypting the secret at each executor, 
> which delegate authentication and authorization to the cloud.
>  * deployment & scheduler: adding a new cluster manager and scheduler backend 
> requires changing a number of places in the Spark core package and rebuilding 
> the entire project. Having a pluggable scheduler per 
> https://issues.apache.org/jira/browse/SPARK-19700 would make it easier to add 
> different scheduler backends backed by different cloud providers.
>  * client-cluster communication: I am not very familiar with the network part 
> of the code base so I might be wrong on this. My understanding is that the 
> code base assumes that the client and the cluster are on the same network and 
> the nodes communicate with each other via hostname/ip. For security best 
> practice, it is advised to run the executors in a private protected network, 
> which may be separate from the client machine's network. Since we are 
> serverless, that means the client need to first launch the driver into the 
> private network, and the driver in turn start the executors, potentially 
> doubling job initialization time. This can be solved by dropping complete 
> serverlessness and having a persistent host in the private network, or (I do 
> not have a POC, so I am not sure if this actually works) by implementing 
> client-cluster communication via message queues in the cloud to get around 
> the network separation.
>  * shuffle storage and retrieval: external shuffle in yarn relies on the 
> existence of a persistent cluster that continues to serve shuffle files 
> beyond the lifecycle of the executors. This assumption no longer holds in a 
> serverless cluster with only transient containers. Pluggable remote shuffle 
> storage per https://issues.apache.org/jira/browse/SPARK-25299 would make it 
> easier to introduce new cloud-backed shuffle.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28022) k8s pod affinity to achieve cloud native friendly autoscaling

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28022.
---
Resolution: Invalid

>From Spark side, the suggested idea looks invalid to me. You may achieve it 
>from K8s scheduler.

BTW, for dynamic allocation, SPARK-20628 might be the better solution we have. 
It's already merged.

> k8s pod affinity to achieve cloud native friendly autoscaling 
> --
>
> Key: SPARK-28022
> URL: https://issues.apache.org/jira/browse/SPARK-28022
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi, in order to achieve cloud native friendly autoscaling , I propose to add 
> a pod affinity feature.
> Traditionally, when we use spark in fix size yarn cluster, it make sense to 
> spread containers to every node.
> Coming to cloud native resource manage, we want to release node when we don't 
> need it any more.
> Pod affinity feature counts to place all pods of certain application to some 
> nodes instead of all nodes.
> By the way,  using pod template is not a good choice, adding application id  
> to pod affinity term when submit is more robust.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31165) Multiple wrong references in Dockerfile for k8s

2020-03-17 Thread Nikolay Dimolarov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay Dimolarov updated SPARK-31165:
--
Affects Version/s: 2.4.5

> Multiple wrong references in Dockerfile for k8s 
> 
>
> Key: SPARK-31165
> URL: https://issues.apache.org/jira/browse/SPARK-31165
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Nikolay Dimolarov
>Priority: Minor
>
> I am currently trying to follow the k8s instructions for Spark: 
> [https://spark.apache.org/docs/latest/running-on-kubernetes.html] and when I 
> clone apache/spark on GitHub on the master branch I saw multiple wrong folder 
> references after trying to build my Docker image:
>  
> *Issue 1: The comments in the Dockerfile reference the wrong folder for the 
> Dockerfile:*
> {code:java}
> # If this docker file is being used in the context of building your images 
> from a Spark # distribution, the docker build command should be invoked from 
> the top level directory # of the Spark distribution. E.g.: # docker build -t 
> spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .{code}
> Well that docker build command simply won't run. I only got the following to 
> run:
> {code:java}
> docker build -t spark:latest -f 
> resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile . 
> {code}
> which is the actual path to the Dockerfile.
>  
> *Issue 2: jars folder does not exist*
> After I read the tutorial I of course build spark first as per the 
> instructions with:
> {code:java}
> ./build/mvn -Pkubernetes -DskipTests clean package{code}
> Nonetheless, in the Dockerfile I get this error when building:
> {code:java}
> Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
> COPY failed: stat 
> /var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh:
>  no such file or directory{code}
>  for which I may have found a similar issue here: 
> [https://stackoverflow.com/questions/52451538/spark-for-kubernetes-test-on-mac]
> I am new to Spark but I assume that this jars folder - if the build step 
> would actually make it and I ran the maven build of the master branch 
> successfully with the command I mentioned above - would exist in the root 
> folder of the project.
>  
> *Issue 3: missing entrypoint.sh and decom.sh due to wrong reference*
> While Issue 2 remains unresolved as I can't wrap my head around the missing 
> jars folder (bin and sbin got copied successfully after I made a dummy jars 
> folder) I then got stuck on these 2 steps:
> {code:java}
> COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/ COPY 
> kubernetes/dockerfiles/spark/decom.sh /opt/{code}
>  
>  with:
>   
> {code:java}
> Step 8/18 : COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
> COPY failed: stat 
> /var/lib/docker/tmp/docker-builder638219776/kubernetes/dockerfiles/spark/entrypoint.sh:
>  no such file or directory{code}
>  
>  which makes sense since the path should actually be:
>   
>  resource-managers/kubernetes/docker/src/main/dockerfiles/spark/entrypoint.sh
>  resource-managers/kubernetes/docker/src/main/dockerfiles/spark/decom.sh
>   
>  *Remark*
>   
>  I only created one issue since this seems like somebody cleaned up the repo 
> and forgot to change these. Am I missing something here? If I am, I apologise 
> in advance since I am new to the Spark project. I also saw that some of these 
> references were handled through vars in previous branches: 
> [https://github.com/apache/spark/blob/branch-2.4/resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile]
>  (e.g. 2.4) but that also does not run out of the box.
>   
>  I am also really not sure about the affected versions since that was not 
> transparent enough for me on GH - feel free to edit that field :) 
>   
>  I can also create a PR and change these but I need help with Issue 2 and the 
> jar files since I am not sure what the correct path for that one is. Would 
> love some help on this :) 
>   
>  Thanks in advance!
>   
>   
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28022) k8s pod affinity to achieve cloud native friendly autoscaling

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28022:
--
Shepherd:   (was: Dongjoon Hyun)

> k8s pod affinity to achieve cloud native friendly autoscaling 
> --
>
> Key: SPARK-28022
> URL: https://issues.apache.org/jira/browse/SPARK-28022
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Henry Yu
>Priority: Major
>
> Hi, in order to achieve cloud native friendly autoscaling , I propose to add 
> a pod affinity feature.
> Traditionally, when we use spark in fix size yarn cluster, it make sense to 
> spread containers to every node.
> Coming to cloud native resource manage, we want to release node when we don't 
> need it any more.
> Pod affinity feature counts to place all pods of certain application to some 
> nodes instead of all nodes.
> By the way,  using pod template is not a good choice, adding application id  
> to pod affinity term when submit is more robust.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28050:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> DataFrameWriter support insertInto a specific table partition
> -
>
> Key: SPARK-28050
> URL: https://issues.apache.org/jira/browse/SPARK-28050
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> {code:java}
> // Some comments here
> val ptTableName = "mc_test_pt_table"
> sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY 
> (pt1 STRING, pt2 STRING)")
> val df = spark.sparkContext.parallelize(0 to 99, 2)
>   .map(f =>
> {
>   (s"name-$f", f)
> })
>   .toDF("name", "num")
> // if i want to insert df into a specific partition
> // say pt1='2018',pt2='0601' current api does not supported
> // only with following work around
> df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
> sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
> select * from ${ptTableName}_tmp_view")
> {code}
> Propose to have another API in DataframeWriter that can do somethink like:
> {code:java}
> df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
> {code}
> we have a lot of this kind of scenario in our production env. providing a api 
> like this will make us less painful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29262) DataFrameWriter insertIntoPartition function

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060747#comment-17060747
 ] 

Dongjoon Hyun commented on SPARK-29262:
---

I close this issue as a `Duplicate` of SPARK-28050. Please track this issue 
there.

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> If we disable dynamic partition overwrite, it will overwrite whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support overwrite a specific partition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-29262) DataFrameWriter insertIntoPartition function

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-29262.
-

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> If we disable dynamic partition overwrite, it will overwrite whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support overwrite a specific partition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29262) DataFrameWriter insertIntoPartition function

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29262.
---
Resolution: Duplicate

> DataFrameWriter insertIntoPartition function
> 
>
> Key: SPARK-29262
> URL: https://issues.apache.org/jira/browse/SPARK-29262
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: feiwang
>Priority: Minor
>
> InsertIntoPartition is a useful function.
> For SQL statement, relative syntax.
> {code:java}
> insert overwrite table tbl_a partition(p1=v1,p2=v2,...,pn=vn) select ...
> {code}
> In the example above, I specify all the partition key value, so it must be a 
> static partition overwrite, regardless whether enable dynamic partition 
> overwrite.
> If we enable dynamic partition overwrite. For the sql below, it will only 
> overwrite relative partition not whole table.
> If we disable dynamic partition overwrite, it will overwrite whole table.
> {code:java}
> insert overwrite table tbl_a partition(p1,p2,...,pn) select ...
> {code}
> As far as now, dataFrame does not support overwrite a specific partition.
> It means that, for a partitioned table, if we insert overwrite  by using 
> dataFrame with dynamic partition overwrite disabled,  it will always 
> overwrite whole table.
> So, we should support insertIntoPartition for dataFrameWriter.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28254) Use More Generic Netty Interfaces to Support Fabric Network in Spark

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28254:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Use More Generic Netty Interfaces to Support Fabric Network in Spark
> 
>
> Key: SPARK-28254
> URL: https://issues.apache.org/jira/browse/SPARK-28254
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiafu zhang
>Priority: Minor
>
> Spark assumes all its networks are socket though Netty has wider interface 
> beyond socket, like Channel over SocketChannel. Socket (TCP/IP protocol) has 
> its advantage of widely being supported. Almost all types of network, 
> including fabric network, support socket. However, it's not as efficient as 
> other protocols designed for fabric, like Intel OPA which has less protocol 
> stacks and less delay than socket, not saying other more advanced features, 
> like flow control. Thus, from view of performance, it's better to have Spark 
> support other types of network protocols too. (we are also proposing Netty to 
> support more fabric semantics, like RMA (Remote Memory Access) ). For fabric 
> networks, we can use OFI (Open Fabric's Interface) 's Libfabric library to 
> have unified API to support fabric networks from different vendors. It'll 
> greatly reduce our effort. 
> To make it possible, we need to use more generic Netty interfaces in several 
> places in the network-common module. For example, use Channel instead of 
> SocketChannel in  TransportClientFactory and TransportContext class.
> Besides, we need to have more flexible options for Bootstrap and IOMode 
> class. For example in createClient method in TransportClientFactory, user can 
> pass in more options to the bootstrap instance. Currently, the options are 
> fixed for TCP/IP. 
> And in IOMode class,  we can have one more option, "OFI" for initializing OFI 
> channels based on Libfabric. Then in NettyUtils class, we can have OFI event 
> loop group, client channel class and server channel class for "OFI" IOMode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28415) Add messageHandler to Kafka 10 direct stream API

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28415.
---
Resolution: Invalid

According to the decision on https://github.com/apache/spark/pull/27022, I 
close this JIRA issue as `Invalid`. Please feel free to reopen this if there is 
an update.

> Add messageHandler to Kafka 10 direct stream API
> 
>
> Key: SPARK-28415
> URL: https://issues.apache.org/jira/browse/SPARK-28415
> Project: Spark
>  Issue Type: New Feature
>  Components: DStreams
>Affects Versions: 3.0.0
>Reporter: Michael Spector
>Priority: Major
>
> Lack of messageHandler parameter to KafkaUtils.createDirectStrem(...) in new 
> Kafka API is what prevents us from upgrading our processes to use it, and 
> here's why:
>  # messageHandler() allowed parsing / filtering / projecting huge JSON files 
> at an early stage (only a small subset of JSON fields is required for a 
> process), without this current cluster configuration doesn't keep up with the 
> traffic.
>  # Transforming Kafka events right after a stream is created prevents from 
> using HasOffsetRanges interface later. This means that whole message must be 
> propagated to the end of a pipeline, which is very ineffective.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28727) Request for partial least square (PLS) regression model

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28727.
---
Resolution: Won't Do

According to the current status and the above advice, I close this issue. 
Sorry, [~nikunj.m].

> Request for partial least square (PLS) regression model
> ---
>
> Key: SPARK-28727
> URL: https://issues.apache.org/jira/browse/SPARK-28727
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.0.0
>Reporter: Nikunj
>Priority: Major
>
> Hi.
> Is there any development going on with regards to a PLS model? Or is there a 
> plan for it in the near future? The application I am developing needs a PLS 
> model as it is mandatory in that particular industry. I am using sparklyr, 
> and have started a bit of the implementation, but was wondering if something 
> is already in the pipeline.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28727) Request for partial least square (PLS) regression model

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28727:
--
Environment: (was: I am using Windows 10, Spark v2.3.2)

> Request for partial least square (PLS) regression model
> ---
>
> Key: SPARK-28727
> URL: https://issues.apache.org/jira/browse/SPARK-28727
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.0.0
>Reporter: Nikunj
>Priority: Major
>
> Hi.
> Is there any development going on with regards to a PLS model? Or is there a 
> plan for it in the near future? The application I am developing needs a PLS 
> model as it is mandatory in that particular industry. I am using sparklyr, 
> and have started a bit of the implementation, but was wondering if something 
> is already in the pipeline.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28727) Request for partial least square (PLS) regression model

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28727:
--
Labels:   (was: PLS least partial regression square)

> Request for partial least square (PLS) regression model
> ---
>
> Key: SPARK-28727
> URL: https://issues.apache.org/jira/browse/SPARK-28727
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Affects Versions: 3.0.0
> Environment: I am using Windows 10, Spark v2.3.2
>Reporter: Nikunj
>Priority: Major
>
> Hi.
> Is there any development going on with regards to a PLS model? Or is there a 
> plan for it in the near future? The application I am developing needs a PLS 
> model as it is mandatory in that particular industry. I am using sparklyr, 
> and have started a bit of the implementation, but was wondering if something 
> is already in the pipeline.
> Thanks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28870) Snapshot event log files to support incremental reading

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28870:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Snapshot event log files to support incremental reading
> ---
>
> Key: SPARK-28870
> URL: https://issues.apache.org/jira/browse/SPARK-28870
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> This issue tracks the effort on snapshotting the current status of 
> KVStore/AppStatusListener and enable incremental reading to speed up 
> replaying event logs.
> This issue will be on top of SPARK-29111 and SPARK-29261, as SPARK-29111 will 
> add the ability to snapshot/restore from/to KVStore and SPARK-29261 will add 
> the ability to snapshot/restore of state of (SQL)AppStatusListeners.
> Note that both SPARK-29261 and SPARK-29111 will not be guaranteeing any 
> compatibility, and this issue should deal with the situation: snapshot 
> version is not compatible with SHS.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060739#comment-17060739
 ] 

Dongjoon Hyun commented on SPARK-28934:
---

Shall we close this JIRA since this is superseded by SPARK-28989 and 
SPARK-28997?

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28948) Support passing all Table metadata in TableProvider

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-28948:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Support passing all Table metadata in TableProvider
> ---
>
> Key: SPARK-28948
> URL: https://issues.apache.org/jira/browse/SPARK-28948
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28986) from_json cannot handle data with the same name but different case in the json string.

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28986.
---
Resolution: Invalid

Hi, [~iduanyingjie]. JSON key name should be case sensitive.

{code}
   When the names within an object are not
   unique, the behavior of software that receives such an object is
   unpredictable.  Many implementations report the last name/value pair
   only.  Other implementations report an error or fail to parse the
   object, and some implementations report all of the name/value pairs,
   including duplicates.
{code}

> from_json cannot handle data with the same name but different case in the 
> json string.
> --
>
> Key: SPARK-28986
> URL: https://issues.apache.org/jira/browse/SPARK-28986
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: iduanyingjie
>Priority: Major
>
> In spark sql, when using this from_json to parse the json string of a field, 
> the defined field information is case-sensitive, such as [a String] in this 
> example, if I want to parse this data with id 2 It won't work. I can no 
> longer define a column of [A String] in the schema, because the schema field 
> name in the DataFrame is not case sensitive, I feel this is a bit 
> contradictory.
>  
> Code：
> {code:java}
> import org.apache.spark.sql.SparkSession
> object JsonCaseInsensitive {  
>   case class User(id: String, fields: String)
>   val users = Seq(User("1", "{\"a\": \"b\"}"), User("2", "{\"A\": \"B\"}"))
>   def main(args: Array[String]): Unit = {
> val spark = 
> SparkSession.builder().master("local").appName("JsonCaseInsensitive").getOrCreate()
> import spark.implicits._
> spark.createDataset(users)
>  .selectExpr("id", "from_json(fields, 'a String')")
>  .show()
>   }
> }
> {code}
>  Output:
> {code:java}
> +---+-+
> | id|jsontostructs(fields)|
> +---+-+
> |  1|  [b]|
> |  2|   []|
> +---+-+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29031) Materialized column to accelerate queries

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29031:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Materialized column to accelerate queries
> -
>
> Key: SPARK-29031
> URL: https://issues.apache.org/jira/browse/SPARK-29031
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jason Guo
>Priority: Major
>  Labels: SPIP
>
> Goals
>  * Add a new SQL grammar of Materialized column
>  * Implicitly rewrite SQL queries on the complex type of columns if there is 
> a materialized columns for it
>  * If the data type of the materialized columns is atomic type, even though 
> the origin column type is in complex type, enable vectorized read and filter 
> pushdown to improve performance
> Example
> Create a normal table
> {quote}CREATE TABLE x (
> name STRING,
> age INT,
> params STRING,
> event MAP
> ) USING parquet;
> {quote}
>  
> Add materialized columns to an existing table
> {quote}ALTER TABLE x ADD COLUMNS (
> new_age INT MATERIALIZED age + 1,
> city STRING MATERIALIZED get_json_object(params, '$.city'),
> label STRING MATERIALIZED event['label']
> );
> {quote}
>  
> When issue a query as below
> {quote}SELECT name, age+1, get_json_object(params, '$.city'), event['label']
> FROM x
> WHER event['label'] = 'newuser';
> {quote}
> It's equivalent to
> {quote}SELECT name, new_age, city, label
> FROM x
> WHERE label = 'newuser';
> {quote}
>  
> The query performance improved dramatically because
>  # The new query (after rewritten) will read the new column city (in string 
> type) instead of read the whole map of params(in map string). Much lesser 
> data are need to read
>  # Vectorized read can be utilized in the new query and can not be used in 
> the old one. Because vectorized read can only be enabled when all required 
> columns are in atomic type
>  # Filter can be pushdown. Only filters on atomic column can be pushdown. The 
> original filter  event['label'] = 'newuser' is on complex column, so it can 
> not be pushdown.
>  # The new query do not need to parse JSON any more. JSON parse is a CPU 
> intensive operation which will impact performance dramatically
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29038) SPIP: Support Spark Materialized View

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29038:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29059:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> [SPIP] Support for Hive Materialized Views in Spark SQL.
> 
>
> Key: SPARK-29059
> URL: https://issues.apache.org/jira/browse/SPARK-29059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Amogh Margoor
>Priority: Minor
>
> Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark 
> Catalyst does not optimize queries against Hive tables using Materialized 
> View the way Apache Calcite does it for Hive. This Jira is to add support for 
> the same.
> We have developed it in our internal trunk and would like to open source it. 
> It would consist of 3 major parts:
>  # Reading MV related Hive Metadata
>  # Implication Engine which would figure out if an expression exp1 implies 
> another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar 
> to RexImplication checker in Apache Calcite.
>  # Catalyst rule to replace tables by it's Materialized view using 
> Implication Engine. For e.g., if MV 'mv' has been created in Hive using query 
> 'select * from foo where x > 10 && x <110'  then query 'select * from foo 
> where x > 70 and x < 100' will be transformed into 'select * from mv where x 
> >70 and x < 100'
> Note that Implication Engine and Catalyst Rule is generic can be used even 
> when Spark decides to have it's own Materialized View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29059) [SPIP] Support for Hive Materialized Views in Spark SQL.

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29059.
---
Resolution: Later

This is closed due to inactivity. Please feel free to reopen this after you 
update your PR, [~amargoor].

> [SPIP] Support for Hive Materialized Views in Spark SQL.
> 
>
> Key: SPARK-29059
> URL: https://issues.apache.org/jira/browse/SPARK-29059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Amogh Margoor
>Priority: Minor
>
> Materialized view was introduced in Apache Hive 3.0.0. Currently, Spark 
> Catalyst does not optimize queries against Hive tables using Materialized 
> View the way Apache Calcite does it for Hive. This Jira is to add support for 
> the same.
> We have developed it in our internal trunk and would like to open source it. 
> It would consist of 3 major parts:
>  # Reading MV related Hive Metadata
>  # Implication Engine which would figure out if an expression exp1 implies 
> another expression exp2 i.e., if exp1 => exp2 is a tautology. This is similar 
> to RexImplication checker in Apache Calcite.
>  # Catalyst rule to replace tables by it's Materialized view using 
> Implication Engine. For e.g., if MV 'mv' has been created in Hive using query 
> 'select * from foo where x > 10 && x <110'  then query 'select * from foo 
> where x > 70 and x < 100' will be transformed into 'select * from mv where x 
> >70 and x < 100'
> Note that Implication Engine and Catalyst Rule is generic can be used even 
> when Spark decides to have it's own Materialized View.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29278) Implement CATALOG/NAMESPACE related SQL commands for Data Source V2

2020-03-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29278?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17060722#comment-17060722
 ] 

Dongjoon Hyun edited comment on SPARK-29278 at 3/17/20, 8:46 AM:
-

Hi, [~imback82]. 
Could you revise this issue because SPARK-29734 already adds `SHOW CURRENT 
NAMESPACE`?


was (Author: dongjoon):
Hi, [~imback82]. 
Could you revise this issue because SPARK-29734 already adds `SHOW CURRENT 
NAMESPACE`, doesn't it?

> Implement CATALOG/NAMESPACE related SQL commands for Data Source V2
> ---
>
> Key: SPARK-29278
> URL: https://issues.apache.org/jira/browse/SPARK-29278
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Introduce the following SQL commands for Data Source V2
> {code:sql}
> CREATE NAMESPACE mycatalog.ns1.ns2
> SHOW CURRENT CATALOG
> SHOW CURRENT NAMESPACE
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29185) Add new SaveMode types for Spark SQL jdbc datasource

2020-03-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29185:
--
Affects Version/s: (was: 3.0.0)
   3.1.0

> Add new SaveMode types for Spark SQL jdbc datasource
> 
>
> Key: SPARK-29185
> URL: https://issues.apache.org/jira/browse/SPARK-29185
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Timothy Zhang
>Priority: Major
>
>  It is necessary to add new SaveMode for Delete, Update, and Upsert, such as:
>  * SaveMode.Delete
>  * SaveMode.Update
>  * SaveMode.Upsert
> So that Spark SQL could support legacy RDBMS much betters, e.g. Oracle, DB2, 
> MySQL etc. Actually code implementation of current SaveMode.Append types is 
> very flexible. All types could share the same savePartition function, add 
> only add new getStatement functions for Delete, Update, Upsert with SQL 
> statements DELETE FROM, UPDATE, MERGE INTO respectively. We have an initial 
> implementations for them:
> {code:java}
> def getDeleteStatement(table: String, rddSchema: StructType, dialect: 
> JdbcDialect): String = {
> val columns = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name) + 
> "=?").mkString(" AND ")
> s"DELETE FROM ${table.toUpperCase} WHERE $columns"
>   }
>   def getUpdateStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val columns = (fullCols diff priCols).map(_ + "=?").mkString(",")
> val cnditns = priCols.map(_ + "=?").mkString(" AND ")
> s"UPDATE ${table.toUpperCase} SET $columns WHERE $cnditns"
>   }
>   def getMergeStatement(table: String, rddSchema: StructType, priKeys: 
> Seq[String], dialect: JdbcDialect): String = {
> val fullCols = rddSchema.fields.map(x => dialect.quoteIdentifier(x.name))
> val priCols = priKeys.map(dialect.quoteIdentifier(_))
> val nrmCols = fullCols diff priCols
> val fullPart = fullCols.map(c => 
> s"${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val priPart = priCols.map(c => 
> s"${dialect.quoteIdentifier("TGT")}.$c=${dialect.quoteIdentifier("SRC")}.$c").mkString("
>  AND ")
> val nrmPart = nrmCols.map(c => 
> s"$c=${dialect.quoteIdentifier("SRC")}.$c").mkString(",")
> val columns = fullCols.mkString(",")
> val placeholders = fullCols.map(_ => "?").mkString(",")
> s"MERGE INTO ${table.toUpperCase} AS ${dialect.quoteIdentifier("TGT")} " +
>   s"USING TABLE(VALUES($placeholders)) " +
>   s"AS ${dialect.quoteIdentifier("SRC")}($columns) " +
>   s"ON $priPart " +
>   s"WHEN NOT MATCHED THEN INSERT ($columns) VALUES ($fullPart) " +
>   s"WHEN MATCHED THEN UPDATE SET $nrmPart"
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 132 matches

Mail list logo