date:20200102

[jira] [Created] (SPARK-30413) Avoid unnecessary WrappedArray roundtrip in GenericArrayData constructor

2020-01-02 Thread Josh Rosen (Jira)

Josh Rosen created SPARK-30413:
--

 Summary: Avoid unnecessary WrappedArray roundtrip in 
GenericArrayData constructor
 Key: SPARK-30413
 URL: https://issues.apache.org/jira/browse/SPARK-30413
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Josh Rosen
Assignee: Josh Rosen


GenericArrayData has a constructor which accepts a {{seqOrArray: Any}} 
parameter. This constructor was originally added for use in situations where we 
don't know the actual type at compile-time (e.g. when converting UDF outputs). 
It's also called (perhaps unintentionally) in code paths where we could 
plausibly and statically know that the type is {{Array[Any]}} (in which case we 
could simply call the primary constructor).

In the current version of this code there's an unnecessary performance penalty 
for going through this path when {{seqOrArray}} is an {{Array[Any]}}: we end up 
converting the array into a WrappedArray, then call a method to unwrap it back 
into an array: this results in a bunch of unnecessary method calls. See 
https://scastie.scala-lang.org/7jOHydbNTaGSU677FWA8nA for an example of 
situations where this can crop up.

Via a small modification to this constructor's implementation, I think we can 
effectively remove this penalty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30214) A new framework to resolve v2 commands with a case of COMMENT ON syntax implementation

2020-01-02 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30214:
-
Description: 
Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the 
table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but 
different commands have different requirements about what to resolve. A few 
examples:

 - DROP NAMESPACE: only need to know the name of the namespace.
 - DESC NAMESPACE: need to lookup the namespace and get metadata, but is done 
during execution
 - DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
 - DESC TABLE: need to lookup the table and get metadata.
For namespaces, the analyzer only needs to find the catalog and the namespace 
name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works 
and describes the root namespace under testcat, while DESC TABLE testcat fails 
if there is no table testcat under the current catalog. It's because namespaces 
can be named [], but tables can't. The commands should explicitly specify it 
needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:

parser creates logical plans or commands with 
UnresolvedNamespace/UnresolvedTable/UnresolvedView/UnresolvedRelation. (CREATE 
TABLE still keeps Seq[String], as it doesn't need to look up relations)
analyzer converts
 -  UnresolvedNamespace to ResolvesNamespace (contains catalog and namespace 
identifier)
 -  UnresolvedTable to ResolvedTable (contains catalog, identifier and Table)
 -  UnresolvedView to ResolvedView (will be added later when we migrate view 
commands)
 -  UnresolvedRelation to relation.

an extra analyzer rule to match commands with V1Table and converts them to 
corresponding v1 commands. This will be added later when we migrate existing 
commands
planner matches commands and converts them to the corresponding physical nodes.
We also introduce brand new v2 commands - the comment syntaxes to illustrate 
how to work with the newly added framework.


{code:java}
COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ...
COMMENT ON TABLE ... IS ...
{code}

Details about the comment syntaxes:
As the new design of catalog v2, some properties become reserved, e.g. 
location, comment. We are going to disable setting reserved properties by 
dbproperties or tblproperites directly to avoid confliction with their related 
subClause or specific commands.

They are the best practices from PostgreSQL and presto.

https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html

  was:
Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the 
table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but 
different commands have different requirements about what to resolve. A few 
examples:

 - DROP NAMESPACE: only need to know the name of the namespace.
 - DESC NAMESPACE: need to lookup the namespace and get metadata, but is done 
during execution
 - DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
 - DESC TABLE: need to lookup the table and get metadata.
For namespaces, the analyzer only needs to find the catalog and the namespace 
name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works 
and describes the root namespace under testcat, while DESC TABLE testcat fails 
if there is no table testcat under the current catalog. It's because namespaces 
can be named [], but tables can't. The commands should explicitly specify it 
needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:

parser creates logical plans or commands with 
UnresolvedNamespace/UnresolvedTable/UnresolvedView/UnresolvedRelation. (CREATE 
TABLE still keeps Seq[String], as it doesn't need to look up relations)
analyzer converts
 - 2.1 UnresolvedNamespace to ResolvesNamespace (contains catalog and namespace 
identifier)
 - 2.2 UnresolvedTable to ResolvedTable (contains catalog, identifier and Table)
 - 2.3 UnresolvedView to ResolvedView (will be added later when we migrate view 
commands)
 - 2.4 UnresolvedRelation to relation.
an extra analyzer rule to match commands with V1Table and converts them to 
corresponding v1 commands. This will be added later when we migrate existing 
commands
planner matches commands and converts them to the corresponding physical nodes.
We also introduce brand new v2 commands - the comment syntaxes to illustrate 
how to work with the newly added framework.

[jira] [Updated] (SPARK-30214) A new framework to resolve v2 commands with a case of COMMENT ON syntax implementation

2020-01-02 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30214:
-
Description: 
Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the 
table/namespace commands can be implemented via v2 APIs.

Usually, a command needs to know which catalog it needs to operate, but 
different commands have different requirements about what to resolve. A few 
examples:

 - DROP NAMESPACE: only need to know the name of the namespace.
 - DESC NAMESPACE: need to lookup the namespace and get metadata, but is done 
during execution
 - DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
 - DESC TABLE: need to lookup the table and get metadata.
For namespaces, the analyzer only needs to find the catalog and the namespace 
name. The command can do lookup during execution if needed.

For tables, mostly commands need the analyzer to do lookup.

Note that, table and namespace have a difference: DESC NAMESPACE testcat works 
and describes the root namespace under testcat, while DESC TABLE testcat fails 
if there is no table testcat under the current catalog. It's because namespaces 
can be named [], but tables can't. The commands should explicitly specify it 
needs to operate on namespace or table.

In this Pull Request, we introduce a new framework to resolve v2 commands:

parser creates logical plans or commands with 
UnresolvedNamespace/UnresolvedTable/UnresolvedView/UnresolvedRelation. (CREATE 
TABLE still keeps Seq[String], as it doesn't need to look up relations)
analyzer converts
 - 2.1 UnresolvedNamespace to ResolvesNamespace (contains catalog and namespace 
identifier)
 - 2.2 UnresolvedTable to ResolvedTable (contains catalog, identifier and Table)
 - 2.3 UnresolvedView to ResolvedView (will be added later when we migrate view 
commands)
 - 2.4 UnresolvedRelation to relation.
an extra analyzer rule to match commands with V1Table and converts them to 
corresponding v1 commands. This will be added later when we migrate existing 
commands
planner matches commands and converts them to the corresponding physical nodes.
We also introduce brand new v2 commands - the comment syntaxes to illustrate 
how to work with the newly added framework.


{code:java}
COMMENT ON (DATABASE|SCHEMA|NAMESPACE) ... IS ...
COMMENT ON TABLE ... IS ...
{code}

Details about the comment syntaxes:
As the new design of catalog v2, some properties become reserved, e.g. 
location, comment. We are going to disable setting reserved properties by 
dbproperties or tblproperites directly to avoid confliction with their related 
subClause or specific commands.

They are the best practices from PostgreSQL and presto.

https://www.postgresql.org/docs/12/sql-comment.html
https://prestosql.io/docs/current/sql/comment.html

  was:
https://prestosql.io/docs/current/sql/comment.html
https://www.postgresql.org/docs/12/sql-comment.html

We are going to disable setting reserved properties by dbproperties or 
tblproperites directly, which needs a subclause in create syntax or specific 
alter commands


> A new framework to resolve v2 commands with a case of COMMENT ON syntax 
> implementation
> --
>
> Key: SPARK-30214
> URL: https://issues.apache.org/jira/browse/SPARK-30214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> Currently, we have a v2 adapter for v1 catalog (V2SessionCatalog), all the 
> table/namespace commands can be implemented via v2 APIs.
> Usually, a command needs to know which catalog it needs to operate, but 
> different commands have different requirements about what to resolve. A few 
> examples:
>  - DROP NAMESPACE: only need to know the name of the namespace.
>  - DESC NAMESPACE: need to lookup the namespace and get metadata, but is done 
> during execution
>  - DROP TABLE: need to do lookup and make sure it's a table not (temp) view.
>  - DESC TABLE: need to lookup the table and get metadata.
> For namespaces, the analyzer only needs to find the catalog and the namespace 
> name. The command can do lookup during execution if needed.
> For tables, mostly commands need the analyzer to do lookup.
> Note that, table and namespace have a difference: DESC NAMESPACE testcat 
> works and describes the root namespace under testcat, while DESC TABLE 
> testcat fails if there is no table testcat under the current catalog. It's 
> because namespaces can be named [], but tables can't. The commands should 
> explicitly specify it needs to operate on namespace or table.
> In this Pull Request, we introduce a new framework to resolve v2 commands:
> parser creates logical plans or commands with 
>

[jira] [Updated] (SPARK-30214) A new framework to resolve v2 commands with a case of COMMENT ON syntax implementation

2020-01-02 Thread Kent Yao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30214?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-30214:
-
Summary: A new framework to resolve v2 commands with a case of COMMENT ON 
syntax implementation  (was: Support COMMENT ON syntax)

> A new framework to resolve v2 commands with a case of COMMENT ON syntax 
> implementation
> --
>
> Key: SPARK-30214
> URL: https://issues.apache.org/jira/browse/SPARK-30214
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> https://prestosql.io/docs/current/sql/comment.html
> https://www.postgresql.org/docs/12/sql-comment.html
> We are going to disable setting reserved properties by dbproperties or 
> tblproperites directly, which needs a subclause in create syntax or specific 
> alter commands



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30384) Needs to improve the Column name and tooltips for the Fair Scheduler Pool Table

2020-01-02 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-30384.

  Assignee: Ankit Raj Boudh
Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/27047

> Needs to improve the Column name and tooltips for the Fair Scheduler Pool 
> Table 
> 
>
> Key: SPARK-30384
> URL: https://issues.apache.org/jira/browse/SPARK-30384
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Assignee: Ankit Raj Boudh
>Priority: Minor
>
> There are different columns in the Fair Scheduler Pools Table under Stage Tab.
> Issue 1: SchedulingMode should be separated as 'Scheduling Mode'
> Issue 2: Minimum Share, Pool Weight and Scheduling Mode require meaning full 
> Tool tips for the end user to understand.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30365) When deploy mode is a client, why doesn't it support remote "spark.files" download?

2020-01-02 Thread wangzhun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangzhun updated SPARK-30365:
-
Priority: Major  (was: Minor)

> When deploy mode is a client, why doesn't it support remote "spark.files" 
> download?
> ---
>
> Key: SPARK-30365
> URL: https://issues.apache.org/jira/browse/SPARK-30365
> Project: Spark
>  Issue Type: Question
>  Components: Spark Submit
>Affects Versions: 2.3.2
> Environment: {code:java}
>  ./bin/spark-submit \
> --master yarn  \
> --deploy-mode client \
> ..{code}
>Reporter: wangzhun
>Priority: Major
>
> {code:java}
> // In client mode, download remote files.
> var localPrimaryResource: String = null
> var localJars: String = null
> var localPyFiles: String = null
> if (deployMode == CLIENT) {
>   localPrimaryResource = Option(args.primaryResource).map {
> downloadFile(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
>   localJars = Option(args.jars).map {
> downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
>   localPyFiles = Option(args.pyFiles).map {
> downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr)
>   }.orNull
> }
> {code}
> The above Spark2.3 SparkSubmit code does not download the corresponding file 
> of "spark.files".
> I think it is possible to download remote files locally and add them to 
> classPath.
> For example, can support --files configuration remote hive-site.xml



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30412) Eliminate warnings in Java tests regarding to deprecated API

2020-01-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30412.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27081
[https://github.com/apache/spark/pull/27081]

> Eliminate warnings in Java tests regarding to deprecated API
> 
>
> Key: SPARK-30412
> URL: https://issues.apache.org/jira/browse/SPARK-30412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Suppress warnings about deprecated Spark API in Java test suites:
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java
> Warning:Warning:line (32)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (91)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (100)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (109)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (118)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java
> Warning:Warning:line (28)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (37)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (46)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (55)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (64)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
> Warning:Warning:line (478)java: 
> json(org.apache.spark.api.java.JavaRDD) in 
> org.apache.spark.sql.DataFrameReader has been deprecated
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30412) Eliminate warnings in Java tests regarding to deprecated API

2020-01-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30412:


Assignee: Maxim Gekk

> Eliminate warnings in Java tests regarding to deprecated API
> 
>
> Key: SPARK-30412
> URL: https://issues.apache.org/jira/browse/SPARK-30412
> Project: Spark
>  Issue Type: Sub-task
>  Components: Java API, SQL
>Affects Versions: 2.4.4
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Suppress warnings about deprecated Spark API in Java test suites:
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java
> Warning:Warning:line (32)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (91)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (100)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (109)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (118)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java
> Warning:Warning:line (28)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (37)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (46)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (55)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> Warning:Warning:line (64)java: 
> org.apache.spark.sql.expressions.javalang.typed in 
> org.apache.spark.sql.expressions.javalang has been deprecated
> {code}
> {code}
> /Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
> Warning:Warning:line (478)java: 
> json(org.apache.spark.api.java.JavaRDD) in 
> org.apache.spark.sql.DataFrameReader has been deprecated
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007215#comment-17007215
 ] 

Reynold Xin edited comment on SPARK-22231 at 1/3/20 4:18 AM:
-

[~fqaiser94] you convinced me with #2. It'd be very verbose if we only allow 
DataFrame.withColumnRenamed to modify nested fields and no new methods in 
Column.

#1 isn't really a problem because DataFrame.withColumnRenamed should be able to 
handle both top level field and struct fields as well.

 

Another question: can withField modify a nested field itself?

 


was (Author: rxin):
[~fqaiser94] you convinced me with #2. It'd be very verbose if we only allow 
DataFrame.withColumnRenamed to modify nested fields and no new methods in 
Column.

#1 isn't really a problem because DataFrame.withColumnRenamed should be able to 
handle both top level field and struct fields as well.

 

 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007215#comment-17007215
 ] 

Reynold Xin commented on SPARK-22231:
-

[~fqaiser94] you convinced me with #2. It'd be very verbose if we only allow 
DataFrame.withColumnRenamed to modify nested fields and no new methods in 
Column.

#1 isn't really a problem because DataFrame.withColumnRenamed should be able to 
handle both top level field and struct fields as well.

 

 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer

[jira] [Updated] (SPARK-29800) Rewrite non-correlated subquery use ScalaSubquery to optimize perf

2020-01-02 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-29800:
--
Summary: Rewrite non-correlated subquery use ScalaSubquery to optimize perf 
 (was: Plan Exists 's subquery in PlanSubqueries)

> Rewrite non-correlated subquery use ScalaSubquery to optimize perf
> --
>
> Key: SPARK-29800
> URL: https://issues.apache.org/jira/browse/SPARK-29800
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:28 AM:
---

I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method on Column, you can add or replace deeply nested 
StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}}

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:14 AM:
---

I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}{{}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 commented on SPARK-22231:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String)}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a

[jira] [Commented] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2020-01-02 Thread Igor Shikin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007173#comment-17007173
 ] 

Igor Shikin commented on SPARK-27996:
-

Here is the PR for the History Server: 
[https://github.com/apache/spark/pull/27083]

> Spark UI redirect will be failed behind the https reverse proxy
> ---
>
> Key: SPARK-27996
> URL: https://issues.apache.org/jira/browse/SPARK-27996
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Saisai Shao
>Priority: Minor
>
> When Spark live/history UI is proxied behind the reverse proxy, the redirect 
> will return wrong scheme, for example:
> If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
> request, whereas if Spark's UI is not SSL enabled, then the request from 
> reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
> requests as HTTP requests, so the redirect URL is just started with "http", 
> which will be failed to redirect from client. 
> Actually for most of the reverse proxy, the proxy will add an additional 
> header "X-Forwarded-Proto" to tell the backend server that the client request 
> is a https request, so Spark should leverage this header to return the 
> correct URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29568) Add flag to stop existing stream when new copy starts

2020-01-02 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-29568:
---

Assignee: Burak Yavuz

> Add flag to stop existing stream when new copy starts
> -
>
> Key: SPARK-29568
> URL: https://issues.apache.org/jira/browse/SPARK-29568
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
>
> In multi-tenant environments where you have multiple SparkSessions, you can 
> accidentally start multiple copies of the same stream (i.e. streams using the 
> same checkpoint location). This will cause all new instantiations of the new 
> stream to fail. However, sometimes you may want to turn off the old stream, 
> as the old stream may have turned into a zombie (you no longer have access to 
> the query handle or SparkSession).
> It would be nice to have a SQL flag that allows the stopping of the old 
> stream for such zombie cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29568) Add flag to stop existing stream when new copy starts

2020-01-02 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-29568.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

> Add flag to stop existing stream when new copy starts
> -
>
> Key: SPARK-29568
> URL: https://issues.apache.org/jira/browse/SPARK-29568
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Burak Yavuz
>Assignee: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> In multi-tenant environments where you have multiple SparkSessions, you can 
> accidentally start multiple copies of the same stream (i.e. streams using the 
> same checkpoint location). This will cause all new instantiations of the new 
> stream to fail. However, sometimes you may want to turn off the old stream, 
> as the old stream may have turned into a zombie (you no longer have access to 
> the query handle or SparkSession).
> It would be nice to have a SQL flag that allows the stopping of the old 
> stream for such zombie cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30269) Should use old partition stats to decide whether to update stats when analyzing partition

2020-01-02 Thread Zhenhua Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-30269:
-
Fix Version/s: 3.0.0

> Should use old partition stats to decide whether to update stats when 
> analyzing partition
> -
>
> Key: SPARK-30269
> URL: https://issues.apache.org/jira/browse/SPARK-30269
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4, 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Zhenhua Wang
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> It's an obvious bug: currently when analyzing partition stats, we use old 
> table stats to compare with newly computed stats to decide whether it should 
> update stats or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30373) Avoid unnecessary sort for ParquetUtils.splitFiles

2020-01-02 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30373.
--
Resolution: Won't Fix

> Avoid unnecessary sort for ParquetUtils.splitFiles
> --
>
> Key: SPARK-30373
> URL: https://issues.apache.org/jira/browse/SPARK-30373
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0, 2.3.0, 2.4.0, 3.0.0
>Reporter: wuyi
>Priority: Major
>
> According to SPARK-11500, we could avoid sorting all files when merge schema 
> disabled or mergeRespectSummaries enabled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30285) Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError

2020-01-02 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-30285:
--

Assignee: Wang Shuo

> Fix deadlock between LiveListenerBus#stop and 
> AsyncEventQueue#removeListenerOnError
> ---
>
> Key: SPARK-30285
> URL: https://issues.apache.org/jira/browse/SPARK-30285
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Wang Shuo
>Assignee: Wang Shuo
>Priority: Major
>
> There is a deadlock between LiveListenerBus#stop and 
> AsyncEventQueue#removeListenerOnError.
> we can reproduce as follows:
>  # Post some events to LiveListenerBus
>  # Call LiveListenerBus#stop and hold the synchronized lock of bus, waiting 
> until all the events are processed by listeners, then remove all the queues
>  # Event queue would drain out events by posting to its listeners. If a 
> listener is interrupted, it will call AsyncEventQueue#removeListenerOnError,  
> inside it will call bus.removeListener, trying to acquire synchronized lock 
> of bus, resulting in deadlock



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30285) Fix deadlock between LiveListenerBus#stop and AsyncEventQueue#removeListenerOnError

2020-01-02 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-30285.

Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 26924
[https://github.com/apache/spark/pull/26924]

> Fix deadlock between LiveListenerBus#stop and 
> AsyncEventQueue#removeListenerOnError
> ---
>
> Key: SPARK-30285
> URL: https://issues.apache.org/jira/browse/SPARK-30285
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
>Reporter: Wang Shuo
>Assignee: Wang Shuo
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> There is a deadlock between LiveListenerBus#stop and 
> AsyncEventQueue#removeListenerOnError.
> we can reproduce as follows:
>  # Post some events to LiveListenerBus
>  # Call LiveListenerBus#stop and hold the synchronized lock of bus, waiting 
> until all the events are processed by listeners, then remove all the queues
>  # Event queue would drain out events by posting to its listeners. If a 
> listener is interrupted, it will call AsyncEventQueue#removeListenerOnError,  
> inside it will call bus.removeListener, trying to acquire synchronized lock 
> of bus, resulting in deadlock



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26173) Prior regularization for Logistic Regression

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-26173.
--
Resolution: Won't Fix

> Prior regularization for Logistic Regression
> 
>
> Key: SPARK-26173
> URL: https://issues.apache.org/jira/browse/SPARK-26173
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 3.0.0
>Reporter: Facundo Bellosi
>Priority: Minor
> Attachments: Prior regularization.png
>
>
> This feature enables Maximum A Posteriori (MAP) optimization for Logistic 
> Regression based on a Gaussian prior. In practice, this is just implementing 
> a more general form of L2 regularization parameterized by a (multivariate) 
> mean and precisions (inverse of variance) vectors.
> Prior regularization is calculated through the following formula:
> !Prior regularization.png!
> where:
>  * λ: regularization parameter ({{regParam}})
>  * K: number of coefficients (weights vector length)
>  * w~i~ with prior Normal(μ~i~, β~i~^2^)
> _Reference: Bishop, Christopher M. (2006). Pattern Recognition and Machine 
> Learning (section 4.5). Berlin, Heidelberg: Springer-Verlag._
> h3. Existing implementations
> * Python: [bayes_logistic|https://pypi.org/project/bayes_logistic/]
> h2.  Implementation
>  * 2 new parameters added to {{LogisticRegression}}: {{priorMean}} and 
> {{priorPrecisions}}.
>  * 1 new class ({{PriorRegularization}}) implements the calculations of the 
> value and gradient of the prior regularization term.
>  * Prior regularization is enabled when both vectors are provided and 
> {{regParam}} > 0 and {{elasticNetParam}} < 1.
> h2. Tests
>  * {{DifferentiableRegularizationSuite}}
>  ** {{Prior regularization}}
>  * {{LogisticRegressionSuite}}
>  ** {{prior precisions should be required when prior mean is set}}
>  ** {{prior mean should be required when prior precisions is set}}
>  ** {{`regParam` should be positive when using prior regularization}}
>  ** {{`elasticNetParam` should be less than 1.0 when using prior 
> regularization}}
>  ** {{prior mean and precisions should have equal length}}
>  ** {{priors' length should match number of features}}
>  ** {{binary logistic regression with prior regularization equivalent to L2}}
>  ** {{binary logistic regression with prior regularization equivalent to L2 
> (bis)}}
>  ** {{binary logistic regression with prior regularization}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27462) Spark hive can not choose some columns in target table flexibly, when running insert into.

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-27462.
--
Resolution: Won't Fix

> Spark hive can not choose some columns in target table flexibly, when running 
> insert into.
> --
>
> Key: SPARK-27462
> URL: https://issues.apache.org/jira/browse/SPARK-27462
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark SQL can not support the feature to choose some columns in target table 
> flexibly, when running
> {code:java}
> insert into tableA select ... from tableB;{code}
> This feature is supported by Hive, so I think this grammar should be 
> consistent with Hive。
> Hive support some feature about 'insert into' as follows:
> {code:java}
> insert into gja_test_spark select * from gja_test;
> insert into gja_test_spark(key, value, other) select key, value, other from 
> gja_test;
> insert into gja_test_spark(key, value) select value, other from gja_test;
> insert into gja_test_spark(key, other) select value, other from gja_test;
> insert into gja_test_spark(value, other) select value, other from 
> gja_test;{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread Henry Davidge (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007131#comment-17007131
 ] 

Henry Davidge commented on SPARK-22231:
---

I agree that these methods make more sense on DataFrame with the caveat that 
they need sane behavior for arrays of structs as well to avoid the need for 
column-level APIs. Otherwise you have the same problem as today if you want to 
add / remove a field from an array of structs. I think ideally 
{{df.withColumn("array_of_structs.new_field", expr("array_of_structs.field_a + 
array_of_structs.field_b"))}} would be equivalent to 
{{transform(array_of_structs, el -> with_column("new_field", el.field_a + 
el.field_b))}} in a column-level API.

Btw, this comes up a lot with genomic data, so we added a version of 
{{withField}} to the Glow library that I would love to deprecate: 
https://github.com/projectglow/glow/blob/master/core/src/main/scala/io/projectglow/sql/optimizer/hlsOptimizerRules.scala#L32-L37

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double,

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/2/20 11:25 PM:


Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(fieldNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 


was (Author: fqaiser94):
Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(colNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread Reynold Xin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007117#comment-17007117
 ] 

Reynold Xin commented on SPARK-22231:
-

Makes sense. One question (I've asked about this before): should the 3 
functions be on DataFrame, or on Column? 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> //

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116
 ] 

fqaiser94 commented on SPARK-22231:
---

Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(colNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested

[jira] [Updated] (SPARK-27495) SPIP: Support Stage level resource configuration and scheduling

2020-01-02 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-27495:
--
Target Version/s: 3.0.0

> SPIP: Support Stage level resource configuration and scheduling
> ---
>
> Key: SPARK-27495
> URL: https://issues.apache.org/jira/browse/SPARK-27495
> Project: Spark
>  Issue Type: Epic
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
>  Labels: SPIP
>
> *Q1.* What are you trying to do? Articulate your objectives using absolutely 
> no jargon.
> Objectives:
>  # Allow users to specify task and executor resource requirements at the 
> stage level. 
>  # Spark will use the stage level requirements to acquire the necessary 
> resources/executors and schedule tasks based on the per stage requirements.
> Many times users have different resource requirements for different stages of 
> their application so they want to be able to configure resources at the stage 
> level. For instance, you have a single job that has 2 stages. The first stage 
> does some  ETL which requires a lot of tasks, each with a small amount of 
> memory and 1 core each. Then you have a second stage where you feed that ETL 
> data into an ML algorithm. The second stage only requires a few executors but 
> each executor needs a lot of memory, GPUs, and many cores.  This feature 
> allows the user to specify the task and executor resource requirements for 
> the ETL Stage and then change them for the ML stage of the job. 
> Resources include cpu, memory (on heap, overhead, pyspark, and off heap), and 
> extra Resources (GPU/FPGA/etc). It has the potential to allow for other 
> things like limiting the number of tasks per stage, specifying other 
> parameters for things like shuffle, etc. Initially I would propose we only 
> support resources as they are now. So Task resources would be cpu and other 
> resources (GPU, FPGA), that way we aren't adding in extra scheduling things 
> at this point.  Executor resources would be cpu, memory, and extra 
> resources(GPU,FPGA, etc). Changing the executor resources will rely on 
> dynamic allocation being enabled.
> Main use cases:
>  # ML use case where user does ETL and feeds it into an ML algorithm where 
> it’s using the RDD API. This should work with barrier scheduling as well once 
> it supports dynamic allocation.
>  # This adds the framework/api for Spark's own internal use.  In the future 
> (not covered by this SPIP), Catalyst could control the stage level resources 
> as it finds the need to change it between stages for different optimizations. 
> For instance, with the new columnar plugin to the query planner we can insert 
> stages into the plan that would change running something on the CPU in row 
> format to running it on the GPU in columnar format. This API would allow the 
> planner to make sure the stages that run on the GPU get the corresponding GPU 
> resources it needs to run. Another possible use case for catalyst is that it 
> would allow catalyst to add in more optimizations to where the user doesn’t 
> need to configure container sizes at all. If the optimizer/planner can handle 
> that for the user, everyone wins.
> This SPIP focuses on the RDD API but we don’t exclude the Dataset API. I 
> think the DataSet API will require more changes because it specifically hides 
> the RDD from the users via the plans and catalyst can optimize the plan and 
> insert things into the plan. The only way I’ve found to make this work with 
> the Dataset API would be modifying all the plans to be able to get the 
> resource requirements down into where it creates the RDDs, which I believe 
> would be a lot of change.  If other people know better options, it would be 
> great to hear them.
> *Q2.* What problem is this proposal NOT designed to solve?
> The initial implementation is not going to add Dataset APIs.
> We are starting with allowing users to specify a specific set of 
> task/executor resources and plan to design it to be extendable, but the first 
> implementation will not support changing generic SparkConf configs and only 
> specific limited resources.
> This initial version will have a programmatic API for specifying the resource 
> requirements per stage, we can add the ability to perhaps have profiles in 
> the configs later if its useful.
> *Q3.* How is it done today, and what are the limits of current practice?
> Currently this is either done by having multiple spark jobs or requesting 
> containers with the max resources needed for any part of the job.  To do this 
> today, you can break it into separate jobs where each job requests the 
> corresponding resources needed, but then you have to write the data out 
>

[jira] [Assigned] (SPARK-30387) Improve YarnClientSchedulerBackend log message

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30387:


Assignee: jobit mathew

> Improve YarnClientSchedulerBackend log message
> --
>
> Key: SPARK-30387
> URL: https://issues.apache.org/jira/browse/SPARK-30387
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Trivial
>
> ShutdownHook of 
> YarnClientSchedulerBackend prints "Stopped" which can be improved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30387) Improve YarnClientSchedulerBackend log message

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30387.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27049
[https://github.com/apache/spark/pull/27049]

> Improve YarnClientSchedulerBackend log message
> --
>
> Key: SPARK-30387
> URL: https://issues.apache.org/jira/browse/SPARK-30387
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Assignee: jobit mathew
>Priority: Trivial
> Fix For: 3.0.0
>
>
> ShutdownHook of 
> YarnClientSchedulerBackend prints "Stopped" which can be improved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30341) check overflow for interval arithmetic operations

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30341.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26995
[https://github.com/apache/spark/pull/26995]

> check overflow for interval arithmetic operations
> -
>
> Key: SPARK-30341
> URL: https://issues.apache.org/jira/browse/SPARK-30341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> the interval arithmetic functions, e.g. 
> add/subtract/negative/multiply/divide, should enable overflow check when ansi 
> is on, and add/subtract/negative should result NULL when overflow happens and 
> ansi is off as multiply/divide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30341) check overflow for interval arithmetic operations

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30341:
---

Assignee: Kent Yao

> check overflow for interval arithmetic operations
> -
>
> Key: SPARK-30341
> URL: https://issues.apache.org/jira/browse/SPARK-30341
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>
> the interval arithmetic functions, e.g. 
> add/subtract/negative/multiply/divide, should enable overflow check when ansi 
> is on, and add/subtract/negative should result NULL when overflow happens and 
> ansi is off as multiply/divide.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30284) CREATE VIEW should track the current catalog and namespace

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30284.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26923
[https://github.com/apache/spark/pull/26923]

> CREATE VIEW should track the current catalog and namespace
> --
>
> Key: SPARK-30284
> URL: https://issues.apache.org/jira/browse/SPARK-30284
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30412) Eliminate warnings in Java tests regarding to deprecated API

2020-01-02 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30412:
--

 Summary: Eliminate warnings in Java tests regarding to deprecated 
API
 Key: SPARK-30412
 URL: https://issues.apache.org/jira/browse/SPARK-30412
 Project: Spark
  Issue Type: Sub-task
  Components: Java API, SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Suppress warnings about deprecated Spark API in Java test suites:
{code}
/Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java
Warning:Warning:line (32)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (91)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (100)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (109)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (118)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
{code}
{code}
/Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/Java8DatasetAggregatorSuite.java
Warning:Warning:line (28)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (37)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (46)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (55)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
Warning:Warning:line (64)java: 
org.apache.spark.sql.expressions.javalang.typed in 
org.apache.spark.sql.expressions.javalang has been deprecated
{code}
{code}
/Users/maxim/proj/eliminate-warnings-part2/sql/core/src/test/java/test/org/apache/spark/sql/JavaDataFrameSuite.java
Warning:Warning:line (478)java: 
json(org.apache.spark.api.java.JavaRDD) in 
org.apache.spark.sql.DataFrameReader has been deprecated
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30174) Eliminate warnings :part 4

2020-01-02 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006953#comment-17006953
 ] 

Maxim Gekk commented on SPARK-30174:


[~shivuson...@gmail.com] Are you still working on this? If so, could you write 
in the ticket how are going to fix the warnings, please.

> Eliminate warnings :part 4
> --
>
> Key: SPARK-30174
> URL: https://issues.apache.org/jira/browse/SPARK-30174
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: jobit mathew
>Priority: Minor
>
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala
> {code:java}
> Warning:Warning:line (127)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> Warning:Warning:line (261)class ParquetInputSplit in package hadoop is 
> deprecated: see corresponding Javadoc for more information.
> new org.apache.parquet.hadoop.ParquetInputSplit(
> Warning:Warning:line (272)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
> ParquetFileReader.readFooter(sharedConf, filePath, 
> SKIP_ROW_GROUPS).getFileMetaData
> Warning:Warning:line (442)method readFooter in class ParquetFileReader is 
> deprecated: see corresponding Javadoc for more information.
>   ParquetFileReader.readFooter(
> {code}
> sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/parquet/ParquetWriteBuilder.scala
> {code:java}
>  Warning:Warning:line (91)value ENABLE_JOB_SUMMARY in class 
> ParquetOutputFormat is deprecated: see corresponding Javadoc for more 
> information.
>   && conf.get(ParquetOutputFormat.ENABLE_JOB_SUMMARY) == null) {
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30172) Eliminate warnings: part3

2020-01-02 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006952#comment-17006952
 ] 

Maxim Gekk commented on SPARK-30172:


[~Ankitraj] Are you still working on this?

> Eliminate warnings: part3
> -
>
> Key: SPARK-30172
> URL: https://issues.apache.org/jira/browse/SPARK-30172
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> /sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/ScriptTransformationExec.scala
> Warning:Warning:line (422)method initialize in class AbstractSerDe is 
> deprecated: see corresponding Javadoc for more information.
> serde.initialize(null, properties)
> /sql/hive/src/main/scala/org/apache/spark/sql/hive/hiveUDFs.scala
> Warning:Warning:line (216)method initialize in class GenericUDTF is 
> deprecated: see corresponding Javadoc for more information.
>   protected lazy val outputInspector = 
> function.initialize(inputInspectors.toArray)
> Warning:Warning:line (342)class UDAF in package exec is deprecated: see 
> corresponding Javadoc for more information.
>   new GenericUDAFBridge(funcWrapper.createFunction[UDAF]())
> Warning:Warning:line (503)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> def serialize(buffer: AggregationBuffer): Array[Byte] = {
> Warning:Warning:line (523)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> def deserialize(bytes: Array[Byte]): AggregationBuffer = {
> Warning:Warning:line (538)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)
> Warning:Warning:line (538)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
> case class HiveUDAFBuffer(buf: AggregationBuffer, canDoMerge: Boolean)
> /sql/hive/src/main/java/org/apache/hadoop/hive/ql/io/orc/SparkOrcNewRecordReader.java
> Warning:Warning:line (44)java: getTypes() in org.apache.orc.Reader has 
> been deprecated
> Warning:Warning:line (47)java: getTypes() in org.apache.orc.Reader has 
> been deprecated
> /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
> Warning:Warning:line (2,368)method readFooter in class ParquetFileReader 
> is deprecated: see corresponding Javadoc for more information.
> val footer = ParquetFileReader.readFooter(
> /sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUDAFSuite.scala
> Warning:Warning:line (202)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def getNewAggregationBuffer: AggregationBuffer = new 
> MockUDAFBuffer(0L, 0L)
> Warning:Warning:line (204)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def reset(agg: AggregationBuffer): Unit = {
> Warning:Warning:line (212)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): 
> Unit = {
> Warning:Warning:line (221)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def merge(agg: AggregationBuffer, partial: Object): Unit = {
> Warning:Warning:line (231)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def terminatePartial(agg: AggregationBuffer): AnyRef = {
> Warning:Warning:line (236)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def terminate(agg: AggregationBuffer): AnyRef = 
> terminatePartial(agg)
> Warning:Warning:line (257)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def getNewAggregationBuffer: AggregationBuffer = {
> Warning:Warning:line (266)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def reset(agg: AggregationBuffer): Unit = {
> Warning:Warning:line (277)trait AggregationBuffer in class 
> GenericUDAFEvaluator is deprecated: see corresponding Javadoc for more 
> information.
>   override def iterate(agg: AggregationBuffer, parameters: Array[AnyRef]): 
>

[jira] [Commented] (SPARK-30171) Eliminate warnings: part2

2020-01-02 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006949#comment-17006949
 ] 

Maxim Gekk commented on SPARK-30171:


[~srowen] SPARK-30258 fixes warnings AvroFunctionsSuite.scala but not in 
parsedOptions.ignoreExtension . I am not sure how we can avoid warnings related 
to ignoreExtension.

> Eliminate warnings: part2
> -
>
> Key: SPARK-30171
> URL: https://issues.apache.org/jira/browse/SPARK-30171
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: ABHISHEK KUMAR GUPTA
>Priority: Minor
>
> AvroFunctionsSuite.scala
> Warning:Warning:line (41)method to_avro in package avro is deprecated (since 
> 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (41)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroDF = df.select(to_avro('id).as("a"), to_avro('str).as("b"))
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (54)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroDF.select(from_avro('a, avroTypeLong), from_avro('b, 
> avroTypeStr)), df)
> Warning:Warning:line (59)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (70)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
> checkAnswer(avroStructDF.select(from_avro('avro, avroTypeStruct)), df)
> Warning:Warning:line (76)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val avroStructDF = df.select(to_avro('struct).as("avro"))
> Warning:Warning:line (118)method to_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.to_avro' 
> instead.
> val readBackOne = dfOne.select(to_avro($"array").as("avro"))
> Warning:Warning:line (119)method from_avro in package avro is deprecated 
> (since 3.0.0): Please use 'org.apache.spark.sql.avro.functions.from_avro' 
> instead.
>   .select(from_avro($"avro", avroTypeArrStruct).as("array"))
> AvroPartitionReaderFactory.scala
> Warning:Warning:line (64)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> if (parsedOptions.ignoreExtension || 
> partitionedFile.filePath.endsWith(".avro")) {
> AvroFileFormat.scala
> Warning:Warning:line (98)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
>   if (parsedOptions.ignoreExtension || file.filePath.endsWith(".avro")) {
> AvroUtils.scala
> Warning:Warning:line (55)value ignoreExtension in class AvroOptions is 
> deprecated (since 3.0): Use the general data source option pathGlobFilter for 
> filtering file names
> inferAvroSchemaFromFiles(files, conf, parsedOptions.ignoreExtension,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-29760) Document VALUES statement in SQL Reference.

2020-01-02 Thread Ankit Raj Boudh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29760?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-29760:

Comment: was deleted

(was: @Sean R. Owen , i will raise PR for this.)

> Document VALUES statement in SQL Reference.
> ---
>
> Key: SPARK-29760
> URL: https://issues.apache.org/jira/browse/SPARK-29760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.4
>Reporter: jobit mathew
>Priority: Minor
>
> spark-sql also supports *VALUES *.
> {code:java}
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three');
> 1   one
> 2   two
> 3   three
> Time taken: 0.015 seconds, Fetched 3 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') limit 2;
> 1   one
> 2   two
> Time taken: 0.014 seconds, Fetched 2 row(s)
> spark-sql>
> spark-sql> VALUES (1, 'one'), (2, 'two'), (3, 'three') order by 2;
> 1   one
> 3   three
> 2   two
> Time taken: 0.153 seconds, Fetched 3 row(s)
> spark-sql>
> {code}
> or even *values *can be used along with INSERT INTO or select.
> refer: https://www.postgresql.org/docs/current/sql-values.html 
> So please confirm VALUES also can be documented or not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-02 Thread Sanket Reddy (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sanket Reddy updated SPARK-30411:
-
Description: 
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
 drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example

Now after saveAsTable
 >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
 >>> ('Fifth', 5)]
 >>> df = spark.createDataFrame(data)
 >>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
 -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
 Overwrites the permissions

Insert into honors preserving parent directory permissions.
 >>> spark.sql("DROP table redsanket_db.example");
 DataFrame[]
 >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
 >>> STORED AS orc");
 DataFrame[]
 >>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
 drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
 It is either limitation of the API based on the mode and the behavior has to 
be documented or needs to be fixed

  was:
-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
drwxr-x--T   - redsanket users 0 2019-12-04 
20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - redsanket users  0 2019-12-04 20:20 
/tmp/my_databases/example


Now after saveAsTable
>>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 
>>> 5)]
>>> df = spark.createDataFrame(data)
>>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwx--   - redsanket users  0 2019-12-04 20:23 
/tmp/my_databases/example
Overwrites the permissions

Insert into honors preserving parent directory permissions.
>>> spark.sql("DROP table redsanket_db.example");
DataFrame[]
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
DataFrame[]
>>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - schintap users  0 2019-12-04 20:43 
/tmp/my_databases/example
It is either limitation of the API based on the mode and the behavior has to be 
documented or needs to be fixed


> saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms
> ---
>
> Key: SPARK-30411
> URL: https://issues.apache.org/jira/browse/SPARK-30411
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Sanket Reddy
>Priority: Minor
>
> -bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
>  drwxr-x--T - redsanket users 0 2019-12-04 20:15 /tmp/my_databases
> >>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> >>>STORED AS orc");
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:20 /tmp/my_databases/example
> Now after saveAsTable
>  >>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), 
> ('Fifth', 5)]
>  >>> df = spark.createDataFrame(data)
>  >>> 
> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
>  -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwx-- - redsanket users 0 2019-12-04 20:23 /tmp/my_databases/example
>  Overwrites the permissions
> Insert into honors preserving parent directory permissions.
>  >>> spark.sql("DROP table redsanket_db.example");
>  DataFrame[]
>  >>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) 
> STORED AS orc");
>  DataFrame[]
>  >>> df.write.format("orc").insertInto('redsanket_db.example')
> -bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
>  drwxr-x--T - redsanket users 0 2019-12-04 20:43 /tmp/my_databases/example
>  It is either limitation of the API based on the mode and the behavior has to 
> be documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30411) saveAsTable does not honor spark.hadoop.hive.warehouse.subdir.inherit.perms

2020-01-02 Thread Sanket Reddy (Jira)

Sanket Reddy created SPARK-30411:


 Summary: saveAsTable does not honor 
spark.hadoop.hive.warehouse.subdir.inherit.perms
 Key: SPARK-30411
 URL: https://issues.apache.org/jira/browse/SPARK-30411
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Sanket Reddy


-bash-4.2$ hdfs dfs -ls /tmp | grep my_databases
drwxr-x--T   - redsanket users 0 2019-12-04 
20:15 /tmp/my_databases

>>>spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>>AS orc");

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - redsanket users  0 2019-12-04 20:20 
/tmp/my_databases/example


Now after saveAsTable
>>> data = [('First', 1), ('Second', 2), ('Third', 3), ('Fourth', 4), ('Fifth', 
>>> 5)]
>>> df = spark.createDataFrame(data)
>>> df.write.format("orc").mode('overwrite').saveAsTable('redsanket_db.example')
-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwx--   - redsanket users  0 2019-12-04 20:23 
/tmp/my_databases/example
Overwrites the permissions

Insert into honors preserving parent directory permissions.
>>> spark.sql("DROP table redsanket_db.example");
DataFrame[]
>>> spark.sql("CREATE TABLE redsanket_db.example(bcookie string, ip int) STORED 
>>> AS orc");
DataFrame[]
>>> df.write.format("orc").insertInto('redsanket_db.example')

-bash-4.2$ hdfs dfs -ls /tmp/my_databases | grep example
drwxr-x--T   - schintap users  0 2019-12-04 20:43 
/tmp/my_databases/example
It is either limitation of the API based on the mode and the behavior has to be 
documented or needs to be fixed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30397) [pyspark] Writer applied to custom model changes type of keys' dict from int to str

2020-01-02 Thread Radhwane Chebaane (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006862#comment-17006862
 ] 

Radhwane Chebaane commented on SPARK-30397:
---

This issue comes from the native behaviour of `python json module`. When Python 
json dumps a dict with integer keys, it converts it to string keys.
As a workaround you can split your dict to two list params (keys list and 
values list) as elements types in lists are kept unchanged.
For more details about the Python json module:
[https://stackoverflow.com/questions/1450957/pythons-json-module-converts-int-dictionary-keys-to-strings/34346202]

> [pyspark] Writer applied to custom model changes type of keys' dict from int 
> to str
> ---
>
> Key: SPARK-30397
> URL: https://issues.apache.org/jira/browse/SPARK-30397
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Jean-Marc Montanier
>Priority: Major
>
> Hello,
>  
> I have a custom model that I'm trying to persist. Within this custom model 
> there is a python dict mapping from int to int. When the model is saved (with 
> write().save('path')), the keys of the dict are modified from int to str.
>  
> You can find bellow a code to reproduce the issue:
> {code:python}
> #!/usr/bin/env python3
> # -*- coding: utf-8 -*-
> """
> @author: Jean-Marc Montanier
> @date: 2019/12/31
> """
> from pyspark.sql import SparkSession
> from pyspark import keyword_only
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml import Estimator, Model
> from pyspark.ml.util import DefaultParamsReadable, DefaultParamsWritable
> from pyspark.ml.param import Param, Params
> from pyspark.ml.param.shared import HasInputCol, HasOutputCol
> from pyspark.sql.types import IntegerType
> from pyspark.sql.functions import udf
> spark = SparkSession \
> .builder \
> .appName("ImputeNormal") \
> .getOrCreate()
> class CustomFit(Estimator,
> HasInputCol,
> HasOutputCol,
> DefaultParamsReadable,
> DefaultParamsWritable,
> ):
> @keyword_only
> def __init__(self, inputCol="inputCol", outputCol="outputCol"):
> super(CustomFit, self).__init__()
> self._setDefault(inputCol="inputCol", outputCol="outputCol")
> kwargs = self._input_kwargs
> self.setParams(**kwargs)
> @keyword_only
> def setParams(self, inputCol="inputCol", outputCol="outputCol"):
> """
> setParams(self, inputCol="inputCol", outputCol="outputCol")
> """
> kwargs = self._input_kwargs
> self._set(**kwargs)
> return self
> def _fit(self, data):
> inputCol = self.getInputCol()
> outputCol = self.getOutputCol()
> categories = data.where(data[inputCol].isNotNull()) \
> .groupby(inputCol) \
> .count() \
> .orderBy("count", ascending=False) \
> .limit(2)
> categories = dict(categories.toPandas().set_index(inputCol)["count"])
> for cat in categories:
> categories[cat] = int(categories[cat])
> return CustomModel(categories=categories,
>input_col=inputCol,
>output_col=outputCol)
> class CustomModel(Model,
>   DefaultParamsReadable,
>   DefaultParamsWritable):
> input_col = Param(Params._dummy(), "input_col", "Name of the input 
> column")
> output_col = Param(Params._dummy(), "output_col", "Name of the output 
> column")
> categories = Param(Params._dummy(), "categories", "Top categories")
> def __init__(self, categories: dict = None, input_col="input_col", 
> output_col="output_col"):
> super(CustomModel, self).__init__()
> self._set(categories=categories, input_col=input_col, 
> output_col=output_col)
> def get_output_col(self) -> str:
> """
> output_col getter
> :return:
> """
> return self.getOrDefault(self.output_col)
> def get_input_col(self) -> str:
> """
> input_col getter
> :return:
> """
> return self.getOrDefault(self.input_col)
> def get_categories(self):
> """
> categories getter
> :return:
> """
> return self.getOrDefault(self.categories)
> def _transform(self, data):
> input_col = self.get_input_col()
> output_col = self.get_output_col()
> categories = self.get_categories()
> def get_cat(val):
> if val is None:
> return -1
> if val not in categories:
> return -1
> return int(categories[val])
> get_cat_udf =

[jira] [Created] (SPARK-30410) Calculating size of table having large number of partitions causes flooding logs

2020-01-02 Thread Zhenhua Wang (Jira)

Zhenhua Wang created SPARK-30410:


 Summary: Calculating size of table having large number of 
partitions causes flooding logs
 Key: SPARK-30410
 URL: https://issues.apache.org/jira/browse/SPARK-30410
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.5, 3.0.0
Reporter: Zhenhua Wang


For a partitioned table, if the number of partitions are very large, e.g. tens 
of thousands or even larger, calculating its total size causes flooding logs.

The flooding happens in:
 # `calculateLocationSize` prints the starting and ending for calculating the 
location size, and it is called per partition;
 # `bulkListLeafFiles` prints all partition paths.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30403.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27068
[https://github.com/apache/spark/pull/27068]

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> After merged [https://github.com/apache/spark/pull/25854], we also need to 
> handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
> rule. Otherwise we will  get the NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30403) Fix the NoSuchElementException exception when enable AQE with InSubquery use case

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30403:
---

Assignee: Ke Jia

> Fix the NoSuchElementException exception when enable AQE with InSubquery use 
> case
> -
>
> Key: SPARK-30403
> URL: https://issues.apache.org/jira/browse/SPARK-30403
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> After merged [https://github.com/apache/spark/pull/25854], we also need to 
> handle the Insubquery case when build SubqueryMap in InsertAdaptiveSparkPlan 
> rule. Otherwise we will  get the NoSuchElementException exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27996) Spark UI redirect will be failed behind the https reverse proxy

2020-01-02 Thread Stijn De Haes (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006823#comment-17006823
 ] 

Stijn De Haes commented on SPARK-27996:
---

[~ishikin] I am stumbling on this as well. Can you maybe already make a PR with 
your solution to get the ball rolling?

> Spark UI redirect will be failed behind the https reverse proxy
> ---
>
> Key: SPARK-27996
> URL: https://issues.apache.org/jira/browse/SPARK-27996
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.4.3
>Reporter: Saisai Shao
>Priority: Minor
>
> When Spark live/history UI is proxied behind the reverse proxy, the redirect 
> will return wrong scheme, for example:
> If reverse proxy is SSL enabled, so the client to reverse proxy is a HTTPS 
> request, whereas if Spark's UI is not SSL enabled, then the request from 
> reverse proxy to Spark UI is a HTTP request, Spark itself treats all the 
> requests as HTTP requests, so the redirect URL is just started with "http", 
> which will be failed to redirect from client. 
> Actually for most of the reverse proxy, the proxy will add an additional 
> header "X-Forwarded-Proto" to tell the backend server that the client request 
> is a https request, so Spark should leverage this header to return the 
> correct URL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30407.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27074
[https://github.com/apache/spark/pull/27074]

> reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe
> 
>
> Key: SPARK-30407
> URL: https://issues.apache.org/jira/browse/SPARK-30407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
> Fix For: 3.0.0
>
>
> Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
> metric info of AdaptiveSparkPlanExec does not reset when running the test 
> DataFrameCallbackSuite#get numRows metrics by callback.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30407:
---

Assignee: Ke Jia

> reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe
> 
>
> Key: SPARK-30407
> URL: https://issues.apache.org/jira/browse/SPARK-30407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Assignee: Ke Jia
>Priority: Major
>
> Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
> metric info of AdaptiveSparkPlanExec does not reset when running the test 
> DataFrameCallbackSuite#get numRows metrics by callback.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30196) Bump lz4-java version to 1.7.0

2020-01-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro reopened SPARK-30196:
--

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30196) Bump lz4-java version to 1.7.0

2020-01-02 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006819#comment-17006819
 ] 

Takeshi Yamamuro commented on SPARK-30196:
--

I'll make this open until the root cause found and resolved.

> Bump lz4-java version to 1.7.0
> --
>
> Key: SPARK-30196
> URL: https://issues.apache.org/jira/browse/SPARK-30196
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Spark Core
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23516) I think it is unnecessary to transfer unroll memory to storage memory

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-23516.
--
Resolution: Won't Fix

> I think it is unnecessary to transfer unroll memory to storage memory 
> --
>
> Key: SPARK-23516
> URL: https://issues.apache.org/jira/browse/SPARK-23516
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Minor
>
> Now _StaticMemoryManager_ mode has been removed.
>  And for _UnifiedMemoryManager_,  unroll memory is also storage memory, so I 
> think it is unnecessary to release unroll memory really,  and then to get 
> storage memory again.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17454) Use Mesos disk resources

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-17454.
--
Resolution: Won't Fix

> Use Mesos disk resources
> 
>
> Key: SPARK-17454
> URL: https://issues.apache.org/jira/browse/SPARK-17454
> Project: Spark
>  Issue Type: Improvement
>  Components: Mesos
>Affects Versions: 2.0.0
>Reporter: Chris Bannister
>Priority: Major
>
> Currently the driver will accept offers from Mesos which have enough ram for 
> the executor and until its max cores is reached. There is no way to control 
> the required CPU's or disk for each executor, it would be very useful to be 
> able to apply something similar to spark.mesos.constraints to resource offers 
> instead of attributes on the offer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-26560) Repeating select on udf function throws analysis exception - function not registered

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-26560:
---

Assignee: Jungtaek Lim

> Repeating select on udf function throws analysis exception - function not 
> registered
> 
>
> Key: SPARK-26560
> URL: https://issues.apache.org/jira/browse/SPARK-26560
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Haripriya
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In spark-shell,
> 1. Create the new function
> sql("CREATE FUNCTION last_day_test AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/two_udfs.jar")
> 2. Perform select over the udf
> {code:java}
> scala> sql(" select last_day_test('2015-08-22')")
>  res1: org.apache.spark.sql.DataFrame = [default.last_day_test(2015-08-22): 
> string]
> scala> sql(" select last_day_test('2015-08-22')")
>  org.apache.spark.sql.AnalysisException: Undefined function: 'last_day_test'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 1 pos 8
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:1167)
>  at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:145)
>  at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:124)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$53.apply(Analyzer.scala:1244)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$53.apply(Analyzer.scala:1244)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:1243)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:1227)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:122)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:127)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at

[jira] [Resolved] (SPARK-26780) Improve shuffle read using ReadAheadInputStream

2020-01-02 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-26780.
--
Resolution: Won't Fix

> Improve  shuffle read using ReadAheadInputStream 
> -
>
> Key: SPARK-26780
> URL: https://issues.apache.org/jira/browse/SPARK-26780
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: liuxian
>Priority: Major
>
> Using _ReadAheadInputStream_  to improve  shuffle read  performance.
>  _ReadAheadInputStream_ can save cpu utilization and almost no performance 
> regression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-26560) Repeating select on udf function throws analysis exception - function not registered

2020-01-02 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26560.
-
Fix Version/s: 3.0.0
   2.4.5
   Resolution: Fixed

Issue resolved by pull request 27075
[https://github.com/apache/spark/pull/27075]

> Repeating select on udf function throws analysis exception - function not 
> registered
> 
>
> Key: SPARK-26560
> URL: https://issues.apache.org/jira/browse/SPARK-26560
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: Haripriya
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> In spark-shell,
> 1. Create the new function
> sql("CREATE FUNCTION last_day_test AS 
> 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFLastDayTest' USING JAR 
> 'hdfs:///tmp/two_udfs.jar")
> 2. Perform select over the udf
> {code:java}
> scala> sql(" select last_day_test('2015-08-22')")
>  res1: org.apache.spark.sql.DataFrame = [default.last_day_test(2015-08-22): 
> string]
> scala> sql(" select last_day_test('2015-08-22')")
>  org.apache.spark.sql.AnalysisException: Undefined function: 'last_day_test'. 
> This function is neither a registered temporary function nor a permanent 
> function registered in the database 'default'.; line 1 pos 8
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.failFunctionLookup(SessionCatalog.scala:1167)
>  at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction0(HiveSessionCatalog.scala:145)
>  at 
> org.apache.spark.sql.hive.HiveSessionCatalog.lookupFunction(HiveSessionCatalog.scala:124)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$53.apply(Analyzer.scala:1244)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6$$anonfun$applyOrElse$53.apply(Analyzer.scala:1244)
>  at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:53)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:1243)
>  at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$16$$anonfun$applyOrElse$6.applyOrElse(Analyzer.scala:1227)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
>  at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$transformExpressionsDown$1.apply(QueryPlan.scala:85)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:107)
>  at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:106)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:118)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1$1.apply(QueryPlan.scala:122)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>  at scala.collection.immutable.List.foreach(List.scala:381)
>  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>  at scala.collection.immutable.List.map(List.scala:285)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$recursiveTransform$1(QueryPlan.scala:122)
>  at 
> org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:127)
>  at 
>

[jira] [Created] (SPARK-30409) Use `NoOp` datasource in SQL benchmarks

2020-01-02 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30409:
--

 Summary: Use `NoOp` datasource in SQL benchmarks
 Key: SPARK-30409
 URL: https://issues.apache.org/jira/browse/SPARK-30409
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Currently, SQL benchmarks use `count()`, `collect()` and `foreach(_ => ())` 
actions. The actions have additional overhead. For example, `collect()` 
converts column values to external type values and pull data on the driver. 
Need to unify benchmark and the `NoOp` datasource except the benchmarks for 
`count()` or `collect()`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30408) orderBy in sortBy clause is removed by EliminateSorts

2020-01-02 Thread APeng Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30408?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

APeng Zhang updated SPARK-30408:

Description: 
OrderBy in sortBy clause will be removed by EliminateSorts.

code to reproduce:
{code:java}
val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") 
val groupData = dataset.orderBy("b")
val sortData = groupData.sortWithinPartitions("c")
{code}
The content of groupData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5]
partition 2: 
[c,3,6]{code}
The content of sortData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5], 
[c,3,6]{code}
 

UT to cover this defect:

In EliminateSortsSuite.scala
{code:java}
test("should not remove orderBy in sortBy clause") {
  val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
  val optimized = Optimize.execute(plan.analyze)
  val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
  comparePlans(optimized, correctAnswer)
}{code}
 

 
 This test will be failed because sortBy was removed by EliminateSorts.

  was:
OrderBy in sortBy clause will be removed by EliminateSorts.

code to reproduce:

 
{code:java}
val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") 
val groupData = dataset.orderBy("b")
val sortData = groupData.sortWithinPartitions("c")
{code}
The content of groupData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5]
partition 2: 
[c,3,6]{code}
The content of sortData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5], 
[c,3,6]{code}
 

UT to cover this defect:

In EliminateSortsSuite.scala
{code:java}
test("should not remove orderBy in sortBy clause") {
  val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
  val optimized = Optimize.execute(plan.analyze)
  val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
  comparePlans(optimized, correctAnswer)
}{code}
 

 
This test will be failed because sortBy was removed by EliminateSorts.


> orderBy in sortBy clause is removed by EliminateSorts
> -
>
> Key: SPARK-30408
> URL: https://issues.apache.org/jira/browse/SPARK-30408
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4
>Reporter: APeng Zhang
>Priority: Major
>
> OrderBy in sortBy clause will be removed by EliminateSorts.
> code to reproduce:
> {code:java}
> val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", 
> "c") 
> val groupData = dataset.orderBy("b")
> val sortData = groupData.sortWithinPartitions("c")
> {code}
> The content of groupData is:
> {code:java}
> partition 0: 
> [a,1,4]
> partition 1: 
> [b,2,5]
> partition 2: 
> [c,3,6]{code}
> The content of sortData is:
> {code:java}
> partition 0: 
> [a,1,4]
> partition 1: 
> [b,2,5], 
> [c,3,6]{code}
>  
> UT to cover this defect:
> In EliminateSortsSuite.scala
> {code:java}
> test("should not remove orderBy in sortBy clause") {
>   val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
>   val optimized = Optimize.execute(plan.analyze)
>   val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
>   comparePlans(optimized, correctAnswer)
> }{code}
>  
>  
>  This test will be failed because sortBy was removed by EliminateSorts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30408) orderBy in sortBy clause is removed by EliminateSorts

2020-01-02 Thread APeng Zhang (Jira)

APeng Zhang created SPARK-30408:
---

 Summary: orderBy in sortBy clause is removed by EliminateSorts
 Key: SPARK-30408
 URL: https://issues.apache.org/jira/browse/SPARK-30408
 Project: Spark
  Issue Type: Bug
  Components: Optimizer
Affects Versions: 2.4.4, 2.4.3, 2.4.2, 2.4.1, 2.4.0
Reporter: APeng Zhang


OrderBy in sortBy clause will be removed by EliminateSorts.

code to reproduce:

 
{code:java}
val dataset = Seq( ("a", 1, 4), ("b", 2, 5), ("c", 3, 6) ).toDF("a", "b", "c") 
val groupData = dataset.orderBy("b")
val sortData = groupData.sortWithinPartitions("c")
{code}
The content of groupData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5]
partition 2: 
[c,3,6]{code}
The content of sortData is:
{code:java}
partition 0: 
[a,1,4]
partition 1: 
[b,2,5], 
[c,3,6]{code}
 

UT to cover this defect:

In EliminateSortsSuite.scala
{code:java}
test("should not remove orderBy in sortBy clause") {
  val plan = testRelation.orderBy('a.asc).sortBy('b.desc)
  val optimized = Optimize.execute(plan.analyze)
  val correctAnswer = testRelation.orderBy('a.asc).sortBy('b.desc).analyze
  comparePlans(optimized, correctAnswer)
}{code}
 

 
This test will be failed because sortBy was removed by EliminateSorts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27229) GroupBy Placement in Intersect Distinct

2020-01-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-27229.
--
Resolution: Won't Fix

> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27229) GroupBy Placement in Intersect Distinct

2020-01-02 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006748#comment-17006748
 ] 

Takeshi Yamamuro commented on SPARK-27229:
--

I'll close this for now because the corresponding pr is stale. Please reopen 
this if necessary.

> GroupBy Placement in Intersect Distinct
> ---
>
> Key: SPARK-27229
> URL: https://issues.apache.org/jira/browse/SPARK-27229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Major
>
> Intersect  operator will be replace by Left Semi Join in Optimizer.
> for example:
> SELECT a1, a2 FROM Tab1 INTERSECT SELECT b1, b2 FROM Tab2
>  ==>  SELECT DISTINCT a1, a2 FROM Tab1 LEFT SEMI JOIN Tab2 ON a1<=>b1 AND 
> a2<=>b2
> if Tabe1 and Tab2 are too large, the join will be very slow, we can reduce 
> the table data before
> Join by place groupby operator under join, that is 
> ==>  
> SELECT a1, a2 FROM 
>(SELECT a1,a2 FROM Tab1 GROUP BY a1,a2) X
>LEFT SEMI JOIN 
>(SELECT b1,b2 FROM Tab2 GROUP BY b1,b2) Y
> ON X.a1<=>Y.b1 AND X.a2<=>Y.b2
> then we can have smaller table data when execute join, because  group by has 
> cut lots of 
>  data.
>  
> A pr will be submit soon



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26739) Standardized Join Types for DataFrames

2020-01-02 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006745#comment-17006745
 ] 

Takeshi Yamamuro commented on SPARK-26739:
--

I'll close this for now because the corresponding pr is stale. Plz reopen this 
if necessary.

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Skyler Lehan
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants or *more preferably an enum*, something 
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the

[jira] [Resolved] (SPARK-26739) Standardized Join Types for DataFrames

2020-01-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-26739.
--
Resolution: Won't Fix

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Skyler Lehan
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors that aren't evident 
> at compile time. The objective of this improvement would be to allow 
> developers to use a common definition for join types (by enum or constants) 
> called JoinTypes. This would contain the possible joins and remove the 
> possibility of a typo. It would also allow Spark to alter the names of the 
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants or *more preferably an enum*, something 
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the functionality of these 
> new join functions and the join functions that take a string for joinType 
> would

[jira] [Commented] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2020-01-02 Thread Takeshi Yamamuro (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006742#comment-17006742
 ] 

Takeshi Yamamuro commented on SPARK-27033:
--

I'll close this for now because the corresponding pr is stale. Please reopen 
this if necessary. Anyway, thanks for the work!

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2020-01-02 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-27033.
--
Resolution: Won't Fix

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24906) Adaptively set split size for columnar file to ensure the task read data size fit expectation

2020-01-02 Thread Lior Chaga (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006691#comment-17006691
 ] 

Lior Chaga commented on SPARK-24906:


Looking at the PR, I see only using configurable estimations for Struct, Map 
and Array types, not something that is based on row group or any sample method 
of real data. 
Is there an additional development effort on this ticket?
I'd love to try and take part with my limited scala skills.

Anyway, not every heuristic is suitable for every use case. Rough estimations 
like in the attached PR are not good for complex schemas or even data sources 
that have many null values in participating columns. 
On the other hand, using sample of metadata has a higher cost, and might be an 
overkill for the naive cases.
Perhaps solution might be introducing a pluggable estimation strategy, 
providing the naive implementation but allowing spark users to provide their 
own strategy?

> Adaptively set split size for columnar file to ensure the task read data size 
> fit expectation
> -
>
> Key: SPARK-24906
> URL: https://issues.apache.org/jira/browse/SPARK-24906
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Jason Guo
>Priority: Major
> Attachments: image-2018-07-24-20-26-32-441.png, 
> image-2018-07-24-20-28-06-269.png, image-2018-07-24-20-29-24-797.png, 
> image-2018-07-24-20-30-24-552.png
>
>
> For columnar file, such as, when spark sql read the table, each split will be 
> 128 MB by default since spark.sql.files.maxPartitionBytes is default to 
> 128MB. Even when user set it to a large value, such as 512MB, the task may 
> read only few MB or even hundreds of KB. Because the table (Parquet) may 
> consists of dozens of columns while the SQL only need few columns. And spark 
> will prune the unnecessary columns.
>  
> In this case, spark DataSourceScanExec can enlarge maxPartitionBytes 
> adaptively. 
> For example, there is 40 columns , 20 are integer while another 20 are long. 
> When use query on an integer type column and an long type column, the 
> maxPartitionBytes should be 20 times larger. (20*4+20*8) /  (4+8) = 20. 
>  
> With this optimization, the number of task will be smaller and the job will 
> run faster. More importantly, for a very large cluster (more the 10 thousand 
> nodes), it will relieve RM's schedule pressure.
>  
> Here is the test
>  
> The table named test2 has more than 40 columns and there are more than 5 TB 
> data each hour.
> When we issue a very simple query 
>  
> {code:java}
> select count(device_id) from test2 where date=20180708 and hour='23'{code}
>  
> There are 72176 tasks and the duration of the job is 4.8 minutes
> !image-2018-07-24-20-26-32-441.png!
>  
> Most tasks last less than 1 second and read less than 1.5 MB data
> !image-2018-07-24-20-28-06-269.png!
>  
> After the optimization, there are only 1615 tasks and the job last only 30 
> seconds. It almost 10 times faster.
> !image-2018-07-24-20-29-24-797.png!
>  
> The median of read data is 44.2MB. 
> !image-2018-07-24-20-30-24-552.png!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Ke Jia (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30407?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30407:
---
Description: 
Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
metric info of AdaptiveSparkPlanExec does not reset when running the test 
DataFrameCallbackSuite#get numRows metrics by callback.

 

  was:
Working on [PR#26813|[https://github.com/apache/spark/pull/26813]]. With on 
AQE, the metric info of AdaptiveSparkPlanExec does not reset when running the 
test DataFrameCallbackSuite#get numRows metrics by callback.

 


> reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe
> 
>
> Key: SPARK-30407
> URL: https://issues.apache.org/jira/browse/SPARK-30407
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> Working on [https://github.com/apache/spark/pull/26813]. With on AQE, the 
> metric info of AdaptiveSparkPlanExec does not reset when running the test 
> DataFrameCallbackSuite#get numRows metrics by callback.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30407) reset the metrics info of AdaptiveSparkPlanExec plan when enable aqe

2020-01-02 Thread Ke Jia (Jira)

Ke Jia created SPARK-30407:
--

 Summary: reset the metrics info of AdaptiveSparkPlanExec plan when 
enable aqe
 Key: SPARK-30407
 URL: https://issues.apache.org/jira/browse/SPARK-30407
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


Working on [PR#26813|[https://github.com/apache/spark/pull/26813]]. With on 
AQE, the metric info of AdaptiveSparkPlanExec does not reset when running the 
test DataFrameCallbackSuite#get numRows metrics by callback.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

68 matches

Mail list logo