[jira] [Created] (SPARK-32641) withField + getField on null structs returns incorrect results

2020-08-17 Thread fqaiser94 (Jira)
fqaiser94 created SPARK-32641:
-

 Summary: withField + getField on null structs returns incorrect 
results
 Key: SPARK-32641
 URL: https://issues.apache.org/jira/browse/SPARK-32641
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: fqaiser94


 

There is a bug in the way the optimizer rule in SimplifyExtractValueOps 
(sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala)
 is currently coded which yields incorrect results in scenarios like the 
following: 
{code:java}
sql("SELECT CAST(NULL AS struct) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))

// currently returns this:
+---+
|d  |
+---+
|4  |
+---+

// when in fact it should return this: 
++
|d   |
++
|null|
++
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32521) WithFields Expression should not be foldable

2020-08-03 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32521:
--
Description: 
The following query currently fails on master brach: 
{code:scala}
sql("SELECT named_struct('a', 1, 'b', 2) a")
.select($"a".withField("c", lit(3)).as("a"))
.show(false)

// java.lang.UnsupportedOperationException: Cannot evaluate expression: 
with_fields(named_struct(a, 1, b, 2), c, 3)
{code}
This happens because the Catalyst optimizer tries to statically evaluate the 
{{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot 
do so because {{WithFields Expression}} is {{Unevaluable}}. 

The likely solution here is to change {{WithFields Expression}} so that it is 
not {{foldable}} under any circumstance. This should solve the issue because 
then Catalyst will no longer attempt to statically evaluate {{WithFields 
Expression}} anymore. 

 

 

  was:
The following query currently fails on master brach: 
{code:scala}
sql("SELECT named_struct('a', 1, 'b', 2) a")
.select($"a".withField("c", lit(3)).as("a"))
.show(false)

// java.lang.UnsupportedOperationException: Cannot evaluate expression: 
with_fields(named_struct(a, 1, b, 2), c, 3)
{code}
This happens because Catalyst optimizer tries to statically evaluate the 
{{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot 
do so because {{WithFields Expression}} is {{Unevaluable}}. 

The likely solution here is to change {{WithFields Expression}} so that it is 
not {{foldable}} under any circumstance. This should solve the issue because 
then Catalyst will no longer attempt to statically evaluate {{WithFields 
Expression}} anymore. 

 

 


> WithFields Expression should not be foldable
> 
>
> Key: SPARK-32521
> URL: https://issues.apache.org/jira/browse/SPARK-32521
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: fqaiser94
>Priority: Major
>
> The following query currently fails on master brach: 
> {code:scala}
> sql("SELECT named_struct('a', 1, 'b', 2) a")
> .select($"a".withField("c", lit(3)).as("a"))
> .show(false)
> // java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> with_fields(named_struct(a, 1, b, 2), c, 3)
> {code}
> This happens because the Catalyst optimizer tries to statically evaluate the 
> {{WithFields Expression}} (via the {{ConstantFolding}} rule), however it 
> cannot do so because {{WithFields Expression}} is {{Unevaluable}}. 
> The likely solution here is to change {{WithFields Expression}} so that it is 
> not {{foldable}} under any circumstance. This should solve the issue because 
> then Catalyst will no longer attempt to statically evaluate {{WithFields 
> Expression}} anymore. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32521) WithFields Expression should not be foldable

2020-08-03 Thread fqaiser94 (Jira)
fqaiser94 created SPARK-32521:
-

 Summary: WithFields Expression should not be foldable
 Key: SPARK-32521
 URL: https://issues.apache.org/jira/browse/SPARK-32521
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: fqaiser94


The following query currently fails on master brach: 
{code:scala}
sql("SELECT named_struct('a', 1, 'b', 2) a")
.select($"a".withField("c", lit(3)).as("a"))
.show(false)

// java.lang.UnsupportedOperationException: Cannot evaluate expression: 
with_fields(named_struct(a, 1, b, 2), c, 3)
{code}
This happens because Catalyst optimizer tries to statically evaluate the 
{{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot 
do so because {{WithFields Expression}} is {{Unevaluable}}. 

The likely solution here is to change {{WithFields Expression}} so that it is 
not {{foldable}} under any circumstance. This should solve the issue because 
then Catalyst will no longer attempt to statically evaluate {{WithFields 
Expression}} anymore. 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column nested inside a StructType 
Column (with similar semantics to the existing {{drop}} method on {{Dataset}}).
It should also be able to handle deeply nested columns through the same API. 
This is similar to the {{withField}} method that was recently added in 
SPARK-31317 and likely we can re-use some of that "infrastructure."

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}


  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}



> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column nested inside a StructType 
> Column (with similar semantics to the existing {{drop}} method on 
> {{Dataset}}).
> It should also be able to handle deeply nested columns through the same API. 
> This is similar to the {{withField}} method that was recently added in 
> SPARK-31317 and likely we can re-use some of that "infrastructure."
> The public-facing method signature should be something along the following 
> lines: 
> {noformat}
> def dropFields(fieldNames: String*): Column
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

The public-facing method signature should be something along the following 
lines: 

{noformat}
def dropFields(fieldNames: String*): Column
{noformat}


  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).
> This is similar to the {{withField}} that was recently added in SPARK-31317 
> and likely can re-use some of that infrastructure. 
> The public-facing method signature should be something along the following 
> lines: 
> {noformat}
> def dropFields(fieldNames: String*): Column
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class. 
This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).
This is similar to the {{withField}} that was recently added in SPARK-31317 and 
likely can re-use some of that infrastructure. 

  was:
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class. 
> This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).
> This is similar to the {{withField}} that was recently added in SPARK-31317 
> and likely can re-use some of that infrastructure. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231), add a new 
{{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket (SPARK-22231) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231), add a new 
> {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22231) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket (SPARK-22241) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22231) and following on, 
> add a new {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the extensive discussions in the parent ticket, it was determined we 
should add a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a column nested inside another 
StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}.
Ideally, this method should be able to handle dropping columns at arbitrary 
levels of nesting in a StructType Column. 




  was:
Based on the discussions in the parent ticket, add a new {{dropFields}} method 
to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the extensive discussions in the parent ticket, it was determined we 
> should add a new {{dropFields}} method to the {{Column}} class.
> This method should allow users to drop a column nested inside another 
> StructType Column, with similar semantics to the {{drop}} method on 
> {{Dataset}}.
> Ideally, this method should be able to handle dropping columns at arbitrary 
> levels of nesting in a StructType Column. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket (SPARK-22241) and following on, 
add a new {{dropFields}} method to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the extensive discussions in the parent ticket, it was determined we 
should add a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a column nested inside another 
StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}.
Ideally, this method should be able to handle dropping columns at arbitrary 
levels of nesting in a StructType Column. 





> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket (SPARK-22241) and following on, 
> add a new {{dropFields}} method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket, add a new {{dropFields}} method 
to the {{Column}} class.
 This method should allow users to drop a column {{nested inside another 
StructType Column}} (with similar semantics to the {{drop}} method on 
{{Dataset}}).

  was:
Based on the discussions in the parent ticket, 

Added a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a {{StructField}} in a {{StructType}} 
column (with similar semantics to the {{drop}} method on {{Dataset}}).


> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket, add a new {{dropFields}} 
> method to the {{Column}} class.
>  This method should allow users to drop a column {{nested inside another 
> StructType Column}} (with similar semantics to the {{drop}} method on 
> {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

fqaiser94 updated SPARK-32511:
--
Description: 
Based on the discussions in the parent ticket, 

Added a new {{dropFields}} method to the {{Column}} class.
This method should allow users to drop a {{StructField}} in a {{StructType}} 
column (with similar semantics to the {{drop}} method on {{Dataset}}).

> Add dropFields method to Column class
> -
>
> Key: SPARK-32511
> URL: https://issues.apache.org/jira/browse/SPARK-32511
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: fqaiser94
>Priority: Major
>
> Based on the discussions in the parent ticket, 
> Added a new {{dropFields}} method to the {{Column}} class.
> This method should allow users to drop a {{StructField}} in a {{StructType}} 
> column (with similar semantics to the {{drop}} method on {{Dataset}}).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32511) Add dropFields method to Column class

2020-07-31 Thread fqaiser94 (Jira)
fqaiser94 created SPARK-32511:
-

 Summary: Add dropFields method to Column class
 Key: SPARK-32511
 URL: https://issues.apache.org/jira/browse/SPARK-32511
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: fqaiser94






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31317) Add withField method to Column class

2020-07-07 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152894#comment-17152894
 ] 

fqaiser94 commented on SPARK-31317:
---

Done. 

> Add withField method to Column class
> 
>
> Key: SPARK-31317
> URL: https://issues.apache.org/jira/browse/SPARK-31317
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: DB Tsai
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072212#comment-17072212
 ] 

fqaiser94 commented on SPARK-22231:
---

Makes sense, thanks!

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+
> {code}

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071945#comment-17071945
 ] 

fqaiser94 commented on SPARK-22231:
---

Excellent, thanks Reynold. 
If somebody could assign this ticket to me, that would be great as I have the 
code ready for all 3 methods. 
Pull request for the first one ({{withField}}) is ready here: 
[https://github.com/apache/spark/pull/27066]
Looking forward to the reviews. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // ||

[jira] [Commented] (SPARK-16483) Unifying struct fields and columns

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023920#comment-17023920
 ] 

fqaiser94 commented on SPARK-16483:
---

This is very similar to 
[SPARK-22231|https://issues.apache.org/jira/browse/SPARK-22231#] where there 
has been more discussion. 

> Unifying struct fields and columns
> --
>
> Key: SPARK-16483
> URL: https://issues.apache.org/jira/browse/SPARK-16483
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Simeon Simeonov
>Priority: Major
>  Labels: sql
>
> This issue comes as a result of an exchange with Michael Armbrust outside of 
> the usual JIRA/dev list channels.
> DataFrame provides a full set of manipulation operations for top-level 
> columns. They have be added, removed, modified and renamed. The same is not 
> true about fields inside structs yet, from a logical standpoint, Spark users 
> may very well want to perform the same operations on struct fields, 
> especially since automatic schema discovery from JSON input tends to create 
> deeply nested structs.
> Common use-cases include:
>  - Remove and/or rename struct field(s) to adjust the schema
>  - Fix a data quality issue with a struct field (update/rewrite)
> To do this with the existing API by hand requires manually calling 
> {{named_struct}} and listing all fields, including ones we don't want to 
> manipulate. This leads to complex, fragile code that cannot survive schema 
> evolution.
> It would be far better if the various APIs that can now manipulate top-level 
> columns were extended to handle struct fields at arbitrary locations or, 
> alternatively, if we introduced new APIs for modifying any field in a 
> dataframe, whether it is a top-level one or one nested inside a struct.
> Purely for discussion purposes (overloaded methods are not shown):
> {code:java}
> class Column(val expr: Expression) extends Logging {
>   // ...
>   // matches Dataset.schema semantics
>   def schema: StructType
>   // matches Dataset.select() semantics
>   // '* support allows multiple new fields to be added easily, saving 
> cumbersome repeated withColumn() calls
>   def select(cols: Column*): Column
>   // matches Dataset.withColumn() semantics of add or replace
>   def withColumn(colName: String, col: Column): Column
>   // matches Dataset.drop() semantics
>   def drop(colName: String): Column
> }
> class Dataset[T] ... {
>   // ...
>   // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema)
>   def cast(newShema: StructType): DataFrame
> }
> {code}
> The benefit of the above API is that it unifies manipulating top-level & 
> nested columns. The addition of {{schema}} and {{select()}} to {{Column}} 
> allows for nested field reordering, casting, etc., which is important in data 
> exchange scenarios where field position matters. That's also the reason to 
> add {{cast}} to {{Dataset}}: it improves consistency and readability (with 
> method chaining). Another way to think of {{Dataset.cast}} is as the Spark 
> schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala 
> encodable type is to a {{StructType}} instance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023919#comment-17023919
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/26/20 8:35 PM:


[~rxin] no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 


was (Author: fqaiser94):
@rxin no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-26 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023919#comment-17023919
 ] 

fqaiser94 commented on SPARK-22231:
---

@rxin no problems, take your time.

Apparently Spark 3.0 code freeze is coming up soon so for those who are 
interested in seeing these features sooner rather later, I wanted to share a 
[library|https://github.com/fqaiser94/mse] I've written that adds 
{{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column 
class implicitly. The signatures of the methods follows pretty much what we've 
discussed so far in this ticket. Hopefully it's helpful to others. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-13 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014252#comment-17014252
 ] 

fqaiser94 commented on SPARK-22231:
---

[~rxin] do you have any other questions/thoughts on this? 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-03 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007543#comment-17007543
 ] 

fqaiser94 commented on SPARK-22231:
---

Just to be complete, there is a third option here. 
 We don't necessarily have to add new methods to Column or Dataset. 
 We could just add ({{add_field}}, {{rename_field}}, {{drop_field}}) methods to 
the {{org.apache.spark.sql.functions}} Object. 
 That said, I still feel it's more natural for these methods to belong to 
Column. 

Regarding your question: *Can withField modify a nested field itself?*
 Short answer; no, not in the implementation I was imagining.

Long answer; to be honest, I initially came across this Jira while trying to 
write a method that would do exactly that (i.e. modify a deeply nested field 
directly). 
 However, if you were to write {{Column.withField}} so that it can *only* 
modify StructFields at the top level of a StructType column passed to it, this 
is flexible enough to  support not only all of the use-cases mentioned so far 
in this Jira (in a succinct way) but *also* you could write a new method that 
builds on top of {{Column.withField}} (or the underlying {{BinaryExpression}}) 
to directly modify nested fields. Here is a quick implementation of this (that 
I'm not proud of) which could be included in {{org.apache.spark.sql.functions}} 
Object. :
{code:java}
object functions {
  import org.apache.spark.sql.catalyst.expressions.NamedExpression
  import org.apache.spark.sql.functions._

  /**
   *
   * @param nestedStructType   : e.g. $"a.b" where a and b are both StructType. 
   * @param fieldName  : Field to add/replace in nestedStructType based 
on name.  
   * @param fieldValue : Value to assign to fieldName
   * @return a copy the top-level nestedStructType column with fieldName 
added/replaced
   */
  def add_nested_field(
  nestedStructField: Column,
  fieldName: String,
  fieldValue: Column): Column = {
val parentColumnAndChildFieldNamePairs: Seq[(Column, String)] = {
  // TODO: handle case where nested Column name contains escaped "."
  val parentColumnNames = getAllNestedStructColumnNames(nestedStructField)
  val childFieldNames = parentColumnNames.drop(1).map(_.split("\\.").last) 
:+ fieldName
  parentColumnNames.map(col).zip(childFieldNames)
}

val updatedNestedStructColumn = parentColumnAndChildFieldNamePairs.reverse
  .foldLeft(fieldValue) {
case (childCol, (parentCol, childFieldName)) =>
  parentCol.withField(childFieldName, childCol)
  }

updatedNestedStructColumn
  }

  private def getAllNestedStructColumnNames(structColumn: Column): Seq[String] 
= {
val nameParts = 
structColumn.expr.asInstanceOf[NamedExpression].toAttribute.name.split("\\.").toSeq

var result = Seq.empty[String]
var i = 0
while (i < nameParts.length) {
  result = result :+ nameParts.slice(0, i + 1).mkString(".")
  i += 1
}
result
  }
}

{code}
Which could then be applied as follows:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(Row(Row(1, 2, 3), Row(4, null, 6),
  StructType(Seq(
StructField("a", StructType(Seq(
  StructField("a", StructType(Seq(
StructField("a", IntegerType),
StructField("b", IntegerType),
StructField("c", IntegerType, 
  StructField("b", StructType(Seq(
StructField("a", IntegerType),
StructField("b", IntegerType),
StructField("c", IntegerType
  )
).cache

data.show(false)
+--+
|a |
+--+
|[[1, 2, 3], [4,, 6]]  |
+--+

data.printSchema
|-- a: struct (nullable = true)
||-- a: struct (nullable = true)
|||-- a: integer (nullable = true)
|||-- b: integer (nullable = true)
|||-- c: integer (nullable = true)
||-- b: struct (nullable = true)
|||-- a: integer (nullable = true)
|||-- b: integer (nullable = true)
|||-- c: integer (nullable = true)

// replace deeply nested field
data.withColumn("newA", functions.add_nested_field($"a.b", "b", 
lit(5))).show(false)
+--+--+
|a |newA  |
+--+--+
|[[1, 2, 3], [4,, 6]]  |[[1, 2, 3], [4, 5, 6]]|
+--+--+

// add new deeply nested field
data.withColumn("newA", functions.add_nested_field($"a.b", "d", 
lit("hello"))).show(false)
++---+
|a   |newA   |
++---+
|[[1, 2, 3], [4,, 6]]|[[1, 2, 3], [4,, 6, hello]]|
++---+

{code}
 Another reason why I would want this feature as a separate method is because 
for {{Column.withField}} to support modifying 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:28 AM:
---

I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method on Column, you can add or replace deeply nested 
StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:14 AM:
---

I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense 
to be on Column rather than Dataframe for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}{{}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}{{: Dataframe}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String): Dataframe}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 


was (Author: fqaiser94):
I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191
 ] 

fqaiser94 commented on SPARK-22231:
---

I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be 
on Column for the following reasons:

+*Reason #1*+

These methods are intended to only ever operate on {{StructType}} columns. 
 It does *not* make much sense to me to add methods to Dataframe that operate 
only on {{StructType}} columns. 
 It does however make sense to me to add methods to Column that operate only on 
{{StructType}} columns and there is lots of precedent for this e.g. the 
existing {{getField}} method on Column operates only on {{StructType}} columns, 
the existing {{getItem}} method on Column operates only on {{ArrayType}} and 
{{MapType}} columns, etc.

+*Reason #2*+

I'm struggling to see how one would express operations on deeply nested 
StructFields easily if you were to put these methods on Dataframe. Using the 
{{withField}} method I added to Column in my PR, you can add or replace deeply 
nested StructFields like this:
{code:java}
data.show(false)
+-+
|a                                |
+-+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+-+

data.withColumn("a", 'a.withField(
  "b", $"a.b".withField(
"a", $"a.b.a".withField(
  "b", lit(5).show(false)
+---+
|a                                  |
+---+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]|
+---+
{code}
You can see a fully reproducible examples of this in my PR: 
[https://github.com/apache/spark/pull/27066]

Another common use-case that would be hard to express by adding the methods to 
Dataframe would be operations on {{ArrayType(StructType)}} or 
{{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to 
Column, you can express such operations naturally using the recently added 
higher-order-functions feature in Spark:
{code:java}
val data = spark.createDataFrame(
  sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6),
  StructType(Seq(
StructField("array", ArrayType(StructType(Seq(
  StructField("a", IntegerType),
  StructField("b", IntegerType),
  StructField("c", IntegerType)
))
  ).cache

data.show(false)
+--+
|array                 |
+--+
|[[1, 2, 3], [4, 5, 6]]|
+--+

data.withColumn("newArray", transform('array, structElem => 
structElem.withField("d", lit("hello".show(false)
+--++
|array                 |newArray                            |
+--++
|[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]|
+--++
{code}
 +*Reason #3*+

To add these methods to Dataframe, we would need signatures that would look 
strange compared to the methods available on Dataframe today. Off the top of my 
head, their signatures could look like this:
 * {{def withField(colName: String, structColumn: Column, fieldName: String, 
field: Column)}}
 Returns a new Dataframe with a new Column (colName) containing a copy of 
structColumn with field added/replaced based on name.
 * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: 
String, newFieldName: String)}}
 Returns a new Dataframe with existingFieldName in structColumn renamed to 
newFieldName.

Maybe these signatures could be refined further? I'm not sure. 

 

For these reasons, it seems more natural and logical to me to have the 
{{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. 
Regarding the proposed {{drop}} method, I think its debatable: 
 * {{drop}} method could be added to Column. 
 I can't at present think of a particular use-case which would necessitate 
adding the {{drop}} method to Column.
 * The existing {{drop}} method on Dataframe could be augmented to support 
dropping StructField if a reference to a StructField is passed to it. I'm not 
sure how challenging this would be to implement but presumably this should be 
possible. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 

[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116
 ] 

fqaiser94 edited comment on SPARK-22231 at 1/2/20 11:25 PM:


Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(fieldNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 


was (Author: fqaiser94):
Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(colNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before 

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-01-02 Thread fqaiser94 (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116
 ] 

fqaiser94 commented on SPARK-22231:
---

Hi folks, I can personally affirm that this would be a valuable feature to have 
in Spark. Looking around, its clear to me that other people have a need for 
this feature as well e.g. SPARK-16483,  [Mailing 
list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html],
 etc. 

A few things have changed since the last comments in this discussion. Support 
for higher-order functions has been 
[added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html]
 to the Apache Spark project and in particular, there are now {{transform}} and 
{{filter}} functions available for operating on ArrayType columns. These would 
be the equivalent of the {{mapItems}} and {{filterItems}} functions that were 
previously proposed in this ticket. To complete this ticket then, I think all 
that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} 
methods to the Column class. Looking through the discussion, I can summarize 
the signatures for these new methods should be as follows: 
 - {{def withField(fieldName: String, field: Column): Column}}
 Returns a new StructType column with field added/replaced based on name.

 - {{def drop(colNames: String*)}}
 Returns a new StructType column with field dropped.

 - {{def withFieldRenamed(existingName: String, newName: String): Column}}
 Returns a new StructType column with field renamed.

Since it didn't seem like anybody was actively working on this, I went ahead 
and created a pull request to add a {{withField}} method to the {{Column}} 
class that conforms with the specs discussed in this ticket. You can review the 
PR here: [https://github.com/apache/spark/pull/27066]

As this is my first PR to the Apache Spark project, I wanted to keep the PR 
small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} 
methods as well in separate PRs once the current PR is accepted. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested