[jira] [Created] (SPARK-32641) withField + getField on null structs returns incorrect results
fqaiser94 created SPARK-32641: - Summary: withField + getField on null structs returns incorrect results Key: SPARK-32641 URL: https://issues.apache.org/jira/browse/SPARK-32641 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: fqaiser94 There is a bug in the way the optimizer rule in SimplifyExtractValueOps (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala) is currently coded which yields incorrect results in scenarios like the following: {code:java} sql("SELECT CAST(NULL AS struct) struct_col") .select($"struct_col".withField("d", lit(4)).getField("d").as("d")) // currently returns this: +---+ |d | +---+ |4 | +---+ // when in fact it should return this: ++ |d | ++ |null| ++ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32521) WithFields Expression should not be foldable
[ https://issues.apache.org/jira/browse/SPARK-32521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32521: -- Description: The following query currently fails on master brach: {code:scala} sql("SELECT named_struct('a', 1, 'b', 2) a") .select($"a".withField("c", lit(3)).as("a")) .show(false) // java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3) {code} This happens because the Catalyst optimizer tries to statically evaluate the {{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot do so because {{WithFields Expression}} is {{Unevaluable}}. The likely solution here is to change {{WithFields Expression}} so that it is not {{foldable}} under any circumstance. This should solve the issue because then Catalyst will no longer attempt to statically evaluate {{WithFields Expression}} anymore. was: The following query currently fails on master brach: {code:scala} sql("SELECT named_struct('a', 1, 'b', 2) a") .select($"a".withField("c", lit(3)).as("a")) .show(false) // java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3) {code} This happens because Catalyst optimizer tries to statically evaluate the {{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot do so because {{WithFields Expression}} is {{Unevaluable}}. The likely solution here is to change {{WithFields Expression}} so that it is not {{foldable}} under any circumstance. This should solve the issue because then Catalyst will no longer attempt to statically evaluate {{WithFields Expression}} anymore. > WithFields Expression should not be foldable > > > Key: SPARK-32521 > URL: https://issues.apache.org/jira/browse/SPARK-32521 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: fqaiser94 >Priority: Major > > The following query currently fails on master brach: > {code:scala} > sql("SELECT named_struct('a', 1, 'b', 2) a") > .select($"a".withField("c", lit(3)).as("a")) > .show(false) > // java.lang.UnsupportedOperationException: Cannot evaluate expression: > with_fields(named_struct(a, 1, b, 2), c, 3) > {code} > This happens because the Catalyst optimizer tries to statically evaluate the > {{WithFields Expression}} (via the {{ConstantFolding}} rule), however it > cannot do so because {{WithFields Expression}} is {{Unevaluable}}. > The likely solution here is to change {{WithFields Expression}} so that it is > not {{foldable}} under any circumstance. This should solve the issue because > then Catalyst will no longer attempt to statically evaluate {{WithFields > Expression}} anymore. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32521) WithFields Expression should not be foldable
fqaiser94 created SPARK-32521: - Summary: WithFields Expression should not be foldable Key: SPARK-32521 URL: https://issues.apache.org/jira/browse/SPARK-32521 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: fqaiser94 The following query currently fails on master brach: {code:scala} sql("SELECT named_struct('a', 1, 'b', 2) a") .select($"a".withField("c", lit(3)).as("a")) .show(false) // java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3) {code} This happens because Catalyst optimizer tries to statically evaluate the {{WithFields Expression}} (via the {{ConstantFolding}} rule), however it cannot do so because {{WithFields Expression}} is {{Unevaluable}}. The likely solution here is to change {{WithFields Expression}} so that it is not {{foldable}} under any circumstance. This should solve the issue because then Catalyst will no longer attempt to statically evaluate {{WithFields Expression}} anymore. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside a StructType Column (with similar semantics to the existing {{drop}} method on {{Dataset}}). It should also be able to handle deeply nested columns through the same API. This is similar to the {{withField}} method that was recently added in SPARK-31317 and likely we can re-use some of that "infrastructure." The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column nested inside a StructType > Column (with similar semantics to the existing {{drop}} method on > {{Dataset}}). > It should also be able to handle deeply nested columns through the same API. > This is similar to the {{withField}} method that was recently added in > SPARK-31317 and likely we can re-use some of that "infrastructure." > The public-facing method signature should be something along the following > lines: > {noformat} > def dropFields(fieldNames: String*): Column > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. The public-facing method signature should be something along the following lines: {noformat} def dropFields(fieldNames: String*): Column {noformat} was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). > This is similar to the {{withField}} that was recently added in SPARK-31317 > and likely can re-use some of that infrastructure. > The public-facing method signature should be something along the following > lines: > {noformat} > def dropFields(fieldNames: String*): Column > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). This is similar to the {{withField}} that was recently added in SPARK-31317 and likely can re-use some of that infrastructure. was: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). > This is similar to the {{withField}} that was recently added in SPARK-31317 > and likely can re-use some of that infrastructure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231), add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket (SPARK-22231) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231), add a new > {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22231) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket (SPARK-22241) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22231) and following on, > add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the extensive discussions in the parent ticket, it was determined we should add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside another StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}. Ideally, this method should be able to handle dropping columns at arbitrary levels of nesting in a StructType Column. was: Based on the discussions in the parent ticket, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the extensive discussions in the parent ticket, it was determined we > should add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column nested inside another > StructType Column, with similar semantics to the {{drop}} method on > {{Dataset}}. > Ideally, this method should be able to handle dropping columns at arbitrary > levels of nesting in a StructType Column. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket (SPARK-22241) and following on, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the extensive discussions in the parent ticket, it was determined we should add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column nested inside another StructType Column, with similar semantics to the {{drop}} method on {{Dataset}}. Ideally, this method should be able to handle dropping columns at arbitrary levels of nesting in a StructType Column. > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket (SPARK-22241) and following on, > add a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket, add a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a column {{nested inside another StructType Column}} (with similar semantics to the {{drop}} method on {{Dataset}}). was: Based on the discussions in the parent ticket, Added a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a {{StructField}} in a {{StructType}} column (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket, add a new {{dropFields}} > method to the {{Column}} class. > This method should allow users to drop a column {{nested inside another > StructType Column}} (with similar semantics to the {{drop}} method on > {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32511) Add dropFields method to Column class
[ https://issues.apache.org/jira/browse/SPARK-32511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] fqaiser94 updated SPARK-32511: -- Description: Based on the discussions in the parent ticket, Added a new {{dropFields}} method to the {{Column}} class. This method should allow users to drop a {{StructField}} in a {{StructType}} column (with similar semantics to the {{drop}} method on {{Dataset}}). > Add dropFields method to Column class > - > > Key: SPARK-32511 > URL: https://issues.apache.org/jira/browse/SPARK-32511 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: fqaiser94 >Priority: Major > > Based on the discussions in the parent ticket, > Added a new {{dropFields}} method to the {{Column}} class. > This method should allow users to drop a {{StructField}} in a {{StructType}} > column (with similar semantics to the {{drop}} method on {{Dataset}}). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32511) Add dropFields method to Column class
fqaiser94 created SPARK-32511: - Summary: Add dropFields method to Column class Key: SPARK-32511 URL: https://issues.apache.org/jira/browse/SPARK-32511 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: fqaiser94 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31317) Add withField method to Column class
[ https://issues.apache.org/jira/browse/SPARK-31317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17152894#comment-17152894 ] fqaiser94 commented on SPARK-31317: --- Done. > Add withField method to Column class > > > Key: SPARK-31317 > URL: https://issues.apache.org/jira/browse/SPARK-31317 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072212#comment-17072212 ] fqaiser94 commented on SPARK-22231: --- Makes sense, thanks! > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > // +---++--+ > // |10 |10.0|[[10,11.0], [11,12.0]]| > // |20 |20.0|[[20,21.0], [21,22.0]]| > // +---++--+ > {code}
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071945#comment-17071945 ] fqaiser94 commented on SPARK-22231: --- Excellent, thanks Reynold. If somebody could assign this ticket to me, that would be great as I have the code ready for all 3 methods. Pull request for the first one ({{withField}}) is ready here: [https://github.com/apache/spark/pull/27066] Looking forward to the reviews. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // ||
[jira] [Commented] (SPARK-16483) Unifying struct fields and columns
[ https://issues.apache.org/jira/browse/SPARK-16483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023920#comment-17023920 ] fqaiser94 commented on SPARK-16483: --- This is very similar to [SPARK-22231|https://issues.apache.org/jira/browse/SPARK-22231#] where there has been more discussion. > Unifying struct fields and columns > -- > > Key: SPARK-16483 > URL: https://issues.apache.org/jira/browse/SPARK-16483 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.1 >Reporter: Simeon Simeonov >Priority: Major > Labels: sql > > This issue comes as a result of an exchange with Michael Armbrust outside of > the usual JIRA/dev list channels. > DataFrame provides a full set of manipulation operations for top-level > columns. They have be added, removed, modified and renamed. The same is not > true about fields inside structs yet, from a logical standpoint, Spark users > may very well want to perform the same operations on struct fields, > especially since automatic schema discovery from JSON input tends to create > deeply nested structs. > Common use-cases include: > - Remove and/or rename struct field(s) to adjust the schema > - Fix a data quality issue with a struct field (update/rewrite) > To do this with the existing API by hand requires manually calling > {{named_struct}} and listing all fields, including ones we don't want to > manipulate. This leads to complex, fragile code that cannot survive schema > evolution. > It would be far better if the various APIs that can now manipulate top-level > columns were extended to handle struct fields at arbitrary locations or, > alternatively, if we introduced new APIs for modifying any field in a > dataframe, whether it is a top-level one or one nested inside a struct. > Purely for discussion purposes (overloaded methods are not shown): > {code:java} > class Column(val expr: Expression) extends Logging { > // ... > // matches Dataset.schema semantics > def schema: StructType > // matches Dataset.select() semantics > // '* support allows multiple new fields to be added easily, saving > cumbersome repeated withColumn() calls > def select(cols: Column*): Column > // matches Dataset.withColumn() semantics of add or replace > def withColumn(colName: String, col: Column): Column > // matches Dataset.drop() semantics > def drop(colName: String): Column > } > class Dataset[T] ... { > // ... > // Equivalent to sparkSession.createDataset(toDF.rdd, newSchema) > def cast(newShema: StructType): DataFrame > } > {code} > The benefit of the above API is that it unifies manipulating top-level & > nested columns. The addition of {{schema}} and {{select()}} to {{Column}} > allows for nested field reordering, casting, etc., which is important in data > exchange scenarios where field position matters. That's also the reason to > add {{cast}} to {{Dataset}}: it improves consistency and readability (with > method chaining). Another way to think of {{Dataset.cast}} is as the Spark > schema equivalent of {{Dataset.as}}. {{as}} is to {{cast}} as a Scala > encodable type is to a {{StructType}} instance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023919#comment-17023919 ] fqaiser94 edited comment on SPARK-22231 at 1/26/20 8:35 PM: [~rxin] no problems, take your time. Apparently Spark 3.0 code freeze is coming up soon so for those who are interested in seeing these features sooner rather later, I wanted to share a [library|https://github.com/fqaiser94/mse] I've written that adds {{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column class implicitly. The signatures of the methods follows pretty much what we've discussed so far in this ticket. Hopefully it's helpful to others. was (Author: fqaiser94): @rxin no problems, take your time. Apparently Spark 3.0 code freeze is coming up soon so for those who are interested in seeing these features sooner rather later, I wanted to share a [library|https://github.com/fqaiser94/mse] I've written that adds {{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column class implicitly. The signatures of the methods follows pretty much what we've discussed so far in this ticket. Hopefully it's helpful to others. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023919#comment-17023919 ] fqaiser94 commented on SPARK-22231: --- @rxin no problems, take your time. Apparently Spark 3.0 code freeze is coming up soon so for those who are interested in seeing these features sooner rather later, I wanted to share a [library|https://github.com/fqaiser94/mse] I've written that adds {{withField}}, {{withFieldRenamed}}, and {{dropFields}} methods to the Column class implicitly. The signatures of the methods follows pretty much what we've discussed so far in this ticket. Hopefully it's helpful to others. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable =
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17014252#comment-17014252 ] fqaiser94 commented on SPARK-22231: --- [~rxin] do you have any other questions/thoughts on this? > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > // +---++--+ > // |10 |10.0|[[10,11.0], [11,12.0]]| > // |20
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007543#comment-17007543 ] fqaiser94 commented on SPARK-22231: --- Just to be complete, there is a third option here. We don't necessarily have to add new methods to Column or Dataset. We could just add ({{add_field}}, {{rename_field}}, {{drop_field}}) methods to the {{org.apache.spark.sql.functions}} Object. That said, I still feel it's more natural for these methods to belong to Column. Regarding your question: *Can withField modify a nested field itself?* Short answer; no, not in the implementation I was imagining. Long answer; to be honest, I initially came across this Jira while trying to write a method that would do exactly that (i.e. modify a deeply nested field directly). However, if you were to write {{Column.withField}} so that it can *only* modify StructFields at the top level of a StructType column passed to it, this is flexible enough to support not only all of the use-cases mentioned so far in this Jira (in a succinct way) but *also* you could write a new method that builds on top of {{Column.withField}} (or the underlying {{BinaryExpression}}) to directly modify nested fields. Here is a quick implementation of this (that I'm not proud of) which could be included in {{org.apache.spark.sql.functions}} Object. : {code:java} object functions { import org.apache.spark.sql.catalyst.expressions.NamedExpression import org.apache.spark.sql.functions._ /** * * @param nestedStructType : e.g. $"a.b" where a and b are both StructType. * @param fieldName : Field to add/replace in nestedStructType based on name. * @param fieldValue : Value to assign to fieldName * @return a copy the top-level nestedStructType column with fieldName added/replaced */ def add_nested_field( nestedStructField: Column, fieldName: String, fieldValue: Column): Column = { val parentColumnAndChildFieldNamePairs: Seq[(Column, String)] = { // TODO: handle case where nested Column name contains escaped "." val parentColumnNames = getAllNestedStructColumnNames(nestedStructField) val childFieldNames = parentColumnNames.drop(1).map(_.split("\\.").last) :+ fieldName parentColumnNames.map(col).zip(childFieldNames) } val updatedNestedStructColumn = parentColumnAndChildFieldNamePairs.reverse .foldLeft(fieldValue) { case (childCol, (parentCol, childFieldName)) => parentCol.withField(childFieldName, childCol) } updatedNestedStructColumn } private def getAllNestedStructColumnNames(structColumn: Column): Seq[String] = { val nameParts = structColumn.expr.asInstanceOf[NamedExpression].toAttribute.name.split("\\.").toSeq var result = Seq.empty[String] var i = 0 while (i < nameParts.length) { result = result :+ nameParts.slice(0, i + 1).mkString(".") i += 1 } result } } {code} Which could then be applied as follows: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(Row(Row(1, 2, 3), Row(4, null, 6), StructType(Seq( StructField("a", StructType(Seq( StructField("a", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType, StructField("b", StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType ) ).cache data.show(false) +--+ |a | +--+ |[[1, 2, 3], [4,, 6]] | +--+ data.printSchema |-- a: struct (nullable = true) ||-- a: struct (nullable = true) |||-- a: integer (nullable = true) |||-- b: integer (nullable = true) |||-- c: integer (nullable = true) ||-- b: struct (nullable = true) |||-- a: integer (nullable = true) |||-- b: integer (nullable = true) |||-- c: integer (nullable = true) // replace deeply nested field data.withColumn("newA", functions.add_nested_field($"a.b", "b", lit(5))).show(false) +--+--+ |a |newA | +--+--+ |[[1, 2, 3], [4,, 6]] |[[1, 2, 3], [4, 5, 6]]| +--+--+ // add new deeply nested field data.withColumn("newA", functions.add_nested_field($"a.b", "d", lit("hello"))).show(false) ++---+ |a |newA | ++---+ |[[1, 2, 3], [4,, 6]]|[[1, 2, 3], [4,, 6, hello]]| ++---+ {code} Another reason why I would want this feature as a separate method is because for {{Column.withField}} to support modifying
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191 ] fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:28 AM: --- I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense to be on Column rather than Dataframe for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on {{StructType}} columns, the existing {{getItem}} method on Column operates only on {{ArrayType}} and {{MapType}} columns, etc. +*Reason #2*+ I'm struggling to see how one would express operations on deeply nested StructFields easily if you were to put these methods on Dataframe. Using the {{withField}} method on Column, you can add or replace deeply nested StructFields like this: {code:java} data.show(false) +-+ |a | +-+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +-+ data.withColumn("a", 'a.withField( "b", $"a.b".withField( "a", $"a.b.a".withField( "b", lit(5).show(false) +---+ |a | +---+ |[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]| +---+ {code} You can see a fully reproducible examples of this in my PR: [https://github.com/apache/spark/pull/27066] Another common use-case that would be hard to express by adding the methods to Dataframe would be operations on {{ArrayType(StructType)}} or {{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to Column, you can express such operations naturally using the recently added higher-order-functions feature in Spark: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6), StructType(Seq( StructField("array", ArrayType(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType) )) ).cache data.show(false) +--+ |array | +--+ |[[1, 2, 3], [4, 5, 6]]| +--+ data.withColumn("newArray", transform('array, structElem => structElem.withField("d", lit("hello".show(false) +--++ |array |newArray | +--++ |[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]| +--++ {code} +*Reason #3*+ To add these methods to Dataframe, we would need signatures that would look strange compared to the methods available on Dataframe today. Off the top of my head, their signatures could look like this: * {{def withField(colName: String, structColumn: Column, fieldName: String, field: Column)}}{{: Dataframe}} Returns a new Dataframe with a new Column (colName) containing a copy of structColumn with field added/replaced based on name. * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: String, newFieldName: String): Dataframe}} Returns a new Dataframe with existingFieldName in structColumn renamed to newFieldName. Maybe these signatures could be refined further? I'm not sure. For these reasons, it seems more natural and logical to me to have the {{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. Regarding the proposed {{drop}} method, I think its debatable: * {{drop}} method could be added to Column. I can't at present think of a particular use-case which would necessitate adding the {{drop}} method to Column. * The existing {{drop}} method on Dataframe could be augmented to support dropping StructField if a reference to a StructField is passed to it. I'm not sure how challenging this would be to implement but presumably this should be possible. was (Author: fqaiser94): I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense to be on Column rather than Dataframe for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}}
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191 ] fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:14 AM: --- I feel that the {{withField}} and {{withFieldRenamed}} methods make more sense to be on Column rather than Dataframe for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on {{StructType}} columns, the existing {{getItem}} method on Column operates only on {{ArrayType}} and {{MapType}} columns, etc. +*Reason #2*+ I'm struggling to see how one would express operations on deeply nested StructFields easily if you were to put these methods on Dataframe. Using the {{withField}} method I added to Column in my PR, you can add or replace deeply nested StructFields like this: {code:java} data.show(false) +-+ |a | +-+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +-+ data.withColumn("a", 'a.withField( "b", $"a.b".withField( "a", $"a.b.a".withField( "b", lit(5).show(false) +---+ |a | +---+ |[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]| +---+ {code} You can see a fully reproducible examples of this in my PR: [https://github.com/apache/spark/pull/27066] Another common use-case that would be hard to express by adding the methods to Dataframe would be operations on {{ArrayType(StructType)}} or {{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to Column, you can express such operations naturally using the recently added higher-order-functions feature in Spark: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6), StructType(Seq( StructField("array", ArrayType(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType) )) ).cache data.show(false) +--+ |array | +--+ |[[1, 2, 3], [4, 5, 6]]| +--+ data.withColumn("newArray", transform('array, structElem => structElem.withField("d", lit("hello".show(false) +--++ |array |newArray | +--++ |[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]| +--++ {code} +*Reason #3*+ To add these methods to Dataframe, we would need signatures that would look strange compared to the methods available on Dataframe today. Off the top of my head, their signatures could look like this: * {{def withField(colName: String, structColumn: Column, fieldName: String, field: Column)}}{{: Dataframe}} Returns a new Dataframe with a new Column (colName) containing a copy of structColumn with field added/replaced based on name. * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: String, newFieldName: String): Dataframe}} Returns a new Dataframe with existingFieldName in structColumn renamed to newFieldName. Maybe these signatures could be refined further? I'm not sure. For these reasons, it seems more natural and logical to me to have the {{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. Regarding the proposed {{drop}} method, I think its debatable: * {{drop}} method could be added to Column. I can't at present think of a particular use-case which would necessitate adding the {{drop}} method to Column. * The existing {{drop}} method on Dataframe could be augmented to support dropping StructField if a reference to a StructField is passed to it. I'm not sure how challenging this would be to implement but presumably this should be possible. was (Author: fqaiser94): I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191 ] fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM: --- I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on {{StructType}} columns, the existing {{getItem}} method on Column operates only on {{ArrayType}} and {{MapType}} columns, etc. +*Reason #2*+ I'm struggling to see how one would express operations on deeply nested StructFields easily if you were to put these methods on Dataframe. Using the {{withField}} method I added to Column in my PR, you can add or replace deeply nested StructFields like this: {code:java} data.show(false) +-+ |a | +-+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +-+ data.withColumn("a", 'a.withField( "b", $"a.b".withField( "a", $"a.b.a".withField( "b", lit(5).show(false) +---+ |a | +---+ |[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]| +---+ {code} You can see a fully reproducible examples of this in my PR: [https://github.com/apache/spark/pull/27066] Another common use-case that would be hard to express by adding the methods to Dataframe would be operations on {{ArrayType(StructType)}} or {{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to Column, you can express such operations naturally using the recently added higher-order-functions feature in Spark: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6), StructType(Seq( StructField("array", ArrayType(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType) )) ).cache data.show(false) +--+ |array | +--+ |[[1, 2, 3], [4, 5, 6]]| +--+ data.withColumn("newArray", transform('array, structElem => structElem.withField("d", lit("hello".show(false) +--++ |array |newArray | +--++ |[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]| +--++ {code} +*Reason #3*+ To add these methods to Dataframe, we would need signatures that would look strange compared to the methods available on Dataframe today. Off the top of my head, their signatures could look like this: * {{def withField(colName: String, structColumn: Column, fieldName: String, field: Column)}}{{: Dataframe}}{{}} Returns a new Dataframe with a new Column (colName) containing a copy of structColumn with field added/replaced based on name. * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: String, newFieldName: String): Dataframe}} Returns a new Dataframe with existingFieldName in structColumn renamed to newFieldName. Maybe these signatures could be refined further? I'm not sure. For these reasons, it seems more natural and logical to me to have the {{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. Regarding the proposed {{drop}} method, I think its debatable: * {{drop}} method could be added to Column. I can't at present think of a particular use-case which would necessitate adding the {{drop}} method to Column. * The existing {{drop}} method on Dataframe could be augmented to support dropping StructField if a reference to a StructField is passed to it. I'm not sure how challenging this would be to implement but presumably this should be possible. was (Author: fqaiser94): I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191 ] fqaiser94 edited comment on SPARK-22231 at 1/3/20 3:07 AM: --- I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on {{StructType}} columns, the existing {{getItem}} method on Column operates only on {{ArrayType}} and {{MapType}} columns, etc. +*Reason #2*+ I'm struggling to see how one would express operations on deeply nested StructFields easily if you were to put these methods on Dataframe. Using the {{withField}} method I added to Column in my PR, you can add or replace deeply nested StructFields like this: {code:java} data.show(false) +-+ |a | +-+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +-+ data.withColumn("a", 'a.withField( "b", $"a.b".withField( "a", $"a.b.a".withField( "b", lit(5).show(false) +---+ |a | +---+ |[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]| +---+ {code} You can see a fully reproducible examples of this in my PR: [https://github.com/apache/spark/pull/27066] Another common use-case that would be hard to express by adding the methods to Dataframe would be operations on {{ArrayType(StructType)}} or {{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to Column, you can express such operations naturally using the recently added higher-order-functions feature in Spark: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6), StructType(Seq( StructField("array", ArrayType(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType) )) ).cache data.show(false) +--+ |array | +--+ |[[1, 2, 3], [4, 5, 6]]| +--+ data.withColumn("newArray", transform('array, structElem => structElem.withField("d", lit("hello".show(false) +--++ |array |newArray | +--++ |[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]| +--++ {code} +*Reason #3*+ To add these methods to Dataframe, we would need signatures that would look strange compared to the methods available on Dataframe today. Off the top of my head, their signatures could look like this: * {{def withField(colName: String, structColumn: Column, fieldName: String, field: Column)}}{{: Dataframe}} Returns a new Dataframe with a new Column (colName) containing a copy of structColumn with field added/replaced based on name. * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: String, newFieldName: String): Dataframe}} Returns a new Dataframe with existingFieldName in structColumn renamed to newFieldName. Maybe these signatures could be refined further? I'm not sure. For these reasons, it seems more natural and logical to me to have the {{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. Regarding the proposed {{drop}} method, I think its debatable: * {{drop}} method could be added to Column. I can't at present think of a particular use-case which would necessitate adding the {{drop}} method to Column. * The existing {{drop}} method on Dataframe could be augmented to support dropping StructField if a reference to a StructField is passed to it. I'm not sure how challenging this would be to implement but presumably this should be possible. was (Author: fqaiser94): I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007191#comment-17007191 ] fqaiser94 commented on SPARK-22231: --- I feel strongly that {{withField}} and {{withFieldRenamed}} methods should be on Column for the following reasons: +*Reason #1*+ These methods are intended to only ever operate on {{StructType}} columns. It does *not* make much sense to me to add methods to Dataframe that operate only on {{StructType}} columns. It does however make sense to me to add methods to Column that operate only on {{StructType}} columns and there is lots of precedent for this e.g. the existing {{getField}} method on Column operates only on {{StructType}} columns, the existing {{getItem}} method on Column operates only on {{ArrayType}} and {{MapType}} columns, etc. +*Reason #2*+ I'm struggling to see how one would express operations on deeply nested StructFields easily if you were to put these methods on Dataframe. Using the {{withField}} method I added to Column in my PR, you can add or replace deeply nested StructFields like this: {code:java} data.show(false) +-+ |a | +-+ |[[1, 2, 3], [[4,, 6], [7, 8, 9]]]| +-+ data.withColumn("a", 'a.withField( "b", $"a.b".withField( "a", $"a.b.a".withField( "b", lit(5).show(false) +---+ |a | +---+ |[[1, 2, 3], [[4, 5, 6], [7, 8, 9]]]| +---+ {code} You can see a fully reproducible examples of this in my PR: [https://github.com/apache/spark/pull/27066] Another common use-case that would be hard to express by adding the methods to Dataframe would be operations on {{ArrayType(StructType)}} or {{MapType(}}{{StructType}}{{, StructType)}} columns. By adding the methods to Column, you can express such operations naturally using the recently added higher-order-functions feature in Spark: {code:java} val data = spark.createDataFrame( sc.parallelize(Seq(Row(List(Row(1, 2, 3), Row(4, 5, 6), StructType(Seq( StructField("array", ArrayType(StructType(Seq( StructField("a", IntegerType), StructField("b", IntegerType), StructField("c", IntegerType) )) ).cache data.show(false) +--+ |array | +--+ |[[1, 2, 3], [4, 5, 6]]| +--+ data.withColumn("newArray", transform('array, structElem => structElem.withField("d", lit("hello".show(false) +--++ |array |newArray | +--++ |[[1, 2, 3], [4, 5, 6]]|[[1, 2, 3, hello], [4, 5, 6, hello]]| +--++ {code} +*Reason #3*+ To add these methods to Dataframe, we would need signatures that would look strange compared to the methods available on Dataframe today. Off the top of my head, their signatures could look like this: * {{def withField(colName: String, structColumn: Column, fieldName: String, field: Column)}} Returns a new Dataframe with a new Column (colName) containing a copy of structColumn with field added/replaced based on name. * {{def withFieldRenamed(}}{{structColumn}}{{: Column, existingFieldName: String, newFieldName: String)}} Returns a new Dataframe with existingFieldName in structColumn renamed to newFieldName. Maybe these signatures could be refined further? I'm not sure. For these reasons, it seems more natural and logical to me to have the {{withField}} and {{withFieldRenamed}} methods on Column rather than Dataframe. Regarding the proposed {{drop}} method, I think its debatable: * {{drop}} method could be added to Column. I can't at present think of a particular use-case which would necessitate adding the {{drop}} method to Column. * The existing {{drop}} method on Dataframe could be augmented to support dropping StructField if a reference to a StructField is passed to it. I'm not sure how challenging this would be to implement but presumably this should be possible. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a
[jira] [Comment Edited] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116 ] fqaiser94 edited comment on SPARK-22231 at 1/2/20 11:25 PM: Hi folks, I can personally affirm that this would be a valuable feature to have in Spark. Looking around, its clear to me that other people have a need for this feature as well e.g. SPARK-16483, [Mailing list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html], etc. A few things have changed since the last comments in this discussion. Support for higher-order functions has been [added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html] to the Apache Spark project and in particular, there are now {{transform}} and {{filter}} functions available for operating on ArrayType columns. These would be the equivalent of the {{mapItems}} and {{filterItems}} functions that were previously proposed in this ticket. To complete this ticket then, I think all that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} methods to the Column class. Looking through the discussion, I can summarize the signatures for these new methods should be as follows: - {{def withField(fieldName: String, field: Column): Column}} Returns a new StructType column with field added/replaced based on name. - {{def drop(fieldNames: String*)}} Returns a new StructType column with field dropped. - {{def withFieldRenamed(existingName: String, newName: String): Column}} Returns a new StructType column with field renamed. Since it didn't seem like anybody was actively working on this, I went ahead and created a pull request to add a {{withField}} method to the {{Column}} class that conforms with the specs discussed in this ticket. You can review the PR here: [https://github.com/apache/spark/pull/27066] As this is my first PR to the Apache Spark project, I wanted to keep the PR small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} methods as well in separate PRs once the current PR is accepted. was (Author: fqaiser94): Hi folks, I can personally affirm that this would be a valuable feature to have in Spark. Looking around, its clear to me that other people have a need for this feature as well e.g. SPARK-16483, [Mailing list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html], etc. A few things have changed since the last comments in this discussion. Support for higher-order functions has been [added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html] to the Apache Spark project and in particular, there are now {{transform}} and {{filter}} functions available for operating on ArrayType columns. These would be the equivalent of the {{mapItems}} and {{filterItems}} functions that were previously proposed in this ticket. To complete this ticket then, I think all that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} methods to the Column class. Looking through the discussion, I can summarize the signatures for these new methods should be as follows: - {{def withField(fieldName: String, field: Column): Column}} Returns a new StructType column with field added/replaced based on name. - {{def drop(colNames: String*)}} Returns a new StructType column with field dropped. - {{def withFieldRenamed(existingName: String, newName: String): Column}} Returns a new StructType column with field renamed. Since it didn't seem like anybody was actively working on this, I went ahead and created a pull request to add a {{withField}} method to the {{Column}} class that conforms with the specs discussed in this ticket. You can review the PR here: [https://github.com/apache/spark/pull/27066] As this is my first PR to the Apache Spark project, I wanted to keep the PR small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} methods as well in separate PRs once the current PR is accepted. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17007116#comment-17007116 ] fqaiser94 commented on SPARK-22231: --- Hi folks, I can personally affirm that this would be a valuable feature to have in Spark. Looking around, its clear to me that other people have a need for this feature as well e.g. SPARK-16483, [Mailing list|http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Applying-transformation-on-a-struct-inside-an-array-td18934.html], etc. A few things have changed since the last comments in this discussion. Support for higher-order functions has been [added|https://databricks.com/blog/2018/11/16/introducing-new-built-in-functions-and-higher-order-functions-for-complex-data-types-in-apache-spark.html] to the Apache Spark project and in particular, there are now {{transform}} and {{filter}} functions available for operating on ArrayType columns. These would be the equivalent of the {{mapItems}} and {{filterItems}} functions that were previously proposed in this ticket. To complete this ticket then, I think all that is needed is adding the {{withField}}, {{withFieldRenamed}}, and {{drop}} methods to the Column class. Looking through the discussion, I can summarize the signatures for these new methods should be as follows: - {{def withField(fieldName: String, field: Column): Column}} Returns a new StructType column with field added/replaced based on name. - {{def drop(colNames: String*)}} Returns a new StructType column with field dropped. - {{def withFieldRenamed(existingName: String, newName: String): Column}} Returns a new StructType column with field renamed. Since it didn't seem like anybody was actively working on this, I went ahead and created a pull request to add a {{withField}} method to the {{Column}} class that conforms with the specs discussed in this ticket. You can review the PR here: [https://github.com/apache/spark/pull/27066] As this is my first PR to the Apache Spark project, I wanted to keep the PR small. However, I wouldn't mind writing the {{drop}} and {{withFieldRenamed}} methods as well in separate PRs once the current PR is accepted. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested