Hi michael,Well for nested structs, I saw in the tests the behaviour defined by SPARK-12512 for the "a.b.c" handling in withColumn, and even if it's not ideal for me, I managed to make it work anyway like that :> df.withColumn("a", struct(struct(myUDF(df("a.b.c.")))) // I didn't put back the aliases but you see what I mean. What I'd like to make work in essence is something like that> val someFunc : String => String = ???> val myUDF = udf(someFunc)> df.withColumn("a.b[*].c", myUDF(df("a.b[*].c"))) // the fact is that in order to be consistent with the previous API, maybe I'd have to put something like a struct(array(struct(… which would be troublesome because I'd have to parse the arbitrary input string and create something like "a.b[*].c" => struct(array(struct( I realise the ambiguity implied in the kind of column expression, but it doesn't seem for now available to cleanly update data inplace at an arbitrary depth. I'll try to work on a PR that would make this possible, but any pointers would be appreciated. Regards, Olivier.
On Fri, Sep 16, 2016 12:42 AM, Michael Armbrust mich...@databricks.com wrote: Is what you are looking for a withColumn that support in place modification of nested columns? or is it some other problem? On Wed, Sep 14, 2016 at 11:07 PM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: I tried to use the RowEncoder but got stuck along the way :The main issue really is that even if it's possible (however tedious) to pattern match generically Row(s) and target the nested field that you need to modify, Rows being immutable data structure without a method like a case class's copy or any kind of lens to create a brand new object, I ended up stuck at the step "target and extract the field to update" without any way to update the original Row with the new value. To sum up, I tried : * using only dataframe's API itself + my udf - which works for nested structs as long as no arrays are along the way * trying to create a udf the can apply on Row and pattern match recursively the path I needed to explore/modify * trying to create a UDT - but we seem to be stuck in a strange middle-ground with 2.0 because some parts of the API ended up private while some stayed public making it impossible to use it now (I'd be glad if I'm mistaken) All of these failed for me and I ended up converting the rows to JSON and update using JSONPath which is…. something I'd like to avoid 'pretty please' On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com wrote: Hi Guys, Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a public API, but it is publicly accessible. I used it recently to correct some bad data in a few nested columns in a dataframe. It wasn't an easy job, but it made it possible. In my particular case I was not working with arrays. Olivier, I'm interested in seeing what you come up with. Thanks, Michael On Sep 14, 2016, at 10:44 AM, Fred Reiss <freiss....@gmail.com> wrote: +1 to this request. I talked last week with a product group within IBM that is struggling with the same issue. It's pretty common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: Hi everyone,I'm currently trying to create a generic transformation mecanism on a Dataframe to modify an arbitrary column regardless of the underlying the schema. It's "relatively" straightforward for complex types like struct<struct<…>> to apply an arbitrary UDF on the column and replace the data "inside" the struct, however I'm struggling to make it work for complex types containing arrays along the way like struct<array<struct<…>>>. Michael Armbrust seemed to allude on the mailing list/forum to a way of using Encoders to do that, I'd be interested in any pointers, especially considering that it's not possible to output any Row or GenericRowWithSchema from a UDF (thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/ src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems). To sum up, I'd like to find a way to apply a transformation on complex nested datatypes (arrays and struct) on a Dataframe updating the value itself. Regards, Olivier Girardot Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94