I tried to use the RowEncoder but got stuck along the way :The main issue really is that even if it's possible (however tedious) to pattern match generically Row(s) and target the nested field that you need to modify, Rows being immutable data structure without a method like a case class's copy or any kind of lens to create a brand new object, I ended up stuck at the step "target and extract the field to update" without any way to update the original Row with the new value. To sum up, I tried : * using only dataframe's API itself + my udf - which works for nested structs as long as no arrays are along the way * trying to create a udf the can apply on Row and pattern match recursively the path I needed to explore/modify * trying to create a UDT - but we seem to be stuck in a strange middle-ground with 2.0 because some parts of the API ended up private while some stayed public making it impossible to use it now (I'd be glad if I'm mistaken)
All of these failed for me and I ended up converting the rows to JSON and update using JSONPath which is…. something I'd like to avoid 'pretty please' On Thu, Sep 15, 2016 5:20 AM, Michael Allman mich...@videoamp.com wrote: Hi Guys, Have you tried org.apache.spark.sql.catalyst.encoders.RowEncoder? It's not a public API, but it is publicly accessible. I used it recently to correct some bad data in a few nested columns in a dataframe. It wasn't an easy job, but it made it possible. In my particular case I was not working with arrays. Olivier, I'm interested in seeing what you come up with. Thanks, Michael On Sep 14, 2016, at 10:44 AM, Fred Reiss <freiss....@gmail.com> wrote: +1 to this request. I talked last week with a product group within IBM that is struggling with the same issue. It's pretty common in data cleaning applications for data in the early stages to have nested lists or sets inconsistent or incomplete schema information. Fred On Tue, Sep 13, 2016 at 8:08 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: Hi everyone,I'm currently trying to create a generic transformation mecanism on a Dataframe to modify an arbitrary column regardless of the underlying the schema. It's "relatively" straightforward for complex types like struct<struct<…>> to apply an arbitrary UDF on the column and replace the data "inside" the struct, however I'm struggling to make it work for complex types containing arrays along the way like struct<array<struct<…>>>. Michael Armbrust seemed to allude on the mailing list/forum to a way of using Encoders to do that, I'd be interested in any pointers, especially considering that it's not possible to output any Row or GenericRowWithSchema from a UDF (thanks to https://github.com/apache/spark/blob/v2.0.0/sql/catalyst/src/main/scala/org/ apache/spark/sql/catalyst/ScalaReflection.scala#L657 it seems). To sum up, I'd like to find a way to apply a transformation on complex nested datatypes (arrays and struct) on a Dataframe updating the value itself. Regards, Olivier Girardot Olivier Girardot| Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94