Let's say I have a Spark (PySpark) dataframe with the following schema: root |-- myarray: array (nullable = true) | |-- element: string (containsNull = true) |-- myindices: array (nullable = true) | |-- element: integer (containsNull = true)
It looks like: +--------------------+----------+ | myarray | myindices| +--------------------+----------+ | [A]| [0] | | [B, C]| [1] | | [D, E, F, G]| [0,2] | +--------------------+----------+ How can I use the second array to index the first? My goal is to create a new dataframe which would look like: +--------------------+----------+------+ | myarray | myindices|result| +--------------------+----------+------+ | [A]| [0] | [A] | | [B, C]| [1] | [C] | | [D, E, F, G]| [0,2] | [D,F]| +--------------------+----------+------+ (It is safe to assume that the contents of myindices are always guaranteed to be within the cardinality of myarray for the row in question, so there are no out-of-bounds problems.) It appears that the .getItem() method only works with single arguments, so I might need a UDF here, but I know of no way to create a UDF that has more than one column as input. Any solutions, with or without UDFs?