Inconsistent dataset behavior between file and in-memory versions

Dean Arnold Thu, 12 Sep 2019 11:43:03 -0700

I have some code to recover a complex structured row from a dataset.
The row contains several ARRAY fields (mostly Array(IntegerType)),
which are populated with Array[java.lang.Integer], as that seems to be
the only way the Spark row serializer will accept them.


If the dataset is written out to a file (parquet in this case), and
then read back in
from the file, Row.getList() (either scala or java) works fine, and I
get a List. But if I simply apply the created dataset into another
dataset iterator, Row.getList() throws an exception:

java.lang.ClassCastException: [Ljava.lang.Integer; cannot be cast to
scala.collection.Seq

On top of that mess, the array fields of the row which were assigned a
null show up as non-null empty arrays, yet when written out to a file
and then read back, they are actually null.

Why isn't the behavior consistent ? And why isn't there a
Row.getArray() ? Will any of this nonsense be fixed in 3.0 ?

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Inconsistent dataset behavior between file and in-memory versions

Reply via email to