Issue with job serialization formats mangling results

Aaron Wiebe Fri, 23 Oct 2015 11:24:13 -0700

Hey folks,

I've been working on a rather odd issue for a while now, and I'm going
to need a hand here.


In one field of a table, I have yaml inside the field (including
\n's).  Regardless of the storage format (parquet, orc, json using the
openx serde), hive will unpack the newlines (even though they're
actually ascii '\' and 'n') within the field and badly mangle the
results.  This _only_ happens when I apply any type of filter (even if
it doesn't hit that field, provided the column is in the resultset).

I've tested this with Hive 1.1.0+cdhCrap and Hive 1.2.1 mainline, both
tez (0.7.0) and MR.  Results are identical.

For example - count(*) returns 1.  Select * from table; returns the
one properly formatted row.  Add any where clause, and I get 63 rows -
the yaml is unpacked in a mangled format.

I've then created ORC and Parquet versions of this same table.  The
behavior remains... select * works, any filter creates horribly
mangled results.

To replace- throw this into a file:

{"id":1,"order_id":8,"number":1,"broken":"#\n---\nstuff\nstuff2:
\"stuff3\"\nstuff4: '730'\nstuff5: []\n","last":null}

Then:

create external table wtf (id int, order_id int, number int, broken
string, last string) ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE location '/user/aaron/wtf';

Then:

select * from wtf;

And:

select * from wtf where broken is not null;

will return very different results.  Creating this as an ORC or
Parquet table with "create table wtf2 like wtf stored as
<orc|parquet>; insert into wtf2 select * from wtf" will result in the
same issue.

-Aaron

Issue with job serialization formats mangling results

Reply via email to