parquet protobuf output and aws athena support

Jin Yi Mon, 15 Mar 2021 12:51:13 -0700

using ParquetProtoWriters
<https://ci.apache.org/projects/flink/flink-docs-master/api/java/org/apache/flink/formats/parquet/protobuf/ParquetProtoWriters.html>,
does anyone have this working with aws athena ingestion via aws glue crawls?


the parquet files being generated by our flink job looks fine at a binary
level, but aws glue crawler crawls over these files via s3 don't seem to be
able to deserialize the row data properly.  the schema is correctly picked
up, but the actual unmarshalling of the rows seems to fail (with no helpful
logs).

likewise, using parquet-tools or pqrs
<https://github.com/manojkarthick/pqrs> locally has the same behavior of
readinging the metadata perfectly fine, but the actual data does not.

i'd like to verify that this is just a relatively atypical combination of
formats (parquet and protos) that doesn't have widespread tooling support
vs something i'm overlooking on my end.  for example, must i define the
table manually in athena using a create table statement (most examples of
parquet/protobuf uses this approach) and not rely on the schema defined by
the aws glue crawler?  i didn't go this route because this seemed counter
to the spirit of the parquet format being embedded w/ the schema.

thanks!

parquet protobuf output and aws athena support

Reply via email to