RE: Reading Parquet/Hive

Gwenhael Pasquiers Fri, 18 Dec 2015 02:42:10 -0800

I'll answer to myself :)

I think i've managed to make it work by creating my "WrappingReadSupport" that 
wraps the DataWritableReadSupport but I also insert my "WrappingMaterializer" 
that converts the ArrayWritable produced by the original Materializer to 
String[]. Then later on, the String[] poses no issues with Tuple and it seems 
to be OK.

Now ... Let's write those String[] in parquet too :)

From: Gwenhael Pasquiers [mailto:gwenhael.pasqui...@ericsson.com]
Sent: vendredi 18 décembre 2015 10:04
To: user@flink.apache.org
Subject: Reading Parquet/Hive

Hi,

I'm trying to read Parquet/Hive data using parquet's ParquetInputFormat and 
hive's DataWritableReadSupport.

I have an error when the TupleSerializer tries to create an instance of 
ArrayWritable, using reflection because ArrayWritable has no no-args 
constructor.

I've been able to make it work when executing in a local cluster by copying the 
ArrayWritable class in my own sources and adding the constructor. I guess that 
the classpath built by maven puts my code first and allows me to override the 
original class. However when running into the real cluster (yarn@cloudera) the 
exception comes back (I guess that the original class is first in the 
classpath).

So you have an idea of how I could make it work ?

I'm think I'm tied to the ArrayWritable type because of the 
DataWritableReadSupport that extends ReadSupport<ArrayWritable>.

Would it be possible (and not too complicated) to make a DataSource that would 
not generate Tuples and allow me to convert the ArrayWritable to a more 
friendly type like String[] ... Or if you have any other idea, they are welcome 
!

B.R.

Gwenhaël PASQUIERS

RE: Reading Parquet/Hive

Reply via email to