Re: Problems reading Parquet input from HDFS

Flavio Pompermaier Mon, 24 Apr 2017 11:51:38 -0700

I started from this guide:

https://github.com/FelixNeutatz/parquet-flinktacular


Best,
Flavio

On 24 Apr 2017 6:36 pm, "Jörn Franke" <jornfra...@gmail.com> wrote:

> Why not use a parquet only format? Not sure why you need an
> avtoparquetformat.
>
> On 24. Apr 2017, at 18:19, Lukas Kircher <lukas.kirc...@uni-konstanz.de>
> wrote:
>
> Hello,
>
> I am trying to read Parquet files from HDFS and having problems. I use
> Avro for schema. Here is a basic example:
>
> public static void main(String[] args) throws Exception {
>     ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
>
>     Job job = Job.getInstance();
>     HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new 
> AvroParquetInputFormat(), Void.class,
>         Customer.class, job);
>     FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new 
> org.apache.hadoop.fs.Path(
>         "/tmp/tpchinput/01/customer_parquet"));
>     Schema projection = Schema.createRecord(Customer.class.getSimpleName(), 
> null, null, false);
>     List<Schema.Field> fields = Arrays.asList(
>         new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, 
> (Object) null)
>     );
>     projection.setFields(fields);
>     AvroParquetInputFormat.setRequestedProjection(job, projection);
>
>     DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif);
>     dataset.print();
> }
>
> If I submit this to the job manager I get the following stack trace:
>
> java.lang.NoSuchMethodError: org.apache.avro.Schema$Field.<
> init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/
> String;Ljava/lang/Object;)V
> at misc.Misc.main(Misc.java:29)
>
> The problem is that I use the parquet-avro dependency (which provides
> AvroParquetInputFormat) in version 1.9.0 which relies on the avro
> dependency 1.8.0. The flink-core itself relies on the avro dependency in
> version 1.7.7. Jfyi the dependency tree looks like this:
>
> [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @
> flink-experiments ---
> [INFO] ...:1.0-SNAPSHOT
> [INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile
> [INFO] |  +- org.apache.flink:flink-core:jar:1.2.0:compile
> [INFO] |  |  \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for
> conflict with 1.8.0)
> [INFO] |  \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile
> [INFO] |     \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for
> duplicate)
> [INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile
> [INFO]    \- org.apache.avro:avro:jar:1.8.0:compile
>
> Fixing the above NoSuchMethodError just leads to further problems.
> Downgrading parquet-avro to an older version creates other conflicts as
> there is no version that uses avro 1.7.7 like Flink does.
>
> Is there a way around this or can you point me to another approach to read
> Parquet data from HDFS? How do you normally go about this?
>
> Thanks for your help,
> Lukas
>
>
>
>

Re: Problems reading Parquet input from HDFS

Reply via email to