Why not use a parquet only format? Not sure why you need an avtoparquetformat.
> On 24. Apr 2017, at 18:19, Lukas Kircher <lukas.kirc...@uni-konstanz.de> > wrote: > > Hello, > > I am trying to read Parquet files from HDFS and having problems. I use Avro > for schema. Here is a basic example: > > public static void main(String[] args) throws Exception { > ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); > > Job job = Job.getInstance(); > HadoopInputFormat<Void, Customer> hif = new HadoopInputFormat<>(new > AvroParquetInputFormat(), Void.class, > Customer.class, job); > FileInputFormat.addInputPath((JobConf) job.getConfiguration(), new > org.apache.hadoop.fs.Path( > "/tmp/tpchinput/01/customer_parquet")); > Schema projection = Schema.createRecord(Customer.class.getSimpleName(), > null, null, false); > List<Schema.Field> fields = Arrays.asList( > new Schema.Field("c_custkey", Schema.create(Schema.Type.INT), null, > (Object) null) > ); > projection.setFields(fields); > AvroParquetInputFormat.setRequestedProjection(job, projection); > > DataSet<Tuple2<Void, Customer>> dataset = env.createInput(hif); > dataset.print(); > } > If I submit this to the job manager I get the following stack trace: > > java.lang.NoSuchMethodError: > org.apache.avro.Schema$Field.<init>(Ljava/lang/String;Lorg/apache/avro/Schema;Ljava/lang/String;Ljava/lang/Object;)V > at misc.Misc.main(Misc.java:29) > > The problem is that I use the parquet-avro dependency (which provides > AvroParquetInputFormat) in version 1.9.0 which relies on the avro dependency > 1.8.0. The flink-core itself relies on the avro dependency in version 1.7.7. > Jfyi the dependency tree looks like this: > > [INFO] --- maven-dependency-plugin:2.8:tree (default-cli) @ flink-experiments > --- > [INFO] ...:1.0-SNAPSHOT > [INFO] +- org.apache.flink:flink-java:jar:1.2.0:compile > [INFO] | +- org.apache.flink:flink-core:jar:1.2.0:compile > [INFO] | | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for > conflict with 1.8.0) > [INFO] | \- org.apache.flink:flink-shaded-hadoop2:jar:1.2.0:compile > [INFO] | \- (org.apache.avro:avro:jar:1.7.7:compile - omitted for > duplicate) > [INFO] \- org.apache.parquet:parquet-avro:jar:1.9.0:compile > [INFO] \- org.apache.avro:avro:jar:1.8.0:compile > > Fixing the above NoSuchMethodError just leads to further problems. > Downgrading parquet-avro to an older version creates other conflicts as there > is no version that uses avro 1.7.7 like Flink does. > > Is there a way around this or can you point me to another approach to read > Parquet data from HDFS? How do you normally go about this? > > Thanks for your help, > Lukas > > >