How to develop custom Avro InputFormat and Deserializer?

Denitsa Tsolova Fri, 19 Nov 2010 09:17:04 -0800

Hi all,
We have a large amount of data in HDFS stored in Avro format. We don't want
to convert it to Hive supported format.
That is why we have developed a custom InputFormat and Deserializer. Our
custom InputFormat does not support file split, because the Avro schema
(which describes the structure of the records stored in the file) is placed
at the end of the file. Here is how our AvroInputFormat is implemented:


public class AvroInputFormat extends FileInputFormat<LongWritable, Text> {

    @Override
    public RecordReader<LongWritable, Text> getRecordReader(InputSplit
inputSplit,
            JobConf job, Reporter reporter) throws IOException {

        FileSplit split = (FileSplit) inputSplit;
        Path file = split.getPath();
        FileSystem fs = file.getFileSystem(job);

        return new AvroRecordReader(file, fs);
    }

    protected boolean isSplitable(FileSystem fs, Path filename) {
        return false;
    };

    class AvroRecordReader implements RecordReader<LongWritable, Text> {

        private final Path file;
        private final FileSystem fs;

        public AvroRecordReader(Path file, FileSystem fs) {
            this.file = file;
            this.fs = fs;
        }

        @Override
        public void close() throws IOException {
        }

        @Override
        public LongWritable createKey() {
            LongWritable longWritable = new LongWritable(1);
            return longWritable;
        }

        @Override
        public Text createValue() {
            return new Text();
        }

        @Override
        public long getPos() throws IOException {
            return 0;
        }

        @Override
        public float getProgress() throws IOException {
            return 0;
        }

        @Override
        public boolean next(LongWritable key, Text value) throws IOException
{

            FSDataInputStream fileIn = null;
            try {
                fileIn = fs.open(file);

                byte[] bytes = new byte[1024];
                int readBytes = 0;

                while ((readBytes = fileIn.read(bytes)) > 0) {
                    value.append(bytes, 0, readBytes);
                }
            } finally {
                if (fileIn != null) {
                    fileIn.close();
                }
            }

            return false;
        }
    }
}

Are there any problems with that implementation? Also, we think that we
don't need an implementation of OutputFormat as we are not going to store
data in Avro format. That is why we pass
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat to OUTPUTFORMAT
clause when creating a table. For instance:
CREATE EXTERNAL TABLE avro_data (Url STRING, ContentType STRING)
ROW FORMAT SERDE 'test.AvroSerDe'
STORED AS INPUTFORMAT 'test.AvroInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/data/avro';
Is this correct?

Regarding our SerDe. Again, we are not going to store data in Avro format
and that is why we implement only Deserializer. Here is our dummy/testing
implementation:

public class AvroSerDe implements Deserializer{
    private int count = 0;
    private ObjectInspector objectInspector;

    @Override
    public Object deserialize(Writable value) throws SerDeException {

        // dummy implementation
        String url = "http://dummy.com/?p="; + count;
        count++;
        String contentType = "dummy web site content";

        LinkedList<String> columns = new LinkedList<String>();
        columns.add(url);
        columns.add(contentType);

        return columns;
    }

    @Override
    public ObjectInspector getObjectInspector() throws SerDeException {
        return objectInspector;
    }

    private ObjectInspector createObjectInspector() {
        List<String> structFieldNames = new ArrayList<String>();
        structFieldNames.add("URL");
        structFieldNames.add("ContentType");

        List<ObjectInspector> structFieldObjectInspectors = new
ArrayList<ObjectInspector>();

structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

structFieldObjectInspectors.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector);

        return ObjectInspectorFactory.getStandardStructObjectInspector(
                structFieldNames,
                structFieldObjectInspectors);
    }

    @Override
    public void initialize(Configuration arg0, Properties arg1) throws
SerDeException {
        objectInspector = createObjectInspector();

    }

}


When we are executing SELECT statement against the created avro_data table,
we don't get any results. When logging is added to those classes, it seems
that AvroSerDe is not used and only AvroInputFormat is called.

Anyway, when the data_avro table is defined in the following way:
CREATE EXTERNAL TABLE data_avro (Url STRING, ContentType STRING)
ROW FORMAT SERDE 'test.AvroSerDe'
STORED AS TEXTFILE
LOCATION '/data/avro';
we got many dummy results:
...
http://dummy.com/?p=39089    dummy web site content
http://dummy.com/?p=39090    dummy web site content
http://dummy.com/?p=39091    dummy web site content
Time taken: 5.616 seconds

Note that the /data/avro folder contains only 6 sample files that are split
into 39091 chunks. We do not want to split the input files due to reasons we
mentioned above.
Do you have any ideas regarding these problems?

Thank you in advance!
Regards

How to develop custom Avro InputFormat and Deserializer?

Reply via email to