Hello, I am writing about a problem that I have with the developing of a custom AvroInputFormat class. I do not have a clear idea in my mind but I will try to explain my target in order to receive a better help from you.
Firstly, I need to join multiple AVRO files together. In order to make this, I wrote a custom implementation of FileInputFormat which works with multiple paths. Secondly, I need to control the number of records for each split. In order to make this, this time I did a dirty work. In each split I store: 1. The paths of the files in which the correspondent records are stored; 2. The first useful sync point of the first file; 3. The offset, express in terms of objects, from the sync point in the first file. The InputFormat does: 1. Use SeekableInput, ReflectData, DatumReader and DataFileReader in order to iterate among all the records and all files; 2. Make the splits storing the need information. Therefore, the RecordReader: 1. Open the first file; 2. Sync to the sync point; 3. Iterate until the offset is reached, again with SeekableInput, ReflectData, DatumReader and DataFileReader; 4. Start to read the records, one by one to make the output. The biggest bottleneck is in the fact that I can only use the sync point to move straight to a file point and it is not possible to use any "seek" to make it faster. Do you have any advice for this?
