Any suggestions ?
On Sat, Dec 21, 2013 at 6:05 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <[email protected]> wrote: > Hello, > I have a 340 MB avro data file that contains records sorted and identified > by unique id (duplicate records exists). At the beginning of every unique > record a synchronization point is created with DataFileWriter.sync(). (I > cannot or do not want to save the sync points and i do not want to use > SortedKeyValueFile as output format for M/R job) > > There are at-least 25k synchronization points in a 340 MB file. > > Ex: > Marker1_RecordA1_RecordA2_RecordA3_Marker2_RecordB1_RecordB2 > > > As records are sorted, for efficient retrieval, binary search is performed > using the attached code. > > Most of the times the search is successful, at times the code throws the > following exception > ------ > org.apache.avro.AvroRuntimeException: java.io.IOException: Invalid sync! > at org.apache.avro.file.DataFileStream.hasNext(DataFileStream.java:210 > ------ > > > > Questions > 1) Is it ok to have 25k sycn points for 300 MB file ? Does it cost in > performance while reading ? > 2) I note down the position that was used to invoke fileReader.sync(mid);. > If i catch AvroRuntimeException, close and open the file and sync(mid) i do > not see exception. Why should Avro throw exception before and not later ? > 3) Is there a limit on number of times sync() is invoked ? > 4) When sync(position) is invoked, are any 0 >= position <= file.size() > valid ? If yes why do i see AvroRuntimeException (#2) ? > > Regards, > Deepak > > -- Deepak
