Thanks Doug .  Here  is a  comparison .

Load Avro  Record Size : Roughly15 Kb

I have used the same payload  with a schema  that has  around 2k fields
and  also  with    another schema   that has  5 fileds . I re used the
avro object in both cases   using a builder once . Test was run for 1 M
records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
this few times as  single threaded . Average TPS in case of smaller schema
is  40 K where  as with a bigger schema it drops down to 10 K  even though
both are  writing the same amount of data. Since I am   only creating the
avro object once  in both  cases   it looks   like  there is an overhead in
the  datafilewriter too in case of bigger schemas .

public static void main(String[] args){
new LoadGenerator().load();
}catch(IOException e){

    DataFileWriter<User> dataFileWriter;
    DatumWriter<User> datumWriter;
    FileSystem hdfsFileSystem;
    Configuration conf;
    Path path;
    OutputStream outStream;
    User user; stopwatch = new;
    public  void load() throws IOException{
conf = new Configuration();
hdfsFileSystem = FileSystem.get(conf);
datumWriter = new SpecificDatumWriter<User>(User.class);
dataFileWriter = new DataFileWriter<User>(datumWriter);
        path = new Path("/projects/tmp/load.avro");
        outStream=hdfsFileSystem.create(path, true);
dataFileWriter.create(User.getClassSchema(), outStream);
// Create and Load User
int numRecords =1000000;
for (int i=0;i<1000000;i++){
    user = User.newBuilder().build();
    user.setFirstName("testName"+new Random().nextLong());
    user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
    user.setFavoriteColor("blue" +new Random().nextFloat());
    user.setData(ByteBuffer.wrap(new byte[15000]));
long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
System.out.println("Time elapsed for myCall() is "+ elapsedTime);

On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <> wrote:

> Builders have some inherent overheads.  Things could be optimized to
> better minimize this, but it will likely always be faster to reuse a single
> instance when writing.
> The deepCopy's are probably of the default values of each field you're not
> setting.  If you're only setting a few fields then you might use a builder
> to create a single instance so its defaults are set, then reuse that
> instance as you write, setting only those few fields you need to differ
> from the default.  (This only works if you're setting the same fields every
> time.  Otherwise you'd need to restore the default value.)
> An optimization for Avro here might be to inline default values for
> immutable types when generating the build() method.
> Doug
> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <>
> wrote:
>> Hello Every One,
>> We have a process that reads data from a  local file share  ,serailizes
>> and writes to HDFS in avro format. .I am just wondering if I am building
>> the avro objects correctly. For every record that  is read from the binary
>> file we create an equivalent avro object in the below format.
>> Parent p = new Parent();
>> LOGHDR hdr = LOGHDR.newBuilder().build()
>> MSGHDR msg = MSGHDR.newBuilder().build()
>> p.setHdr(hdr);
>> p.setMsg(msg);
>> p..
>> p..set
>> datumFileWriter.write(p);
>> This avro schema has  around 1800 fileds including 26 nested types within
>> it .I did some load testing and figured that if I serialize the same object
>> to disk  performance is  6 x times faster  than a constructing a new object
>> ( When a new  avro object is constructed everytime using
>>  much of the time is spend in
>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>> using Avro 1.8.2.
>> Thanks,
>> Nishanth

Reply via email to