Thanks for taking a look Doug. This was a cooked up schema I used for
testing . It only has 4 fields in it . Did a simple test writing 1 M
records with close to 1 Gb of  data written to disk .TPS   has been
consistent at 44K. I  did not see much of a difference  in this case if I
re use the object or  do a  build() for every iteration. How ever in case
of actual schema  which has close to 2K fields  I can achieve only 5K tps
with no reuse and 9K TPS with re use(data size written to disk is 1 G). I
just added a  bytes field to  both my test schema and actual schema to
increase the  data volume for test,rest all are  default values . Is there
any other way to improve  performance ? . We do not use the avro sorting
capabilities, so I also tried setting order=ignore for a major chunk  of
fields but  that did not have an impact . Appreciate you taking a look.

Thanks,
Nishanth

On Mon, Feb 5, 2018 at 9:34 AM, Doug Cutting <cutt...@gmail.com> wrote:

> Your code builds a new builder and instance each time through the loop:
>
>   for (int i=0;i<1000000;i++) {
>   user = User.newBuilder().build();
>   ...
>
> How does it perform if you move that second line outside the loop?
>
> Thanks,
>
> Doug
>
>
> On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <nishanth.2...@gmail.com>
> wrote:
>
>> Thanks Doug .  Here  is a  comparison .
>>
>> Load Avro  Record Size : Roughly15 Kb
>>
>> I have used the same payload  with a schema  that has  around 2k fields
>> and  also  with    another schema   that has  5 fileds . I re used the
>> avro object in both cases   using a builder once . Test was run for 1 M
>> records writing the  same amount of data  (1 Gb ) to  a  local drive . Ran
>> this few times as  single threaded . Average TPS in case of smaller schema
>> is  40 K where  as with a bigger schema it drops down to 10 K  even though
>> both are  writing the same amount of data. Since I am   only creating the
>> avro object once  in both  cases   it looks   like  there is an overhead in
>> the  datafilewriter too in case of bigger schemas .
>>
>>
>>
>> public static void main(String[] args){
>> try{
>> new LoadGenerator().load();
>> }catch(IOException e){
>>     e.printStackTrace();
>> }
>>     }
>>
>>     DataFileWriter<User> dataFileWriter;
>>     DatumWriter<User> datumWriter;
>>     FileSystem hdfsFileSystem;
>>     Configuration conf;
>>     Path path;
>>     OutputStream outStream;
>>     User user;
>>     com.google.common.base.Stopwatch stopwatch = new
>> com.google.common.base.Stopwatch().start();
>>     public  void load() throws IOException{
>> conf = new Configuration();
>> hdfsFileSystem = FileSystem.get(conf);
>> datumWriter = new SpecificDatumWriter<User>(User.class);
>> dataFileWriter = new DataFileWriter<User>(datumWriter);
>> dataFileWriter.setCodec(CodecFactory.snappyCodec());
>>         path = new Path("/projects/tmp/load.avro");
>>         outStream=hdfsFileSystem.create(path, true);
>> dataFileWriter.create(User.getClassSchema(), outStream);
>>         dataFileWriter.setFlushOnEveryBlock(false);
>> // Create and Load User
>> int numRecords =1000000;
>> for (int i=0;i<1000000;i++){
>>     user = User.newBuilder().build();
>>     user.setFirstName("testName"+new Random().nextLong());
>>     user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
>>     user.setFavoriteColor("blue" +new Random().nextFloat());
>>     user.setData(ByteBuffer.wrap(new byte[15000]));
>>     dataFileWriter.append(user);
>> }
>> dataFileWriter.close();
>> stopwatch.stop();
>> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
>> System.out.println("Time elapsed for myCall() is "+ elapsedTime);
>>
>> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cutt...@gmail.com> wrote:
>>
>>> Builders have some inherent overheads.  Things could be optimized to
>>> better minimize this, but it will likely always be faster to reuse a single
>>> instance when writing.
>>>
>>> The deepCopy's are probably of the default values of each field you're
>>> not setting.  If you're only setting a few fields then you might use a
>>> builder to create a single instance so its defaults are set, then reuse
>>> that instance as you write, setting only those few fields you need to
>>> differ from the default.  (This only works if you're setting the same
>>> fields every time.  Otherwise you'd need to restore the default value.)
>>>
>>> An optimization for Avro here might be to inline default values for
>>> immutable types when generating the build() method.
>>>
>>> Doug
>>>
>>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <nishanth.2...@gmail.com>
>>> wrote:
>>>
>>>> Hello Every One,
>>>>
>>>> We have a process that reads data from a  local file share  ,serailizes
>>>> and writes to HDFS in avro format. .I am just wondering if I am building
>>>> the avro objects correctly. For every record that  is read from the binary
>>>> file we create an equivalent avro object in the below format.
>>>>
>>>> Parent p = new Parent();
>>>> LOGHDR hdr = LOGHDR.newBuilder().build()
>>>> MSGHDR msg = MSGHDR.newBuilder().build()
>>>> p.setHdr(hdr);
>>>> p.setMsg(msg);
>>>> p..
>>>> p..set
>>>> datumFileWriter.write(p);
>>>>
>>>> This avro schema has  around 1800 fileds including 26 nested types
>>>> within it .I did some load testing and figured that if I serialize the same
>>>> object to disk  performance is  6 x times faster  than a constructing a new
>>>> object (p.build). When a new  avro object is constructed everytime using
>>>> RecordBuilder.build()  much of the time is spend in
>>>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>>>> using Avro 1.8.2.
>>>>
>>>> Thanks,
>>>> Nishanth
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to