Thanks for taking a look Doug. This was a cooked up schema I used for testing . It only has 4 fields in it . Did a simple test writing 1 M records with close to 1 Gb of data written to disk .TPS has been consistent at 44K. I did not see much of a difference in this case if I re use the object or do a build() for every iteration. How ever in case of actual schema which has close to 2K fields I can achieve only 5K tps with no reuse and 9K TPS with re use(data size written to disk is 1 G). I just added a bytes field to both my test schema and actual schema to increase the data volume for test,rest all are default values . Is there any other way to improve performance ? . We do not use the avro sorting capabilities, so I also tried setting order=ignore for a major chunk of fields but that did not have an impact . Appreciate you taking a look.
Thanks, Nishanth On Mon, Feb 5, 2018 at 9:34 AM, Doug Cutting <cutt...@gmail.com> wrote: > Your code builds a new builder and instance each time through the loop: > > for (int i=0;i<1000000;i++) { > user = User.newBuilder().build(); > ... > > How does it perform if you move that second line outside the loop? > > Thanks, > > Doug > > > On Fri, Feb 2, 2018 at 3:50 PM, Nishanth S <nishanth.2...@gmail.com> > wrote: > >> Thanks Doug . Here is a comparison . >> >> Load Avro Record Size : Roughly15 Kb >> >> I have used the same payload with a schema that has around 2k fields >> and also with another schema that has 5 fileds . I re used the >> avro object in both cases using a builder once . Test was run for 1 M >> records writing the same amount of data (1 Gb ) to a local drive . Ran >> this few times as single threaded . Average TPS in case of smaller schema >> is 40 K where as with a bigger schema it drops down to 10 K even though >> both are writing the same amount of data. Since I am only creating the >> avro object once in both cases it looks like there is an overhead in >> the datafilewriter too in case of bigger schemas . >> >> >> >> public static void main(String[] args){ >> try{ >> new LoadGenerator().load(); >> }catch(IOException e){ >> e.printStackTrace(); >> } >> } >> >> DataFileWriter<User> dataFileWriter; >> DatumWriter<User> datumWriter; >> FileSystem hdfsFileSystem; >> Configuration conf; >> Path path; >> OutputStream outStream; >> User user; >> com.google.common.base.Stopwatch stopwatch = new >> com.google.common.base.Stopwatch().start(); >> public void load() throws IOException{ >> conf = new Configuration(); >> hdfsFileSystem = FileSystem.get(conf); >> datumWriter = new SpecificDatumWriter<User>(User.class); >> dataFileWriter = new DataFileWriter<User>(datumWriter); >> dataFileWriter.setCodec(CodecFactory.snappyCodec()); >> path = new Path("/projects/tmp/load.avro"); >> outStream=hdfsFileSystem.create(path, true); >> dataFileWriter.create(User.getClassSchema(), outStream); >> dataFileWriter.setFlushOnEveryBlock(false); >> // Create and Load User >> int numRecords =1000000; >> for (int i=0;i<1000000;i++){ >> user = User.newBuilder().build(); >> user.setFirstName("testName"+new Random().nextLong()); >> user.setFavoriteNumber(Integer.valueOf(new Random().nextInt())); >> user.setFavoriteColor("blue" +new Random().nextFloat()); >> user.setData(ByteBuffer.wrap(new byte[15000])); >> dataFileWriter.append(user); >> } >> dataFileWriter.close(); >> stopwatch.stop(); >> long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS); >> System.out.println("Time elapsed for myCall() is "+ elapsedTime); >> >> On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cutt...@gmail.com> wrote: >> >>> Builders have some inherent overheads. Things could be optimized to >>> better minimize this, but it will likely always be faster to reuse a single >>> instance when writing. >>> >>> The deepCopy's are probably of the default values of each field you're >>> not setting. If you're only setting a few fields then you might use a >>> builder to create a single instance so its defaults are set, then reuse >>> that instance as you write, setting only those few fields you need to >>> differ from the default. (This only works if you're setting the same >>> fields every time. Otherwise you'd need to restore the default value.) >>> >>> An optimization for Avro here might be to inline default values for >>> immutable types when generating the build() method. >>> >>> Doug >>> >>> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <nishanth.2...@gmail.com> >>> wrote: >>> >>>> Hello Every One, >>>> >>>> We have a process that reads data from a local file share ,serailizes >>>> and writes to HDFS in avro format. .I am just wondering if I am building >>>> the avro objects correctly. For every record that is read from the binary >>>> file we create an equivalent avro object in the below format. >>>> >>>> Parent p = new Parent(); >>>> LOGHDR hdr = LOGHDR.newBuilder().build() >>>> MSGHDR msg = MSGHDR.newBuilder().build() >>>> p.setHdr(hdr); >>>> p.setMsg(msg); >>>> p.. >>>> p..set >>>> datumFileWriter.write(p); >>>> >>>> This avro schema has around 1800 fileds including 26 nested types >>>> within it .I did some load testing and figured that if I serialize the same >>>> object to disk performance is 6 x times faster than a constructing a new >>>> object (p.build). When a new avro object is constructed everytime using >>>> RecordBuilder.build() much of the time is spend in >>>> GenericData.deepCopy().Has any one run into a similar problem ? We are >>>> using Avro 1.8.2. >>>> >>>> Thanks, >>>> Nishanth >>>> >>>> >>>> >>>> >>>> >>> >> >