Thanks Doug . Here is a comparison . Load Avro Record Size : Roughly15 Kb
I have used the same payload with a schema that has around 2k fields and also with another schema that has 5 fileds . I re used the avro object in both cases using a builder once . Test was run for 1 M records writing the same amount of data (1 Gb ) to a local drive . Ran this few times as single threaded . Average TPS in case of smaller schema is 40 K where as with a bigger schema it drops down to 10 K even though both are writing the same amount of data. Since I am only creating the avro object once in both cases it looks like there is an overhead in the datafilewriter too in case of bigger schemas . public static void main(String[] args){ try{ new LoadGenerator().load(); }catch(IOException e){ e.printStackTrace(); } } DataFileWriter<User> dataFileWriter; DatumWriter<User> datumWriter; FileSystem hdfsFileSystem; Configuration conf; Path path; OutputStream outStream; User user; com.google.common.base.Stopwatch stopwatch = new com.google.common.base.Stopwatch().start(); public void load() throws IOException{ conf = new Configuration(); hdfsFileSystem = FileSystem.get(conf); datumWriter = new SpecificDatumWriter<User>(User.class); dataFileWriter = new DataFileWriter<User>(datumWriter); dataFileWriter.setCodec(CodecFactory.snappyCodec()); path = new Path("/projects/tmp/load.avro"); outStream=hdfsFileSystem.create(path, true); dataFileWriter.create(User.getClassSchema(), outStream); dataFileWriter.setFlushOnEveryBlock(false); // Create and Load User int numRecords =1000000; for (int i=0;i<1000000;i++){ user = User.newBuilder().build(); user.setFirstName("testName"+new Random().nextLong()); user.setFavoriteNumber(Integer.valueOf(new Random().nextInt())); user.setFavoriteColor("blue" +new Random().nextFloat()); user.setData(ByteBuffer.wrap(new byte[15000])); dataFileWriter.append(user); } dataFileWriter.close(); stopwatch.stop(); long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS); System.out.println("Time elapsed for myCall() is "+ elapsedTime); On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <cutt...@gmail.com> wrote: > Builders have some inherent overheads. Things could be optimized to > better minimize this, but it will likely always be faster to reuse a single > instance when writing. > > The deepCopy's are probably of the default values of each field you're not > setting. If you're only setting a few fields then you might use a builder > to create a single instance so its defaults are set, then reuse that > instance as you write, setting only those few fields you need to differ > from the default. (This only works if you're setting the same fields every > time. Otherwise you'd need to restore the default value.) > > An optimization for Avro here might be to inline default values for > immutable types when generating the build() method. > > Doug > > On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <nishanth.2...@gmail.com> > wrote: > >> Hello Every One, >> >> We have a process that reads data from a local file share ,serailizes >> and writes to HDFS in avro format. .I am just wondering if I am building >> the avro objects correctly. For every record that is read from the binary >> file we create an equivalent avro object in the below format. >> >> Parent p = new Parent(); >> LOGHDR hdr = LOGHDR.newBuilder().build() >> MSGHDR msg = MSGHDR.newBuilder().build() >> p.setHdr(hdr); >> p.setMsg(msg); >> p.. >> p..set >> datumFileWriter.write(p); >> >> This avro schema has around 1800 fileds including 26 nested types within >> it .I did some load testing and figured that if I serialize the same object >> to disk performance is 6 x times faster than a constructing a new object >> (p.build). When a new avro object is constructed everytime using >> RecordBuilder.build() much of the time is spend in >> GenericData.deepCopy().Has any one run into a similar problem ? We are >> using Avro 1.8.2. >> >> Thanks, >> Nishanth >> >> >> >> >> >