Thanks Doug . Here is a comparison .
Load Avro Record Size : Roughly15 Kb
I have used the same payload with a schema that has around 2k fields
and also with another schema that has 5 fileds . I re used the
avro object in both cases using a builder once . Test was run for 1 M
records writing the same amount of data (1 Gb ) to a local drive . Ran
this few times as single threaded . Average TPS in case of smaller schema
is 40 K where as with a bigger schema it drops down to 10 K even though
both are writing the same amount of data. Since I am only creating the
avro object once in both cases it looks like there is an overhead in
the datafilewriter too in case of bigger schemas .
public static void main(String[] args){
try{
new LoadGenerator().load();
}catch(IOException e){
e.printStackTrace();
}
}
DataFileWriter<User> dataFileWriter;
DatumWriter<User> datumWriter;
FileSystem hdfsFileSystem;
Configuration conf;
Path path;
OutputStream outStream;
User user;
com.google.common.base.Stopwatch stopwatch = new
com.google.common.base.Stopwatch().start();
public void load() throws IOException{
conf = new Configuration();
hdfsFileSystem = FileSystem.get(conf);
datumWriter = new SpecificDatumWriter<User>(User.class);
dataFileWriter = new DataFileWriter<User>(datumWriter);
dataFileWriter.setCodec(CodecFactory.snappyCodec());
path = new Path("/projects/tmp/load.avro");
outStream=hdfsFileSystem.create(path, true);
dataFileWriter.create(User.getClassSchema(), outStream);
dataFileWriter.setFlushOnEveryBlock(false);
// Create and Load User
int numRecords =1000000;
for (int i=0;i<1000000;i++){
user = User.newBuilder().build();
user.setFirstName("testName"+new Random().nextLong());
user.setFavoriteNumber(Integer.valueOf(new Random().nextInt()));
user.setFavoriteColor("blue" +new Random().nextFloat());
user.setData(ByteBuffer.wrap(new byte[15000]));
dataFileWriter.append(user);
}
dataFileWriter.close();
stopwatch.stop();
long elapsedTime = stopwatch.elapsedTime(TimeUnit.SECONDS);
System.out.println("Time elapsed for myCall() is "+ elapsedTime);
On Mon, Jan 29, 2018 at 11:01 AM, Doug Cutting <[email protected]> wrote:
> Builders have some inherent overheads. Things could be optimized to
> better minimize this, but it will likely always be faster to reuse a single
> instance when writing.
>
> The deepCopy's are probably of the default values of each field you're not
> setting. If you're only setting a few fields then you might use a builder
> to create a single instance so its defaults are set, then reuse that
> instance as you write, setting only those few fields you need to differ
> from the default. (This only works if you're setting the same fields every
> time. Otherwise you'd need to restore the default value.)
>
> An optimization for Avro here might be to inline default values for
> immutable types when generating the build() method.
>
> Doug
>
> On Fri, Jan 26, 2018 at 9:04 AM, Nishanth S <[email protected]>
> wrote:
>
>> Hello Every One,
>>
>> We have a process that reads data from a local file share ,serailizes
>> and writes to HDFS in avro format. .I am just wondering if I am building
>> the avro objects correctly. For every record that is read from the binary
>> file we create an equivalent avro object in the below format.
>>
>> Parent p = new Parent();
>> LOGHDR hdr = LOGHDR.newBuilder().build()
>> MSGHDR msg = MSGHDR.newBuilder().build()
>> p.setHdr(hdr);
>> p.setMsg(msg);
>> p..
>> p..set
>> datumFileWriter.write(p);
>>
>> This avro schema has around 1800 fileds including 26 nested types within
>> it .I did some load testing and figured that if I serialize the same object
>> to disk performance is 6 x times faster than a constructing a new object
>> (p.build). When a new avro object is constructed everytime using
>> RecordBuilder.build() much of the time is spend in
>> GenericData.deepCopy().Has any one run into a similar problem ? We are
>> using Avro 1.8.2.
>>
>> Thanks,
>> Nishanth
>>
>>
>>
>>
>>
>