Re: Dynamic Schema

Martin Kleppmann Wed, 02 Apr 2014 14:03:19 -0700

Hi Amit,

The Avro data file format requires the writer to know the schema from the 
start, because all records in the file are then written with the same schema. 
So there probably isn't an alternative to what you're doing -- to buffer as 
much as you can in memory, write it out to file when the memory buffer is full, 
and then start a new file.

You can't change the schema of a data file once it has been written, but you 
can run a background process which merges several data files together, and 
writes the result to a new file. You can make the merged file's schema the 
union of all the input file schemas, or you can write some application-specific 
code which combines the schemas into one, and evolve all the records into that 
merged schema. This can be done by streaming through the files -- you don't 
need to keep all the data in memory.

Martin

On 1 Apr 2014, at 21:55, amit nanda <[email protected]> wrote:
> I have very dynamic data that i want to write to an avro file. The solution i 
> have is to store all that data in the memory and then calculate the schema, 
> and then start the writing. 
> 
> This causes the files to be smaller in size, because of the memory 
> limitations.
> 
> What i am looking for is that i will start data as and when it is collected, 
> but how should i compute the schema in this case? Can i change the schema for 
> an avro file?
> 
> Thanks
> Amit

Re: Dynamic Schema

Reply via email to