Re: Need help transforming Avro schemas

Michael Pigott Wed, 20 Aug 2014 11:53:47 -0700

Hi Juan!

I originally considered showing you the AvroSchemaGenerator, but I thought
it was a bit complex and very specific to XML Schema itself.  I think you
would have better luck understanding how either Protobuf or Thrift schemas
are converted to Avro instead, as those are more generic, and the feature
set more closely maps to Avro.


To answer your question, I never was able to find a use case where creating
an Avro schema from only a list of fields worked for me.  That was okay in
my case, because I could just use the corresponding XML element name and
namespace when creating the record.  You might have better luck, depending
on your use case?

I unfortunately do not know of an existing tool that solves your problem,
and I poked around the existing code and JIRA tickets for a bit and came up
empty.  I originally thought you could write a clone function yourself, and
create a new schema as you recursively descend through the old one, adding
in any changes you wanted to make along the way.  (The comparison tool I
showed you would make a good template.)

That said, you might have better luck using the Avro Schema IDL[1], rather
than rolling your own?

Good luck!
Mike

[1] http://avro.apache.org/docs/1.7.7/idl.html


On Wed, Aug 20, 2014 at 3:19 AM, Juan Rodríguez Hortalá <
[email protected]> wrote:

> Hi Michael,
>
> Thanks a lot for your suggestion. I've found particularly interesting the
> class
> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/main/java/org/apache/avro/xml/AvroSchemaGenerator.java,
> which I understand generates an Avro schema by visiting an XML document. I
> assume that you have used a fresh name for record in the node, otherwise
> maybe you had encountere problems like the following: starting from an
> Schema object 'personSchema' containing the following schema:
>
> {
>   "type" : "record",
>   "name" : "Person",
>   "namespace" : "test",
>   "doc" : "Schema for test.SchemasTest$Person",
>   "fields" : [ {
>     "name" : "age",
>     "type" : "int"
>   }, {
>     "name" : "name",
>     "type" : [ "null", "string" ]
>   } ]
> }
>
> The following code works ok
>
> Schema twoPersons = Schema.createRecord(      Arrays.asList(         new
> Schema.Field(personSchema.getName() + "_1", personSchema, personSchema.
> getDoc() + " _1", null),         new Schema.Field(personSchema.getName() +
> "_2", personSchema, personSchema.getDoc() + " _2", null)       )  );
>
> but when I use the new Schema object twoPersons it's pretty easy to
> encounter an exception, for example:
>
>     System.out.println(new Schema.Parser().setValidate(true).parse(
> twoPersons.toString()))
> throws
>
> org.apache.avro.SchemaParseException: No name in schema:
> {"type":"record","fields":[{"name":"Person_1","type":{"type":"record","name":"Person","namespace":"test","doc":"Schema
> for
> test.SchemasTest$Person","fields":[{"name":"age","type":"int"},{"name":"name","type":["null","string"]}]},"doc":"Schema
> for test.SchemasTest$Person
> _1"},{"name":"Person_2","type":"test.Person","doc":"Schema for
> test.SchemasTest$Person _2"}]}
>     at org.apache.avro.Schema.getRequiredText(Schema.java:1221)
>     at org.apache.avro.Schema.parse(Schema.java:1092)
>     at org.apache.avro.Schema$Parser.parse(Schema.java:953)
>     at org.apache.avro.Schema$Parser.parse(Schema.java:943)
>     at
> com.lambdoop.sdk.core.SchemasTest.createRecordFailTest(SchemasTest.java:232)
>     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>     at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>     at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>     at java.lang.reflect.Method.invoke(Method.java:606)
>     at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
>     at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
>     at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
>     at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
>     at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
>     at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
>     at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
>     at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
>     at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
>     at
> org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
>     at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
>     at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
>
>
> Adding the name with twoPersons.addProp("name", "twoPersons") doesn't work
> because "name" is a reserved property. SchemaBuilder cannot be used either
> because it doesn't allow adding Schema objects to a field, but just
> creating schemas from scratch.
>
> Other problem I have is that when I convert the schemas to Jackson's
> JsonNode, and starting from an empty schema like
>
> {
>   "type" : "record",
>   "name" : "Person",
>   "namespace" : "test",
>   "fields" : [ ]
> }
>
> if I add a field with schema Person by manipulating the JsonNode, when I
> convert back to an Avro Schema object I get a "Can't redefine:
> test.Person". My conclusions then are:
> - every record needs to have a name
> - two records with the same name must have the same schema
>
> That is not very surprising as it corresponds to what it's specified in
> http://avro.apache.org/docs/current/spec.html. I was wondering If anyone
> knows about a library for transforming Avro schemas that is able of doing
> things like adding an existing schema as new field of another schema, that
> has already dealt with these details.
>
> Thanks a lot for your help,
>
> Greetings,
>
> Juan Rodríguez
>
>
>
>
>
>
> 2014-08-19 7:04 GMT-07:00 Michael Pigott <[email protected]>
> :
>
> Hi Juan,
>>     That sounds really complex.  Would you instead be able to build or
>> retrieve the original Avro Schema objects, and then build a new Schema from
>> its definition?  For my work on transforming XML to Avro and back[1], I
>> wrote a comparison tool to confirm that two Avro Schemas are equivalent by
>> recursively descending through both schemas[2].  Perhaps you can use
>> something similar to build a transformed Avro schema in memory, by applying
>> your transformations on the fly?
>>
>> Good luck!
>> Mike
>>
>> [1] https://issues.apache.org/jira/browse/AVRO-457
>> [2]
>> https://github.com/mikepigott/xml-to-avro/blob/master/avro-to-xml/src/test/java/org/apache/avro/xml/UtilsForTests.java
>>
>>
>> On Tue, Aug 19, 2014 at 2:23 AM, Juan Rodríguez Hortalá <
>> [email protected]> wrote:
>>
>>> Hi list,
>>>
>>> I'm working on a project in Java where we have a DSL working on
>>> GenericRecord objects, over which we define record transformation
>>> operations like projections, filters and so. This implies that the avro
>>> schema of the records evolves by adding and deleting record fields. As a
>>> result the avro schemas used are different in each program depending on the
>>> operations used. Hence I have to define avro schema transformations, and
>>> generate new schemas as modifications of other schemas. For that the avro
>>> schema builder classes are only useful for the starting schema, and so does
>>> a pojo to schema mapping like avro-jackson. The main problem I face is that
>>> in avro by design "schema objects are logically immutable", as stated in
>>> the documentation. So far I've taken the way of converting the schema to
>>> string, parsing it with jackson and manipulate it's representation as
>>> JsonNode, and then parsing it back to Avro. In that latter step I sometimes
>>> have problems because avro records are named, and anonymous records are not
>>> always legal in complete schemas; or because the same record name cannot be
>>> used twice in two child fields of a parent record. I was then thinking in
>>> using generated schema names, with an increasing ID or a random UUID.
>>> Anyway my question is, the approach I'm describing is correct?,  are you
>>> aware of some library for creating new avro schemas by manipulating an
>>> input schema? Maybe that capabilities are already present in avro's Java
>>> API but I haven't noticed.
>>>
>>> Any help with be welcome. Thanks a lot in advance
>>>
>>> Greetings,
>>>
>>> Juan Rodríguez Hortalá
>>>
>>
>>
>

Re: Need help transforming Avro schemas

Reply via email to