Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Of course! I think some BeamSQL folks should be involved as well, as this directly affects SQL work. Anton especially has expressed interest in Row and schemas. Reuven On Mon, Mar 5, 2018 at 4:30 AM Jean-Baptiste Onofré wrote: > Cool, > > can I work with you on this

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Jean-Baptiste Onofré
Cool, can I work with you on this (sharing a branch for instance) ? Thanks ! Regards JB On 03/05/2018 01:01 PM, Reuven Lax wrote: > Yes, I do have a PoC in progress. The Beam Row class was being refactored, so > I > paused to wait for that to finish. > > > On Sun, Mar 4, 2018 at 8:24 PM

Re: Schema-Aware PCollections revisited

2018-03-05 Thread Reuven Lax
Yes, I do have a PoC in progress. The Beam Row class was being refactored, so I paused to wait for that to finish. On Sun, Mar 4, 2018 at 8:24 PM Jean-Baptiste Onofré wrote: > Hi Reuven, > > I revive this discussion as I think it would be a great addition. > > We had

Re: Schema-Aware PCollections revisited

2018-03-04 Thread Jean-Baptiste Onofré
Hi Reuven, I revive this discussion as I think it would be a great addition. We had discussion on the fly, but I think now, as base for discussion, it would be great to have a feature branch where we can start some sketch/impl and discuss. @Reuven, did you start a PoC with what you proposed: -

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
On Mon, Feb 5, 2018 at 9:06 PM, Kenneth Knowles wrote: > Joining late, but very interested. Commented on the doc. Since there's a > forked discussion between doc and thread, I want to say this on the thread: > > 1. I have used JSON schema in production for describing the

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
I would add a use case: single serialization mecanism accross a pipeline. JSON allows to handle generic records (JsonObject) as well as POJO serialization and both are compatible. Compared to avro built-in mecanism, it is not intrusive in the models which is a key feature of an API. It also

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Kenneth Knowles
Joining late, but very interested. Commented on the doc. Since there's a forked discussion between doc and thread, I want to say this on the thread: 1. I have used JSON schema in production for describing the structure of analytics events and it is OK but not great. If you are sure your data is

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
None, Json-p - the spec so no strong impl requires - as record API and a custom light wrapping for schema - like https://github.com/Talend/component-runtime/blob/master/component-form/component-form-model/src/main/java/org/talend/sdk/component/form/model/jsonschema/JsonSchema.java (note this code

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
Which json library are you thinking of? At least in Java, there's always been a problem of no good standard Json library. On Mon, Feb 5, 2018 at 12:03 PM, Romain Manni-Bucau wrote: > > > Le 5 févr. 2018 19:54, "Reuven Lax" a écrit : > > multiplying by

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Romain Manni-Bucau
Le 5 févr. 2018 19:54, "Reuven Lax" a écrit : multiplying by 1.0 doesn't really solve the right problems. The number type used by Javascript (and by extension, they standard for json) only has 53 bits of precision. I've seen many, many bugs caused because of this - the input

Re: Schema-Aware PCollections revisited

2018-02-05 Thread Reuven Lax
multiplying by 1.0 doesn't really solve the right problems. The number type used by Javascript (and by extension, they standard for json) only has 53 bits of precision. I've seen many, many bugs caused because of this - the input data may easily contain numbers too large for 53 bits. In addition,

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
Im off tonight but can we try to do it next week (tomorrow)? If not please answer to this thread with outcomes and Ill catch up tmr morning. Le 4 févr. 2018 20:23, "Reuven Lax" a écrit : Cool, let's chat about this on slack for a bit (which I realized I've been signed out of

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
Cool, let's chat about this on slack for a bit (which I realized I've been signed out of for some time). Reuven On Sun, Feb 4, 2018 at 9:21 AM, Jean-Baptiste Onofré wrote: > Sorry guys, I was off today. Happy to be part of the party too ;) > > Regards > JB > > On 02/04/2018

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Jean-Baptiste Onofré
Sorry guys, I was off today. Happy to be part of the party too ;) Regards JB On 02/04/2018 06:19 PM, Reuven Lax wrote: > Romain, since you're interested maybe the two of us should put together a > proposal for how to set this things (hints, schema) on PCollections? I don't > think it'll be hard

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
Romain, since you're interested maybe the two of us should put together a proposal for how to set this things (hints, schema) on PCollections? I don't think it'll be hard - the previous list thread on hints already agreed on a general approach, and we would just need to flesh it out. BTW in the

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:53 GMT+01:00 Reuven Lax : > > > On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau > wrote: > >> >> 2018-02-04 17:37 GMT+01:00 Reuven Lax : >> >>> I'm not sure where proto comes from here. Proto is one example of a type >>>

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
On Sun, Feb 4, 2018 at 8:42 AM, Romain Manni-Bucau wrote: > > 2018-02-04 17:37 GMT+01:00 Reuven Lax : > >> I'm not sure where proto comes from here. Proto is one example of a type >> that has a schema, but only one example. >> >> 1. In the initial

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
2018-02-04 17:37 GMT+01:00 Reuven Lax : > I'm not sure where proto comes from here. Proto is one example of a type > that has a schema, but only one example. > > 1. In the initial prototype I want to avoid modifying the PCollection API. > So I think it's best to create a special

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Reuven Lax
I'm not sure where proto comes from here. Proto is one example of a type that has a schema, but only one example. 1. In the initial prototype I want to avoid modifying the PCollection API. So I think it's best to create a special SchemaCoder, and pass the schema into this coder. Later we might

Re: Schema-Aware PCollections revisited

2018-02-04 Thread Romain Manni-Bucau
@Reuven: is the proto only about passing schema or also the generic type? There are 2.5 topics to solve this issue: 1. How to pass schema 1.a. hints? 2. What is the generic record type associated to a schema and how to express a schema relatively to it I would be happy to help on 1.a and 2

Re: Schema-Aware PCollections revisited

2018-02-03 Thread Reuven Lax
One more thing. If anyone here has experience with various OSS metadata stores (e.g. Kafka Schema Registry is one example), would you like to collaborate on implementation? I want to make sure that source schemas can be stored in a variety of OSS metadata stores, and be easily pulled into a Beam

Re: Schema-Aware PCollections revisited

2018-02-03 Thread Reuven Lax
Hi all, If there are no concerns, I would like to start working on a prototype. It's just a prototype, so I don't think it will have the final API (e.g. for the prototype I'm going to avoid change the API of PCollection, and use a "special" Coder instead). Also even once we go beyond prototype,

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
If you need help on the json part I'm happy to help. To give a few hints on what is very doable: we can add an avro module to johnzon (asf json{p,b} impl) to back jsonp by avro (guess it will be one of the first to be asked) for instance. Romain Manni-Bucau @rmannibucau

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Hmm, it is a hint semantically or it is deducable from the transform. Doing the union of both you cover all cases. Then how it is forwarded from the transform to the runtime is in runner API not the user (pipeline) API so I'm not sure I see the case you reference where it has a semantic API. Can

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
I don't think "hint" is the right API, as schema is not a hint (it has semantic meaning). However I think the API for schema should look similar to any "hint" API. On Wed, Jan 31, 2018 at 11:40 AM, Romain Manni-Bucau wrote: > > > Le 31 janv. 2018 20:16, "Reuven Lax"

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Romain Manni-Bucau
Le 31 janv. 2018 20:16, "Reuven Lax" a écrit : As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a

Re: Schema-Aware PCollections revisited

2018-01-31 Thread Reuven Lax
As to the question of how a schema should be specified, I want to support several common schema formats. So if a user has a Json schema, or an Avro schema, or a Calcite schema, etc. there should be adapters that allow setting a schema from any of them. I don't think we should prefer one over the

Re: Schema-Aware PCollections revisited

2018-01-30 Thread Jean-Baptiste Onofré
Hi, I think we should avoid to mix two things in the discussion (and so the document): 1. The element of the collection and the schema itself are two different things. By essence, Beam should not enforce any schema. That's why I think it's a good idea to set the schema optionally on the

Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Le 30 janv. 2018 01:09, "Reuven Lax" a écrit : On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau wrote: > Hi > > I have some questions on this: how hierarchic schemas would work? Seems it > is not really supported by the ecosystem (out of custom

Re: Schema-Aware PCollections revisited

2018-01-29 Thread Romain Manni-Bucau
Hi I have some questions on this: how hierarchic schemas would work? Seems it is not really supported by the ecosystem (out of custom stuff) :(. How would it integrate smoothly with other generic record types - N bridges? Concretely I wonder if using json API couldnt be beneficial: json-p is a

Re: Schema-Aware PCollections revisited

2018-01-28 Thread Jean-Baptiste Onofré
Hi Reuven, Thanks for the update ! As I'm working with you on this, I fully agree and great doc gathering the ideas. It's clearly something we have to add asap in Beam, because it would allow new use cases for our users (in a simple way) and open new areas for the runners (for instance dataframe