Re: Pubsub to Beam SQL

2018-05-10 Thread Anton Kedin
Shared the doc. There is already a table provider for Kafka with CSV records. The implementation at the moment doesn't touch the IO itself, just wraps it. Implementing Kafka JSON records can be as easy as wrapping KafkaIO with JsonToRow

Re: Pubsub to Beam SQL

2018-05-10 Thread Reuven Lax
I think even easier for other sources. PubSub is a tricky one (for us at least) because Dataflow overrides the Beam native PubSub source with something different. Kafka is a pure Beam source. On Thu, May 10, 2018 at 1:39 PM Ismaël Mejía wrote: > Hi, Jumping a bit late to this

Re: Pubsub to Beam SQL

2018-05-10 Thread Ismaël Mejía
Hi, Jumping a bit late to this discussion. This sounds super nice. But I could not access the document. How hard would it be to do this for other 'unbounded' sources, e.g. Kafka ? On Sat, May 5, 2018 at 2:56 AM Andrew Pilloud wrote: > I don't think we should jump to adding a

Re: Pubsub to Beam SQL

2018-05-04 Thread Andrew Pilloud
I don't think we should jump to adding a extension, but TBLPROPERTIES is already a DDL extension and it isn't user friendly. We should strive for a world where no one needs to use it. SQL needs the timestamp to be exposed as a column, we can't hide it without changing the definition of GROUP BY. I

Re: Pubsub to Beam SQL

2018-05-04 Thread Anton Kedin
There are few aspects of the event timestamp definition in SQL, which we are talking about here: - configuring the source. E.g. for PubsubIO you can choose whether to extract event timestamp from the message attributes or the message publish time: - this is source-specific and cannot

Re: Pubsub to Beam SQL

2018-05-04 Thread Raghu Angadi
On Thu, May 3, 2018 at 12:47 PM Anton Kedin wrote: > I think it makes sense for the case when timestamp is provided in the > payload (including pubsub message attributes). We can mark the field as an > event timestamp. But if the timestamp is internally defined by the source >

Re: Pubsub to Beam SQL

2018-05-03 Thread Ankur Goenka
I like the idea of exposing source timestamp in TBLPROPERTIES which is closely tied to source (KafkaIO, KinesisIO, MqttIO, AmqpIO, unbounded FileIO, PubSubIO). Exposing timestamp as a top level keyword will break the symmetry between streaming and batch pipelines. TBLPROPERTIES gives us

Re: Pubsub to Beam SQL

2018-05-03 Thread Kenneth Knowles
It is an interesting question for Beam DDL - since timestamps are fundamental to Beam's data model, should we have a DDL extension that makes it very explicit? Seems nice, but perhaps TBLPROPERTIES is a way to stage the work, getting the functionality in place first and the parsing second. What

Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
I like to avoid magic too. I might not have been entirely clear in what I was asking. Here is an example of what I had in mind, replacing the TBLPROPERTIES with a more generic TIMESTAMP option: CREATE TABLE table_name ( publishTimestamp TIMESTAMP, attributes MAP(VARCHAR, VARCHAR), payload

Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
I believe PubSubIO already exposes the publish timestamp if no timestamp attribute is set. On Thu, May 3, 2018 at 12:52 PM Anton Kedin wrote: > A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We > will probably need to a way to expose a message publish

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We will probably need to a way to expose a message publish timestamp if we want to use it as an event timestamp, but that will be consumed by the same wrapper/transform without adding anything schema or SQL-specific to PubsubIO

Re: Pubsub to Beam SQL

2018-05-03 Thread Anton Kedin
I think it makes sense for the case when timestamp is provided in the payload (including pubsub message attributes). We can mark the field as an event timestamp. But if the timestamp is internally defined by the source (pubsub message publish time) and not exposed in the event body, then we need

Re: Pubsub to Beam SQL

2018-05-03 Thread Reuven Lax
Are you planning on integrating this directly into PubSubIO, or add a follow-on transform? On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote: > Hi > > I am working on adding functionality to support querying Pubsub messages > directly from Beam SQL. > > *Goal* > Provide Beam

Re: Pubsub to Beam SQL

2018-05-03 Thread Andrew Pilloud
This sounds awesome! Is event timestamp something that we need to specify for every source? If so, I would suggest we add this as a first class option on CREATE TABLE rather then something hidden in TBLPROPERTIES. Andrew On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote: >

Pubsub to Beam SQL

2018-05-02 Thread Anton Kedin
Hi I am working on adding functionality to support querying Pubsub messages directly from Beam SQL. *Goal* Provide Beam users a pure SQL solution to create the pipelines with Pubsub as a data source, without the need to set up the pipelines in Java before applying the query. *High level