Shared the doc.
There is already a table provider for Kafka with CSV records. The
implementation at the moment doesn't touch the IO itself, just wraps it.
Implementing Kafka JSON records can be as easy as wrapping KafkaIO with
JsonToRow
I think even easier for other sources. PubSub is a tricky one (for us at
least) because Dataflow overrides the Beam native PubSub source with
something different. Kafka is a pure Beam source.
On Thu, May 10, 2018 at 1:39 PM Ismaël Mejía wrote:
> Hi, Jumping a bit late to this
Hi, Jumping a bit late to this discussion. This sounds super nice. But I
could not access the document.
How hard would it be to do this for other 'unbounded' sources, e.g. Kafka ?
On Sat, May 5, 2018 at 2:56 AM Andrew Pilloud wrote:
> I don't think we should jump to adding a
I don't think we should jump to adding a extension, but TBLPROPERTIES is
already a DDL extension and it isn't user friendly. We should strive for a
world where no one needs to use it. SQL needs the timestamp to be exposed
as a column, we can't hide it without changing the definition of GROUP BY.
I
There are few aspects of the event timestamp definition in SQL, which we
are talking about here:
- configuring the source. E.g. for PubsubIO you can choose whether to
extract event timestamp from the message attributes or the message publish
time:
- this is source-specific and cannot
On Thu, May 3, 2018 at 12:47 PM Anton Kedin wrote:
> I think it makes sense for the case when timestamp is provided in the
> payload (including pubsub message attributes). We can mark the field as an
> event timestamp. But if the timestamp is internally defined by the source
>
I like the idea of exposing source timestamp in TBLPROPERTIES which is
closely tied to source (KafkaIO, KinesisIO, MqttIO, AmqpIO, unbounded
FileIO, PubSubIO).
Exposing timestamp as a top level keyword will break the symmetry between
streaming and batch pipelines.
TBLPROPERTIES gives us
It is an interesting question for Beam DDL - since timestamps are
fundamental to Beam's data model, should we have a DDL extension that makes
it very explicit? Seems nice, but perhaps TBLPROPERTIES is a way to stage
the work, getting the functionality in place first and the parsing second.
What
I like to avoid magic too. I might not have been entirely clear in what I
was asking. Here is an example of what I had in mind, replacing the
TBLPROPERTIES
with a more generic TIMESTAMP option:
CREATE TABLE table_name (
publishTimestamp TIMESTAMP,
attributes MAP(VARCHAR, VARCHAR),
payload
I believe PubSubIO already exposes the publish timestamp if no timestamp
attribute is set.
On Thu, May 3, 2018 at 12:52 PM Anton Kedin wrote:
> A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
> will probably need to a way to expose a message publish
A SQL-specific wrapper+custom transforms for PubsubIO should suffice. We
will probably need to a way to expose a message publish timestamp if we
want to use it as an event timestamp, but that will be consumed by the same
wrapper/transform without adding anything schema or SQL-specific to
PubsubIO
I think it makes sense for the case when timestamp is provided in the
payload (including pubsub message attributes). We can mark the field as an
event timestamp. But if the timestamp is internally defined by the source
(pubsub message publish time) and not exposed in the event body, then we
need
Are you planning on integrating this directly into PubSubIO, or add a
follow-on transform?
On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote:
> Hi
>
> I am working on adding functionality to support querying Pubsub messages
> directly from Beam SQL.
>
> *Goal*
> Provide Beam
This sounds awesome!
Is event timestamp something that we need to specify for every source? If
so, I would suggest we add this as a first class option on CREATE TABLE
rather then something hidden in TBLPROPERTIES.
Andrew
On Wed, May 2, 2018 at 10:30 AM Anton Kedin wrote:
>
Hi
I am working on adding functionality to support querying Pubsub messages
directly from Beam SQL.
*Goal*
Provide Beam users a pure SQL solution to create the pipelines with
Pubsub as a data source, without the need to set up the pipelines in Java
before applying the query.
*High level
15 matches
Mail list logo