Hi Gwen,

As you said I see Bottled Water and Sqoop managing slightly different use
cases so I don't see this feature as a Sqoop killer.  However I did have a
question on your comment that the transaction log or CDC approach will have
problems with very large, very active databases.

I get that you need to have a single producer that transmits the
transaction log changes to Kafka in order.  However on the consumer side
you can have a topic per table and then partition these topics by primary
key to achieve nice parallelism.  So it seems the producer is the potential
bottleneck, but I imagine you can scale that appropriately vertically and
put the proper HA.

Would love to hear your thoughts on this.

Jonathan



On Thu, Apr 30, 2015 at 5:09 PM, Gwen Shapira <gshap...@cloudera.com> wrote:

> I feel a need to respond to the Sqoop-killer comment :)
>
> 1) Note that most databases have a single transaction log per db and in
> order to get the correct view of the DB, you need to read it in order
> (otherwise transactions will get messed up). This means you are limited to
> a single producer reading data from the log, writing it to a single
> partition and getting it read from a single consumer. If the database is
> very large and very active, you may run into some issues there...
>
> Because Sqoop doesn't try to catch up with all the changes, but takes a
> snapshot (from multiple mappers in parallel), we can very rapidly Sqoop
> 10TB databases.
>
> 2) If HDFS is the target of getting data from Postgres, then postgresql ->
> kafka -> HDFS seems less optimal than postgresql -> HDFS directly (in
> parallel). There are good reasons to get Postgres data to Kafka, but if the
> eventual goal is HDFS (or HBase), I suspect Sqoop still has a place.
>
> 3) Due to its parallelism and general purpose JDBC connector, I suspect
> that Sqoop is even a very viable way of getting data into Kafka.
>
> Gwen
>
>
> On Thu, Apr 30, 2015 at 2:27 PM, Jan Filipiak <jan.filip...@trivago.com>
> wrote:
>
> > Hello Everyone,
> >
> > I am quite exited about the recent example of replicating PostgresSQL
> > Changes to Kafka. My view on the log compaction feature always had been a
> > very sceptical one, but now with its great potential exposed to the wide
> > public, I think its an awesome feature. Especially when pulling this data
> > into HDFS as a Snapshot, it is (IMO) a sqoop killer. So I want to thank
> > everyone who had the vision of building these kind of systems during a
> time
> > I could not imagine those.
> >
> > There is one open question that I would like people to help me with. When
> > pulling a snapshot of a partition into HDFS using a camus-like
> application
> > I feel the need of keeping a Set of all keys read so far and stop as soon
> > as I find a key beeing already in my set. I use this as an indicator of
> how
> > far the log compaction has happened already and only pull up to this
> point.
> > This works quite well as I do not need to keep the messages but only
> their
> > keys in memory.
> >
> > The question I want to raise with the community is:
> >
> > How do you prevent pulling the same record twice (in different versions)
> > and would it be beneficial if the "OffsetResponse" would also return the
> > last offset that got compacted so far and the application would just pull
> > up to this point?
> >
> > Looking forward for some recommendations and comments.
> >
> > Best
> > Jan
> >
> >
>

Reply via email to