Jay,

You can query after the fact, but you're not necessarily going to get the
same value back. There could easily be dozens of changes to the document in
the oplog so the delta you see may not even make sense given the current
state of the document. Even if you can apply it the delta, you'd still be
seeing data that is newer than the update. You can of course take this
shortcut, but it won't give correct results. And if the data has been
deleted since then, you won't even be able to write the full record... As
far as I know, the way the op log is exposed won't let you do something
like pin a query to the state of the db at a specific point in the op log
and you may be reading from the beginning of the op log, so I don't think
there's a way to get correct results by just querying the DB for the full
documents.

Strictly speaking you don't need to get all the data in memory, you just
need a record of the current set of values somewhere. This is what I was
describing following those two options -- if you do an initial dump to
Kafka, you could track only offsets in memory and read back full values as
needed to apply deltas, but this of course requires random reads into your
Kafka topic (but may perform fine in practice depending on the workload).

-Ewen

On Fri, Jan 29, 2016 at 9:12 AM, Jay Kreps <j...@confluent.io> wrote:

> Hey Ewen, how come you need to get it all in memory for approach (1)? I
> guess the obvious thing to do would just be to query for the record
> after-image when you get the diff--e.g. just read a batch of changes and
> multi-get the final values. I don't know how bad the overhead of this would
> be...batching might reduce it a fair amount. The guarantees for this are
> slightly different than the pure oplog too (you get the current value not
> every necessarily every intermediate) but that should be okay for most
> uses.
>
> -Jay
>
> On Fri, Jan 29, 2016 at 8:54 AM, Ewen Cheslack-Postava <e...@confluent.io>
> wrote:
>
> > Sunny,
> >
> > As I said on Twitter, I'm stoked to hear you're working on a Mongo
> > connector! It struck me as a pretty natural source to tackle since it
> does
> > such a nice job of cleanly exposing the op log.
> >
> > Regarding the problem of only getting deltas, unfortunately there is not
> a
> > trivial solution here -- if you want to generate the full updated record,
> > you're going to have to have a way to recover the original document.
> >
> > In fact, I'm curious how you were thinking of even bootstrapping. Are you
> > going to do a full dump and then start reading the op log? Is there a
> good
> > way to do the dump and figure out the exact location in the op log that
> the
> > query generating the dump was initially performed? I know that internally
> > mongo effectively does these two steps, but I'm not sure if the necessary
> > info is exposed via normal queries.
> >
> > If you want to reconstitute the data, I can think of a couple of options:
> >
> > 1. Try to reconstitute inline in the connector. This seems difficult to
> > make work in practice. At some point you basically have to query for the
> > entire data set to bring it into memory and then the connector is
> > effectively just applying the deltas to its in memory copy and then just
> > generating one output record containing the full document each time it
> > applies an update.
> > 2. Make the connector send just the updates and have a separate stream
> > processing job perform the reconstitution and send to another topic. In
> > this case, the first topic should not be compacted, but the second one
> > could be.
> >
> > Unfortunately, without additional hooks into the database, there's not
> much
> > you can do besides this pretty heavyweight process. There may be some
> > tricks you can use to reduce the amount of memory used during the process
> > (e.g. keep a small cache of actual records and for the rest only store
> > Kafka offsets for the last full value, performing a (possibly expensive)
> > random read as necessary to get the full document value back), but to get
> > full correctness you will need to perform this process.
> >
> > In terms of Kafka Connect supporting something like this, I'm not sure
> how
> > general it could be made, or that you even want to perform the process
> > inline with the Kafka Connect job. If it's an issue that repeatedly
> arises
> > across a variety of systems, then we should consider how to address it
> more
> > generally.
> >
> > -Ewen
> >
> > On Tue, Jan 26, 2016 at 8:43 PM, Sunny Shah <su...@tinyowl.co.in> wrote:
> >
> > >
> > > Hi ,
> > >
> > > We are trying to write a Kafka-connect connector for Mongodb. The issue
> > > is, MongoDB does not provide an entire changed document for update
> > > operations, It just provides the modified fields.
> > >
> > > if Kafka allows custom log compaction then It is possible to eventually
> > > merge an entire document and subsequent update to to create an entire
> > > record again.
> > >
> > > As Ewen pointed out to me on twitter, this is not possible, then What
> is
> > > the Kafka-connect way of solving this issue?
> > >
> > > @Ewen, Thanks a lot for a really quick answer on twitter.
> > >
> > > --
> > > Thanks and Regards,
> > >  Sunny
> > >
> > > The contents of this e-mail and any attachment(s) are confidential and
> > > intended for the named recipient(s) only. It shall not attach any
> > liability
> > > on the originator or TinyOwl Technology Pvt. Ltd. or its affiliates.
> Any
> > > form of reproduction, dissemination, copying, disclosure, modification,
> > > distribution and / or publication of this message without the prior
> > written
> > > consent of the author of this e-mail is strictly prohibited. If you
> have
> > > received this email in error please delete it and notify the sender
> > > immediately. You are liable to the company (TinyOwl Technology Pvt.
> > Ltd.) in
> > > case of any breach in ​
> > > ​confidentialy (through any form of communication) wherein the company
> > has
> > > the right to injunct legal action and an equitable relief for damages.
> > >
> >
> >
> >
> > --
> > Thanks,
> > Ewen
> >
>



-- 
Thanks,
Ewen

Reply via email to