Maybe
https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/



On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <
tobias.eriks...@qvantel.com> wrote:

> Hi
>
> I would like to make a dump of the database, in JSON format, to KAFKA
>
> The database contains lots of data, millions and in some cases billions of
> “rows”
>
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
>
>
>
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
>
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
>
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
>
> One by one until all the “rows” have been processed.
>
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
>
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
>
>
>
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
>
> https://dzone.com/articles/data-locality-w-cassandra-how
>
>
>
>
>
> Would you consider this a good idea ?
>
> Would there in fact be a better idea, what would that be then ?
>
>
>
> -Tobias
>
>
>

Reply via email to