Maybe https://www.confluent.io/blog/kafka-connect-cassandra-sink-the-perfect-match/
On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson < tobias.eriks...@qvantel.com> wrote: > Hi > > I would like to make a dump of the database, in JSON format, to KAFKA > > The database contains lots of data, millions and in some cases billions of > “rows” > > I will provide the customer with an export of the data, where they can > read it off of a KAFKA topic > > > > My thinking was to have it scalable such that I will distribute the token > range of all available partition-keys to a number of (N) processes > (JSON-Producers) > > First I will have a process which will read through the available tokens > and then publish them on a KAFKA “Coordinator” Topic > > And then I can create 1, 10, 20 or N processes that will act as Producers > to the real KAFKA topic, and pick available tokens/partition-keys off of > the “Coordinator” Topic > > One by one until all the “rows” have been processed. > > So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert > them into my own JSON format and post to KAFKA > > And then after that take another 1000 “rows” and then …. And then another > 1000 “rows” and so on, until it is done. > > > > I base my idea on how I believe Apache Spark Connector accomplishes data > locality, i.e. being aware of where tokens reside and figured that since > that is possible it should be possible to create a job-list in a KAFKA > topic, and have each Producer pick jobs from there, and read up data from > Cassandra based on the partition key (token) and then post the JSON on the > export KAFKA topic. > > https://dzone.com/articles/data-locality-w-cassandra-how > > > > > > Would you consider this a good idea ? > > Would there in fact be a better idea, what would that be then ? > > > > -Tobias > > >