
On Wed, Apr 26, 2017 at 2:49 PM, Tobias Eriksson <> wrote:

> Hi
> I would like to make a dump of the database, in JSON format, to KAFKA
> The database contains lots of data, millions and in some cases billions of
> “rows”
> I will provide the customer with an export of the data, where they can
> read it off of a KAFKA topic
> My thinking was to have it scalable such that I will distribute the token
> range of all available partition-keys to a number of (N) processes
> (JSON-Producers)
> First I will have a process which will read through the available tokens
> and then publish them on a KAFKA “Coordinator” Topic
> And then I can create 1, 10, 20 or N processes that will act as Producers
> to the real KAFKA topic, and pick available tokens/partition-keys off of
> the “Coordinator” Topic
> One by one until all the “rows” have been processed.
> So the JOSN-Producer will take e.g. a range of 1000 “rows” and convert
> them into my own JSON format and post to KAFKA
> And then after that take another 1000 “rows” and then …. And then another
> 1000 “rows” and so on, until it is done.
> I base my idea on how I believe Apache Spark Connector accomplishes data
> locality, i.e. being aware of where tokens reside and figured that since
> that is possible it should be possible to create a job-list in a KAFKA
> topic, and have each Producer pick jobs from there, and read up data from
> Cassandra based on the partition key (token) and then post the JSON on the
> export KAFKA topic.
> Would you consider this a good idea ?
> Would there in fact be a better idea, what would that be then ?
> -Tobias

Reply via email to