Hi Prabhjot,

Thanks, yes there are almost too many ways to do it. One other option is to use 
HDFS for the locality side of things, but it would be nice if we could use 
Kafka for both

Thanks,
Ben

-----Original Message-----
From: Prabhjot Bharaj [mailto:prabhbha...@gmail.com] 
Sent: 06 November 2015 11:41
To: users@kafka.apache.org
Subject: Re: Locality question

Hi Ben,

I have a similar use case where data from processing layer gets onto storage 
layer, where data per db gets distributed amongst storage machines based on 
some logic.
The architecture that we had agreed upon was to have topics as storage machine 
names so the processing layer can send all data meant for 1 storage machine 
onto the topic only that storage machine will consume.
In case you change the logic of distributing data amongst the storage machines, 
the processing machines can have a time based approach to send the data to 
respective topics. In this case, you wont need to know which partitions are 
residing on which machine. If you have, lets say x partitions, you can keep 
writing data to this topic (meant for 1 storage
machine) without worrying where those partitions reside in kafka cluster.

The other architecture was similar to the one you are planning - to have db 
names as topics and have respective partitions (more than 200 per topic), this 
design would have killed our system because we have lots of dbs, and having too 
many topics and partitions on the transfer layer (like kafka) might have 
required higher end hardware as well.


Your other question still remains unanswered here - where to deploy your 
storage machines ? - on the kafka cluster or separately (requireing a network 
hop for your data). Both approaches their pros and cons.

I would love to hear about this from the community

Thanks,
Prabhjot

On Fri, Nov 6, 2015 at 1:50 PM, Young, Ben <ben.yo...@sungard.com> wrote:

> Hi,
>
> I've had a look over the website and searched the archives, but I 
> can't find any obvious answers to this, so apologies if it's been asked 
> before.
>
> I'm investigating potentially using Kafka for the transaction log for 
> our in-memory database technology. The idea is the Kafka partitioning 
> and replication will "automatically" give us sharding and hot-standby 
> capabilities in the db (obviously with a fair amount of work).
>
> The database can ingest hundreds of gigabytes of data extremely 
> quickly, easily enough to saturate any reasonable network connection, 
> so I've thought about co-locating the db on the same nodes of the 
> kafka cluster that actually store the data, to cut out the network 
> entirely from the loading process. We'd also probably want the db 
> topology to be defined first, and the kafka partitioning to follow. I 
> can see how to use the partitioner class to assign a specific 
> partition to a key, but I can't currently see how to assume partitions 
> to known machines upfront. Is this possible?
>
> Does the plan sound reasonable in general? I've also considered a log 
> shipping approach like Flume, but Kafka seems simplest all round, and 
> a really like the idea of just being able to set the log offset to 
> zero to reload on startup.
>
> Thanks,
> Ben Young
>
>
> Ben Young . Principal Software Engineer . Adaptiv .
>
>
>


--
---------------------------------------------------------
"There are only 10 types of people in the world: Those who understand binary, 
and those who don't"

Reply via email to