Those data has a timestamp: its actually email campaigns with scheduled
send time. But  since they can be scheduled ahead(e.g, two days ahead), I
cannot read it when it arrives. It has to wait until its actual scheduled
send time. As you can tell, the sequence within the 6 min does not matter,
but it does matter to ensure messages in the first 6 min will be sent out
before the next 6 min, thus reading it at the end of every 6 minutes.

Therefore it's hard for me to put the messages into different partitions of
a single topic, becoz  6 min data might already too big for a single
partition, let alone the offset management chaos.

It seems that a fast queue system is the right tool to use, but it involves
more setup and cluster maintenance overhead. My thought is to use the
existing kafka cluster, with the hope that the topic deletion api will be
available soon.Meantime just have a cron job cleaning up the outdated
topics from zookeeper.

Let me know what you think,
Thanks,
Chen


On Mon, Aug 11, 2014 at 6:53 PM, Philip O'Toole <
philip.oto...@yahoo.com.invalid> wrote:

> Why do you need to read it every 6 minutes? Why not just read it as it
> arrives? If it naturally arrives in 6 minute bursts, you'll read it in 6
> minute bursts, no?
>
> Perhaps the data does not have timestamps embedded in it, so that is why
> you are relying on time-based topic names? In that case I would have an
> intermediate stage that tags the data with the timestamp, and then writes
> it to a single topic, and then processes it at your leisure in a third
> stage.
>
> Perhaps I am still missing a key difficulty with your system.
>
> Your original suggestion is going to be difficult to get working. You'll
> quickly run out of file descriptors, amongst other issues.
>
> Philip
>
>
>
>
> ---------------------------------
> http://www.philipotoole.com
>
> > On Aug 11, 2014, at 6:42 PM, Chen Wang <chen.apache.s...@gmail.com>
> wrote:
> >
> > "And if you can't consume it all within 6 minutes, partition the topic
> > until you can run enough consumers such that you can keep up.", this is
> > what I intend to do for each 6min -topic.
> >
> > What I really need is a partitioned queue: each 6 minute of data can put
> > into a separate partition, so that I can read that specific partition at
> > the end of each 6 minutes. So apparently redis naturally fit this case,
> but
> > the only issue is the performance,(well also some trick in ensuring the
> > reliable message delivery). As I said, we have kakfa infrastructure in
> > place, if without too much work, i can make the design work with kafka, i
> > would rather go this path instead of setting up another queue system.
> >
> > Chen
> >
> > Chen
> >
> >
> > On Mon, Aug 11, 2014 at 6:07 PM, Philip O'Toole <
> > philip.oto...@yahoo.com.invalid> wrote:
> >
> >> It's still not clear to me why you need to create so many topics.
> >>
> >> Write the data to a single topic and consume it when it arrives. It
> >> doesn't matter if it arrives in bursts, as long as you can process it
> all
> >> within 6 minutes, right?
> >>
> >> And if you can't consume it all within 6 minutes, partition the topic
> >> until you can run enough consumers such that you can keep up. The fact
> that
> >> you are thinking about so many topics is a sign your design is wrong, or
> >> Kafka is the wrong solution.
> >>
> >> Philip
> >>
> >>>> On Aug 11, 2014, at 5:18 PM, Chen Wang <chen.apache.s...@gmail.com>
> >>> wrote:
> >>>
> >>> Philip,
> >>> That is right. There is huge amount of data flushed into the topic
> >> within each 6 minutes. Then at the end of each 6 min, I only want to
> read
> >> from that specify topic, and data within that topic has to be processed
> as
> >> fast as possible. I was originally using redis queue for this purpose,
> but
> >> it takes much longer to process a redis queue than kafka queue(testing
> data
> >> is 2M messages). Since we already have kafka infrastructure setup,
> instead
> >> of seeking other tools(activeMQ, rabbitMQ etc), I would rather make use
> of
> >> kafka, although it does not seem like a common kafka user case.
> >>>
> >>> Chen
> >>>
> >>>
> >>>> On Mon, Aug 11, 2014 at 5:01 PM, Philip O'Toole
> >> <philip.oto...@yahoo.com.invalid> wrote:
> >>>> I'd love to know more about what you're trying to do here. It sounds
> >> like you're trying to create topics on a schedule, trying to make it
> easy
> >> to locate data for a given time range? I'm not sure it makes sense to
> use
> >> Kafka in this manner.
> >>>>
> >>>> Can you provide more detail?
> >>>>
> >>>>
> >>>> Philip
> >>>>
> >>>>
> >>>> -----------------------------------------
> >>>> http://www.philipotoole.com
> >>>>
> >>>>
> >>>> On Monday, August 11, 2014 4:45 PM, Chen Wang <
> >> chen.apache.s...@gmail.com> wrote:
> >>>>
> >>>>
> >>>>
> >>>> Todd,
> >>>> I actually only intend to keep each topic valid for 3 days most. Each
> of
> >>>> our topic has 3 partitions, so its around 3*240*3 =2160 partitions.
> >> Since
> >>>> there is no api for deleting topic, i guess i could set up a cron job
> >>>> deleting the out dated topics(folders) from zookeeper..
> >>>> do you know when the delete topic api will be available in kafka?
> >>>> Chen
> >>>>
> >>>>
> >>>>
> >>>> On Mon, Aug 11, 2014 at 3:47 PM, Todd Palino
> >> <tpal...@linkedin.com.invalid>
> >>>> wrote:
> >>>>
> >>>>> You need to consider your total partition count as you do this. After
> >> 30
> >>>>> days, assuming 1 partition per topic, you have 7200 partitions.
> >> Depending
> >>>>> on how many brokers you have, this can start to be a problem. We just
> >>>>> found an issue on one of our clusters that has over 70k partitions
> >> that
> >>>>> there¹s now a problem with doing actions like a preferred replica
> >> election
> >>>>> for all topics because the JSON object that gets written to the
> >> zookeeper
> >>>>> node to trigger it is too large for Zookeeper¹s default 1 MB data
> >> size.
> >>>>>
> >>>>> You also need to think about the number of open file handles. Even
> >> with no
> >>>>> data, there will be open files for each topic.
> >>>>>
> >>>>> -Todd
> >>>>>
> >>>>>
> >>>>>> On 8/11/14, 2:19 PM, "Chen Wang" <chen.apache.s...@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Folks,
> >>>>>> Is there any potential issue with creating 240 topics every day?
> >> Although
> >>>>>> the retention of each topic is set to be 2 days, I am a little
> >> concerned
> >>>>>> that since right now there is no delete topic api, the zookeepers
> >> might be
> >>>>>> overloaded.
> >>>>>> Thanks,
> >>>>>> Chen
> >>
>

Reply via email to