Albert, you certainly can use Kafka (and it will probably work quite well)
you'll just need to make sure your consumers are written to match the
available options. I think I may not have a good picture of what you need
to do. Is it that you have a stream of documents coming in and then each
document can go through crawling, keyword extraction, language detection,
and indexation in any order, or are the steps ordered?

The difference between a workqueue and what Kafka provides is that with a
workqueue all the documents would be in one big queue and all the consumers
have equal access to each document and can process the documents out of
order if required. In Kafka to have multiple consumers on one subscription
the data is partitioned with consumers assigned to particular partitions
(though this can happen in a coordinated fashion at runtime) and each
consumer will only chug through the data in its partition with messages
marked as being processed in a strict order. You can use this partitioning
to control which consumer will process a message but you may not want to.
Does this help at all? Without forethought  using Kafka you wouldn't be
able to suddenly increase the number of consumers to process a backlog
since you could have at most one consumer per partition (but with
forethought you could start with a larger number of partitions so as to be
able to increase the number of consumers by having each consumer process
fewer partitions at a time).

As for Storm, I don't believe there's anything stopping each bolt from just
being concerned with a single document at a time, and if your stages are
sequential then your use of Kafka might well be equivalent to a simple
linear Storm topology.

Christian

On Thu, Oct 9, 2014 at 11:57 PM, Albert Vila <albert.v...@augure.com> wrote:

> Hi
>
> We process data in real time, and we are taking a look at Storm and Spark
> streaming too, however our actions are atomic, done at a document level so
> I don't know if it fits on something like Storm/Spark.
>
> Regarding what you Christian said, isn't Kafka used for scenarios like the
> one I described? I mean, we do have work queues right now with Gearman, but
> with a bunch of workers on each step. I thought we could change that to a
> producer and a bunch of consumers (where the message should only reach one
> and exact one consumer).
>
> And what I said about the data locally, it was only an optimization we did
> some time ago because we was moving more data back then. Maybe now its not
> necessary and we could move messages around the system using Kafka, so it
> will allow us to simplify the architecture a little bit. I've seen people
> saying they move Tb of data every day using Kafka.
>
> Just to be clear on the size of each document/message, we are talking about
> tweets, blog posts, ... (on 90% of cases the size is less than 50Kb)
>
> Regards
>
> On 9 October 2014 20:02, Christian Csar <cac...@gmail.com> wrote:
>
> > Apart from your data locality problem it sounds like what you want is a
> > workqueue. Kafka's consumer structure doesn't lend itself too well to
> > that use case as a single partition of a topic should only have one
> > consumer instance per logical subscriber of the topic, and that consumer
> > would not be able to mark jobs as completed except in a strict order
> > (while maintaining a processed successfully at least once guarantee).
> > This is not to say it cannot be done, but I believe your workqueue would
> > end up working a bit strangely if built with Kafka.
> >
> > Christian
> >
> > On 10/09/2014 06:13 AM, William Briggs wrote:
> > > Manually managing data locality will become difficult to scale. Kafka
> is
> > > one potential tool you can use to help scale, but by itself, it will
> not
> > > solve your problem. If you need the data in near-real time, you could
> > use a
> > > technology like Spark or Storm to stream data from Kafka and perform
> your
> > > processing. If you can batch the data, you might be better off pulling
> it
> > > into a distributed filesystem like HDFS, and using MapReduce, Spark or
> > > another scalable processing framework to handle your transformations.
> > Once
> > > you've paid the initial price for moving the document into HDFS, your
> > > network traffic should be fairly manageable; most clusters built for
> this
> > > purpose will schedule work to be run local to the data, and typically
> > have
> > > separate, high-speed network interfaces and a dedicated switch in order
> > to
> > > optimize intra-cluster communications when moving data is unavoidable.
> > >
> > > -Will
> > >
> > > On Thu, Oct 9, 2014 at 7:57 AM, Albert Vila <albert.v...@augure.com>
> > wrote:
> > >
> > >> Hi
> > >>
> > >> I just came across Kafta when I was trying to find solutions to scale
> > our
> > >> current architecture.
> > >>
> > >> We are currently downloading and processing 6M documents per day from
> > >> online and social media. We have a different workflow for each type of
> > >> document, but some of the steps are keyword extraction, language
> > detection,
> > >> clustering, classification, indexation, .... We are using Gearman to
> > >> dispatch the job to workers and we have some queues on a database.
> > >>
> > >> I'm wondering if we could integrate Kafka on the current workflow and
> if
> > >> it's feasible. One of our main discussions are if we have to go to a
> > fully
> > >> distributed architecture or to a semi-distributed one. I mean,
> > distribute
> > >> everything or process some steps on the same machine (crawling,
> keyword
> > >> extraction, language detection, indexation). We don't know which one
> > scales
> > >> more, each one has pros and cont.
> > >>
> > >> Now we have a semi-distributed one as we had network problems taking
> > into
> > >> account the amount of data we were moving around. So now, all
> documents
> > >> crawled on server X, later on are dispatched through Gearman to the
> same
> > >> server. What we dispatch on Gearman is only the document id, and the
> > >> document data remains on the crawling server on a Memcached, so the
> > network
> > >> traffic is keep at minimum.
> > >>
> > >> What do you think?
> > >> It's feasible to remove all database queues and Gearman and move to
> > Kafka?
> > >> As Kafka is mainly based on messages I think we should move the
> messages
> > >> around, should we take into account the network? We may face the same
> > >> problems?
> > >> If so, there is a way to isolate some steps to be processed on the
> same
> > >> machine, to avoid network traffic?
> > >>
> > >> Any help or comment will be appreciate. And If someone has had a
> similar
> > >> problem and has knowledge about the architecture approach will be more
> > than
> > >> welcomed.
> > >>
> > >> Thanks
> > >>
> > >
> >
> >
>
>
> --
> *Albert Vila*
> R&D Manager & Software Developer
>
>
> Tél. : +34 972 982 968
>
> *www.augure.com* <http://www.augure.com/> | *Blog.* Reputation in action
> <http://blog.augure.es/> | *Twitter. *@AugureSpain
> <https://twitter.com/AugureSpain>
> *Skype *: albert.vila | *Access map.* Augure Girona
> <
> https://maps.google.com/maps?q=Eiximenis+12,+17001+Girona,+Espanya&hl=ca&sll=50.956548,6.799948&sspn=30.199963,86.044922&hnear=Carrer+Eiximenis,+12,+17001+Girona,+Espanya&t=m&z=16
> >
>

Reply via email to