Re: Using C* and CAS to coordinate workers

Jan Algermissen Fri, 04 Apr 2014 09:49:30 -0700

Hi DuyHai,


On 04 Apr 2014, at 13:58, DuyHai Doan <doanduy...@gmail.com> wrote:

> @Jan
> 
>  This subject of distributed workers & queues has been discussed in the 
> mailing list many times.

Sorry + thanks.

Unfortunately, I do not want to use C* as a queue, but to coordinate workers 
that page through an (XML) data feed of events every N seconds.

Let me try again:

- I have N instances of the same system, replicated to ensure work is
  being done despite failure of instances
- the instances are master less and know nothing about each other. Given
  them an integer ID isn’r really possible and the number of instances isn’t
  really known
- there is a schedule, controlling how often the feed is read, say once every 
Minute
- the schedule might change by way of an administrator of the ‘feed polling’
- the worker instances check for work every, e.g. 10 secs
- once a worker starts, it checks whether there is work to do (the schedule 
aspect) and if so,
  starts polling the feed until the last event has been reached.
- During that time, no other worker must poll the feed
- once the working worker is done it saves the timestamp or ID of the last seen 
event and sets the next schedule

- the processing of the events might take much longer than the schedule 
intervals

I hope this explains more, what I am up to. Maybe I can adapt your suggestion, 
I just do not see how.

Jan



> Basically one implementation can be:
> 
> 1) p data providers, c data consumers
> 2) create partitions (physical rows) of arbitrary number of columns (let's 
> say 10 000, not too big though). Partition key = bucket number (#b)
> 3) assign an integer id (pId) to each provider, same for each consumer (cId)
> 4) each provider can only write messages in bucket number such that #b mod p 
> = pId mod p
> 5) once the provider reaches 10 000 messages per bucket, it switches to the 
> next one with new #b = old #b + p
> 6) the consumers follow the same rule for bucket switching
> 
> Example:
> 
>  p = 5, c = 3
> 
>  - p1 writes messages into buckets {1,6,11,16...} // 1, 1+5, 1+5+5, ....
>  - p2 writes messages into buckets {2,7,12,17...} // 2, 2+5, 2+5+5,...
>  - p3 writes messages into buckets {3,8,13,18...}
>  - p4 writes messages into buckets {4,9,14,19...}
>  - p5 writes messages into buckets {5,10,15,20...} 
> 
>  - c1 consumes messages from buckets {1,4,7,10...} // 1, 1+3, 1+3+3...
>  - c2 consumes messages from buckets {2,5,8,11...}
>  - c1 consumes messages from buckets {3,6,9,12...}
> 
> Of course, consumers can not re-put messages into the bucket otherwise the 
> counting (10 000 elements/bucket) is screwed
> 
> Alternatively, you can insert messages with TTL to automatically expired 
> "consumed buckets" after a while, saving you the hassle to clean up old 
> buckets to reclaim disk space.
> 
> 
>  There are other implementations based on distributed lock using C* C.A.S 
> also but the above algorithm do not requires any lock.
> 
> Regards
> 
>  Duy Hai DOAN
> 
> 
> 
> 
> 
> On Fri, Apr 4, 2014 at 12:47 PM, prem yadav <ipremya...@gmail.com> wrote:
> Oh ok. I thought you did not have a cassandra cluster already. Sorry about 
> that.
> 
> 
> On Fri, Apr 4, 2014 at 11:42 AM, Jan Algermissen <jan.algermis...@nordsc.com> 
> wrote:
> 
> On 04 Apr 2014, at 11:18, prem yadav <ipremya...@gmail.com> wrote:
> 
>> Though cassandra can work but to me it looks like you could use a persistent 
>> queue for example (rabbitMQ) to implement this. All your workers can 
>> subscribe to a queue.
>> In fact, why not just MySQL?
> 
> Hey, I have got a C* cluster that can (potentially) do CAS.
> 
> Why would I set up a MySQL cluster to solve that problem?
> 
> And yeah, I could use a queue or redis or whatnot, but I want to avoid yet 
> another moving part :-)
> 
> Jan
> 
> 
>> 
>> 
>> On Thu, Apr 3, 2014 at 11:44 PM, Jan Algermissen 
>> <jan.algermis...@nordsc.com> wrote:
>> Hi,
>> 
>> maybe someone knows a nice solution to the following problem:
>> 
>> I have N worker processes that are intentionally masterless and do not know 
>> about each other - they are stateless and independent instances of a given 
>> service system.
>> 
>> These workers need to poll an event feed, say about every 10 seconds and 
>> persist a state after processing the polled events so the next worker knows 
>> where to continue processing events.
>> 
>> I would like to use C*’s CAS feature to coordinate the workers and protect 
>> the shared state (a row or cell in  a C* key space, too).
>> 
>> Has anybody done something similar and can suggest a ‘clever’ data model 
>> design and interaction?
>> 
>> 
>> 
>> Jan
>> 
> 
> 
>

Re: Using C* and CAS to coordinate workers

Reply via email to