Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Flavio Percoco
On 09/23/2014 11:59 PM, Joe Gordon wrote:
 
 
 On Tue, Sep 23, 2014 at 9:13 AM, Zane Bitter zbit...@redhat.com
 mailto:zbit...@redhat.com wrote:
 
 On 22/09/14 22:04, Joe Gordon wrote:
 
 To me this is less about valid or invalid choices. The Zaqar team is
 comparing Zaqar to SQS, but after digging into the two of them,
 zaqar
 barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
 important parts of SQS: the message will be delivered and will
 never be
 lost by SQS.
 
 
 I agree that this is the most important feature. Happily, Flavio has
 clarified this in his other thread[1]:
 
  *Zaqar's vision is to provide a cross-cloud interoperable,
   fully-reliable messaging service at scale that is both, easy and not
   invasive, for deployers and users.*
 
   ...
 
   Zaqar aims to be a fully-reliable service, therefore messages should
   never be lost under any circumstances except for when the message's
   expiration time (ttl) is reached
 
 So Zaqar _will_ guarantee reliable delivery.
 
 Zaqar doesn't have the same scaling properties as SQS.
 
 
 This is true. (That's not to say it won't scale, but it doesn't
 scale in exactly the same way that SQS does because it has a
 different architecture.)
 
 It appears that the main reason for this is the ordering guarantee,
 which was introduced in response to feedback from users. So this is
 clearly a different design choice: SQS chose reliability plus
 effectively infinite scalability, while Zaqar chose reliability plus
 FIFO. It's not feasible to satisfy all three simultaneously, so the
 options are:
 
 1) Implement two separate modes and allow the user to decide
 2) Continue to choose FIFO over infinite scalability
 3) Drop FIFO and choose infinite scalability instead
 
 This is one of the key points on which we need to get buy-in from
 the community on selecting one of these as the long-term strategy.
 
 Zaqar is aiming for low latency per message, SQS doesn't appear
 to be.
 
 
 I've seen no evidence that Zaqar is actually aiming for that. There
 are waaay lower-latency ways to implement messaging if you don't
 care about durability (you wouldn't do store-and-forward, for a
 start). If you see a lot of talk about low latency, it's probably
 because for a long time people insisted on comparing Zaqar to
 RabbitMQ instead of SQS.
 
 
 I thought this was why Zaqar uses Falcon and not Pecan/WSME?
 
 For an application like Marconi where throughput and latency is of
 paramount importance, I recommend Falcon over
 Pecan. https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation
 
 Yes that statement mentions throughput as well, but it does mention
 latency as well.

Right, but that doesn't make low-latency the main goal and as I've
already said, the fact that latency is not the main goal doesn't mean
the team will overlook it.

  
 
 
 (Let's also be careful not to talk about high latency as if it were
 a virtue in itself; it's simply something we would happily trade off
 for other properties. Zaqar _is_ making that trade-off.)
 
 So if Zaqar isn't SQS what is Zaqar and why should I use it?
 
 
 If you are a small-to-medium user of an SQS-like service, Zaqar is
 like SQS but better because not only does it never lose your
 messages but they always arrive in order, and you have the option to
 fan them out to multiple subscribers. If you are a very large user
 along one particular dimension (I believe it's number of messages
 delivered from a single queue, but probably Gordon will correct me
 :D) then Zaqar may not _yet_ have a good story for you.
 
 cheers,
 Zane.
 
 [1]
 
 http://lists.openstack.org/__pipermail/openstack-dev/2014-__September/046809.html
 
 http://lists.openstack.org/pipermail/openstack-dev/2014-September/046809.html
 
 
 _
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.__org
 mailto:OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/__cgi-bin/mailman/listinfo/__openstack-dev 
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Flavio Percoco
On 09/24/2014 03:48 AM, Devananda van der Veen wrote:
 I've taken a bit of time out of this thread, and I'd like to jump back
 in now and attempt to summarize what I've learned and hopefully frame
 it in such a way that it helps us to answer the question Thierry
 asked:
 


I *loved* it! Thanks a lot for taking the time.

 On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez thie...@openstack.org wrote:

 The underlying question being... can Zaqar evolve to ultimately reach
 the massive scale use case Joe, Clint and Devananda want it to reach, or
 are those design choices so deeply rooted in the code and architecture
 that Zaqar won't naturally mutate to support that use case.
 
 
 I also want to sincerely thank everyone who has been involved in this
 discussion, and helped to clarify the different viewpoints and
 uncertainties which have surrounded Zaqar lately. I hope that all of
 this helps provide the Zaqar team guidance on a path forward, as I do
 believe that a scalable cloud-based messaging service would greatly
 benefit the OpenStack ecosystem.
 
 Use cases
 --
 
 So, I'd like to start from the perspective of a hypothetical user
 evaluating messaging services for the new application that I'm
 developing. What does my application need from a messaging service so
 that it can grow and become hugely popular with all the hipsters of
 the world? In other words, what might my architectural requirements
 be?
 
 (This is certainly not a complete list of features, and it's not meant
 to be -- it is a list of things that I *might* need from a messaging
 service. But feel free to point out any glaring omissions I may have
 made anyway :) )
 
 1. Durability: I can't risk losing any messages
   Example: Using a queue to process votes. Every vote should count.
 
 2. Single Delivery - each message must be processed *exactly* once
   Example: Using a queue to process votes. Every vote must be counted only 
 once.
 
 3. Low latency to interact with service
   Example: Single threaded application that can't wait on external calls
 
 4. Low latency of a message through the system
   Example: Video streaming. Messages are very time-sensitive.
 
 5. Aggregate throughput
   Example: Ad banner processing. Remember when sites could get
 slash-dotted? I need a queue resilient to truly massive spikes in
 traffic.
 
 6. FIFO - When ordering matters
   Example: I can't stop a job that hasn't started yet.
 
 
 So, as a developer, I actually probably never need all of these in a
 single application -- but depending on what I'm doing, I'm going to
 need some of them. Hopefully, the examples above give some idea of
 what I have in mind for different sorts of applications I might
 develop which would require these guarantees from a messaging service.
 
 Why is this at all interesting or relevant? Because I think Zaqar and
 SQS are actually, in their current forms, trying to meet different
 sets of requirements. And, because I have not actually seen an
 application using a cloud which requires the things that Zaqar is
 guaranteeing - which doesn't mean they don't exist - it frames my past
 judgements about Zaqar in a much better way than simply I have
 doubts. It explains _why_ I have those doubts.
 
 I'd now like to offer the following as a summary of this email thread
 and the available documentation on SQS and Zaqar, as far as which of
 the above requirements are satisfied by each service and why I believe
 that. If there are fallacies in here, please correct me.
 
 SQS
 --
 
 Requirements it meets: 1, 5
 
 The SQS documentation states that it guarantees durability of messages
 (1) and handles unlimited throughput (5).
 
 It does not guarantee once-and-only-once delivery (2) and requires
 applications that care about this to de-duplicate on the receiving
 side.
 
 It also does not guarantee message order (6), making it unsuitable for
 certain uses.
 
 SQS is not an application-local service nor does it use a wire-level
 protocol, so from this I infer that (3) and (4) were not design goals.
 
 
 Zaqar
 
 
 Requirements it meets: 1*, 2, 6
 
 Zaqar states that it aims to guarantee message durability (1) but does
 so by pushing the guarantee of durability into the storage layer.
 Thus, Zaqar will not be able to guarantee durability of messages when
 a storage pool fails, is misconfigured, or what have you. Therefor I
 do not feel that message durability is a strong guarantee of Zaqar
 itself; in some configurations, this guarantee may be possible based
 on the underlying storage, but this capability will need to be exposed
 in such a way that users can make informed decisions about which Zaqar
 storage back-end (or flavor) to use for their application based on
 whether or not they need durability.

I agree with the above but I would like to add a couple of things.

The first one is just a clarification on flavors. Flavor's are not
required to use pool whereas pools are required to use flavors.
Operators can 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Flavio Percoco
On 09/24/2014 12:06 AM, Joe Gordon wrote:
 
 
 On Tue, Sep 23, 2014 at 2:40 AM, Flavio Percoco fla...@redhat.com
 mailto:fla...@redhat.com wrote:
 
 On 09/23/2014 05:13 AM, Clint Byrum wrote:
  Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
 
 [snip]
 
 
  To me this is less about valid or invalid choices. The Zaqar team is
  comparing Zaqar to SQS, but after digging into the two of them, zaqar
  barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
  important parts of SQS: the message will be delivered and will never be
  lost by SQS. Zaqar doesn't have the same scaling properties as SQS. 
 Zaqar
  is aiming for low latency per message, SQS doesn't appear to be. So if
  Zaqar isn't SQS what is Zaqar and why should I use it?
 
 
  I have to agree. I'd like to see a simple, non-ordered, high latency,
  high scale messaging service that can be used cheaply by cloud operators
  and users. What I see instead is a very powerful, ordered, low latency,
  medium scale messaging service that will likely cost a lot to scale out
  to the thousands of users level.
 
 I don't fully agree :D
 
 Let me break the above down into several points:
 
 * Zaqar team is comparing Zaqar to SQS: True, we're comparing to the
 *type* of service SQS is but not *all* the guarantees it gives. We're
 not working on an exact copy of the service but on a service capable of
 addressing the same use cases.
 
 * Zaqar is not guaranteeing reliability: This is not true. Yes, the
 current default write concern for the mongodb driver is `acknowledge`
 but that's a bug, not a feature [0] ;)
 
 * Zaqar doesn't have the same scaling properties as SQS: What are SQS
 scaling properties? We know they have a big user base, we know they have
 lots of connections, queues and what not but we don't have numbers to
 compare ourselves with.
 
  
 Here is *a* number
 30k messages per second on a single
 queue: http://java.dzone.com/articles/benchmarking-sqs

I know how to get those links and I had read them before. For example,
here's[0] a 2 years older one that tests a different scenario and has a
quite different result.

My point is that it's not as easy as to say X doesn't scale as Y. We
know, based on Zaqar's architecture, that depending on the storage there
are some scaling limits the service could hit but without more (or
proper) load tests I think that's just an assumption based on what we
know about the service architecture and not the storage itself. There
are benchmarks about mongodb but it'd be unfair to use those as the
definitive reference since the schema plays a huge role there.

And with this, I'm neither saying Zaqar scales unlimitedly regardless of
the storage backend nor that there are no limits at all. I'm aware
there's lot to improve in the service.

[0]
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/throughput.html

Thanks for sharing,
Flavio

  
 
 
 * Zaqar is aiming for low latency per message: This is not true and I'd
 be curious to know where did this come from. A couple of things to
 consider:
 
 - First and foremost, low latency is a very relative
 measure  and it
 depends on each use-case.
 - The benchmarks Kurt did were purely informative. I believe
 it's good
 to do them every once in a while but this doesn't mean the team is
 mainly focused on that.
 - Not being focused on 'low-latency' does not mean the team will
 overlook performance.
 
 * Zaqar has FIFO and SQS doesn't: FIFO won't hurt *your use-case* if
 ordering is not a requirement but the lack of it does when ordering is a
 must.
 
 * Scaling out Zaqar will cost a lot: In terms of what? I'm pretty sure
 it's not for free but I'd like to understand better this point and
 figure out a way to improve it, if possible.
 
 * If Zaqar isn't SQS then what is it? Why should I use it?: I don't
 believe Zaqar is SQS as I don't believe nova is EC2. Do they share
 similar features and provide similar services? Yes, does that mean you
 can address similar use cases, hence a similar users? Yes.
 
 In addition to the above, I believe Zaqar is a simple service, easy to
 install and to interact with. From a user perspective the semantics are
 few and the concepts are neither new nor difficult to grasp. From an
 operators perspective, I don't believe it adds tons of complexity. It
 does require the operator to deploy a replicated storage environment but
 I believe all services require that.
 
 Cheers,
 Flavio
 
 P.S: Sorry for my late answer or lack of it. I lost *all* my emails
 yesterday and I'm working on recovering them.
 
 [0] https://bugs.launchpad.net/zaqar/+bug/1372335
 
 --
 @flaper87
 Flavio Percoco
 
 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Zane Bitter

On 23/09/14 17:59, Joe Gordon wrote:

  Zaqar is aiming for low latency per message, SQS doesn't appear to be.




I've seen no evidence that Zaqar is actually aiming for that. There are
waaay lower-latency ways to implement messaging if you don't care about
durability (you wouldn't do store-and-forward, for a start). If you see a
lot of talk about low latency, it's probably because for a long time people
insisted on comparing Zaqar to RabbitMQ instead of SQS.


I thought this was why Zaqar uses Falcon and not Pecan/WSME?

For an application like Marconi where throughput and latency is of
paramount importance, I recommend Falcon over Pecan.
https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation

Yes that statement mentions throughput as well, but it does mention latency
as well.


I think we're talking about two different kinds of latency - latency for 
a message passing end-to-end through the system, and latency for a 
request to the API (which also affects throughput, and may not be a 
great choice of word).


By not caring about the former, which Zaqar and SQS don't, you can add 
guarantees like never loses your message, which Zaqar and SQS have.


By not caring about the latter you can add a lot of cost to operating 
the service and... that's about it. (Which is why *both* Zaqar and 
clearly SQS *do* care about it.) There's really no upside to doing more 
work than you need to on every API request, of which there will be *a 
lot*. The latency trade-off here is against using the same framework 
as... a handful of other OpenStack projects - I can't even say all other 
OpenStack projects, since there are at least 2 or 3 frameworks in use 
out there already. IMHO the whole Falcon vs. Pecan thing is a storm in a 
teacup.


cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Zane Bitter

On 23/09/14 19:29, Devananda van der Veen wrote:

On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:

On 22/09/14 17:06, Joe Gordon wrote:


If 50,000 messages per second doesn't count as small-to-moderate then
Zaqar
does not fulfill a major SQS use case.



It's not a drop-in replacement, but as I mentioned you can recreate the SQS
semantics exactly *and* get the scalability benefits of that approach by
sharding at the application level and then round-robin polling.



This response seems dismissive to application developers deciding what
cloud-based messaging system to use for their application.

If I'm evaluating two messaging services, and one of them requires my
application to implement autoscaling and pool management, and the
other does not, I'm clearly going to pick the one which makes my
application development *simpler*.


This is absolutely true, but the point I was trying to make earlier in 
the thread is that for other use cases you can make exactly the same 
argument going in the other direction: if I'm evaluating two messaging 
services, and one of them requires my application to implement 
reordering of messages by sequence number, and the other does not, I'm 
clearly going to pick the one which makes my application development 
*simpler*.


So it's not a question of do we make developers do more work?. It's a 
question of *which* developers do we make do more work?.



Also, choices made early in a
product's lifecycle (like, say, a facebook game) about which
technology they use (like, say, for messaging) are often informed by
hopeful expectations of explosive growth and fame.

So, based on what you've said, if I were a game developer comparing
SQS and Zaqar today, it seems clear that, if I picked Zaqar, and my
game gets really popular, it's also going to have to be engineered to
handle autoscaling of queues in Zaqar. Based on that, I'm going to
pick SQS. Because then I don't have to worry about what I'm going to
do when my game has 100 million users and there's still just one
queue.


I totally agree, and that's why I'm increasingly convinced that Zaqar 
should eventually offer the choice of either. Happily, thanks to the 
existence of Flavours, I believe this can be implemented in the future 
as an optional distribution layer *above* the storage back end without 
any major changes to the current API or architecture. (One open 
question: would this require dropping browsability from the API?)


The key question here is if we're satisfied with the possibility of 
adding this in the future, or if we want to require Zaqar to dump the 
users with the in-order use case in favour of the users with the 
massive-scale use case. If we wanted that then a major re-think would be 
in order.


cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-24 Thread Gordon Sim

Apologies in advance for possible repetition and pedantry...

On 09/24/2014 02:48 AM, Devananda van der Veen wrote:

2. Single Delivery - each message must be processed *exactly* once
   Example: Using a queue to process votes. Every vote must be counted only 
once.


It is also important to consider the ability of the publisher to 
reliable publish a message exactly once. If that can't be done, there 
may need to be de-duplication even if there is an exactly-once delivery 
guarantee of messages from the queue (because there could exist two 
copies of the same logical message).



5. Aggregate throughput
   Example: Ad banner processing. Remember when sites could get
slash-dotted? I need a queue resilient to truly massive spikes in
traffic.


A massive spike in traffic can be handled also by allowing the queue to 
grow, rather than increasing the throughput. This is obviously only 
effective if it is indeed a spike and the rate of ingress drops again to 
allow the backlog to be processed.


So scaling up aggregate throughput is certainly an important requirement 
for some. However the example illustrates another, which is scaling the 
size of the queue (because the bottleneck for throughput may be in the 
application processing or this processing may be temporarily 
unavailable). The latter is something that both Zaqar and SQS I suspect 
would do quite well at.



6. FIFO - When ordering matters
   Example: I can't stop a job that hasn't started yet.


I think FIFO is insufficiently precise.

The most extreme requirement is total ordering, i.e. all messages are 
assigned a place in a fixed sequence and the order in which they are 
seen is the same for all receivers.


The example you give above is really causal ordering. Since the need to 
stop a job is caused by the starting of that job, the stop request must 
come after the start request. However the ordering of the stop request 
for task A with respect to a stop request for task B may not be defined 
(e.g. if they are triggered concurrently).


The pattern in use is also relevant. For multiple competing consumers, 
if there are ordering requirements such as the one in your example, it 
is not sufficient to *deliver* the messages in order, they must also be 
*processed* in order.


If I have two consumers processing task requests, and give the 'start A' 
message to one, and then the 'stop A' message to another it is possible 
that the second, though dispatched by the messaging service after the 
first message, is still processed before it.


One way to avoid that would be to have the application use a separate 
queue for processing consumer, and ensure causally related messages are 
sent through the same queue. The downside is less adaptive load 
balancing and resiliency. Another option is to have the messaging 
service recognise message groupings and ensure that a group in which a 
previously delivered message has not been acknowledged are delivered 
only to the same consumer as that previous message.


[...]

Zaqar relies on a store-and-forward architecture, which is not
amenable to low-latency message processing (4).


I don't think store-and-forward precludes low-latency ('low' is of 
course subjective). Polling however is not a good fit for latency 
sensitive applications.



Again, as with SQS, it is not a wire-level protocol,


It is a wire-level protocol, but as it is based on HTTP it doesn't 
support asynchronous delivery of messages from server to client at present.



so I don't believe low-latency connectivity (3) was a design goal.


Agreed (and that is the important thing, so sorry for the nitpicking!).

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Gordon Sim

On 09/22/2014 05:58 PM, Zane Bitter wrote:

On 22/09/14 10:11, Gordon Sim wrote:

As I understand it, pools don't help scaling a given queue since all the
messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.


But I think it's also the important part of the problem. When I talk
about scaling, I mean 1 million clients sending 10 messages per second
each, not 10 clients sending 1 million messages per second each.


I wasn't really talking about high throughput per producer (which I 
agree is not going to be a good fit), but about e.g. a large number of 
subscribers for the same set of messages, e.g. publishing one message 
per second to 10,000 subscribers.


Even at much smaller scale, expanding from 10 subscribers to say 100 
seems relatively modest but the subscriber related load would increase 
by a factor of 10. I think handling these sorts of changes is also an 
important part of the problem (though perhaps not a part that Zaqar is 
focused on).



When a user gets to the point that individual queues have massive
throughput, it's unlikely that a one-size-fits-all cloud offering like
Zaqar or SQS is _ever_ going to meet their needs. Those users will want
to spin up and configure their own messaging systems on Nova servers,
and at that kind of size they'll be able to afford to. (In fact, they
may not be able to afford _not_ to, assuming per-message-based pricing.)


[...]

If scaling the number of communicants on a given communication channel
is a goal however, then strict ordering may hamper that. If it does, it
seems to me that this is not just a policy tweak on the underlying
datastore to choose the desired balance between ordering and scale, but
a more fundamental question on the internal structure of the queue
implementation built on top of the datastore.


I agree with your analysis, but I don't think this should be a goal.


I think it's worth clarifying that alongside the goals since scaling can 
mean different things to different people. The implication then is that 
there is some limit in the number of producers and/or consumers on a 
queue beyond which the service won't scale and applications need to 
design around that.



Note that the user can still implement this themselves using
application-level sharding - if you know that in-order delivery is not
important to you, then randomly assign clients to a queue and then poll
all of the queues in the round-robin. This yields _exactly_ the same
semantics as SQS.


You can certainly leave the problem of scaling in this dimension to the 
application itself by having them split the traffic into orthogonal 
streams or hooking up orthogonal streams to provide an aggregated stream.


A true distributed queue isn't entirely trivial, but it may well be that 
most applications can get by with a much simpler approximation.


Distributed (pub-sub) topic semantics are easier to implement, but if 
the application is responsible for keeping the partitions connected, 
then it also takes on part of the burden for availability and redundancy.



The reverse is true of SQS - if you want FIFO then you have to implement
re-ordering by sequence number in your application. (I'm not certain,
but it also sounds very much like this situation is ripe for losing
messages when your client dies.)

So the question is: in which use case do we want to push additional
complexity into the application? The case where there are truly massive
volumes of messages flowing to a single point?  Or the case where the
application wants the messages in order?


I think the first case is more generally about increasing the number of 
communicating parties (publishers or subscribers or both).


For competing consumers ordering isn't usually a concern since you are 
processing in parallel anyway (if it is important you need some notion 
of message grouping within which order is preserved and some stickiness 
between group and consumer).


For multiple non-competing consumers the choice needn't be as simple as 
total ordering or no ordering at all. Many systems quite naturally only 
define partial ordering which can be guaranteed more scalably.


That's not to deny that there are indeed cases where total ordering may 
be required however.



I'd suggest both that the former applications are better able to handle
that extra complexity and that the latter applications are probably more
common. So it seems that the Zaqar team made a good decision.


If that was a deliberate decision it would be worth clarifying in the 
goals. It seems to be a different conclusion from that reached by SQS 
and as such is part of the answer to the question that began the thread.



(Aside: it follows that Zaqar probably should have a maximum throughput
quota for each queue; or that it should report usage 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Flavio Percoco
On 09/23/2014 05:13 AM, Clint Byrum wrote:
 Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:

[snip]


 To me this is less about valid or invalid choices. The Zaqar team is
 comparing Zaqar to SQS, but after digging into the two of them, zaqar
 barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
 important parts of SQS: the message will be delivered and will never be
 lost by SQS. Zaqar doesn't have the same scaling properties as SQS. Zaqar
 is aiming for low latency per message, SQS doesn't appear to be. So if
 Zaqar isn't SQS what is Zaqar and why should I use it?

 
 I have to agree. I'd like to see a simple, non-ordered, high latency,
 high scale messaging service that can be used cheaply by cloud operators
 and users. What I see instead is a very powerful, ordered, low latency,
 medium scale messaging service that will likely cost a lot to scale out
 to the thousands of users level.

I don't fully agree :D

Let me break the above down into several points:

* Zaqar team is comparing Zaqar to SQS: True, we're comparing to the
*type* of service SQS is but not *all* the guarantees it gives. We're
not working on an exact copy of the service but on a service capable of
addressing the same use cases.

* Zaqar is not guaranteeing reliability: This is not true. Yes, the
current default write concern for the mongodb driver is `acknowledge`
but that's a bug, not a feature [0] ;)

* Zaqar doesn't have the same scaling properties as SQS: What are SQS
scaling properties? We know they have a big user base, we know they have
lots of connections, queues and what not but we don't have numbers to
compare ourselves with.

* Zaqar is aiming for low latency per message: This is not true and I'd
be curious to know where did this come from. A couple of things to consider:

- First and foremost, low latency is a very relative measure  and it
depends on each use-case.
- The benchmarks Kurt did were purely informative. I believe it's good
to do them every once in a while but this doesn't mean the team is
mainly focused on that.
- Not being focused on 'low-latency' does not mean the team will
overlook performance.

* Zaqar has FIFO and SQS doesn't: FIFO won't hurt *your use-case* if
ordering is not a requirement but the lack of it does when ordering is a
must.

* Scaling out Zaqar will cost a lot: In terms of what? I'm pretty sure
it's not for free but I'd like to understand better this point and
figure out a way to improve it, if possible.

* If Zaqar isn't SQS then what is it? Why should I use it?: I don't
believe Zaqar is SQS as I don't believe nova is EC2. Do they share
similar features and provide similar services? Yes, does that mean you
can address similar use cases, hence a similar users? Yes.

In addition to the above, I believe Zaqar is a simple service, easy to
install and to interact with. From a user perspective the semantics are
few and the concepts are neither new nor difficult to grasp. From an
operators perspective, I don't believe it adds tons of complexity. It
does require the operator to deploy a replicated storage environment but
I believe all services require that.

Cheers,
Flavio

P.S: Sorry for my late answer or lack of it. I lost *all* my emails
yesterday and I'm working on recovering them.

[0] https://bugs.launchpad.net/zaqar/+bug/1372335

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Flavio Percoco
On 09/23/2014 10:58 AM, Gordon Sim wrote:
 On 09/22/2014 05:58 PM, Zane Bitter wrote:
 On 22/09/14 10:11, Gordon Sim wrote:
 As I understand it, pools don't help scaling a given queue since all the
 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.

 But I think it's also the important part of the problem. When I talk
 about scaling, I mean 1 million clients sending 10 messages per second
 each, not 10 clients sending 1 million messages per second each.
 
 I wasn't really talking about high throughput per producer (which I
 agree is not going to be a good fit), but about e.g. a large number of
 subscribers for the same set of messages, e.g. publishing one message
 per second to 10,000 subscribers.
 
 Even at much smaller scale, expanding from 10 subscribers to say 100
 seems relatively modest but the subscriber related load would increase
 by a factor of 10. I think handling these sorts of changes is also an
 important part of the problem (though perhaps not a part that Zaqar is
 focused on).
 
 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to spin up and configure their own messaging systems on Nova servers,
 and at that kind of size they'll be able to afford to. (In fact, they
 may not be able to afford _not_ to, assuming per-message-based pricing.)
 
 [...]
 If scaling the number of communicants on a given communication channel
 is a goal however, then strict ordering may hamper that. If it does, it
 seems to me that this is not just a policy tweak on the underlying
 datastore to choose the desired balance between ordering and scale, but
 a more fundamental question on the internal structure of the queue
 implementation built on top of the datastore.

 I agree with your analysis, but I don't think this should be a goal.
 
 I think it's worth clarifying that alongside the goals since scaling can
 mean different things to different people. The implication then is that
 there is some limit in the number of producers and/or consumers on a
 queue beyond which the service won't scale and applications need to
 design around that.

Agreed. The above is not part of Zaqar's goals. That is to say that each
store knows best how to distribute reads and writes itself. Nonetheless,
drivers can be very smart about this and be implemented in ways they'd
take the most out of the backend.


 Note that the user can still implement this themselves using
 application-level sharding - if you know that in-order delivery is not
 important to you, then randomly assign clients to a queue and then poll
 all of the queues in the round-robin. This yields _exactly_ the same
 semantics as SQS.
 
 You can certainly leave the problem of scaling in this dimension to the
 application itself by having them split the traffic into orthogonal
 streams or hooking up orthogonal streams to provide an aggregated stream.
 
 A true distributed queue isn't entirely trivial, but it may well be that
 most applications can get by with a much simpler approximation.
 
 Distributed (pub-sub) topic semantics are easier to implement, but if
 the application is responsible for keeping the partitions connected,
 then it also takes on part of the burden for availability and redundancy.
 
 The reverse is true of SQS - if you want FIFO then you have to implement
 re-ordering by sequence number in your application. (I'm not certain,
 but it also sounds very much like this situation is ripe for losing
 messages when your client dies.)

 So the question is: in which use case do we want to push additional
 complexity into the application? The case where there are truly massive
 volumes of messages flowing to a single point?  Or the case where the
 application wants the messages in order?
 
 I think the first case is more generally about increasing the number of
 communicating parties (publishers or subscribers or both).
 
 For competing consumers ordering isn't usually a concern since you are
 processing in parallel anyway (if it is important you need some notion
 of message grouping within which order is preserved and some stickiness
 between group and consumer).
 
 For multiple non-competing consumers the choice needn't be as simple as
 total ordering or no ordering at all. Many systems quite naturally only
 define partial ordering which can be guaranteed more scalably.
 
 That's not to deny that there are indeed cases where total ordering may
 be required however.
 
 I'd suggest both that the former applications are better able to handle
 that extra complexity and that the latter applications are probably more
 common. So it seems that the Zaqar team made a good decision.
 
 If that was 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Zane Bitter

On 23/09/14 08:58, Flavio Percoco wrote:

I believe the guarantee is still useful and it currently does not
represent an issue for the service nor the user. 2 things could happen
to FIFO in the future:

1. It's made optional and we allow users to opt-in in a per flavor
basis. (I personally don't like this one because it makes
interoperability even harder).


Hmm, I'm not so sure this is such a bad option. I criticised flavours 
earlier in this thread on the assumption that it meant every storage 
back-end would have its own semantics, and those would be surfaced to 
the user in the form of flavours - that does indeed make 
interoperability very hard.


The same issue does not arise for options implemented in Zaqar itself. 
If every back-end supports FIFO semantics but Zaqar has a layer that 
distributes the queue among multiple backends or not, depending on the 
flavour selected by the user, then there would be no impact on 
interoperability as the same semantics would be available regardless of 
the back-end chosen by the operator.



2. It's removed completely (Again, I personally don't like this one
because I don't think we have strong enough cases to require this to
happen).

That said, there's just 1 thing I think will happen for now, it'll be
kept as-is unless there are strong cases that'd require (1) or (2). All
this should be considered in the discussion of the API v2, whenever that
happens.


I think this is it ;)

cheers,
Zane.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Zane Bitter

On 22/09/14 22:04, Joe Gordon wrote:

To me this is less about valid or invalid choices. The Zaqar team is
comparing Zaqar to SQS, but after digging into the two of them, zaqar
barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
important parts of SQS: the message will be delivered and will never be
lost by SQS.


I agree that this is the most important feature. Happily, Flavio has 
clarified this in his other thread[1]:


 *Zaqar's vision is to provide a cross-cloud interoperable,
  fully-reliable messaging service at scale that is both, easy and not
  invasive, for deployers and users.*

  ...

  Zaqar aims to be a fully-reliable service, therefore messages should
  never be lost under any circumstances except for when the message's
  expiration time (ttl) is reached

So Zaqar _will_ guarantee reliable delivery.


Zaqar doesn't have the same scaling properties as SQS.


This is true. (That's not to say it won't scale, but it doesn't scale in 
exactly the same way that SQS does because it has a different architecture.)


It appears that the main reason for this is the ordering guarantee, 
which was introduced in response to feedback from users. So this is 
clearly a different design choice: SQS chose reliability plus 
effectively infinite scalability, while Zaqar chose reliability plus 
FIFO. It's not feasible to satisfy all three simultaneously, so the 
options are:


1) Implement two separate modes and allow the user to decide
2) Continue to choose FIFO over infinite scalability
3) Drop FIFO and choose infinite scalability instead

This is one of the key points on which we need to get buy-in from the 
community on selecting one of these as the long-term strategy.



Zaqar is aiming for low latency per message, SQS doesn't appear to be.


I've seen no evidence that Zaqar is actually aiming for that. There are 
waaay lower-latency ways to implement messaging if you don't care about 
durability (you wouldn't do store-and-forward, for a start). If you see 
a lot of talk about low latency, it's probably because for a long time 
people insisted on comparing Zaqar to RabbitMQ instead of SQS.


(Let's also be careful not to talk about high latency as if it were a 
virtue in itself; it's simply something we would happily trade off for 
other properties. Zaqar _is_ making that trade-off.)



So if Zaqar isn't SQS what is Zaqar and why should I use it?


If you are a small-to-medium user of an SQS-like service, Zaqar is like 
SQS but better because not only does it never lose your messages but 
they always arrive in order, and you have the option to fan them out to 
multiple subscribers. If you are a very large user along one particular 
dimension (I believe it's number of messages delivered from a single 
queue, but probably Gordon will correct me :D) then Zaqar may not _yet_ 
have a good story for you.


cheers,
Zane.

[1] 
http://lists.openstack.org/pipermail/openstack-dev/2014-September/046809.html


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Fox, Kevin M
Flavio wrote The reasoning, as explained in an other
email, is that from a use-case perspective, strict ordering won't hurt
you if you don't need it whereas having to implement it in the client
side because the service doesn't provide it can be a PITA.

The reasoning is flawed though. If performance is a concern, having strict 
ordering costs you when you may not care!

For example, is it better to implement a video streaming service on tcp or udp 
if firewalls aren't a concern? The latter. Why? Because ordering is a problem 
for these systems! If you have frames, 1 2 and 3..., and frame 2 gets lost on 
the first transmit and needs resending, but 3 gets there, the system has to 
wait to display frame 3 waiting for frame 2. But by the time frame 2 gets 
there, frame 3 doesn't matter because the system needs to move on to frame 5 
now. The human eye doens't care to wait for retransmits of frames. it only 
cares about the now. So because of the ordering, the eye sees 3 dropped frames 
instead of just one. making the system worse, not better.

Yeah, I know its a bit of a silly example. No one would implement video 
streaming on top of messaging like that. But it does present the point that 
something that seemingly only provides good things (order is always better then 
disorder, right?), sometimes has unintended and negative side affects. In 
lossless systems, it can show up as unnecessary latency or higher cpu loads.

I think your option 1 will make Zaqar much more palatable to those that don't 
need the strict ordering requirement.

I'm glad you want to make hard things like guaranteed ordering available so 
that users don't have to deal with it themselves if they don't want to. Its a 
great feature. But it also is an anti-feature in some cases. The ramifications 
of its requirements are higher then you think, and a feature to just disable it 
shouldn't be very costly to implement.

Part of the controversy right now, I think, has been not understanding the use 
case here, and by insisting that FIFO only ever is positive, it makes others 
that know its negatives question what other assumptions were made in Zaqar and 
makes them a little gun shy.

Please do reconsider this stance.

Thanks,
Kevin



From: Flavio Percoco [fla...@redhat.com]
Sent: Tuesday, September 23, 2014 5:58 AM
To: openstack-dev@lists.openstack.org
Subject: Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed 
Queues

On 09/23/2014 10:58 AM, Gordon Sim wrote:
 On 09/22/2014 05:58 PM, Zane Bitter wrote:
 On 22/09/14 10:11, Gordon Sim wrote:
 As I understand it, pools don't help scaling a given queue since all the
 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.

 But I think it's also the important part of the problem. When I talk
 about scaling, I mean 1 million clients sending 10 messages per second
 each, not 10 clients sending 1 million messages per second each.

 I wasn't really talking about high throughput per producer (which I
 agree is not going to be a good fit), but about e.g. a large number of
 subscribers for the same set of messages, e.g. publishing one message
 per second to 10,000 subscribers.

 Even at much smaller scale, expanding from 10 subscribers to say 100
 seems relatively modest but the subscriber related load would increase
 by a factor of 10. I think handling these sorts of changes is also an
 important part of the problem (though perhaps not a part that Zaqar is
 focused on).

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to spin up and configure their own messaging systems on Nova servers,
 and at that kind of size they'll be able to afford to. (In fact, they
 may not be able to afford _not_ to, assuming per-message-based pricing.)

 [...]
 If scaling the number of communicants on a given communication channel
 is a goal however, then strict ordering may hamper that. If it does, it
 seems to me that this is not just a policy tweak on the underlying
 datastore to choose the desired balance between ordering and scale, but
 a more fundamental question on the internal structure of the queue
 implementation built on top of the datastore.

 I agree with your analysis, but I don't think this should be a goal.

 I think it's worth clarifying that alongside the goals since scaling can
 mean different things to different people. The implication then is that
 there is some limit in the number of producers and/or consumers on a
 queue beyond which the service won't scale and applications need to
 design around that.

Agreed. The above is not part of Zaqar's goals. That is to say that each

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Flavio Percoco
On 09/23/2014 09:29 PM, Fox, Kevin M wrote:
 Flavio wrote The reasoning, as explained in an other
 email, is that from a use-case perspective, strict ordering won't hurt
 you if you don't need it whereas having to implement it in the client
 side because the service doesn't provide it can be a PITA.
 
 The reasoning is flawed though. If performance is a concern, having strict 
 ordering costs you when you may not care!
 
 For example, is it better to implement a video streaming service on tcp or 
 udp if firewalls aren't a concern? The latter. Why? Because ordering is a 
 problem for these systems! If you have frames, 1 2 and 3..., and frame 2 gets 
 lost on the first transmit and needs resending, but 3 gets there, the system 
 has to wait to display frame 3 waiting for frame 2. But by the time frame 2 
 gets there, frame 3 doesn't matter because the system needs to move on to 
 frame 5 now. The human eye doens't care to wait for retransmits of frames. it 
 only cares about the now. So because of the ordering, the eye sees 3 dropped 
 frames instead of just one. making the system worse, not better.
 
 Yeah, I know its a bit of a silly example. No one would implement video 
 streaming on top of messaging like that. But it does present the point that 
 something that seemingly only provides good things (order is always better 
 then disorder, right?), sometimes has unintended and negative side affects. 
 In lossless systems, it can show up as unnecessary latency or higher cpu 
 loads.
 
 I think your option 1 will make Zaqar much more palatable to those that don't 
 need the strict ordering requirement.
 
 I'm glad you want to make hard things like guaranteed ordering available so 
 that users don't have to deal with it themselves if they don't want to. Its a 
 great feature. But it also is an anti-feature in some cases. The 
 ramifications of its requirements are higher then you think, and a feature to 
 just disable it shouldn't be very costly to implement.
 
 Part of the controversy right now, I think, has been not understanding the 
 use case here, and by insisting that FIFO only ever is positive, it makes 
 others that know its negatives question what other assumptions were made in 
 Zaqar and makes them a little gun shy.
 
 Please do reconsider this stance.


Hey Kevin,

FWIW, I explicitly said from a use-case perspective which in the
context of the emails I was replying to referred to the need (or not)
for FIFO and not to the impact it has in other areas like performance.

In any way I tried to insist that FIFO is only ever positive and I've
also explicitly said in several other emails that it *does* have an
impact on performance.

That said, I agree that if FIFO's reality in Zaqar changes, it'll likely
be towards the option (1).

Thanks for your feedback,
Flavio

 
 Thanks,
 Kevin
 
 
 
 From: Flavio Percoco [fla...@redhat.com]
 Sent: Tuesday, September 23, 2014 5:58 AM
 To: openstack-dev@lists.openstack.org
 Subject: Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed 
 Queues
 
 On 09/23/2014 10:58 AM, Gordon Sim wrote:
 On 09/22/2014 05:58 PM, Zane Bitter wrote:
 On 22/09/14 10:11, Gordon Sim wrote:
 As I understand it, pools don't help scaling a given queue since all the
 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.

 But I think it's also the important part of the problem. When I talk
 about scaling, I mean 1 million clients sending 10 messages per second
 each, not 10 clients sending 1 million messages per second each.

 I wasn't really talking about high throughput per producer (which I
 agree is not going to be a good fit), but about e.g. a large number of
 subscribers for the same set of messages, e.g. publishing one message
 per second to 10,000 subscribers.

 Even at much smaller scale, expanding from 10 subscribers to say 100
 seems relatively modest but the subscriber related load would increase
 by a factor of 10. I think handling these sorts of changes is also an
 important part of the problem (though perhaps not a part that Zaqar is
 focused on).

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to spin up and configure their own messaging systems on Nova servers,
 and at that kind of size they'll be able to afford to. (In fact, they
 may not be able to afford _not_ to, assuming per-message-based pricing.)

 [...]
 If scaling the number of communicants on a given communication channel
 is a goal however, then strict ordering may hamper that. If it does, it
 seems to me that this is not just a policy tweak on the underlying
 datastore to choose

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Joe Gordon
On Tue, Sep 23, 2014 at 9:13 AM, Zane Bitter zbit...@redhat.com wrote:

 On 22/09/14 22:04, Joe Gordon wrote:

 To me this is less about valid or invalid choices. The Zaqar team is
 comparing Zaqar to SQS, but after digging into the two of them, zaqar
 barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
 important parts of SQS: the message will be delivered and will never be
 lost by SQS.


 I agree that this is the most important feature. Happily, Flavio has
 clarified this in his other thread[1]:

  *Zaqar's vision is to provide a cross-cloud interoperable,
   fully-reliable messaging service at scale that is both, easy and not
   invasive, for deployers and users.*

   ...

   Zaqar aims to be a fully-reliable service, therefore messages should
   never be lost under any circumstances except for when the message's
   expiration time (ttl) is reached

 So Zaqar _will_ guarantee reliable delivery.

  Zaqar doesn't have the same scaling properties as SQS.


 This is true. (That's not to say it won't scale, but it doesn't scale in
 exactly the same way that SQS does because it has a different architecture.)

 It appears that the main reason for this is the ordering guarantee, which
 was introduced in response to feedback from users. So this is clearly a
 different design choice: SQS chose reliability plus effectively infinite
 scalability, while Zaqar chose reliability plus FIFO. It's not feasible to
 satisfy all three simultaneously, so the options are:

 1) Implement two separate modes and allow the user to decide
 2) Continue to choose FIFO over infinite scalability
 3) Drop FIFO and choose infinite scalability instead

 This is one of the key points on which we need to get buy-in from the
 community on selecting one of these as the long-term strategy.

  Zaqar is aiming for low latency per message, SQS doesn't appear to be.


 I've seen no evidence that Zaqar is actually aiming for that. There are
 waaay lower-latency ways to implement messaging if you don't care about
 durability (you wouldn't do store-and-forward, for a start). If you see a
 lot of talk about low latency, it's probably because for a long time people
 insisted on comparing Zaqar to RabbitMQ instead of SQS.


I thought this was why Zaqar uses Falcon and not Pecan/WSME?

For an application like Marconi where throughput and latency is of
paramount importance, I recommend Falcon over Pecan.
https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation

Yes that statement mentions throughput as well, but it does mention latency
as well.



 (Let's also be careful not to talk about high latency as if it were a
 virtue in itself; it's simply something we would happily trade off for
 other properties. Zaqar _is_ making that trade-off.)

  So if Zaqar isn't SQS what is Zaqar and why should I use it?


 If you are a small-to-medium user of an SQS-like service, Zaqar is like
 SQS but better because not only does it never lose your messages but they
 always arrive in order, and you have the option to fan them out to multiple
 subscribers. If you are a very large user along one particular dimension (I
 believe it's number of messages delivered from a single queue, but probably
 Gordon will correct me :D) then Zaqar may not _yet_ have a good story for
 you.

 cheers,
 Zane.

 [1] http://lists.openstack.org/pipermail/openstack-dev/2014-
 September/046809.html


 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Joe Gordon
On Tue, Sep 23, 2014 at 2:40 AM, Flavio Percoco fla...@redhat.com wrote:

 On 09/23/2014 05:13 AM, Clint Byrum wrote:
  Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:

 [snip]

 
  To me this is less about valid or invalid choices. The Zaqar team is
  comparing Zaqar to SQS, but after digging into the two of them, zaqar
  barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
  important parts of SQS: the message will be delivered and will never be
  lost by SQS. Zaqar doesn't have the same scaling properties as SQS.
 Zaqar
  is aiming for low latency per message, SQS doesn't appear to be. So if
  Zaqar isn't SQS what is Zaqar and why should I use it?
 
 
  I have to agree. I'd like to see a simple, non-ordered, high latency,
  high scale messaging service that can be used cheaply by cloud operators
  and users. What I see instead is a very powerful, ordered, low latency,
  medium scale messaging service that will likely cost a lot to scale out
  to the thousands of users level.

 I don't fully agree :D

 Let me break the above down into several points:

 * Zaqar team is comparing Zaqar to SQS: True, we're comparing to the
 *type* of service SQS is but not *all* the guarantees it gives. We're
 not working on an exact copy of the service but on a service capable of
 addressing the same use cases.

 * Zaqar is not guaranteeing reliability: This is not true. Yes, the
 current default write concern for the mongodb driver is `acknowledge`
 but that's a bug, not a feature [0] ;)

 * Zaqar doesn't have the same scaling properties as SQS: What are SQS
 scaling properties? We know they have a big user base, we know they have
 lots of connections, queues and what not but we don't have numbers to
 compare ourselves with.


Here is *a* number
30k messages per second on a single queue:
http://java.dzone.com/articles/benchmarking-sqs



 * Zaqar is aiming for low latency per message: This is not true and I'd
 be curious to know where did this come from. A couple of things to
 consider:

 - First and foremost, low latency is a very relative measure  and
 it
 depends on each use-case.
 - The benchmarks Kurt did were purely informative. I believe it's
 good
 to do them every once in a while but this doesn't mean the team is
 mainly focused on that.
 - Not being focused on 'low-latency' does not mean the team will
 overlook performance.

 * Zaqar has FIFO and SQS doesn't: FIFO won't hurt *your use-case* if
 ordering is not a requirement but the lack of it does when ordering is a
 must.

 * Scaling out Zaqar will cost a lot: In terms of what? I'm pretty sure
 it's not for free but I'd like to understand better this point and
 figure out a way to improve it, if possible.

 * If Zaqar isn't SQS then what is it? Why should I use it?: I don't
 believe Zaqar is SQS as I don't believe nova is EC2. Do they share
 similar features and provide similar services? Yes, does that mean you
 can address similar use cases, hence a similar users? Yes.

 In addition to the above, I believe Zaqar is a simple service, easy to
 install and to interact with. From a user perspective the semantics are
 few and the concepts are neither new nor difficult to grasp. From an
 operators perspective, I don't believe it adds tons of complexity. It
 does require the operator to deploy a replicated storage environment but
 I believe all services require that.

 Cheers,
 Flavio

 P.S: Sorry for my late answer or lack of it. I lost *all* my emails
 yesterday and I'm working on recovering them.

 [0] https://bugs.launchpad.net/zaqar/+bug/1372335

 --
 @flaper87
 Flavio Percoco

 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Clint Byrum
Excerpts from Joe Gordon's message of 2014-09-23 14:59:33 -0700:
 On Tue, Sep 23, 2014 at 9:13 AM, Zane Bitter zbit...@redhat.com wrote:
 
  On 22/09/14 22:04, Joe Gordon wrote:
 
  To me this is less about valid or invalid choices. The Zaqar team is
  comparing Zaqar to SQS, but after digging into the two of them, zaqar
  barely looks like SQS. Zaqar doesn't guarantee what IMHO is the most
  important parts of SQS: the message will be delivered and will never be
  lost by SQS.
 
 
  I agree that this is the most important feature. Happily, Flavio has
  clarified this in his other thread[1]:
 
   *Zaqar's vision is to provide a cross-cloud interoperable,
fully-reliable messaging service at scale that is both, easy and not
invasive, for deployers and users.*
 
...
 
Zaqar aims to be a fully-reliable service, therefore messages should
never be lost under any circumstances except for when the message's
expiration time (ttl) is reached
 
  So Zaqar _will_ guarantee reliable delivery.
 
   Zaqar doesn't have the same scaling properties as SQS.
 
 
  This is true. (That's not to say it won't scale, but it doesn't scale in
  exactly the same way that SQS does because it has a different architecture.)
 
  It appears that the main reason for this is the ordering guarantee, which
  was introduced in response to feedback from users. So this is clearly a
  different design choice: SQS chose reliability plus effectively infinite
  scalability, while Zaqar chose reliability plus FIFO. It's not feasible to
  satisfy all three simultaneously, so the options are:
 
  1) Implement two separate modes and allow the user to decide
  2) Continue to choose FIFO over infinite scalability
  3) Drop FIFO and choose infinite scalability instead
 
  This is one of the key points on which we need to get buy-in from the
  community on selecting one of these as the long-term strategy.
 
   Zaqar is aiming for low latency per message, SQS doesn't appear to be.
 
 
  I've seen no evidence that Zaqar is actually aiming for that. There are
  waaay lower-latency ways to implement messaging if you don't care about
  durability (you wouldn't do store-and-forward, for a start). If you see a
  lot of talk about low latency, it's probably because for a long time people
  insisted on comparing Zaqar to RabbitMQ instead of SQS.
 
 
 I thought this was why Zaqar uses Falcon and not Pecan/WSME?
 
 For an application like Marconi where throughput and latency is of
 paramount importance, I recommend Falcon over Pecan.
 https://wiki.openstack.org/wiki/Zaqar/pecan-evaluation#Recommendation
 
 Yes that statement mentions throughput as well, but it does mention latency
 as well.
 

I definitely see where that may have subtly suggested the wrong
thing, if indeed latency isn't a top concern.

I think what it probably should say is something like this:

For an application like Marconi where there will be many repetitive,
small requests, a lighter weight solution such as Falcon is preferred
over Pecan.

As in, we care about the cost of all those requests, not so much about
the latency.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Devananda van der Veen
On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:
 On 22/09/14 17:06, Joe Gordon wrote:

 If 50,000 messages per second doesn't count as small-to-moderate then
 Zaqar
 does not fulfill a major SQS use case.


 It's not a drop-in replacement, but as I mentioned you can recreate the SQS
 semantics exactly *and* get the scalability benefits of that approach by
 sharding at the application level and then round-robin polling.


This response seems dismissive to application developers deciding what
cloud-based messaging system to use for their application.

If I'm evaluating two messaging services, and one of them requires my
application to implement autoscaling and pool management, and the
other does not, I'm clearly going to pick the one which makes my
application development *simpler*. Also, choices made early in a
product's lifecycle (like, say, a facebook game) about which
technology they use (like, say, for messaging) are often informed by
hopeful expectations of explosive growth and fame.

So, based on what you've said, if I were a game developer comparing
SQS and Zaqar today, it seems clear that, if I picked Zaqar, and my
game gets really popular, it's also going to have to be engineered to
handle autoscaling of queues in Zaqar. Based on that, I'm going to
pick SQS. Because then I don't have to worry about what I'm going to
do when my game has 100 million users and there's still just one
queue.

Granted, it has become apparent that Zaqar is targeting a different
audience than SQS. I'm going to follow up shortly with more on that
topic.

-Devananda

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-23 Thread Devananda van der Veen
I've taken a bit of time out of this thread, and I'd like to jump back
in now and attempt to summarize what I've learned and hopefully frame
it in such a way that it helps us to answer the question Thierry
asked:

On Fri, Sep 19, 2014 at 2:00 AM, Thierry Carrez thie...@openstack.org wrote:

 The underlying question being... can Zaqar evolve to ultimately reach
 the massive scale use case Joe, Clint and Devananda want it to reach, or
 are those design choices so deeply rooted in the code and architecture
 that Zaqar won't naturally mutate to support that use case.


I also want to sincerely thank everyone who has been involved in this
discussion, and helped to clarify the different viewpoints and
uncertainties which have surrounded Zaqar lately. I hope that all of
this helps provide the Zaqar team guidance on a path forward, as I do
believe that a scalable cloud-based messaging service would greatly
benefit the OpenStack ecosystem.

Use cases
--

So, I'd like to start from the perspective of a hypothetical user
evaluating messaging services for the new application that I'm
developing. What does my application need from a messaging service so
that it can grow and become hugely popular with all the hipsters of
the world? In other words, what might my architectural requirements
be?

(This is certainly not a complete list of features, and it's not meant
to be -- it is a list of things that I *might* need from a messaging
service. But feel free to point out any glaring omissions I may have
made anyway :) )

1. Durability: I can't risk losing any messages
  Example: Using a queue to process votes. Every vote should count.

2. Single Delivery - each message must be processed *exactly* once
  Example: Using a queue to process votes. Every vote must be counted only once.

3. Low latency to interact with service
  Example: Single threaded application that can't wait on external calls

4. Low latency of a message through the system
  Example: Video streaming. Messages are very time-sensitive.

5. Aggregate throughput
  Example: Ad banner processing. Remember when sites could get
slash-dotted? I need a queue resilient to truly massive spikes in
traffic.

6. FIFO - When ordering matters
  Example: I can't stop a job that hasn't started yet.


So, as a developer, I actually probably never need all of these in a
single application -- but depending on what I'm doing, I'm going to
need some of them. Hopefully, the examples above give some idea of
what I have in mind for different sorts of applications I might
develop which would require these guarantees from a messaging service.

Why is this at all interesting or relevant? Because I think Zaqar and
SQS are actually, in their current forms, trying to meet different
sets of requirements. And, because I have not actually seen an
application using a cloud which requires the things that Zaqar is
guaranteeing - which doesn't mean they don't exist - it frames my past
judgements about Zaqar in a much better way than simply I have
doubts. It explains _why_ I have those doubts.

I'd now like to offer the following as a summary of this email thread
and the available documentation on SQS and Zaqar, as far as which of
the above requirements are satisfied by each service and why I believe
that. If there are fallacies in here, please correct me.

SQS
--

Requirements it meets: 1, 5

The SQS documentation states that it guarantees durability of messages
(1) and handles unlimited throughput (5).

It does not guarantee once-and-only-once delivery (2) and requires
applications that care about this to de-duplicate on the receiving
side.

It also does not guarantee message order (6), making it unsuitable for
certain uses.

SQS is not an application-local service nor does it use a wire-level
protocol, so from this I infer that (3) and (4) were not design goals.


Zaqar


Requirements it meets: 1*, 2, 6

Zaqar states that it aims to guarantee message durability (1) but does
so by pushing the guarantee of durability into the storage layer.
Thus, Zaqar will not be able to guarantee durability of messages when
a storage pool fails, is misconfigured, or what have you. Therefor I
do not feel that message durability is a strong guarantee of Zaqar
itself; in some configurations, this guarantee may be possible based
on the underlying storage, but this capability will need to be exposed
in such a way that users can make informed decisions about which Zaqar
storage back-end (or flavor) to use for their application based on
whether or not they need durability.

Single delivery of messages (2) is provided for by the claim semantics
in Zaqar's API. FIFO (6) ordering was an architectural choice made
based on feedback from users.

Aggregate throughput of a single queue (5) is not scalable beyond the
reach of a single storage pool. This makes it possible for an
application to outgrow Zaqar when its total throughput needs exceed
the capacity of a single pool. This would also make it 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Gordon Sim

On 09/19/2014 09:13 PM, Zane Bitter wrote:

SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
(Pools) which is its answer to the massive scaling question.


There are different dimensions to the scaling problem.

As I understand it, pools don't help scaling a given queue since all the 
messages for that queue must be in the same pool. At present traffic 
through different Zaqar queues are essentially entirely orthogonal 
streams. Pooling can help scale the number of such orthogonal streams, 
but to be honest, that's the easier part of the problem.


There is also the possibility of using the sharding capabilities of the 
underlying storage. But the pattern of use will determine how effective 
that can be.


So for example, on the ordering question, if order is defined by a 
single sequence number held in the database and atomically incremented 
for every message published, that is not likely to be something where 
the databases sharding is going to help in scaling the number of 
concurrent publications.


Though sharding would allow scaling the total number messages on the 
queue (by distributing them over multiple shards), the total ordering of 
those messages reduces it's effectiveness in scaling the number of 
concurrent getters (e.g. the concurrent subscribers in pub-sub) since 
they will all be getting the messages in exactly the same order.


Strict ordering impacts the competing consumers case also (and is in my 
opinion of limited value as a guarantee anyway). At any given time, the 
head of the queue is in one shard, and all concurrent claim requests 
will contend for messages in that same shard. Though the unsuccessful 
claimants may then move to another shard as the head moves, they will 
all again try to access the messages in the same order.


So if Zaqar's goal is to scale the number of orthogonal queues, and the 
number of messages held at any time within these, the pooling facility 
and any sharding capability in the underlying store for a pool would 
likely be effective even with the strict ordering guarantee.


If scaling the number of communicants on a given communication channel 
is a goal however, then strict ordering may hamper that. If it does, it 
seems to me that this is not just a policy tweak on the underlying 
datastore to choose the desired balance between ordering and scale, but 
a more fundamental question on the internal structure of the queue 
implementation built on top of the datastore.


I also get the impression, perhaps wrongly, that providing the strict 
ordering guarantee wasn't necessarily an explicit requirement, but was 
simply a property of the underlying implementation(?).


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Zane Bitter

On 22/09/14 10:11, Gordon Sim wrote:

On 09/19/2014 09:13 PM, Zane Bitter wrote:

SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
(Pools) which is its answer to the massive scaling question.


There are different dimensions to the scaling problem.


Many thanks for this analysis, Gordon. This is really helpful stuff.


As I understand it, pools don't help scaling a given queue since all the
messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.


But I think it's also the important part of the problem. When I talk 
about scaling, I mean 1 million clients sending 10 messages per second 
each, not 10 clients sending 1 million messages per second each.


When a user gets to the point that individual queues have massive 
throughput, it's unlikely that a one-size-fits-all cloud offering like 
Zaqar or SQS is _ever_ going to meet their needs. Those users will want 
to spin up and configure their own messaging systems on Nova servers, 
and at that kind of size they'll be able to afford to. (In fact, they 
may not be able to afford _not_ to, assuming per-message-based pricing.)



There is also the possibility of using the sharding capabilities of the
underlying storage. But the pattern of use will determine how effective
that can be.

So for example, on the ordering question, if order is defined by a
single sequence number held in the database and atomically incremented
for every message published, that is not likely to be something where
the databases sharding is going to help in scaling the number of
concurrent publications.

Though sharding would allow scaling the total number messages on the
queue (by distributing them over multiple shards), the total ordering of
those messages reduces it's effectiveness in scaling the number of
concurrent getters (e.g. the concurrent subscribers in pub-sub) since
they will all be getting the messages in exactly the same order.

Strict ordering impacts the competing consumers case also (and is in my
opinion of limited value as a guarantee anyway). At any given time, the
head of the queue is in one shard, and all concurrent claim requests
will contend for messages in that same shard. Though the unsuccessful
claimants may then move to another shard as the head moves, they will
all again try to access the messages in the same order.

So if Zaqar's goal is to scale the number of orthogonal queues, and the
number of messages held at any time within these, the pooling facility
and any sharding capability in the underlying store for a pool would
likely be effective even with the strict ordering guarantee.


IMHO this is (or should be) the goal - support enormous numbers of 
small-to-moderate sized queues.



If scaling the number of communicants on a given communication channel
is a goal however, then strict ordering may hamper that. If it does, it
seems to me that this is not just a policy tweak on the underlying
datastore to choose the desired balance between ordering and scale, but
a more fundamental question on the internal structure of the queue
implementation built on top of the datastore.


I agree with your analysis, but I don't think this should be a goal.

Note that the user can still implement this themselves using 
application-level sharding - if you know that in-order delivery is not 
important to you, then randomly assign clients to a queue and then poll 
all of the queues in the round-robin. This yields _exactly_ the same 
semantics as SQS.


The reverse is true of SQS - if you want FIFO then you have to implement 
re-ordering by sequence number in your application. (I'm not certain, 
but it also sounds very much like this situation is ripe for losing 
messages when your client dies.)


So the question is: in which use case do we want to push additional 
complexity into the application? The case where there are truly massive 
volumes of messages flowing to a single point? Or the case where the 
application wants the messages in order?


I'd suggest both that the former applications are better able to handle 
that extra complexity and that the latter applications are probably more 
common. So it seems that the Zaqar team made a good decision.


(Aside: it follows that Zaqar probably should have a maximum throughput 
quota for each queue; or that it should report usage information 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Joe Gordon
On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com wrote:

 On 22/09/14 10:11, Gordon Sim wrote:

 On 09/19/2014 09:13 PM, Zane Bitter wrote:

 SQS offers very, very limited guarantees, and it's clear that the reason
 for that is to make it massively, massively scalable in the way that
 e.g. S3 is scalable while also remaining comparably durable (S3 is
 supposedly designed for 11 nines, BTW).

 Zaqar, meanwhile, seems to be promising the world in terms of
 guarantees. (And then taking it away in the fine print, where it says
 that the operator can disregard many of them, potentially without the
 user's knowledge.)

 On the other hand, IIUC Zaqar does in fact have a sharding feature
 (Pools) which is its answer to the massive scaling question.


 There are different dimensions to the scaling problem.


 Many thanks for this analysis, Gordon. This is really helpful stuff.

  As I understand it, pools don't help scaling a given queue since all the
 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.


 But I think it's also the important part of the problem. When I talk about
 scaling, I mean 1 million clients sending 10 messages per second each, not
 10 clients sending 1 million messages per second each.

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want to
 spin up and configure their own messaging systems on Nova servers, and at
 that kind of size they'll be able to afford to. (In fact, they may not be
 able to afford _not_ to, assuming per-message-based pricing.)


Running a message queue that has a high guarantee of not loosing a message
is hard and SQS promises exactly that, it *will* deliver your message. If a
use case can handle occasionally dropping messages then running your own MQ
makes more sense.

SQS is designed to handle massive queues as well, while I haven't found any
examples of queues that have 1 million messages/second being sent or
received  30k to 100k messages/second is not unheard of [0][1][2].

[0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
[1] http://java.dzone.com/articles/benchmarking-sqs
[2]
http://www.slideshare.net/AmazonWebServices/massive-message-processing-with-amazon-sqs-and-amazon-dynamodb-arc301-aws-reinvent-2013-28431182


  There is also the possibility of using the sharding capabilities of the
 underlying storage. But the pattern of use will determine how effective
 that can be.

 So for example, on the ordering question, if order is defined by a
 single sequence number held in the database and atomically incremented
 for every message published, that is not likely to be something where
 the databases sharding is going to help in scaling the number of
 concurrent publications.

 Though sharding would allow scaling the total number messages on the
 queue (by distributing them over multiple shards), the total ordering of
 those messages reduces it's effectiveness in scaling the number of
 concurrent getters (e.g. the concurrent subscribers in pub-sub) since
 they will all be getting the messages in exactly the same order.

 Strict ordering impacts the competing consumers case also (and is in my
 opinion of limited value as a guarantee anyway). At any given time, the
 head of the queue is in one shard, and all concurrent claim requests
 will contend for messages in that same shard. Though the unsuccessful
 claimants may then move to another shard as the head moves, they will
 all again try to access the messages in the same order.

 So if Zaqar's goal is to scale the number of orthogonal queues, and the
 number of messages held at any time within these, the pooling facility
 and any sharding capability in the underlying store for a pool would
 likely be effective even with the strict ordering guarantee.


 IMHO this is (or should be) the goal - support enormous numbers of
 small-to-moderate sized queues.


If 50,000 messages per second doesn't count as small-to-moderate then Zaqar
does not fulfill a major SQS use case.




  If scaling the number of communicants on a given communication channel
 is a goal however, then strict ordering may hamper that. If it does, it
 seems to me that this is not just a policy tweak on the underlying
 datastore to choose the desired balance between ordering and scale, but
 a more fundamental question on the internal structure of the queue
 implementation built on top of the datastore.


 I agree with your analysis, but I don't think this should be a goal.

 Note that the user can still implement this themselves using
 application-level sharding - if you know that in-order delivery is not
 important to you, then randomly assign clients to a queue 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Zane Bitter

On 22/09/14 17:06, Joe Gordon wrote:

On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com wrote:


On 22/09/14 10:11, Gordon Sim wrote:


On 09/19/2014 09:13 PM, Zane Bitter wrote:


SQS offers very, very limited guarantees, and it's clear that the reason
for that is to make it massively, massively scalable in the way that
e.g. S3 is scalable while also remaining comparably durable (S3 is
supposedly designed for 11 nines, BTW).

Zaqar, meanwhile, seems to be promising the world in terms of
guarantees. (And then taking it away in the fine print, where it says
that the operator can disregard many of them, potentially without the
user's knowledge.)

On the other hand, IIUC Zaqar does in fact have a sharding feature
(Pools) which is its answer to the massive scaling question.



There are different dimensions to the scaling problem.



Many thanks for this analysis, Gordon. This is really helpful stuff.

  As I understand it, pools don't help scaling a given queue since all the

messages for that queue must be in the same pool. At present traffic
through different Zaqar queues are essentially entirely orthogonal
streams. Pooling can help scale the number of such orthogonal streams,
but to be honest, that's the easier part of the problem.



But I think it's also the important part of the problem. When I talk about
scaling, I mean 1 million clients sending 10 messages per second each, not
10 clients sending 1 million messages per second each.

When a user gets to the point that individual queues have massive
throughput, it's unlikely that a one-size-fits-all cloud offering like
Zaqar or SQS is _ever_ going to meet their needs. Those users will want to
spin up and configure their own messaging systems on Nova servers, and at
that kind of size they'll be able to afford to. (In fact, they may not be
able to afford _not_ to, assuming per-message-based pricing.)



Running a message queue that has a high guarantee of not loosing a message
is hard and SQS promises exactly that, it *will* deliver your message. If a
use case can handle occasionally dropping messages then running your own MQ
makes more sense.

SQS is designed to handle massive queues as well, while I haven't found any
examples of queues that have 1 million messages/second being sent or
received  30k to 100k messages/second is not unheard of [0][1][2].

[0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
[1] http://java.dzone.com/articles/benchmarking-sqs
[2]
http://www.slideshare.net/AmazonWebServices/massive-message-processing-with-amazon-sqs-and-amazon-dynamodb-arc301-aws-reinvent-2013-28431182


Thanks for digging those up, that's really helpful input. I think number 
[1] kind of summed up part of what I'm arguing here though:


But once your requirements get above 35k messages per second, chances 
are you need custom solutions anyway; not to mention that while SQS is 
cheap, it may become expensive with such loads.



  There is also the possibility of using the sharding capabilities of the

underlying storage. But the pattern of use will determine how effective
that can be.

So for example, on the ordering question, if order is defined by a
single sequence number held in the database and atomically incremented
for every message published, that is not likely to be something where
the databases sharding is going to help in scaling the number of
concurrent publications.

Though sharding would allow scaling the total number messages on the
queue (by distributing them over multiple shards), the total ordering of
those messages reduces it's effectiveness in scaling the number of
concurrent getters (e.g. the concurrent subscribers in pub-sub) since
they will all be getting the messages in exactly the same order.

Strict ordering impacts the competing consumers case also (and is in my
opinion of limited value as a guarantee anyway). At any given time, the
head of the queue is in one shard, and all concurrent claim requests
will contend for messages in that same shard. Though the unsuccessful
claimants may then move to another shard as the head moves, they will
all again try to access the messages in the same order.

So if Zaqar's goal is to scale the number of orthogonal queues, and the
number of messages held at any time within these, the pooling facility
and any sharding capability in the underlying store for a pool would
likely be effective even with the strict ordering guarantee.



IMHO this is (or should be) the goal - support enormous numbers of
small-to-moderate sized queues.



If 50,000 messages per second doesn't count as small-to-moderate then Zaqar
does not fulfill a major SQS use case.


It's not a drop-in replacement, but as I mentioned you can recreate the 
SQS semantics exactly *and* get the scalability benefits of that 
approach by sharding at the application level and then round-robin polling.


As I also mentioned, this is pretty easy to implement, and is only 
required for really big applications that are more likely 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Joe Gordon
On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:

 On 22/09/14 17:06, Joe Gordon wrote:

 On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com wrote:

  On 22/09/14 10:11, Gordon Sim wrote:

  On 09/19/2014 09:13 PM, Zane Bitter wrote:

  SQS offers very, very limited guarantees, and it's clear that the
 reason
 for that is to make it massively, massively scalable in the way that
 e.g. S3 is scalable while also remaining comparably durable (S3 is
 supposedly designed for 11 nines, BTW).

 Zaqar, meanwhile, seems to be promising the world in terms of
 guarantees. (And then taking it away in the fine print, where it says
 that the operator can disregard many of them, potentially without the
 user's knowledge.)

 On the other hand, IIUC Zaqar does in fact have a sharding feature
 (Pools) which is its answer to the massive scaling question.


 There are different dimensions to the scaling problem.


 Many thanks for this analysis, Gordon. This is really helpful stuff.

   As I understand it, pools don't help scaling a given queue since all
 the

 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.


 But I think it's also the important part of the problem. When I talk
 about
 scaling, I mean 1 million clients sending 10 messages per second each,
 not
 10 clients sending 1 million messages per second each.

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to
 spin up and configure their own messaging systems on Nova servers, and at
 that kind of size they'll be able to afford to. (In fact, they may not be
 able to afford _not_ to, assuming per-message-based pricing.)


 Running a message queue that has a high guarantee of not loosing a message
 is hard and SQS promises exactly that, it *will* deliver your message. If
 a
 use case can handle occasionally dropping messages then running your own
 MQ
 makes more sense.

 SQS is designed to handle massive queues as well, while I haven't found
 any
 examples of queues that have 1 million messages/second being sent or
 received  30k to 100k messages/second is not unheard of [0][1][2].

 [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
 [1] http://java.dzone.com/articles/benchmarking-sqs
 [2]
 http://www.slideshare.net/AmazonWebServices/massive-
 message-processing-with-amazon-sqs-and-amazon-
 dynamodb-arc301-aws-reinvent-2013-28431182


 Thanks for digging those up, that's really helpful input. I think number
 [1] kind of summed up part of what I'm arguing here though:

 But once your requirements get above 35k messages per second, chances are
 you need custom solutions anyway; not to mention that while SQS is cheap,
 it may become expensive with such loads.


If you don't require the reliability guarantees that SQS provides then
perhaps. But I would be surprised to hear that a user can set up something
with this level of uptime for less:

Amazon SQS runs within Amazon’s high-availability data centers, so queues
will be available whenever applications need them. To prevent messages from
being lost or becoming unavailable, all messages are stored redundantly
across multiple servers and data centers. [1]




There is also the possibility of using the sharding capabilities of the

 underlying storage. But the pattern of use will determine how effective
 that can be.

 So for example, on the ordering question, if order is defined by a
 single sequence number held in the database and atomically incremented
 for every message published, that is not likely to be something where
 the databases sharding is going to help in scaling the number of
 concurrent publications.

 Though sharding would allow scaling the total number messages on the
 queue (by distributing them over multiple shards), the total ordering of
 those messages reduces it's effectiveness in scaling the number of
 concurrent getters (e.g. the concurrent subscribers in pub-sub) since
 they will all be getting the messages in exactly the same order.

 Strict ordering impacts the competing consumers case also (and is in my
 opinion of limited value as a guarantee anyway). At any given time, the
 head of the queue is in one shard, and all concurrent claim requests
 will contend for messages in that same shard. Though the unsuccessful
 claimants may then move to another shard as the head moves, they will
 all again try to access the messages in the same order.

 So if Zaqar's goal is to scale the number of orthogonal queues, and the
 number of messages held at any time within these, the pooling facility
 and any sharding capability in the underlying store for a pool would
 likely be effective even 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Joe Gordon
On Mon, Sep 22, 2014 at 7:04 PM, Joe Gordon joe.gord...@gmail.com wrote:



 On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:

 On 22/09/14 17:06, Joe Gordon wrote:

 On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com wrote:

  On 22/09/14 10:11, Gordon Sim wrote:

  On 09/19/2014 09:13 PM, Zane Bitter wrote:

  SQS offers very, very limited guarantees, and it's clear that the
 reason
 for that is to make it massively, massively scalable in the way that
 e.g. S3 is scalable while also remaining comparably durable (S3 is
 supposedly designed for 11 nines, BTW).

 Zaqar, meanwhile, seems to be promising the world in terms of
 guarantees. (And then taking it away in the fine print, where it says
 that the operator can disregard many of them, potentially without the
 user's knowledge.)

 On the other hand, IIUC Zaqar does in fact have a sharding feature
 (Pools) which is its answer to the massive scaling question.


 There are different dimensions to the scaling problem.


 Many thanks for this analysis, Gordon. This is really helpful stuff.

   As I understand it, pools don't help scaling a given queue since all
 the

 messages for that queue must be in the same pool. At present traffic
 through different Zaqar queues are essentially entirely orthogonal
 streams. Pooling can help scale the number of such orthogonal streams,
 but to be honest, that's the easier part of the problem.


 But I think it's also the important part of the problem. When I talk
 about
 scaling, I mean 1 million clients sending 10 messages per second each,
 not
 10 clients sending 1 million messages per second each.

 When a user gets to the point that individual queues have massive
 throughput, it's unlikely that a one-size-fits-all cloud offering like
 Zaqar or SQS is _ever_ going to meet their needs. Those users will want
 to
 spin up and configure their own messaging systems on Nova servers, and
 at
 that kind of size they'll be able to afford to. (In fact, they may not
 be
 able to afford _not_ to, assuming per-message-based pricing.)


 Running a message queue that has a high guarantee of not loosing a
 message
 is hard and SQS promises exactly that, it *will* deliver your message.
 If a
 use case can handle occasionally dropping messages then running your own
 MQ
 makes more sense.

 SQS is designed to handle massive queues as well, while I haven't found
 any
 examples of queues that have 1 million messages/second being sent or
 received  30k to 100k messages/second is not unheard of [0][1][2].

 [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
 [1] http://java.dzone.com/articles/benchmarking-sqs
 [2]
 http://www.slideshare.net/AmazonWebServices/massive-
 message-processing-with-amazon-sqs-and-amazon-
 dynamodb-arc301-aws-reinvent-2013-28431182


 Thanks for digging those up, that's really helpful input. I think number
 [1] kind of summed up part of what I'm arguing here though:

 But once your requirements get above 35k messages per second, chances
 are you need custom solutions anyway; not to mention that while SQS is
 cheap, it may become expensive with such loads.


 If you don't require the reliability guarantees that SQS provides then
 perhaps. But I would be surprised to hear that a user can set up something
 with this level of uptime for less:

 Amazon SQS runs within Amazon’s high-availability data centers, so queues
 will be available whenever applications need them. To prevent messages from
 being lost or becoming unavailable, all messages are stored redundantly
 across multiple servers and data centers. [1]




There is also the possibility of using the sharding capabilities of the

 underlying storage. But the pattern of use will determine how effective
 that can be.

 So for example, on the ordering question, if order is defined by a
 single sequence number held in the database and atomically incremented
 for every message published, that is not likely to be something where
 the databases sharding is going to help in scaling the number of
 concurrent publications.

 Though sharding would allow scaling the total number messages on the
 queue (by distributing them over multiple shards), the total ordering
 of
 those messages reduces it's effectiveness in scaling the number of
 concurrent getters (e.g. the concurrent subscribers in pub-sub) since
 they will all be getting the messages in exactly the same order.

 Strict ordering impacts the competing consumers case also (and is in my
 opinion of limited value as a guarantee anyway). At any given time, the
 head of the queue is in one shard, and all concurrent claim requests
 will contend for messages in that same shard. Though the unsuccessful
 claimants may then move to another shard as the head moves, they will
 all again try to access the messages in the same order.

 So if Zaqar's goal is to scale the number of orthogonal queues, and the
 number of messages held at any time within these, the pooling facility
 and any 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Clint Byrum
Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
 On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:
 
  On 22/09/14 17:06, Joe Gordon wrote:
 
  On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com wrote:
 
   On 22/09/14 10:11, Gordon Sim wrote:
 
   On 09/19/2014 09:13 PM, Zane Bitter wrote:
 
   SQS offers very, very limited guarantees, and it's clear that the
  reason
  for that is to make it massively, massively scalable in the way that
  e.g. S3 is scalable while also remaining comparably durable (S3 is
  supposedly designed for 11 nines, BTW).
 
  Zaqar, meanwhile, seems to be promising the world in terms of
  guarantees. (And then taking it away in the fine print, where it says
  that the operator can disregard many of them, potentially without the
  user's knowledge.)
 
  On the other hand, IIUC Zaqar does in fact have a sharding feature
  (Pools) which is its answer to the massive scaling question.
 
 
  There are different dimensions to the scaling problem.
 
 
  Many thanks for this analysis, Gordon. This is really helpful stuff.
 
As I understand it, pools don't help scaling a given queue since all
  the
 
  messages for that queue must be in the same pool. At present traffic
  through different Zaqar queues are essentially entirely orthogonal
  streams. Pooling can help scale the number of such orthogonal streams,
  but to be honest, that's the easier part of the problem.
 
 
  But I think it's also the important part of the problem. When I talk
  about
  scaling, I mean 1 million clients sending 10 messages per second each,
  not
  10 clients sending 1 million messages per second each.
 
  When a user gets to the point that individual queues have massive
  throughput, it's unlikely that a one-size-fits-all cloud offering like
  Zaqar or SQS is _ever_ going to meet their needs. Those users will want
  to
  spin up and configure their own messaging systems on Nova servers, and at
  that kind of size they'll be able to afford to. (In fact, they may not be
  able to afford _not_ to, assuming per-message-based pricing.)
 
 
  Running a message queue that has a high guarantee of not loosing a message
  is hard and SQS promises exactly that, it *will* deliver your message. If
  a
  use case can handle occasionally dropping messages then running your own
  MQ
  makes more sense.
 
  SQS is designed to handle massive queues as well, while I haven't found
  any
  examples of queues that have 1 million messages/second being sent or
  received  30k to 100k messages/second is not unheard of [0][1][2].
 
  [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
  [1] http://java.dzone.com/articles/benchmarking-sqs
  [2]
  http://www.slideshare.net/AmazonWebServices/massive-
  message-processing-with-amazon-sqs-and-amazon-
  dynamodb-arc301-aws-reinvent-2013-28431182
 
 
  Thanks for digging those up, that's really helpful input. I think number
  [1] kind of summed up part of what I'm arguing here though:
 
  But once your requirements get above 35k messages per second, chances are
  you need custom solutions anyway; not to mention that while SQS is cheap,
  it may become expensive with such loads.
 
 
 If you don't require the reliability guarantees that SQS provides then
 perhaps. But I would be surprised to hear that a user can set up something
 with this level of uptime for less:
 
 Amazon SQS runs within Amazon’s high-availability data centers, so queues
 will be available whenever applications need them. To prevent messages from
 being lost or becoming unavailable, all messages are stored redundantly
 across multiple servers and data centers. [1]
 

This is pretty easily doable with gearman or even just using Redis
directly. But it is still ops for end users. The AWS users I've talked to
who use SQS do so because they like that they can use RDS, SQS, and ELB,
and have only one type of thing to operate: their app.

 
 
 There is also the possibility of using the sharding capabilities of the
 
  underlying storage. But the pattern of use will determine how effective
  that can be.
 
  So for example, on the ordering question, if order is defined by a
  single sequence number held in the database and atomically incremented
  for every message published, that is not likely to be something where
  the databases sharding is going to help in scaling the number of
  concurrent publications.
 
  Though sharding would allow scaling the total number messages on the
  queue (by distributing them over multiple shards), the total ordering of
  those messages reduces it's effectiveness in scaling the number of
  concurrent getters (e.g. the concurrent subscribers in pub-sub) since
  they will all be getting the messages in exactly the same order.
 
  Strict ordering impacts the competing consumers case also (and is in my
  opinion of limited value as a guarantee anyway). At any given time, the
  head of the queue is in one shard, and all concurrent claim 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-22 Thread Joe Gordon
On Mon, Sep 22, 2014 at 8:13 PM, Clint Byrum cl...@fewbar.com wrote:

 Excerpts from Joe Gordon's message of 2014-09-22 19:04:03 -0700:
  On Mon, Sep 22, 2014 at 5:47 PM, Zane Bitter zbit...@redhat.com wrote:
 
   On 22/09/14 17:06, Joe Gordon wrote:
  
   On Mon, Sep 22, 2014 at 9:58 AM, Zane Bitter zbit...@redhat.com
 wrote:
  
On 22/09/14 10:11, Gordon Sim wrote:
  
On 09/19/2014 09:13 PM, Zane Bitter wrote:
  
SQS offers very, very limited guarantees, and it's clear that the
   reason
   for that is to make it massively, massively scalable in the way
 that
   e.g. S3 is scalable while also remaining comparably durable (S3 is
   supposedly designed for 11 nines, BTW).
  
   Zaqar, meanwhile, seems to be promising the world in terms of
   guarantees. (And then taking it away in the fine print, where it
 says
   that the operator can disregard many of them, potentially without
 the
   user's knowledge.)
  
   On the other hand, IIUC Zaqar does in fact have a sharding feature
   (Pools) which is its answer to the massive scaling question.
  
  
   There are different dimensions to the scaling problem.
  
  
   Many thanks for this analysis, Gordon. This is really helpful stuff.
  
 As I understand it, pools don't help scaling a given queue since
 all
   the
  
   messages for that queue must be in the same pool. At present traffic
   through different Zaqar queues are essentially entirely orthogonal
   streams. Pooling can help scale the number of such orthogonal
 streams,
   but to be honest, that's the easier part of the problem.
  
  
   But I think it's also the important part of the problem. When I talk
   about
   scaling, I mean 1 million clients sending 10 messages per second
 each,
   not
   10 clients sending 1 million messages per second each.
  
   When a user gets to the point that individual queues have massive
   throughput, it's unlikely that a one-size-fits-all cloud offering
 like
   Zaqar or SQS is _ever_ going to meet their needs. Those users will
 want
   to
   spin up and configure their own messaging systems on Nova servers,
 and at
   that kind of size they'll be able to afford to. (In fact, they may
 not be
   able to afford _not_ to, assuming per-message-based pricing.)
  
  
   Running a message queue that has a high guarantee of not loosing a
 message
   is hard and SQS promises exactly that, it *will* deliver your
 message. If
   a
   use case can handle occasionally dropping messages then running your
 own
   MQ
   makes more sense.
  
   SQS is designed to handle massive queues as well, while I haven't
 found
   any
   examples of queues that have 1 million messages/second being sent or
   received  30k to 100k messages/second is not unheard of [0][1][2].
  
   [0] https://www.youtube.com/watch?v=zwLC5xmCZUs#t=22m53s
   [1] http://java.dzone.com/articles/benchmarking-sqs
   [2]
   http://www.slideshare.net/AmazonWebServices/massive-
   message-processing-with-amazon-sqs-and-amazon-
   dynamodb-arc301-aws-reinvent-2013-28431182
  
  
   Thanks for digging those up, that's really helpful input. I think
 number
   [1] kind of summed up part of what I'm arguing here though:
  
   But once your requirements get above 35k messages per second, chances
 are
   you need custom solutions anyway; not to mention that while SQS is
 cheap,
   it may become expensive with such loads.
 
 
  If you don't require the reliability guarantees that SQS provides then
  perhaps. But I would be surprised to hear that a user can set up
 something
  with this level of uptime for less:
 
  Amazon SQS runs within Amazon’s high-availability data centers, so
 queues
  will be available whenever applications need them. To prevent messages
 from
  being lost or becoming unavailable, all messages are stored redundantly
  across multiple servers and data centers. [1]
 

 This is pretty easily doable with gearman or even just using Redis
 directly. But it is still ops for end users. The AWS users I've talked to
 who use SQS do so because they like that they can use RDS, SQS, and ELB,
 and have only one type of thing to operate: their app.

  
  
  There is also the possibility of using the sharding capabilities of
 the
  
   underlying storage. But the pattern of use will determine how
 effective
   that can be.
  
   So for example, on the ordering question, if order is defined by a
   single sequence number held in the database and atomically
 incremented
   for every message published, that is not likely to be something
 where
   the databases sharding is going to help in scaling the number of
   concurrent publications.
  
   Though sharding would allow scaling the total number messages on the
   queue (by distributing them over multiple shards), the total
 ordering of
   those messages reduces it's effectiveness in scaling the number of
   concurrent getters (e.g. the concurrent subscribers in pub-sub)
 since
   they will all be getting the messages in exactly the same order.
  
   

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Flavio Percoco
On 09/18/2014 09:25 PM, Gordon Sim wrote:
 On 09/18/2014 03:45 PM, Flavio Percoco wrote:
 On 09/18/2014 04:09 PM, Gordon Sim wrote:
 Is the replication synchronous or asynchronous with respect to client
 calls? E.g. will the response to a post of messages be returned only
 once the replication of those messages is confirmed? Likewise when
 deleting a message, is the response only returned when replicas of the
 message are deleted?

 It depends on the driver implementation and/or storage configuration.
 For example, in the mongodb driver, we use the default write concern
 called acknowledged. This means that as soon as the message gets to
 the master node (note it's not written on disk yet nor replicated) zaqar
 will receive a confirmation and then send the response back to the
 client.
 
 So in that mode it's unreliable. If there is failure right after the
 response is sent the message may be lost, but the client believes it has
 been confirmed so will not resend.
 
 This is also configurable by the deployer by changing the
 default write concern in the mongodb uri using
 `?w=SOME_WRITE_CONCERN`[0].

 [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w
 
 So you could change that to majority to get reliable publication
 (at-least-once).

Right, to help with the fight for a world with saner defaults, I think
it'd be better to use majority as the default write concern in the
mongodb driver.


 What do you mean by 'streaming messages'?

 I'm sorry, that went out wrong. I had the browsability term in my head
 and went with something even worse. By streaming messages I meant
 polling messages without claiming them. In other words, at-least-once is
 guaranteed by default, whereas once-and-only-once is guaranteed just if
 claims are used.
 
 I don't see that the claim mechanism brings any stronger guarantee, it
 just offers a competing consumer behaviour where browsing is
 non-competing (non-destructive). In both cases you require the client to
 be able to remember which messages it had processed in order to ensure
 exactly once. The claim reduces the scope of any doubt, but the client
 still needs to be able to determine whether it has already processed any
 message in the claim already.

The client needs to remember which messages it had processed if it
doesn't delete them (ack) after it has processed them. It's true the
client could also fail after having processed the message which means it
won't be able to ack it.

That said, being able to prevent other consumers to consume a specific
message can bring a stronger guarantee depending on how messages are
processed. I mean, claiming a message guarantees that throughout the
duration of that claim, no other client will be able to consume the
claimed messages, which means it allows messages to be consumed only once.

 
 [...]
 That marker is a sequence number of some kind that is used to provide
 ordering to queries? Is it generated by the database itself?

 It's a sequence number to provide ordering to queries, correct.
 Depending on the driver, it may be generated by Zaqar or the database.
 In mongodb's case it's generated by Zaqar[0].
 
 Zaqar increments a counter held within the database, am I reading that
 correctly? So mongodb is responsible for the ordering and atomicity of
 multiple concurrent requests for a marker?

Yes.

The message posting code is here[0] in case you'd like to take a look at
the logic used for mongodb (any feedback is obviously very welcome):

[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/messages.py#L494

Cheers,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Flavio Percoco
On 09/18/2014 07:19 PM, Joe Gordon wrote:
 
 
 On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco fla...@redhat.com
 mailto:fla...@redhat.com wrote:
 
 On 09/18/2014 04:09 PM, Gordon Sim wrote:
  On 09/18/2014 12:31 PM, Flavio Percoco wrote:
  On 09/17/2014 10:36 PM, Joe Gordon wrote:
  My understanding of Zaqar is that it's like SQS. SQS uses distributed
  queues, which have a few unusual properties [0]:
 
 
   Message Order
 
  Amazon SQS makes a best effort to preserve order in messages, but due 
 to
  the distributed nature of the queue, we cannot guarantee you will
  receive messages in the exact order you sent them. If your system
  requires that order be preserved, we recommend you place sequencing
  information in each message so you can reorder the messages upon
  receipt.
 
 
  Zaqar guarantees FIFO. To be more precise, it does that relying on the
  storage backend ability to do so as well. Depending on the storage 
 used,
  guaranteeing FIFO may have some performance penalties.
 
  Would it be accurate to say that at present Zaqar does not use
  distributed queues, but holds all queue data in a storage mechanism of
  some form which may internally distribute that data among servers but
  provides Zaqar with a consistent data model of some form?
 
 I think this is accurate. The queue's distribution depends on the
 storage ability to do so and deployers will be able to choose what
 storage works best for them based on this as well. I'm not sure how
 useful this separation is from a user perspective but I do see the
 relevance when it comes to implementation details and deployments.
 
 
  [...]
  As of now, Zaqar fully relies on the storage replication/clustering
  capabilities to provide data consistency, availability and fault
  tolerance.
 
  Is the replication synchronous or asynchronous with respect to client
  calls? E.g. will the response to a post of messages be returned only
  once the replication of those messages is confirmed? Likewise when
  deleting a message, is the response only returned when replicas of the
  message are deleted?
 
 It depends on the driver implementation and/or storage configuration.
 For example, in the mongodb driver, we use the default write concern
 called acknowledged. This means that as soon as the message gets to
 the master node (note it's not written on disk yet nor replicated) zaqar
 will receive a confirmation and then send the response back to the
 client. This is also configurable by the deployer by changing the
 default write concern in the mongodb uri using
 `?w=SOME_WRITE_CONCERN`[0].
 
 
 This means that by default Zaqar cannot guarantee a message will be
 delivered at all. A message can be acknowledged and then the 'master
 node' crashes and the message is lost. Zaqar's ability to guarantee
 delivery is limited by the reliability of a single node, while something
 like swift can only loose a piece of data if 3 machines crash at the
 same time.

Correct, as mentioned in my reply to Gordon, I also think `majority` is
a saner default for the write concern in this case.

I'm glad you mentioned Swift. We discussed a while back about having a
storage driver for it. I thought we had a blueprint for that but we
don't. Last time we discussed it, swift seemed to cover everything we
needed, IIRC. Anyway, just a thought.

Flavio

 [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w
 
 
  However, as far as consuming messages is concerned, it can
  guarantee once-and-only-once and/or at-least-once delivery depending on
  the message pattern used to consume messages. Using pop or claims
  guarantees the former whereas streaming messages out of Zaqar 
 guarantees
  the later.
 
  From what I can see, pop provides unreliable delivery (i.e. its similar
  to no-ack). If the delete call using pop fails while sending back the
  response, the messages are removed but didn't get to the client.
 
 Correct, pop works like no-ack. If you want to have pop+ack, it is
 possible to claim just 1 message and then delete it.
 
 
  What do you mean by 'streaming messages'?
 
 I'm sorry, that went out wrong. I had the browsability term in my head
 and went with something even worse. By streaming messages I meant
 polling messages without claiming them. In other words, at-least-once is
 guaranteed by default, whereas once-and-only-once is guaranteed just if
 claims are used.
 
 
  [...]
  Based on our short conversation on IRC last night, I understand you're
  concerned that FIFO may result in performance issues. That's a valid
  concern and I think the right answer is that it depends on the storage.
  If the storage has a built-in FIFO guarantee then there's nothing 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Flavio Percoco
On 09/18/2014 07:16 PM, Devananda van der Veen wrote:
 On Thu, Sep 18, 2014 at 8:54 AM, Devananda van der Veen
 devananda@gmail.com wrote:
 On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco fla...@redhat.com wrote:
 On 09/18/2014 04:09 PM, Gordon Sim wrote:
 On 09/18/2014 12:31 PM, Flavio Percoco wrote:
 Zaqar guarantees FIFO. To be more precise, it does that relying on the
 storage backend ability to do so as well. Depending on the storage used,
 guaranteeing FIFO may have some performance penalties.

 Would it be accurate to say that at present Zaqar does not use
 distributed queues, but holds all queue data in a storage mechanism of
 some form which may internally distribute that data among servers but
 provides Zaqar with a consistent data model of some form?

 I think this is accurate. The queue's distribution depends on the
 storage ability to do so and deployers will be able to choose what
 storage works best for them based on this as well. I'm not sure how
 useful this separation is from a user perspective but I do see the
 relevance when it comes to implementation details and deployments.

 Guaranteeing FIFO and not using a distributed queue architecture
 *above* the storage backend are both scale-limiting design choices.
 That Zaqar's scalability depends on the storage back end is not a
 desirable thing in a cloud-scale messaging system in my opinion,
 because this will prevent use at scales which can not be accommodated
 by a single storage back end.

 
 It may be worth qualifying this a bit more.
 
 While no single instance of any storage back-end is infinitely
 scalable, some of them are really darn fast. That may be enough for
 the majority of use cases. It's not outside the realm of possibility
 that the inflection point [0] where these design choices result in
 performance limitations is at the very high end of scale-out, eg.
 public cloud providers who have the resources to invest further in
 improving zaqar.
 
 As an example of what I mean, let me refer to the 99th percentile
 response time graphs in Kurt's benchmarks [1]... increasing the number
 of clients with write-heavy workloads was enough to drive latency from
 10ms to 200 ms with a single service. That latency significantly
 improved as storage and application instances were added, which is
 good, and what I would expect. These benchmarks do not (and were not
 intended to) show the maximal performance of a public-cloud-scale
 deployment -- but they do show that performance under different
 workloads improves as additional services are started.
 
 While I have no basis for comparing the configuration of the
 deployment he used in those tests to what a public cloud operator
 might choose to deploy, and presumably such an operator would put
 significant work into tuning storage and running more instances of
 each service and thus shift that inflection point to the right, my
 point is that, by depending on a single storage instance, Zaqar has
 pushed the *ability* to scale out down into the storage
 implementation. Given my experience scaling SQL and NoSQL data stores
 (in my past life, before working on OpenStack) I have a knee-jerk
 reaction to believing that this approach will result in a
 public-cloud-scale messaging system.

Thanks for the more detailed explanation of your concern, I appreciate it.

Let me start by saying that I agree with the fact that pushing messages
distribution down to the storage may end up in some scaling limitations
for some scenarios.

That said, Zaqar already has the knowledge of pools. Pools allow
operators to add more storage clusters to Zaqar and with that balance
the load between them. It is possible to distribute the data across
these pools in a per-queue basis. While the messages of a queue are not
distributed across multiple *pools* - all the messages for queue X will
live in a single pool - I do believe this per-queue distribution helps
to address the above concern and pushes that limitation farther away.

Let me explain how pools currently work a bit better. As of now, each
pool has a URI pointing to a storage cluster and a weight. This weight
is used to balance load between pools every time a queue is created.
Once it's created, Zaqar keeps the information of the queue-pool
association in a catalogue that is used to know where the queue lives.
We'll likely add new algorithms to have a better and more even
distribution of queues across the registered pools.

I'm sure message distribution could be implemented in Zaqar but I'm not
convinced we should do so right now. The reason being it would bring in
a whole lot of new issues to the project that I think we can and should
avoid for now.

Thanks for the feedback, Devananda.
Flavio


 
 -Devananda
 
 [0] http://en.wikipedia.org/wiki/Inflection_point -- in this context,
 I mean the point on the graph of throughput vs latency where the
 derivative goes from near-zero (linear growth) to non-zero
 (exponential growth)
 
 [1] 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Thierry Carrez
Joe Gordon wrote:
 On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen
 devananda@gmail.com mailto:devananda@gmail.com wrote:
 - guaranteed message order
 - not distributing work across a configurable number of back ends
 
 These are scale-limiting design choices which are reflected in the
 API's characteristics.
 
 I agree with Clint and Devananda

The underlying question being... can Zaqar evolve to ultimately reach
the massive scale use case Joe, Clint and Devananda want it to reach, or
are those design choices so deeply rooted in the code and architecture
that Zaqar won't naturally mutate to support that use case.

The Zaqar team has shown great willingness to adapt in order to support
more use cases, but I guess there may be architectural design choices
that would just mean starting over ?

-- 
Thierry Carrez (ttx)

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Flavio Percoco
On 09/19/2014 11:00 AM, Thierry Carrez wrote:
 Joe Gordon wrote:
 On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen
 devananda@gmail.com mailto:devananda@gmail.com wrote:
 - guaranteed message order
 - not distributing work across a configurable number of back ends

 These are scale-limiting design choices which are reflected in the
 API's characteristics.

 I agree with Clint and Devananda
 
 The underlying question being... can Zaqar evolve to ultimately reach
 the massive scale use case Joe, Clint and Devananda want it to reach, or
 are those design choices so deeply rooted in the code and architecture
 that Zaqar won't naturally mutate to support that use case.
 
 The Zaqar team has shown great willingness to adapt in order to support
 more use cases, but I guess there may be architectural design choices
 that would just mean starting over ?


Zaqar has scaling capabilities that go beyond depending on a single
storage cluster. As I mentioned in my previous email, the support for
storage pools allows the operator to scale out the storage layer and
balance the load across them.

There's always space for improvement and I definitely wouldn't go that
far to say there's a need to start over.

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Eoghan Glynn


 Hi All,
 
 My understanding of Zaqar is that it's like SQS. SQS uses distributed queues,
 which have a few unusual properties [0]:
 Message Order
 
 
 Amazon SQS makes a best effort to preserve order in messages, but due to the
 distributed nature of the queue, we cannot guarantee you will receive
 messages in the exact order you sent them. If your system requires that
 order be preserved, we recommend you place sequencing information in each
 message so you can reorder the messages upon receipt.
 At-Least-Once Delivery
 
 
 Amazon SQS stores copies of your messages on multiple servers for redundancy
 and high availability. On rare occasions, one of the servers storing a copy
 of a message might be unavailable when you receive or delete the message. If
 that occurs, the copy of the message will not be deleted on that unavailable
 server, and you might get that message copy again when you receive messages.
 Because of this, you must design your application to be idempotent (i.e., it
 must not be adversely affected if it processes the same message more than
 once).
 Message Sample
 
 
 The behavior of retrieving messages from the queue depends whether you are
 using short (standard) polling, the default behavior, or long polling. For
 more information about long polling, see Amazon SQS Long Polling .
 
 With short polling, when you retrieve messages from the queue, Amazon SQS
 samples a subset of the servers (based on a weighted random distribution)
 and returns messages from just those servers. This means that a particular
 receive request might not return all your messages. Or, if you have a small
 number of messages in your queue (less than 1000), it means a particular
 request might not return any of your messages, whereas a subsequent request
 will. If you keep retrieving from your queues, Amazon SQS will sample all of
 the servers, and you will receive all of your messages.
 
 The following figure shows short polling behavior of messages being returned
 after one of your system components makes a receive request. Amazon SQS
 samples several of the servers (in gray) and returns the messages from those
 servers (Message A, C, D, and B). Message E is not returned to this
 particular request, but it would be returned to a subsequent request.
 
 
 
 
 
 
 
 Presumably SQS has these properties because it makes the system scalable, if
 so does Zaqar have the same properties (not just making these same
 guarantees in the API, but actually having these properties in the
 backends)? And if not, why? I looked on the wiki [1] for information on
 this, but couldn't find anything.

The premise of this thread is flawed I think.

It seems to be predicated on a direct quote from the public
documentation of a closed-source system justifying some
assumptions about the internal architecture and design goals
of that closed-source system.

It then proceeds to hold zaqar to account for not making
the same choices as that closed-source system.

This puts the zaqar folks in a no-win situation, as it's hard
to refute such arguments when they have no visibility over
the innards of that closed-source system.

Sure, the assumption may well be correct that the designers
of SQS made the choice to expose applications to out-of-order
messages as this was the only practical way of acheiving their
scalability goals.

But since the code isn't on github and the design discussions
aren't publicly archived, we have no way of validating that.

Would it be more reasonable to compare against a cloud-scale
messaging system that folks may have more direct knowledge
of?

For example, is HP Cloud Messaging[1] rolled out in full
production by now?

Is it still cloning the original Marconi API, or has it kept
up with the evolution of the API? Has the nature of this API
been seen as the root cause of any scalability issues?

Cheers,
Eoghan

[1] 
http://www.openstack.org/blog/2013/05/an-introductory-tour-of-openstack-cloud-messaging-as-a-service

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Clint Byrum
Excerpts from Eoghan Glynn's message of 2014-09-19 04:23:55 -0700:
 
  Hi All,
  
  My understanding of Zaqar is that it's like SQS. SQS uses distributed 
  queues,
  which have a few unusual properties [0]:
  Message Order
  
  
  Amazon SQS makes a best effort to preserve order in messages, but due to the
  distributed nature of the queue, we cannot guarantee you will receive
  messages in the exact order you sent them. If your system requires that
  order be preserved, we recommend you place sequencing information in each
  message so you can reorder the messages upon receipt.
  At-Least-Once Delivery
  
  
  Amazon SQS stores copies of your messages on multiple servers for redundancy
  and high availability. On rare occasions, one of the servers storing a copy
  of a message might be unavailable when you receive or delete the message. If
  that occurs, the copy of the message will not be deleted on that unavailable
  server, and you might get that message copy again when you receive messages.
  Because of this, you must design your application to be idempotent (i.e., it
  must not be adversely affected if it processes the same message more than
  once).
  Message Sample
  
  
  The behavior of retrieving messages from the queue depends whether you are
  using short (standard) polling, the default behavior, or long polling. For
  more information about long polling, see Amazon SQS Long Polling .
  
  With short polling, when you retrieve messages from the queue, Amazon SQS
  samples a subset of the servers (based on a weighted random distribution)
  and returns messages from just those servers. This means that a particular
  receive request might not return all your messages. Or, if you have a small
  number of messages in your queue (less than 1000), it means a particular
  request might not return any of your messages, whereas a subsequent request
  will. If you keep retrieving from your queues, Amazon SQS will sample all of
  the servers, and you will receive all of your messages.
  
  The following figure shows short polling behavior of messages being returned
  after one of your system components makes a receive request. Amazon SQS
  samples several of the servers (in gray) and returns the messages from those
  servers (Message A, C, D, and B). Message E is not returned to this
  particular request, but it would be returned to a subsequent request.
  
  
  
  
  
  
  
  Presumably SQS has these properties because it makes the system scalable, if
  so does Zaqar have the same properties (not just making these same
  guarantees in the API, but actually having these properties in the
  backends)? And if not, why? I looked on the wiki [1] for information on
  this, but couldn't find anything.
 
 The premise of this thread is flawed I think.
 
 It seems to be predicated on a direct quote from the public
 documentation of a closed-source system justifying some
 assumptions about the internal architecture and design goals
 of that closed-source system.
 
 It then proceeds to hold zaqar to account for not making
 the same choices as that closed-source system.
 

I don't think we want Zaqar to make the same choices. OpenStack's
constraints are different from AWS's.

I want to highlight that our expectations are for the API to support
deploying at scale. SQS _clearly_ started with a point of extreme scale
for the deployer, and thus is a good example of an API that is limited
enough to scale like that.

What has always been the concern is that Zaqar would make it extremely
complicated and/or costly to get to that level.

 This puts the zaqar folks in a no-win situation, as it's hard
 to refute such arguments when they have no visibility over
 the innards of that closed-source system.
 

Nobody expects to know the insides. But the outsides, the parts that
are public, are brilliant because they are _limited_, and yet they still
support many many use cases.

 Sure, the assumption may well be correct that the designers
 of SQS made the choice to expose applications to out-of-order
 messages as this was the only practical way of acheiving their
 scalability goals.
 
 But since the code isn't on github and the design discussions
 aren't publicly archived, we have no way of validating that.
 

We don't need to see the code. Not requiring ordering makes the whole
problem easier to reason about. You don't need explicit pools anymore.
Just throw messages wherever, and make sure that everywhere gets
polled on a reasonable enough frequency. This is the kind of thing
operations loves. No global state means no split brain to avoid, no
synchronization. Does it solve all problems? no. But it solves a single
one, REALLY well.

Frankly I don't understand why there would be this argument to hold on
to so many use cases and so much API surface area. Zaqar's life gets
easier without ordering guarantees or message browsing. And it still
retains _many_ of its potential users.

___

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Gordon Sim

On 09/19/2014 08:53 AM, Flavio Percoco wrote:

On 09/18/2014 09:25 PM, Gordon Sim wrote:

I don't see that the claim mechanism brings any stronger guarantee, it
just offers a competing consumer behaviour where browsing is
non-competing (non-destructive). In both cases you require the client to
be able to remember which messages it had processed in order to ensure
exactly once. The claim reduces the scope of any doubt, but the client
still needs to be able to determine whether it has already processed any
message in the claim already.


The client needs to remember which messages it had processed if it
doesn't delete them (ack) after it has processed them. It's true the
client could also fail after having processed the message which means it
won't be able to ack it.

That said, being able to prevent other consumers to consume a specific
message can bring a stronger guarantee depending on how messages are
processed. I mean, claiming a message guarantees that throughout the
duration of that claim, no other client will be able to consume the
claimed messages, which means it allows messages to be consumed only once.


I think 'exactly once' means different things when used for competing 
consumers and non-competing consumers. For the former it means the 
message is processed by only one consumer, and only once. For the latter 
it means every consumer processes the message exactly once.


Using a claim provides the competing consumer behaviour. To me this is a 
'different' guarantee from non-competing consumer rather than a 
'stronger' one, and it is orthogonal to the reliability of the delivery.


However we only differ on terminology used; I believe we are on the same 
page as far as the semantics go :-)



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Vipul Sabhaya
On Fri, Sep 19, 2014 at 4:23 AM, Eoghan Glynn egl...@redhat.com wrote:



  Hi All,
 
  My understanding of Zaqar is that it's like SQS. SQS uses distributed
 queues,
  which have a few unusual properties [0]:
  Message Order
 
 
  Amazon SQS makes a best effort to preserve order in messages, but due to
 the
  distributed nature of the queue, we cannot guarantee you will receive
  messages in the exact order you sent them. If your system requires that
  order be preserved, we recommend you place sequencing information in each
  message so you can reorder the messages upon receipt.
  At-Least-Once Delivery
 
 
  Amazon SQS stores copies of your messages on multiple servers for
 redundancy
  and high availability. On rare occasions, one of the servers storing a
 copy
  of a message might be unavailable when you receive or delete the
 message. If
  that occurs, the copy of the message will not be deleted on that
 unavailable
  server, and you might get that message copy again when you receive
 messages.
  Because of this, you must design your application to be idempotent
 (i.e., it
  must not be adversely affected if it processes the same message more than
  once).
  Message Sample
 
 
  The behavior of retrieving messages from the queue depends whether you
 are
  using short (standard) polling, the default behavior, or long polling.
 For
  more information about long polling, see Amazon SQS Long Polling .
 
  With short polling, when you retrieve messages from the queue, Amazon SQS
  samples a subset of the servers (based on a weighted random distribution)
  and returns messages from just those servers. This means that a
 particular
  receive request might not return all your messages. Or, if you have a
 small
  number of messages in your queue (less than 1000), it means a particular
  request might not return any of your messages, whereas a subsequent
 request
  will. If you keep retrieving from your queues, Amazon SQS will sample
 all of
  the servers, and you will receive all of your messages.
 
  The following figure shows short polling behavior of messages being
 returned
  after one of your system components makes a receive request. Amazon SQS
  samples several of the servers (in gray) and returns the messages from
 those
  servers (Message A, C, D, and B). Message E is not returned to this
  particular request, but it would be returned to a subsequent request.
 
 
 
 
 
 
 
  Presumably SQS has these properties because it makes the system
 scalable, if
  so does Zaqar have the same properties (not just making these same
  guarantees in the API, but actually having these properties in the
  backends)? And if not, why? I looked on the wiki [1] for information on
  this, but couldn't find anything.

 The premise of this thread is flawed I think.

 It seems to be predicated on a direct quote from the public
 documentation of a closed-source system justifying some
 assumptions about the internal architecture and design goals
 of that closed-source system.

 It then proceeds to hold zaqar to account for not making
 the same choices as that closed-source system.

 This puts the zaqar folks in a no-win situation, as it's hard
 to refute such arguments when they have no visibility over
 the innards of that closed-source system.

 Sure, the assumption may well be correct that the designers
 of SQS made the choice to expose applications to out-of-order
 messages as this was the only practical way of acheiving their
 scalability goals.

 But since the code isn't on github and the design discussions
 aren't publicly archived, we have no way of validating that.

 Would it be more reasonable to compare against a cloud-scale
 messaging system that folks may have more direct knowledge
 of?

 For example, is HP Cloud Messaging[1] rolled out in full
 production by now?


Unfortunately the HP Cloud Messaging service was decommissioned.


 Is it still cloning the original Marconi API, or has it kept
 up with the evolution of the API? Has the nature of this API
 been seen as the root cause of any scalability issues?


We created a RabbitMQ backed implementation that aimed to be API compatible
with Marconi.  This proved difficult given some of the API issues that have
been discussed on this very thread.  Our implementation could never be full
API compatible with Marconi (there really isn’t an easy way to map AMQP to
HTTP, without losing serious functionality).

We also worked closely with the Marconi team, trying to get upstream to
support AMQP — the Marconi team also came to the same conclusion that their
API was not a good fit for such a backend.

Now — we are looking at options.  One that intrigues us has also been
suggested on these threads, specifically building a ‘managed messaging
service’ that could provision various messaging technologies (rabbit,
kafka, etc), and at the end of the day hand off the protocol native to the
messaging technology to the end user.



 Cheers,
 Eoghan

 [1]
 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-19 Thread Zane Bitter

On 18/09/14 10:55, Flavio Percoco wrote:

On 09/18/2014 04:24 PM, Clint Byrum wrote:

Great job highlighting what our friends over at Amazon are doing.

It's clear from these snippets, and a few other pieces of documentation
for SQS I've read, that the Amazon team approached SQS from a _massive_
scaling perspective. I think what may be forcing a lot of this frustration
with Zaqar is that it was designed with a much smaller scale in mind.

I think as long as that is the case, the design will remain in question.
I'd be comfortable saying that the use cases I've been thinking about
are entirely fine with the limitations SQS has.


I think these are pretty strong comments with not enough arguments to
defend them.


I actually more or less agree with Clint here. As Joe noted (and *thank 
you* Joe for starting this thread - the first one to compare Zaqar to 
something relevant!), SQS offers very, very limited guarantees, and it's 
clear that the reason for that is to make it massively, massively 
scalable in the way that e.g. S3 is scalable while also remaining 
comparably durable (S3 is supposedly designed for 11 nines, BTW).


Zaqar, meanwhile, seems to be promising the world in terms of 
guarantees. (And then taking it away in the fine print, where it says 
that the operator can disregard many of them, potentially without the 
user's knowledge.)


On the other hand, IIUC Zaqar does in fact have a sharding feature 
(Pools) which is its answer to the massive scaling question. I don't 
know enough details to comment further except to say that it evidently 
has been carefully thought about at some level, and it's really 
frustrating for the Zaqar folks when people just assume that it hasn't 
without doing any research. On the face of it sharding is a decent 
solution for this problem. Maybe we need to dig into the details and 
make sure folks are satisfied that there are no hidden dragons.



Saying that Zaqar was designed with a smaller scale in mid without
actually saying why you think so is not fair besides not being true. So
please, do share why you think Zaqar was not designed for big scales and
provide comments that will help the project to grow and improve.

- Is it because the storage technologies that have been chosen?
- Is it because of the API?
- Is it because of the programing language/framework ?


I didn't read Clint and Devananda's comments as an attack on any of 
these things (although I agree that there have been far too many such 
attacks over the last 12 months from people who didn't bother to do 
their homework first). They're looking at Zaqar from first principles 
and finding that it promises too much, raising the concern the team may 
in future reach a point where they are unable to meet the needs of 
future users (perhaps for scaling reasons) without breaking existing 
users who have come to rely on those promises.



So far, we've just discussed the API semantics and not zaqar's
scalability, which makes your comments even more surprising.


What guarantees you offer can determine pretty much all of the design 
tradeoffs (e.g. latency vs. durability) that you have to make. Some of 
those (e.g. random access to messages) are baked in to the API, but 
others are not. It's actually a real concern to me to see elsewhere in 
this thread that you are punting to operators on many of the latter.


IMO the Zaqar team needs to articulate an opinionated vision of just 
what Zaqar actually is, and why it offers value. And by 'value' here I 
mean there should be $ signs attached.


For example, it makes no sense to me that Zaqar should ever be able to 
run in a mode that doesn't guarantee delivery of messages. There are a 
million and one easy, cheap ways to set up a system that _might_ deliver 
your message. One server running a message broker is sufficient. But if 
you want reliable delivery, then you'll probably need at least 3 (for 
still pretty low values of reliable). I did some back-of-the-envelope 
math with the AWS pricing and _conservatively_ for any application 
receiving 100k messages per hour (~30 per second) it's cheaper to use 
SQS than to spin up those servers yourself.


In other words, a service that *guarantees* delivery of messages *has* 
to be run by the cloud operator because for the overwhelming majority of 
applications, the user cannot do so economically.


(I'm assuming here that AWS's _relative_ pricing accurately reflects 
their _relative_ cost basis, which is almost certainly not strictly 
true, but I expect a close enough approximation for these purposes.)


What I would like to hear in this thread is:

Zaqar is We-Never-Ever-Ever-Ever-Lose-Your-Message as a Service 
(WNEEELYMaaS), and it has to be in OpenStack because only the cloud 
operator can provide that cost-effectively.


What I'm hearing instead is:

- We'll probably deliver your message.
- We can guarantee that we'll deliver your message, but only on clouds 
where the operator has chosen to configure Mongo with 

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Flavio Percoco
On 09/17/2014 10:36 PM, Joe Gordon wrote:
 Hi All,
 
 My understanding of Zaqar is that it's like SQS. SQS uses distributed
 queues, which have a few unusual properties [0]:
 
 
 Message Order
 
 Amazon SQS makes a best effort to preserve order in messages, but due to
 the distributed nature of the queue, we cannot guarantee you will
 receive messages in the exact order you sent them. If your system
 requires that order be preserved, we recommend you place sequencing
 information in each message so you can reorder the messages upon receipt.
 

Zaqar guarantees FIFO. To be more precise, it does that relying on the
storage backend ability to do so as well. Depending on the storage used,
guaranteeing FIFO may have some performance penalties.

We have discussed on adding a way to enable/disable some features - like
FIFO - in a per-queue bases and now that we have flavors, we may work on
allowing things like enabling/disabling FIFO on a per-flavor basis.


 At-Least-Once Delivery
 
 Amazon SQS stores copies of your messages on multiple servers for
 redundancy and high availability. On rare occasions, one of the servers
 storing a copy of a message might be unavailable when you receive or
 delete the message. If that occurs, the copy of the message will not be
 deleted on that unavailable server, and you might get that message copy
 again when you receive messages. Because of this, you must design your
 application to be idempotent (i.e., it must not be adversely affected if
 it processes the same message more than once).

As of now, Zaqar fully relies on the storage replication/clustering
capabilities to provide data consistency, availability and fault
tolerance. However, as far as consuming messages is concerned, it can
guarantee once-and-only-once and/or at-least-once delivery depending on
the message pattern used to consume messages. Using pop or claims
guarantees the former whereas streaming messages out of Zaqar guarantees
the later.


 Message Sample
 
 The behavior of retrieving messages from the queue depends whether you
 are using short (standard) polling, the default behavior, or long
 polling. For more information about long polling, see Amazon SQS Long
 Polling
 http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html.
 
 With short polling, when you retrieve messages from the queue, Amazon
 SQS samples a subset of the servers (based on a weighted random
 distribution) and returns messages from just those servers. This means
 that a particular receive request might not return all your messages.
 Or, if you have a small number of messages in your queue (less than
 1000), it means a particular request might not return any of your
 messages, whereas a subsequent request will. If you keep retrieving from
 your queues, Amazon SQS will sample all of the servers, and you will
 receive all of your messages.
 
 The following figure shows short polling behavior of messages being
 returned after one of your system components makes a receive request.
 Amazon SQS samples several of the servers (in gray) and returns the
 messages from those servers (Message A, C, D, and B). Message E is not
 returned to this particular request, but it would be returned to a
 subsequent request.

Image here:
http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/images/ArchOverview_Receive.png

Currently the best and only way to pull messages from Zaqar is by using
short polling. We'll work on long-polling and websocket support.
Requests are guaranteed to return available messages if there are any.

 Presumably SQS has these properties because it makes the
 system scalable, if so does Zaqar have the same properties (not just
 making these same guarantees in the API, but actually having these
 properties in the backends)? And if not, why?

Based on our short conversation on IRC last night, I understand you're
concerned that FIFO may result in performance issues. That's a valid
concern and I think the right answer is that it depends on the storage.
If the storage has a built-in FIFO guarantee then there's nothing Zaqar
needs to do there. In the other hand, if the storage does not have a
built-in support for FIFO, Zaqar will cover it in the driver
implementation. In the mongodb driver, each message has a marker that is
used to guarantee FIFO.


 I looked on the wiki [1]
 for information on this, but couldn't find anything.
 


We have this[0] section in the FAQ but it definitely doesn't answer your
questions. I'm sure we had this documented way better but I can't find
the link. :(


[0]
https://wiki.openstack.org/wiki/Zaqar/Frequently_asked_questions#How_does_Zaqar_compare_to_AWS_.28SQS.2FSNS.29.3F

Hope the above helps,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Gordon Sim

On 09/18/2014 12:31 PM, Flavio Percoco wrote:

On 09/17/2014 10:36 PM, Joe Gordon wrote:

My understanding of Zaqar is that it's like SQS. SQS uses distributed
queues, which have a few unusual properties [0]:


 Message Order

Amazon SQS makes a best effort to preserve order in messages, but due to
the distributed nature of the queue, we cannot guarantee you will
receive messages in the exact order you sent them. If your system
requires that order be preserved, we recommend you place sequencing
information in each message so you can reorder the messages upon receipt.



Zaqar guarantees FIFO. To be more precise, it does that relying on the
storage backend ability to do so as well. Depending on the storage used,
guaranteeing FIFO may have some performance penalties.


Would it be accurate to say that at present Zaqar does not use 
distributed queues, but holds all queue data in a storage mechanism of 
some form which may internally distribute that data among servers but 
provides Zaqar with a consistent data model of some form?


[...]

As of now, Zaqar fully relies on the storage replication/clustering
capabilities to provide data consistency, availability and fault
tolerance.


Is the replication synchronous or asynchronous with respect to client 
calls? E.g. will the response to a post of messages be returned only 
once the replication of those messages is confirmed? Likewise when 
deleting a message, is the response only returned when replicas of the 
message are deleted?



However, as far as consuming messages is concerned, it can
guarantee once-and-only-once and/or at-least-once delivery depending on
the message pattern used to consume messages. Using pop or claims
guarantees the former whereas streaming messages out of Zaqar guarantees
the later.


From what I can see, pop provides unreliable delivery (i.e. its similar 
to no-ack). If the delete call using pop fails while sending back the 
response, the messages are removed but didn't get to the client.


What do you mean by 'streaming messages'?

[...]

Based on our short conversation on IRC last night, I understand you're
concerned that FIFO may result in performance issues. That's a valid
concern and I think the right answer is that it depends on the storage.
If the storage has a built-in FIFO guarantee then there's nothing Zaqar
needs to do there. In the other hand, if the storage does not have a
built-in support for FIFO, Zaqar will cover it in the driver
implementation. In the mongodb driver, each message has a marker that is
used to guarantee FIFO.


That marker is a sequence number of some kind that is used to provide 
ordering to queries? Is it generated by the database itself?



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Clint Byrum
Great job highlighting what our friends over at Amazon are doing.

It's clear from these snippets, and a few other pieces of documentation
for SQS I've read, that the Amazon team approached SQS from a _massive_
scaling perspective. I think what may be forcing a lot of this frustration
with Zaqar is that it was designed with a much smaller scale in mind.

I think as long as that is the case, the design will remain in question.
I'd be comfortable saying that the use cases I've been thinking about
are entirely fine with the limitations SQS has.

Excerpts from Joe Gordon's message of 2014-09-17 13:36:18 -0700:
 Hi All,
 
 My understanding of Zaqar is that it's like SQS. SQS uses distributed
 queues, which have a few unusual properties [0]:
 Message Order
 
 Amazon SQS makes a best effort to preserve order in messages, but due to
 the distributed nature of the queue, we cannot guarantee you will receive
 messages in the exact order you sent them. If your system requires that
 order be preserved, we recommend you place sequencing information in each
 message so you can reorder the messages upon receipt.
 At-Least-Once Delivery
 
 Amazon SQS stores copies of your messages on multiple servers for
 redundancy and high availability. On rare occasions, one of the servers
 storing a copy of a message might be unavailable when you receive or delete
 the message. If that occurs, the copy of the message will not be deleted on
 that unavailable server, and you might get that message copy again when you
 receive messages. Because of this, you must design your application to be
 idempotent (i.e., it must not be adversely affected if it processes the
 same message more than once).
 Message Sample
 
 The behavior of retrieving messages from the queue depends whether you are
 using short (standard) polling, the default behavior, or long polling. For
 more information about long polling, see Amazon SQS Long Polling
 http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html
 .
 
 With short polling, when you retrieve messages from the queue, Amazon SQS
 samples a subset of the servers (based on a weighted random distribution)
 and returns messages from just those servers. This means that a particular
 receive request might not return all your messages. Or, if you have a small
 number of messages in your queue (less than 1000), it means a particular
 request might not return any of your messages, whereas a subsequent request
 will. If you keep retrieving from your queues, Amazon SQS will sample all
 of the servers, and you will receive all of your messages.
 
 The following figure shows short polling behavior of messages being
 returned after one of your system components makes a receive request.
 Amazon SQS samples several of the servers (in gray) and returns the
 messages from those servers (Message A, C, D, and B). Message E is not
 returned to this particular request, but it would be returned to a
 subsequent request.
 
 
 
 Presumably SQS has these properties because it makes the system scalable,
 if so does Zaqar have the same properties (not just making these same
 guarantees in the API, but actually having these properties in the
 backends)? And if not, why? I looked on the wiki [1] for information on
 this, but couldn't find anything.
 
 
 
 
 
 [0]
 http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
 [1] https://wiki.openstack.org/wiki/Zaqar

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Flavio Percoco
On 09/18/2014 04:09 PM, Gordon Sim wrote:
 On 09/18/2014 12:31 PM, Flavio Percoco wrote:
 On 09/17/2014 10:36 PM, Joe Gordon wrote:
 My understanding of Zaqar is that it's like SQS. SQS uses distributed
 queues, which have a few unusual properties [0]:


  Message Order

 Amazon SQS makes a best effort to preserve order in messages, but due to
 the distributed nature of the queue, we cannot guarantee you will
 receive messages in the exact order you sent them. If your system
 requires that order be preserved, we recommend you place sequencing
 information in each message so you can reorder the messages upon
 receipt.


 Zaqar guarantees FIFO. To be more precise, it does that relying on the
 storage backend ability to do so as well. Depending on the storage used,
 guaranteeing FIFO may have some performance penalties.
 
 Would it be accurate to say that at present Zaqar does not use
 distributed queues, but holds all queue data in a storage mechanism of
 some form which may internally distribute that data among servers but
 provides Zaqar with a consistent data model of some form?

I think this is accurate. The queue's distribution depends on the
storage ability to do so and deployers will be able to choose what
storage works best for them based on this as well. I'm not sure how
useful this separation is from a user perspective but I do see the
relevance when it comes to implementation details and deployments.


 [...]
 As of now, Zaqar fully relies on the storage replication/clustering
 capabilities to provide data consistency, availability and fault
 tolerance.
 
 Is the replication synchronous or asynchronous with respect to client
 calls? E.g. will the response to a post of messages be returned only
 once the replication of those messages is confirmed? Likewise when
 deleting a message, is the response only returned when replicas of the
 message are deleted?

It depends on the driver implementation and/or storage configuration.
For example, in the mongodb driver, we use the default write concern
called acknowledged. This means that as soon as the message gets to
the master node (note it's not written on disk yet nor replicated) zaqar
will receive a confirmation and then send the response back to the
client. This is also configurable by the deployer by changing the
default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].

[0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w

 
 However, as far as consuming messages is concerned, it can
 guarantee once-and-only-once and/or at-least-once delivery depending on
 the message pattern used to consume messages. Using pop or claims
 guarantees the former whereas streaming messages out of Zaqar guarantees
 the later.
 
 From what I can see, pop provides unreliable delivery (i.e. its similar
 to no-ack). If the delete call using pop fails while sending back the
 response, the messages are removed but didn't get to the client.

Correct, pop works like no-ack. If you want to have pop+ack, it is
possible to claim just 1 message and then delete it.

 
 What do you mean by 'streaming messages'?

I'm sorry, that went out wrong. I had the browsability term in my head
and went with something even worse. By streaming messages I meant
polling messages without claiming them. In other words, at-least-once is
guaranteed by default, whereas once-and-only-once is guaranteed just if
claims are used.

 
 [...]
 Based on our short conversation on IRC last night, I understand you're
 concerned that FIFO may result in performance issues. That's a valid
 concern and I think the right answer is that it depends on the storage.
 If the storage has a built-in FIFO guarantee then there's nothing Zaqar
 needs to do there. In the other hand, if the storage does not have a
 built-in support for FIFO, Zaqar will cover it in the driver
 implementation. In the mongodb driver, each message has a marker that is
 used to guarantee FIFO.
 
 That marker is a sequence number of some kind that is used to provide
 ordering to queries? Is it generated by the database itself?

It's a sequence number to provide ordering to queries, correct.
Depending on the driver, it may be generated by Zaqar or the database.
In mongodb's case it's generated by Zaqar[0].

[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/queues.py#L103-L185

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Flavio Percoco
On 09/18/2014 04:24 PM, Clint Byrum wrote:
 Great job highlighting what our friends over at Amazon are doing.
 
 It's clear from these snippets, and a few other pieces of documentation
 for SQS I've read, that the Amazon team approached SQS from a _massive_
 scaling perspective. I think what may be forcing a lot of this frustration
 with Zaqar is that it was designed with a much smaller scale in mind.
 
 I think as long as that is the case, the design will remain in question.
 I'd be comfortable saying that the use cases I've been thinking about
 are entirely fine with the limitations SQS has.

I think these are pretty strong comments with not enough arguments to
defend them.

Saying that Zaqar was designed with a smaller scale in mid without
actually saying why you think so is not fair besides not being true. So
please, do share why you think Zaqar was not designed for big scales and
provide comments that will help the project to grow and improve.

- Is it because the storage technologies that have been chosen?
- Is it because of the API?
- Is it because of the programing language/framework ?

So far, we've just discussed the API semantics and not zaqar's
scalability, which makes your comments even more surprising.

Flavio

 
 Excerpts from Joe Gordon's message of 2014-09-17 13:36:18 -0700:
 Hi All,

 My understanding of Zaqar is that it's like SQS. SQS uses distributed
 queues, which have a few unusual properties [0]:
 Message Order

 Amazon SQS makes a best effort to preserve order in messages, but due to
 the distributed nature of the queue, we cannot guarantee you will receive
 messages in the exact order you sent them. If your system requires that
 order be preserved, we recommend you place sequencing information in each
 message so you can reorder the messages upon receipt.
 At-Least-Once Delivery

 Amazon SQS stores copies of your messages on multiple servers for
 redundancy and high availability. On rare occasions, one of the servers
 storing a copy of a message might be unavailable when you receive or delete
 the message. If that occurs, the copy of the message will not be deleted on
 that unavailable server, and you might get that message copy again when you
 receive messages. Because of this, you must design your application to be
 idempotent (i.e., it must not be adversely affected if it processes the
 same message more than once).
 Message Sample

 The behavior of retrieving messages from the queue depends whether you are
 using short (standard) polling, the default behavior, or long polling. For
 more information about long polling, see Amazon SQS Long Polling
 http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-long-polling.html
 .

 With short polling, when you retrieve messages from the queue, Amazon SQS
 samples a subset of the servers (based on a weighted random distribution)
 and returns messages from just those servers. This means that a particular
 receive request might not return all your messages. Or, if you have a small
 number of messages in your queue (less than 1000), it means a particular
 request might not return any of your messages, whereas a subsequent request
 will. If you keep retrieving from your queues, Amazon SQS will sample all
 of the servers, and you will receive all of your messages.

 The following figure shows short polling behavior of messages being
 returned after one of your system components makes a receive request.
 Amazon SQS samples several of the servers (in gray) and returns the
 messages from those servers (Message A, C, D, and B). Message E is not
 returned to this particular request, but it would be returned to a
 subsequent request.



 Presumably SQS has these properties because it makes the system scalable,
 if so does Zaqar have the same properties (not just making these same
 guarantees in the API, but actually having these properties in the
 backends)? And if not, why? I looked on the wiki [1] for information on
 this, but couldn't find anything.





 [0]
 http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/DistributedQueues.html
 [1] https://wiki.openstack.org/wiki/Zaqar
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 


-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Steve Lewis
On September 18, 2014 7:45 AM, Flavio Percoco wrote:

 On 09/18/2014 04:09 PM, Gordon Sim wrote:

 However, as far as consuming messages is concerned, it can
 guarantee once-and-only-once and/or at-least-once delivery depending on
 the message pattern used to consume messages. Using pop or claims
 guarantees the former whereas streaming messages out of Zaqar guarantees
 the later.

 From what I can see, pop provides unreliable delivery (i.e. its similar
 to no-ack). If the delete call using pop fails while sending back the
 response, the messages are removed but didn't get to the client.
 
 Correct, pop works like no-ack. If you want to have pop+ack, it is
 possible to claim just 1 message and then delete it.

Having some experience using SQS I would expect that there would be a mode
something like what SQS provides where a message gets picked up by one 
consumer (let's say by short-polling) and is hidden for a timeout duration from 
other consumers, so that the consumer has time to ack. Is that how you are 
using the term 'claim' in this case?

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Flavio Percoco
On 09/18/2014 05:42 PM, Steve Lewis wrote:
 On September 18, 2014 7:45 AM, Flavio Percoco wrote:
 
 On 09/18/2014 04:09 PM, Gordon Sim wrote:

 However, as far as consuming messages is concerned, it can
 guarantee once-and-only-once and/or at-least-once delivery depending on
 the message pattern used to consume messages. Using pop or claims
 guarantees the former whereas streaming messages out of Zaqar guarantees
 the later.

 From what I can see, pop provides unreliable delivery (i.e. its similar
 to no-ack). If the delete call using pop fails while sending back the
 response, the messages are removed but didn't get to the client.

 Correct, pop works like no-ack. If you want to have pop+ack, it is
 possible to claim just 1 message and then delete it.
 
 Having some experience using SQS I would expect that there would be a mode
 something like what SQS provides where a message gets picked up by one 
 consumer (let's say by short-polling) and is hidden for a timeout duration 
 from 
 other consumers, so that the consumer has time to ack. Is that how you are 
 using the term 'claim' in this case?

Correct, that's the name of the feature in Zaqar. You can find a bit of
extra information here[0]. Hope this helps.

[0] https://wiki.openstack.org/wiki/Zaqar/specs/api/v1.1#Claims

Cheers,
Flavio

-- 
@flaper87
Flavio Percoco

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen
On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco fla...@redhat.com wrote:
 On 09/18/2014 04:09 PM, Gordon Sim wrote:
 On 09/18/2014 12:31 PM, Flavio Percoco wrote:
 Zaqar guarantees FIFO. To be more precise, it does that relying on the
 storage backend ability to do so as well. Depending on the storage used,
 guaranteeing FIFO may have some performance penalties.

 Would it be accurate to say that at present Zaqar does not use
 distributed queues, but holds all queue data in a storage mechanism of
 some form which may internally distribute that data among servers but
 provides Zaqar with a consistent data model of some form?

 I think this is accurate. The queue's distribution depends on the
 storage ability to do so and deployers will be able to choose what
 storage works best for them based on this as well. I'm not sure how
 useful this separation is from a user perspective but I do see the
 relevance when it comes to implementation details and deployments.

Guaranteeing FIFO and not using a distributed queue architecture
*above* the storage backend are both scale-limiting design choices.
That Zaqar's scalability depends on the storage back end is not a
desirable thing in a cloud-scale messaging system in my opinion,
because this will prevent use at scales which can not be accommodated
by a single storage back end.

And based on my experience consulting for companies whose needs grew
beyond the capabilities of a single storage backend, moving to
application-aware sharding required a significant amount of
rearchitecture in most cases.

-Devananda

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen
On Thu, Sep 18, 2014 at 7:55 AM, Flavio Percoco fla...@redhat.com wrote:
 On 09/18/2014 04:24 PM, Clint Byrum wrote:
 Great job highlighting what our friends over at Amazon are doing.

 It's clear from these snippets, and a few other pieces of documentation
 for SQS I've read, that the Amazon team approached SQS from a _massive_
 scaling perspective. I think what may be forcing a lot of this frustration
 with Zaqar is that it was designed with a much smaller scale in mind.

 I think as long as that is the case, the design will remain in question.
 I'd be comfortable saying that the use cases I've been thinking about
 are entirely fine with the limitations SQS has.

 I think these are pretty strong comments with not enough arguments to
 defend them.


Please see my prior email. I agree with Clint's assertions here.

 Saying that Zaqar was designed with a smaller scale in mid without
 actually saying why you think so is not fair besides not being true. So
 please, do share why you think Zaqar was not designed for big scales and
 provide comments that will help the project to grow and improve.

 - Is it because the storage technologies that have been chosen?
 - Is it because of the API?
 - Is it because of the programing language/framework ?

It is not because of the storage technology or because of the
programming language.

 So far, we've just discussed the API semantics and not zaqar's
 scalability, which makes your comments even more surprising.

- guaranteed message order
- not distributing work across a configurable number of back ends

These are scale-limiting design choices which are reflected in the
API's characteristics.

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Joe Gordon
On Thu, Sep 18, 2014 at 9:02 AM, Devananda van der Veen 
devananda@gmail.com wrote:

 On Thu, Sep 18, 2014 at 7:55 AM, Flavio Percoco fla...@redhat.com wrote:
  On 09/18/2014 04:24 PM, Clint Byrum wrote:
  Great job highlighting what our friends over at Amazon are doing.
 
  It's clear from these snippets, and a few other pieces of documentation
  for SQS I've read, that the Amazon team approached SQS from a _massive_
  scaling perspective. I think what may be forcing a lot of this
 frustration
  with Zaqar is that it was designed with a much smaller scale in mind.
 
  I think as long as that is the case, the design will remain in question.
  I'd be comfortable saying that the use cases I've been thinking about
  are entirely fine with the limitations SQS has.
 
  I think these are pretty strong comments with not enough arguments to
  defend them.
 

 Please see my prior email. I agree with Clint's assertions here.

  Saying that Zaqar was designed with a smaller scale in mid without
  actually saying why you think so is not fair besides not being true. So
  please, do share why you think Zaqar was not designed for big scales and
  provide comments that will help the project to grow and improve.
 
  - Is it because the storage technologies that have been chosen?
  - Is it because of the API?
  - Is it because of the programing language/framework ?

 It is not because of the storage technology or because of the
 programming language.

  So far, we've just discussed the API semantics and not zaqar's
  scalability, which makes your comments even more surprising.

 - guaranteed message order
 - not distributing work across a configurable number of back ends

 These are scale-limiting design choices which are reflected in the
 API's characteristics.


I agree with Clint and Devananda



 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Devananda van der Veen
On Thu, Sep 18, 2014 at 8:54 AM, Devananda van der Veen
devananda@gmail.com wrote:
 On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco fla...@redhat.com wrote:
 On 09/18/2014 04:09 PM, Gordon Sim wrote:
 On 09/18/2014 12:31 PM, Flavio Percoco wrote:
 Zaqar guarantees FIFO. To be more precise, it does that relying on the
 storage backend ability to do so as well. Depending on the storage used,
 guaranteeing FIFO may have some performance penalties.

 Would it be accurate to say that at present Zaqar does not use
 distributed queues, but holds all queue data in a storage mechanism of
 some form which may internally distribute that data among servers but
 provides Zaqar with a consistent data model of some form?

 I think this is accurate. The queue's distribution depends on the
 storage ability to do so and deployers will be able to choose what
 storage works best for them based on this as well. I'm not sure how
 useful this separation is from a user perspective but I do see the
 relevance when it comes to implementation details and deployments.

 Guaranteeing FIFO and not using a distributed queue architecture
 *above* the storage backend are both scale-limiting design choices.
 That Zaqar's scalability depends on the storage back end is not a
 desirable thing in a cloud-scale messaging system in my opinion,
 because this will prevent use at scales which can not be accommodated
 by a single storage back end.


It may be worth qualifying this a bit more.

While no single instance of any storage back-end is infinitely
scalable, some of them are really darn fast. That may be enough for
the majority of use cases. It's not outside the realm of possibility
that the inflection point [0] where these design choices result in
performance limitations is at the very high end of scale-out, eg.
public cloud providers who have the resources to invest further in
improving zaqar.

As an example of what I mean, let me refer to the 99th percentile
response time graphs in Kurt's benchmarks [1]... increasing the number
of clients with write-heavy workloads was enough to drive latency from
10ms to 200 ms with a single service. That latency significantly
improved as storage and application instances were added, which is
good, and what I would expect. These benchmarks do not (and were not
intended to) show the maximal performance of a public-cloud-scale
deployment -- but they do show that performance under different
workloads improves as additional services are started.

While I have no basis for comparing the configuration of the
deployment he used in those tests to what a public cloud operator
might choose to deploy, and presumably such an operator would put
significant work into tuning storage and running more instances of
each service and thus shift that inflection point to the right, my
point is that, by depending on a single storage instance, Zaqar has
pushed the *ability* to scale out down into the storage
implementation. Given my experience scaling SQL and NoSQL data stores
(in my past life, before working on OpenStack) I have a knee-jerk
reaction to believing that this approach will result in a
public-cloud-scale messaging system.

-Devananda

[0] http://en.wikipedia.org/wiki/Inflection_point -- in this context,
I mean the point on the graph of throughput vs latency where the
derivative goes from near-zero (linear growth) to non-zero
(exponential growth)

[1] https://wiki.openstack.org/wiki/Zaqar/Performance/PubSub/Redis

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Joe Gordon
On Thu, Sep 18, 2014 at 7:45 AM, Flavio Percoco fla...@redhat.com wrote:

 On 09/18/2014 04:09 PM, Gordon Sim wrote:
  On 09/18/2014 12:31 PM, Flavio Percoco wrote:
  On 09/17/2014 10:36 PM, Joe Gordon wrote:
  My understanding of Zaqar is that it's like SQS. SQS uses distributed
  queues, which have a few unusual properties [0]:
 
 
   Message Order
 
  Amazon SQS makes a best effort to preserve order in messages, but due
 to
  the distributed nature of the queue, we cannot guarantee you will
  receive messages in the exact order you sent them. If your system
  requires that order be preserved, we recommend you place sequencing
  information in each message so you can reorder the messages upon
  receipt.
 
 
  Zaqar guarantees FIFO. To be more precise, it does that relying on the
  storage backend ability to do so as well. Depending on the storage used,
  guaranteeing FIFO may have some performance penalties.
 
  Would it be accurate to say that at present Zaqar does not use
  distributed queues, but holds all queue data in a storage mechanism of
  some form which may internally distribute that data among servers but
  provides Zaqar with a consistent data model of some form?

 I think this is accurate. The queue's distribution depends on the
 storage ability to do so and deployers will be able to choose what
 storage works best for them based on this as well. I'm not sure how
 useful this separation is from a user perspective but I do see the
 relevance when it comes to implementation details and deployments.


  [...]
  As of now, Zaqar fully relies on the storage replication/clustering
  capabilities to provide data consistency, availability and fault
  tolerance.
 
  Is the replication synchronous or asynchronous with respect to client
  calls? E.g. will the response to a post of messages be returned only
  once the replication of those messages is confirmed? Likewise when
  deleting a message, is the response only returned when replicas of the
  message are deleted?

 It depends on the driver implementation and/or storage configuration.
 For example, in the mongodb driver, we use the default write concern
 called acknowledged. This means that as soon as the message gets to
 the master node (note it's not written on disk yet nor replicated) zaqar
 will receive a confirmation and then send the response back to the
 client. This is also configurable by the deployer by changing the
 default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].


This means that by default Zaqar cannot guarantee a message will be
delivered at all. A message can be acknowledged and then the 'master node'
crashes and the message is lost. Zaqar's ability to guarantee delivery is
limited by the reliability of a single node, while something like swift can
only loose a piece of data if 3 machines crash at the same time.



 [0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w

 
  However, as far as consuming messages is concerned, it can
  guarantee once-and-only-once and/or at-least-once delivery depending on
  the message pattern used to consume messages. Using pop or claims
  guarantees the former whereas streaming messages out of Zaqar guarantees
  the later.
 
  From what I can see, pop provides unreliable delivery (i.e. its similar
  to no-ack). If the delete call using pop fails while sending back the
  response, the messages are removed but didn't get to the client.

 Correct, pop works like no-ack. If you want to have pop+ack, it is
 possible to claim just 1 message and then delete it.

 
  What do you mean by 'streaming messages'?

 I'm sorry, that went out wrong. I had the browsability term in my head
 and went with something even worse. By streaming messages I meant
 polling messages without claiming them. In other words, at-least-once is
 guaranteed by default, whereas once-and-only-once is guaranteed just if
 claims are used.

 
  [...]
  Based on our short conversation on IRC last night, I understand you're
  concerned that FIFO may result in performance issues. That's a valid
  concern and I think the right answer is that it depends on the storage.
  If the storage has a built-in FIFO guarantee then there's nothing Zaqar
  needs to do there. In the other hand, if the storage does not have a
  built-in support for FIFO, Zaqar will cover it in the driver
  implementation. In the mongodb driver, each message has a marker that is
  used to guarantee FIFO.
 
  That marker is a sequence number of some kind that is used to provide
  ordering to queries? Is it generated by the database itself?

 It's a sequence number to provide ordering to queries, correct.
 Depending on the driver, it may be generated by Zaqar or the database.
 In mongodb's case it's generated by Zaqar[0].

 [0]

 https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/queues.py#L103-L185

 --
 @flaper87
 Flavio Percoco

 ___
 OpenStack-dev mailing list

Re: [openstack-dev] [Zaqar] Zaqar and SQS Properties of Distributed Queues

2014-09-18 Thread Gordon Sim

On 09/18/2014 03:45 PM, Flavio Percoco wrote:

On 09/18/2014 04:09 PM, Gordon Sim wrote:

Is the replication synchronous or asynchronous with respect to client
calls? E.g. will the response to a post of messages be returned only
once the replication of those messages is confirmed? Likewise when
deleting a message, is the response only returned when replicas of the
message are deleted?


It depends on the driver implementation and/or storage configuration.
For example, in the mongodb driver, we use the default write concern
called acknowledged. This means that as soon as the message gets to
the master node (note it's not written on disk yet nor replicated) zaqar
will receive a confirmation and then send the response back to the
client.


So in that mode it's unreliable. If there is failure right after the 
response is sent the message may be lost, but the client believes it has 
been confirmed so will not resend.



This is also configurable by the deployer by changing the
default write concern in the mongodb uri using `?w=SOME_WRITE_CONCERN`[0].

[0] http://docs.mongodb.org/manual/reference/connection-string/#uri.w


So you could change that to majority to get reliable publication 
(at-least-once).


[...]

 From what I can see, pop provides unreliable delivery (i.e. its similar
to no-ack). If the delete call using pop fails while sending back the
response, the messages are removed but didn't get to the client.


Correct, pop works like no-ack. If you want to have pop+ack, it is
possible to claim just 1 message and then delete it.


Right, claim+delete is ack (and if the claim is replicated and 
recoverable etc you can verify whether deletion occurred to ensure 
message is processed only once). Using delete-with-pop is no-ak, i.e. 
at-most-once.



What do you mean by 'streaming messages'?


I'm sorry, that went out wrong. I had the browsability term in my head
and went with something even worse. By streaming messages I meant
polling messages without claiming them. In other words, at-least-once is
guaranteed by default, whereas once-and-only-once is guaranteed just if
claims are used.


I don't see that the claim mechanism brings any stronger guarantee, it 
just offers a competing consumer behaviour where browsing is 
non-competing (non-destructive). In both cases you require the client to 
be able to remember which messages it had processed in order to ensure 
exactly once. The claim reduces the scope of any doubt, but the client 
still needs to be able to determine whether it has already processed any 
message in the claim already.


[...]

That marker is a sequence number of some kind that is used to provide
ordering to queries? Is it generated by the database itself?


It's a sequence number to provide ordering to queries, correct.
Depending on the driver, it may be generated by Zaqar or the database.
In mongodb's case it's generated by Zaqar[0].


Zaqar increments a counter held within the database, am I reading that 
correctly? So mongodb is responsible for the ordering and atomicity of 
multiple concurrent requests for a marker?



[0]
https://github.com/openstack/zaqar/blob/master/zaqar/queues/storage/mongodb/queues.py#L103-L185




___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev