Re: mass user realtime, high concurrency , hardware resource

Ted Dunning Sun, 14 Feb 2016 08:23:16 -0800

Your answer doesn't really quite provide enough information to permit an
answer.

First, number of users doesn't really tell us how many queries will happen
in real time. What you need there is the number of queries per second. That
number can vary for the same number of users by a factor of 1000 quite
easily.

Second, querying 10 billion rows can take less than a millisecond for the
query itself. Or minutes. What kind of query to you mean to do?

If you are talking about something like messaging and personalization which
high usage per day and very fast response requirements, then Drill is
likely to be inappropriate almost by definition. The problem is that Drill
spends a lot of time (100 ms or more) thinking about how to execute the
query on the theory that most queries in Drill will be complex enough that
this planning will allow savings of seconds or tens of seconds. That is a
fine trade-off for complex queries. If you just want to show the last ten
messages for a user, it is a very bad trade-off and you should probably use
something other than SQL to do this.

The modern trend for data oriented web access is to expose a REST interface
that is framed in terms of your business needs. This interface and the
resulting data are manipulated using browser resident Javascript. To make
this efficient, you want a database that has direct Javascript access and,
preferably, one that has a storage module written already for meteor.js or
similar package.

If you are talking about something like account maintenance where you need
to do complex queries very rarely (say once a month), then SQL is much more
plausible since users are likely to accept a long (1 s or more) delay to
access information. Even here, a good data abstraction in terms of a REST
microservice is likely to be good.

So... back to your question.

Let's compute some query rates:

100 million users accessing the system once per month mostly during peak
hours will cause 100 million accesses / (30 days * 20,000 peak seconds /
day) = 100e6 / 600e3 < 200 queries per second.

100 million users with 10 million active users accessing the system 100
times per day during peak hours will cause 100 * 10 million queries/day /
20,000 peak seconds / day = 1e9 / 20e3 = 50e3 queries per second.

Note that the number of *concurrent* queries does not matter here. For the
first rate, if you have a system that responds in 1ms, there will be
essentially no concurrency whereas with a horrible system that responds in
10 s will require concurrency of 2000 simultaneous queries. What you want
is throughput at acceptable response time.

Concurrent users is also clearly almost irrelevant. What you want is usage
patterns x user population to get queries per second.

So the answer is that the first rate of 200 queries per second could
probably be dealt with using Drill given enough hardware, but a
programmatic interface to even a fairly small scale data base would
probably be much, much better. In general, I expect that a document
oriented database will suit your needs much better than a relational one.
My guess is that not very many analytical query engines like Drill will be
cost effective in this range. Almost every document oriented database will
be able to handle this.

For the second rate of 50,000 per second, there very few SQL based systems
that will suit your needs, but programmatic access to a document oriented
database is likely to be your only cost effective solution here. My company
(MapR), for instance, makes a database that would likely work, but many
will not work.

On Sun, Feb 14, 2016 at 4:27 AM, <[email protected]> wrote:

> hi,     i want to build a app which need for support sevaral  hundred
> million users realtime query from about ten billion row records. dose
> apache drill fit for this requirement? dose it support High concurrency?
> dose it need mass hardware resource to archive the low latency
> performance?  resource exchange performance? i use hbase as database.
>
>
>                         李启明 from China

Re: mass user realtime, high concurrency , hardware resource

Reply via email to