If you can about maximizing your throughput, ingest is probably not desirable through the proxy (you can probably get ~10x faster using the Java BatchWriter API).

I wouldn't avoid the proxy server purely because of using batch_scans though. If you look at the Java impl of the BatchScanner, it essentially keeps a queue which many servers are concurrently throwing results onto and providing a Java Iterator to that queue to the client. With this in mind, this is very similar to what the proxy server is doing for you.

On 4/14/14, 12:12 PM, David O'Gwynn wrote:
Ah, thanks Eric, that answers my question. It sounds like using the
proxy server for batch_scans and ingest is a bit beyond its scope. Are
there plans for beefing up the proxy to handle a wider range of
purposes from multiple clients?

Thanks,
David

On Mon, Apr 14, 2014 at 11:06 AM, Eric Newton <[email protected]> wrote:
High ingest and batch scans use resources within the proxy for queuing
data.  If I was using a proxy for these activities, I would want to
have a proxy for each client.  Administrative requests, and even basic
single-range scans are simple pass-throughs with a much lower chance
of overloading the proxy.


On Mon, Apr 14, 2014 at 9:56 AM, David Medinets
<[email protected]> wrote:
"number of proxy servers should be proportional to the number of clients" -
I hate to be pedantic but
this is a very general statement. Can you be more specific? Should the
proportion be 1:1 or 5:1? What factors affect the ratio?


On Mon, Apr 14, 2014 at 9:32 AM, Eric Newton <[email protected]> wrote:

The number of proxy servers should be proportional to the number of
clients.

The proxy can talk to all the tablet servers, but the client of the
proxy only has the proxy to make requests on its behalf.

As always, it's going to depend on what you want to do, what your
schema looks like, and the total number of servers you have.

-Eric

On Sun, Apr 13, 2014 at 11:58 PM, David O'Gwynn <[email protected]> wrote:
Hi community,

I was reading a thread "Error stressing with pyaccumulo app" from
February, and the topic of optimal number of proxy servers for a
cluster of a given size came up. Does anyone have any insight into
that question? Is there a thread in the archive that addresses this
question directly?

My gut tells me that you should have a number proportional to the
number of tablet servers, but I'm afraid I don't really understand
what the proxy server is doing.


Reply via email to