Re: benchmarking

2018-08-28 Thread Jeremy Kepner
Our nodes are usually 20+ cores and 100+ GB RAM.

On Tue, Aug 28, 2018 at 10:18:24PM +0300, guy sharon wrote:
> hi Jeremy,
> 
> Do you have any information on how you configure them and what kind of
> hardware they run on?
> 
> Thanks,
> Guy.
> 
> 
> 
> On Tue, Aug 28, 2018 at 3:44 PM Jeremy Kepner  wrote:
> 
> > FYI, Single node Accumulo instances is our most popular deployment.
> > We have hundreds of them.   Accummulo is so fast that it can replace
> > what would normally require 20 MySQL servers.
> >
> > Regards.  -Jeremy
> >
> > On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote:
> > > Hi Guy,
> > >
> > > Apache Accumulo is designed for horizontally scaling out for large scale
> > workloads that need to do random reads and writes. There's a non-trivial
> > amount of overhead that comes with a system aimed at doing that on
> > thousands of nodes.
> > >
> > > If your use case works for a single laptop with such a small number of
> > entries and exhaustive scans, then Accumulo is probably not the correct
> > tool for the job.
> > >
> > > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset
> > size you can just rely on a file format like Apache Avro:
> > >
> > > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy
> > --count 630 --schema '{ "type": "record", "name": "entry", "fields": [
> > { "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
> > > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
> > > WARNING: Unable to load native-hadoop library for your platform... using
> > builtin-java classes where applicable
> > > test.seed=1535441473243
> > >
> > > real  0m5.451s
> > > user  0m5.922s
> > > sys   0m0.656s
> > > busbey$ ls -lah ~/Downloads/6.3m_entries.avro
> > > -rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31
> > /Users/busbey/Downloads/6.3m_entries.avro
> > > busbey$ time java -jar avro-tools-1.7.7.jar tojson
> > ~/Downloads/6.3m_entries.avro | wc -l
> > >  630
> > >
> > > real  0m4.239s
> > > user  0m6.026s
> > > sys   0m0.721s
> > >
> > > I'd recommend that you start at >= 5 nodes if you want to look at rough
> > per-node throughput capabilities.
> > >
> > >
> > > On 2018/08/28 06:59:38, guy sharon  wrote:
> > > > hi Mike,
> > > >
> > > > Thanks for the links.
> > > >
> > > > My current setup is a 4 node cluster (tserver, master, gc, monitor)
> > running
> > > > on Alpine Docker containers on a laptop with an i7 processor (8 cores)
> > with
> > > > 16GB of RAM. As an example I'm running a count of all entries for a
> > table
> > > > with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
> > > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if
> > this is
> > > > reasonable or not. Seems a little slow to me. What do you think?
> > > >
> > > > BR,
> > > > Guy.
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall 
> > wrote:
> > > >
> > > > > Hi Guy,
> > > > >
> > > > > Here are a couple links I found.  Can you tell us more about your
> > setup
> > > > > and what you are seeing?
> > > > >
> > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> > > > > https://www.youtube.com/watch?v=Ae9THpmpFpM
> > > > >
> > > > > Mike
> > > > >
> > > > >
> > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon <
> > guy.sharon.1...@gmail.com>
> > > > > wrote:
> > > > >
> > > > >> hi,
> > > > >>
> > > > >> I've just started working with Accumulo and I think I'm
> > experiencing slow
> > > > >> reads/writes. I'm aware of the recommended configuration. Does
> > anyone know
> > > > >> of any standard benchmarks and benchmarking tools I can use to tell
> > if the
> > > > >> performance I'm getting is reasonable?
> > > > >>
> > > > >>
> > > > >>
> > > >
> >


Re: benchmarking

2018-08-28 Thread guy sharon
hi Jeremy,

Do you have any information on how you configure them and what kind of
hardware they run on?

Thanks,
Guy.



On Tue, Aug 28, 2018 at 3:44 PM Jeremy Kepner  wrote:

> FYI, Single node Accumulo instances is our most popular deployment.
> We have hundreds of them.   Accummulo is so fast that it can replace
> what would normally require 20 MySQL servers.
>
> Regards.  -Jeremy
>
> On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote:
> > Hi Guy,
> >
> > Apache Accumulo is designed for horizontally scaling out for large scale
> workloads that need to do random reads and writes. There's a non-trivial
> amount of overhead that comes with a system aimed at doing that on
> thousands of nodes.
> >
> > If your use case works for a single laptop with such a small number of
> entries and exhaustive scans, then Accumulo is probably not the correct
> tool for the job.
> >
> > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset
> size you can just rely on a file format like Apache Avro:
> >
> > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy
> --count 630 --schema '{ "type": "record", "name": "entry", "fields": [
> { "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
> > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
> > WARNING: Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> > test.seed=1535441473243
> >
> > real  0m5.451s
> > user  0m5.922s
> > sys   0m0.656s
> > busbey$ ls -lah ~/Downloads/6.3m_entries.avro
> > -rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31
> /Users/busbey/Downloads/6.3m_entries.avro
> > busbey$ time java -jar avro-tools-1.7.7.jar tojson
> ~/Downloads/6.3m_entries.avro | wc -l
> >  630
> >
> > real  0m4.239s
> > user  0m6.026s
> > sys   0m0.721s
> >
> > I'd recommend that you start at >= 5 nodes if you want to look at rough
> per-node throughput capabilities.
> >
> >
> > On 2018/08/28 06:59:38, guy sharon  wrote:
> > > hi Mike,
> > >
> > > Thanks for the links.
> > >
> > > My current setup is a 4 node cluster (tserver, master, gc, monitor)
> running
> > > on Alpine Docker containers on a laptop with an i7 processor (8 cores)
> with
> > > 16GB of RAM. As an example I'm running a count of all entries for a
> table
> > > with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
> > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if
> this is
> > > reasonable or not. Seems a little slow to me. What do you think?
> > >
> > > BR,
> > > Guy.
> > >
> > >
> > >
> > >
> > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall 
> wrote:
> > >
> > > > Hi Guy,
> > > >
> > > > Here are a couple links I found.  Can you tell us more about your
> setup
> > > > and what you are seeing?
> > > >
> > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> > > > https://www.youtube.com/watch?v=Ae9THpmpFpM
> > > >
> > > > Mike
> > > >
> > > >
> > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon <
> guy.sharon.1...@gmail.com>
> > > > wrote:
> > > >
> > > >> hi,
> > > >>
> > > >> I've just started working with Accumulo and I think I'm
> experiencing slow
> > > >> reads/writes. I'm aware of the recommended configuration. Does
> anyone know
> > > >> of any standard benchmarks and benchmarking tools I can use to tell
> if the
> > > >> performance I'm getting is reasonable?
> > > >>
> > > >>
> > > >>
> > >
>


Re: benchmarking

2018-08-28 Thread Mike Miller
Measuring scan performance by piping output from the shell is not the best
way.  A lot of time is wasted printing output to the terminal. You are
better off measuring the difference using the Batch Scanner API directly.
An example can be found here:
https://accumulo.apache.org/tour/batch-scanner/


On Tue, Aug 28, 2018 at 2:50 PM guy sharon 
wrote:

> hi Sean,
>
> Thanks for the advice. I tried bringing up a 5 tserver cluster on AWS with
> Muchos (https://github.com/apache/fluo-muchos). My first attempt was
> using servers with 2 vCPU, 8GB RAM (m5d.large on AWS). The Hadoop datanodes
> were colocated with the tservers and the Accumulo master was on the same
> server as the Hadoop namenode. I populated a table with 6M entries using a
> modified version of
> org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter from
> Accumulo (the only thing I modified was the number of entries as it usually
> inserts 50k). I then did a count with "bin/accumulo shell -u root -p secret
> -e "scan -t hellotable -np" | wc -l". That took 15 seconds. I then upgraded
> to m5d.xlarge instances (4vCPU, 16GB RAM) and got the exact same result, so
> it seems upgrading the servers doesn't help.
>
> Is this expected or am I doing something terribly wrong?
>
> BR,
> Guy.
>
>
>
> On Tue, Aug 28, 2018 at 10:38 AM Sean Busbey  wrote:
>
>> Hi Guy,
>>
>> Apache Accumulo is designed for horizontally scaling out for large scale
>> workloads that need to do random reads and writes. There's a non-trivial
>> amount of overhead that comes with a system aimed at doing that on
>> thousands of nodes.
>>
>> If your use case works for a single laptop with such a small number of
>> entries and exhaustive scans, then Accumulo is probably not the correct
>> tool for the job.
>>
>> For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset
>> size you can just rely on a file format like Apache Avro:
>>
>> busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count
>> 630 --schema '{ "type": "record", "name": "entry", "fields": [ {
>> "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
>> Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
>> WARNING: Unable to load native-hadoop library for your platform... using
>> builtin-java classes where applicable
>> test.seed=1535441473243
>>
>> real0m5.451s
>> user0m5.922s
>> sys 0m0.656s
>> busbey$ ls -lah ~/Downloads/6.3m_entries.avro
>> -rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31
>> /Users/busbey/Downloads/6.3m_entries.avro
>> busbey$ time java -jar avro-tools-1.7.7.jar tojson
>> ~/Downloads/6.3m_entries.avro | wc -l
>>  630
>>
>> real0m4.239s
>> user0m6.026s
>> sys 0m0.721s
>>
>> I'd recommend that you start at >= 5 nodes if you want to look at rough
>> per-node throughput capabilities.
>>
>>
>> On 2018/08/28 06:59:38, guy sharon  wrote:
>> > hi Mike,
>> >
>> > Thanks for the links.
>> >
>> > My current setup is a 4 node cluster (tserver, master, gc, monitor)
>> running
>> > on Alpine Docker containers on a laptop with an i7 processor (8 cores)
>> with
>> > 16GB of RAM. As an example I'm running a count of all entries for a
>> table
>> > with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
>> > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this
>> is
>> > reasonable or not. Seems a little slow to me. What do you think?
>> >
>> > BR,
>> > Guy.
>> >
>> >
>> >
>> >
>> > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:
>> >
>> > > Hi Guy,
>> > >
>> > > Here are a couple links I found.  Can you tell us more about your
>> setup
>> > > and what you are seeing?
>> > >
>> > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
>> > > https://www.youtube.com/watch?v=Ae9THpmpFpM
>> > >
>> > > Mike
>> > >
>> > >
>> > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > >
>> > > wrote:
>> > >
>> > >> hi,
>> > >>
>> > >> I've just started working with Accumulo and I think I'm experiencing
>> slow
>> > >> reads/writes. I'm aware of the recommended configuration. Does
>> anyone know
>> > >> of any standard benchmarks and benchmarking tools I can use to tell
>> if the
>> > >> performance I'm getting is reasonable?
>> > >>
>> > >>
>> > >>
>> >
>>
>


Re: benchmarking

2018-08-28 Thread guy sharon
hi Sean,

Thanks for the advice. I tried bringing up a 5 tserver cluster on AWS with
Muchos (https://github.com/apache/fluo-muchos). My first attempt was using
servers with 2 vCPU, 8GB RAM (m5d.large on AWS). The Hadoop datanodes were
colocated with the tservers and the Accumulo master was on the same server
as the Hadoop namenode. I populated a table with 6M entries using a
modified version of
org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter from
Accumulo (the only thing I modified was the number of entries as it usually
inserts 50k). I then did a count with "bin/accumulo shell -u root -p secret
-e "scan -t hellotable -np" | wc -l". That took 15 seconds. I then upgraded
to m5d.xlarge instances (4vCPU, 16GB RAM) and got the exact same result, so
it seems upgrading the servers doesn't help.

Is this expected or am I doing something terribly wrong?

BR,
Guy.



On Tue, Aug 28, 2018 at 10:38 AM Sean Busbey  wrote:

> Hi Guy,
>
> Apache Accumulo is designed for horizontally scaling out for large scale
> workloads that need to do random reads and writes. There's a non-trivial
> amount of overhead that comes with a system aimed at doing that on
> thousands of nodes.
>
> If your use case works for a single laptop with such a small number of
> entries and exhaustive scans, then Accumulo is probably not the correct
> tool for the job.
>
> For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size
> you can just rely on a file format like Apache Avro:
>
> busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count
> 630 --schema '{ "type": "record", "name": "entry", "fields": [ {
> "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
> Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
> WARNING: Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable
> test.seed=1535441473243
>
> real0m5.451s
> user0m5.922s
> sys 0m0.656s
> busbey$ ls -lah ~/Downloads/6.3m_entries.avro
> -rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31
> /Users/busbey/Downloads/6.3m_entries.avro
> busbey$ time java -jar avro-tools-1.7.7.jar tojson
> ~/Downloads/6.3m_entries.avro | wc -l
>  630
>
> real0m4.239s
> user0m6.026s
> sys 0m0.721s
>
> I'd recommend that you start at >= 5 nodes if you want to look at rough
> per-node throughput capabilities.
>
>
> On 2018/08/28 06:59:38, guy sharon  wrote:
> > hi Mike,
> >
> > Thanks for the links.
> >
> > My current setup is a 4 node cluster (tserver, master, gc, monitor)
> running
> > on Alpine Docker containers on a laptop with an i7 processor (8 cores)
> with
> > 16GB of RAM. As an example I'm running a count of all entries for a table
> > with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
> > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this
> is
> > reasonable or not. Seems a little slow to me. What do you think?
> >
> > BR,
> > Guy.
> >
> >
> >
> >
> > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:
> >
> > > Hi Guy,
> > >
> > > Here are a couple links I found.  Can you tell us more about your setup
> > > and what you are seeing?
> > >
> > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> > > https://www.youtube.com/watch?v=Ae9THpmpFpM
> > >
> > > Mike
> > >
> > >
> > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
> > > wrote:
> > >
> > >> hi,
> > >>
> > >> I've just started working with Accumulo and I think I'm experiencing
> slow
> > >> reads/writes. I'm aware of the recommended configuration. Does anyone
> know
> > >> of any standard benchmarks and benchmarking tools I can use to tell
> if the
> > >> performance I'm getting is reasonable?
> > >>
> > >>
> > >>
> >
>


Re: benchmarking

2018-08-28 Thread Michael Wall
Hi Guy,

I can't say if that is reasonable without more info.  How are you running
datanodes, namenodes and zookeepers?  Also, what are the JVM options for
each process?  Can you share your dockerfiles?  What OS are you on?  How
much of your OS can Docker take?  What is the data in your benchmark_table?

Like Sean mentioned, running multiple tservers will help to distribute the
load.  You may or may not have headroom.  It is possible to run multiple
tservers on the same host, even without docker.

Like Jeremy mentioned, I have seem better performance than you are getting
on a single node cluster but I usually use the standalone mini accumulo for
that, not a full cluster setup with HDFS.

Mike

On Tue, Aug 28, 2018 at 2:59 AM guy sharon 
wrote:

> hi Mike,
>
> Thanks for the links.
>
> My current setup is a 4 node cluster (tserver, master, gc, monitor)
> running on Alpine Docker containers on a laptop with an i7 processor (8
> cores) with 16GB of RAM. As an example I'm running a count of all entries
> for a table with 6.3M entries with "accumulo shell -u root -p secret  -e
> "scan -t benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if
> this is reasonable or not. Seems a little slow to me. What do you think?
>
> BR,
> Guy.
>
>
>
>
> On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:
>
>> Hi Guy,
>>
>> Here are a couple links I found.  Can you tell us more about your setup
>> and what you are seeing?
>>
>> https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
>> https://www.youtube.com/watch?v=Ae9THpmpFpM
>>
>> Mike
>>
>>
>> On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
>> wrote:
>>
>>> hi,
>>>
>>> I've just started working with Accumulo and I think I'm experiencing
>>> slow reads/writes. I'm aware of the recommended configuration. Does anyone
>>> know of any standard benchmarks and benchmarking tools I can use to tell if
>>> the performance I'm getting is reasonable?
>>>
>>>
>>>


Re: benchmarking

2018-08-28 Thread Jeremy Kepner
FYI, Single node Accumulo instances is our most popular deployment.
We have hundreds of them.   Accummulo is so fast that it can replace
what would normally require 20 MySQL servers.

Regards.  -Jeremy

On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote:
> Hi Guy,
> 
> Apache Accumulo is designed for horizontally scaling out for large scale 
> workloads that need to do random reads and writes. There's a non-trivial 
> amount of overhead that comes with a system aimed at doing that on thousands 
> of nodes.
> 
> If your use case works for a single laptop with such a small number of 
> entries and exhaustive scans, then Accumulo is probably not the correct tool 
> for the job.
> 
> For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size 
> you can just rely on a file format like Apache Avro:
> 
> busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count 
> 630 --schema '{ "type": "record", "name": "entry", "fields": [ { "name": 
> "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
> Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
> WARNING: Unable to load native-hadoop library for your platform... using 
> builtin-java classes where applicable
> test.seed=1535441473243
> 
> real  0m5.451s
> user  0m5.922s
> sys   0m0.656s
> busbey$ ls -lah ~/Downloads/6.3m_entries.avro 
> -rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31 
> /Users/busbey/Downloads/6.3m_entries.avro
> busbey$ time java -jar avro-tools-1.7.7.jar tojson 
> ~/Downloads/6.3m_entries.avro | wc -l
>  630
> 
> real  0m4.239s
> user  0m6.026s
> sys   0m0.721s
> 
> I'd recommend that you start at >= 5 nodes if you want to look at rough 
> per-node throughput capabilities.
> 
> 
> On 2018/08/28 06:59:38, guy sharon  wrote: 
> > hi Mike,
> > 
> > Thanks for the links.
> > 
> > My current setup is a 4 node cluster (tserver, master, gc, monitor) running
> > on Alpine Docker containers on a laptop with an i7 processor (8 cores) with
> > 16GB of RAM. As an example I'm running a count of all entries for a table
> > with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
> > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is
> > reasonable or not. Seems a little slow to me. What do you think?
> > 
> > BR,
> > Guy.
> > 
> > 
> > 
> > 
> > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:
> > 
> > > Hi Guy,
> > >
> > > Here are a couple links I found.  Can you tell us more about your setup
> > > and what you are seeing?
> > >
> > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> > > https://www.youtube.com/watch?v=Ae9THpmpFpM
> > >
> > > Mike
> > >
> > >
> > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
> > > wrote:
> > >
> > >> hi,
> > >>
> > >> I've just started working with Accumulo and I think I'm experiencing slow
> > >> reads/writes. I'm aware of the recommended configuration. Does anyone 
> > >> know
> > >> of any standard benchmarks and benchmarking tools I can use to tell if 
> > >> the
> > >> performance I'm getting is reasonable?
> > >>
> > >>
> > >>
> > 


Re: benchmarking

2018-08-28 Thread Sean Busbey
Hi Guy,

Apache Accumulo is designed for horizontally scaling out for large scale 
workloads that need to do random reads and writes. There's a non-trivial amount 
of overhead that comes with a system aimed at doing that on thousands of nodes.

If your use case works for a single laptop with such a small number of entries 
and exhaustive scans, then Accumulo is probably not the correct tool for the 
job.

For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size you 
can just rely on a file format like Apache Avro:

busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count 
630 --schema '{ "type": "record", "name": "entry", "fields": [ { "name": 
"field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro
Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader 
WARNING: Unable to load native-hadoop library for your platform... using 
builtin-java classes where applicable
test.seed=1535441473243

real0m5.451s
user0m5.922s
sys 0m0.656s
busbey$ ls -lah ~/Downloads/6.3m_entries.avro 
-rwxrwxrwx  1 busbey  staff   186M Aug 28 00:31 
/Users/busbey/Downloads/6.3m_entries.avro
busbey$ time java -jar avro-tools-1.7.7.jar tojson 
~/Downloads/6.3m_entries.avro | wc -l
 630

real0m4.239s
user0m6.026s
sys 0m0.721s

I'd recommend that you start at >= 5 nodes if you want to look at rough 
per-node throughput capabilities.


On 2018/08/28 06:59:38, guy sharon  wrote: 
> hi Mike,
> 
> Thanks for the links.
> 
> My current setup is a 4 node cluster (tserver, master, gc, monitor) running
> on Alpine Docker containers on a laptop with an i7 processor (8 cores) with
> 16GB of RAM. As an example I'm running a count of all entries for a table
> with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
> benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is
> reasonable or not. Seems a little slow to me. What do you think?
> 
> BR,
> Guy.
> 
> 
> 
> 
> On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:
> 
> > Hi Guy,
> >
> > Here are a couple links I found.  Can you tell us more about your setup
> > and what you are seeing?
> >
> > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> > https://www.youtube.com/watch?v=Ae9THpmpFpM
> >
> > Mike
> >
> >
> > On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
> > wrote:
> >
> >> hi,
> >>
> >> I've just started working with Accumulo and I think I'm experiencing slow
> >> reads/writes. I'm aware of the recommended configuration. Does anyone know
> >> of any standard benchmarks and benchmarking tools I can use to tell if the
> >> performance I'm getting is reasonable?
> >>
> >>
> >>
> 


Re: benchmarking

2018-08-28 Thread guy sharon
hi Mike,

Thanks for the links.

My current setup is a 4 node cluster (tserver, master, gc, monitor) running
on Alpine Docker containers on a laptop with an i7 processor (8 cores) with
16GB of RAM. As an example I'm running a count of all entries for a table
with 6.3M entries with "accumulo shell -u root -p secret  -e "scan -t
benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is
reasonable or not. Seems a little slow to me. What do you think?

BR,
Guy.




On Mon, Aug 27, 2018 at 4:43 PM Michael Wall  wrote:

> Hi Guy,
>
> Here are a couple links I found.  Can you tell us more about your setup
> and what you are seeing?
>
> https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> https://www.youtube.com/watch?v=Ae9THpmpFpM
>
> Mike
>
>
> On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
> wrote:
>
>> hi,
>>
>> I've just started working with Accumulo and I think I'm experiencing slow
>> reads/writes. I'm aware of the recommended configuration. Does anyone know
>> of any standard benchmarks and benchmarking tools I can use to tell if the
>> performance I'm getting is reasonable?
>>
>>
>>


Re: benchmarking

2018-08-27 Thread Mike Walch
Hi Guy,

If you are looking to improve performance, you should also check out the
2.0 documentation below:

https://accumulo.apache.org/docs/2.0/troubleshooting/performance

-Mike

On Mon, Aug 27, 2018 at 9:43 AM Michael Wall  wrote:

> Hi Guy,
>
> Here are a couple links I found.  Can you tell us more about your setup
> and what you are seeing?
>
> https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
> https://www.youtube.com/watch?v=Ae9THpmpFpM
>
> Mike
>
>
> On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
> wrote:
>
>> hi,
>>
>> I've just started working with Accumulo and I think I'm experiencing slow
>> reads/writes. I'm aware of the recommended configuration. Does anyone know
>> of any standard benchmarks and benchmarking tools I can use to tell if the
>> performance I'm getting is reasonable?
>>
>>
>>


Re: benchmarking

2018-08-27 Thread Michael Wall
Hi Guy,

Here are a couple links I found.  Can you tell us more about your setup and
what you are seeing?

https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf
https://www.youtube.com/watch?v=Ae9THpmpFpM

Mike


On Sat, Aug 25, 2018 at 5:09 PM guy sharon 
wrote:

> hi,
>
> I've just started working with Accumulo and I think I'm experiencing slow
> reads/writes. I'm aware of the recommended configuration. Does anyone know
> of any standard benchmarks and benchmarking tools I can use to tell if the
> performance I'm getting is reasonable?
>
>
>


benchmarking

2018-08-25 Thread guy sharon
hi,

I've just started working with Accumulo and I think I'm experiencing slow
reads/writes. I'm aware of the recommended configuration. Does anyone know
of any standard benchmarks and benchmarking tools I can use to tell if the
performance I'm getting is reasonable?


Re: Accumulo Caching for benchmarking

2012-08-07 Thread Steven Troxell
Are there other considerations I should be aware of to ensure independent
runs outside of stopping/restarting tablet servers and clearing OS cache?

I ran a test with 2 tablet servers active,  got 1 query to come back in 10
hours.   Ran /bin/stop-all and ./bin/start-all to get a comparison test
with 10 tservers,   cleared the cache using Eric's command on the 2 tablet
servers I had used for the first run before, and now I already had 4
queries return in under 2 minutes.

This could be awesome peformance gains, but I'm a bit skeptical, especially
considering the client code isn't even using batchscans (as well as
assorted other inefficiencies).

Is there some other dependency between the tests I haven't accounted for?

On Mon, Aug 6, 2012 at 2:41 PM, Steven Troxell steven.trox...@gmail.comwrote:

 For anyone else curious about this, it seems the OS caching played a much
 larger role for me then TServer caching.  I actually measured performance
 increase after just stopping/restarting TServers to clear cache. (could
 also have been biased by being a weekend run on the cluster).

 However I noticed immediate difference when clearing the OS caching
 through Eric's commands, the first few querys that had generally been
 returning in tenths of seconds, were now up in the minutes range.




 On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell 
 steven.trox...@gmail.comwrote:

 thanks everyone, that should definately help me out,  while I feel silly
 for ignoring this issue at first, it should be interesting to see how much
 this influences the results.



 On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.comwrote:

 You can drop the OS caches between runs:

 # echo 1  /proc/sys/vm/drop_caches


 On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote:

 Steve-

 I would probably design the experiment to test different cluster sizes
 as completely independent. That means, taking the entire thing down
 and back up again (possibly even rebooting the boxes, and/or
 re-initializing the cluster at the new size). I'd also do several runs
 while it is up at a particular cluster size, to capture any
 performance difference between the first and a later run due to OS or
 TServer caching, for analysis later.

 Essentially, when in doubt, take more data...

 --L


 On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell 
 steven.trox...@gmail.com wrote:
  Hi  all,
 
  I am running a benchmarking project on accumulo looking at RDF
 queries for
  clusters with different node sizes.   While I intend to look at
 caching for
  each optimizing each individual run, I do NOT want caching to
 interfere for
  example between runs involving the use of 10 and 8 tablet servers.
 
  Up to now I'd just been killing nodes via the bin/stop-here.sh script
 but I
  realize that may have allowed caching from previous runs with
 different node
  sizes to influence my results.   It seemed weird to me for exmaple
 when I
  realized dropping nodes actually increased performance (as measured
 by query
  return times) in some cases (though I acknowledge the code I'm
 working with
  has some serious issues with how ineffectively it is actually
 utilizing
  accumulo, but that's an issue I intend to address later).
 
  I suppose one way would be between a change of node sizes,  stop and
 restart
  ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes
 for
  example in transitioning from a 10 to 8 node test).  Will this be
 sure to
  clear the influence of caching across runs, and is there any cleaner
 way to
  do this?
 
  thanks,
  Steve







Re: Accumulo Caching for benchmarking

2012-08-07 Thread Eric Newton
Index caching is on by default in 1.4, and it's not particularly large.
 So, if your index suddenly fit entirely in cache with 10 servers, you
would see much better performance.

-Eric

On Tue, Aug 7, 2012 at 10:57 AM, Steven Troxell steven.trox...@gmail.comwrote:

 Are there other considerations I should be aware of to ensure independent
 runs outside of stopping/restarting tablet servers and clearing OS cache?

 I ran a test with 2 tablet servers active,  got 1 query to come back in 10
 hours.   Ran /bin/stop-all and ./bin/start-all to get a comparison test
 with 10 tservers,   cleared the cache using Eric's command on the 2 tablet
 servers I had used for the first run before, and now I already had 4
 queries return in under 2 minutes.

 This could be awesome peformance gains, but I'm a bit skeptical,
 especially considering the client code isn't even using batchscans (as well
 as assorted other inefficiencies).

 Is there some other dependency between the tests I haven't accounted for?


 On Mon, Aug 6, 2012 at 2:41 PM, Steven Troxell 
 steven.trox...@gmail.comwrote:

 For anyone else curious about this, it seems the OS caching played a much
 larger role for me then TServer caching.  I actually measured performance
 increase after just stopping/restarting TServers to clear cache. (could
 also have been biased by being a weekend run on the cluster).

 However I noticed immediate difference when clearing the OS caching
 through Eric's commands, the first few querys that had generally been
 returning in tenths of seconds, were now up in the minutes range.




 On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell 
 steven.trox...@gmail.comwrote:

 thanks everyone, that should definately help me out,  while I feel silly
 for ignoring this issue at first, it should be interesting to see how much
 this influences the results.



 On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.comwrote:

 You can drop the OS caches between runs:

 # echo 1  /proc/sys/vm/drop_caches


 On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs 
 ctubb...@gmail.comwrote:

 Steve-

 I would probably design the experiment to test different cluster sizes
 as completely independent. That means, taking the entire thing down
 and back up again (possibly even rebooting the boxes, and/or
 re-initializing the cluster at the new size). I'd also do several runs
 while it is up at a particular cluster size, to capture any
 performance difference between the first and a later run due to OS or
 TServer caching, for analysis later.

 Essentially, when in doubt, take more data...

 --L


 On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell 
 steven.trox...@gmail.com wrote:
  Hi  all,
 
  I am running a benchmarking project on accumulo looking at RDF
 queries for
  clusters with different node sizes.   While I intend to look at
 caching for
  each optimizing each individual run, I do NOT want caching to
 interfere for
  example between runs involving the use of 10 and 8 tablet servers.
 
  Up to now I'd just been killing nodes via the bin/stop-here.sh
 script but I
  realize that may have allowed caching from previous runs with
 different node
  sizes to influence my results.   It seemed weird to me for exmaple
 when I
  realized dropping nodes actually increased performance (as measured
 by query
  return times) in some cases (though I acknowledge the code I'm
 working with
  has some serious issues with how ineffectively it is actually
 utilizing
  accumulo, but that's an issue I intend to address later).
 
  I suppose one way would be between a change of node sizes,  stop and
 restart
  ALL nodes ( as opposed to what I'd been doing in just killing 2
 nodes for
  example in transitioning from a 10 to 8 node test).  Will this be
 sure to
  clear the influence of caching across runs, and is there any cleaner
 way to
  do this?
 
  thanks,
  Steve








Re: Accumulo Caching for benchmarking

2012-08-06 Thread Steven Troxell
For anyone else curious about this, it seems the OS caching played a much
larger role for me then TServer caching.  I actually measured performance
increase after just stopping/restarting TServers to clear cache. (could
also have been biased by being a weekend run on the cluster).

However I noticed immediate difference when clearing the OS caching through
Eric's commands, the first few querys that had generally been returning in
tenths of seconds, were now up in the minutes range.



On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell steven.trox...@gmail.comwrote:

 thanks everyone, that should definately help me out,  while I feel silly
 for ignoring this issue at first, it should be interesting to see how much
 this influences the results.



 On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.com wrote:

 You can drop the OS caches between runs:

 # echo 1  /proc/sys/vm/drop_caches


 On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote:

 Steve-

 I would probably design the experiment to test different cluster sizes
 as completely independent. That means, taking the entire thing down
 and back up again (possibly even rebooting the boxes, and/or
 re-initializing the cluster at the new size). I'd also do several runs
 while it is up at a particular cluster size, to capture any
 performance difference between the first and a later run due to OS or
 TServer caching, for analysis later.

 Essentially, when in doubt, take more data...

 --L


 On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com
 wrote:
  Hi  all,
 
  I am running a benchmarking project on accumulo looking at RDF queries
 for
  clusters with different node sizes.   While I intend to look at
 caching for
  each optimizing each individual run, I do NOT want caching to
 interfere for
  example between runs involving the use of 10 and 8 tablet servers.
 
  Up to now I'd just been killing nodes via the bin/stop-here.sh script
 but I
  realize that may have allowed caching from previous runs with
 different node
  sizes to influence my results.   It seemed weird to me for exmaple
 when I
  realized dropping nodes actually increased performance (as measured by
 query
  return times) in some cases (though I acknowledge the code I'm working
 with
  has some serious issues with how ineffectively it is actually utilizing
  accumulo, but that's an issue I intend to address later).
 
  I suppose one way would be between a change of node sizes,  stop and
 restart
  ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes
 for
  example in transitioning from a 10 to 8 node test).  Will this be sure
 to
  clear the influence of caching across runs, and is there any cleaner
 way to
  do this?
 
  thanks,
  Steve






Re: Accumulo Caching for benchmarking

2012-08-04 Thread Eric Newton
You can drop the OS caches between runs:

# echo 1  /proc/sys/vm/drop_caches


On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote:

 Steve-

 I would probably design the experiment to test different cluster sizes
 as completely independent. That means, taking the entire thing down
 and back up again (possibly even rebooting the boxes, and/or
 re-initializing the cluster at the new size). I'd also do several runs
 while it is up at a particular cluster size, to capture any
 performance difference between the first and a later run due to OS or
 TServer caching, for analysis later.

 Essentially, when in doubt, take more data...

 --L


 On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com
 wrote:
  Hi  all,
 
  I am running a benchmarking project on accumulo looking at RDF queries
 for
  clusters with different node sizes.   While I intend to look at caching
 for
  each optimizing each individual run, I do NOT want caching to interfere
 for
  example between runs involving the use of 10 and 8 tablet servers.
 
  Up to now I'd just been killing nodes via the bin/stop-here.sh script
 but I
  realize that may have allowed caching from previous runs with different
 node
  sizes to influence my results.   It seemed weird to me for exmaple when I
  realized dropping nodes actually increased performance (as measured by
 query
  return times) in some cases (though I acknowledge the code I'm working
 with
  has some serious issues with how ineffectively it is actually utilizing
  accumulo, but that's an issue I intend to address later).
 
  I suppose one way would be between a change of node sizes,  stop and
 restart
  ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for
  example in transitioning from a 10 to 8 node test).  Will this be sure to
  clear the influence of caching across runs, and is there any cleaner way
 to
  do this?
 
  thanks,
  Steve



Re: Accumulo Caching for benchmarking

2012-08-03 Thread Christopher Tubbs
Steve-

I would probably design the experiment to test different cluster sizes
as completely independent. That means, taking the entire thing down
and back up again (possibly even rebooting the boxes, and/or
re-initializing the cluster at the new size). I'd also do several runs
while it is up at a particular cluster size, to capture any
performance difference between the first and a later run due to OS or
TServer caching, for analysis later.

Essentially, when in doubt, take more data...

--L


On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote:
 Hi  all,

 I am running a benchmarking project on accumulo looking at RDF queries for
 clusters with different node sizes.   While I intend to look at caching for
 each optimizing each individual run, I do NOT want caching to interfere for
 example between runs involving the use of 10 and 8 tablet servers.

 Up to now I'd just been killing nodes via the bin/stop-here.sh script but I
 realize that may have allowed caching from previous runs with different node
 sizes to influence my results.   It seemed weird to me for exmaple when I
 realized dropping nodes actually increased performance (as measured by query
 return times) in some cases (though I acknowledge the code I'm working with
 has some serious issues with how ineffectively it is actually utilizing
 accumulo, but that's an issue I intend to address later).

 I suppose one way would be between a change of node sizes,  stop and restart
 ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for
 example in transitioning from a 10 to 8 node test).  Will this be sure to
 clear the influence of caching across runs, and is there any cleaner way to
 do this?

 thanks,
 Steve


Re: Software used for benchmarking a cloud

2012-06-04 Thread William Slacum
Accumulo comes with continuous ingest and query scripts that can
benchmark how many KV-pairs you have coming into and out of a cluster.

On Mon, Jun 4, 2012 at 12:57 PM, Hider, Sandy sandy.hi...@jhuapl.edu wrote:
 All,

 I am wondering what others have found to help perform benchmarks of a
 cloud?  Have others found any open source software for this or rolled their
 own?



 Thanks in advance,



 Sandy