Re: benchmarking
Our nodes are usually 20+ cores and 100+ GB RAM. On Tue, Aug 28, 2018 at 10:18:24PM +0300, guy sharon wrote: > hi Jeremy, > > Do you have any information on how you configure them and what kind of > hardware they run on? > > Thanks, > Guy. > > > > On Tue, Aug 28, 2018 at 3:44 PM Jeremy Kepner wrote: > > > FYI, Single node Accumulo instances is our most popular deployment. > > We have hundreds of them. Accummulo is so fast that it can replace > > what would normally require 20 MySQL servers. > > > > Regards. -Jeremy > > > > On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote: > > > Hi Guy, > > > > > > Apache Accumulo is designed for horizontally scaling out for large scale > > workloads that need to do random reads and writes. There's a non-trivial > > amount of overhead that comes with a system aimed at doing that on > > thousands of nodes. > > > > > > If your use case works for a single laptop with such a small number of > > entries and exhaustive scans, then Accumulo is probably not the correct > > tool for the job. > > > > > > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset > > size you can just rely on a file format like Apache Avro: > > > > > > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy > > --count 630 --schema '{ "type": "record", "name": "entry", "fields": [ > > { "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro > > > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader > > > WARNING: Unable to load native-hadoop library for your platform... using > > builtin-java classes where applicable > > > test.seed=1535441473243 > > > > > > real 0m5.451s > > > user 0m5.922s > > > sys 0m0.656s > > > busbey$ ls -lah ~/Downloads/6.3m_entries.avro > > > -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 > > /Users/busbey/Downloads/6.3m_entries.avro > > > busbey$ time java -jar avro-tools-1.7.7.jar tojson > > ~/Downloads/6.3m_entries.avro | wc -l > > > 630 > > > > > > real 0m4.239s > > > user 0m6.026s > > > sys 0m0.721s > > > > > > I'd recommend that you start at >= 5 nodes if you want to look at rough > > per-node throughput capabilities. > > > > > > > > > On 2018/08/28 06:59:38, guy sharon wrote: > > > > hi Mike, > > > > > > > > Thanks for the links. > > > > > > > > My current setup is a 4 node cluster (tserver, master, gc, monitor) > > running > > > > on Alpine Docker containers on a laptop with an i7 processor (8 cores) > > with > > > > 16GB of RAM. As an example I'm running a count of all entries for a > > table > > > > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t > > > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if > > this is > > > > reasonable or not. Seems a little slow to me. What do you think? > > > > > > > > BR, > > > > Guy. > > > > > > > > > > > > > > > > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall > > wrote: > > > > > > > > > Hi Guy, > > > > > > > > > > Here are a couple links I found. Can you tell us more about your > > setup > > > > > and what you are seeing? > > > > > > > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > > > > > https://www.youtube.com/watch?v=Ae9THpmpFpM > > > > > > > > > > Mike > > > > > > > > > > > > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon < > > guy.sharon.1...@gmail.com> > > > > > wrote: > > > > > > > > > >> hi, > > > > >> > > > > >> I've just started working with Accumulo and I think I'm > > experiencing slow > > > > >> reads/writes. I'm aware of the recommended configuration. Does > > anyone know > > > > >> of any standard benchmarks and benchmarking tools I can use to tell > > if the > > > > >> performance I'm getting is reasonable? > > > > >> > > > > >> > > > > >> > > > > > >
Re: benchmarking
hi Jeremy, Do you have any information on how you configure them and what kind of hardware they run on? Thanks, Guy. On Tue, Aug 28, 2018 at 3:44 PM Jeremy Kepner wrote: > FYI, Single node Accumulo instances is our most popular deployment. > We have hundreds of them. Accummulo is so fast that it can replace > what would normally require 20 MySQL servers. > > Regards. -Jeremy > > On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote: > > Hi Guy, > > > > Apache Accumulo is designed for horizontally scaling out for large scale > workloads that need to do random reads and writes. There's a non-trivial > amount of overhead that comes with a system aimed at doing that on > thousands of nodes. > > > > If your use case works for a single laptop with such a small number of > entries and exhaustive scans, then Accumulo is probably not the correct > tool for the job. > > > > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset > size you can just rely on a file format like Apache Avro: > > > > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy > --count 630 --schema '{ "type": "record", "name": "entry", "fields": [ > { "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro > > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader > > WARNING: Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > > test.seed=1535441473243 > > > > real 0m5.451s > > user 0m5.922s > > sys 0m0.656s > > busbey$ ls -lah ~/Downloads/6.3m_entries.avro > > -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 > /Users/busbey/Downloads/6.3m_entries.avro > > busbey$ time java -jar avro-tools-1.7.7.jar tojson > ~/Downloads/6.3m_entries.avro | wc -l > > 630 > > > > real 0m4.239s > > user 0m6.026s > > sys 0m0.721s > > > > I'd recommend that you start at >= 5 nodes if you want to look at rough > per-node throughput capabilities. > > > > > > On 2018/08/28 06:59:38, guy sharon wrote: > > > hi Mike, > > > > > > Thanks for the links. > > > > > > My current setup is a 4 node cluster (tserver, master, gc, monitor) > running > > > on Alpine Docker containers on a laptop with an i7 processor (8 cores) > with > > > 16GB of RAM. As an example I'm running a count of all entries for a > table > > > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t > > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if > this is > > > reasonable or not. Seems a little slow to me. What do you think? > > > > > > BR, > > > Guy. > > > > > > > > > > > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall > wrote: > > > > > > > Hi Guy, > > > > > > > > Here are a couple links I found. Can you tell us more about your > setup > > > > and what you are seeing? > > > > > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > > > > https://www.youtube.com/watch?v=Ae9THpmpFpM > > > > > > > > Mike > > > > > > > > > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon < > guy.sharon.1...@gmail.com> > > > > wrote: > > > > > > > >> hi, > > > >> > > > >> I've just started working with Accumulo and I think I'm > experiencing slow > > > >> reads/writes. I'm aware of the recommended configuration. Does > anyone know > > > >> of any standard benchmarks and benchmarking tools I can use to tell > if the > > > >> performance I'm getting is reasonable? > > > >> > > > >> > > > >> > > > >
Re: benchmarking
Measuring scan performance by piping output from the shell is not the best way. A lot of time is wasted printing output to the terminal. You are better off measuring the difference using the Batch Scanner API directly. An example can be found here: https://accumulo.apache.org/tour/batch-scanner/ On Tue, Aug 28, 2018 at 2:50 PM guy sharon wrote: > hi Sean, > > Thanks for the advice. I tried bringing up a 5 tserver cluster on AWS with > Muchos (https://github.com/apache/fluo-muchos). My first attempt was > using servers with 2 vCPU, 8GB RAM (m5d.large on AWS). The Hadoop datanodes > were colocated with the tservers and the Accumulo master was on the same > server as the Hadoop namenode. I populated a table with 6M entries using a > modified version of > org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter from > Accumulo (the only thing I modified was the number of entries as it usually > inserts 50k). I then did a count with "bin/accumulo shell -u root -p secret > -e "scan -t hellotable -np" | wc -l". That took 15 seconds. I then upgraded > to m5d.xlarge instances (4vCPU, 16GB RAM) and got the exact same result, so > it seems upgrading the servers doesn't help. > > Is this expected or am I doing something terribly wrong? > > BR, > Guy. > > > > On Tue, Aug 28, 2018 at 10:38 AM Sean Busbey wrote: > >> Hi Guy, >> >> Apache Accumulo is designed for horizontally scaling out for large scale >> workloads that need to do random reads and writes. There's a non-trivial >> amount of overhead that comes with a system aimed at doing that on >> thousands of nodes. >> >> If your use case works for a single laptop with such a small number of >> entries and exhaustive scans, then Accumulo is probably not the correct >> tool for the job. >> >> For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset >> size you can just rely on a file format like Apache Avro: >> >> busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count >> 630 --schema '{ "type": "record", "name": "entry", "fields": [ { >> "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro >> Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader >> WARNING: Unable to load native-hadoop library for your platform... using >> builtin-java classes where applicable >> test.seed=1535441473243 >> >> real0m5.451s >> user0m5.922s >> sys 0m0.656s >> busbey$ ls -lah ~/Downloads/6.3m_entries.avro >> -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 >> /Users/busbey/Downloads/6.3m_entries.avro >> busbey$ time java -jar avro-tools-1.7.7.jar tojson >> ~/Downloads/6.3m_entries.avro | wc -l >> 630 >> >> real0m4.239s >> user0m6.026s >> sys 0m0.721s >> >> I'd recommend that you start at >= 5 nodes if you want to look at rough >> per-node throughput capabilities. >> >> >> On 2018/08/28 06:59:38, guy sharon wrote: >> > hi Mike, >> > >> > Thanks for the links. >> > >> > My current setup is a 4 node cluster (tserver, master, gc, monitor) >> running >> > on Alpine Docker containers on a laptop with an i7 processor (8 cores) >> with >> > 16GB of RAM. As an example I'm running a count of all entries for a >> table >> > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t >> > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this >> is >> > reasonable or not. Seems a little slow to me. What do you think? >> > >> > BR, >> > Guy. >> > >> > >> > >> > >> > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: >> > >> > > Hi Guy, >> > > >> > > Here are a couple links I found. Can you tell us more about your >> setup >> > > and what you are seeing? >> > > >> > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf >> > > https://www.youtube.com/watch?v=Ae9THpmpFpM >> > > >> > > Mike >> > > >> > > >> > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > > >> > > wrote: >> > > >> > >> hi, >> > >> >> > >> I've just started working with Accumulo and I think I'm experiencing >> slow >> > >> reads/writes. I'm aware of the recommended configuration. Does >> anyone know >> > >> of any standard benchmarks and benchmarking tools I can use to tell >> if the >> > >> performance I'm getting is reasonable? >> > >> >> > >> >> > >> >> > >> >
Re: benchmarking
hi Sean, Thanks for the advice. I tried bringing up a 5 tserver cluster on AWS with Muchos (https://github.com/apache/fluo-muchos). My first attempt was using servers with 2 vCPU, 8GB RAM (m5d.large on AWS). The Hadoop datanodes were colocated with the tservers and the Accumulo master was on the same server as the Hadoop namenode. I populated a table with 6M entries using a modified version of org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter from Accumulo (the only thing I modified was the number of entries as it usually inserts 50k). I then did a count with "bin/accumulo shell -u root -p secret -e "scan -t hellotable -np" | wc -l". That took 15 seconds. I then upgraded to m5d.xlarge instances (4vCPU, 16GB RAM) and got the exact same result, so it seems upgrading the servers doesn't help. Is this expected or am I doing something terribly wrong? BR, Guy. On Tue, Aug 28, 2018 at 10:38 AM Sean Busbey wrote: > Hi Guy, > > Apache Accumulo is designed for horizontally scaling out for large scale > workloads that need to do random reads and writes. There's a non-trivial > amount of overhead that comes with a system aimed at doing that on > thousands of nodes. > > If your use case works for a single laptop with such a small number of > entries and exhaustive scans, then Accumulo is probably not the correct > tool for the job. > > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size > you can just rely on a file format like Apache Avro: > > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count > 630 --schema '{ "type": "record", "name": "entry", "fields": [ { > "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader > WARNING: Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > test.seed=1535441473243 > > real0m5.451s > user0m5.922s > sys 0m0.656s > busbey$ ls -lah ~/Downloads/6.3m_entries.avro > -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 > /Users/busbey/Downloads/6.3m_entries.avro > busbey$ time java -jar avro-tools-1.7.7.jar tojson > ~/Downloads/6.3m_entries.avro | wc -l > 630 > > real0m4.239s > user0m6.026s > sys 0m0.721s > > I'd recommend that you start at >= 5 nodes if you want to look at rough > per-node throughput capabilities. > > > On 2018/08/28 06:59:38, guy sharon wrote: > > hi Mike, > > > > Thanks for the links. > > > > My current setup is a 4 node cluster (tserver, master, gc, monitor) > running > > on Alpine Docker containers on a laptop with an i7 processor (8 cores) > with > > 16GB of RAM. As an example I'm running a count of all entries for a table > > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this > is > > reasonable or not. Seems a little slow to me. What do you think? > > > > BR, > > Guy. > > > > > > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: > > > > > Hi Guy, > > > > > > Here are a couple links I found. Can you tell us more about your setup > > > and what you are seeing? > > > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > > > https://www.youtube.com/watch?v=Ae9THpmpFpM > > > > > > Mike > > > > > > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > > > wrote: > > > > > >> hi, > > >> > > >> I've just started working with Accumulo and I think I'm experiencing > slow > > >> reads/writes. I'm aware of the recommended configuration. Does anyone > know > > >> of any standard benchmarks and benchmarking tools I can use to tell > if the > > >> performance I'm getting is reasonable? > > >> > > >> > > >> > > >
Re: benchmarking
Hi Guy, I can't say if that is reasonable without more info. How are you running datanodes, namenodes and zookeepers? Also, what are the JVM options for each process? Can you share your dockerfiles? What OS are you on? How much of your OS can Docker take? What is the data in your benchmark_table? Like Sean mentioned, running multiple tservers will help to distribute the load. You may or may not have headroom. It is possible to run multiple tservers on the same host, even without docker. Like Jeremy mentioned, I have seem better performance than you are getting on a single node cluster but I usually use the standalone mini accumulo for that, not a full cluster setup with HDFS. Mike On Tue, Aug 28, 2018 at 2:59 AM guy sharon wrote: > hi Mike, > > Thanks for the links. > > My current setup is a 4 node cluster (tserver, master, gc, monitor) > running on Alpine Docker containers on a laptop with an i7 processor (8 > cores) with 16GB of RAM. As an example I'm running a count of all entries > for a table with 6.3M entries with "accumulo shell -u root -p secret -e > "scan -t benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if > this is reasonable or not. Seems a little slow to me. What do you think? > > BR, > Guy. > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: > >> Hi Guy, >> >> Here are a couple links I found. Can you tell us more about your setup >> and what you are seeing? >> >> https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf >> https://www.youtube.com/watch?v=Ae9THpmpFpM >> >> Mike >> >> >> On Sat, Aug 25, 2018 at 5:09 PM guy sharon >> wrote: >> >>> hi, >>> >>> I've just started working with Accumulo and I think I'm experiencing >>> slow reads/writes. I'm aware of the recommended configuration. Does anyone >>> know of any standard benchmarks and benchmarking tools I can use to tell if >>> the performance I'm getting is reasonable? >>> >>> >>>
Re: benchmarking
FYI, Single node Accumulo instances is our most popular deployment. We have hundreds of them. Accummulo is so fast that it can replace what would normally require 20 MySQL servers. Regards. -Jeremy On Tue, Aug 28, 2018 at 07:38:37AM +, Sean Busbey wrote: > Hi Guy, > > Apache Accumulo is designed for horizontally scaling out for large scale > workloads that need to do random reads and writes. There's a non-trivial > amount of overhead that comes with a system aimed at doing that on thousands > of nodes. > > If your use case works for a single laptop with such a small number of > entries and exhaustive scans, then Accumulo is probably not the correct tool > for the job. > > For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size > you can just rely on a file format like Apache Avro: > > busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count > 630 --schema '{ "type": "record", "name": "entry", "fields": [ { "name": > "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro > Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader > WARNING: Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > test.seed=1535441473243 > > real 0m5.451s > user 0m5.922s > sys 0m0.656s > busbey$ ls -lah ~/Downloads/6.3m_entries.avro > -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 > /Users/busbey/Downloads/6.3m_entries.avro > busbey$ time java -jar avro-tools-1.7.7.jar tojson > ~/Downloads/6.3m_entries.avro | wc -l > 630 > > real 0m4.239s > user 0m6.026s > sys 0m0.721s > > I'd recommend that you start at >= 5 nodes if you want to look at rough > per-node throughput capabilities. > > > On 2018/08/28 06:59:38, guy sharon wrote: > > hi Mike, > > > > Thanks for the links. > > > > My current setup is a 4 node cluster (tserver, master, gc, monitor) running > > on Alpine Docker containers on a laptop with an i7 processor (8 cores) with > > 16GB of RAM. As an example I'm running a count of all entries for a table > > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t > > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is > > reasonable or not. Seems a little slow to me. What do you think? > > > > BR, > > Guy. > > > > > > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: > > > > > Hi Guy, > > > > > > Here are a couple links I found. Can you tell us more about your setup > > > and what you are seeing? > > > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > > > https://www.youtube.com/watch?v=Ae9THpmpFpM > > > > > > Mike > > > > > > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > > > wrote: > > > > > >> hi, > > >> > > >> I've just started working with Accumulo and I think I'm experiencing slow > > >> reads/writes. I'm aware of the recommended configuration. Does anyone > > >> know > > >> of any standard benchmarks and benchmarking tools I can use to tell if > > >> the > > >> performance I'm getting is reasonable? > > >> > > >> > > >> > >
Re: benchmarking
Hi Guy, Apache Accumulo is designed for horizontally scaling out for large scale workloads that need to do random reads and writes. There's a non-trivial amount of overhead that comes with a system aimed at doing that on thousands of nodes. If your use case works for a single laptop with such a small number of entries and exhaustive scans, then Accumulo is probably not the correct tool for the job. For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset size you can just rely on a file format like Apache Avro: busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count 630 --schema '{ "type": "record", "name": "entry", "fields": [ { "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader WARNING: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable test.seed=1535441473243 real0m5.451s user0m5.922s sys 0m0.656s busbey$ ls -lah ~/Downloads/6.3m_entries.avro -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 /Users/busbey/Downloads/6.3m_entries.avro busbey$ time java -jar avro-tools-1.7.7.jar tojson ~/Downloads/6.3m_entries.avro | wc -l 630 real0m4.239s user0m6.026s sys 0m0.721s I'd recommend that you start at >= 5 nodes if you want to look at rough per-node throughput capabilities. On 2018/08/28 06:59:38, guy sharon wrote: > hi Mike, > > Thanks for the links. > > My current setup is a 4 node cluster (tserver, master, gc, monitor) running > on Alpine Docker containers on a laptop with an i7 processor (8 cores) with > 16GB of RAM. As an example I'm running a count of all entries for a table > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is > reasonable or not. Seems a little slow to me. What do you think? > > BR, > Guy. > > > > > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: > > > Hi Guy, > > > > Here are a couple links I found. Can you tell us more about your setup > > and what you are seeing? > > > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > > https://www.youtube.com/watch?v=Ae9THpmpFpM > > > > Mike > > > > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > > wrote: > > > >> hi, > >> > >> I've just started working with Accumulo and I think I'm experiencing slow > >> reads/writes. I'm aware of the recommended configuration. Does anyone know > >> of any standard benchmarks and benchmarking tools I can use to tell if the > >> performance I'm getting is reasonable? > >> > >> > >> >
Re: benchmarking
hi Mike, Thanks for the links. My current setup is a 4 node cluster (tserver, master, gc, monitor) running on Alpine Docker containers on a laptop with an i7 processor (8 cores) with 16GB of RAM. As an example I'm running a count of all entries for a table with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this is reasonable or not. Seems a little slow to me. What do you think? BR, Guy. On Mon, Aug 27, 2018 at 4:43 PM Michael Wall wrote: > Hi Guy, > > Here are a couple links I found. Can you tell us more about your setup > and what you are seeing? > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > https://www.youtube.com/watch?v=Ae9THpmpFpM > > Mike > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > wrote: > >> hi, >> >> I've just started working with Accumulo and I think I'm experiencing slow >> reads/writes. I'm aware of the recommended configuration. Does anyone know >> of any standard benchmarks and benchmarking tools I can use to tell if the >> performance I'm getting is reasonable? >> >> >>
Re: benchmarking
Hi Guy, If you are looking to improve performance, you should also check out the 2.0 documentation below: https://accumulo.apache.org/docs/2.0/troubleshooting/performance -Mike On Mon, Aug 27, 2018 at 9:43 AM Michael Wall wrote: > Hi Guy, > > Here are a couple links I found. Can you tell us more about your setup > and what you are seeing? > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf > https://www.youtube.com/watch?v=Ae9THpmpFpM > > Mike > > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon > wrote: > >> hi, >> >> I've just started working with Accumulo and I think I'm experiencing slow >> reads/writes. I'm aware of the recommended configuration. Does anyone know >> of any standard benchmarks and benchmarking tools I can use to tell if the >> performance I'm getting is reasonable? >> >> >>
Re: benchmarking
Hi Guy, Here are a couple links I found. Can you tell us more about your setup and what you are seeing? https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf https://www.youtube.com/watch?v=Ae9THpmpFpM Mike On Sat, Aug 25, 2018 at 5:09 PM guy sharon wrote: > hi, > > I've just started working with Accumulo and I think I'm experiencing slow > reads/writes. I'm aware of the recommended configuration. Does anyone know > of any standard benchmarks and benchmarking tools I can use to tell if the > performance I'm getting is reasonable? > > >
benchmarking
hi, I've just started working with Accumulo and I think I'm experiencing slow reads/writes. I'm aware of the recommended configuration. Does anyone know of any standard benchmarks and benchmarking tools I can use to tell if the performance I'm getting is reasonable?
Re: Accumulo Caching for benchmarking
Are there other considerations I should be aware of to ensure independent runs outside of stopping/restarting tablet servers and clearing OS cache? I ran a test with 2 tablet servers active, got 1 query to come back in 10 hours. Ran /bin/stop-all and ./bin/start-all to get a comparison test with 10 tservers, cleared the cache using Eric's command on the 2 tablet servers I had used for the first run before, and now I already had 4 queries return in under 2 minutes. This could be awesome peformance gains, but I'm a bit skeptical, especially considering the client code isn't even using batchscans (as well as assorted other inefficiencies). Is there some other dependency between the tests I haven't accounted for? On Mon, Aug 6, 2012 at 2:41 PM, Steven Troxell steven.trox...@gmail.comwrote: For anyone else curious about this, it seems the OS caching played a much larger role for me then TServer caching. I actually measured performance increase after just stopping/restarting TServers to clear cache. (could also have been biased by being a weekend run on the cluster). However I noticed immediate difference when clearing the OS caching through Eric's commands, the first few querys that had generally been returning in tenths of seconds, were now up in the minutes range. On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell steven.trox...@gmail.comwrote: thanks everyone, that should definately help me out, while I feel silly for ignoring this issue at first, it should be interesting to see how much this influences the results. On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.comwrote: You can drop the OS caches between runs: # echo 1 /proc/sys/vm/drop_caches On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote: Steve- I would probably design the experiment to test different cluster sizes as completely independent. That means, taking the entire thing down and back up again (possibly even rebooting the boxes, and/or re-initializing the cluster at the new size). I'd also do several runs while it is up at a particular cluster size, to capture any performance difference between the first and a later run due to OS or TServer caching, for analysis later. Essentially, when in doubt, take more data... --L On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote: Hi all, I am running a benchmarking project on accumulo looking at RDF queries for clusters with different node sizes. While I intend to look at caching for each optimizing each individual run, I do NOT want caching to interfere for example between runs involving the use of 10 and 8 tablet servers. Up to now I'd just been killing nodes via the bin/stop-here.sh script but I realize that may have allowed caching from previous runs with different node sizes to influence my results. It seemed weird to me for exmaple when I realized dropping nodes actually increased performance (as measured by query return times) in some cases (though I acknowledge the code I'm working with has some serious issues with how ineffectively it is actually utilizing accumulo, but that's an issue I intend to address later). I suppose one way would be between a change of node sizes, stop and restart ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for example in transitioning from a 10 to 8 node test). Will this be sure to clear the influence of caching across runs, and is there any cleaner way to do this? thanks, Steve
Re: Accumulo Caching for benchmarking
Index caching is on by default in 1.4, and it's not particularly large. So, if your index suddenly fit entirely in cache with 10 servers, you would see much better performance. -Eric On Tue, Aug 7, 2012 at 10:57 AM, Steven Troxell steven.trox...@gmail.comwrote: Are there other considerations I should be aware of to ensure independent runs outside of stopping/restarting tablet servers and clearing OS cache? I ran a test with 2 tablet servers active, got 1 query to come back in 10 hours. Ran /bin/stop-all and ./bin/start-all to get a comparison test with 10 tservers, cleared the cache using Eric's command on the 2 tablet servers I had used for the first run before, and now I already had 4 queries return in under 2 minutes. This could be awesome peformance gains, but I'm a bit skeptical, especially considering the client code isn't even using batchscans (as well as assorted other inefficiencies). Is there some other dependency between the tests I haven't accounted for? On Mon, Aug 6, 2012 at 2:41 PM, Steven Troxell steven.trox...@gmail.comwrote: For anyone else curious about this, it seems the OS caching played a much larger role for me then TServer caching. I actually measured performance increase after just stopping/restarting TServers to clear cache. (could also have been biased by being a weekend run on the cluster). However I noticed immediate difference when clearing the OS caching through Eric's commands, the first few querys that had generally been returning in tenths of seconds, were now up in the minutes range. On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell steven.trox...@gmail.comwrote: thanks everyone, that should definately help me out, while I feel silly for ignoring this issue at first, it should be interesting to see how much this influences the results. On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.comwrote: You can drop the OS caches between runs: # echo 1 /proc/sys/vm/drop_caches On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote: Steve- I would probably design the experiment to test different cluster sizes as completely independent. That means, taking the entire thing down and back up again (possibly even rebooting the boxes, and/or re-initializing the cluster at the new size). I'd also do several runs while it is up at a particular cluster size, to capture any performance difference between the first and a later run due to OS or TServer caching, for analysis later. Essentially, when in doubt, take more data... --L On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote: Hi all, I am running a benchmarking project on accumulo looking at RDF queries for clusters with different node sizes. While I intend to look at caching for each optimizing each individual run, I do NOT want caching to interfere for example between runs involving the use of 10 and 8 tablet servers. Up to now I'd just been killing nodes via the bin/stop-here.sh script but I realize that may have allowed caching from previous runs with different node sizes to influence my results. It seemed weird to me for exmaple when I realized dropping nodes actually increased performance (as measured by query return times) in some cases (though I acknowledge the code I'm working with has some serious issues with how ineffectively it is actually utilizing accumulo, but that's an issue I intend to address later). I suppose one way would be between a change of node sizes, stop and restart ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for example in transitioning from a 10 to 8 node test). Will this be sure to clear the influence of caching across runs, and is there any cleaner way to do this? thanks, Steve
Re: Accumulo Caching for benchmarking
For anyone else curious about this, it seems the OS caching played a much larger role for me then TServer caching. I actually measured performance increase after just stopping/restarting TServers to clear cache. (could also have been biased by being a weekend run on the cluster). However I noticed immediate difference when clearing the OS caching through Eric's commands, the first few querys that had generally been returning in tenths of seconds, were now up in the minutes range. On Sat, Aug 4, 2012 at 1:21 PM, Steven Troxell steven.trox...@gmail.comwrote: thanks everyone, that should definately help me out, while I feel silly for ignoring this issue at first, it should be interesting to see how much this influences the results. On Sat, Aug 4, 2012 at 7:19 AM, Eric Newton eric.new...@gmail.com wrote: You can drop the OS caches between runs: # echo 1 /proc/sys/vm/drop_caches On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote: Steve- I would probably design the experiment to test different cluster sizes as completely independent. That means, taking the entire thing down and back up again (possibly even rebooting the boxes, and/or re-initializing the cluster at the new size). I'd also do several runs while it is up at a particular cluster size, to capture any performance difference between the first and a later run due to OS or TServer caching, for analysis later. Essentially, when in doubt, take more data... --L On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote: Hi all, I am running a benchmarking project on accumulo looking at RDF queries for clusters with different node sizes. While I intend to look at caching for each optimizing each individual run, I do NOT want caching to interfere for example between runs involving the use of 10 and 8 tablet servers. Up to now I'd just been killing nodes via the bin/stop-here.sh script but I realize that may have allowed caching from previous runs with different node sizes to influence my results. It seemed weird to me for exmaple when I realized dropping nodes actually increased performance (as measured by query return times) in some cases (though I acknowledge the code I'm working with has some serious issues with how ineffectively it is actually utilizing accumulo, but that's an issue I intend to address later). I suppose one way would be between a change of node sizes, stop and restart ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for example in transitioning from a 10 to 8 node test). Will this be sure to clear the influence of caching across runs, and is there any cleaner way to do this? thanks, Steve
Re: Accumulo Caching for benchmarking
You can drop the OS caches between runs: # echo 1 /proc/sys/vm/drop_caches On Fri, Aug 3, 2012 at 9:41 PM, Christopher Tubbs ctubb...@gmail.comwrote: Steve- I would probably design the experiment to test different cluster sizes as completely independent. That means, taking the entire thing down and back up again (possibly even rebooting the boxes, and/or re-initializing the cluster at the new size). I'd also do several runs while it is up at a particular cluster size, to capture any performance difference between the first and a later run due to OS or TServer caching, for analysis later. Essentially, when in doubt, take more data... --L On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote: Hi all, I am running a benchmarking project on accumulo looking at RDF queries for clusters with different node sizes. While I intend to look at caching for each optimizing each individual run, I do NOT want caching to interfere for example between runs involving the use of 10 and 8 tablet servers. Up to now I'd just been killing nodes via the bin/stop-here.sh script but I realize that may have allowed caching from previous runs with different node sizes to influence my results. It seemed weird to me for exmaple when I realized dropping nodes actually increased performance (as measured by query return times) in some cases (though I acknowledge the code I'm working with has some serious issues with how ineffectively it is actually utilizing accumulo, but that's an issue I intend to address later). I suppose one way would be between a change of node sizes, stop and restart ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for example in transitioning from a 10 to 8 node test). Will this be sure to clear the influence of caching across runs, and is there any cleaner way to do this? thanks, Steve
Re: Accumulo Caching for benchmarking
Steve- I would probably design the experiment to test different cluster sizes as completely independent. That means, taking the entire thing down and back up again (possibly even rebooting the boxes, and/or re-initializing the cluster at the new size). I'd also do several runs while it is up at a particular cluster size, to capture any performance difference between the first and a later run due to OS or TServer caching, for analysis later. Essentially, when in doubt, take more data... --L On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell steven.trox...@gmail.com wrote: Hi all, I am running a benchmarking project on accumulo looking at RDF queries for clusters with different node sizes. While I intend to look at caching for each optimizing each individual run, I do NOT want caching to interfere for example between runs involving the use of 10 and 8 tablet servers. Up to now I'd just been killing nodes via the bin/stop-here.sh script but I realize that may have allowed caching from previous runs with different node sizes to influence my results. It seemed weird to me for exmaple when I realized dropping nodes actually increased performance (as measured by query return times) in some cases (though I acknowledge the code I'm working with has some serious issues with how ineffectively it is actually utilizing accumulo, but that's an issue I intend to address later). I suppose one way would be between a change of node sizes, stop and restart ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for example in transitioning from a 10 to 8 node test). Will this be sure to clear the influence of caching across runs, and is there any cleaner way to do this? thanks, Steve
Re: Software used for benchmarking a cloud
Accumulo comes with continuous ingest and query scripts that can benchmark how many KV-pairs you have coming into and out of a cluster. On Mon, Jun 4, 2012 at 12:57 PM, Hider, Sandy sandy.hi...@jhuapl.edu wrote: All, I am wondering what others have found to help perform benchmarks of a cloud? Have others found any open source software for this or rolled their own? Thanks in advance, Sandy