Measuring scan performance by piping output from the shell is not the best way. A lot of time is wasted printing output to the terminal. You are better off measuring the difference using the Batch Scanner API directly. An example can be found here: https://accumulo.apache.org/tour/batch-scanner/
On Tue, Aug 28, 2018 at 2:50 PM guy sharon <[email protected]> wrote: > hi Sean, > > Thanks for the advice. I tried bringing up a 5 tserver cluster on AWS with > Muchos (https://github.com/apache/fluo-muchos). My first attempt was > using servers with 2 vCPU, 8GB RAM (m5d.large on AWS). The Hadoop datanodes > were colocated with the tservers and the Accumulo master was on the same > server as the Hadoop namenode. I populated a table with 6M entries using a > modified version of > org.apache.accumulo.examples.simple.helloworld.InsertWithBatchWriter from > Accumulo (the only thing I modified was the number of entries as it usually > inserts 50k). I then did a count with "bin/accumulo shell -u root -p secret > -e "scan -t hellotable -np" | wc -l". That took 15 seconds. I then upgraded > to m5d.xlarge instances (4vCPU, 16GB RAM) and got the exact same result, so > it seems upgrading the servers doesn't help. > > Is this expected or am I doing something terribly wrong? > > BR, > Guy. > > > > On Tue, Aug 28, 2018 at 10:38 AM Sean Busbey <[email protected]> wrote: > >> Hi Guy, >> >> Apache Accumulo is designed for horizontally scaling out for large scale >> workloads that need to do random reads and writes. There's a non-trivial >> amount of overhead that comes with a system aimed at doing that on >> thousands of nodes. >> >> If your use case works for a single laptop with such a small number of >> entries and exhaustive scans, then Accumulo is probably not the correct >> tool for the job. >> >> For example, on my laptop (i7 2 cores, 8GiB memory) with that dataset >> size you can just rely on a file format like Apache Avro: >> >> busbey$ time java -jar avro-tools-1.7.7.jar random --codec snappy --count >> 6300000 --schema '{ "type": "record", "name": "entry", "fields": [ { >> "name": "field0", "type": "string" } ] }' ~/Downloads/6.3m_entries.avro >> Aug 28, 2018 12:31:13 AM org.apache.hadoop.util.NativeCodeLoader <clinit> >> WARNING: Unable to load native-hadoop library for your platform... using >> builtin-java classes where applicable >> test.seed=1535441473243 >> >> real 0m5.451s >> user 0m5.922s >> sys 0m0.656s >> busbey$ ls -lah ~/Downloads/6.3m_entries.avro >> -rwxrwxrwx 1 busbey staff 186M Aug 28 00:31 >> /Users/busbey/Downloads/6.3m_entries.avro >> busbey$ time java -jar avro-tools-1.7.7.jar tojson >> ~/Downloads/6.3m_entries.avro | wc -l >> 6300000 >> >> real 0m4.239s >> user 0m6.026s >> sys 0m0.721s >> >> I'd recommend that you start at >= 5 nodes if you want to look at rough >> per-node throughput capabilities. >> >> >> On 2018/08/28 06:59:38, guy sharon <[email protected]> wrote: >> > hi Mike, >> > >> > Thanks for the links. >> > >> > My current setup is a 4 node cluster (tserver, master, gc, monitor) >> running >> > on Alpine Docker containers on a laptop with an i7 processor (8 cores) >> with >> > 16GB of RAM. As an example I'm running a count of all entries for a >> table >> > with 6.3M entries with "accumulo shell -u root -p secret -e "scan -t >> > benchmark_table -np" | wc -l" and it takes 43 seconds. Not sure if this >> is >> > reasonable or not. Seems a little slow to me. What do you think? >> > >> > BR, >> > Guy. >> > >> > >> > >> > >> > On Mon, Aug 27, 2018 at 4:43 PM Michael Wall <[email protected]> wrote: >> > >> > > Hi Guy, >> > > >> > > Here are a couple links I found. Can you tell us more about your >> setup >> > > and what you are seeing? >> > > >> > > https://accumulo.apache.org/papers/accumulo-benchmarking-2.1.pdf >> > > https://www.youtube.com/watch?v=Ae9THpmpFpM >> > > >> > > Mike >> > > >> > > >> > > On Sat, Aug 25, 2018 at 5:09 PM guy sharon <[email protected] >> > >> > > wrote: >> > > >> > >> hi, >> > >> >> > >> I've just started working with Accumulo and I think I'm experiencing >> slow >> > >> reads/writes. I'm aware of the recommended configuration. Does >> anyone know >> > >> of any standard benchmarks and benchmarking tools I can use to tell >> if the >> > >> performance I'm getting is reasonable? >> > >> >> > >> >> > >> >> > >> >
