How many cores are on your servers? There are several thread counts you can change. Even +1 thread per server counts at some point if you have enough servers in the cluster.
On Fri, May 20, 2016 at 2:54 AM, Mario Pastorelli < [email protected]> wrote: > You mean the BatchScanner number of threads? I've made it parametric and > usually I use 1 or 2 threads per tablet server. Going up doesn't seem to do > anything for the performance. > > On Thu, May 19, 2016 at 6:21 PM, David Medinets <[email protected]> > wrote: > >> Have you tuned thread counts? >> On May 19, 2016 11:08 AM, "Mario Pastorelli" < >> [email protected]> wrote: >> >>> Hey people, >>> I'm trying to tune a bit the query performance to see how fast it can go >>> and I thought it would be great to have comments from the community. The >>> problem that I'm trying to solve in Accumulo is the following: we want to >>> store the entities that have been in a certain location in a certain day. >>> The location is a Long and the entity id is a Long. I want to be able to >>> scan ~1M of rows in few seconds, possibly less than one. Right now, I'm >>> doing the following things: >>> >>> 1. I'm using a sharding byte at the start of the rowId to keep the >>> data in the same range distributed in the cluster >>> 2. all the records are encoded, one single record is composed by >>> 1. rowId: 1 shard byte + 3 bytes for the day >>> 2. column family: 8 byte for the long corresponding to the hash >>> of the location >>> 3. column qualifier: 8 byte corresponding to the identifier of >>> the entity >>> 4. value: 2 bytes for some additional information >>> 3. I use a batch scanner because I don't need sorting and it's faster >>> >>> As expected, it takes few seconds to scan 1M rows but now I'm wondering >>> if I can improve it. My ideas are the following: >>> >>> 1. set table.compaction.major.ration to 1 because I don't care about >>> the ingestion performance and this should improve the query performance >>> 2. pre-split tables to match the number of servers and then use a >>> byte of shard as first byte of the rowId. This should improve both >>> writing >>> and reading the data because both should work in parallel for what I >>> understood >>> 3. enable bloom filter on the table >>> >>> Do you think those ideas make sense? Furthermore, I have two questions: >>> >>> 1. considering that a single entry is only 22 bytes but I'm going to >>> scan ~1M records per query, do you think I should change the BatchScanner >>> buffers somehow? >>> 2. anything else to improve the scan speed? Again, I don't care >>> about the ingestion time >>> >>> Thanks for the help! >>> >>> -- >>> Mario Pastorelli | TERALYTICS >>> >>> *software engineer* >>> >>> Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland >>> phone: +41794381682 >>> email: [email protected] >>> www.teralytics.net >>> >>> Company registration number: CH-020.3.037.709-7 | Trade register Canton >>> Zurich >>> Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, >>> Yann de Vries >>> >>> This e-mail message contains confidential information which is for the >>> sole attention and use of the intended recipient. Please notify us at once >>> if you think that it may not be intended for you and delete it immediately. >>> >> > > > -- > Mario Pastorelli | TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: [email protected] > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at once > if you think that it may not be intended for you and delete it immediately. >
