Hi Mario, Not sure where this plays into your data integrity, but have you looked into these settings in hdfs-site.xml? dfs.client.read.shortcircuit dfs.client.read.shortcircuit.skip.checksum dfs.domain.socket.path
These make for a somewhat dramatic increase in HDFS read performance if data is distributed well enough around.. I can't speak as much to the scanner params, but you may look into these as well. Marc On Thu, May 19, 2016 at 10:08 AM, Mario Pastorelli < [email protected]> wrote: > Hey people, > I'm trying to tune a bit the query performance to see how fast it can go > and I thought it would be great to have comments from the community. The > problem that I'm trying to solve in Accumulo is the following: we want to > store the entities that have been in a certain location in a certain day. > The location is a Long and the entity id is a Long. I want to be able to > scan ~1M of rows in few seconds, possibly less than one. Right now, I'm > doing the following things: > > 1. I'm using a sharding byte at the start of the rowId to keep the > data in the same range distributed in the cluster > 2. all the records are encoded, one single record is composed by > 1. rowId: 1 shard byte + 3 bytes for the day > 2. column family: 8 byte for the long corresponding to the hash of > the location > 3. column qualifier: 8 byte corresponding to the identifier of the > entity > 4. value: 2 bytes for some additional information > 3. I use a batch scanner because I don't need sorting and it's faster > > As expected, it takes few seconds to scan 1M rows but now I'm wondering if > I can improve it. My ideas are the following: > > 1. set table.compaction.major.ration to 1 because I don't care about > the ingestion performance and this should improve the query performance > 2. pre-split tables to match the number of servers and then use a byte > of shard as first byte of the rowId. This should improve both writing and > reading the data because both should work in parallel for what I understood > 3. enable bloom filter on the table > > Do you think those ideas make sense? Furthermore, I have two questions: > > 1. considering that a single entry is only 22 bytes but I'm going to > scan ~1M records per query, do you think I should change the BatchScanner > buffers somehow? > 2. anything else to improve the scan speed? Again, I don't care about > the ingestion time > > Thanks for the help! > > -- > Mario Pastorelli | TERALYTICS > > *software engineer* > > Teralytics AG | Zollstrasse 62 | 8005 Zurich | Switzerland > phone: +41794381682 > email: [email protected] > www.teralytics.net > > Company registration number: CH-020.3.037.709-7 | Trade register Canton > Zurich > Board of directors: Georg Polzer, Luciano Franceschina, Mark Schmitz, Yann > de Vries > > This e-mail message contains confidential information which is for the > sole attention and use of the intended recipient. Please notify us at once > if you think that it may not be intended for you and delete it immediately. >
