Thanks for all the comments below. Very helpful! On the last point, around "small indexes", do you mean if your set of keys is small, but having many column-families and column qualifiers? What order of magnitude would you consider to be small? A few million keys/billion keys? Or in another way, keys with 10s/100s of column families/qualifiers.
I have another question around the use of column families and qualifiers. Would it be good or bad practice to have many column families/qualifiers per row. I was just wondering if there would be any point in using these almost as extensions to the keys, i.e. the column family/qualifier would end up being the last part of the key. I understand column families can also be used to control how the data gets stored to maximize scanning too. I was just wondering if there would be drawbacks on having many of these. Chris On 28 November 2012 20:31, Eric Newton <[email protected]> wrote: > Some comments inlined below: > > On Wed, Nov 28, 2012 at 2:49 PM, Chris Burrell <[email protected]>wrote: > >> Hi >> >> I am trialling Accumulo on a small (tiny) cluster and wondering how the >> best way to tune it would be. I have 1 master + 2 tservers. The master has >> 8Gb of RAM and the tservers have each 16Gb each. >> >> I have set the walogs size to be 2Gb with an external memory map of 9G. >> The ratio is still the defaulted to 3. I've also upped the heap sizes of >> each tserver to 2Gb heaps. >> >> I'm trying to achieve high-speed ingest via batch writers held on several >> other servers. I'm loading two separate tables. >> >> Here are some questions I have: >> - Does the config above sound sensible? or overkill? >> > > Looks good to me, assuming you aren't doing other things (like map/reduce) > on the machines. > > >> - Is it preferable to have more servers with lower specs? >> > Yes. Mostly to get more drives. > > >> - Is this the best way to maximise use of the memory? >> > It's not bad. You may want to have larger block caches and a smaller > in-memory map. But if you want to write-mostly, read-little, this is good. > > >> - Does the fact I have 3x2Gb walogs, means that the remaining 3Gb in the >> external memory map can be used while compactions occur? >> > > Yes. You will want to increase the size or number of logs. With that > many servers, failures will hopefully be very rare. I would go with > changing 3 to 8. Having lots of logs on a tablet is no big deal if you > have disk space, and don't expect many failures. > > >> - When minor compactions occur, does this halt ingest on that particular >> tablet? or tablet server? >> > Only if memory fills before the compactions finish. The monitor page will > indicate this by displaying "hold time." When this happens the tserver > will self-tune and start minor compactions earlier with future ingest. > > >> - I have pre-split the tables six-ways, but not entirely sure if that's >> preferable if I only have 2 servers while trying it out? Perhaps 2 ways >> might be better? >> > Not for that reason, but to be able to use more cores concurrently. Aim > for 50-100 tablets/node. > > >> - Does the batch upload through the shell client give significantly >> better performance stats? >> > > Using map/reduce to create RFiles is more efficient. But it also increases > latency: you only can see the data when the whole file is loaded. > > When a file is batch-loaded, its index is read, and the file is assigned > to matching tablets. With small indexes, you can batch-load terabytes in > minutes. > > -Eric > >
