Steve-

I would probably design the experiment to test different cluster sizes
as completely independent. That means, taking the entire thing down
and back up again (possibly even rebooting the boxes, and/or
re-initializing the cluster at the new size). I'd also do several runs
while it is up at a particular cluster size, to capture any
performance difference between the first and a later run due to OS or
TServer caching, for analysis later.

Essentially, when in doubt, take more data...

--L


On Fri, Aug 3, 2012 at 5:50 PM, Steven Troxell <steven.trox...@gmail.com> wrote:
> Hi  all,
>
> I am running a benchmarking project on accumulo looking at RDF queries for
> clusters with different node sizes.   While I intend to look at caching for
> each optimizing each individual run, I do NOT want caching to interfere for
> example between runs involving the use of 10 and 8 tablet servers.
>
> Up to now I'd just been killing nodes via the bin/stop-here.sh script but I
> realize that may have allowed caching from previous runs with different node
> sizes to influence my results.   It seemed weird to me for exmaple when I
> realized dropping nodes actually increased performance (as measured by query
> return times) in some cases (though I acknowledge the code I'm working with
> has some serious issues with how ineffectively it is actually utilizing
> accumulo, but that's an issue I intend to address later).
>
> I suppose one way would be between a change of node sizes,  stop and restart
> ALL nodes ( as opposed to what I'd been doing in just killing 2 nodes for
> example in transitioning from a 10 to 8 node test).  Will this be sure to
> clear the influence of caching across runs, and is there any cleaner way to
> do this?
>
> thanks,
> Steve

Reply via email to