As long as you're managing your expectations (which I sounds like you've considered well), there could be some worth.

A concern would be how using a different filesystem implementation actually impacts the validity of your benchmark though.

e.g. w/ a local FS (which is by default what MAC does), a disk seek costs 10ms, but using your real HDFS cluster, it's 200ms. IteratorA does more seeks but is less efficient on the retrieved data while IteratorB does fewer seeks but is more efficient on the retrieved data would lead to inaccurate benchmarks on a production system.

I guess another way to put it is that total wall time for a query might be deceiving in a test environment.

Dave Hardcastle wrote:
Hi,

Is it crazy to use a MiniAccumuloCluster to measure the *relative*
performance of two different implementations of iterators?

Obviously it would be better to do it on a real Accumulo cluster, but
that's not possible for several reasons.

The approach would be something like:
- Fire up a Mini cluster
- Bulk import a file
- Start timer
- Set up a BatchScanner with one of the iterator stacks and use it to
query for lots of different ranges
- Iterate through the results of this
- Stop timer

Repeat with the other implementation of the iterators.

Of course, the difference in performance may not be measurable, if the
time is dominated by the disk-seek time, but that would still be useful
information. And the absolute performance wouldn't be representative of
what you'd get on a real cluster as there's no network latency in these
trials, but that's fine as I'm mainly interested in which of the two
implementations of the iterators is most performant.

Similarly, could the same approach be used to compare the performance on
SSD vs hard disk?

Thanks,

Dave.

Reply via email to