What size cluster, and what is the HDFS block size, compared to the file sizes? I'm wondering if the blocks for the large file were disproportionately burdening a small number of datanodes, when the small files were more evenly distributed.
-- Christopher L Tubbs II http://gravatar.com/ctubbsii On Tue, May 21, 2013 at 1:30 PM, Patrick Lynch <[email protected]> wrote: > user@accumulo, > > I was working with the Wikipedia Accumulo ingest examples, and I was trying > to get the ingest of a single archive file to be as fast as ingesting > multiple archives through parallelization. I increased the number of ways > the job split the single archive so that all the servers could work on > ingesting at the same time. What I noticed, however, was that having all the > servers work on ingesting the same file was still not nearly as fast as > using multiple ingest files. I was wondering if I could have some insight > into the design of the Wikipedia ingest that could explain this phenomenon. > > Thank you for your time, > Patrick Lynch
