Re: Wikisearch Performance Question

Josh Elser Tue, 21 May 2013 11:16:59 -0700

Let me see if I understand what you're asking: you took one mediawikiarchive and split it into n archives of size 1/n the original. You thentook many n _different_ mediawiki archives and ingested those. You triedto get the speed of ingesting many different archives be as fast assplitting an original single archive?

Are you using gzip'ed input files? Have you tried just decompressing thegzip into plaintext? Hadoop will naturally split uncompressed text andand give you nice balancing.

I haven't looked at the ingest code in a long time. Not sure if it everreceived much love.


On 5/21/13 1:30 PM, Patrick Lynch wrote:

user@accumulo,

I was working with the Wikipedia Accumulo ingest examples, and I was
trying to get the ingest of a single archive file to be as fast as
ingesting multiple archives through parallelization. I increased the
number of ways the job split the single archive so that all the servers
could work on ingesting at the same time. What I noticed, however, was
that having all the servers work on ingesting the same file was still
not nearly as fast as using multiple ingest files. I was wondering if I
could have some insight into the design of the Wikipedia ingest that
could explain this phenomenon.

Thank you for your time,
Patrick Lynch

Re: Wikisearch Performance Question

Reply via email to