Re: Wikisearch Performance Question

Patrick Lynch Wed, 22 May 2013 11:37:25 -0700

Thanks everyone for the advice! What I have ended up doing was what I did 
before but using a number a splits much closer to the number of nodes, which 
has resulted in good performance so far. Using decompressed input may be 
faster, but for my purposes space is more valuable than time.

-----Original Message-----
From: Eric Newton <[email protected]>
To: user <[email protected]>
Sent: Tue, May 21, 2013 3:54 pm
Subject: Re: Wikisearch Performance Question

The files decompress remarkably fast, too. I seem to recall about 8 minutes on 
our hardware.

I could not get map/reduce to split on blocks in bzip'd files.

That gave me a long tail since the English file is so much bigger.

Uncompressing the files is the way to go.

-Eric

On Tue, May 21, 2013 at 2:58 PM, Josh Elser <[email protected]> wrote:

You should see much better ingest performance having decompressed input. Hadoop 
will also 'naturally' handle the splits for you based on the HDFS block size.

On 5/21/13 2:35 PM, Patrick Lynch wrote:

I think your description is accurate, except that I split the single
archive into a much greater number of pieces than the number of
different archives I ingested. Specifically, I set numGroups to a higher
number, I didn't split the archive my hand in hdfs. The archives are
bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?

-----Original Message-----
From: Josh Elser <[email protected]>
To: user <[email protected]>
Sent: Tue, May 21, 2013 2:16 pm
Subject: Re: Wikisearch Performance Question

Let me see if I understand what you're asking: you took one mediawiki
archive and split it into n archives of size 1/n the original. You then
took many n _different_ mediawiki archives and ingested those. You tried
to get the speed of ingesting many different archives be as fast as
splitting an original single archive?

Are you using gzip'ed input files? Have you tried just decompressing the
gzip into plaintext? Hadoop will naturally split uncompressed text and
and give you nice balancing.

I haven't looked at the ingest code in a long time. Not sure if it ever
received much love.

On 5/21/13 1:30 PM, Patrick Lynch wrote:

user@accumulo,

I was working with the Wikipedia Accumulo ingest examples, and I was
trying to get the ingest of a single archive file to be as fast as
ingesting multiple archives through parallelization. I increased the
number of ways the job split the single archive so that all the servers
could work on ingesting at the same time. What I noticed, however, was
that having all the servers work on ingesting the same file was still
not nearly as fast as using multiple ingest files. I was wondering if I
could have some insight into the design of the Wikipedia ingest that
could explain this phenomenon.

Thank you for your time,
Patrick Lynch

Re: Wikisearch Performance Question

Reply via email to