Re: Wikisearch Performance Question

William Slacum Tue, 21 May 2013 15:20:08 -0700

According to https://issues.apache.org/jira/browse/HADOOP-7823 , it should
possible to split bzip2 files in Hadoop 1.1.



On Tue, May 21, 2013 at 3:54 PM, Eric Newton <[email protected]> wrote:

> The files decompress remarkably fast, too. I seem to recall about 8
> minutes on our hardware.
>
> I could not get map/reduce to split on blocks in bzip'd files.
>
> That gave me a long tail since the English file is so much bigger.
>
> Uncompressing the files is the way to go.
>
> -Eric
>
>
> On Tue, May 21, 2013 at 2:58 PM, Josh Elser <[email protected]> wrote:
>
>> You should see much better ingest performance having decompressed input.
>> Hadoop will also 'naturally' handle the splits for you based on the HDFS
>> block size.
>>
>>
>> On 5/21/13 2:35 PM, Patrick Lynch wrote:
>>
>>> I think your description is accurate, except that I split the single
>>> archive into a much greater number of pieces than the number of
>>> different archives I ingested. Specifically, I set numGroups to a higher
>>> number, I didn't split the archive my hand in hdfs. The archives are
>>> bzip2-ed, not gzip-ed. Will decompressing still have the same benefit?
>>>
>>>
>>> -----Original Message-----
>>> From: Josh Elser <[email protected]>
>>> To: user <[email protected]>
>>> Sent: Tue, May 21, 2013 2:16 pm
>>> Subject: Re: Wikisearch Performance Question
>>>
>>> Let me see if I understand what you're asking: you took one mediawiki
>>> archive and split it into n archives of size 1/n the original. You then
>>> took many n _different_ mediawiki archives and ingested those. You tried
>>> to get the speed of ingesting many different archives be as fast as
>>> splitting an original single archive?
>>>
>>> Are you using gzip'ed input files? Have you tried just decompressing the
>>> gzip into plaintext? Hadoop will naturally split uncompressed text and
>>> and give you nice balancing.
>>>
>>> I haven't looked at the ingest code in a long time. Not sure if it ever
>>> received much love.
>>>
>>> On 5/21/13 1:30 PM, Patrick Lynch wrote:
>>>
>>>> user@accumulo,
>>>>
>>>> I was working with the Wikipedia Accumulo ingest examples, and I was
>>>> trying to get the ingest of a single archive file to be as fast as
>>>> ingesting multiple archives through parallelization. I increased the
>>>> number of ways the job split the single archive so that all the servers
>>>> could work on ingesting at the same time. What I noticed, however, was
>>>> that having all the servers work on ingesting the same file was still
>>>> not nearly as fast as using multiple ingest files. I was wondering if I
>>>> could have some insight into the design of the Wikipedia ingest that
>>>> could explain this phenomenon.
>>>>
>>>> Thank you for your time,
>>>> Patrick Lynch
>>>>
>>>
>>>
>

Re: Wikisearch Performance Question

Reply via email to