According to https://issues.apache.org/jira/browse/HADOOP-7823 , it should possible to split bzip2 files in Hadoop 1.1.
On Tue, May 21, 2013 at 3:54 PM, Eric Newton <[email protected]> wrote: > The files decompress remarkably fast, too. I seem to recall about 8 > minutes on our hardware. > > I could not get map/reduce to split on blocks in bzip'd files. > > That gave me a long tail since the English file is so much bigger. > > Uncompressing the files is the way to go. > > -Eric > > > On Tue, May 21, 2013 at 2:58 PM, Josh Elser <[email protected]> wrote: > >> You should see much better ingest performance having decompressed input. >> Hadoop will also 'naturally' handle the splits for you based on the HDFS >> block size. >> >> >> On 5/21/13 2:35 PM, Patrick Lynch wrote: >> >>> I think your description is accurate, except that I split the single >>> archive into a much greater number of pieces than the number of >>> different archives I ingested. Specifically, I set numGroups to a higher >>> number, I didn't split the archive my hand in hdfs. The archives are >>> bzip2-ed, not gzip-ed. Will decompressing still have the same benefit? >>> >>> >>> -----Original Message----- >>> From: Josh Elser <[email protected]> >>> To: user <[email protected]> >>> Sent: Tue, May 21, 2013 2:16 pm >>> Subject: Re: Wikisearch Performance Question >>> >>> Let me see if I understand what you're asking: you took one mediawiki >>> archive and split it into n archives of size 1/n the original. You then >>> took many n _different_ mediawiki archives and ingested those. You tried >>> to get the speed of ingesting many different archives be as fast as >>> splitting an original single archive? >>> >>> Are you using gzip'ed input files? Have you tried just decompressing the >>> gzip into plaintext? Hadoop will naturally split uncompressed text and >>> and give you nice balancing. >>> >>> I haven't looked at the ingest code in a long time. Not sure if it ever >>> received much love. >>> >>> On 5/21/13 1:30 PM, Patrick Lynch wrote: >>> >>>> user@accumulo, >>>> >>>> I was working with the Wikipedia Accumulo ingest examples, and I was >>>> trying to get the ingest of a single archive file to be as fast as >>>> ingesting multiple archives through parallelization. I increased the >>>> number of ways the job split the single archive so that all the servers >>>> could work on ingesting at the same time. What I noticed, however, was >>>> that having all the servers work on ingesting the same file was still >>>> not nearly as fast as using multiple ingest files. I was wondering if I >>>> could have some insight into the design of the Wikipedia ingest that >>>> could explain this phenomenon. >>>> >>>> Thank you for your time, >>>> Patrick Lynch >>>> >>> >>> >
