Tony, I think the first step would be to verify if the S3 filesystem implementation rename works as expected.
Thx On Fri, Feb 1, 2013 at 7:12 AM, Tony Burton <[email protected]>wrote: > ** ** > > Thanks for the reply Alejandro. Using a temp output directory was my first > guess as well. What’s the best way to proceed? I’ve come across > FileSystem.rename but it’s consistently returning false for whatever Paths > I provide. Specifically, I need to copy the following:**** > > ** ** > > s3://<path to data>/<tmp folder>/<object type 1>/part-00000**** > > …**** > > s3://<path to data>/<tmp folder>/<object type 1>/part-nnnnn**** > > s3://<path to data>/<tmp folder>/<object type 2>/part-00000**** > > …**** > > s3://<path to data>/<tmp folder>/<object type 2>/part-nnnnn**** > > …**** > > s3://<path to data>/<tmp folder>/<object type m>/part-nnnnn**** > > ** ** > > to **** > > ** ** > > s3://<path to data>/<object type 1>/part-00000**** > > …**** > > s3://<path to data>/<object type 1>/part-nnnnn**** > > s3://<path to data>/<object type 2>/part-00000**** > > …**** > > s3://<path to data>/<object type 2>/part-nnnnn**** > > …**** > > s3://<path to data>/<object type m>/part-nnnnn**** > > ** ** > > without doing a copyToLocal.**** > > ** ** > > Any tips? Are there any better alternatives to FileSystem.rename? Or would > using the AWS Java SDK be a better solution?**** > > ** ** > > Thanks!**** > > ** ** > > Tony**** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > ** ** > > *From:* Alejandro Abdelnur [mailto:[email protected]] > *Sent:* 31 January 2013 18:45 > *To:* [email protected] > > *Subject:* Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat**** > > ** ** > > Hi Tony, from what i understand your prob is not with MTOF but with you > wanting to run 2 jobs using the same output directory, the second job will > fail because the output dir already existed. My take would be tweaking your > jobs to use a temp output dir, and moving them to the required (final) > location upon completion.**** > > ** ** > > thx**** > > ** ** > > ** ** > > On Thu, Jan 31, 2013 at 8:22 AM, Tony Burton <[email protected]> > wrote:**** > > Hi everyone, > > Some of you might recall this topic, which I worked on with the list's > help back in August last year - see email trail below. Despite initial > success of the discovery, I had the shelve the approach as I ended up using > a different solution (for reasons I forget!) with the implementation that > was ultimately used for that particular project. > > I'm now in a position to be working on a similar new task, where I've > successfully implemented the combination of LazyOutputFormat and > MultipleOutputs using hadoop 1.0.3 to write out to multiple custom output > locations. However, I've hit another snag which I'm hoping you might help > me work through. > > I'm going to be running daily tasks to extract data from XML files > (specifically, the data stored in certain nodes of the XML), stored on AWS > S3 using object names with the following format: > > s3://inputbucket/data/2013/1/13/<list of xml data files.bz2> > > I want to extract items from the XML and write out as follows: > > s3://outputbucket/path/<xml node name>/20130113/<output from MR job> > > For one day of data, this works fine. I pass in s3://inputbucket/data and > s3://outputbucket/path as input and output arguments, along with my run > date (20130113) which gets manipulated and appended where appropriate to > form the precise read and write locations, for example > > FileInputFormat.setInputhPath(job, " s3://inputbucket/data"); > FileOutputFormat.setOutputPath(job, "s3://outputbucket/path"); > > Then MultipleOutputs adds on my XML node names underneath > s3://outputbucket/path automatically. > > However, for the next day's run, the job gets to > FileOutputFormat.setOutputPath and sees that the output path > (s3://outputbucket/path) already exists, and throws a > FileAlreadyExistsException from FileOutputFormat.checkOutputSpecs() - even > though my ultimate subdirectory, to be constructed by MultipleOutputs does > not already exist. > > Is there any way around this? I'm given hope by this, from > http://hadoop.apache.org/docs/r1.0.3/api/org/apache/hadoop/fs/FileAlreadyExistsException.html: > "public class FileAlreadyExistsException extends IOException - Used when > target file already exists for any operation *and is not configured to be > overwritten*" (my emphasis). Is it possible to deconfigure the overwrite > protection? > > If not, I suppose one other way ahead is to create my own FileOutputFormat > where the checkOutputSpecs() is a bit less fussy; another might be to write > to a "temp" directory and programmatically move it to the desired output > when the job completes successfully, although this is getting to feel a bit > "hacky" to me. > > Thanks for any feedback! > > Tony > > > > > > > > ________________________________________ > From: Harsh J [[email protected]] > Sent: 31 August 2012 10:47 > To: [email protected] > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > > Good finding, that OF slipped my mind. We can mention on the > MultipleOutputs javadocs for the new API to use the LazyOutputFormat for > the job-level config. Please file a JIRA for this under MAPREDUCE project > on the Apache JIRA? > > On Fri, Aug 31, 2012 at 2:32 PM, Tony Burton <[email protected]> > wrote: > > Hi Harsh, > > > > I tried using NullOutputFormat as you suggested, however simply using > > > > job.setOutputFormatClass(NullOutputFormat.class); > > > > resulted in no output at all. Although I've not tried overriding > getOutputCommitter in NullOutputFormat as you suggested, I discovered > LazyOutputFormat which only writes when it has to, "the output file is > created only when the first record is emitted for a given partition" (from > "Hadoop: The Definitive Guide"). > > > > Instead of > > > > job.setOutputFormatClass(TextOutputFormat.class); > > > > use LazyOutputFormat like this: > > > > LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); > > > > So now my unnamed MultipleOutputs are handling to segmented results, and > LazyOutputFormat is suppressing the default output. Good job! > > > > Tony > > > > > > > > > > > > ________________________________________ > > From: Harsh J [[email protected]] > > Sent: 29 August 2012 17:05 > > To: [email protected] > > Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > > > > Hi Tony, > > > > On Wed, Aug 29, 2012 at 9:30 PM, Tony Burton <[email protected]> > wrote: > >> Success so far! > >> > >> I followed the example given by Tom on the link to the > MultipleOutputs.html API you suggested. > >> > >> I implemented a WordCount MR job using hadoop 1.0.3 and segmented the > output depending on word length: output to directory "sml" for less than 10 > characters, "med" for between 10 and 20 characters, "lrg" otherwise. > >> > >> I used out.write(key, new IntWritable(sum), generateFilename(key, > >> sum)); to write the output, and generateFileName to create the custom > >> directory name/filename. You need to provide the start of the > >> filename as well otherwise your output files will be -r-00000, > >> -r-00001 etc. (so, for example, return "sml/part"; etc) > > > > Thanks for these notes, should come helpful for those who search! > > > >> Also required: as Tom states, override Reducer.setup() to create the > MultipleOutputs. However, Tom's puzzle left for the reader is that you also > need to override Reducer.cleanup() and call close() on your MultipleOutputs > object. Forget to do this and your segmented files will be empty. > > > > Ah yes this is important. Non closure of files would have you wait for > > an hour for data to get available to readers (open writer lease expiry > > period). > > > >> One observation: although it's not the end of the world, as well as my > segmented output I also get a zero-size part-r-00000 file in the base of my > output path. Is there any way to prevent creation of this file? > > > > Set the OutputFormat to NullOutputFormat. > > > > In case you face issues doing this in new API (you may notice some odd > > behavior) try to extend NullOutputFormat and in its getOutputCommitter > > method i.e. > > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/mapr > > educe/lib/output/NullOutputFormat.html#getOutputCommitter(org.apache.h > > adoop.mapreduce.TaskAttemptContext), > > return a FileOutputCommitter object. By default it returns a no-op > > OutputCommitter that may not gel well with a file-based writer such as > > MultipleOutputs. Then set this new OutputFormat as your job's output > > format. > > > >> Thanks again Harsh for pointing the way. > >> > >> Tony > >> > >> > >> > >> > >> > >> > >> > >> -----Original Message----- > >> From: Tony Burton [mailto:[email protected]] > >> Sent: 29 August 2012 11:38 > >> To: [email protected] > >> Subject: RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >> > >> Thanks Harsh! Will try it out and report back later. > >> > >> > >> -----Original Message----- > >> From: Harsh J [mailto:[email protected]] > >> Sent: 29 August 2012 11:12 > >> To: [email protected] > >> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >> > >> Hi Tony, > >> > >> Seeing your new question, I recalled Tom's post to a user once, here: > >> https://groups.google.com/a/cloudera.org/d/msg/cdh-user/pdyVyydt5Ys/1 > >> CaLukt4v1AJ > >> > >> This specific call allows you to specify / characters in your name, > >> that gets translated into creation of directories automatically: > >> http://hadoop.apache.org/common/docs/stable/api/org/apache/hadoop/map > >> reduce/lib/output/MultipleOutputs.html#write(KEYOUT,%20VALUEOUT,%20ja > >> va.lang.String) (The last argument is where you will need to specify > >> the path) > >> > >> Try it out and let us know! > >> > >> On Tue, Aug 28, 2012 at 7:06 PM, Tony Burton <[email protected]> > wrote: > >>> Hi Harsh > >>> > >>> Thanks for the reply - my understanding is that with MultipleOutputs I > can write differently named files into the same target directory. With > MultipleTextOutputFormat I was able to override the target directory name > to perform the segmentation, by overriding generateFileNameForKeyValue(). > >>> > >>> Does the 1.0.3 MultipleOutputs give me the ability to alter the target > directory name as well as the file name? > >>> > >>> Thanks, > >>> > >>> Tony > >>> > >>> > >>> > >>> -----Original Message----- > >>> From: Harsh J [mailto:[email protected]] > >>> Sent: 28 August 2012 13:44 > >>> To: [email protected] > >>> Subject: Re: hadoop 1.0.3 equivalent of MultipleTextOutputFormat > >>> > >>> The Multiple*OutputFormat have been deprecated in favor of the > >>> generic MultipleOutputs API. Would using that instead work for you? > >>> > >>> On Tue, Aug 28, 2012 at 6:05 PM, Tony Burton < > [email protected]> wrote: > >>>> Hi, > >>>> > >>>> I've seen that org.apache.hadoop.mapred.lib.MultipleTextOutputFormat > is good for writing results into (for example) different directories > created on the fly. However, now I'm implementing a MapReduce job using > Hadoop 1.0.3, I see that the new API no longer supports > MultipleTextOutputFormat. Is there an equivalent that I can use, or will it > be supported in a future release? > >>>> > >>>> Thanks, > >>>> > >>>> Tony > >>>> > >>>> > >>>> ******************************************************************* > >>>> *** This email and any attachments are confidential, protected by > >>>> copyright and may be legally privileged. If you are not the > >>>> intended recipient, then the dissemination or copying of this email > is prohibited. If you have received this in error, please notify the sender > by replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >>>> > >>>> Outbound email has been scanned for viruses and SPAM > >>>> > >>> > >>> > >>> > >>> -- > >>> Harsh J > >>> www.sportingindex.com > >>> Inbound Email has been scanned for viruses and SPAM > >>> ******************************************************************** > >>> ** This email and any attachments are confidential, protected by > >>> copyright and may be legally privileged. If you are not the > >>> intended recipient, then the dissemination or copying of this email is > prohibited. If you have received this in error, please notify the sender by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >>> > >>> Outbound email has been scanned for viruses and SPAM > >> > >> > >> > >> -- > >> Harsh J > >> www.sportingindex.com > >> Inbound Email has been scanned for viruses and SPAM > >> ********************************************************************* > >> * This email and any attachments are confidential, protected by > >> copyright and may be legally privileged. If you are not the intended > >> recipient, then the dissemination or copying of this email is > prohibited. If you have received this in error, please notify the sender by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >> > >> Outbound email has been scanned for viruses and SPAM > >> www.sportingindex.com Inbound Email has been scanned for viruses and > >> SPAM > >> ********************************************************************* > >> * This email and any attachments are confidential, protected by > >> copyright and may be legally privileged. If you are not the intended > >> recipient, then the dissemination or copying of this email is > prohibited. If you have received this in error, please notify the sender by > replying by email and then delete the email completely from your system. > Neither Sporting Index nor the sender accepts responsibility for any > virus, or any other defect which might affect any computer or IT system > into which the email is received and/or opened. It is the responsibility > of the recipient to scan the email and no responsibility is accepted for > any loss or damage arising in any way from receipt or use of this email. > Sporting Index Ltd is a company registered in England and Wales with > company number 2636842, whose registered office is at Gateway House, > Milverton Street, London, SE11 4AP. Sporting Index Ltd is authorised and > regulated by the UK Financial Services Authority (reg. no. 150404) and > Gambling Commission (reg. no. 000-027343-R-308898-001). Any financial > promotion contained herein has been issued and approved by Sporting Index > Ltd. > >> > >> Outbound email has been scanned for viruses and SPAM > > > > > > > > -- > > Harsh J > > www.sportingindex.com > > Inbound Email has been scanned for viruses and SPAM > > ********************************************************************** > > This email and any attachments are confidential, protected by > > copyright and may be legally privileged. If you are not the intended > recipient, then the dissemination or copying of this email is prohibited. > If you have received this in error, please notify the sender by replying by > email and then delete the email completely from your system. Neither > Sporting Index nor the sender accepts responsibility for any virus, or any > other defect which might affect any computer or IT system into which the > email is received and/or opened. It is the responsibility of the recipient > to scan the email and no responsibility is accepted for any loss or damage > arising in any way from receipt or use of this email. Sporting Index Ltd > is a company registered in England and Wales with company number 2636842, > whose registered office is at Gateway House, Milverton Street, London, SE11 > 4AP. Sporting Index Ltd is authorised and regulated by the UK Financial > Services Authority (reg. no. 150404) and Gambling Commission (reg. no. > 000-027343-R-308898-001). Any financial promotion contained herein has > been issued and approved by Sporting Index Ltd. > > > > Outbound email has been scanned for viruses and SPAM > > > > -- > Harsh J > www.sportingindex.com > Inbound Email has been scanned for viruses and SPAM > ********************************************************************** > This email and any attachments are confidential, protected by copyright > and may be legally privileged. If you are not the intended recipient, then > the dissemination or copying of this email is prohibited. If you have > received this in error, please notify the sender by replying by email and > then delete the email completely from your system. Neither Sporting Index > nor the sender accepts responsibility for any virus, or any other defect > which might affect any computer or IT system into which the email is > received and/or opened. It is the responsibility of the recipient to scan > the email and no responsibility is accepted for any loss or damage arising > in any way from receipt or use of this email. Sporting Index Ltd is a > company registered in England and Wales with company number 2636842, whose > registered office is at Gateway House, Milverton Street, London, SE11 4AP. > Sporting Index Ltd is authorised and regulated by the UK Financial > Services Authority (reg. no. 150404) and Gambling Commission (reg. no. > 000-027343-R-308898-001). Any financial promotion contained herein has > been issued and approved by Sporting Index Ltd. > > Outbound email has been scanned for viruses and SPAM > www.sportingindex.comInbound Email has been scanned for viruses and SPAM > **** > > > > **** > > ** ** > > -- > Alejandro **** > > ** ** > > > > ***************************************************************************** > P *Please consider the environment before printing this email* **** > > > www.sportingindex.com > > Inbound email has been scanned for viruses & spam**** > -- Alejandro
