That should work as well. On Tue, Oct 23, 2012 at 9:06 PM, David Parks <[email protected]> wrote:
> I might very well be overthinking this. But I have a cluster I’m firing up > on EC2 that I want to keep utilized. I have some other unrelated jobs that > don’t need to wait for the downloads, so I don’t want to fill all the map > slots with long downloads. I’d rather the other jobs run in parallel while > the downloads are happening.**** > > ** ** > > ** ** > > *From:* [email protected] [mailto:[email protected]] *On Behalf Of > *Seetharam > Venkatesh > *Sent:* Tuesday, October 23, 2012 1:10 PM > > *To:* [email protected] > *Subject:* Re: Large input files via HTTP**** > > ** ** > > Well, it depends. :-) If the XML cannot be split, then you'd end up with > only one map task for the entire set of files. I think it'd make sense to > have multiple splits so you can get en even spread of copy across maps, > retry only the failed copy and not manage the scheduling of the downloads. > **** > > ** ** > > Look at DistCp for some intelligent splitting. **** > > ** ** > > What are the constraints that you are working with? **** > > On Mon, Oct 22, 2012 at 5:59 PM, David Parks <[email protected]> > wrote:**** > > Would it make sense to write a map job that takes an unsplittable XML file > (which defines all of the files I need to download); that one map job then > kicks off the downloads in multiple threads. This way I can easily manage > the most efficient download pattern within the map job, and my output is > emitted as key,values straight to the reducer step?**** > > **** > > **** > > *From:* [email protected] [mailto:[email protected]] *On Behalf Of > *Seetharam > Venkatesh > *Sent:* Tuesday, October 23, 2012 7:28 AM > *To:* [email protected] > *Subject:* Re: Large input files via HTTP**** > > **** > > One possible way is to first create a list of files with tuples<host:port, > filePath>. Then use a map-only job to pull each file using NLineInputFormat. > **** > > **** > > Another way is to write a HttpInputFormat and HttpRecordReader and stream > the data in a map-only job.**** > > On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[email protected]> > wrote:**** > > I want to create a MapReduce job which reads many multi-gigabyte input > files > from various HTTP sources & processes them nightly. > > Is there a reasonably flexible way to acquire the files in the Hadoop job > its self? I expect the initial downloads to take many hours and I'd hope I > can optimize the # of connections (example: I'm limited to 5 connections to > one host, whereas another host has a 3-connection limit, so maximize as > much > as possible). Also the set of files to download will change a little over > time so the input list should be easily configurable (in a config file or > equivalent). > > - Is it normal to perform batch downloads like this *before* running the > mapreduce job? > - Or is it ok to include such steps in with the job? > - It seems handy to keep the whole process as one neat package in Hadoop > if > possible. > - What class should I implement if I wanted to manage this myself? Would I > just extend TextInputFormat for example, and do the HTTP processing there? > Or am I created a FileSystem? > > Thanks, > David**** > > > > **** > > **** > > -- > Regards, > Venkatesh**** > > **** > > “Perfection (in design) is achieved not when there is nothing more to add, > but rather when there is nothing more to take away.” **** > > - Antoine de Saint-Exupéry**** > > **** > > ** ** > -- Regards, Venkatesh Phone: (408) 658-8368 EMail: [email protected] http://in.linkedin.com/in/seetharamvenkatesh http://about.me/SeetharamVenkatesh “Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.” - Antoine de Saint-Exupéry
