One possible way is to first create a list of files with tuples<host:port, filePath>. Then use a map-only job to pull each file using NLineInputFormat.
Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job. On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[email protected]> wrote: > I want to create a MapReduce job which reads many multi-gigabyte input > files > from various HTTP sources & processes them nightly. > > Is there a reasonably flexible way to acquire the files in the Hadoop job > its self? I expect the initial downloads to take many hours and I'd hope I > can optimize the # of connections (example: I'm limited to 5 connections to > one host, whereas another host has a 3-connection limit, so maximize as > much > as possible). Also the set of files to download will change a little over > time so the input list should be easily configurable (in a config file or > equivalent). > > - Is it normal to perform batch downloads like this *before* running the > mapreduce job? > - Or is it ok to include such steps in with the job? > - It seems handy to keep the whole process as one neat package in Hadoop > if > possible. > - What class should I implement if I wanted to manage this myself? Would I > just extend TextInputFormat for example, and do the HTTP processing there? > Or am I created a FileSystem? > > Thanks, > David > > > -- Regards, Venkatesh “Perfection (in design) is achieved not when there is nothing more to add, but rather when there is nothing more to take away.” - Antoine de Saint-Exupéry
