I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connections (example: I'm limited to 5 connections to one host, whereas another host has a 3-connection limit, so maximize as much as possible). Also the set of files to download will change a little over time so the input list should be easily configurable (in a config file or equivalent). - Is it normal to perform batch downloads like this *before* running the mapreduce job? - Or is it ok to include such steps in with the job? - It seems handy to keep the whole process as one neat package in Hadoop if possible. - What class should I implement if I wanted to manage this myself? Would I just extend TextInputFormat for example, and do the HTTP processing there? Or am I created a FileSystem?
Thanks, David
