Re: Large input files via HTTP

Seetharam Venkatesh Mon, 22 Oct 2012 17:28:10 -0700

One possible way is to first create a list of files with tuples<host:port,
filePath>. Then use a map-only job to pull each file using NLineInputFormat.


Another way is to write a HttpInputFormat and HttpRecordReader and stream
the data in a map-only job.

On Mon, Oct 22, 2012 at 1:54 AM, David Parks <[email protected]> wrote:

> I want to create a MapReduce job which reads many multi-gigabyte input
> files
> from various HTTP sources & processes them nightly.
>
> Is there a reasonably flexible way to acquire the files in the Hadoop job
> its self? I expect the initial downloads to take many hours and I'd hope I
> can optimize the # of connections (example: I'm limited to 5 connections to
> one host, whereas another host has a 3-connection limit, so maximize as
> much
> as possible).  Also the set of files to download will change a little over
> time so the input list should be easily configurable (in a config file or
> equivalent).
>
>  - Is it normal to perform batch downloads like this *before* running the
> mapreduce job?
>  - Or is it ok to include such steps in with the job?
>  - It seems handy to keep the whole process as one neat package in Hadoop
> if
> possible.
>  - What class should I implement if I wanted to manage this myself? Would I
> just extend TextInputFormat for example, and do the HTTP processing there?
> Or am I created a FileSystem?
>
> Thanks,
> David
>
>
>


-- 
Regards,
Venkatesh

“Perfection (in design) is achieved not when there is nothing more to add,
but rather when there is nothing more to take away.”
- Antoine de Saint-Exupéry

Re: Large input files via HTTP

Reply via email to