Re: Does anyone have experience with using Hadoop InputFormats?

Liquan Pei Tue, 23 Sep 2014 17:30:30 -0700

Hi Steve,

Here is my understanding, as long as you implement InputFormat, you should
be able to use hadoopFile API in SparkContext to create an RDD. Suppose you
have a customized InputFormat which we call CustomizedInputFormat<K, V>
where K is the key type and V is the value type. You can create an RDD with
CustomizedInputFormat in the following way:


Let sc denote the SparkContext variable and path denote the path to file of
CustomizedInputFormat, we use

val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])

to create an RDD of (K,V) with CustomizedInputFormat.

Hope this helps,
Liquan

On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

>  When I experimented with using an InputFormat I had used in Hadoop for a
> long time in Hadoop I found
> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated
> class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
> 2) initialize needs to be called in the constructor
> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a
> Hadoop Writable - those are not serializable but extends
> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
> is allowed in Hadoop
>
> Are these statements correct and if so it seems like most Hadoop
> InputFormate - certainly the custom ones I create require serious
> modifications to work - does anyone have samples of use of Hadoop
> InputFormat
>
> Since I am working with problems where a directory with multiple files are
> processed and some files are many gigabytes in size with multiline complex
> records an input format is a requirement.
>



-- 
Liquan Pei
Department of Physics
University of Massachusetts Amherst

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to