Re: Does anyone have experience with using Hadoop InputFormats?

Andrew Ash Tue, 23 Sep 2014 17:52:51 -0700

Hi Steve,

Hadoop has both old-style and new-style APIs -- Java package has "mapred"
vs "mapreduce".  Spark supports both of these via the sc.hadoopFile() and
sc.newAPIHadoopFile().  Maybe you need to switch to the newAPI versions of
those methods?


On Tue, Sep 23, 2014 at 5:43 PM, Steve Lewis <lordjoe2...@gmail.com> wrote:

> Well I had one and tried that - my message tells what I found found
> 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V>
>  not org.apache.hadoop.mapreduce.InputFormat<K,V>
> 2) Hadoop expects K and V to be Writables - I always use Text - Text is
> not Serializable and will not work with Spark - StringBuffer will work with
> Spark but not (as far as I know) with Hadoop
> - Telling me what the documentation SAYS is all well and good but I just
> tried it and want hear from people with real examples working
>
> On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote:
>
>> Hi Steve,
>>
>> Here is my understanding, as long as you implement InputFormat, you
>> should be able to use hadoopFile API in SparkContext to create an RDD.
>> Suppose you have a customized InputFormat which we call
>> CustomizedInputFormat<K, V> where K is the key type and V is the value
>> type. You can create an RDD with CustomizedInputFormat in the following way:
>>
>> Let sc denote the SparkContext variable and path denote the path to file
>> of CustomizedInputFormat, we use
>>
>> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
>> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>>
>> to create an RDD of (K,V) with CustomizedInputFormat.
>>
>> Hope this helps,
>> Liquan
>>
>> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com>
>> wrote:
>>
>>>  When I experimented with using an InputFormat I had used in Hadoop for
>>> a long time in Hadoop I found
>>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
>>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
>>> 2) initialize needs to be called in the constructor
>>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a
>>> Hadoop Writable - those are not serializable but extends
>>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
>>> is allowed in Hadoop
>>>
>>> Are these statements correct and if so it seems like most Hadoop
>>> InputFormate - certainly the custom ones I create require serious
>>> modifications to work - does anyone have samples of use of Hadoop
>>> InputFormat
>>>
>>> Since I am working with problems where a directory with multiple files
>>> are processed and some files are many gigabytes in size with multiline
>>> complex records an input format is a requirement.
>>>
>>
>>
>>
>> --
>> Liquan Pei
>> Department of Physics
>> University of Massachusetts Amherst
>>
>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to