Re: Does anyone have experience with using Hadoop InputFormats?

Steve Lewis Tue, 23 Sep 2014 17:44:35 -0700

Well I had one and tried that - my message tells what I found found
1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V>
 not org.apache.hadoop.mapreduce.InputFormat<K,V>
2) Hadoop expects K and V to be Writables - I always use Text - Text is not
Serializable and will not work with Spark - StringBuffer will work with
Spark but not (as far as I know) with Hadoop
- Telling me what the documentation SAYS is all well and good but I just
tried it and want hear from people with real examples working


On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote:

> Hi Steve,
>
> Here is my understanding, as long as you implement InputFormat, you should
> be able to use hadoopFile API in SparkContext to create an RDD. Suppose you
> have a customized InputFormat which we call CustomizedInputFormat<K, V>
> where K is the key type and V is the value type. You can create an RDD with
> CustomizedInputFormat in the following way:
>
> Let sc denote the SparkContext variable and path denote the path to file
> of CustomizedInputFormat, we use
>
> val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path,
> ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V])
>
> to create an RDD of (K,V) with CustomizedInputFormat.
>
> Hope this helps,
> Liquan
>
> On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com>
> wrote:
>
>>  When I experimented with using an InputFormat I had used in Hadoop for a
>> long time in Hadoop I found
>> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the
>> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat
>> 2) initialize needs to be called in the constructor
>> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a
>> Hadoop Writable - those are not serializable but extends
>> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this
>> is allowed in Hadoop
>>
>> Are these statements correct and if so it seems like most Hadoop
>> InputFormate - certainly the custom ones I create require serious
>> modifications to work - does anyone have samples of use of Hadoop
>> InputFormat
>>
>> Since I am working with problems where a directory with multiple files
>> are processed and some files are many gigabytes in size with multiline
>> complex records an input format is a requirement.
>>
>
>
>
> --
> Liquan Pei
> Department of Physics
> University of Massachusetts Amherst
>



-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: Does anyone have experience with using Hadoop InputFormats?

Reply via email to