Well I had one and tried that - my message tells what I found found 1) Spark only accepts org.apache.hadoop.mapred.InputFormat<K,V> not org.apache.hadoop.mapreduce.InputFormat<K,V> 2) Hadoop expects K and V to be Writables - I always use Text - Text is not Serializable and will not work with Spark - StringBuffer will work with Spark but not (as far as I know) with Hadoop - Telling me what the documentation SAYS is all well and good but I just tried it and want hear from people with real examples working
On Tue, Sep 23, 2014 at 5:29 PM, Liquan Pei <liquan...@gmail.com> wrote: > Hi Steve, > > Here is my understanding, as long as you implement InputFormat, you should > be able to use hadoopFile API in SparkContext to create an RDD. Suppose you > have a customized InputFormat which we call CustomizedInputFormat<K, V> > where K is the key type and V is the value type. You can create an RDD with > CustomizedInputFormat in the following way: > > Let sc denote the SparkContext variable and path denote the path to file > of CustomizedInputFormat, we use > > val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, > ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) > > to create an RDD of (K,V) with CustomizedInputFormat. > > Hope this helps, > Liquan > > On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > >> When I experimented with using an InputFormat I had used in Hadoop for a >> long time in Hadoop I found >> 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the >> deprecated class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat >> 2) initialize needs to be called in the constructor >> 3) The type - mine was extends FileInputFormat<Text, Text> must not be a >> Hadoop Writable - those are not serializable but extends >> FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this >> is allowed in Hadoop >> >> Are these statements correct and if so it seems like most Hadoop >> InputFormate - certainly the custom ones I create require serious >> modifications to work - does anyone have samples of use of Hadoop >> InputFormat >> >> Since I am working with problems where a directory with multiple files >> are processed and some files are many gigabytes in size with multiline >> complex records an input format is a requirement. >> > > > > -- > Liquan Pei > Department of Physics > University of Massachusetts Amherst > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com