Hi Steve, Here is my understanding, as long as you implement InputFormat, you should be able to use hadoopFile API in SparkContext to create an RDD. Suppose you have a customized InputFormat which we call CustomizedInputFormat<K, V> where K is the key type and V is the value type. You can create an RDD with CustomizedInputFormat in the following way:
Let sc denote the SparkContext variable and path denote the path to file of CustomizedInputFormat, we use val rdd;RDD[(K,V)] = sc.hadoopFile[K,V,CustomizedInputFormat](path, ClassOf[CustomizedInputFormat], ClassOf[K], ClassOf[V]) to create an RDD of (K,V) with CustomizedInputFormat. Hope this helps, Liquan On Tue, Sep 23, 2014 at 5:13 PM, Steve Lewis <lordjoe2...@gmail.com> wrote: > When I experimented with using an InputFormat I had used in Hadoop for a > long time in Hadoop I found > 1) it must extend org.apache.hadoop.mapred.FileInputFormat (the deprecated > class not org.apache.hadoop.mapreduce.lib.input;FileInputFormat > 2) initialize needs to be called in the constructor > 3) The type - mine was extends FileInputFormat<Text, Text> must not be a > Hadoop Writable - those are not serializable but extends > FileInputFormat<StringBuffer, StringBuffer> does work - I don't think this > is allowed in Hadoop > > Are these statements correct and if so it seems like most Hadoop > InputFormate - certainly the custom ones I create require serious > modifications to work - does anyone have samples of use of Hadoop > InputFormat > > Since I am working with problems where a directory with multiple files are > processed and some files are many gigabytes in size with multiline complex > records an input format is a requirement. > -- Liquan Pei Department of Physics University of Massachusetts Amherst