Re: Reading data from HDFS

Sandeep Deshmukh Fri, 25 Nov 2016 20:15:57 -0800

Thanks. It helped. I faced serialisation issues, so moved the DoFn code
into a static class and it worked.


Used following imports.

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

At least now able to move on for input side. Would check for output too.

Thanks  & Regards,
Sandeep

On Sat, Nov 26, 2016 at 1:35 AM, Ismaël Mejía <[email protected]> wrote:

> Hello,
>
> I achieved this (reading a text file from HdfsSource) like this:
>
>         PCollection<String> data = pipeline
>                 .apply("ReadFromHDFS", Read.from(
>                         HDFSFileSource.from(options.getInput(),
>                                 TextInputFormat.class,
> LongWritable.class, Text.class))
>                 )
>                 .apply("ExtractPayload", ParDo.of(new
> DoFn<KV<LongWritable, Text>, String>() {
>                     @ProcessElement
>                     public void processElement(ProcessContext c) throws
> Exception {
>                         c.output(c.element().getValue().toString());
>                     }
>                 }));
>
> Probably there is a better way, but this one worked for me. Writing I
> think it was easier, I think it was something like this:
>
>             .apply("WriteToHDFS", Write.to(new 
> HDFSFileSink(options.getOutput(),
> TextOutputFormat.class))
>
> Hope it helps,
> Ismaël
>
>
> On Fri, Nov 25, 2016 at 6:44 PM, Sandeep Deshmukh <[email protected]
> > wrote:
>
>> Hi,
>>
>> I am trying to user recently added support for Apex runner. This is to
>> run the program[1] using Apex on Hadoop cluster. The program is getting
>> launched successfully.
>>
>> I would like to change the input and output to HDFS. I looked at the
>> HDFSFileSource and planning to use the same. I would reading simple text
>> file from HDFS and same way writing to HDFS.
>>
>> I tried something like below, but looks like missing something trivial.
>>
>>  Pipeline p = Pipeline.create(options);
>> HDFSFileSource<IntWritable, Text> source =
>>                 HDFSFileSource.from("filePath",
>>                     SequenceFileInputFormat.class, IntWritable.class,
>> Text.class);
>>
>> p.apply("ReadLines", source)
>>        .apply(new CountWords())
>>        .apply(....)
>>
>> What would be the right format and <Key,Value> to use for this.
>>
>> [1] https://github.com/tweise/apex-samples/blob/master/beam-apex
>> -wordcount/src/main/java/com/example/myapexapp/Application.java
>>
>> Regards,
>> Sandeep
>>
>
>

Re: Reading data from HDFS

Reply via email to