Yes, I am using Apex runner.

Setting up withNumShards(1) created single file. Thanks, got exactly what I
was looking for.

Regards,
Sandeep

On Mon, Nov 28, 2016 at 6:14 PM, Amit Sela <[email protected]> wrote:

> Hi Sandeep,
>
> Are you using the Apex runner ? the default number of output shards is
> runner dependent.
>
> You can set the number of write shards, see Write documentation:
> https://github.com/apache/incubator-beam/blob/master/
> sdks/java/core/src/main/java/org/apache/beam/sdk/io/Write.java#L75
>
> Does it answer your question ?
>
> On Mon, Nov 28, 2016 at 2:22 PM Sandeep Deshmukh <[email protected]>
> wrote:
>
>> I am able to read and write from HDFS successfully. Here is my code
>> snippet.
>>
>>        p.apply("ReadFromHDFS", 
>> Read.from(HDFSFileSource.from(options.getInputFile(),
>> TextInputFormat.class, LongWritable.class, Text.class)))
>>         .apply("ExtractPayload", ParDo.of(new ExtractString()))
>>         .apply(new CountWords())
>>         .apply("WriteToHDFS", Write.to(new HDFSFileSink(options.getOutput(),
>> TextOutputFormat.class)));
>>
>> One observation is that the output is one file per word. How can I store
>> all the data into single file and avoid creation of one file per word?
>>
>> Regards,
>> Sandeep
>>
>> On Sat, Nov 26, 2016 at 9:45 AM, Sandeep Deshmukh <
>> [email protected]> wrote:
>>
>> Thanks. It helped. I faced serialisation issues, so moved the DoFn code
>> into a static class and it worked.
>>
>> Used following imports.
>>
>> import org.apache.hadoop.io.LongWritable;
>> import org.apache.hadoop.io.Text;
>> import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
>>
>> At least now able to move on for input side. Would check for output too.
>>
>> Thanks  & Regards,
>> Sandeep
>>
>> On Sat, Nov 26, 2016 at 1:35 AM, Ismaël Mejía <[email protected]> wrote:
>>
>> Hello,
>>
>> I achieved this (reading a text file from HdfsSource) like this:
>>
>>         PCollection<String> data = pipeline
>>                 .apply("ReadFromHDFS", Read.from(
>>                         HDFSFileSource.from(options.getInput(),
>>                                 TextInputFormat.class,
>> LongWritable.class, Text.class))
>>                 )
>>                 .apply("ExtractPayload", ParDo.of(new
>> DoFn<KV<LongWritable, Text>, String>() {
>>                     @ProcessElement
>>                     public void processElement(ProcessContext c) throws
>> Exception {
>>                         c.output(c.element().getValue().toString());
>>                     }
>>                 }));
>>
>> Probably there is a better way, but this one worked for me. Writing I
>> think it was easier, I think it was something like this:
>>
>>             .apply("WriteToHDFS", Write.to(new 
>> HDFSFileSink(options.getOutput(),
>> TextOutputFormat.class))
>>
>> Hope it helps,
>> Ismaël
>>
>>
>> On Fri, Nov 25, 2016 at 6:44 PM, Sandeep Deshmukh <
>> [email protected]> wrote:
>>
>> Hi,
>>
>> I am trying to user recently added support for Apex runner. This is to
>> run the program[1] using Apex on Hadoop cluster. The program is getting
>> launched successfully.
>>
>> I would like to change the input and output to HDFS. I looked at the
>> HDFSFileSource and planning to use the same. I would reading simple text
>> file from HDFS and same way writing to HDFS.
>>
>> I tried something like below, but looks like missing something trivial.
>>
>>  Pipeline p = Pipeline.create(options);
>> HDFSFileSource<IntWritable, Text> source =
>>                 HDFSFileSource.from("filePath",
>>                     SequenceFileInputFormat.class, IntWritable.class,
>> Text.class);
>>
>> p.apply("ReadLines", source)
>>        .apply(new CountWords())
>>        .apply(....)
>>
>> What would be the right format and <Key,Value> to use for this.
>>
>> [1] https://github.com/tweise/apex-samples/blob/master/beam-
>> apex-wordcount/src/main/java/com/example/myapexapp/Application.java
>>
>> Regards,
>> Sandeep
>>
>>
>>
>>
>>

Reply via email to