Hi,
the sampling functions are exposed in
org.apache.flink.api.java.utils.DataSetUtils. So you can basically can
create something like:
final HadoopInputFormat<LongWritable, Text> inputFormat =
HadoopInputs.readHadoopFile(new TextInputFormat(), LongWritable.class,
Text.class, hdfsPath);
final DataSet<Tuple2<LongWritable, Text>> input =
environment.createInput(inputFormat).withParameters(configs);
final DataSet<Tuple2<LongWritable, Text>> output =
DataSetUtils.sample(input, true, true);
output.print();
Regards,
Timo
Am 11/21/17 um 2:46 PM schrieb sohimankotia:
Hi,
I have directory in HDFS containing 20 files with 150 Million records .
I just want random 20 million records from that directory . (Sampled Data ).
I see that there are few implementations are there in flink
https://github.com/eBay/Flink/tree/master/flink-java/src/main/java/org/apache/flink/api/java/sampling
.
Can someone provide code example to use these .
Here is my code to read from HDFS file :
final
org.apache.flink.api.java.hadoop.mapred.HadoopInputFormat<LongWritable,
Text> inputFormat
= HadoopInputs.readHadoopFile(new
TextInputFormat(), LongWritable.class,
Text.class, hdfsPath);
final DataSource<Tuple2<LongWritable, Text>> input =
environment.createInput(inputFormat).withParameters(configs);
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/