Hi Elliot,

right now, I see the following options to read CSV/TSV files:

- Read CSV files (ExecutionEnvironment.readCsvFile()) into Tuples (max
number of fields 25 for Java, 22 for Scala) and map Tuples to POJOs in a
subsequent Map function (if necessary). I would recommend this approach, if
the field limitation is not a problem for you. The CsvReader can be
configured in several ways. For example record and field delimiters (',',
'\t', ...) can be adapted.

- Read the CSV file as text file (ExecutionEnvironment.readTextFile())
which gives you each line of a file as String. You can parse that line and
create a POJO out of it in a subsequent Map function (just as you did in
your example). This is more generic but leaves the parsing of the line up
to you.

See the DataSource documentation for details:
0.8.1:
http://ci.apache.org/projects/flink/flink-docs-release-0.8/programming_guide.html#data-sources
0.9-SNAPSHOT:
http://ci.apache.org/projects/flink/flink-docs-master/programming_guide.html#data-sinks

Best, Fabian

2015-03-05 10:58 GMT+01:00 Robert Metzger <rmetz...@apache.org>:

> Hi Elliot,
>
> Right now there is no tooling support for reading CSV/TSV data into a
> POJO, but there is a pull request open where a user contributes such a
> feature: https://github.com/apache/flink/pull/426
> So its probably only a matter of days until it is available in master.
>
> Your suggested approach of using a mapper is perfectly fine.
> You can do it a bit easier by using env.readCsvFile(). It will do the
> parsing into the types for you.
>
> Sorry that the feature is not already available for you.
>
> Please let us know if you have more questions regarding Flink.
>
>
> Best,
> Robert
>
>
> On Thu, Mar 5, 2015 at 10:18 AM, Elliot West <tea...@gmail.com> wrote:
>
>> Hello,
>>
>> As a new Flink user I wondered if there are any existing approaches or
>> practices for reading file formats such as CSV, TSV, etc. as DataSets or
>> POJOs? My current approach can be illustrated with a contrived example:
>>
>> // Simulating a TSV file DataSet
>>
>> DataSet<String> tsvRatings = env.fromElements("category-1\t10");
>>
>> // Mapping to a POJO
>>
>> DataSet<Rating> ratings = tsvRatings.map(line -> {
>>   String[] elements = line.split("\t");
>>   return new Rating(elements[0], Integer.parseInt(elements[1]));     });
>>
>>
>> While such a mapping could be implemented in a more general form, I'm
>> keen to avoid wheel reinvention and therefore wonder if there are already
>> good ways of doing this?
>>
>> Thanks - Elliot.
>>
>>
>

Reply via email to