Hi Elliot, right now, I see the following options to read CSV/TSV files:
- Read CSV files (ExecutionEnvironment.readCsvFile()) into Tuples (max number of fields 25 for Java, 22 for Scala) and map Tuples to POJOs in a subsequent Map function (if necessary). I would recommend this approach, if the field limitation is not a problem for you. The CsvReader can be configured in several ways. For example record and field delimiters (',', '\t', ...) can be adapted. - Read the CSV file as text file (ExecutionEnvironment.readTextFile()) which gives you each line of a file as String. You can parse that line and create a POJO out of it in a subsequent Map function (just as you did in your example). This is more generic but leaves the parsing of the line up to you. See the DataSource documentation for details: 0.8.1: http://ci.apache.org/projects/flink/flink-docs-release-0.8/programming_guide.html#data-sources 0.9-SNAPSHOT: http://ci.apache.org/projects/flink/flink-docs-master/programming_guide.html#data-sinks Best, Fabian 2015-03-05 10:58 GMT+01:00 Robert Metzger <rmetz...@apache.org>: > Hi Elliot, > > Right now there is no tooling support for reading CSV/TSV data into a > POJO, but there is a pull request open where a user contributes such a > feature: https://github.com/apache/flink/pull/426 > So its probably only a matter of days until it is available in master. > > Your suggested approach of using a mapper is perfectly fine. > You can do it a bit easier by using env.readCsvFile(). It will do the > parsing into the types for you. > > Sorry that the feature is not already available for you. > > Please let us know if you have more questions regarding Flink. > > > Best, > Robert > > > On Thu, Mar 5, 2015 at 10:18 AM, Elliot West <tea...@gmail.com> wrote: > >> Hello, >> >> As a new Flink user I wondered if there are any existing approaches or >> practices for reading file formats such as CSV, TSV, etc. as DataSets or >> POJOs? My current approach can be illustrated with a contrived example: >> >> // Simulating a TSV file DataSet >> >> DataSet<String> tsvRatings = env.fromElements("category-1\t10"); >> >> // Mapping to a POJO >> >> DataSet<Rating> ratings = tsvRatings.map(line -> { >> String[] elements = line.split("\t"); >> return new Rating(elements[0], Integer.parseInt(elements[1])); }); >> >> >> While such a mapping could be implemented in a more general form, I'm >> keen to avoid wheel reinvention and therefore wonder if there are already >> good ways of doing this? >> >> Thanks - Elliot. >> >> >