Hi Robert, We are not using HDFS. We have a large file that's already split into 8 parts, each of them on a node that runs a separate task manager, at the same place, with the same name. The job manager is in another node. If I start a job that uses readTextFile, I get an exception, saying that the input file was not found, and the splits could not be created. (The exception disappears if I create an empty file with the given name on the job manager)
What I'd like is to read a different file on each node, and process that. Is there a way to do this? Thanks, Daniel On Sun, Jun 14, 2015 at 8:31 PM, Robert Metzger <rmetz...@apache.org> wrote: > Hi Daniel, > > Are the files in HDFS? > what do you exactly mean by "`readTextFile` wants to read the file on the > JobManager" ? > The JobManager is not reading input files. > Also, Flink is assigning input splits locally (when reading from > distributed file systems). In the JobManager log you can see how many > splits are assigned locally and how many do remote reads. Usually the > number of remote reads is very low. > > > > On Sun, Jun 14, 2015 at 11:18 AM, Dániel Bali <balijanosdan...@gmail.com> > wrote: > >> Hi Márton, >> >> Thanks for the reply! I suppose I have to implement `createInputSplits` >> too then. I tried looking at the documentation for the InputFormat >> interface, but I can't see how I could force it to load separate files on >> separate task managers, instead of one file on the job manager. Where is >> this behavior decided? Or am I misunderstanding something about how this >> all works? >> >> Cheers, >> Daniel >> >> On Sun, Jun 14, 2015 at 7:02 PM, Márton Balassi <balassi.mar...@gmail.com >> > wrote: >> >>> Hi Dani, >>> >>> The batch API does not expose an addSourse-like method, but you can >>> always write your own inputformat and pass that directly to constructor of >>> the DataSource. DataSource extends DataSet, so you will get all the usual >>> methods in the end. For an example you can have a look e.g. here. [1] >>> >>> [1] >>> https://github.com/dataArtisans/flink-dataflow/blob/master/src/main/java/com/dataartisans/flink/dataflow/translation/FlinkTransformTranslators.java#L133 >>> >>> Best, >>> >>> Marton >>> >>> On Sun, Jun 14, 2015 at 4:34 PM, Dániel Bali <balijanosdan...@gmail.com> >>> wrote: >>> >>>> Hello! >>>> >>>> We are running an experiment on a cluster and we have a large input >>>> split into multiple files. We'd like to run a Flink job that reads the >>>> local file on each instance and processes that. Is there a way to do this >>>> in the batch environment? `readTextFile` wants to read the file on the >>>> JobManager and split that right there, which is not what we want. >>>> >>>> We solved it in the streaming environment by using `addSource`, but >>>> there is no similar function in the batch version. Does anybody know how >>>> this could be done? >>>> >>>> Thanks! >>>> Daniel >>>> >>> >>> >> >