Yes. you are right. SplittableIterator will cause each worker list all the
files. thanks!

best
Lu

On Fri, Aug 16, 2019 at 12:33 AM Zhu Zhu <reed...@gmail.com> wrote:

> Hi Lu,
>
> I think it's OK to choose any way as long as it works.
> Though I've no idea how you would extend SplittableIterator in your case.
> The underlying is ParallelIteratorInputFormat and its processing is not
> matched to a certain subtask index.
>
> Thanks,
> Zhu Zhu
>
> Lu Niu <qqib...@gmail.com> 于2019年8月16日周五 上午12:48写道:
>
>> Hi, Zhu
>>
>> Thanks for reply! I found using SplittableIterator is also doable to some
>> extent. How to choose between these two?
>>
>> Best
>> Lu
>>
>> On Wed, Aug 14, 2019 at 8:02 PM Zhu Zhu <reed...@gmail.com> wrote:
>>
>>> Hi Lu,
>>>
>>> Implementing your own *InputFormat* and *InputSplitAssigner*(which has
>>> the interface "InputSplit getNextInputSplit(String host, int
>>> taskId)") created by it should work if you want to assign InputSplit to
>>> tasks according to the task index and file name patterns.
>>> To assign 2 *InputSplit*s in one request, you can implement a new
>>> *InputSplit* which wraps multiple *FileInputSplit*s. And you may need
>>> to define in your *InputFormat* on how to process the new *InputSplit*.
>>>
>>> Thanks,
>>> Zhu Zhu
>>>
>>> Lu Niu <qqib...@gmail.com> 于2019年8月15日周四 上午12:26写道:
>>>
>>>> Hi,
>>>>
>>>> I have a data set backed by a directory of files in which file names
>>>> are meaningful.
>>>>
>>>> folder1
>>>>    +-----file01
>>>>    +-----file02
>>>>    +-----file03
>>>>    +-----file04
>>>>
>>>> I want to control the file assignments in my flink application. For
>>>> example, when parallelism is 2, worker 1 get file01 and file02 to read and
>>>> worker2 get 3 and 4. Also each worker get 2 files all at once because
>>>> reading requires jumping back and forth between those two files.
>>>>
>>>> What's the best way to do this? It seems like FileInputFormat is not
>>>> extensible in this case.
>>>>
>>>> Best
>>>> Lu
>>>>
>>>>
>>>>

Reply via email to