Re: Reading multiple datasets with one read operation

Till Rohrmann Thu, 22 Oct 2015 03:02:20 -0700

I fear that the filter operations are not chained because there are at
least two of them which have the same DataSet as input. However, it's true
that the intermediate results are not materialized.


It is also correct that the filter operators are deployed colocated to the
data sources. Thus, there is no network traffic. However, the data will
still be serialized/deserialized between the not-chained operators (also if
they reside on the same machine).



On Thu, Oct 22, 2015 at 11:49 AM, Gábor Gévay <gga...@gmail.com> wrote:

> Hello!
>
> > I have thought about a workaround where the InputFormat would return
> > Tuple2s and the first field is the name of the dataset to which a record
> > belongs. This would however require me to filter the read data once for
> > each dataset or to do a groupReduce which is some overhead i'm
> > looking to prevent.
>
> I think that those two filters might not have that much overhead,
> because of several optimizations Flink does under the hood:
> - The dataset of Tuple2s won't be materialized, but instead will be
> streamed directly to the two filter operators.
> - The input format and the two filters will probably end up on the
> same machine, because of chaining, so there won't be
> serialization/deserialization between them.
>
> Best,
> Gabor
>
>
>
> 2015-10-22 11:38 GMT+02:00 Pieter Hameete <phame...@gmail.com>:
> > Good morning!
> >
> > I have the following usecase:
> >
> > My program reads nested data (in this specific case XML) based on
> > projections (path expressions) of this data. Often multiple paths are
> > projected onto the same input. I would like each path to result in its
> own
> > dataset.
> >
> > Is it possible to generate more than 1 dataset using a readFile
> operation to
> > prevent reading the input twice?
> >
> > I have thought about a workaround where the InputFormat would return
> Tuple2s
> > and the first field is the name of the dataset to which a record belongs.
> > This would however require me to filter the read data once for each
> dataset
> > or to do a groupReduce which is some overhead i'm looking to prevent.
> >
> > Is there a better (less overhead) workaround for doing this? Or is there
> > some mechanism in Flink that would allow me to do this?
> >
> > Cheers!
> >
> > - Pieter
>

Re: Reading multiple datasets with one read operation

Reply via email to