Re: Dataset filter improvement

Stephan Ewen Wed, 10 Feb 2016 02:48:17 -0800

Why not use an abstract base class and N subclasses?

On Wed, Feb 10, 2016 at 10:05 AM, Fabian Hueske <fhue...@gmail.com> wrote:


> Unfortunately, there is no Either<1,...,n>.
> You could implement something like a Tuple3<Option<Type1>, Option<Type2>,
> Option<Type3>>. However, Flink does not provide an Option type (comes with
> Java8). You would need to implement it yourself incl. TypeInfo and
> Serializer. You can get some inspiration from the Either type info
> /serializer, if you want to go this way.
>
> Using a byte array would also work but doesn't look much easier than the
> Option approach to me.
>
> 2016-02-10 9:47 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>
>> Yes, the intermediate dataset I create then join again between
>> themselves. What I'd need is a Either<1,...,n>. Is that possible to add?
>> Otherwise I was thinking to generate a Tuple2<String,byte[]> and in the
>> subsequent filter+map/flatMap deserialize only those elements I want to
>> group togheter (e.g. t.f0=="someEventType") in order to generate the typed
>> dataset based.
>> Which one  do you think is the best solution?
>>
>> On Wed, Feb 10, 2016 at 9:40 AM, Fabian Hueske <fhue...@gmail.com> wrote:
>>
>>> Hi Flavio,
>>>
>>> I did not completely understand which objects should go where, but here
>>> are some general guidelines:
>>>
>>> - early filtering is mostly a good idea (unless evaluating the filter
>>> expression is very expensive)
>>> - you can use a flatMap function to combine a map and a filter
>>> - applying multiple functions on the same data set does not necessarily
>>> materialize the data set (in memory or on disk). In most cases it prevents
>>> chaining, hence there is serialization overhead. In some cases where the
>>> forked data streams are joined again, the data set must be materialized in
>>> order to avoid deadlocks.
>>> - it is not possible to write a map that generates two different types,
>>> but you could implement a mapper that returns an Either<First, Second> type.
>>>
>>> Hope this helps,
>>> Fabian
>>>
>>> 2016-02-10 8:43 GMT+01:00 Flavio Pompermaier <pomperma...@okkam.it>:
>>>
>>>> Any help on this?
>>>> On 9 Feb 2016 18:03, "Flavio Pompermaier" <pomperma...@okkam.it> wrote:
>>>>
>>>>> Hi to all,
>>>>>
>>>>> in my program I have a Dataset that generated different types of
>>>>> object wrt the incoming element.
>>>>> Thus it's like a Map<Tuple2,Object>.
>>>>> In order to type the different generated datasets I do something:
>>>>>
>>>>> Dataset<Tuple2> start =...
>>>>>
>>>>> Dataset<MyObj1> ds1 = start.filter().map(..);
>>>>> Dataset<MyObj1> ds2 = start.filter().map(..);
>>>>> Dataset<MyObj3> ds3 = start.filter().map(..);
>>>>> Dataset<MyObj3> ds4 = start.filter().map(..);
>>>>>
>>>>> However this is very inefficient (I think because Flink needs to
>>>>> materialize the entire source dataset for every slot).
>>>>>
>>>>> It's much more efficient to group the generation of objects of the
>>>>> same type. E.g.:
>>>>>
>>>>> Dataset<Tuple2> start =..
>>>>>
>>>>> Dataset<MyObj1> tmp1 = start.map(..);
>>>>> Dataset<MyObj3> tmp2 = start.map(..);
>>>>> Dataset<MyObj1> ds1 = tmp1.filter();
>>>>> Dataset<MyObj1> ds2 = tmp1.filter();
>>>>> Dataset<MyObj3> ds3 = tmp2.filter();
>>>>> Dataset<MyObj3> ds4 = tmp2.filter();
>>>>>
>>>>> Increasing the number of slots per task manager make things worse and
>>>>> worse :)
>>>>> Is there a way to improve this situation? Is it possible to write a
>>>>> "map" generating different type of object and then filter them by 
>>>>> generated
>>>>> class type?
>>>>>
>>>>> Best,
>>>>> Flavio
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>
>>
>

Re: Dataset filter improvement

Reply via email to