Re: Udf Performance and Object Creation

Stephan Ewen Fri, 14 Aug 2015 09:17:59 -0700

Yes, map() is like a convenience function around mapPartition().

On Fri, Aug 14, 2015 at 6:09 PM, Flavio Pompermaier <pomperma...@okkam.it>
wrote:


> Hi Stephan thanks for the reply!
> Now it's more clear..if I understood correctly map and mapPartition are
> the same iff I have only one slot per task manager, right?
>
> I was convinced to have post those questions in this thread as 3rd or 4th
> message..isn't it?
> On 14 Aug 2015 17:57, "Stephan Ewen" <se...@apache.org> wrote:
>
>> Hi!
>>
>> (1) A mapper is created once per parallel task. So if you create a
>> program that runs a map() transformation with a parallelism of n, you will
>> have n mapper instances in the cluster. Some may be on the same
>> TaskManager, if the TaskManager has multiple slots.
>>
>> (2) I would really like that. But it means Java has to deal with both
>> managed and unmanaged memory at the same time, which is quite a heavy
>> addition. C# has some form of support for that.
>>
>> BTW: Where did you originally post these questions? I have not seen them
>> before...
>>
>> On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <pomperma...@okkam.it
>> > wrote:
>>
>>> Any insight about these 2 questions..?
>>> On 12 Aug 2015 17:38, "Flavio Pompermaier" <pomperma...@okkam.it> wrote:
>>>
>>>> This is something I've never understood in depth: isn't a mapper
>>>> created for each record?if it's created only once per task manager then
>>>> it's not so different from mapPartition..what I'm missing here?
>>>>
>>>> And then a more philosophic question: all big data framework requires
>>>> somehow to manage memory very efficiently (Flink has even though to reserve
>>>> a fraction of the entire memory in order to have control over it). Wouldn't
>>>> be simpler if java would finally release some APIs (even marked as unsafe,
>>>> it doesn't change theMat much) to allow for a full control of the
>>>> memory..?it will make a lot of sense for all big data platforms (at least
>>>> for non-UDF code...).
>>>>
>>>> Best,
>>>> Flavio
>>>> On 12 Aug 2015 12:44, "Timo Walther" <twal...@apache.org> wrote:
>>>>
>>>>> Hello Michael,
>>>>>
>>>>> every time you code a Java program you should avoid object creation if
>>>>> you want an efficient program, because every created object needs to be
>>>>> garbage collected later (which slows down your program performance).
>>>>> You can have small Pojos, just try to avoid the call "new" in your
>>>>> functions:
>>>>>
>>>>> Instead of:
>>>>>
>>>>> class Mapper implements MapFunction<String,Pojo> {
>>>>> public Pojo map(String s) {
>>>>>     Pojo p = new Pojo();
>>>>>     p.f = s;
>>>>> }
>>>>> }
>>>>>
>>>>> do:
>>>>>
>>>>> class Mapper implements MapFunction<String,Pojo> {
>>>>> private Pojo p = new Pojo();
>>>>> public Pojo map(String s) {
>>>>>     p.f = s;
>>>>> }
>>>>> }
>>>>>
>>>>> Then an object is only created once per Mapper and not per record.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Regards,
>>>>> Timo
>>>>>
>>>>>
>>>>>
>>>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote:
>>>>>
>>>>>> Hello
>>>>>>
>>>>>> I have a question about the programming of user defined functions, is
>>>>>> it still like in old Stratosphere times the case that object creation
>>>>>> should be avoided al all cost? Because in some of the examples there are
>>>>>> now Tuples and other objects created before returning them.
>>>>>>
>>>>>> I gonna have an at least 6 step streaming plan and I am going to use
>>>>>> Pojos. Is it performance wise a big improvement to define one big pojo 
>>>>>> that
>>>>>> can be used by all the steps or better to have smaller ones to send less
>>>>>> data but create more objects.
>>>>>>
>>>>>> Thanks
>>>>>> Michael
>>>>>>
>>>>>
>>>>>
>>

Re: Udf Performance and Object Creation

Reply via email to