Yes, map() is like a convenience function around mapPartition(). On Fri, Aug 14, 2015 at 6:09 PM, Flavio Pompermaier <pomperma...@okkam.it> wrote:
> Hi Stephan thanks for the reply! > Now it's more clear..if I understood correctly map and mapPartition are > the same iff I have only one slot per task manager, right? > > I was convinced to have post those questions in this thread as 3rd or 4th > message..isn't it? > On 14 Aug 2015 17:57, "Stephan Ewen" <se...@apache.org> wrote: > >> Hi! >> >> (1) A mapper is created once per parallel task. So if you create a >> program that runs a map() transformation with a parallelism of n, you will >> have n mapper instances in the cluster. Some may be on the same >> TaskManager, if the TaskManager has multiple slots. >> >> (2) I would really like that. But it means Java has to deal with both >> managed and unmanaged memory at the same time, which is quite a heavy >> addition. C# has some form of support for that. >> >> BTW: Where did you originally post these questions? I have not seen them >> before... >> >> On Fri, Aug 14, 2015 at 5:43 PM, Flavio Pompermaier <pomperma...@okkam.it >> > wrote: >> >>> Any insight about these 2 questions..? >>> On 12 Aug 2015 17:38, "Flavio Pompermaier" <pomperma...@okkam.it> wrote: >>> >>>> This is something I've never understood in depth: isn't a mapper >>>> created for each record?if it's created only once per task manager then >>>> it's not so different from mapPartition..what I'm missing here? >>>> >>>> And then a more philosophic question: all big data framework requires >>>> somehow to manage memory very efficiently (Flink has even though to reserve >>>> a fraction of the entire memory in order to have control over it). Wouldn't >>>> be simpler if java would finally release some APIs (even marked as unsafe, >>>> it doesn't change theMat much) to allow for a full control of the >>>> memory..?it will make a lot of sense for all big data platforms (at least >>>> for non-UDF code...). >>>> >>>> Best, >>>> Flavio >>>> On 12 Aug 2015 12:44, "Timo Walther" <twal...@apache.org> wrote: >>>> >>>>> Hello Michael, >>>>> >>>>> every time you code a Java program you should avoid object creation if >>>>> you want an efficient program, because every created object needs to be >>>>> garbage collected later (which slows down your program performance). >>>>> You can have small Pojos, just try to avoid the call "new" in your >>>>> functions: >>>>> >>>>> Instead of: >>>>> >>>>> class Mapper implements MapFunction<String,Pojo> { >>>>> public Pojo map(String s) { >>>>> Pojo p = new Pojo(); >>>>> p.f = s; >>>>> } >>>>> } >>>>> >>>>> do: >>>>> >>>>> class Mapper implements MapFunction<String,Pojo> { >>>>> private Pojo p = new Pojo(); >>>>> public Pojo map(String s) { >>>>> p.f = s; >>>>> } >>>>> } >>>>> >>>>> Then an object is only created once per Mapper and not per record. >>>>> >>>>> Hope this helps. >>>>> >>>>> Regards, >>>>> Timo >>>>> >>>>> >>>>> >>>>> On 12.08.2015 11:53, Michael Huelfenhaus wrote: >>>>> >>>>>> Hello >>>>>> >>>>>> I have a question about the programming of user defined functions, is >>>>>> it still like in old Stratosphere times the case that object creation >>>>>> should be avoided al all cost? Because in some of the examples there are >>>>>> now Tuples and other objects created before returning them. >>>>>> >>>>>> I gonna have an at least 6 step streaming plan and I am going to use >>>>>> Pojos. Is it performance wise a big improvement to define one big pojo >>>>>> that >>>>>> can be used by all the steps or better to have smaller ones to send less >>>>>> data but create more objects. >>>>>> >>>>>> Thanks >>>>>> Michael >>>>>> >>>>> >>>>> >>