Re: What if the resulting graph is larger than the memory?

Claudio Martella Tue, 21 May 2013 05:34:39 -0700

Let me understand. You said that your graph is about 400GB, and that you
don't have 400GB of main memory in your cluster (assuming for now that
400GB of edge-based input format would actually result in that amount of
memory used on the heap, which is not the case). If THIS is your problem,
and not for example that the MESSAGES you create would exceed your memory
availability, or that your graph is going to grow even more during the
computation (because you use mutation API), then going for out-of-core
graph should be the way to go for you. It is very simple, you specify how
many partitions a worker keeps in memory, and giraph will keep only this
number of partitions in memory. The rest will be stored on disk and loaded
into memory only when computed (resulting in one of the currently kept in
memory to be spilled to disk).


The options are giraph.useOutOfCore=true and giraph.maxPartitionsInMemory
(default 10) to control the number of partitions.


On Tue, May 21, 2013 at 2:18 PM, Sebastian Schelter <[email protected]
> wrote:

> It simply means that not all partitions of the graph are in-memory all
> the time. If you don't have enugh memory, some of them might get spilled
> to disk.
>
> On 21.05.2013 14:16, Han JU wrote:
> > Thanks, that's a good point.
> > But for the moment I just want to try out different solutions on hadoop
> and
> > have a comparison of them. So I'd like to see how they perform under
> > general conditions.
> >
> > Do you happen to know what out-of-core graph means?
> >
> > Thanks.
> >
> >
> > 2013/5/21 Sebastian Schelter <[email protected]>
> >
> >> Ah, I see. I have worked on similar things in recommender systems. Here
> >> the problem is generally that you get a result quadratic to the number
> >> of interactions per item. If you have some topsellers in your data,
> >> those might make up for the large result. It helps very much to throw
> >> out the few most popular items (if your application allows that).
> >>
> >> Best,
> >> Sebastian
> >>
> >>
> >> On 21.05.2013 12:10, Han JU wrote:
> >>> Hi Sebastian,
> >>>
> >>> It's something like frequent item pairs out of transaction data.
> >>> I need all these pairs with somehow a low support (say 2), so the
> result
> >>> could be very big.
> >>>
> >>>
> >>>
> >>> 2013/5/21 Sebastian Schelter <[email protected]>
> >>>
> >>>> Hello Han,
> >>>>
> >>>> out of curiosity, what do you compute that has such a big result?
> >>>>
> >>>> Best,
> >>>> Sebastian
> >>>>
> >>>> On 21.05.2013 11:52, Han JU wrote:
> >>>>> Hi Maja,
> >>>>>
> >>>>> The input graph of my problem is not big, the calculation result is
> >> very
> >>>>> big.
> >>>>> In fact what does out-of-core graph mean? Where can I find some
> >> examples
> >>>> of
> >>>>> this and for output during computation?
> >>>>>
> >>>>> Thanks.
> >>>>>
> >>>>>
> >>>>>
> >>>>> 2013/5/17 Maja Kabiljo <[email protected]>
> >>>>>
> >>>>>>  Hi JU,
> >>>>>>
> >>>>>>  One thing you can try is to use out-of-core graph
> >>>>>> (giraph.useOutOfCoreGraph option).
> >>>>>>
> >>>>>>  I don't know what your exact use case is – do you have the graph
> >> which
> >>>>>> is huge or the data which you calculate in your application is? In
> the
> >>>>>> second case, there is 'giraph.doOutputDuringComputation' option you
> >>>> might
> >>>>>> want to try out. When that is turned on, during each superstep
> >>>> writeVertex
> >>>>>> will be called immediately after compute for that vertex is called.
> >> This
> >>>>>> means that you can store data you want to write in vertex, write it
> >> and
> >>>>>> clear the data before going to the next vertex.
> >>>>>>
> >>>>>>  Maja
> >>>>>>
> >>>>>>   From: Han JU <[email protected]>
> >>>>>> Reply-To: "[email protected]" <[email protected]>
> >>>>>> Date: Friday, May 17, 2013 8:38 AM
> >>>>>> To: "[email protected]" <[email protected]>
> >>>>>> Subject: What if the resulting graph is larger than the memory?
> >>>>>>
> >>>>>>   Hi,
> >>>>>>
> >>>>>>  It's me again.
> >>>>>> After a day's work I've coded a Giraph solution for my problem at
> >> hand.
> >>>> I
> >>>>>> gave it a run on a medium dataset and it's notably faster than other
> >>>>>> approaches.
> >>>>>>
> >>>>>>  However the goal is to process larger inputs, for example I've a
> >> larger
> >>>>>> dataset that the result graph is about 400GB when represented in
> edge
> >>>>>> format and in text file. And I think the edges that the algorithm
> >>>> created
> >>>>>> all reside in the cluster's memory. So it means that for this big
> >>>> dataset,
> >>>>>> I need a cluster with ~ 400GB main memory to run? Is there any
> >>>>>> possibilities that I can output "on the go" that means I don't need
> to
> >>>>>> construct the whole graph, an edge is outputed to HDFS immediately
> >>>> instead
> >>>>>> of being created in main memory then be outputed?
> >>>>>>
> >>>>>>  Thanks!
> >>>>>> --
> >>>>>> *JU Han*
> >>>>>>
> >>>>>>    Software Engineer Intern @ KXEN Inc.
> >>>>>>   UTC   -  Université de Technologie de Compiègne
> >>>>>>    *     **GI06 - Fouille de Données et Décisionnel*
> >>>>>>
> >>>>>>  +33 0619608888
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
> >>
> >
> >
>
>


-- 
   Claudio Martella
   [email protected]

Re: What if the resulting graph is larger than the memory?

Reply via email to