This is an interesting discussion, I have had some success running GraphX on large graphs with more than a Billion edges using clusters of different size up to 64 machines. However, the performance goes down when I double the cluster size to reach 128 machines of r3.xlarge. Does any one have experience with very large GraphX clusters?
@Ovidiu-Cristian, @Alexis and @Alexander, could you please share the configurations for Spark / GraphX that works best for you? Thanks, -Khaled On Fri, Mar 11, 2016 at 1:25 PM, John Lilley <john.lil...@redpoint.net> wrote: > We have almost zero node info – just an identifying integer. > > *John Lilley* > > > > *From:* Alexis Roos [mailto:alexis.r...@gmail.com] > *Sent:* Friday, March 11, 2016 11:24 AM > *To:* Alexander Pivovarov <apivova...@gmail.com> > *Cc:* John Lilley <john.lil...@redpoint.net>; Ovidiu-Cristian MARCU < > ovidiu-cristian.ma...@inria.fr>; lihu <lihu...@gmail.com>; Andrew A < > andrew.a...@gmail.com>; u...@spark.incubator.apache.org; Geoff Thompson < > geoff.thomp...@redpoint.net> > *Subject:* Re: Graphx > > > > Also we keep the Node info minimal as needed for connected components and > rejoin later. > > > > Alexis > > > > On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov < > apivova...@gmail.com> wrote: > > we use it in prod > > > > 70 boxes, 61GB RAM each > > > > GraphX Connected Components works fine on 250M Vertices and 1B Edges > (takes about 5-10 min) > > > > Spark likes memory, so use r3.2xlarge boxes (61GB) > > For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge > (30.5 GB) (especially if you have skewed data) > > > > Also, use checkpoints before and after Connected Components to reduce DAG > delays > > > > You can also try to enable Kryo and register classes used in RDD > > > > > > On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net> > wrote: > > I suppose for a 2.6bn case we’d need Long: > > > > public class GenCCInput { > > public static void main(String[] args) { > > if (args.length != 2) { > > System.err.println("Usage: \njava GenCCInput <edges> <groupsize>"); > > System.exit(-1); > > } > > long edges = Long.parseLong(args[0]); > > long groupSize = Long.parseLong(args[1]); > > long currentEdge = 1; > > long currentGroupSize = 0; > > for (long i = 0; i < edges; i++) { > > System.out.println(currentEdge + " " + (currentEdge + 1)); > > if (currentGroupSize == 0) { > > currentGroupSize = 2; > > } else { > > currentGroupSize++; > > } > > if (currentGroupSize >= groupSize) { > > currentGroupSize = 0; > > currentEdge += 2; > > } else { > > currentEdge++; > > } > > } > > } > > } > > > > *John Lilley* > > Chief Architect, RedPoint Global Inc. > > T: +1 303 541 1516 *| *M: +1 720 938 5761 *|* F: +1 781-705-2077 > > Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net > > > > *From:* John Lilley [mailto:john.lil...@redpoint.net] > *Sent:* Friday, March 11, 2016 8:46 AM > *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr> > *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; > u...@spark.incubator.apache.org; Geoff Thompson < > geoff.thomp...@redpoint.net> > *Subject:* RE: Graphx > > > > Ovidiu, > > > > IMHO, this is one of the biggest issues facing GraphX and Spark. There > are a lot of knobs and levers to pull to affect performance, with very > little guidance about which settings work in general. We cannot ship > software that requires end-user tuning; it just has to work. Unfortunately > GraphX seems very sensitive to working set size relative to available RAM > and fails catastrophically as opposed to gracefully when working set is too > large. It is also very sensitive to the nature of the data. For example, > if we build a test file with input-edge representation like: > > 1 2 > > 2 3 > > 3 4 > > 5 6 > > 6 7 > > 7 8 > > … > > this represents a graph with connected components in groups of four. We > found experimentally that when this data in input in clustered order, the > required memory is lower and runtime is much faster than when data is input > in random order. This makes intuitive sense because of the additional > communication required for the random order. > > > > Our 1bn-edge test case was of this same form, input in clustered order, > with groups of 10 vertices per component. It failed at 8 x 60GB. This is > the kind of data that our application processes, so it is a realistic test > for us. I’ve found that social media test data sets tend to follow > power-law distributions, and that GraphX has much less problem with them. > > > > A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges > in 10-vertex components using the synthetic test input I describe above. I > would be curious to know if this works and what settings you use to > succeed, and if it continues to succeed for random input order. > > > > As for the C++ algorithm, it scales multi-core. It exhibits O(N^2) > behavior for large data sets, but it processes the 1bn-edge case on a > single 60GB node in about 20 minutes. It degrades gracefully along the > O(N^2) curve and additional memory reduces time. > > > > *John Lilley* > > > > *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr > <ovidiu-cristian.ma...@inria.fr>] > *Sent:* Friday, March 11, 2016 8:14 AM > *To:* John Lilley <john.lil...@redpoint.net> > *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; > u...@spark.incubator.apache.org > *Subject:* Re: Graphx > > > > Hi, > > > > I wonder what version of Spark and different parameter configuration you > used. > > I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) > using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) > > John: I suppose your C++ app (algorithm) does not scale if you used only > one node. > > I don’t understand how RDD’s serialization is taking excessive time, > compared to the total time or other expected time? > > > > For the different RDD times you have events and UI console and a bunch of > papers describing how measure different things, lihu: did you used some > incomplete tool or what are you looking for? > > > > Best, > > Ovidiu > > > > On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote: > > > > A colleague did the experiments and I don’t know exactly how he observed > that. I think it was indirect from the Spark diagnostics indicating the > amount of I/O he deduced that this was RDD serialization. Also when he > added light compression to RDD serialization this improved matters. > > > > *John Lilley* > > Chief Architect, RedPoint Global Inc. > > T: +1 303 541 1516 *| *M: +1 720 938 5761 *|* F: +1 781-705-2077 > > Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net > > > > *From:* lihu [mailto:lihu...@gmail.com <lihu...@gmail.com>] > *Sent:* Friday, March 11, 2016 7:58 AM > *To:* John Lilley <john.lil...@redpoint.net> > *Cc:* Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org > *Subject:* Re: Graphx > > > > Hi, John: > > I am very intersting in your experiment, How can you get that RDD > serialization cost lots of time, from the log or some other tools? > > > > On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net> > wrote: > > Andrew, > > > > We conducted some tests for using Graphx to solve the connected-components > problem and were disappointed. On 8 nodes of 16GB each, we could not get > above 100M edges. On 8 nodes of 60GB each, we could not process 1bn > edges. RDD serialization would take excessive time and then we would get > failures. By contrast, we have a C++ algorithm that solves 1bn edges using > memory+disk on a single 16GB node in about an hour. I think that a very > large cluster will do better, but we did not explore that. > > > > *John Lilley* > > Chief Architect, RedPoint Global Inc. > > T: +1 303 541 1516 <%2B1%C2%A0303%C2%A0541%201516> *| *M: +1 720 938 5761 > <%2B1%20720%20938%205761> *|* F: +1 781-705-2077 <%2B1%20781-705-2077> > > Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net > > > > *From:* Andrew A [mailto:andrew.a...@gmail.com] > *Sent:* Thursday, March 10, 2016 2:44 PM > *To:* u...@spark.incubator.apache.org > *Subject:* Graphx > > > > Hi, is there anyone who use graphx in production? What maximum size of > graphs did you process by spark and what cluster are you use for it? > > i tried calculate pagerank for 1 Gb edges LJ - dataset for > LiveJournalPageRank from spark examples and i faced with large volume > shuffles produced by spark which fail my spark job. > > Thank you, > > Andrew > > > > > > > -- Thanks, -Khaled