Also we keep the Node info minimal as needed for connected components and rejoin later.
Alexis On Fri, Mar 11, 2016 at 10:12 AM, Alexander Pivovarov <apivova...@gmail.com> wrote: > we use it in prod > > 70 boxes, 61GB RAM each > > GraphX Connected Components works fine on 250M Vertices and 1B Edges > (takes about 5-10 min) > > Spark likes memory, so use r3.2xlarge boxes (61GB) > For example 10 x r3.2xlarge (61GB) work much faster than 20 x r3.xlarge > (30.5 GB) (especially if you have skewed data) > > Also, use checkpoints before and after Connected Components to reduce DAG > delays > > You can also try to enable Kryo and register classes used in RDD > > > On Fri, Mar 11, 2016 at 8:07 AM, John Lilley <john.lil...@redpoint.net> > wrote: > >> I suppose for a 2.6bn case we’d need Long: >> >> >> >> public class GenCCInput { >> >> public static void main(String[] args) { >> >> if (args.length != 2) { >> >> System.err.println("Usage: \njava GenCCInput <edges> <groupsize>"); >> >> System.exit(-1); >> >> } >> >> long edges = Long.parseLong(args[0]); >> >> long groupSize = Long.parseLong(args[1]); >> >> long currentEdge = 1; >> >> long currentGroupSize = 0; >> >> for (long i = 0; i < edges; i++) { >> >> System.out.println(currentEdge + " " + (currentEdge + 1)); >> >> if (currentGroupSize == 0) { >> >> currentGroupSize = 2; >> >> } else { >> >> currentGroupSize++; >> >> } >> >> if (currentGroupSize >= groupSize) { >> >> currentGroupSize = 0; >> >> currentEdge += 2; >> >> } else { >> >> currentEdge++; >> >> } >> >> } >> >> } >> >> } >> >> >> >> *John Lilley* >> >> Chief Architect, RedPoint Global Inc. >> >> T: +1 303 541 1516 *| *M: +1 720 938 5761 *|* F: +1 781-705-2077 >> >> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net >> >> >> >> *From:* John Lilley [mailto:john.lil...@redpoint.net] >> *Sent:* Friday, March 11, 2016 8:46 AM >> *To:* Ovidiu-Cristian MARCU <ovidiu-cristian.ma...@inria.fr> >> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; >> u...@spark.incubator.apache.org; Geoff Thompson < >> geoff.thomp...@redpoint.net> >> *Subject:* RE: Graphx >> >> >> >> Ovidiu, >> >> >> >> IMHO, this is one of the biggest issues facing GraphX and Spark. There >> are a lot of knobs and levers to pull to affect performance, with very >> little guidance about which settings work in general. We cannot ship >> software that requires end-user tuning; it just has to work. Unfortunately >> GraphX seems very sensitive to working set size relative to available RAM >> and fails catastrophically as opposed to gracefully when working set is too >> large. It is also very sensitive to the nature of the data. For example, >> if we build a test file with input-edge representation like: >> >> 1 2 >> >> 2 3 >> >> 3 4 >> >> 5 6 >> >> 6 7 >> >> 7 8 >> >> … >> >> this represents a graph with connected components in groups of four. We >> found experimentally that when this data in input in clustered order, the >> required memory is lower and runtime is much faster than when data is input >> in random order. This makes intuitive sense because of the additional >> communication required for the random order. >> >> >> >> Our 1bn-edge test case was of this same form, input in clustered order, >> with groups of 10 vertices per component. It failed at 8 x 60GB. This is >> the kind of data that our application processes, so it is a realistic test >> for us. I’ve found that social media test data sets tend to follow >> power-law distributions, and that GraphX has much less problem with them. >> >> >> >> A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges >> in 10-vertex components using the synthetic test input I describe above. I >> would be curious to know if this works and what settings you use to >> succeed, and if it continues to succeed for random input order. >> >> >> >> As for the C++ algorithm, it scales multi-core. It exhibits O(N^2) >> behavior for large data sets, but it processes the 1bn-edge case on a >> single 60GB node in about 20 minutes. It degrades gracefully along the >> O(N^2) curve and additional memory reduces time. >> >> >> >> *John Lilley* >> >> >> >> *From:* Ovidiu-Cristian MARCU [mailto:ovidiu-cristian.ma...@inria.fr >> <ovidiu-cristian.ma...@inria.fr>] >> *Sent:* Friday, March 11, 2016 8:14 AM >> *To:* John Lilley <john.lil...@redpoint.net> >> *Cc:* lihu <lihu...@gmail.com>; Andrew A <andrew.a...@gmail.com>; >> u...@spark.incubator.apache.org >> *Subject:* Re: Graphx >> >> >> >> Hi, >> >> >> >> I wonder what version of Spark and different parameter configuration you >> used. >> >> I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) >> using 16 nodes with around 80GB RAM each (Spark 1.5, default parameters) >> >> John: I suppose your C++ app (algorithm) does not scale if you used only >> one node. >> >> I don’t understand how RDD’s serialization is taking excessive time, >> compared to the total time or other expected time? >> >> >> >> For the different RDD times you have events and UI console and a bunch of >> papers describing how measure different things, lihu: did you used some >> incomplete tool or what are you looking for? >> >> >> >> Best, >> >> Ovidiu >> >> >> >> On 11 Mar 2016, at 16:02, John Lilley <john.lil...@redpoint.net> wrote: >> >> >> >> A colleague did the experiments and I don’t know exactly how he observed >> that. I think it was indirect from the Spark diagnostics indicating the >> amount of I/O he deduced that this was RDD serialization. Also when he >> added light compression to RDD serialization this improved matters. >> >> >> >> *John Lilley* >> >> Chief Architect, RedPoint Global Inc. >> >> T: +1 303 541 1516 *| *M: +1 720 938 5761 *|* F: +1 781-705-2077 >> >> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net >> >> >> >> *From:* lihu [mailto:lihu...@gmail.com <lihu...@gmail.com>] >> *Sent:* Friday, March 11, 2016 7:58 AM >> *To:* John Lilley <john.lil...@redpoint.net> >> *Cc:* Andrew A <andrew.a...@gmail.com>; u...@spark.incubator.apache.org >> *Subject:* Re: Graphx >> >> >> >> Hi, John: >> >> I am very intersting in your experiment, How can you get that RDD >> serialization cost lots of time, from the log or some other tools? >> >> >> >> On Fri, Mar 11, 2016 at 8:46 PM, John Lilley <john.lil...@redpoint.net> >> wrote: >> >> Andrew, >> >> >> >> We conducted some tests for using Graphx to solve the >> connected-components problem and were disappointed. On 8 nodes of 16GB >> each, we could not get above 100M edges. On 8 nodes of 60GB each, we could >> not process 1bn edges. RDD serialization would take excessive time and >> then we would get failures. By contrast, we have a C++ algorithm that >> solves 1bn edges using memory+disk on a single 16GB node in about an hour. >> I think that a very large cluster will do better, but we did not explore >> that. >> >> >> >> *John Lilley* >> >> Chief Architect, RedPoint Global Inc. >> >> T: +1 303 541 1516 *| *M: +1 720 938 5761 <%2B1%20720%20938%205761> *|* >> F: +1 781-705-2077 <%2B1%20781-705-2077> >> >> Skype: jlilley.redpoint *|* john.lil...@redpoint.net *|* www.redpoint.net >> >> >> >> *From:* Andrew A [mailto:andrew.a...@gmail.com] >> *Sent:* Thursday, March 10, 2016 2:44 PM >> *To:* u...@spark.incubator.apache.org >> *Subject:* Graphx >> >> >> >> Hi, is there anyone who use graphx in production? What maximum size of >> graphs did you process by spark and what cluster are you use for it? >> >> i tried calculate pagerank for 1 Gb edges LJ - dataset for >> LiveJournalPageRank from spark examples and i faced with large volume >> shuffles produced by spark which fail my spark job. >> >> Thank you, >> >> Andrew >> >> >> > >