I suppose for a 2.6bn case we’d need Long:
public class GenCCInput {
public static void main(String[] args) {
if (args.length != 2) {
System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
System.exit(-1);
}
long edges = Long.parseLong(args[0]);
long groupSize = Long.parseLong(args[1]);
long currentEdge = 1;
long currentGroupSize = 0;
for (long i = 0; i < edges; i++) {
System.out.println(currentEdge + " " + (currentEdge + 1));
if (currentGroupSize == 0) {
currentGroupSize = 2;
} else {
currentGroupSize++;
}
if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
} else {
currentEdge++;
}
}
}
}
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: John Lilley [mailto:[email protected]]
Sent: Friday, March 11, 2016 8:46 AM
To: Ovidiu-Cristian MARCU <[email protected]>
Cc: lihu <[email protected]>; Andrew A <[email protected]>;
[email protected]; Geoff Thompson <[email protected]>
Subject: RE: Graphx
Ovidiu,
IMHO, this is one of the biggest issues facing GraphX and Spark. There are a
lot of knobs and levers to pull to affect performance, with very little
guidance about which settings work in general. We cannot ship software that
requires end-user tuning; it just has to work. Unfortunately GraphX seems very
sensitive to working set size relative to available RAM and fails
catastrophically as opposed to gracefully when working set is too large. It is
also very sensitive to the nature of the data. For example, if we build a test
file with input-edge representation like:
1 2
2 3
3 4
5 6
6 7
7 8
…
this represents a graph with connected components in groups of four. We found
experimentally that when this data in input in clustered order, the required
memory is lower and runtime is much faster than when data is input in random
order. This makes intuitive sense because of the additional communication
required for the random order.
Our 1bn-edge test case was of this same form, input in clustered order, with
groups of 10 vertices per component. It failed at 8 x 60GB. This is the kind
of data that our application processes, so it is a realistic test for us. I’ve
found that social media test data sets tend to follow power-law distributions,
and that GraphX has much less problem with them.
A comparable test scaled to your cluster (16 x 80GB) would be 2.6bn edges in
10-vertex components using the synthetic test input I describe above. I would
be curious to know if this works and what settings you use to succeed, and if
it continues to succeed for random input order.
As for the C++ algorithm, it scales multi-core. It exhibits O(N^2) behavior
for large data sets, but it processes the 1bn-edge case on a single 60GB node
in about 20 minutes. It degrades gracefully along the O(N^2) curve and
additional memory reduces time.
John Lilley
From: Ovidiu-Cristian MARCU [mailto:[email protected]]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <[email protected]<mailto:[email protected]>>
Cc: lihu <[email protected]<mailto:[email protected]>>; Andrew A
<[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: Graphx
Hi,
I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one
node.
I don’t understand how RDD’s serialization is taking excessive time, compared
to the total time or other expected time?
For the different RDD times you have events and UI console and a bunch of
papers describing how measure different things, lihu: did you used some
incomplete tool or what are you looking for?
Best,
Ovidiu
On 11 Mar 2016, at 16:02, John Lilley
<[email protected]<mailto:[email protected]>> wrote:
A colleague did the experiments and I don’t know exactly how he observed that.
I think it was indirect from the Spark diagnostics indicating the amount of I/O
he deduced that this was RDD serialization. Also when he added light
compression to RDD serialization this improved matters.
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: lihu [mailto:[email protected]]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <[email protected]<mailto:[email protected]>>
Cc: Andrew A <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: Graphx
Hi, John:
I am very intersting in your experiment, How can you get that RDD
serialization cost lots of time, from the log or some other tools?
On Fri, Mar 11, 2016 at 8:46 PM, John Lilley
<[email protected]<mailto:[email protected]>> wrote:
Andrew,
We conducted some tests for using Graphx to solve the connected-components
problem and were disappointed. On 8 nodes of 16GB each, we could not get above
100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD
serialization would take excessive time and then we would get failures. By
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a
single 16GB node in about an hour. I think that a very large cluster will do
better, but we did not explore that.
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516> | M: +1 720 938
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: Andrew A [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, March 10, 2016 2:44 PM
To: [email protected]<mailto:[email protected]>
Subject: Graphx
Hi, is there anyone who use graphx in production? What maximum size of graphs
did you process by spark and what cluster are you use for it?
i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank
from spark examples and i faced with large volume shuffles produced by spark
which fail my spark job.
Thank you,
Andrew