PS: This is the code I use to generate clustered test dat:
public class GenCCInput {
public static void main(String[] args) {
if (args.length != 2) {
System.err.println("Usage: \njava GenCCInput <edges> <groupsize>");
System.exit(-1);
}
int edges = Integer.parseInt(args[0]);
int groupSize = Integer.parseInt(args[1]);
int currentEdge = 1;
int currentGroupSize = 0;
for (int i = 0; i < edges; i++) {
System.out.println(currentEdge + " " + (currentEdge + 1));
if (currentGroupSize == 0) {
currentGroupSize = 2;
} else {
currentGroupSize++;
}
if (currentGroupSize >= groupSize) {
currentGroupSize = 0;
currentEdge += 2;
} else {
currentEdge++;
}
}
}
}
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: Ovidiu-Cristian MARCU [mailto:[email protected]]
Sent: Friday, March 11, 2016 8:14 AM
To: John Lilley <[email protected]>
Cc: lihu <[email protected]>; Andrew A <[email protected]>;
[email protected]
Subject: Re: Graphx
Hi,
I wonder what version of Spark and different parameter configuration you used.
I was able to run CC for 1.8bn edges in about 8 minutes (23 iterations) using
16 nodes with around 80GB RAM each (Spark 1.5, default parameters)
John: I suppose your C++ app (algorithm) does not scale if you used only one
node.
I don’t understand how RDD’s serialization is taking excessive time, compared
to the total time or other expected time?
For the different RDD times you have events and UI console and a bunch of
papers describing how measure different things, lihu: did you used some
incomplete tool or what are you looking for?
Best,
Ovidiu
On 11 Mar 2016, at 16:02, John Lilley
<[email protected]<mailto:[email protected]>> wrote:
A colleague did the experiments and I don’t know exactly how he observed that.
I think it was indirect from the Spark diagnostics indicating the amount of I/O
he deduced that this was RDD serialization. Also when he added light
compression to RDD serialization this improved matters.
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516 | M: +1 720 938 5761 | F: +1 781-705-2077
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: lihu [mailto:[email protected]]
Sent: Friday, March 11, 2016 7:58 AM
To: John Lilley <[email protected]<mailto:[email protected]>>
Cc: Andrew A <[email protected]<mailto:[email protected]>>;
[email protected]<mailto:[email protected]>
Subject: Re: Graphx
Hi, John:
I am very intersting in your experiment, How can you get that RDD
serialization cost lots of time, from the log or some other tools?
On Fri, Mar 11, 2016 at 8:46 PM, John Lilley
<[email protected]<mailto:[email protected]>> wrote:
Andrew,
We conducted some tests for using Graphx to solve the connected-components
problem and were disappointed. On 8 nodes of 16GB each, we could not get above
100M edges. On 8 nodes of 60GB each, we could not process 1bn edges. RDD
serialization would take excessive time and then we would get failures. By
contrast, we have a C++ algorithm that solves 1bn edges using memory+disk on a
single 16GB node in about an hour. I think that a very large cluster will do
better, but we did not explore that.
John Lilley
Chief Architect, RedPoint Global Inc.
T: +1 303 541 1516<tel:%2B1%C2%A0303%C2%A0541%201516> | M: +1 720 938
5761<tel:%2B1%20720%20938%205761> | F: +1 781-705-2077<tel:%2B1%20781-705-2077>
Skype: jlilley.redpoint |
[email protected]<mailto:[email protected]> |
www.redpoint.net<http://www.redpoint.net/>
From: Andrew A [mailto:[email protected]<mailto:[email protected]>]
Sent: Thursday, March 10, 2016 2:44 PM
To: [email protected]<mailto:[email protected]>
Subject: Graphx
Hi, is there anyone who use graphx in production? What maximum size of graphs
did you process by spark and what cluster are you use for it?
i tried calculate pagerank for 1 Gb edges LJ - dataset for LiveJournalPageRank
from spark examples and i faced with large volume shuffles produced by spark
which fail my spark job.
Thank you,
Andrew