Hi Ankur, I have another question, w.r.t edges/partitions scheduling:
For instance, I have a 2*4 cores(L1 cache: 32K) machine, with 32GB memory, a 80GB size of local edges file on disk, when I load the file using sc.textFile (minPartitions = 16, PartitionStrategy.RandomVertexCut), Then, what happened? 1) how much data will be loaded into memory? 2) how many partitions will be stored in memory? 3) If the thread/task on each core will read only one edge from memory and then compute it at every time? 3.1) which edge on memory was determined to read into cache? 3.2) how are those partitions being scheduled? Best, Yifan On Jul 15, 2014, at 12:06 PM, Yifan LI <iamyifa...@gmail.com> wrote: > Dear Ankur, > > Thanks so much! > > Btw, is there any possibility to customise the partition strategy as we > expect? > > > Best, > Yifan > On Jul 11, 2014, at 10:20 PM, Ankur Dave <ankurd...@gmail.com> wrote: > >> Hi Yifan, >> >> When you run Spark on a single machine, it uses a local mode where one task >> per core can be executed at at a time -- that is, the level of parallelism >> is the same as the number of cores. >> >> To take advantage of this, when you load a file using sc.textFile, you >> should set the minPartitions argument to be the number of cores (available >> from sc.defaultParallelism) or a multiple thereof. This will split up your >> local edge file and allow you to take advantage of all the machine's cores. >> >> Once you've loaded the edge RDD with the appropriate number of partitions >> and constructed a graph using it, GraphX will leave the edge partitioning >> alone. During graph computation, each vertex will automatically be copied to >> the edge partitions where it is needed, and the computation will execute in >> parallel on each of the edge partitions (cores). >> >> If you later call Graph.partitionBy, it will by default preserve the number >> of edge partitions, but shuffle around the edges according to the partition >> strategy. This won't change the level of parallelism, but it might decrease >> the amount of inter-core communication. >> >> Hope that helps! By the way, do continue to post your GraphX questions to >> the Spark user list if possible. I'll probably still be the one answering >> them, but that way others can benefit as well. >> >> Ankur >> >> >> On Fri, Jul 11, 2014 at 3:05 AM, Yifan LI <iamyifa...@gmail.com> wrote: >> Hi Ankur, >> >> I am doing graph computation using GraphX on a single multicore machine(not >> a cluster). >> But It seems that I couldn't find enough docs w.r.t "how GraphX partition >> graph on a multicore machine". >> Could you give me some introduction or docs? >> >> For instance, I have one single edges file(not HDFS, etc), which follows the >> "srcID, dstID, edgeProperties" format, maybe 100MB or 500GB on size. >> and the latest Spark 1.0.0(with GraphX) has been installed on a 64bit, 8*CPU >> machine. >> I propose to do my own algorithm application, >> >> - as default, how the edges data is partitioned? to each CPU? or to each >> process? >> >> - if later I specify partition strategy in partitionBy(), e.g. >> PartitionStrategy.EdgePartition2D >> what will happen? it will work? >> >> >> Thanks in advance! :) >> >> Best, >> Yifan LI >> Univ. Paris-Sud/ Inria, Paris, France >> >