Thanks Daniel. Do you have any code fragments on using CoGroups or Joins across 2 RDDs ? I don't think that index would help much because this is an N x M operation, examining each cell of each RDD. Each comparison is complex as it needs to peer into a complex JSON
On Mon, Aug 15, 2016 at 1:24 PM, Daniel Imberman <daniel.imber...@gmail.com> wrote: > There's no real way of doing nested for-loops with RDD's because the whole > idea is that you could have so much data in the RDD that it would be really > ugly to store it all in one worker. > > There are, however, ways to handle what you're asking about. > > I would personally use something like CoGroup or Join between the two > RDDs. if index matters, you can use ZipWithIndex on both before you join > and then see which indexes match up. > > On Mon, Aug 15, 2016 at 1:15 PM Eric Ho <e...@analyticsmd.com> wrote: > >> I've nested foreach loops like this: >> >> for i in A[i] do: >> for j in B[j] do: >> append B[j] to some list if B[j] 'matches' A[i] in some fashion. >> >> Each element in A or B is some complex structure like: >> ( >> some complex JSON, >> some number >> ) >> >> Question: if A and B were represented as RRDs (e.g. RRD(A) and RRD(B)), >> how would my code look ? >> Are there any RRD operators that would allow me to loop thru both RRDs >> like the above procedural code ? >> I can't find any RRD operators nor any code fragments that would allow me >> to do this. >> >> Thing is: by that time I composed RRD(A), this RRD would have contain >> elements in array B as well as array A. >> Same argument for RRD(B). >> >> Any pointers much appreciated. >> >> Thanks. >> >> >> -- >> >> -eric ho >> >> -- -eric ho