Re: How to join RDD keyValuePairs efficiently

Jeetendra Gangele Thu, 16 Apr 2015 13:58:20 -0700

Does this same functionality exist with Java?

On 17 April 2015 at 02:23, Evo Eftimov <evo.efti...@isecc.com> wrote:


> You can use
>
> def  partitionBy(partitioner: Partitioner): RDD[(K, V)]
> Return a copy of the RDD partitioned using the specified partitioner
>
> The https://github.com/amplab/spark-indexedrdd stuff looks pretty cool
> and is something which adds valuable functionality to spark e.g. the point
> lookups PROVIDED it can be executed from within function running on worker
> executors
>
> Can somebody from DataBricks sched more light here
>
> -----Original Message-----
> From: Wang, Ningjun (LNG-NPV) [mailto:ningjun.w...@lexisnexis.com]
> Sent: Thursday, April 16, 2015 9:39 PM
> To: user@spark.apache.org
> Subject: RE: How to join RDD keyValuePairs efficiently
>
> Evo
>
>         > partition the large doc RDD based on the hash function on the
> key ie the docid
>
> What API to use to do this?
>
> By the way, loading the entire dataset to memory cause OutOfMemory problem
> because it is too large (I only have one machine with 16GB and 4 cores).
>
> I found something called IndexedRDD on the web
> https://github.com/amplab/spark-indexedrdd
>
> Has anybody use it?
>
> Ningjun
>
> -----Original Message-----
> From: Evo Eftimov [mailto:evo.efti...@isecc.com]
> Sent: Thursday, April 16, 2015 12:18 PM
> To: 'Sean Owen'; Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: RE: How to join RDD keyValuePairs efficiently
>
> Ningjun, to speed up your current design you can do the following:
>
> 1.partition the large doc RDD based on the hash function on the key ie the
> docid
>
> 2. persist the large dataset in memory to be available for subsequent
> queries without reloading and repartitioning for every search query
>
> 3. partition the small doc dataset in the same way - this will result in
> collocated small and large RDD partitions with the same key
>
> 4. run the join - the match is not going to be "sequential" it is based on
> hash of the key moreover RDD elements with the same key will be collocated
> on the same cluster node
>
>
> OR simply go for Sean suggestion - under the hood it works in a slightly
> different way - the filter is executed in mappers running in parallel on
> every node and also by passing the small doc IDs to each filter (mapper)
> you essentially replicate them on every node so each mapper instance has
> its own copy and runs with it when filtering
>
> And finally you can prototype both options described above and measure and
> compare their performance
>
> -----Original Message-----
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Thursday, April 16, 2015 5:02 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: How to join RDD keyValuePairs efficiently
>
> This would be much, much faster if your set of IDs was simply a Set, and
> you passed that to a filter() call that just filtered in the docs that
> matched an ID in the set.
>
> On Thu, Apr 16, 2015 at 4:51 PM, Wang, Ningjun (LNG-NPV) <
> ningjun.w...@lexisnexis.com> wrote:
> > Does anybody have a solution for this?
> >
> >
> >
> >
> >
> > From: Wang, Ningjun (LNG-NPV)
> > Sent: Tuesday, April 14, 2015 10:41 AM
> > To: user@spark.apache.org
> > Subject: How to join RDD keyValuePairs efficiently
> >
> >
> >
> > I have an RDD that contains millions of Document objects. Each
> > document has an unique Id that is a string. I need to find the documents
> by ids quickly.
> > Currently I used RDD join as follow
> >
> >
> >
> > First I save the RDD as object file
> >
> >
> >
> > allDocs : RDD[Document] = getDocs()  // this RDD contains 7 million
> > Document objects
> >
> > allDocs.saveAsObjectFile("/temp/allDocs.obj")
> >
> >
> >
> > Then I wrote a function to find documents by Ids
> >
> >
> >
> > def findDocumentsByIds(docids: RDD[String]) = {
> >
> > // docids contains less than 100 item
> >
> > val allDocs : RDD[Document] =sc.objectFile[Document](
> > ("/temp/allDocs.obj")
> >
> > val idAndDocs = allDocs.keyBy(d => dv.id)
> >
> >     docids.map(id => (id,id)).join(idAndDocs).map(t => t._2._2)
> >
> > }
> >
> >
> >
> > I found that this is very slow. I suspect it scan the entire 7 million
> > Document objects in "/temp/allDocs.obj" sequentially to find the
> > desired document.
> >
> >
> >
> > Is there any efficient way to do this?
> >
> >
> >
> > One option I am thinking is that instead of storing the RDD[Document]
> > as object file, I store each document in a separate file with filename
> > equal to the docid. This way I can find a document quickly by docid.
> > However this means I need to save the RDD to 7 million small file
> > which will take a very long time to save and may cause IO problems with
> so many small files.
> >
> >
> >
> > Is there any other way?
> >
> >
> >
> >
> >
> >
> >
> > Ningjun
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
>
> ----------------------------------------------
> T ususcib, -mil uerunubcrbesprkapch.og
> Fo adiioalcomads emal:usr...@sar.aace.rg
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: How to join RDD keyValuePairs efficiently

Reply via email to