I have an RDD that contains millions of Document objects. Each document has an 
unique Id that is a string. I need to find the documents by ids quickly. 
Currently I used RDD join as follow

First I save the RDD as object file

allDocs : RDD[Document] = getDocs()  // this RDD contains 7 million Document 
objects
allDocs.saveAsObjectFile("/temp/allDocs.obj")

Then I wrote a function to find documents by Ids

def findDocumentsByIds(docids: RDD[String]) = {
// docids contains less than 100 item

val allDocs : RDD[Document] =sc.objectFile[Document]( ("/temp/allDocs.obj")

val idAndDocs = allDocs.keyBy(d => dv.id)
    docids.map(id => (id,id)).join(idAndDocs).map(t => t._2._2)
}

I found that this is very slow. I suspect it scan the entire 7 million Document 
objects in "/temp/allDocs.obj" sequentially to find the desired document.

Is there any efficient way to do this?

One option I am thinking is that instead of storing the RDD[Document] as object 
file, I store each document in a separate file with filename equal to the 
docid. This way I can find a document quickly by docid. However this means I 
need to save the RDD to 7 million small file which will take a very long time 
to save and may cause IO problems with so many small files.

Is there any other way?



Ningjun

Reply via email to