I don't need to be 100% randome. How about randomly pick a few partitions and 
return all docs in those partitions? Is 
rdd.mapPartitionsWithIndex() the right method to use to just process a small 
portion of partitions?

Ningjun

-----Original Message-----
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, May 21, 2015 11:30 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow

I guess the fundamental issue is that these aren't stored in a way that allows 
random access to a Document.

Underneath, Hadoop has a concept of a MapFile which is like a SequenceFile with 
an index of offsets into the file where records being. Although Spark doesn't 
use it, you could maybe create some custom RDD that takes advantage of this 
format to grab random elements efficiently.

Other things come to mind but I think they're all slower -- like hashing all 
the docs and taking the smallest n in each of k partitions to get a pretty 
uniform random sample of kn docs.


On Thu, May 21, 2015 at 4:04 PM, Wang, Ningjun (LNG-NPV) 
<ningjun.w...@lexisnexis.com> wrote:
> Is there any other way to solve the problem? Let me state the use case
>
>
>
> I have an RDD[Document] contains over 7 millions items. The RDD need 
> to be save on a persistent storage (currently I save it as object file on 
> disk).
> Then I need to get a small random sample of Document objects (e.g. 
> 10,000 document). How can I do this quickly? The rdd.sample() methods 
> does not help because it need to read the entire RDD of 7 million 
> Document from disk which take very long time.
>
>
>
> Ningjun
>
>
>
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Tuesday, May 19, 2015 4:51 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: rdd.sample() methods very slow
>
>
>
> The way these files are accessed is inherently sequential-access. 
> There isn't a way to in general know where record N is in a file like 
> this and jump to it. So they must be read to be sampled.
>
>
>
>
>
> On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) 
> <ningjun.w...@lexisnexis.com> wrote:
>
> Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is 
> saved in file system as object file. I want to get a random sample of 
> about 70 objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.00001D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at 
> the folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this 
> is very slow. Why does Spark try to read entire 7 million objects 
> while I only need to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects 
> without reading through the entire object files?
>
>
>
> Ningjun
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to