Re: rdd.sample() methods very slow

2015-05-21 Thread Reynold Xin
You can do something like this:

val myRdd = ...

val rddSampledByPartition = PartitionPruningRDD.create(myRdd, i =>
Random.nextDouble() < 0.1)  // this samples 10% of the partitions

rddSampledByPartition.mapPartitions { iter => iter.take(10) }  // take the
first 10 elements out of each partition



On Thu, May 21, 2015 at 11:36 AM, Sean Owen  wrote:

> If sampling whole partitions is sufficient (or a part of a partition),
> sure you could mapPartitionsWithIndex and decide whether to process a
> partition at all based on its # and skip the rest. That's much faster.
>
> On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
>  wrote:
> > I don't need to be 100% randome. How about randomly pick a few
> partitions and return all docs in those partitions? Is
> > rdd.mapPartitionsWithIndex() the right method to use to just process a
> small portion of partitions?
> >
> > Ningjun
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
If sampling whole partitions is sufficient (or a part of a partition),
sure you could mapPartitionsWithIndex and decide whether to process a
partition at all based on its # and skip the rest. That's much faster.

On Thu, May 21, 2015 at 7:07 PM, Wang, Ningjun (LNG-NPV)
 wrote:
> I don't need to be 100% randome. How about randomly pick a few partitions and 
> return all docs in those partitions? Is
> rdd.mapPartitionsWithIndex() the right method to use to just process a small 
> portion of partitions?
>
> Ningjun

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
I don't need to be 100% randome. How about randomly pick a few partitions and 
return all docs in those partitions? Is 
rdd.mapPartitionsWithIndex() the right method to use to just process a small 
portion of partitions?

Ningjun

-Original Message-
From: Sean Owen [mailto:so...@cloudera.com] 
Sent: Thursday, May 21, 2015 11:30 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow

I guess the fundamental issue is that these aren't stored in a way that allows 
random access to a Document.

Underneath, Hadoop has a concept of a MapFile which is like a SequenceFile with 
an index of offsets into the file where records being. Although Spark doesn't 
use it, you could maybe create some custom RDD that takes advantage of this 
format to grab random elements efficiently.

Other things come to mind but I think they're all slower -- like hashing all 
the docs and taking the smallest n in each of k partitions to get a pretty 
uniform random sample of kn docs.


On Thu, May 21, 2015 at 4:04 PM, Wang, Ningjun (LNG-NPV) 
 wrote:
> Is there any other way to solve the problem? Let me state the use case
>
>
>
> I have an RDD[Document] contains over 7 millions items. The RDD need 
> to be save on a persistent storage (currently I save it as object file on 
> disk).
> Then I need to get a small random sample of Document objects (e.g. 
> 10,000 document). How can I do this quickly? The rdd.sample() methods 
> does not help because it need to read the entire RDD of 7 million 
> Document from disk which take very long time.
>
>
>
> Ningjun
>
>
>
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Tuesday, May 19, 2015 4:51 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: rdd.sample() methods very slow
>
>
>
> The way these files are accessed is inherently sequential-access. 
> There isn't a way to in general know where record N is in a file like 
> this and jump to it. So they must be read to be sampled.
>
>
>
>
>
> On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) 
>  wrote:
>
> Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is 
> saved in file system as object file. I want to get a random sample of 
> about 70 objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.1D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at 
> the folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this 
> is very slow. Why does Spark try to read entire 7 million objects 
> while I only need to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects 
> without reading through the entire object files?
>
>
>
> Ningjun
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: rdd.sample() methods very slow

2015-05-21 Thread Sean Owen
I guess the fundamental issue is that these aren't stored in a way
that allows random access to a Document.

Underneath, Hadoop has a concept of a MapFile which is like a
SequenceFile with an index of offsets into the file where records
being. Although Spark doesn't use it, you could maybe create some
custom RDD that takes advantage of this format to grab random elements
efficiently.

Other things come to mind but I think they're all slower -- like
hashing all the docs and taking the smallest n in each of k partitions
to get a pretty uniform random sample of kn docs.


On Thu, May 21, 2015 at 4:04 PM, Wang, Ningjun (LNG-NPV)
 wrote:
> Is there any other way to solve the problem? Let me state the use case
>
>
>
> I have an RDD[Document] contains over 7 millions items. The RDD need to be
> save on a persistent storage (currently I save it as object file on disk).
> Then I need to get a small random sample of Document objects (e.g. 10,000
> document). How can I do this quickly? The rdd.sample() methods does not help
> because it need to read the entire RDD of 7 million Document from disk which
> take very long time.
>
>
>
> Ningjun
>
>
>
> From: Sean Owen [mailto:so...@cloudera.com]
> Sent: Tuesday, May 19, 2015 4:51 PM
> To: Wang, Ningjun (LNG-NPV)
> Cc: user@spark.apache.org
> Subject: Re: rdd.sample() methods very slow
>
>
>
> The way these files are accessed is inherently sequential-access. There
> isn't a way to in general know where record N is in a file like this and
> jump to it. So they must be read to be sampled.
>
>
>
>
>
> On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV)
>  wrote:
>
> Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is saved in
> file system as object file. I want to get a random sample of about 70
> objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.1D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at the
> folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very
> slow. Why does Spark try to read entire 7 million objects while I only need
> to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects without
> reading through the entire object files?
>
>
>
> Ningjun
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: rdd.sample() methods very slow

2015-05-21 Thread Wang, Ningjun (LNG-NPV)
Is there any other way to solve the problem? Let me state the use case

I have an RDD[Document] contains over 7 millions items. The RDD need to be save 
on a persistent storage (currently I save it as object file on disk). Then I 
need to get a small random sample of Document objects (e.g. 10,000 document). 
How can I do this quickly? The rdd.sample() methods does not help because it 
need to read the entire RDD of 7 million Document from disk which take very 
long time.

Ningjun

From: Sean Owen [mailto:so...@cloudera.com]
Sent: Tuesday, May 19, 2015 4:51 PM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: rdd.sample() methods very slow

The way these files are accessed is inherently sequential-access. There isn't a 
way to in general know where record N is in a file like this and jump to it. So 
they must be read to be sampled.


On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) 
mailto:ningjun.w...@lexisnexis.com>> wrote:
Hi

I have an RDD[Document] that contains 7 million objects and it is saved in file 
system as object file. I want to get a random sample of about 70 objects from 
it using rdd.sample() method. It is ver slow


val rdd : RDD[Document] = 
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.1D, 0L).cache()
val count = rdd.count()

From Spark UI, I see spark is try to read the entire object files at the folder 
“C:/temp/docs.obj” which is about 29.7 GB. Of course this is very slow. Why 
does Spark try to read entire 7 million objects while I only need to return a 
random sample of 70 objects?

Is there any efficient way to get a random sample of 70 objects without reading 
through the entire object files?

Ningjun




Re: rdd.sample() methods very slow

2015-05-19 Thread Sean Owen
The way these files are accessed is inherently sequential-access. There
isn't a way to in general know where record N is in a file like this and
jump to it. So they must be read to be sampled.


On Tue, May 19, 2015 at 9:44 PM, Wang, Ningjun (LNG-NPV) <
ningjun.w...@lexisnexis.com> wrote:

>  Hi
>
>
>
> I have an RDD[Document] that contains 7 million objects and it is saved in
> file system as object file. I want to get a random sample of about 70
> objects from it using rdd.sample() method. It is ver slow
>
>
>
>
>
> val rdd : RDD[Document] =
> sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.1D,
> 0L).cache()
>
> val count = rdd.count()
>
>
>
> From Spark UI, I see spark is try to read the entire object files at the
> folder “C:/temp/docs.obj” which is about 29.7 GB. Of course this is very
> slow. Why does Spark try to read entire 7 million objects while I only need
> to return a random sample of 70 objects?
>
>
>
> Is there any efficient way to get a random sample of 70 objects without
> reading through the entire object files?
>
>
>
> Ningjun
>
>
>


rdd.sample() methods very slow

2015-05-19 Thread Wang, Ningjun (LNG-NPV)
Hi

I have an RDD[Document] that contains 7 million objects and it is saved in file 
system as object file. I want to get a random sample of about 70 objects from 
it using rdd.sample() method. It is ver slow


val rdd : RDD[Document] = 
sc.objectFile[Document]("C:/temp/docs.obj").sample(false, 0.1D, 0L).cache()
val count = rdd.count()

>From Spark UI, I see spark is try to read the entire object files at the 
>folder "C:/temp/docs.obj" which is about 29.7 GB. Of course this is very slow. 
>Why does Spark try to read entire 7 million objects while I only need to 
>return a random sample of 70 objects?

Is there any efficient way to get a random sample of 70 objects without reading 
through the entire object files?

Ningjun