from:"Maximo Gurmendez"

Re: Random Forest MLlib

2015-09-10 Thread Maximo Gurmendez

Hi Yasemin,
   We had the same question and found this:

https://issues.apache.org/jira/browse/SPARK-6884

Thanks,
   Maximo

On Sep 10, 2015, at 9:09 AM, Yasemin Kaya 
mailto:godo...@gmail.com>> wrote:

Hi ,

I am using Random Forest Alg. for recommendation system. I get users and users' 
response yes or no (1/0). But I want to learn the probability of the trees. 
Program says x user yes but with how much probability, I want to get these 
probabilities.

Best,
yasemin
--
hiç ender hiç

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez

Adding an example (very raw), to see if my understanding is correct:

 val repartitioned = bidRdd.partitionBy(new Partitioner {
def numPartitions: Int = 100

def getPartition(clientId: Any): Int = hash(clientId) % 100
}

val cachedRdd =  repartitioned.cache()

val client1Rdd = cachedRdd.filter({case (clientId:String,v:String) => 
clientId.equals(“Client 1")})

val client2Rdd = cachedRdd.filter({case (clientId:String,v:String) => 
clientId.equals(“Client 2")})

MyModel.train(client1Rdd)
MyModel.train(client2Rdd)

When the first MyModel.train() runs it will trigger an action and will cause 
the repartition of the original bigRdd and its caching.

Can someone confirm if the following statements are true?

1) When I execute an action on client2Rdd, it will only read from the partition 
that corresponds to Client 2 (it won’t iterate over ALL items originally in 
bigRdd)
2) The caching happens in a way that preserves the partitioning by client Id 
(and the locality)

Thanks,
   Maximo

PD: I am aware that this might cause imbalance of data, but I can probably 
mitigate that with a smarter partitioner.


On Sep 9, 2015, at 9:30 AM, Maximo Gurmendez 
mailto:mgurmen...@dataxu.com>> wrote:

Thanks Ben for your answer. I’ll explore what happens under the hoods in a  
data frame.

Regarding the ability to split a large RDD into n RDDs without requiring n 
passes to the large RDD.  Can partitionBy() help? If I partition by a key that 
corresponds to the the split criteria (i..e client id) and then cache each of 
those RDDs. Will that lessen the effect of repeated large traversals (since 
Spark will figure out that for each smaller RDD it just needs to traverse a 
subset of the partitions)?

Thanks!
   Máximo


On Sep 8, 2015, at 11:32 AM, Ben Tucker 
mailto:ben...@chatid.com>> wrote:

Hi Maximo —

This is a relatively naive answer, but I would consider structuring the RDD 
into a DataFrame, then saving the 'splits' using something like 
DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read 
a DataFrame from each resulting parquet directory and do your per-client work 
from these. You mention re-using the splits, so this solution might be worth 
the file-writing time.

Does anyone know of a method that gets a collection of DataFrames — one for 
each partition, in the byPartition=('client') sense — from a 'big' DataFrame? 
Basically, the equivalent of writing by partition and creating a DataFrame for 
each result, but skipping the HDFS step.


On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez 
mailto:mgurmen...@dataxu.com>> wrote:
Hi,
I have a RDD that needs to be split (say, by client) in order to train n 
models (i.e. one for each client). Since most of the classifiers that come with 
ml-lib only can accept an RDD as input (and cannot build multiple models in one 
pass - as I understand it), the only way to train n separate models is to 
create n RDDs (by filtering the original one).

Conceptually:

rdd1,rdd2,rdd3 = splitRdds(bigRdd)

the function splitRdd would use the standard filter mechanism .  I would then 
need to submit n training spark jobs. When I do this, will it mean that it will 
traverse the bigRdd n times? Is there a better way to persist the splitted rdd 
(i.e. save the split RDD in a cache)?

I could cache the bigRdd, but not sure that would be ver efficient either since 
it will require the same number of passes anyway (I think - but I’m relatively 
new to Spark). Also I’m planning on reusing the individual splits (rdd1, rdd2, 
etc so would be convenient to have them individually cached).

Another problem is that the splits are could be very skewed (i.e. one split 
could represent a large percentage of the original bigRdd ). So saving the 
split RDDs to disk (at least, naively) could be a challenge.

Is there any better way of doing this?

Thanks!
   Máximo

Re: Partitioning a RDD for training multiple classifiers

2015-09-09 Thread Maximo Gurmendez

Thanks Ben for your answer. I’ll explore what happens under the hoods in a  
data frame.

Regarding the ability to split a large RDD into n RDDs without requiring n 
passes to the large RDD.  Can partitionBy() help? If I partition by a key that 
corresponds to the the split criteria (i..e client id) and then cache each of 
those RDDs. Will that lessen the effect of repeated large traversals (since 
Spark will figure out that for each smaller RDD it just needs to traverse a 
subset of the partitions)?

Thanks!
   Máximo


On Sep 8, 2015, at 11:32 AM, Ben Tucker 
mailto:ben...@chatid.com>> wrote:

Hi Maximo —

This is a relatively naive answer, but I would consider structuring the RDD 
into a DataFrame, then saving the 'splits' using something like 
DataFrame.write.parquet(hdfs_path, byPartition=('client')). You could then read 
a DataFrame from each resulting parquet directory and do your per-client work 
from these. You mention re-using the splits, so this solution might be worth 
the file-writing time.

Does anyone know of a method that gets a collection of DataFrames — one for 
each partition, in the byPartition=('client') sense — from a 'big' DataFrame? 
Basically, the equivalent of writing by partition and creating a DataFrame for 
each result, but skipping the HDFS step.


On Tue, Sep 8, 2015 at 10:47 AM, Maximo Gurmendez 
mailto:mgurmen...@dataxu.com>> wrote:
Hi,
I have a RDD that needs to be split (say, by client) in order to train n 
models (i.e. one for each client). Since most of the classifiers that come with 
ml-lib only can accept an RDD as input (and cannot build multiple models in one 
pass - as I understand it), the only way to train n separate models is to 
create n RDDs (by filtering the original one).

Conceptually:

rdd1,rdd2,rdd3 = splitRdds(bigRdd)

the function splitRdd would use the standard filter mechanism .  I would then 
need to submit n training spark jobs. When I do this, will it mean that it will 
traverse the bigRdd n times? Is there a better way to persist the splitted rdd 
(i.e. save the split RDD in a cache)?

I could cache the bigRdd, but not sure that would be ver efficient either since 
it will require the same number of passes anyway (I think - but I’m relatively 
new to Spark). Also I’m planning on reusing the individual splits (rdd1, rdd2, 
etc so would be convenient to have them individually cached).

Another problem is that the splits are could be very skewed (i.e. one split 
could represent a large percentage of the original bigRdd ). So saving the 
split RDDs to disk (at least, naively) could be a challenge.

Is there any better way of doing this?

Thanks!
   Máximo

Partitioning a RDD for training multiple classifiers

2015-09-08 Thread Maximo Gurmendez

Hi,
I have a RDD that needs to be split (say, by client) in order to train n 
models (i.e. one for each client). Since most of the classifiers that come with 
ml-lib only can accept an RDD as input (and cannot build multiple models in one 
pass - as I understand it), the only way to train n separate models is to 
create n RDDs (by filtering the original one). 

Conceptually:

rdd1,rdd2,rdd3 = splitRdds(bigRdd)  

the function splitRdd would use the standard filter mechanism .  I would then 
need to submit n training spark jobs. When I do this, will it mean that it will 
traverse the bigRdd n times? Is there a better way to persist the splitted rdd 
(i.e. save the split RDD in a cache)? 

I could cache the bigRdd, but not sure that would be ver efficient either since 
it will require the same number of passes anyway (I think - but I’m relatively 
new to Spark). Also I’m planning on reusing the individual splits (rdd1, rdd2, 
etc so would be convenient to have them individually cached). 

Another problem is that the splits are could be very skewed (i.e. one split 
could represent a large percentage of the original bigRdd ). So saving the 
split RDDs to disk (at least, naively) could be a challenge. 

Is there any better way of doing this?

Thanks!
   Máximo

Hadoop Distributed Cache

2014-09-10 Thread Maximo Gurmendez

Hi,
  As part of SparkContext.newAPIHadoopRDD(). Would Spark support an InputFormat 
that uses Hadoop’s distributed cache?
Thanks,
   Máximo
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Random Forest MLlib

Re: Partitioning a RDD for training multiple classifiers

Re: Partitioning a RDD for training multiple classifiers

Partitioning a RDD for training multiple classifiers

Hadoop Distributed Cache

5 matches

Site Navigation

Mail list logo

Footer information