Spark Checkpointing behavior

2016-02-27 Thread Tarek Elgamal
Hi,

I am trying to understand the behavior of rdd.checkpoint() in Spark. I am
running the JavaPageRank

example on a 1 GB graph and I am checkpointing the *ranks *rdd inside each
iteration (between line 125 and 126 in the given link). Spark execution
starts when it hits the *collect()* action. I am expecting that after each
iteration the intermediate ranks will be materialized and written in the
checkpoint dir but, it seems that the rdd is only written once in the end
of the program, although I am invoking ranks.checkpoint() inside the for
loop. Is that the default behavior ?

Note that I am caching the rdd before checkpointing in order to avoid
recomputing

Best Regards,
Tarek


Re: Problem in running MLlib SVM

2015-11-28 Thread Tarek Elgamal
According to the documentation
<http://spark.apache.org/docs/latest/mllib-linear-methods.html>, by
default, if wTx≥0 then the outcome is positive, and negative otherwise. I
suppose that wTx is the "score" in my case. If score is more than 0 and the
label is positive, then I return 1 which is correct classification and I
return zero otherwise. Do you have any idea how to classify a point as
positive or negative using this score or another function ?

On Sat, Nov 28, 2015 at 5:14 AM, Jeff Zhang <zjf...@gmail.com> wrote:

> if((score >=0 && label == 1) || (score <0 && label == 0))
>  {
>   return 1; //correct classiciation
>  }
>  else
>   return 0;
>
>
>
> I suspect score is always between 0 and 1
>
>
>
> On Sat, Nov 28, 2015 at 10:39 AM, Tarek Elgamal <tarek.elga...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to run the straightforward example of SVm but I am getting
>> low accuracy (around 50%) when I predict using the same data I used for
>> training. I am probably doing the prediction in a wrong way. My code is
>> below. I would appreciate any help.
>>
>>
>> import java.util.List;
>>
>> import org.apache.spark.SparkConf;
>> import org.apache.spark.SparkContext;
>> import org.apache.spark.api.java.JavaRDD;
>> import org.apache.spark.api.java.function.Function;
>> import org.apache.spark.api.java.function.Function2;
>> import org.apache.spark.mllib.classification.SVMModel;
>> import org.apache.spark.mllib.classification.SVMWithSGD;
>> import org.apache.spark.mllib.regression.LabeledPoint;
>> import org.apache.spark.mllib.util.MLUtils;
>>
>> import scala.Tuple2;
>> import edu.illinois.biglbjava.readers.LabeledPointReader;
>>
>> public class SimpleDistSVM {
>>   public static void main(String[] args) {
>> SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
>> SparkContext sc = new SparkContext(conf);
>> String inputPath=args[0];
>>
>> // Read training data
>> JavaRDD data = MLUtils.loadLibSVMFile(sc,
>> inputPath).toJavaRDD();
>>
>> // Run training algorithm to build the model.
>> int numIterations = 3;
>> final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);
>>
>> // Clear the default threshold.
>> model.clearThreshold();
>>
>>
>> // Predict points in test set and map to an RDD of 0/1 values where 0
>> is misclassication and 1 is correct classification
>> JavaRDD classification = data.map(new Function<LabeledPoint,
>> Integer>() {
>>  public Integer call(LabeledPoint p) {
>>int label = (int) p.label();
>>Double score = model.predict(p.features());
>>if((score >=0 && label == 1) || (score <0 && label == 0))
>>{
>>return 1; //correct classiciation
>>}
>>else
>> return 0;
>>
>>  }
>>}
>>  );
>> // sum up all values in the rdd to get the number of correctly
>> classified examples
>>  int sum=classification.reduce(new Function2<Integer, Integer,
>> Integer>()
>> {
>> public Integer call(Integer arg0, Integer arg1)
>> throws Exception {
>> return arg0+arg1;
>> }});
>>
>>  //compute accuracy as the percentage of the correctly classified
>> examples
>>  double accuracy=((double)sum)/((double)classification.count());
>>  System.out.println("Accuracy = " + accuracy);
>>
>> }
>>   }
>> );
>>   }
>> }
>>
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>


Problem in running MLlib SVM

2015-11-27 Thread Tarek Elgamal
Hi,

I am trying to run the straightforward example of SVm but I am getting low
accuracy (around 50%) when I predict using the same data I used for
training. I am probably doing the prediction in a wrong way. My code is
below. I would appreciate any help.


import java.util.List;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.mllib.classification.SVMModel;
import org.apache.spark.mllib.classification.SVMWithSGD;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;

import scala.Tuple2;
import edu.illinois.biglbjava.readers.LabeledPointReader;

public class SimpleDistSVM {
  public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("SVM Classifier Example");
SparkContext sc = new SparkContext(conf);
String inputPath=args[0];

// Read training data
JavaRDD data = MLUtils.loadLibSVMFile(sc,
inputPath).toJavaRDD();

// Run training algorithm to build the model.
int numIterations = 3;
final SVMModel model = SVMWithSGD.train(data.rdd(), numIterations);

// Clear the default threshold.
model.clearThreshold();


// Predict points in test set and map to an RDD of 0/1 values where 0
is misclassication and 1 is correct classification
JavaRDD classification = data.map(new Function() {
 public Integer call(LabeledPoint p) {
   int label = (int) p.label();
   Double score = model.predict(p.features());
   if((score >=0 && label == 1) || (score <0 && label == 0))
   {
   return 1; //correct classiciation
   }
   else
return 0;

 }
   }
 );
// sum up all values in the rdd to get the number of correctly
classified examples
 int sum=classification.reduce(new Function2()
{
public Integer call(Integer arg0, Integer arg1)
throws Exception {
return arg0+arg1;
}});

 //compute accuracy as the percentage of the correctly classified
examples
 double accuracy=((double)sum)/((double)classification.count());
 System.out.println("Accuracy = " + accuracy);

}
  }
);
  }
}


Contribute code to MLlib

2015-05-18 Thread Tarek Elgamal
Hi,

I would like to contribute an algorithm to the MLlib project. I have
implemented a scalable PCA algorithm on spark. It is scalable for both tall
and fat matrices and the paper around it is accepted for publication in
SIGMOD 2015 conference. I looked at the guidelines in the following link:

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-MLlib-specificContributionGuidelines

I believe that most of the guidelines applies in my case, however, the code
is written in java and it was not clear in the guidelines whether MLLib
project accepts java code or not.
My algorithm can be found under this repository:
https://github.com/Qatar-Computing-Research-Institute/sPCA

Any help on how to make it suitable for MLlib project will be greatly
appreciated.

Best Regards,
Tarek Elgamal