I spoke with SK offline about this, it looks like the difference in timings
came from the fact that he was training 100 models for 100 iterations and
taking the total time (vs. my example which trains a single model for 100
iterations). I'm posting my response here, though, because I think it's
wor
Hmm... something is fishy here.
That's a *really* small dataset for a spark job, so almost all your time
will be spent in these overheads, but still you should be able to train a
logistic regression model with the default options and 100 iterations in
<1s on a single machine.
Are you caching your
The dataset is quite small : 5.6 KB. It has 200 rows and 3 features, and 1
column of labels. From this dataset, I split 80% for training set and 20%
for test set. The features are integer counts and labels are binary (1/0).
thanks
--
View this message in context:
http://apache-spark-user-lis
Those are interesting numbers. You haven't mentioned the dataset size in
your thread. This is a classic example of scalability and performance
assuming your baseline numbers are correct and you tuned correctly
everything on your cluster.
Putting my outside cap, there are multiple reasons for this,
NUm Iterations: For LR and SVM, I am using the default value of 100. All
the other parameters also I am using the default values. I am pretty much
reusing the code from BinaryClassification.scala. For Decision Tree, I dont
see any parameter for number of iterations inthe example code, so I did
Also - what hardware are you running the cluster on? And what is the local
machine hardware?
On Tue, Sep 2, 2014 at 11:57 AM, Evan R. Sparks
wrote:
> How many iterations are you running? Can you provide the exact details
> about the size of the dataset? (how many data points, how many features)
How many iterations are you running? Can you provide the exact details
about the size of the dataset? (how many data points, how many features) Is
this sparse or dense - and for the sparse case, how many non-zeroes? How
many partitions is your data RDD?
For very small datasets the scheduling overh