Re: MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-15 Thread Xiangrui Meng
Is it possible that after filtering the feature dimension changed?
This may happen if you use LIBSVM format but didn't specify the number
of features. -Xiangrui

On Tue, Dec 9, 2014 at 4:54 AM, Sameer Tilak  wrote:
> Hi All,
>
>
> I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB
> sparse). I have now filtered the data and I am running regression on a
> subset of it  (~ 200 MB). I see this error, which is strange since it was
> running fine with the superset data. Is this a formatting issue (which I
> doubt) or is this some other issue in data preparation? I confirmed that
> there is no empty line in my dataset. Any help with this will be highly
> appreciated.
>
>
> 14/12/08 20:32:03 WARN TaskSetManager: Lost TID 5 (task 3.0:1)
>
> 14/12/08 20:32:03 WARN TaskSetManager: Loss was due to
> java.lang.ArrayIndexOutOfBoundsException
>
> java.lang.ArrayIndexOutOfBoundsException: 150323
>
> at
> breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231)
>
> at
> breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216)
>
> at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
>
> at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391)
>
> at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83)
>
> at breeze.linalg.DenseVector.dot(DenseVector.scala:47)
>
> at
> org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125)
>
> at
> org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180)
>
> at
> org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179)
>
> at
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>
> at
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
>
> at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
>
> at
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
>
> at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
>
> at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
>
> at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
>
> at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
>
> at org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:51)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLIb: Linear regression: Loss was due to java.lang.ArrayIndexOutOfBoundsException

2014-12-08 Thread Sameer Tilak








Hi All,
I was able to run LinearRegressionwithSGD for a largeer dataset (> 2GB sparse). 
I have now filtered the data and I am running regression on a subset of it  (~ 
200 MB). I see this error, which is strange since it was running fine with the 
superset data. Is this a formatting issue (which I doubt) or is this some other 
issue in data preparation? I confirmed that there is no empty line in my 
dataset. Any help with this will be highly appreciated.


14/12/08 20:32:03 WARN TaskSetManager: Lost TID 5 (task 3.0:1)
14/12/08 20:32:03 WARN TaskSetManager: Loss was due to 
java.lang.ArrayIndexOutOfBoundsException
java.lang.ArrayIndexOutOfBoundsException: 150323
at 
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:231)
at 
breeze.linalg.operators.DenseVector_SparseVector_Ops$$anon$129.apply(SparseVectorOps.scala:216)
at breeze.linalg.operators.BinaryRegistry$class.apply(BinaryOp.scala:60)
at breeze.linalg.VectorOps$$anon$178.apply(Vector.scala:391)
at breeze.linalg.NumericOps$class.dot(NumericOps.scala:83)
at breeze.linalg.DenseVector.dot(DenseVector.scala:47)
at 
org.apache.spark.mllib.optimization.LeastSquaresGradient.compute(Gradient.scala:125)
at 
org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:180)
at 
org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:179)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at 
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
at org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
at 
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)




  

Re: MLLib Linear regression

2014-10-08 Thread Xiangrui Meng
The proper step size partially depends on the Lipschitz constant of
the objective. You should let the machine try different combinations
of parameters and select the best. We are working with people from
AMPLab to make hyperparameter tunning easier in MLlib 1.2. For the
theory, Nesterov's book "Introductory Lectures on Convex Optimization"
is a good one.

We didn't use line search in the current implementation of
LinearRegression, which we should definitely add that option in the
future.

Best,
Xiangrui

On Wed, Oct 8, 2014 at 7:21 AM, Sameer Tilak  wrote:
> Hi Xiangrui,
> Changing the default step size to 0.01 made a huge difference. The results
> make sense when I use A + B + C + D. MSE is ~0.07 and the outcome matches
> the domain knowledge.
>
> I was wondering is there any documentation on the parameters and when/how to
> vary them.
>
>> Date: Tue, 7 Oct 2014 15:11:39 -0700
>> Subject: Re: MLLib Linear regression
>> From: men...@gmail.com
>> To: ssti...@live.com
>> CC: user@spark.apache.org
>
>>
>> Did you test different regularization parameters and step sizes? In
>> the combination that works, I don't see "A + D". Did you test that
>> combination? Are there any linear dependency between A's columns and
>> D's columns? -Xiangrui
>>
>> On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak  wrote:
>> > BTW, one detail:
>> >
>> > When number of iterations is 100 all weights are zero or below and the
>> > indices are only from set A.
>> >
>> > When number of iterations is 150 I see 30+ non-zero weights (when sorted
>> > by
>> > weight) and indices are distributed across al sets. however MSE is high
>> > (5.xxx) and the result does not match the domain knowledge.
>> >
>> > When number of iterations is 400 I see 30+ non-zero weights (when sorted
>> > by
>> > weight) and indices are distributed across al sets. however MSE is high
>> > (6.xxx) and the result does not match the domain knowledge.
>> >
>> > Any help will be highly appreciated.
>> >
>> >
>> > 
>> > From: ssti...@live.com
>> > To: user@spark.apache.org
>> > Subject: MLLib Linear regression
>> > Date: Tue, 7 Oct 2014 13:41:03 -0700
>> >
>> >
>> > Hi All,
>> > I have following classes of features:
>> >
>> > class A: 15000 features
>> > class B: 170 features
>> > class C: 900 features
>> > Class D: 6000 features.
>> >
>> > I use linear regression (over sparse data). I get excellent results with
>> > low
>> > RMSE (~0.06) for the following combinations of classes:
>> > 1. A + B + C
>> > 2. B + C + D
>> > 3. A + B
>> > 4. A + C
>> > 5. B + D
>> > 6. C + D
>> > 7. D
>> >
>> > Unfortunately, when I use A + B + C + D (all the features) I get results
>> > that don't make any sense -- all weights are zero or below and the
>> > indices
>> > are only from set A. I also get high MSE. I changed the number of
>> > iterations
>> > from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there
>> > any
>> > other parameters that I can play with? Any insight on what could be
>> > wrong?
>> > Is it somehow it is not able to scale up to 22K features? (I highly
>> > doubt
>> > that).
>> >
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: MLLib Linear regression

2014-10-08 Thread Sameer Tilak
Hi Xiangrui,Changing the default step size to 0.01 made a huge difference. The 
results make sense when I use A + B + C + D. MSE is ~0.07 and the outcome 
matches the domain knowledge. 
I was wondering is there any documentation on the parameters and when/how to 
vary them.  

> Date: Tue, 7 Oct 2014 15:11:39 -0700
> Subject: Re: MLLib Linear regression
> From: men...@gmail.com
> To: ssti...@live.com
> CC: user@spark.apache.org
> 
> Did you test different regularization parameters and step sizes? In
> the combination that works, I don't see "A + D". Did you test that
> combination? Are there any linear dependency between A's columns and
> D's columns? -Xiangrui
> 
> On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak  wrote:
> > BTW, one detail:
> >
> > When number of iterations is 100 all weights are zero or below and the
> > indices are only from set A.
> >
> > When  number of iterations is 150 I see 30+ non-zero weights (when sorted by
> > weight) and indices are distributed across al sets. however MSE is high
> > (5.xxx) and the result does not match the domain knowledge.
> >
> > When  number of iterations is 400 I see 30+ non-zero weights (when sorted by
> > weight) and indices are distributed across al sets. however MSE is high
> > (6.xxx) and the result does not match the domain knowledge.
> >
> > Any help will be highly appreciated.
> >
> >
> > 
> > From: ssti...@live.com
> > To: user@spark.apache.org
> > Subject: MLLib Linear regression
> > Date: Tue, 7 Oct 2014 13:41:03 -0700
> >
> >
> > Hi All,
> > I have following classes of features:
> >
> > class A: 15000 features
> > class B: 170 features
> > class C: 900 features
> > Class D:  6000 features.
> >
> > I use linear regression (over sparse data). I get excellent results with low
> > RMSE (~0.06) for the following combinations of classes:
> > 1. A + B + C
> > 2. B + C + D
> > 3. A + B
> > 4. A + C
> > 5. B + D
> > 6. C + D
> > 7. D
> >
> > Unfortunately, when I use A + B + C + D (all the features) I get results
> > that don't make any sense -- all weights are zero or below and the indices
> > are only from set A. I also get high MSE. I changed the number of iterations
> > from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any
> > other parameters that I can play with? Any insight on what could be wrong?
> > Is it somehow it is not able to scale up to 22K features? (I highly doubt
> > that).
> >
> >
> >
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
> 
  

Re: MLLib Linear regression

2014-10-07 Thread Xiangrui Meng
Did you test different regularization parameters and step sizes? In
the combination that works, I don't see "A + D". Did you test that
combination? Are there any linear dependency between A's columns and
D's columns? -Xiangrui

On Tue, Oct 7, 2014 at 1:56 PM, Sameer Tilak  wrote:
> BTW, one detail:
>
> When number of iterations is 100 all weights are zero or below and the
> indices are only from set A.
>
> When  number of iterations is 150 I see 30+ non-zero weights (when sorted by
> weight) and indices are distributed across al sets. however MSE is high
> (5.xxx) and the result does not match the domain knowledge.
>
> When  number of iterations is 400 I see 30+ non-zero weights (when sorted by
> weight) and indices are distributed across al sets. however MSE is high
> (6.xxx) and the result does not match the domain knowledge.
>
> Any help will be highly appreciated.
>
>
> ____
> From: ssti...@live.com
> To: user@spark.apache.org
> Subject: MLLib Linear regression
> Date: Tue, 7 Oct 2014 13:41:03 -0700
>
>
> Hi All,
> I have following classes of features:
>
> class A: 15000 features
> class B: 170 features
> class C: 900 features
> Class D:  6000 features.
>
> I use linear regression (over sparse data). I get excellent results with low
> RMSE (~0.06) for the following combinations of classes:
> 1. A + B + C
> 2. B + C + D
> 3. A + B
> 4. A + C
> 5. B + D
> 6. C + D
> 7. D
>
> Unfortunately, when I use A + B + C + D (all the features) I get results
> that don't make any sense -- all weights are zero or below and the indices
> are only from set A. I also get high MSE. I changed the number of iterations
> from 100 to 150, 250, or even 400. I still get MSE as (5/ 6). Are there any
> other parameters that I can play with? Any insight on what could be wrong?
> Is it somehow it is not able to scale up to 22K features? (I highly doubt
> that).
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: MLLib Linear regression

2014-10-07 Thread Sameer Tilak
BTW, one detail:
When number of iterations is 100 all weights are zero or below and the indices 
are only from set A.
When  number of iterations is 150 I see 30+ non-zero weights (when sorted by 
weight) and indices are distributed across al sets. however MSE is high (5.xxx) 
and the result does not match the domain knowledge.
When  number of iterations is 400 I see 30+ non-zero weights (when sorted by 
weight) and indices are distributed across al sets. however MSE is high (6.xxx) 
and the result does not match the domain knowledge.
Any help will be highly appreciated.

From: ssti...@live.com
To: user@spark.apache.org
Subject: MLLib Linear regression
Date: Tue, 7 Oct 2014 13:41:03 -0700




Hi All,I have following classes of features:
class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D:  6000 
features.
I use linear regression (over sparse data). I get excellent results with low 
RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + 
D3. A + B4. A + C5. B + D6. C + D7. D
Unfortunately, when I use A + B + C + D (all the features) I get results that 
don't make any sense -- all weights are zero or below and the indices are only 
from set A. I also get high MSE. I changed the number of iterations from 100 to 
150, 250, or even 400. I still get MSE as (5/ 6). Are there any other 
parameters that I can play with? Any insight on what could be wrong? Is it 
somehow it is not able to scale up to 22K features? (I highly doubt that). 



  

MLLib Linear regression

2014-10-07 Thread Sameer Tilak
Hi All,I have following classes of features:
class A: 15000 featuresclass B: 170 featuresclass C: 900 featuresClass D:  6000 
features.
I use linear regression (over sparse data). I get excellent results with low 
RMSE (~0.06) for the following combinations of classes:1. A + B + C 2. B + C + 
D3. A + B4. A + C5. B + D6. C + D7. D
Unfortunately, when I use A + B + C + D (all the features) I get results that 
don't make any sense -- all weights are zero or below and the indices are only 
from set A. I also get high MSE. I changed the number of iterations from 100 to 
150, 250, or even 400. I still get MSE as (5/ 6). Are there any other 
parameters that I can play with? Any insight on what could be wrong? Is it 
somehow it is not able to scale up to 22K features? (I highly doubt that). 


  

Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) !
Cheers


On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz  wrote:

> Hi,
>
> It appears that the step size is too high that the model is diverging with
> the added noise.
> Could you try by setting the step size to be 0.1 or 0.01?
>
> Best,
> Burak
>
> - Original Message -
> From: "Krishna Sankar" 
> To: user@spark.apache.org
> Sent: Wednesday, October 1, 2014 12:43:20 PM
> Subject: MLlib Linear Regression Mismatch
>
> Guys,
>Obviously I am doing something wrong. May be 4 points are too small a
> dataset.
> Can you help me to figure out why the following doesn't work ?
> a) This works :
>
> data = [
>LabeledPoint(0.0, [0.0]),
>LabeledPoint(10.0, [10.0]),
>LabeledPoint(20.0, [20.0]),
>LabeledPoint(30.0, [30.0])
> ]
> lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
> initialWeights=array([1.0]))
> print lrm
> print lrm.weights
> print lrm.intercept
> lrm.predict([40])
>
> output:
> 
>
> [ 1.]
> 0.0
>
> 40.0
>
> b) By perturbing the y a little bit, the model gives wrong results:
>
> data = [
>LabeledPoint(0.0, [0.0]),
>LabeledPoint(9.0, [10.0]),
>LabeledPoint(22.0, [20.0]),
>LabeledPoint(32.0, [30.0])
> ]
> lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
> initialWeights=array([1.0])) # should be 1.09x -0.60
> print lrm
> print lrm.weights
> print lrm.intercept
> lrm.predict([40])
>
> Output:
> 
>
> [ -8.20487463e+203]
> 0.0
>
> -3.2819498532740317e+205
>
> c) Same story here - wrong results. Actually nan:
>
> data = [
>LabeledPoint(18.9, [3910.0]),
>LabeledPoint(17.0, [3860.0]),
>LabeledPoint(20.0, [4200.0]),
>LabeledPoint(16.6, [3660.0])
> ]
> lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
> initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
> print lrm
> print lrm.weights
> print lrm.intercept
> lrm.predict([4000])
>
> Output: 0x109666b90>
>
> [ nan]
> 0.0
>
> nan
>
> Cheers & Thanks
> 
>
>


Re: MLlib Linear Regression Mismatch

2014-10-01 Thread Burak Yavuz
Hi,

It appears that the step size is too high that the model is diverging with the 
added noise. 
Could you try by setting the step size to be 0.1 or 0.01?

Best,
Burak

- Original Message -
From: "Krishna Sankar" 
To: user@spark.apache.org
Sent: Wednesday, October 1, 2014 12:43:20 PM
Subject: MLlib Linear Regression Mismatch

Guys,
   Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(10.0, [10.0]),
   LabeledPoint(20.0, [20.0]),
   LabeledPoint(30.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0]))
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

output:


[ 1.]
0.0

40.0

b) By perturbing the y a little bit, the model gives wrong results:

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(9.0, [10.0]),
   LabeledPoint(22.0, [20.0]),
   LabeledPoint(32.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be 1.09x -0.60
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

Output:


[ -8.20487463e+203]
0.0

-3.2819498532740317e+205

c) Same story here - wrong results. Actually nan:

data = [
   LabeledPoint(18.9, [3910.0]),
   LabeledPoint(17.0, [3860.0]),
   LabeledPoint(20.0, [4200.0]),
   LabeledPoint(16.6, [3660.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([4000])

Output:

[ nan]
0.0

nan

Cheers & Thanks



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLlib Linear Regression Mismatch

2014-10-01 Thread Krishna Sankar
Guys,
   Obviously I am doing something wrong. May be 4 points are too small a
dataset.
Can you help me to figure out why the following doesn't work ?
a) This works :

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(10.0, [10.0]),
   LabeledPoint(20.0, [20.0]),
   LabeledPoint(30.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0]))
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

output:


[ 1.]
0.0

40.0

b) By perturbing the y a little bit, the model gives wrong results:

data = [
   LabeledPoint(0.0, [0.0]),
   LabeledPoint(9.0, [10.0]),
   LabeledPoint(22.0, [20.0]),
   LabeledPoint(32.0, [30.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be 1.09x -0.60
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([40])

Output:


[ -8.20487463e+203]
0.0

-3.2819498532740317e+205

c) Same story here - wrong results. Actually nan:

data = [
   LabeledPoint(18.9, [3910.0]),
   LabeledPoint(17.0, [3860.0]),
   LabeledPoint(20.0, [4200.0]),
   LabeledPoint(16.6, [3660.0])
]
lrm = LinearRegressionWithSGD.train(sc.parallelize(data),
initialWeights=array([1.0])) # should be ~ 0.006582x -7.595170
print lrm
print lrm.weights
print lrm.intercept
lrm.predict([4000])

Output:

[ nan]
0.0

nan

Cheers & Thanks