Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Hi,

Suppose there is data frame called goods with columns "barcode" and
"items". Some of the values in the column "items" can be null.

I want to the barcode and the respective items from the table adhering the
following rules:

   - If "items" is null -> output 0
   - else -> output "items" ( the actual value in the column)

I would write a query like:

*SELECT barcode, IF(items is null, 0, items) FROM goods*

But this query fails with the error:

*unresolved operator 'Project [if (IS NULL items#1) 0 else items#1 AS
c0#132]; *

It seems I can only use numerical values inside this IF statement, but when
a column name is used, it fails.

Is there any workaround to do this?

Best regards.
-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 : http://wso2.com/
Email: mahesha...@wso2.com
Mobile: +94711228855


Re: Using a variable (a column name) in an IF statement in Spark SQL

2015-10-08 Thread Maheshakya Wijewardena
Spark version: 1.4.1
The schema is "barcode STRING, items INT"

On Thu, Oct 8, 2015 at 10:48 PM, Michael Armbrust <mich...@databricks.com>
wrote:

> Hmm, that looks like it should work to me.  What version of Spark?  What
> is the schema of goods?
>
> On Thu, Oct 8, 2015 at 6:13 AM, Maheshakya Wijewardena <
> mahesha...@wso2.com> wrote:
>
>> Hi,
>>
>> Suppose there is data frame called goods with columns "barcode" and
>> "items". Some of the values in the column "items" can be null.
>>
>> I want to the barcode and the respective items from the table adhering
>> the following rules:
>>
>>- If "items" is null -> output 0
>>- else -> output "items" ( the actual value in the column)
>>
>> I would write a query like:
>>
>> *SELECT barcode, IF(items is null, 0, items) FROM goods*
>>
>> But this query fails with the error:
>>
>> *unresolved operator 'Project [if (IS NULL items#1) 0 else items#1 AS
>> c0#132]; *
>>
>> It seems I can only use numerical values inside this IF statement, but
>> when a column name is used, it fails.
>>
>> Is there any workaround to do this?
>>
>> Best regards.
>> --
>> Pruthuvi Maheshakya Wijewardena
>> Software Engineer
>> WSO2 : http://wso2.com/
>> Email: mahesha...@wso2.com
>> Mobile: +94711228855
>>
>>
>>
>


-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 : http://wso2.com/
Email: mahesha...@wso2.com
Mobile: +94711228855


Re: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Thanks for the information. I'll try that out with Spark 1.4.

On Thu, May 28, 2015 at 9:54 AM, DB Tsai dbt...@dbtsai.com wrote:

 LinearRegressionWithSGD requires to tune the step size and # of
 iteration very carefully. Please try Linear Regression with elastic
 net implementation in Spark 1.4 in ML framework, which uses quasi
 newton method and step size will be automatically determined. That
 implementation also matches the result from R.

 Sincerely,

 DB Tsai
 ---
 Blog: https://www.dbtsai.com


 On Wed, May 27, 2015 at 9:08 PM, Maheshakya Wijewardena
 mahesha...@wso2.com wrote:
 
  Hi,
 
  I'm trying to use Sparks' LinearRegressionWithSGD in PySpark with the
  attached dataset. The code is attached. When I check the model weights
  vector after training, it contains `nan` values.
 
  [nan,nan,nan,nan,nan,nan,nan,nan]
 
  But for some data sets, this problem does not occur. What might be the
  reason for this?
  Is this an issue with the data I'm using or a bug?
 
  Best regards.
 
  --
  Pruthuvi Maheshakya Wijewardena
  Software Engineer
  WSO2 Lanka (Pvt) Ltd
  Email: mahesha...@wso2.com
  Mobile: +94711228855
 
 
 
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org




-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 Lanka (Pvt) Ltd
Email: mahesha...@wso2.com
Mobile: +94711228855


Fwd: Model weights of linear regression becomes abnormal values

2015-05-27 Thread Maheshakya Wijewardena
Hi,

I'm trying to use Sparks' *LinearRegressionWithSGD* in PySpark with the
attached dataset. The code is attached. When I check the model weights
vector after training, it contains `nan` values.

[nan,nan,nan,nan,nan,nan,nan,nan]

But for some data sets, this problem does not occur. What might be the
reason for this?
Is this an issue with the data I'm using or a bug?

Best regards.

-- 
Pruthuvi Maheshakya Wijewardena
Software Engineer
WSO2 Lanka (Pvt) Ltd
Email: mahesha...@wso2.com
Mobile: +94711228855
6,148,72,35,0,336,627,50,1
1,85,66,29,0,266,351,31,0
8,183,64,0,0,233,672,32,1
1,89,66,23,94,281,167,21,0
0,137,40,35,168,431,2288,33,1
5,116,74,0,0,256,201,30,0
3,78,50,32,88,310,248,26,1
10,115,0,0,0,353,134,29,0
2,197,70,45,543,305,158,53,1
8,125,96,0,0,0,232,54,1
4,110,92,0,0,376,191,30,0
10,168,74,0,0,380,537,34,1
10,139,80,0,0,271,1441,57,0
1,189,60,23,846,301,398,59,1
5,166,72,19,175,258,587,51,1
7,100,0,0,0,300,484,32,1
0,118,84,47,230,458,551,31,1
7,107,74,0,0,296,254,31,1
1,103,30,38,83,433,183,33,0
1,115,70,30,96,346,529,32,1
3,126,88,41,235,393,704,27,0
import sys
from pyspark import SparkContext
from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD
from numpy import array

# Load and parse data
def parse_point(line):
values = [float(x) for x in line.split(',')]
return LabeledPoint(values[0], values[1:])

sc = SparkContext(appName='LinearRegression')
# Add path to your dataset.
data = sc.textFile('dummy_data_sest.csv')
parsedData = data.map(parse_point)

# Build the model
model = LinearRegressionWithSGD.train(parsedData)

# Check model weight vector
print(model.weights)
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org