Hi,
I modified the example code for logistic regression to compute the error in
classification. Please see below. However the code is failing when it makes a
call to:
labelsAndPreds.filter(lambda (v, p): v != p).count()
with the error message (something related to numpy or dot product):
File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/classification.py",
line 65, in predict
margin = _dot(x, self._coeff) + self._intercept
File "/opt/spark-1.0.0-bin-hadoop2/python/pyspark/mllib/_common.py", line
443, in _dot
return vec.dot(target)
AttributeError: 'numpy.ndarray' object has no attribute 'dot'
FYI, I am running the code using spark-submit i.e.
./bin/spark-submit examples/src/main/python/mllib/logistic_regression2.py
The code is posted below if it will be useful in any way:
from math import exp
import sys
import time
from pyspark import SparkContext
from pyspark.mllib.classification import LogisticRegressionWithSGD
from pyspark.mllib.regression import LabeledPoint
from numpy import array
# Load and parse the data
def parsePoint(line):
values = [float(x) for x in line.split(',')]
if values[0] == -1: # Convert -1 labels to 0 for MLlib
values[0] = 0
return LabeledPoint(values[0], values[1:])
?
sc = SparkContext(appName="PythonLR")
# start timing
start = time.time()
#start = time.clock()
data = sc.textFile("sWAMSpark_train.csv")
parsedData = data.map(parsePoint)
# Build the model
model = LogisticRegressionWithSGD.train(parsedData)
#load test data
testdata = sc.textFile("sWSpark_test.csv")
parsedTestData = testdata.map(parsePoint)
# Evaluating the model on test data
labelsAndPreds = parsedTestData.map(lambda p: (p.label,
model.predict(p.features)))
trainErr = labelsAndPreds.filter(lambda (v, p): v != p).count() /
float(parsedData.count())
print("Training Error = " + str(trainErr))
end = time.time()
print("Time is = " + str(end - start))