How do i specify the data types in a DF

2015-07-30 Thread afarahat
Hello; 
I have a simple file of mobile IDFA. they look like
gregconv = ['00013FEE-7561-47F3-95BC-CA18D20BCF78',
'000D9B97-2B54-4B80-AAA1-C1CB42CFBF3A',
'000F9E1F-BC7E-47E1-BF68-C68F6D987B96']
I am trying to make this RDD into a data frame

ConvRecord = Row("IDFA")

gregconvdf = gregconv.map(lambda x: ConvRecord(*x)).toDF()
i get the following error

Traceback (most recent call last):
  File "", line 1, in 
  File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/context.py",
line 60, in toDF
return sqlContext.createDataFrame(self, schema, sampleRatio)
  File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/context.py",
line 351, in createDataFrame
_verify_type(row, schema)
  File "/homes/afarahat/aofspark/share/spark/python/pyspark/sql/types.py",
line 1027, in _verify_type
"length of fields (%d)" % (len(obj), len(dataType.fields)))
ValueError: Length of object (36) does not match with length of fields (1)



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-do-i-specify-the-data-types-in-a-DF-tp24090.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How can the RegressionMetrics produce negative R2 and explained variance?

2015-07-12 Thread afarahat
Hello; 
I am using the ALS recommendation MLLibb. To select the optimal rank, I have
a number of users who used multiple items as my test. I then get the
prediction on these users and compare it to the observed. I use 
the  RegressionMetrics to estimate the R^2. 
I keep getting a negative value. 
r2 =   -1.18966999676 explained var =  -1.18955347415 count =  11620309
Here is my Pyspark code :

train1.cache()
test1.cache()

numIterations =10
for i in range(10) :
rank = int(40+i*10)
als = ALS(rank=rank, maxIter=numIterations,implicitPrefs=False)
model = als.fit(train1)
predobs =
model.transform(test1).select("prediction","rating").map(lambda p :
(p.prediction,p.rating)).filter(lambda p: (math.isnan(p[0]) == False))
metrics = RegressionMetrics(predobs)
mycount = predobs.count()
myr2 = metrics.r2
myvar = metrics.explainedVariance
print "hooo",rank, " r2 =  ",myr2, "explained var = ", myvar, "count
= ",mycount




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-can-the-RegressionMetrics-produce-negative-R2-and-explained-variance-tp23779.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ALS :how to set numUserBlocks and numItemBlocks

2015-06-25 Thread afarahat
any guidance how to set these 2?
I have way more users (100s of millions than items)
Thanks
Ayman



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-how-to-set-numUserBlocks-and-numItemBlocks-tp23503.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



HiveContext /Spark much slower than Hive

2015-06-24 Thread afarahat
I have a simple HQL (below). In hive it takes maybe 10 minutes to complete.
When I do this with Spark it seems to take for every. The table is
partitioned by "datestamp".  I am using Spark 1.3.1
How can i tune/optimize 

here is the query 
tumblruser=hiveCtx.sql(" select s_mobile_id, receive_time  from 
mx3.post_tp_annotated_mb_impr where ad_id = 30590918987 and datestamp
>='20150623' ")


Thanks
Ayman




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/HiveContext-Spark-much-slower-than-Hive-tp23480.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to get the ALS reconstruction error

2015-06-20 Thread afarahat
Hello; 
I am fitting ALS models and would like to get an initial idea of the number
of factors.I wan tot use the reconstruction error on train data as a
measure.  Does the API expose the reconstruction error ?
Thanks
Ayman



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-the-ALS-reconstruction-error-tp23416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Un-persist RDD in a loop

2015-06-19 Thread afarahat
Hello; 
I am trying to get the optimal number of factors in ALS. To that end, i am
scanning various values and evaluating the RSE. DO i need to un-perisist the
RDD between loops or will the resources (memory) get automatically deleted
and re-assigned  between iterations. 

for i in range(5):
rank = 5 +int(i )
#imodel = ALS.trainImplicit(smallratings, rank, numIterations)
imodel = ALS.train(smallratings, rank, numIterations)
predictions = imodel.predictAll(testdata).map(lambda r: ((r[0], r[1]),
r[2]))
ratesAndPreds = smallratings.map(lambda r: ((r[0], r[1]),
r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).reduce(lambda
x, y: x + y) / ratesAndPreds.count()
print "ho ho ", rank, " ", MSE
predictions.unpersist()
ratesAndPreds.unpersist()



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Un-persist-RDD-in-a-loop-tp23414.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Matrix Multiplication and mllib.recommendation

2015-06-17 Thread afarahat
Hello; 
I am trying to get predictions after running the ALS model. 
The model works fine. In the prediction/recommendation , I have about 30
,000 products and 90 Millions users. 
When i try the predict all it fails. 
I have been trying to formulate the problem as a Matrix multiplication where
I first get the product features, broadcast them and then do a dot product. 
Its still very slow. Any reason why 
here is a sample code

def doMultiply(x):
a = []
#multiply by 
mylen = len(pf.value)
for i in range(mylen) :
  myprod = numpy.dot(x,pf.value[i][1])
  a.append(myprod)
return a


myModel = MatrixFactorizationModel.load(sc, "FlurryModelPath")
#I need to select which products to broadcast but lets try all 
m1 = myModel.productFeatures().sample(False, 0.001)
pf = sc.broadcast(m1.collect())
uf = myModel.userFeatures()
f1 = uf.map(lambda x : (x[0], doMultiply(x[1])))



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Matrix-Multiplication-and-mllib-recommendation-tp23384.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Pyspark Dense Matrix Multiply : One of them can fit in Memory

2015-06-16 Thread afarahat
Hello
I would like to Multiply two matrices 

C = A* B 
A is a m x k ,   B is a kxl 
k,l  m  so that B can easily fit in memory. 
Any ideas or suggestions how to do that in Pyspark?
Thanks
Ayman



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-Dense-Matrix-Multiply-One-of-them-can-fit-in-Memory-tp23344.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



ALS predictALL not completing

2015-06-15 Thread afarahat
Hello; 
I have a data set of about 80 Million users and 12,000 items (very sparse ). 
I can get the training part working no problem. (model has 20 factors), 
However, when i try using Predict all for 80 Million x 10 items , the jib
does not complete. 
When i use a smaller data set say 500k or a million it completes. 
Any ideas suggestions ?
Thanks
Ayman



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/ALS-predictALL-not-completing-tp23327.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org