Tensor Flow

2016-12-12 Thread Meeraj Kunnumpurath
Hello,

Is there anything available in Spark similar to Tensor Flow? I am looking
at a mechanism for performing nearest neighbour search on vectorized image
data.

Regards

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Logistic regression using gradient ascent

2016-11-30 Thread Meeraj Kunnumpurath
Hello,

I have been trying to implement logistic regression using gradient ascent,
out of curiosity. I am using Spark ML feature extraction packages and data
frames, and not any of the implemented algorithms. I will be grateful if
any of you could please cast an eye and provide some feedback.

https://github.com/kunnum/sandbox/blob/master/classification/src/main/scala/com/ss/ml/classification/lr/LRWithGradientAscent.scala

Regards

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: UDF for gradient ascent

2016-11-26 Thread Meeraj Kunnumpurath
One thing I noticed inside the UDF is that original column names from the
data frame have disappeared and the columns are called col1, col2 etc.

Regards
Meeraj

On Sat, Nov 26, 2016 at 7:31 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> I have a dataset of features on which I want to compute the likelihood
> value for implementing gradient ascent for estimating coefficients. I have
> written a UDF that compute the probability function on each feature as
> shown below.
>
> def getLikelihood(cfs : List[(String, Double)], df: DataFrame) = {
>   val pr = udf((r: Row) => {
> cfs.foldLeft(0.0)((x, y) => x * 1 / Math.pow(Math.E, 
> r.getAs[Double](y._1) * y._2))
>   })
>   df.withColumn("probabibility", pr(struct(df.columns.map(df(_)) : 
> _*))).agg(sum('probabibility)).first.get(0)
> }
>
> When I run it I get a long exception trace listing some generated code, as
> shown below.
>
> org.codehaus.commons.compiler.CompileException: File 'generated.java',
> Line 2445, Column 34: Expression "scan_isNull1" is not an rvalue
> at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
> at org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(
> UnitCompiler.java:6036)
> at org.codehaus.janino.UnitCompiler.getConstantValue2(
> UnitCompiler.java:4440)
> at org.codehaus.janino.UnitCompiler.access$9900(UnitCompiler.java:185)
> at org.codehaus.janino.UnitCompiler$11.visitAmbiguousName(
> UnitCompiler.java:4417)
>
> This is line 2445 in the generated code,
>
> /* 2445 */ Object project_arg = scan_isNull1 ? null :
> project_converter2.apply(scan_value1);
>
> Many thanks
>
>
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


UDF for gradient ascent

2016-11-26 Thread Meeraj Kunnumpurath
Hello,

I have a dataset of features on which I want to compute the likelihood
value for implementing gradient ascent for estimating coefficients. I have
written a UDF that compute the probability function on each feature as
shown below.

def getLikelihood(cfs : List[(String, Double)], df: DataFrame) = {
  val pr = udf((r: Row) => {
cfs.foldLeft(0.0)((x, y) => x * 1 / Math.pow(Math.E,
r.getAs[Double](y._1) * y._2))
  })
  df.withColumn("probabibility", pr(struct(df.columns.map(df(_)) :
_*))).agg(sum('probabibility)).first.get(0)
}

When I run it I get a long exception trace listing some generated code, as
shown below.

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line
2445, Column 34: Expression "scan_isNull1" is not an rvalue
at org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:10174)
at
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:6036)
at
org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:4440)
at org.codehaus.janino.UnitCompiler.access$9900(UnitCompiler.java:185)
at
org.codehaus.janino.UnitCompiler$11.visitAmbiguousName(UnitCompiler.java:4417)

This is line 2445 in the generated code,

/* 2445 */ Object project_arg = scan_isNull1 ? null :
project_converter2.apply(scan_value1);

Many thanks



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Thank you, it was the escape character, option("escape", "\"")

Regards

On Sat, Nov 19, 2016 at 11:10 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> I triied .option("quote", "\""), which I believe is the default, still the
> same error. This is the offending record.
>
> Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel
> Blue Seat,"I chose this potty for my son because of the good reviews. I do
> not like it. I'm honestly baffled by all the great reviews now that I have
> this thing in front of me.1)It is made of cheap material, feels flimsy, the
> grips on the bottom of the thing do nothing to keep it in place when the
> child sits on it.2)It comes apart into 5 or 6 different pieces and all my
> son likes to do is take it apart. I did not want a potty that would turn
> into a toy, and this has just become like a puzzle for him, with all the
> different pieces.3)It is a little big for him. He is young still but he's a
> big boy for his age. I looked at one of the pictures posted and he looks
> about the same size as the curly haired kid reading the book, but the potty
> in that picture is NOT this potty! This one is a little bigger and he can't
> get quite touch his feet on the ground, which is important.4)And one final
> thing, maybe most importantly, the ""soft"" seat is not so soft. Doesn't
> seem very comfortable to me. It's just plastic on top of plastic... and
> after my son sits on it for just a few minutes his butt has horrible red
> marks all over it! Definitely not comfortable.So, overall, i'm not
> impressed at all.I gave it 2 stars because... it gets the job done I
> suppose, and for a child a little bit older than my son it might fit a
> little better. Also I really liked the idea that it was 4-in-1.Overall
> though, I do not suggest getting this potty. Look elseware!It's probably
> best to actually go to a store and look at them first hand, and not order
> online. That's what I should have done in the first place.",2
>
> On Sat, Nov 19, 2016 at 10:59 PM, Meeraj Kunnumpurath <
> mee...@servicesymphony.com> wrote:
>
>> Digging through it looks like an issue with reading CSV. Some of the data
>> have embedded commas in them, these fields are rightly quoted. However, the
>> CSV reader seems to be getting to a pickle, when the records contain quoted
>> and unquoted data. Fields are only quoted, when there are commas within the
>> fields, otherwise they are unquoted.
>>
>> Regards
>> Meeraj
>>
>> On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
>> mee...@servicesymphony.com> wrote:
>>
>>> Hello,
>>>
>>> I have the following code that trains a mapping of review text to
>>> ratings. I use a tokenizer to get all the words from the review, and use a
>>> count vectorizer to get all the words. However, when I train the classifier
>>> I get a match error. Any pointers will be very helpful.
>>>
>>> The code is below,
>>>
>>> val spark = SparkSession.builder().appName("Logistic 
>>> Regression").master("local").getOrCreate()
>>> import spark.implicits._
>>>
>>> val df = spark.read.option("header", "true").option("inferSchema", 
>>> "true").csv("data/amazon_baby.csv")
>>> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
>>> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>>>
>>> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>>>
>>> val words = tk.transform(df.withColumn("label", isGood('rating)))
>>> val Array(training, test) = 
>>> cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>>>
>>> val classifier = new LogisticRegression()
>>>
>>> training.show(10)
>>>
>>> val simpleModel = classifier.fit(training)
>>> simpleModel.evaluate(test).predictions.select("words", "label", 
>>> "prediction", "probability").show(10)
>>>
>>>
>>> And the error I get is below.
>>>
>>> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0
>>> (TID 9)
>>> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5
>>> ,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,
>>> 68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,2
>>> 08,219,221,235,249,255,260,353,355,371,431,442,641,711,972,
>>> 1065,1411,1663,1

Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
I triied .option("quote", "\""), which I believe is the default, still the
same error. This is the offending record.

Primo 4-In-1 Soft Seat Toilet Trainer and Step Stool White with Pastel Blue
Seat,"I chose this potty for my son because of the good reviews. I do not
like it. I'm honestly baffled by all the great reviews now that I have this
thing in front of me.1)It is made of cheap material, feels flimsy, the
grips on the bottom of the thing do nothing to keep it in place when the
child sits on it.2)It comes apart into 5 or 6 different pieces and all my
son likes to do is take it apart. I did not want a potty that would turn
into a toy, and this has just become like a puzzle for him, with all the
different pieces.3)It is a little big for him. He is young still but he's a
big boy for his age. I looked at one of the pictures posted and he looks
about the same size as the curly haired kid reading the book, but the potty
in that picture is NOT this potty! This one is a little bigger and he can't
get quite touch his feet on the ground, which is important.4)And one final
thing, maybe most importantly, the ""soft"" seat is not so soft. Doesn't
seem very comfortable to me. It's just plastic on top of plastic... and
after my son sits on it for just a few minutes his butt has horrible red
marks all over it! Definitely not comfortable.So, overall, i'm not
impressed at all.I gave it 2 stars because... it gets the job done I
suppose, and for a child a little bit older than my son it might fit a
little better. Also I really liked the idea that it was 4-in-1.Overall
though, I do not suggest getting this potty. Look elseware!It's probably
best to actually go to a store and look at them first hand, and not order
online. That's what I should have done in the first place.",2

On Sat, Nov 19, 2016 at 10:59 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Digging through it looks like an issue with reading CSV. Some of the data
> have embedded commas in them, these fields are rightly quoted. However, the
> CSV reader seems to be getting to a pickle, when the records contain quoted
> and unquoted data. Fields are only quoted, when there are commas within the
> fields, otherwise they are unquoted.
>
> Regards
> Meeraj
>
> On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
> mee...@servicesymphony.com> wrote:
>
>> Hello,
>>
>> I have the following code that trains a mapping of review text to
>> ratings. I use a tokenizer to get all the words from the review, and use a
>> count vectorizer to get all the words. However, when I train the classifier
>> I get a match error. Any pointers will be very helpful.
>>
>> The code is below,
>>
>> val spark = SparkSession.builder().appName("Logistic 
>> Regression").master("local").getOrCreate()
>> import spark.implicits._
>>
>> val df = spark.read.option("header", "true").option("inferSchema", 
>> "true").csv("data/amazon_baby.csv")
>> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
>> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>>
>> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>>
>> val words = tk.transform(df.withColumn("label", isGood('rating)))
>> val Array(training, test) = 
>> cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>>
>> val classifier = new LogisticRegression()
>>
>> training.show(10)
>>
>> val simpleModel = classifier.fit(training)
>> simpleModel.evaluate(test).predictions.select("words", "label", 
>> "prediction", "probability").show(10)
>>
>>
>> And the error I get is below.
>>
>> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
>> 9)
>> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,5
>> ,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,
>> 68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,
>> 208,219,221,235,249,255,260,353,355,371,431,442,641,711,
>> 972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,
>> 7294,8951,9758,12203,18319,21779,48525,72732,75420,146476
>> ,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.
>> 0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.
>> 0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.
>> 0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
>> org.apache.spark.sql.catalyst.expre

Re: Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Digging through it looks like an issue with reading CSV. Some of the data
have embedded commas in them, these fields are rightly quoted. However, the
CSV reader seems to be getting to a pickle, when the records contain quoted
and unquoted data. Fields are only quoted, when there are commas within the
fields, otherwise they are unquoted.

Regards
Meeraj

On Sat, Nov 19, 2016 at 10:10 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> I have the following code that trains a mapping of review text to ratings.
> I use a tokenizer to get all the words from the review, and use a count
> vectorizer to get all the words. However, when I train the classifier I get
> a match error. Any pointers will be very helpful.
>
> The code is below,
>
> val spark = SparkSession.builder().appName("Logistic 
> Regression").master("local").getOrCreate()
> import spark.implicits._
>
> val df = spark.read.option("header", "true").option("inferSchema", 
> "true").csv("data/amazon_baby.csv")
> val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
> val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")
>
> val isGood = udf((x: Int) => if (x >= 4) 1 else 0)
>
> val words = tk.transform(df.withColumn("label", isGood('rating)))
> val Array(training, test) = 
> cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)
>
> val classifier = new LogisticRegression()
>
> training.show(10)
>
> val simpleModel = classifier.fit(training)
> simpleModel.evaluate(test).predictions.select("words", "label", "prediction", 
> "probability").show(10)
>
>
> And the error I get is below.
>
> 16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID
> 9)
> scala.MatchError: [null,1.0,(257358,[0,1,2,3,4,
> 5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,
> 58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,
> 169,208,219,221,235,249,255,260,353,355,371,431,442,641,
> 711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,
> 6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,
> 146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,
> 2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,
> 1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
> 1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])] (of class
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.
> apply(LogisticRegression.scala:266)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(
> MemoryStore.scala:214)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:919)
> at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
> at org.apache.spark.storage.BlockManager.doPutIterator(
> BlockManager.scala:910)
> at org.apache.spark.storage.BlockManager.getOrElseUpdate(
> BlockManager.scala:668)
> at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)
>
> Many thanks
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Logistic Regression Match Error

2016-11-19 Thread Meeraj Kunnumpurath
Hello,

I have the following code that trains a mapping of review text to ratings.
I use a tokenizer to get all the words from the review, and use a count
vectorizer to get all the words. However, when I train the classifier I get
a match error. Any pointers will be very helpful.

The code is below,

val spark = SparkSession.builder().appName("Logistic
Regression").master("local").getOrCreate()
import spark.implicits._

val df = spark.read.option("header", "true").option("inferSchema",
"true").csv("data/amazon_baby.csv")
val tk = new Tokenizer().setInputCol("review").setOutputCol("words")
val cv = new CountVectorizer().setInputCol("words").setOutputCol("features")

val isGood = udf((x: Int) => if (x >= 4) 1 else 0)

val words = tk.transform(df.withColumn("label", isGood('rating)))
val Array(training, test) =
cv.fit(words).transform(words).randomSplit(Array(0.8, 0.2), 1)

val classifier = new LogisticRegression()

training.show(10)

val simpleModel = classifier.fit(training)
simpleModel.evaluate(test).predictions.select("words", "label",
"prediction", "probability").show(10)


And the error I get is below.

16/11/19 22:06:45 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 9)
scala.MatchError:
[null,1.0,(257358,[0,1,2,3,4,5,6,7,8,9,10,13,15,16,20,25,27,29,34,37,40,42,45,48,49,52,58,68,71,76,77,86,89,93,98,99,100,108,109,116,122,124,129,169,208,219,221,235,249,255,260,353,355,371,431,442,641,711,972,1065,1411,1663,1776,1925,2596,2957,3355,3828,4860,6288,7294,8951,9758,12203,18319,21779,48525,72732,75420,146476,192184],[3.0,8.0,1.0,1.0,4.0,2.0,7.0,4.0,2.0,1.0,1.0,2.0,1.0,4.0,3.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,1.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])]
(of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at
org.apache.spark.ml.classification.LogisticRegression$$anonfun$6.apply(LogisticRegression.scala:266)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:214)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:919)
at
org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:910)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:866)
at
org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:910)
at
org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:668)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:330)

Many thanks
-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
That was a bit of a brute force search, so I changed the code to use a UDF
to create the dot product between the two IDF vectors, and do a sort on the
new column.

package com.ss.ml.clustering

import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.ml.feature.{IDF, Tokenizer, HashingTF}
import org.apache.spark.ml.linalg.Vector

object ClusteringBasics extends App {

  val spark = SparkSession.builder().appName("Clustering
Basics").master("local").getOrCreate()
  import spark.implicits._

  val df = spark.read.option("header", "false").csv("data")

  val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")
  val tf = new HashingTF().setInputCol("words").setOutputCol("tf")
  val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")

  val df1 = tf.transform(tk.transform(df))
  val idfs = idf.fit(df1).transform(df1)

  val nn = nearestNeighbour("<http://dbpedia.org/resource/Barack_Obama>", idfs)
  println(nn)

  def nearestNeighbour(uri: String, ds: DataFrame) : String = {
val tfIdfSrc = ds.filter(s"_c0 ==
'$uri'").take(1)(0).getAs[Vector]("tf-idf")
def dorProduct(vectorA: Vector) = {
  var dp = 0.0
  var index = vectorA.size - 1
  for (i <- 0 to index) {
dp += vectorA(i) * tfIdfSrc(i)
  }
  dp
}
val dpUdf = udf((v1: Vector, v2: Vector) => dorProduct(v1))
ds.filter(s"_c0 != '$uri'").withColumn("dp",
dpUdf('tf-idf)).sort("dp").take(1)(0).getString(1)
  }

}


However, that is generating the exception below,

Exception in thread "main" java.lang.RuntimeException: Unsupported literal
type class org.apache.spark.ml.feature.IDF idf_e49381a285dd
at
org.apache.spark.sql.catalyst.expressions.Literal$.apply(literals.scala:57)
at org.apache.spark.sql.functions$.lit(functions.scala:101)
at org.apache.spark.sql.Column.$minus(Column.scala:672)
at
com.ss.ml.clustering.ClusteringBasics$.nearestNeighbour(ClusteringBasics.scala:36)
at
com.ss.ml.clustering.ClusteringBasics$.delayedEndpoint$com$ss$ml$clustering$ClusteringBasics$1(ClusteringBasics.scala:22)
at
com.ss.ml.clustering.ClusteringBasics$delayedInit$body.apply(ClusteringBasics.scala:8)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at com.ss.ml.clustering.ClusteringBasics$.main(ClusteringBasics.scala:8)
at com.ss.ml.clustering.ClusteringBasics.main(ClusteringBasics.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)

On Sun, Nov 13, 2016 at 10:56 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> This is what I have done, is there a better way of doing this?
>
>   val df = spark.read.option("header", "false").csv("data")
>
>
>   val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")
>
>   val tf = new HashingTF().setInputCol("words").setOutputCol("tf")
>
>   val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")
>
>
>   val df1 = tf.transform(tk.transform(df))
>
>   val idfs = idf.fit(df1).transform(df1)
>
>
>   println(nearestNeighbour("http://dbpedia.org/resource/Barack_Obama;,
> idfs))
>
>
>   def nearestNeighbour(uri: String, ds: DataFrame) : String = {
>
> var res : Row = null
>
> var metric : Double = 0
>
> val tfIdfSrc = ds.filter(s"_c0 == '$uri'").take(1)(0).getAs[
> Vector]("tf-idf")
>
> ds.filter("_c0 != '" + uri + "'").foreach { r =>
>
>   val tfIdfDst = r.getAs[Vector]("tf-idf")
>
>   val dp = dorProduct(tfIdfSrc, tfIdfDst)
>
>   if (dp > metric) {
>
> res = r
>
> metric = dp
>
>   }
>
> }
>
> return res.getAs[String]("_c1")
>
>   }
>
>
>   def cosineSimilarity(vectorA: Vector, vectorB: Vector) = {
>
> var dotProduct = 0.0
>
> var normA = 0.0
>
> var normB = 0.0
>
>     var index = vectorA.size - 1
>
> for (i <- 0 to 

Re: Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
This is what I have done, is there a better way of doing this?

  val df = spark.read.option("header", "false").csv("data")


  val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")

  val tf = new HashingTF().setInputCol("words").setOutputCol("tf")

  val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")


  val df1 = tf.transform(tk.transform(df))

  val idfs = idf.fit(df1).transform(df1)


  println(nearestNeighbour("http://dbpedia.org/resource/Barack_Obama;,
idfs))


  def nearestNeighbour(uri: String, ds: DataFrame) : String = {

var res : Row = null

var metric : Double = 0

val tfIdfSrc = ds.filter(s"_c0 ==
'$uri'").take(1)(0).getAs[Vector]("tf-idf")

ds.filter("_c0 != '" + uri + "'").foreach { r =>

  val tfIdfDst = r.getAs[Vector]("tf-idf")

  val dp = dorProduct(tfIdfSrc, tfIdfDst)

  if (dp > metric) {

res = r

metric = dp

  }

}

return res.getAs[String]("_c1")

  }


  def cosineSimilarity(vectorA: Vector, vectorB: Vector) = {

var dotProduct = 0.0

var normA = 0.0

var normB = 0.0

var index = vectorA.size - 1

for (i <- 0 to index) {

  dotProduct += vectorA(i) * vectorB(i)

  normA += Math.pow(vectorA(i), 2)

  normB += Math.pow(vectorB(i), 2)

}

(dotProduct / (Math.sqrt(normA) * Math.sqrt(normB)))

  }


  def dorProduct(vectorA: Vector, vectorB: Vector) = {

var dp = 0.0

var index = vectorA.size - 1

for (i <- 0 to index) {

  dp += vectorA(i) * vectorB(i)

}

dp

  }

On Sun, Nov 13, 2016 at 7:04 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> I have a dataset containing TF-IDF vectors for a corpus of documents. How
> do I perform a nearest neighbour search on the dataset, using cosine
> similarity?
>
>   val df = spark.read.option("header", "false").csv("data")
>
>   val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")
>
>   val tf = new HashingTF().setInputCol("words").setOutputCol("tf")
>
>   val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")
>
>   val df1 = tf.transform(tk.transform(df))
>
>   idf.fit(df1).transform(df1).select("tf-idf").show(10)
> Thank you
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Nearest neighbour search

2016-11-13 Thread Meeraj Kunnumpurath
Hello,

I have a dataset containing TF-IDF vectors for a corpus of documents. How
do I perform a nearest neighbour search on the dataset, using cosine
similarity?

  val df = spark.read.option("header", "false").csv("data")

  val tk = new Tokenizer().setInputCol("_c2").setOutputCol("words")

  val tf = new HashingTF().setInputCol("words").setOutputCol("tf")

  val idf = new IDF().setInputCol("tf").setOutputCol("tf-idf")

  val df1 = tf.transform(tk.transform(df))

  idf.fit(df1).transform(df1).select("tf-idf").show(10)
Thank you

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: RowMatrix from DenseVector

2016-10-13 Thread Meeraj Kunnumpurath
Apologies, oversight, I had a mix of mllib and ml imports.

On Thu, Oct 13, 2016 at 2:27 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> How do I create a row matrix from a dense vector. The following code,
> doesn't compile.
>
> val features = df.rdd.map(r => Vectors.dense(r.getAs[Double]("constant"), 
> r.getAs[Double]("sqft_living")))
> val rowMatrix = new RowMatrix(features, features.count(), 2)
>
> The compiler error
>
> Error:(24, 33) type mismatch;
>  found   : org.apache.spark.rdd.RDD[org.apache.spark.ml.linalg.Vector]
>  required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
>   val rowMatrix = new RowMatrix(features, features.count(), 2)
> ^
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


RowMatrix from DenseVector

2016-10-13 Thread Meeraj Kunnumpurath
Hello,

How do I create a row matrix from a dense vector. The following code,
doesn't compile.

val features = df.rdd.map(r =>
Vectors.dense(r.getAs[Double]("constant"),
r.getAs[Double]("sqft_living")))
val rowMatrix = new RowMatrix(features, features.count(), 2)

The compiler error

Error:(24, 33) type mismatch;
 found   : org.apache.spark.rdd.RDD[org.apache.spark.ml.linalg.Vector]
 required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
  val rowMatrix = new RowMatrix(features, features.count(), 2)
    ^

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath
If I drop the last feature on the third model, the error seems to go away.

On Wed, Oct 12, 2016 at 11:52 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> I have some code trying to compare linear regression coefficients with
> three sets of features, as shown below. On the third one, I get an
> assertion error.
>
> This is the code,
>
> object MultipleRegression extends App {
>
>
>
>   val spark = SparkSession.builder().appName("Regression Model 
> Builder").master("local").getOrCreate()
>
>   import spark.implicits._
>
>   val training = build("kc_house_train_data.csv", "train", spark)
>   val test = build("kc_house_test_data.csv", "test", spark)
>
>   val lr = new LinearRegression()
>
>   val m1 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long")))
>   println(s"Coefficients: ${m1.coefficients}, Intercept: ${m1.intercept}")
>
>   val m2 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long", "bed_bath_rooms")))
>   println(s"Coefficients: ${m2.coefficients}, Intercept: ${m2.intercept}")
>
>   val m3 = lr.fit(training.map(r => buildLp(r, "sqft_living", "bedrooms", 
> "bathrooms", "lat", "long", "bed_bath_rooms", "bedrooms_squared", 
> "log_sqft_living", "lat_plus_long")))
>   println(s"Coefficients: ${m3.coefficients}, Intercept: ${m3.intercept}")
>
>
>   def build(path: String, view: String, spark: SparkSession) = {
>
> val toDouble = udf((x: String) => x.toDouble)
> val product = udf((x: Double, y: Double) => x * y)
> val sum = udf((x: Double, y: Double) => x + y)
> val log = udf((x: Double) => scala.math.log(x))
>
> spark.read.
>   option("header", "true").
>   csv(path).
>   withColumn("sqft_living", toDouble('sqft_living)).
>   withColumn("price", toDouble('price)).
>   withColumn("bedrooms", toDouble('bedrooms)).
>   withColumn("bathrooms", toDouble('bathrooms)).
>   withColumn("lat", toDouble('lat)).
>   withColumn("long", toDouble('long)).
>   withColumn("bedrooms_squared", product('bedrooms, 'bedrooms)).
>   withColumn("bed_bath_rooms", product('bedrooms, 'bathrooms)).
>   withColumn("lat_plus_long", sum('lat, 'long)).
>   withColumn("log_sqft_living", log('sqft_living))
>
>   }
>
>   def buildLp(r: Row, input: String*) = {
> var features = input.map(r.getAs[Double](_)).toArray
> new LabeledPoint(r.getAs[Double]("price"), Vectors.dense(features))
>   }
>
> }
>
>
> This is the error I get.
>
> Exception in thread "main" java.lang.AssertionError: assertion failed:
> lapack.dppsv returned 9.
> at scala.Predef$.assert(Predef.scala:170)
> at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(
> CholeskyDecomposition.scala:40)
> at org.apache.spark.ml.optim.WeightedLeastSquares.fit(
> WeightedLeastSquares.scala:140)
> at org.apache.spark.ml.regression.LinearRegression.
> train(LinearRegression.scala:180)
> at org.apache.spark.ml.regression.LinearRegression.
> train(LinearRegression.scala:70)
> at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> at com.ss.ml.regression.MultipleRegression$.delayedEndpoint$com$ss$ml$
> regression$MultipleRegression$1(MultipleRegression.scala:36)
> at com.ss.ml.regression.MultipleRegression$delayedInit$body.apply(
> MultipleRegression.scala:12)
> at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
> at scala.runtime.AbstractFunction0.apply$mcV$
> sp(AbstractFunction0.scala:12)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.App$$anonfun$main$1.apply(App.scala:76)
> at scala.collection.immutable.List.foreach(List.scala:381)
> at scala.collection.generic.TraversableForwarder$class.
> foreach(TraversableForwarder.scala:35)
> at scala.App$class.main(App.scala:76)
> at com.ss.ml.regression.MultipleRegression$.main(
> MultipleRegression.scala:12)
> at com.ss.ml.regression.MultipleRegression.main(MultipleRegression.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at sun.reflect.NativeMethodAccessorImpl.invoke(
> NativeMethodAccessorImpl.java:62)
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:483)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)
>
>
> Does anyone know what is going wrong here?
>
> Many thanks
>
> --
> *Meeraj Kunnumpurath*
>
>
> *Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*
>
> *00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Linear Regression Error

2016-10-12 Thread Meeraj Kunnumpurath
Hello,

I have some code trying to compare linear regression coefficients with
three sets of features, as shown below. On the third one, I get an
assertion error.

This is the code,

object MultipleRegression extends App {



  val spark = SparkSession.builder().appName("Regression Model
Builder").master("local").getOrCreate()

  import spark.implicits._

  val training = build("kc_house_train_data.csv", "train", spark)
  val test = build("kc_house_test_data.csv", "test", spark)

  val lr = new LinearRegression()

  val m1 = lr.fit(training.map(r => buildLp(r, "sqft_living",
"bedrooms", "bathrooms", "lat", "long")))
  println(s"Coefficients: ${m1.coefficients}, Intercept: ${m1.intercept}")

  val m2 = lr.fit(training.map(r => buildLp(r, "sqft_living",
"bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms")))
  println(s"Coefficients: ${m2.coefficients}, Intercept: ${m2.intercept}")

  val m3 = lr.fit(training.map(r => buildLp(r, "sqft_living",
"bedrooms", "bathrooms", "lat", "long", "bed_bath_rooms",
"bedrooms_squared", "log_sqft_living", "lat_plus_long")))
  println(s"Coefficients: ${m3.coefficients}, Intercept: ${m3.intercept}")


  def build(path: String, view: String, spark: SparkSession) = {

val toDouble = udf((x: String) => x.toDouble)
val product = udf((x: Double, y: Double) => x * y)
val sum = udf((x: Double, y: Double) => x + y)
val log = udf((x: Double) => scala.math.log(x))

spark.read.
  option("header", "true").
  csv(path).
  withColumn("sqft_living", toDouble('sqft_living)).
  withColumn("price", toDouble('price)).
  withColumn("bedrooms", toDouble('bedrooms)).
  withColumn("bathrooms", toDouble('bathrooms)).
  withColumn("lat", toDouble('lat)).
  withColumn("long", toDouble('long)).
  withColumn("bedrooms_squared", product('bedrooms, 'bedrooms)).
  withColumn("bed_bath_rooms", product('bedrooms, 'bathrooms)).
  withColumn("lat_plus_long", sum('lat, 'long)).
  withColumn("log_sqft_living", log('sqft_living))

  }

  def buildLp(r: Row, input: String*) = {
var features = input.map(r.getAs[Double](_)).toArray
new LabeledPoint(r.getAs[Double]("price"), Vectors.dense(features))
  }

}


This is the error I get.

Exception in thread "main" java.lang.AssertionError: assertion failed:
lapack.dppsv returned 9.
at scala.Predef$.assert(Predef.scala:170)
at
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:40)
at
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:140)
at
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:70)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
at
com.ss.ml.regression.MultipleRegression$.delayedEndpoint$com$ss$ml$regression$MultipleRegression$1(MultipleRegression.scala:36)
at
com.ss.ml.regression.MultipleRegression$delayedInit$body.apply(MultipleRegression.scala:12)
at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.App$$anonfun$main$1.apply(App.scala:76)
at scala.collection.immutable.List.foreach(List.scala:381)
at
scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
at scala.App$class.main(App.scala:76)
at
com.ss.ml.regression.MultipleRegression$.main(MultipleRegression.scala:12)
at com.ss.ml.regression.MultipleRegression.main(MultipleRegression.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)


Does anyone know what is going wrong here?

Many thanks

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Matrix Operations

2016-10-12 Thread Meeraj Kunnumpurath
Hello,

Does anyone have examples of doing Matrix operations (multiplication,
transpose, inverse etc) using the Spark ML API?

Many thanks

-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


Re: UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath
This is what I do at the moment,

def build(path: String, spark: SparkSession) = {
  val toDouble = udf((x: String) => x.toDouble)
  val df = spark.read.
option("header", "true").
csv(path).
withColumn("sqft_living", toDouble('sqft_living)).
withColumn("price", toDouble('price)).
withColumn("bedrooms", toDouble('bedrooms)).
withColumn("bathrooms", toDouble('bathrooms)).
withColumn("lat", toDouble('lat)).
withColumn("long", toDouble('long))
  df.createOrReplaceTempView("sales")
  spark.sql("select bedrooms * bedrooms, bedrooms * bathrooms, lat +
long, log(sqft_living), price from sales")
}


On Wed, Oct 12, 2016 at 9:56 PM, Meeraj Kunnumpurath <
mee...@servicesymphony.com> wrote:

> Hello,
>
> How do I write a UDF that operate on two columns. For example, how do I
> introduce a new column, which is a product of two columns already on the
> dataframe.
>
> Many thanks
> Meeraj
>



-- 
*Meeraj Kunnumpurath*


*Director and Executive PrincipalService Symphony Ltd00 44 7702 693597*

*00 971 50 409 0169mee...@servicesymphony.com <mee...@servicesymphony.com>*


UDF on multiple columns

2016-10-12 Thread Meeraj Kunnumpurath
Hello,

How do I write a UDF that operate on two columns. For example, how do I
introduce a new column, which is a product of two columns already on the
dataframe.

Many thanks
Meeraj