from:"franco barrientos"

kafka + mysql filtering problem

2016-02-29 Thread franco barrientos

Hi all,

I want to read some filtering rules from mysql (jdbc mysql driver) specifically 
its a char type containing a field and value to process in a kafka streaming 
input.

The main idea is to process this from a web UI (livy server).

Any suggestion or guidelines?

e.g., I have this:

object Streaming {
  def main(args: Array[String]) {
if (args.length < 4) {
  System.err.println("Usage: KafkaWordCount
")
  System.exit(1)
}
val Array(zkQuorum, group, topics, numThreads) = args
var spc = SparkContext.getOrCreate()
val ssc = new StreamingContext(spc, Seconds(3))
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topics -> 
5)).map(_._2)
/* TEST MYSQL */
val sqlContext = new SQLContext(spc)
val prop = new java.util.Properties
val url = "jdbc:mysql://52.22.38.81:3306/tmp"
val tbl_users = "santander_demo_users"
val tbl_rules = "santander_demo_filters"
val tbl_campaigns = "santander_demo_campaigns"
prop.setProperty("user", "root")
prop.setProperty("password", "Exalitica2014")
val users = sqlContext.read.jdbc(url, tbl_users, prop)
val rules = sqlContext.read.jdbc(url, tbl_rules, prop)
val campaigns = sqlContext.read.jdbc(url, tbl_campaigns, prop)
val toolbox = currentMirror.mkToolBox()
val toRemove = "\"”.toSet
var mto = “0"

def rule_apply (n:Int, t:String, rules:DataFrame) : String = {
 // reading rules from mysql
  var r = (rules.filter(rules("CID") === 
n).select("FILTER_DSC").first())(0).toString()
  
  // using mkToolbox for pre-processing rules
return toolbox.eval(toolbox.parse("""
  val mto = """ + t + """
  if(""" + r + """) {
return “true"
  } else {
return “false"
}
""")).toString()
}
/* TEST MYSQL */

lines.map{x =>
  if(x.split(",").length > 1) {
// reading from kafka input
mto = spc.broadcast(x.split(",")(5).filterNot(toRemove))
  }
}
var msg = rule_apply(1, mto, rules)
var word = lines.map(x => msg)
word.print()
ssc.start()
ssc.awaitTermination()
  }
}

The problem is that mto variable always returns to “0” value after mapping 
lines DStream. I tried to process rule_apply into map but I get not 
serializable mkToolbox class error.

Thanks in advance.

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

franco.barrien...@exalitica.com 

www.exalitica.com

TF-IDF Question

2015-06-04 Thread franco barrientos

Hi all!,

I have a .txt file where each row of it it¹s a collection of terms of a
document separated by space. For example:

1 "Hola spark²
2 ..

I followed this example of spark site
https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get
something like this:

tfidf.first()
org.apache.spark.mllib.linalg.Vector =
(1048576,[35587,884670],[3.458767233,3.458767233])

I think this:

1. First parameter ³1048576² i don¹t know what it is but always it´s the
same number (maybe the number of terms).
2. Second parameter ³[35587,884670]² i think are the terms of the first line
in my .txt file.
3. Third parameter ³[3.458767233,3.458767233]² i think are the tfidf values
for my terms.
Anyone knows the exact interpretation of this and in the second point if
these values are the terms, how can i match this values with the original
terms values (³[35587=>Hola,884670=>spark]²)?.

Regards and thanks in advance.

Franco Barrientos
Data Scientist
Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893
franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com>
www.exalitica.com
 <http://www.exalitica.com/>

null Error in ALS model predict

2014-12-24 Thread Franco Barrientos

Hi all!,

 

I have  a RDD[(int,int,double,double)] where the first two int values are id
and product, respectively. I trained an implicit ALS algorithm and want to
make predictions from this RDD. I make two things but I think both ways are
same.

 

1-  Convert this RDD to RDD[(int,int)] and use
model.predict(RDD(int,int)), this works to me!

2-  Make a map and apply  model.predict(int,int), for example:

val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating,
resp)=> 

model.predict(id,rubro)

}

Where ratings is a RDD[Double].

 

Now, the second way when I apply a ratings.first() I get the follow error:



 

Why this happend? I need to use this second way.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

RE: Effects problems in logistic regression

2014-12-22 Thread Franco Barrientos

Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me!

 

De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:42
Para: 'DB Tsai'
CC: 'Sean Owen'; user@spark.apache.org
Asunto: RE: Effects problems in logistic regression

 

Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org <mailto:user@spark.apache.org> 
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
mailto:franco.barrien...@exalitica.com> > 
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org <mailto:user@spark.apache.org> 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
 wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649  
(+569)-76347893  

franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com>  

www.exalitica.com <http://www.exalitica.com/> 


  <http://exalitica.com/web/img/frim.png>

RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos

Thanks I will try.

 

De: DB Tsai [mailto:dbt...@dbtsai.com] 
Enviado el: jueves, 18 de diciembre de 2014 16:24
Para: Franco Barrientos
CC: Sean Owen; user@spark.apache.org
Asunto: Re: Effects problems in logistic regression

 

Can you try LogisticRegressionWithLBFGS? I verified that this will be converged 
to the same result trained by R's glmnet package without regularization. The 
problem of LogisticRegressionWithSGD is it's very slow in term of converging, 
and lots of time, it's very sensitive to stepsize which can lead to wrong 
answer. 

 

The regularization logic in MLLib is not entirely correct, and it will penalize 
the intercept. In general, with really high regularization, all the 
coefficients will be zeros except the intercept. In logistic regression, the 
non-zero intercept can be understood as the prior-probability of each class, 
and in linear regression, this will be the mean of response. I'll have a PR to 
fix this issue.





Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai

 

On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos 
mailto:franco.barrien...@exalitica.com> > 
wrote:

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org <mailto:user@spark.apache.org> 
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
mailto:franco.barrien...@exalitica.com> > 
wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649  
(+569)-76347893  

franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com>  

www.exalitica.com <http://www.exalitica.com/> 


  <http://exalitica.com/web/img/frim.png>

RE: Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos

Yes, without the “amounts” variables the results are similiar. When I put other 
variables its fine.

 

De: Sean Owen [mailto:so...@cloudera.com] 
Enviado el: jueves, 18 de diciembre de 2014 14:22
Para: Franco Barrientos
CC: user@spark.apache.org
Asunto: Re: Effects problems in logistic regression

 

Are you sure this is an apples-to-apples comparison? for example does your SAS 
process normalize or otherwise transform the data first? 

 

Is the optimization configured similarly in both cases -- same regularization, 
etc.?

 

Are you sure you are pulling out the intercept correctly? It is a separate 
value from the logistic regression model in Spark.

 

On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos 
mailto:franco.barrien...@exalitica.com> > 
wrote:

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set with 
one variable (wich is a amount of an item) and intercept, I get weights of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense to 
me because I run a logistic regression for the same data in SAS and I get these 
weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from 
data set, this probability is near to zero or zero in much cases, because when 
spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in 
fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649  
(+569)-76347893  

franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com>  

www.exalitica.com <http://www.exalitica.com/> 


  <http://exalitica.com/web/img/frim.png>

Effects problems in logistic regression

2014-12-18 Thread Franco Barrientos

Hi all!,

 

I have a problem with LogisticRegressionWithSGD, when I train a data set
with one variable (wich is a amount of an item) and intercept, I get weights
of

(-0.4021,-207.1749) for both features, respectively. This don´t make sense
to me because I run a logistic regression for the same data in SAS and I get
these weights (-2.6604,0.000245).

 

The rank of this variable is from 0 to 59102 with a mean of 1158.

 

The problem is when I want to calculate the probabilities for each user from
data set, this probability is near to zero or zero in much cases, because
when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big
number, in fact infinity for spark.

 

How can I treat this variable? or why this happened? 

 

Thanks ,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

Percentile

2014-11-27 Thread Franco Barrientos

Hi folks!,

 

Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

join 2 tables

2014-11-12 Thread Franco Barrientos

I have 2 tables in a hive context, and I want to select one field of each
table where ids of each table are equal. For example,

 

val tmp2=sqlContext.sql("select a.ult_fecha,b.pri_fecha from
fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id")

 

but i get an error:

 



 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

S3 table to spark sql

2014-11-11 Thread Franco Barrientos

How can i create a date field in spark sql? I have a S3 table and  i load it
into a RDD.

 

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

import sqlContext.createSchemaRDD

 

case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int,
sku: String, unidades: Double, monto: Double)

 

val tabla =
sc.textFile("s3n://exalitica.com/trx_u3m/trx_u3m.txt").map(_.split(",")).map
(p => trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString,
p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble,
p(6).trim.toDouble))

tabla.registerTempTable("trx_u3m")

 

Now my problema i show can i transform string variable into date variables
(fechau3m)?

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png>

kafka + mysql filtering problem

TF-IDF Question

null Error in ALS model predict

RE: Effects problems in logistic regression

RE: Effects problems in logistic regression

RE: Effects problems in logistic regression

Effects problems in logistic regression

Percentile

join 2 tables

S3 table to spark sql

10 matches

Site Navigation

Mail list logo

Footer information