kafka + mysql filtering problem
Hi all, I want to read some filtering rules from mysql (jdbc mysql driver) specifically its a char type containing a field and value to process in a kafka streaming input. The main idea is to process this from a web UI (livy server). Any suggestion or guidelines? e.g., I have this: object Streaming { def main(args: Array[String]) { if (args.length < 4) { System.err.println("Usage: KafkaWordCount ") System.exit(1) } val Array(zkQuorum, group, topics, numThreads) = args var spc = SparkContext.getOrCreate() val ssc = new StreamingContext(spc, Seconds(3)) val lines = KafkaUtils.createStream(ssc, zkQuorum, group, Map(topics -> 5)).map(_._2) /* TEST MYSQL */ val sqlContext = new SQLContext(spc) val prop = new java.util.Properties val url = "jdbc:mysql://52.22.38.81:3306/tmp" val tbl_users = "santander_demo_users" val tbl_rules = "santander_demo_filters" val tbl_campaigns = "santander_demo_campaigns" prop.setProperty("user", "root") prop.setProperty("password", "Exalitica2014") val users = sqlContext.read.jdbc(url, tbl_users, prop) val rules = sqlContext.read.jdbc(url, tbl_rules, prop) val campaigns = sqlContext.read.jdbc(url, tbl_campaigns, prop) val toolbox = currentMirror.mkToolBox() val toRemove = "\"”.toSet var mto = “0" def rule_apply (n:Int, t:String, rules:DataFrame) : String = { // reading rules from mysql var r = (rules.filter(rules("CID") === n).select("FILTER_DSC").first())(0).toString() // using mkToolbox for pre-processing rules return toolbox.eval(toolbox.parse(""" val mto = """ + t + """ if(""" + r + """) { return “true" } else { return “false" } """)).toString() } /* TEST MYSQL */ lines.map{x => if(x.split(",").length > 1) { // reading from kafka input mto = spc.broadcast(x.split(",")(5).filterNot(toRemove)) } } var msg = rule_apply(1, mto, rules) var word = lines.map(x => msg) word.print() ssc.start() ssc.awaitTermination() } } The problem is that mto variable always returns to “0” value after mapping lines DStream. I tried to process rule_apply into map but I get not serializable mkToolbox class error. Thanks in advance. Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com www.exalitica.com
TF-IDF Question
Hi all!, I have a .txt file where each row of it it¹s a collection of terms of a document separated by space. For example: 1 "Hola spark² 2 .. I followed this example of spark site https://spark.apache.org/docs/latest/mllib-feature-extraction.html and i get something like this: tfidf.first() org.apache.spark.mllib.linalg.Vector = (1048576,[35587,884670],[3.458767233,3.458767233]) I think this: 1. First parameter ³1048576² i don¹t know what it is but always it´s the same number (maybe the number of terms). 2. Second parameter ³[35587,884670]² i think are the terms of the first line in my .txt file. 3. Third parameter ³[3.458767233,3.458767233]² i think are the tfidf values for my terms. Anyone knows the exact interpretation of this and in the second point if these values are the terms, how can i match this values with the original terms values (³[35587=>Hola,884670=>spark]²)?. Regards and thanks in advance. Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/>
null Error in ALS model predict
Hi all!, I have a RDD[(int,int,double,double)] where the first two int values are id and product, respectively. I trained an implicit ALS algorithm and want to make predictions from this RDD. I make two things but I think both ways are same. 1- Convert this RDD to RDD[(int,int)] and use model.predict(RDD(int,int)), this works to me! 2- Make a map and apply model.predict(int,int), for example: val ratings = RDD[(int,int,double,double)].map{ case (id, rubro, rating, resp)=> model.predict(id,rubro) } Where ratings is a RDD[Double]. Now, the second way when I apply a ratings.first() I get the follow error: Why this happend? I need to use this second way. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com <http://www.exalitica.com/> www.exalitica.com <http://exalitica.com/web/img/frim.png>
RE: Effects problems in logistic regression
Thanks again DB Tsai, LogisticRegressionWithLBFGS works for me! De: Franco Barrientos [mailto:franco.barrien...@exalitica.com] Enviado el: jueves, 18 de diciembre de 2014 16:42 Para: 'DB Tsai' CC: 'Sean Owen'; user@spark.apache.org Asunto: RE: Effects problems in logistic regression Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
RE: Effects problems in logistic regression
Thanks I will try. De: DB Tsai [mailto:dbt...@dbtsai.com] Enviado el: jueves, 18 de diciembre de 2014 16:24 Para: Franco Barrientos CC: Sean Owen; user@spark.apache.org Asunto: Re: Effects problems in logistic regression Can you try LogisticRegressionWithLBFGS? I verified that this will be converged to the same result trained by R's glmnet package without regularization. The problem of LogisticRegressionWithSGD is it's very slow in term of converging, and lots of time, it's very sensitive to stepsize which can lead to wrong answer. The regularization logic in MLLib is not entirely correct, and it will penalize the intercept. In general, with really high regularization, all the coefficients will be zeros except the intercept. In logistic regression, the non-zero intercept can be understood as the prior-probability of each class, and in linear regression, this will be the mean of response. I'll have a PR to fix this issue. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Dec 18, 2014 at 10:50 AM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com <mailto:so...@cloudera.com> ] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org <mailto:user@spark.apache.org> Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
RE: Effects problems in logistic regression
Yes, without the “amounts” variables the results are similiar. When I put other variables its fine. De: Sean Owen [mailto:so...@cloudera.com] Enviado el: jueves, 18 de diciembre de 2014 14:22 Para: Franco Barrientos CC: user@spark.apache.org Asunto: Re: Effects problems in logistic regression Are you sure this is an apples-to-apples comparison? for example does your SAS process normalize or otherwise transform the data first? Is the optimization configured similarly in both cases -- same regularization, etc.? Are you sure you are pulling out the intercept correctly? It is a separate value from the logistic regression model in Spark. On Thu, Dec 18, 2014 at 4:34 PM, Franco Barrientos mailto:franco.barrien...@exalitica.com> > wrote: Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 franco.barrien...@exalitica.com <mailto:franco.barrien...@exalitica.com> www.exalitica.com <http://www.exalitica.com/> <http://exalitica.com/web/img/frim.png>
Effects problems in logistic regression
Hi all!, I have a problem with LogisticRegressionWithSGD, when I train a data set with one variable (wich is a amount of an item) and intercept, I get weights of (-0.4021,-207.1749) for both features, respectively. This don´t make sense to me because I run a logistic regression for the same data in SAS and I get these weights (-2.6604,0.000245). The rank of this variable is from 0 to 59102 with a mean of 1158. The problem is when I want to calculate the probabilities for each user from data set, this probability is near to zero or zero in much cases, because when spark calculates exp(-1*(-0.4021+(-207.1749)*amount)) this is a big number, in fact infinity for spark. How can I treat this variable? or why this happened? Thanks , Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com <http://www.exalitica.com/> www.exalitica.com <http://exalitica.com/web/img/frim.png>
Percentile
Hi folks!, Anyone known how can I calculate for each elements of a variable in a RDD its percentile? I tried to calculate trough Spark SQL with subqueries but I think that is imposible in Spark SQL. Any idea will be welcome. Thanks in advance, Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com <http://www.exalitica.com/> www.exalitica.com <http://exalitica.com/web/img/frim.png>
join 2 tables
I have 2 tables in a hive context, and I want to select one field of each table where ids of each table are equal. For example, val tmp2=sqlContext.sql("select a.ult_fecha,b.pri_fecha from fecha_ult_compra_u3m as a, fecha_pri_compra_u3m as b where a.id=b.id") but i get an error: Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com <http://www.exalitica.com/> www.exalitica.com <http://exalitica.com/web/img/frim.png>
S3 table to spark sql
How can i create a date field in spark sql? I have a S3 table and i load it into a RDD. val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.createSchemaRDD case class trx_u3m(id: String, local: String, fechau3m: String, rubro: Int, sku: String, unidades: Double, monto: Double) val tabla = sc.textFile("s3n://exalitica.com/trx_u3m/trx_u3m.txt").map(_.split(",")).map (p => trx_u3m(p(0).trim.toString, p(1).trim.toString, p(2).trim.toString, p(3).trim.toInt, p(4).trim.toString, p(5).trim.toDouble, p(6).trim.toDouble)) tabla.registerTempTable("trx_u3m") Now my problema i show can i transform string variable into date variables (fechau3m)? Franco Barrientos Data Scientist Málaga #115, Of. 1003, Las Condes. Santiago, Chile. (+562)-29699649 (+569)-76347893 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com <http://www.exalitica.com/> www.exalitica.com <http://exalitica.com/web/img/frim.png>