[R] LDA select number of topics
Hi all, I've seen recently this great post by Nikita Murzintcev http://rpubs.com/nikita-moor/107657. If I understood correctly, according to Griffiths (2004) I should select 11 topics? But, it seems that other metrics suggest quite different number of topics? I mean, 11 topics is about the right number, however, besides it works better in my case, how do I know which metric to rely on? That is, if I want to report this in a paper, can I simply say that I relied on Griffiths (2004), without explaining why not Arun (2010), for example? Thanks, dda_topics.pdf Description: Adobe PDF document __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Regression model
Hi, I'm trying to fit regression model, but there is something wrong with it. The dataset contains 85 observations for 85 students.Those observations are counts of several actions, and dependent variable is final score. More precisely, I have 5 IV and one DV. I'm trying to build regression model to check whether those variables can predict the final score. I'm attaching output of several steps, but I tried to following procedure: - build model with only those two variables - summary shows that non of them is significant predictor of the final outcome. - test for multicollinearity revealed tolerance below 0.2 (potential problem) - build two new models having as a predictor only one of those values - both models show that variable used for the model is significant predictor. Separately they are significant, together not. Probably multicollinearity problem, but... - as I keep adding other variables to one or the other model, Multiple R-squared slightly increases. - I tried to compare different models using anova, but non of them seems to be better. How to determine which model is better? Thanks lm.all.1 - lm(mark~IA+IC, data=social_presence_data) summary(lm.all.1) Call: lm(formula = mark ~ IA + IC, data = social_presence_data) Residuals: Min 1Q Median 3Q Max -3.5969 -0.2573 0.2599 0.5819 1.2955 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 2.789380.24599 11.339 2e-16 *** IA 0.028440.04503 0.6320.530 IC 0.019790.02601 0.7610.449 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.031 on 79 degrees of freedom Multiple R-squared: 0.12, Adjusted R-squared: 0.09774 F-statistic: 5.387 on 2 and 79 DF, p-value: 0.006407 1/vif(lm.all.1) IAIC 0.1719037 0.1719037 dwt(lm.all.1) lag Autocorrelation D-W Statistic p-value 1 0.09176706 1.815883 0.372 Alternative hypothesis: rho != 0 lm.all.2 - lm(mark~IA, data=social_presence_data) lm.all.3 - lm(mark~IC, data=social_presence_data) anova(lm.all.2, lm.all.3) Analysis of Variance Table Model 1: mark ~ IA Model 2: mark ~ IC Res.DfRSS Df Sum of Sq F Pr(F) 1 80 84.604 2 80 84.413 0 0.19141 anova(lm.all.1, lm.all.3) Analysis of Variance Table Model 1: mark ~ IA + IC Model 2: mark ~ IC Res.DfRSS Df Sum of Sq F Pr(F) 1 79 83.989 2 80 84.413 -1 -0.42402 0.3988 0.5295 anova(lm.all.1, lm.all.2) Analysis of Variance Table Model 1: mark ~ IA + IC Model 2: mark ~ IA Res.DfRSS Df Sum of Sq F Pr(F) 1 79 83.989 2 80 84.604 -1 -0.61543 0.5789 0.449 summary(lm.all.2) Call: lm(formula = mark ~ IA, data = social_presence_data) Residuals: Min 1Q Median 3Q Max -3.5409 -0.2539 0.2283 0.5793 1.2956 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 2.885170.21078 13.688 2e-16 *** IA 0.059610.01862 3.202 0.00196 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.028 on 80 degrees of freedom Multiple R-squared: 0.1136,Adjusted R-squared: 0.1025 F-statistic: 10.25 on 1 and 80 DF, p-value: 0.001962 summary(lm.all.3) Call: lm(formula = mark ~ IC, data = social_presence_data) Residuals: Min 1Q Median 3Q Max -3.6320 -0.2562 0.2590 0.5764 1.2585 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 2.763640.24168 11.435 2e-16 *** IC 0.034730.01074 3.233 0.00178 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.027 on 80 degrees of freedom Multiple R-squared: 0.1156,Adjusted R-squared: 0.1045 F-statistic: 10.45 on 1 and 80 DF, p-value: 0.001779 lm.all.3.1 - lm(mark~IC+AU, data=social_presence_data) summary(lm.all.3.1) Call: lm(formula = mark ~ IC + AU, data = social_presence_data) Residuals: Min 1Q Median 3Q Max -3.5951 -0.2618 0.2378 0.5907 1.2619 Coefficients: Estimate Std. Error t value Pr(|t|) (Intercept) 2.776000.24499 11.331 2e-16 *** IC 0.032760.01191 2.752 0.00735 ** AU 0.049940.12697 0.393 0.69514 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 Residual standard error: 1.033 on 79 degrees of freedom Multiple R-squared: 0.1173,Adjusted R-squared: 0.09496 F-statistic: 5.249 on 2 and 79 DF, p-value: 0.007236__ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Regression model
No, it's not homework, it's just some initial analysis, but still... and thanks for recommendation. On Thu, Nov 21, 2013 at 4:42 PM, Rolf Turner r.tur...@auckland.ac.nzwrote: (1) Is this homework? (This list doesn't do homework for people!) (Animals maybe, but not people! :-) ) (2) Your question isn't really an R question but rather a statistics/linear modelling question. It is possible that you might get some insight from Frank Harrel's book Regression Modelling Strategies (Springer, 2001). cheers, Rolf Turner On 11/22/13 12:52, srecko joksimovic wrote: Hi, I'm trying to fit regression model, but there is something wrong with it. The dataset contains 85 observations for 85 students.Those observations are counts of several actions, and dependent variable is final score. More precisely, I have 5 IV and one DV. I'm trying to build regression model to check whether those variables can predict the final score. I'm attaching output of several steps, but I tried to following procedure: - build model with only those two variables - summary shows that non of them is significant predictor of the final outcome. - test for multicollinearity revealed tolerance below 0.2 (potential problem) - build two new models having as a predictor only one of those values - both models show that variable used for the model is significant predictor. Separately they are significant, together not. Probably multicollinearity problem, but... - as I keep adding other variables to one or the other model, Multiple R-squared slightly increases. - I tried to compare different models using anova, but non of them seems to be better. How to determine which model is better? [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lmerTest
Thanks Uwe, I wasn't quite sure about that one... when I build model with that particular variable, that is what happen. have to check why... Best, Srecko On Sun, Oct 13, 2013 at 5:45 AM, Uwe Ligges lig...@statistik.tu-dortmund.de wrote: On 13.10.2013 02:52, srecko joksimovic wrote: ok, ok... thanks. I'll try with R-sig-ME Or for short, you are trying to estimate more coefficients than you have degrees of freedom which is what rank of X = 1660 ncol(X) = 1895 tries to tell us. Best, Uwe Ligges On Sat, Oct 12, 2013 at 5:43 PM, Jeff Newmiller jdnew...@dcn.davis.ca.us **wrote: Any idea what could be the problem? Hmmm... posting in html? No reproducible example? Not posting on R-sig-ME? Just some ideas... reading the Posting Guide might be helpful to you. --**--** --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --**--** --- Sent from my phone. Please excuse my brevity. srecko joksimovic sreckojoksimo...@gmail.com wrote: Hi, I'm trying to user lmer function from lmerTest package because, if I understood correectly, it allows to make better inference than lmer method from lme4 package. However, whatever I do I keep getting this error: Error in lme4::lFormula(formula = mark ~ ssCount + sTime+ : rank of X = 1660 ncol(X) = 1895 any ideas what could be a problem? thanks, Srecko [[alternative HTML version deleted]] __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/**posting-guide.htmlhttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] lmerTest
Hi, I'm trying to user lmer function from lmerTest package because, if I understood correectly, it allows to make better inference than lmer method from lme4 package. However, whatever I do I keep getting this error: Error in lme4::lFormula(formula = mark ~ ssCount + sTime+ : rank of X = 1660 ncol(X) = 1895 any ideas what could be a problem? thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] lmerTest
ok, ok... thanks. I'll try with R-sig-ME On Sat, Oct 12, 2013 at 5:43 PM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: Any idea what could be the problem? Hmmm... posting in html? No reproducible example? Not posting on R-sig-ME? Just some ideas... reading the Posting Guide might be helpful to you. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. srecko joksimovic sreckojoksimo...@gmail.com wrote: Hi, I'm trying to user lmer function from lmerTest package because, if I understood correectly, it allows to make better inference than lmer method from lme4 package. However, whatever I do I keep getting this error: Error in lme4::lFormula(formula = mark ~ ssCount + sTime+ : rank of X = 1660 ncol(X) = 1895 any ideas what could be a problem? thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] multilevel analysis
I have an example of multilevel analysis with 3 levels, but data are non-normally distributed. In case of normal distribution, I would perform multilevel linear analysis using lme function, but what should I do in case of non-normal distribution? thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multilevel analysis
I thought so, but then I found this: Normality The assumption of normality states that the error terms at every level of the model are normally distributed maybe I misinterpreted something. On Mon, Sep 30, 2013 at 3:06 PM, David Winsemius dwinsem...@comcast.netwrote: On Sep 30, 2013, at 2:50 PM, srecko joksimovic wrote: I have an example of multilevel analysis with 3 levels, but data are non-normally distributed. In case of normal distribution, I would perform multilevel linear analysis using lme function, but what should I do in case of non-normal distribution? But normal distribution is not a requirement for linear models. Please review your theory. thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] multilevel analysis
Thanks for your comments, David and Bert. The best would be to provide an example. Let's say we have a dataset like this one: IDEmployee Company OU CountViewPortal CountLogin TimeOnTask Performance 1 Company1 Company1.OU1 21 33 627.8 4.3 2 Company1 Company1.OU2 45 54 34.8 2.3 3 Company2 Company1.OU1 23 33 3.8 1.0 4 Company2 Company1.OU1 34 12 44.8 2.3 5 Company2 Company1.OU2 55 22 55.8 4.5 6 Company2 Company1.OU3 45 44 34.8 3 I want to see if there is correlation between CountViewPortal and Performance. Moreover, I'd like to reveal the influence of CountViewPortal+TimeOnTask on Performance. However, I expect that employees within a OU, and than a Company have similar behavior. Thus, I'll have 3 levels - employee, OU, Company. In R, I would do something like this: randomInterceptCount - lme(Performance ~ CountViewPortal, data=analysis, random=~1|OU/Company1, method=ML) But, then the point is that CountViewPortal, CountLogin and TimeOnTask are non-normally distributed. I guess that my question is, what should I do in case of non-normal distribution? I really appreciate your help. Thanks again! Srecko On Mon, Sep 30, 2013 at 5:14 PM, David Winsemius dwinsem...@comcast.netwrote: On Sep 30, 2013, at 3:22 PM, srecko joksimovic wrote: I thought so, but then I found this: Normality The assumption of normality states that the error terms at every level of the model are normally distributed maybe I misinterpreted something. Notice that it is the _error_terms_ that are to be normally distributed, not the data itself. One might even infer that normally distrited data might be suspect because the correct distribution should be a mixture of normals. Since the errors never are going to fit on a straight line on a QQ plot, the real question is how far from Normal and what the impact might be on the quantities being estimated. -- David. On Mon, Sep 30, 2013 at 3:06 PM, David Winsemius dwinsem...@comcast.net wrote: On Sep 30, 2013, at 2:50 PM, srecko joksimovic wrote: I have an example of multilevel analysis with 3 levels, but data are non-normally distributed. In case of normal distribution, I would perform multilevel linear analysis using lme function, but what should I do in case of non-normal distribution? But normal distribution is not a requirement for linear models. Please review your theory. thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius Alameda, CA, USA David Winsemius Alameda, CA, USA [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unrecognized token
Thanks William, actually, the combination that works is: with(list(id=c(1234,abcd)), paste(paste(select * from tbl_user where student_id = '78789D', sep=), order by date_time, sep=), maybe I should try to replace double quotes with single (opposite of what I was doing...) On Tue, Sep 17, 2013 at 9:16 AM, William Dunlap wdun...@tibco.com wrote: Look at the query strings your code produces: with(list(id=c(1234,abcd)), paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) ) [1] select * from tbl_user where student_id = 1234 order by date_time [2] select * from tbl_user where student_id = abcd order by date_time I suspect that the abcd should have quotes around it. If student_id is stored as string data the 1234 should probably also have quotes around it. Replace id with \, id, \ and you may get a query that works. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of srecko joksimovic Sent: Tuesday, September 17, 2013 9:04 AM To: R help Subject: [R] Unrecognized token Hi, when I generate query using sqldf library, like this: query = paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) student - sqldf(query) everything works fine in case the id is 21328, 82882, or something like that. But, when id is something like 78789D, there is an error: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: 78789D) I tried replacing single quotes with double, but it still doesn't work... thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Unrecognized token
Hi, when I generate query using sqldf library, like this: query = paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) student - sqldf(query) everything works fine in case the id is 21328, 82882, or something like that. But, when id is something like 78789D, there is an error: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: 78789D) I tried replacing single quotes with double, but it still doesn't work... thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unrecognized token
Yes, you are right... the other is definitely not a valid query. thanks On Tue, Sep 17, 2013 at 11:22 AM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: id - c(21328,78789D) query - paste(paste(select * from tbl_user where student_id = , id,sep=), order by date_time, sep=) query [1] select * from tbl_user where student_id = 21328 order by date_time [2] select * from tbl_user where student_id = 78789D order by date_time Now, does the second string look like valid SQL to you? In particular, the 78789D is a problem. On the other hand... query - paste(paste(select * from tbl_user where student_id = ', id,sep=), ' order by date_time, sep=) query [1] select * from tbl_user where student_id = '21328' order by date_time [2] select * from tbl_user where student_id = '78789D' order by date_time As others have pointed out, in this case escaping does not appear to be key to getting valid SQL syntax... but looking at the query before shipping it off to a database engine seems to me to be an obvious technique you should learn. On Tue, 17 Sep 2013, srecko joksimovic wrote: There is no difference, the same query structure is in the both cases:6683 character character select * from students where student_id = 6683 order by date_time 4738D character character select * from students where student_id = 4738D order by date_time and still is the same error On Tue, Sep 17, 2013 at 9:47 AM, srecko joksimovic sreckojoksimo...@gmail.com wrote: thanks, Jeff, good point... I'll try that On Tue, Sep 17, 2013 at 9:43 AM, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: Why don't you print the 'query' variable with each id value and consider what the SQL syntax is for number and string literals. Then study the use of escaping in strings (\\) to fix the query. --**--** --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --**--** --- Sent from my phone. Please excuse my brevity. srecko joksimovic sreckojoksimo...@gmail.com wrote: Hi, when I generate query using sqldf library, like this: query = paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) student - sqldf(query) everything works fine in case the id is 21328, 82882, or something like that. But, when id is something like 78789D, there is an error: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: 78789D) I tried replacing single quotes with double, but it still doesn't work... thanks, Srecko [[alternative HTML version deleted]] _**_ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/**posting-guide.htmlhttp://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. --**--** --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --**--** --- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unrecognized token
thanks, Jeff, good point... I'll try that On Tue, Sep 17, 2013 at 9:43 AM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: Why don't you print the 'query' variable with each id value and consider what the SQL syntax is for number and string literals. Then study the use of escaping in strings (\\) to fix the query. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. srecko joksimovic sreckojoksimo...@gmail.com wrote: Hi, when I generate query using sqldf library, like this: query = paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) student - sqldf(query) everything works fine in case the id is 21328, 82882, or something like that. But, when id is something like 78789D, there is an error: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: 78789D) I tried replacing single quotes with double, but it still doesn't work... thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unrecognized token
There is no difference, the same query structure is in the both cases: 6683 character character select * from students where student_id = 6683 order by date_time 4738D character character select * from students where student_id = 4738D order by date_time and still is the same error On Tue, Sep 17, 2013 at 9:47 AM, srecko joksimovic sreckojoksimo...@gmail.com wrote: thanks, Jeff, good point... I'll try that On Tue, Sep 17, 2013 at 9:43 AM, Jeff Newmiller jdnew...@dcn.davis.ca.uswrote: Why don't you print the 'query' variable with each id value and consider what the SQL syntax is for number and string literals. Then study the use of escaping in strings (\\) to fix the query. --- Jeff NewmillerThe . . Go Live... DCN:jdnew...@dcn.davis.ca.usBasics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/BatteriesO.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. srecko joksimovic sreckojoksimo...@gmail.com wrote: Hi, when I generate query using sqldf library, like this: query = paste(paste(select * from tbl_user where student_id = , id, sep=), order by date_time, sep=) student - sqldf(query) everything works fine in case the id is 21328, 82882, or something like that. But, when id is something like 78789D, there is an error: Error in sqliteExecStatement(con, statement, bind.data) : RS-DBI driver: (error in statement: unrecognized token: 78789D) I tried replacing single quotes with double, but it still doesn't work... thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] split on change occurence
Hi, I had an example like this: iduseraction 1 12 login 2 12 view 3 12 view 4 12 view 5 12 login 6 12 view 7 12 view 8 12 login which I used to split using split(dat1,cumsum(dat1$action==login)). If I had a similar example: iduserIP 1 12 ip1 2 12 ip1 3 12 ip2 4 12 ip2 5 12 ip2 6 12 ip3 7 12 ip3 8 12 ip3 how can I split data frame to obtain the following structure: #1 1 12 ip1 2 12 ip1 #2 3 12 ip2 4 12 ip2 5 12 ip2 #3 6 12 ip3 7 12 ip3 8 12 ip3 thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] split on change occurence
Thanks... I don't know why I didn't try... guess was in hurry... I apologize for posting such a simple question On Mon, Sep 16, 2013 at 3:44 PM, Rui Barradas ruipbarra...@sapo.pt wrote: Hello, That's an even simpler case for ?split. dat - read.table(text = iduserIP 1 12 ip1 2 12 ip1 3 12 ip2 4 12 ip2 5 12 ip2 6 12 ip3 7 12 ip3 8 12 ip3 , header = TRUE) split(dat, dat$IP) Hope this helps, Rui Barradas Em 16-09-2013 22:57, srecko joksimovic escreveu: Hi, I had an example like this: iduseraction 1 12 login 2 12 view 3 12 view 4 12 view 5 12 login 6 12 view 7 12 view 8 12 login which I used to split using split(dat1,cumsum(dat1$action=**=login)). If I had a similar example: iduserIP 1 12 ip1 2 12 ip1 3 12 ip2 4 12 ip2 5 12 ip2 6 12 ip3 7 12 ip3 8 12 ip3 how can I split data frame to obtain the following structure: #1 1 12 ip1 2 12 ip1 #2 3 12 ip2 4 12 ip2 5 12 ip2 #3 6 12 ip3 7 12 ip3 8 12 ip3 thanks, Srecko [[alternative HTML version deleted]] __** R-help@r-project.org mailing list https://stat.ethz.ch/mailman/**listinfo/r-helphttps://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/** posting-guide.html http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Add new calculated column to data frame
Hi, I have a following data set: ideventtime (in sec) 1 add 1373502892 2 add 1373502972 3 delete 1373502995 4 view 1373503896 5 add 1373503996 ... I'd like to add new column time on task which is time elapsed between two events (id2 - id1...). What would be the best approach to do that? Thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Add new calculated column to data frame
Thanks Arun, this is great. However, it should be just a little bit different: # id event time time_on_task #1 1add 1373502892 80 #2 2add 1373502972 23 #3 3 delete 1373502995 901 #4 4 view 1373503896 100 #5 5add 1373503996 NA When I calculate difference, I need to know how long each activity was. It is id2-id1 for the first activity... On Thu, Aug 29, 2013 at 11:03 AM, arun smartpink...@yahoo.com wrote: Hi, Try: dat1- read.table(text= ideventtime 1add 1373502892 2add 1373502972 3delete 1373502995 4view 1373503896 5add 1373503996 ,sep=,header=TRUE,stringsAsFactors=FALSE) dat1$time_on_task- c(NA,diff(dat1$time)) dat1 # id event time time_on_task #1 1add 1373502892 NA #2 2add 1373502972 80 #3 3 delete 1373502995 23 #4 4 view 1373503896 901 #5 5add 1373503996 100 #Not sure whether this depends on the values of event or not.. A.K. - Original Message - From: srecko joksimovic sreckojoksimo...@gmail.com To: R help R-help@r-project.org Cc: Sent: Thursday, August 29, 2013 1:52 PM Subject: [R] Add new calculated column to data frame Hi, I have a following data set: ideventtime (in sec) 1 add 1373502892 2 add 1373502972 3 delete 1373502995 4 view 1373503896 5 add 1373503996 ... I'd like to add new column time on task which is time elapsed between two events (id2 - id1...). What would be the best approach to do that? Thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Add new calculated column to data frame
Hi Arun, There is one more question... you explained me how to use split(dat1,cumsum(dat1$action==login)) in one of previous questions, and that is great. Now, if I have something like this: id moduleevent time time_on_task 1 sys login 1373502892 80 2 taskadd 1373502892 80 3 taskadd 1373502972 23 4 sys login 1373502892 80 5 list delete 1373502995 901 6 list view 1373503896 100 7 taskadd 1373503996 NA I know how to split at each login occurrence, and I know how to add new column with time differences. But, how to add new column category which will be calculated based on columns module and even? For example if module=task and event=add = category= A... Srecko On Thu, Aug 29, 2013 at 11:22 AM, arun smartpink...@yahoo.com wrote: Hi Srecko, No problem. Regards, Arun From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Sent: Thursday, August 29, 2013 2:22 PM Subject: Re: [R] Add new calculated column to data frame Sorry... I should figure it out... thanks so much! Srecko On Thu, Aug 29, 2013 at 11:21 AM, arun smartpink...@yahoo.com wrote: Hi, The one you showed is: dat1$time_on_task- c(diff(dat1$time),NA) dat1 # id event time time_on_task #1 1add 1373502892 80 #2 2add 1373502972 23 #3 3 delete 1373502995 901 #4 4 view 1373503896 100 #5 5add 1373503996 NA From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Cc: R help r-help@r-project.org Sent: Thursday, August 29, 2013 2:15 PM Subject: Re: [R] Add new calculated column to data frame Thanks Arun, this is great. However, it should be just a little bit different: # id event time time_on_task #1 1add 1373502892 80 #2 2add 1373502972 23 #3 3 delete 1373502995 901 #4 4 view 1373503896 100 #5 5add 1373503996 NA When I calculate difference, I need to know how long each activity was. It is id2-id1 for the first activity... On Thu, Aug 29, 2013 at 11:03 AM, arun smartpink...@yahoo.com wrote: Hi, Try: dat1- read.table(text= ideventtime 1add 1373502892 2add 1373502972 3delete 1373502995 4view 1373503896 5add 1373503996 ,sep=,header=TRUE,stringsAsFactors=FALSE) dat1$time_on_task- c(NA,diff(dat1$time)) dat1 # id event time time_on_task #1 1add 1373502892 NA #2 2add 1373502972 80 #3 3 delete 1373502995 23 #4 4 view 1373503896 901 #5 5add 1373503996 100 #Not sure whether this depends on the values of event or not.. A.K. - Original Message - From: srecko joksimovic sreckojoksimo...@gmail.com To: R help R-help@r-project.org Cc: Sent: Thursday, August 29, 2013 1:52 PM Subject: [R] Add new calculated column to data frame Hi, I have a following data set: ideventtime (in sec) 1 add 1373502892 2 add 1373502972 3 delete 1373502995 4 view 1373503896 5 add 1373503996 ... I'd like to add new column time on task which is time elapsed between two events (id2 - id1...). What would be the best approach to do that? Thanks, Srecko [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Add new calculated column to data frame
Thanks Berend, I don't know why I didn't try that before posting the question... but... anyways, thanks for your help Srecko On Thu, Aug 29, 2013 at 11:34 AM, Berend Hasselman b...@xs4all.nl wrote: On 29-08-2013, at 20:15, srecko joksimovic sreckojoksimo...@gmail.com wrote: Thanks Arun, this is great. However, it should be just a little bit different: # id event time time_on_task #1 1add 1373502892 80 #2 2add 1373502972 23 #3 3 delete 1373502995 901 #4 4 view 1373503896 100 #5 5add 1373503996 NA When I calculate difference, I need to know how long each activity was. It is id2-id1 for the first activity... then why don't you try dat1$time_on_task- c(diff(dat1$time),NA) Berend [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Add new calculated column to data frame
Hi Arun, this could to the work... Thanks so much! On Thu, Aug 29, 2013 at 3:10 PM, arun smartpink...@yahoo.com wrote: HI, It's not really clear, but you can try this: dat1- read.table(text= id module event time time_on_task Categurl 1sys login 1373502892 80 B http://post/add?id=42idp=45 2 taskadd 1373502892 80 A http://post/add?id=33idp=45 3 taskadd 1373502972 23 A http://post/add?id=34idp=45 4sys login 1373502892 80 B http://post/add?id=39idp=42 5 list delete 1373502995 901 C http://post/add?id=37idp=41 6 list view 1373503896 100 D http://post/add?id=36idp=46 7 taskadd 1373503996 NA A http://post/add?id=31idp=45 ,sep=,header=TRUE,stringsAsFactors=FALSE) vec1-as.numeric(gsub(.*\\?.*=(\\d+)\\.*,\\1,dat1$url[dat1$Categ==A])) vec1 #[1] 33 34 31 dat2- read.table(text= id idpost idtopic iduser 1 45 33 101 2 46 34 102 3 47 33 103 4 48 33 101 5 49 35 104 ,sep=,header=TRUE) dat1$Categ[dat1$Categ==A][!vec1%in%dat2$idtopic]-F dat1 # id module event time time_on_task Categ url #1 1sys login 1373502892 80 B http://post/add?id=42idp=45 #2 2 taskadd 1373502892 80 A http://post/add?id=33idp=45 #3 3 taskadd 1373502972 23 A http://post/add?id=34idp=45 #4 4sys login 1373502892 80 B http://post/add?id=39idp=42 #5 5 list delete 1373502995 901 C http://post/add?id=37idp=41 #6 6 list view 1373503896 100 D http://post/add?id=36idp=46 #7 7 taskadd 1373503996 NA F http://post/add?id=31idp=45 A.K. From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Sent: Thursday, August 29, 2013 5:38 PM Subject: Re: [R] Add new calculated column to data frame Hi Arun, I really appreciate your help, and we did a great job :) but, now I think that R can do anything, so I'd like to try one more thing, if you don't mind... from the table with categories, # id module event time time_on_task Categurl #1 1sys login 1373502892 80 B http: #2 2 taskadd 1373502892 80 A http: #3 3 taskadd 1373502972 23 A http: #4 4sys login 1373502892 80 B http: #5 5 list delete 1373502995 901 C #6 6 list view 1373503896 100 D #7 7 taskadd 1373503996 NA A I'd like to use only certain category (for example A). Each of these fields has an url whose format is something like http://post/add?id=33idp=45. First step would be to extract this id (33 in this case). Based on that value, I want to find all iduser from the following table: id idpost idtopic iduser 1 45 33 101 2 46 34 102 3 47 33 103 4 48 33 101 5 49 35 104 The next step would be to check if at least one of these values (iduser) is not in the vectors users (only ids). If that is the case, I want to change category to F, if not, I want to keep the same category. If this is too much for one question, I'll implement this in Java, but I'd really like to try this with R. Maybe this id extraction from url is the most important problem... I tried most of these steps, but still not able to put them all together... Thank you so much for your time. Srecko On Thu, Aug 29, 2013 at 12:22 PM, arun smartpink...@yahoo.com wrote: Hi Srecko, No problem. Arun From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Sent: Thursday, August 29, 2013 3:19 PM Subject: Re: [R] Add new calculated column to data frame This is great Arun, thank you again. I was thinking to use sqldf and issue query for each module-action combination, but this is much better. Since I have table with categories (module, action, category), I could create vector levels based on the first two columns and vector labels based on the category column and that should to the work... Best, Srecko On Thu, Aug 29, 2013 at 12:16 PM, arun smartpink...@yahoo.com wrote: Hi Srecko, You didn't mention the order in which the letters are assigned. If you need a different order, just change the order in the ,levels=c(),. Arun - Original Message - From: arun smartpink...@yahoo.com To: srecko joksimovic sreckojoksimo...@gmail.com Cc: R help r-help@r-project.org Sent: Thursday, August 29, 2013 3:13 PM Subject: Re: [R] Add new calculated column to data frame Hi, You could try this: dat1- read.table(text= id moduleevent time
Re: [R] Add new calculated column to data frame
Thanks, I'll try this as well. Srecko On Thu, Aug 29, 2013 at 3:26 PM, arun smartpink...@yahoo.com wrote: Hi Srecko, Try this: dat1- read.table(text= id module event time time_on_task Categurl 1sys login 1373502892 80 B http:// 2 taskadd 1373502892 80 A http://post/add?id=33idp=67 3 taskadd 1373502972 23 A http://post/add?id=34idp=67 4sys login 1373502892 80 B http:// 5 list delete 1373502995 901 C http:// 6 list view 1373503896 100 D http:// 7 taskadd 1373503996 NA A http://post/add?id=35idp=99 ,sep=,header=TRUE,stringsAsFactors=FALSE) vec1-as.numeric(gsub(.*\\?.*=(\\d+)\\.*,\\1,dat1$url[dat1$Categ==A])) dat2- read.table(text= id idpost idtopic iduser 1 45 33 101 2 46 34 102 3 47 33 103 4 48 33 101 5 49 35 104 ,sep=,header=TRUE) student_list- c(101:102,104:107) vec2-with(dat2,tapply(iduser,list(idtopic),FUN=function(x) all(x%in% student_list))) dat1$Categ[dat1$Categ==A][match(vec1,as.numeric(names(vec2)))[!vec2]]-F dat1 # id module event time time_on_task Categ url #1 1sys login 1373502892 80 B http:// #2 2 taskadd 1373502892 80 F http://post/add?id=33idp=67 #3 3 taskadd 1373502972 23 A http://post/add?id=34idp=67 #4 4sys login 1373502892 80 B http:// #5 5 list delete 1373502995 901 C http:// #6 6 list view 1373503896 100 D http:// #7 7 taskadd 1373503996 NA A http://post/add?id=35idp=99 A.K. From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Sent: Thursday, August 29, 2013 6:04 PM Subject: Re: [R] Add new calculated column to data frame Did you mean to separate the number 33 from the link? , yes that is correct. It should be something like this: # id module event time time_on_task Categurl #1 1sys login 1373502892 80 B http:// #2 2 taskadd 1373502892 80 A http://post/add?id=33idp=67 #3 3 taskadd 1373502972 23 A http://post/add?id=34idp=67 #4 4sys login 1373502892 80 B http:// #5 5 list delete 1373502995 901 C http:// #6 6 list view 1373503896 100 D http:// #7 7 taskadd 1373503996 NA A http://post/add?id=35idp=99 from this table I should get 3 rows with 3 URLs: http://post/add?id=33idp=67, http://post/add?id=34idp=67, and http://post/add?id=35idp=99 For each of them, I need to extract id (33, 34, and 35). Once I do that, I need to obtain users from this table: id idpost idtopic iduser 1 45 33 101 2 46 34 102 3 47 33 103 4 48 33 101 5 49 35 104 again, for each id. This means: id = 33 = 101, 103 id = 34 = 102 id = 35 = 104 Next, for each vector I need to check whether or not all it's values are in the students list (101,102, 104,105, 106,107) id = 33 = FALSE (since 103 is not in the list) id = 34 = TRUE id = 35 = TRUE This means that category for row 2 in the first table is not A any more, but F... Thanks, Srecko On Thu, Aug 29, 2013 at 2:56 PM, arun smartpink...@yahoo.com wrote: HI Srecko, Did you mean to separate the number 33 from the link? Could you provide a reproducible example with the output you expected? Tx. Arun From: srecko joksimovic sreckojoksimo...@gmail.com To: arun smartpink...@yahoo.com Sent: Thursday, August 29, 2013 5:38 PM Subject: Re: [R] Add new calculated column to data frame Hi Arun, I really appreciate your help, and we did a great job :) but, now I think that R can do anything, so I'd like to try one more thing, if you don't mind... from the table with categories, # id module event time time_on_task Categurl #1 1sys login 1373502892 80 B http: #2 2 taskadd 1373502892 80 A http: #3 3 taskadd 1373502972 23 A http: #4 4sys login 1373502892 80 B http: #5 5 list delete 1373502995 901 C #6 6 list view 1373503896 100 D #7 7 taskadd 1373503996 NA A I'd like to use only certain category (for example A). Each of these fields has an url whose format is something like http://post/add?id=33idp=45. First step would be to extract this id (33 in this case). Based on that value, I want to find all iduser from the following table: id idpost idtopic iduser 1 45 33 101 2 46 34 102 3
[R] Iterate over rows and update values based on condition
Hi, I have a data set with structure similar to this: iduseraction 1 12 login 2 12 view 3 12 view 4 12 view 5 12 login 6 12 view 7 12 view 8 12 login I want to create a list of sessions. That means to split table on every occurrence of login. Using Java (or some other language), I would probably iterate through rows and create new List instance on every login, but I guess there is more efficient way to do that using R? Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Iterate over rows and update values based on condition
This is great! Thank you so much. On Tue, Aug 27, 2013 at 3:06 PM, arun smartpink...@yahoo.com wrote: Hi, May be this helps: dat1- read.table(text= iduseraction 1 12 login 2 12 view 3 12 view 4 12 view 5 12 login 6 12 view 7 12 view 8 12 login ,sep=,header=TRUE,stringsAsFactors=FALSE) split(dat1,cumsum(dat1$action==login)) #$`1` # id user action #1 1 12 login #2 2 12 view #3 3 12 view #4 4 12 view # #$`2` # id user action #5 5 12 login #6 6 12 view #7 7 12 view # #$`3` # id user action #8 8 12 login A.K. - Original Message - From: srecko joksimovic sreckojoksimo...@gmail.com To: R-help@r-project.org Cc: Sent: Tuesday, August 27, 2013 3:29 PM Subject: [R] Iterate over rows and update values based on condition Hi, I have a data set with structure similar to this: iduseraction 1 12 login 2 12 view 3 12 view 4 12 view 5 12 login 6 12 view 7 12 view 8 12 login I want to create a list of sessions. That means to split table on every occurrence of login. Using Java (or some other language), I would probably iterate through rows and create new List instance on every login, but I guess there is more efficient way to do that using R? Thanks [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.