Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Simon Urbanek wrote: I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII factor as it is on the ASCII factor. Strange - are you sure you get the right locale names? Make sure it's listed in locale -a. Yes, I managed to reproduce it now, using a locale listed in ‘locale -a’. There is a performance hit, though *much* smaller than on Windows. FWIW if you care about speed you should use tabulate() instead - it's much faster and incurs no penalty: Yes, that the solution I ended up using: res = tabulate(x, nbins=nlevels(x)) # nbins needed for levels that don’t occur names(res) = levels(x) res (Though I’m not sure it’s *guaranteed* that factors are internally stored in a way that make this works, i.e., as the numbers 1, 2, ... for level 1, 2 ...) Anyway, do you think it’s worth trying to change the ‘table’ function the way I outlined in my first post¹? This should eliminate the performance hit on all platforms. However, it will introduce a performance hit (CPU and memory use) if the elements of ‘exclude’ make up a large part of the factor(s). ¹ http://permalink.gmane.org/gmane.comp.lang.r.devel/26576 -- Karl Ove Hufthammer __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows
Karl Ove Hufthammer wrote: Anyway, do you think it’s worth trying to change the ‘table’ function the way I outlined in my first post¹? This should eliminate the performance hit on all platforms. Some additional notes: ‘table’ uses ‘factor’ directly, but also indirectly, in ‘addNA’. The definition of ‘addNA’ ends with: if (!any(is.na(ll))) ll - c(ll, NA) factor(x, levels = ll, exclude = NULL) Which is slow for non-ASCII levels. One *could* fix this by changing the last line to attr(x, levels)=ll But one soon ends up changing every function that uses ‘factor’ in this way, which seems like the wrong approach. The problems lies inside ‘factor’, and that’s where it should be fixed, if feasible. BTW, the defintion of ‘addNA’ looks suboptimal in a different way. The last line is always executed, even if the factor *does* contain NA values (and of course NA levels). For this case, basically it’s doing nothing, just taking a very long time doing it (at least on Windows). Moving the last line inside the ‘if’ clause, and adding a ‘else return(x)’ would fix this (correct me if I’m wrong). -- Karl Ove Hufthammer __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] aggregate(as.formula(some formula), data, function) error when called from in a function
I'm having a problem with aggregate.formula when I call it in a function and the function is converted from a string in the funtion I think my problem may also only occur when the left hand side of the formula is cbind(...) Here is example code that generates a dataset and then the error. The first function agg2 fails agg2(FALSE) do agg 2 Error in m[[2L]][[2L]] : object of type 'symbol' is not subsettable but, if I run it have it return what it is going to pass to aggregate and pass it myself, it works. I can use this for a workaround (agg3) where one function does this itself. I'm confused by the behavior. Is there some way to not have to use a separate function to make the call ? == # start R code # idea: in a function, count the number of instances # of some factor (y) associated with another # factor (x). aggregate.formula appears to be # able to do this... but I have a problem if all of the following: # (1) It is called in a function # (2) the formula is created using as.formula(character) # calling aggregate with the same formula (created with as.formula) # outside the function works fine. agg2 - function(test=FALSE) { # create a factor y dat - data.frame(y=sample(LETTERS[1:3],100,replace=TRUE)) # create a factor x dat$x - sample(letters[1:4],100,replace=TRUE) # make a column of 1s and zeros # 1 when that row has that level of y # 0 otherwise lvls - levels(dat$y) dat$ya - 1*(dat[,1] == lvls[1]) dat$yb - 1*(dat[,1] == lvls[2]) dat$yc - 1*(dat[,1] == lvls[3]) # this works fine if you give the exact function agg1 - aggregate(cbind(ya,yb,yc)~x,data=dat,sum) # and fine if you accept fo - as.formula(cbind(ya,yb,yc)~x) if(test) { return(list(fo=fo,data=dat)) } cat(do agg 2\n) agg2 - aggregate(fo,data=dat,sum) list(agg1,agg2) } agg2(FALSE) ag - agg2(TRUE) ag$fo aggregate(ag$fo,ag$data,sum) agg3 - function() { ag - agg2(TRUE) ag$fo aggregate(ag$fo,ag$data,sum) } agg3() # end R code == Paul Bailey University of Maryland __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] aggregate(as.formula(some formula), data, function) error when called from in a function
On Wed, Jan 26, 2011 at 2:04 PM, Paul Bailey pdbai...@umd.edu wrote: I'm having a problem with aggregate.formula when I call it in a function and the function is converted from a string in the funtion I think my problem may also only occur when the left hand side of the formula is cbind(...) Here is example code that generates a dataset and then the error. The first function agg2 fails agg2(FALSE) do agg 2 Error in m[[2L]][[2L]] : object of type 'symbol' is not subsettable but, if I run it have it return what it is going to pass to aggregate and pass it myself, it works. I can use this for a workaround (agg3) where one function does this itself. I'm confused by the behavior. Is there some way to not have to use a separate function to make the call ? == # start R code # idea: in a function, count the number of instances # of some factor (y) associated with another # factor (x). aggregate.formula appears to be # able to do this... but I have a problem if all of the following: # (1) It is called in a function # (2) the formula is created using as.formula(character) # calling aggregate with the same formula (created with as.formula) # outside the function works fine. agg2 - function(test=FALSE) { # create a factor y dat - data.frame(y=sample(LETTERS[1:3],100,replace=TRUE)) # create a factor x dat$x - sample(letters[1:4],100,replace=TRUE) # make a column of 1s and zeros # 1 when that row has that level of y # 0 otherwise lvls - levels(dat$y) dat$ya - 1*(dat[,1] == lvls[1]) dat$yb - 1*(dat[,1] == lvls[2]) dat$yc - 1*(dat[,1] == lvls[3]) # this works fine if you give the exact function agg1 - aggregate(cbind(ya,yb,yc)~x,data=dat,sum) # and fine if you accept fo - as.formula(cbind(ya,yb,yc)~x) if(test) { return(list(fo=fo,data=dat)) } cat(do agg 2\n) agg2 - aggregate(fo,data=dat,sum) list(agg1,agg2) } agg2(FALSE) ag - agg2(TRUE) ag$fo aggregate(ag$fo,ag$data,sum) agg3 - function() { ag - agg2(TRUE) ag$fo aggregate(ag$fo,ag$data,sum) } agg3() # end R code == Paul Bailey University of Maryland The problem is that the aggregate statement: agg2 - aggregate(fo, data = dat, sum) is using non-standard evaluation and is literally looking at fo rather than fo's value. This may be a bug in aggregate.formula but at any rate you could try replacing that statement with the following to force fo to be evaluated: agg2 - do.call(aggregate, list(fo, data = dat, FUN = sum)) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes
Dear list, I'm tackling an empiric research problem that requires me to address a whole bunch of conceptual and/or technical details at the same time which cuts time short for all the nitty-gritty details of the components involved. Having said this, I'm lacking the time at the moment to deeply dive into parallel computing and HTTP requests via RCurl and I hope you can help me out with one or two imminent issues of my crawler/scraper: Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub, multiHandle=my.multi.handle)' within an lapply()-construct in order to read chunks of deterministically composed URLs from a host. There are courtesy time delays implemented between the individual http requests (5 times the time the last request from this host took) so that I'm not clogging the host. I'm causing about 15 minutes of traffic per day. The problem is, that 'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why so. I also can't reproduce the error as it's totally erratic. I tried to put the function inside a try() or tryCatch() construct to no avail. Also, I've experimented with a couple of timeout options of Curl, but honestly didn't really understand all the implications. None worked so far. It simply seems that upon an error 'getURIAsynchronous()' simply does not give control back to the R process. Additionally, due to a lack of profound knowledge in parallel computing, the program is scripted to run a bunch of R processes independently. Communication between them takes place via variables they read from and write to disc in order to have some sort of shared environment (horrible, I know ;-)). So here are my specific questions: 1) Is it possible to catch connection or timeout errors in RCurl functions that allow me to implement my customized error handling? If so, could you guide me to some examples, please? 2) Can I somehow identify frozen Rterm or Rscript processes (e.g. via using Sys.getpid()?) in order to shut them down and reinitialize them? You'll find my session info below. Thanks for any hints or advice! Janko sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] tcltk tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] RCurl_1.5-0.1bitops_1.0-4.1 XML_3.2-0.2 RMySQL_0.7-5 [5] filehash_2.1-1 hash_2.0.1 timeDate_2130.91 RODBC_1.3-2 [9] MiscPsycho_1.6 statmod_1.4.8debug_1.2.4 mvbutils_2.5.4 [13] DBI_0.2-5cwhmisc_2.1 lattice_0.19-13 loaded via a namespace (and not attached): [1] grid_2.12.1 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes
Dear list, I'm tackling an empiric research problem that requires me to address a whole bunch of conceptual and/or technical details at the same time which cuts time short for all the nitty-gritty details of the components involved. Having said this, I'm lacking the time at the moment to deeply dive into parallel computing and HTTP requests via RCurl and I hope you can help me out with one or two imminent issues of my crawler/scraper: Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub, multiHandle=my.multi.handle)' within an lapply()-construct in order to read chunks of deterministically composed URLs from a host. There are courtesy time delays implemented between the individual http requests (5 times the time the last request from this host took) so that I'm not clogging the host. I'm causing about 15 minutes of traffic per day. The problem is, that 'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why so. I also can't reproduce the error as it's totally erratic. I tried to put the function inside a try() or tryCatch() construct to no avail. Also, I've experimented with a couple of timeout options of Curl, but honestly didn't really understand all the implications. None worked so far. It simply seems that upon an error 'getURIAsynchronous()' simply does not give control back to the R process. Additionally, due to a lack of profound knowledge in parallel computing, the program is scripted to run a bunch of R processes independently. Communication between them takes place via variables they read from and write to disc in order to have some sort of shared environment (horrible, I know ;-)). So here are my specific questions: 1) Is it possible to catch connection or timeout errors in RCurl functions that allow me to implement my customized error handling? If so, could you guide me to some examples, please? 2) Can I somehow identify frozen Rterm or Rscript processes (e.g. via using Sys.getpid()?) in order to shut them down and reinitialize them? You'll find my session info below. Thanks for any hints or advice! Janko sessionInfo() R version 2.12.1 (2010-12-16) Platform: i386-pc-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 [3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C [5] LC_TIME=German_Germany.1252 attached base packages: [1] tcltk tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] RCurl_1.5-0.1bitops_1.0-4.1 XML_3.2-0.2 RMySQL_0.7-5 [5] filehash_2.1-1 hash_2.0.1 timeDate_2130.91 RODBC_1.3-2 [9] MiscPsycho_1.6 statmod_1.4.8debug_1.2.4 mvbutils_2.5.4 [13] DBI_0.2-5cwhmisc_2.1 lattice_0.19-13 loaded via a namespace (and not attached): [1] grid_2.12.1 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Dealing with R list objects in C/C++
Hi, I'd like to construct an R list object in C++, fill it with relevant data, and pass it to an R function which will return a different list object back. I have browsed through all the R manuals, and examples under tests/Embedding, but can't figure out the correct way. Below is my code snippet: #include Rinternals.h // Rf_initEmbeddedR and other setups already performed SEXP arg, ret; // this actually creates a pairlist. I can't find any API that creates a list PROTECT(arg = allocList(3)); // I want the first element to be type integer, second double, and third a vector. INTEGER(arg)[0] = 1;// - runtime exception: INTEGER() can only be applied to a 'integer', not a 'pairlist' REAL(arg)[1] = 2.5; // control never reached here VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4); REAL(VECTOR_PTR(arg)[2])[0] = 10.0; REAL(VECTOR_PTR(arg)[2])[1] = 11.0; REAL(VECTOR_PTR(arg)[2])[2] = 12.0; REAL(VECTOR_PTR(arg)[2])[3] = 13.0; PROTECT(call = lang2(install(entryPoint.c_str()), arg)); ret = R_tryEval(call, R_GlobalEnv, errorOccurred); I'll be grateful if you can point me to any online docs/samples. Thanks in advance, Wayne ___ This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered offi! ce at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Dealing with R list objects in C/C++
Hi Wayne, On 26 January 2011 at 17:56, wayne.zh...@barclayscapital.com wrote: | Hi, | | I'd like to construct an R list object in C++, fill it with relevant data, and pass it to an R function which will return a different list object back. I have browsed through all the R manuals, and examples under tests/Embedding, but can't figure out the correct way. Below is my code snippet: | | #include Rinternals.h | // Rf_initEmbeddedR and other setups already performed | | SEXP arg, ret; | | // this actually creates a pairlist. I can't find any API that creates a list | PROTECT(arg = allocList(3)); | | // I want the first element to be type integer, second double, and third a vector. | INTEGER(arg)[0] = 1;// - runtime exception: INTEGER() can only be applied to a 'integer', not a 'pairlist' | REAL(arg)[1] = 2.5; // control never reached here | | VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4); | REAL(VECTOR_PTR(arg)[2])[0] = 10.0; | REAL(VECTOR_PTR(arg)[2])[1] = 11.0; | REAL(VECTOR_PTR(arg)[2])[2] = 12.0; | REAL(VECTOR_PTR(arg)[2])[3] = 13.0; | | PROTECT(call = lang2(install(entryPoint.c_str()), arg)); | | ret = R_tryEval(call, R_GlobalEnv, errorOccurred); | | | I'll be grateful if you can point me to any online docs/samples. This is a non-trivial problem when the use the C API provided by R. It is all documented, but you need to study the 'Writing R Extensions' in some detail, as well as maybe 'R Programming' by Gentleman and/or 'Software for Data Analysis' by Chambers. But there is another API you can use. It is provided by RInside (to embed R inside C++) which uses Rcpp (for R and C++ integration). Install those two packages from CRAN, and then drop the few lines below as a file, say, wayne.cpp in the examples/standard/ directory of RInside. Saying 'make wayne' will build an executable, using proper flags and linker options, and you can run that: edd@max:~/svn/rinside/pkg/inst/examples/standard$ make wayne g++ -I/usr/share/R/include -I/usr/local/lib/R/site-library/Rcpp/include -I/usr/local/lib/R/site-library/RInside/include -O3 -pipe -g -Wall wayne.cpp -L/usr/lib64/R/lib -lR -lblas -llapack -L/usr/local/lib/R/site-library/Rcpp/lib -lRcpp -Wl,-rpath,/usr/local/lib/R/site-library/Rcpp/lib -L/usr/local/lib/R/site-library/RInside/lib -lRInside -Wl,-rpath,/usr/local/lib/R/site-library/RInside/lib -o wayne edd@max:~/svn/rinside/pkg/inst/examples/standard$ ./wayne Showing list content: L[0] 1 L[1] 2.5 L[2][0] 10 L[2][1] 11 Showing list content: L[0] 42 L[1] 42 L[2][0] 10 L[2][1] 42 edd@max:~/svn/rinside/pkg/inst/examples/standard$ The code a list as you spec'ed with int, double and vector. The list is shown on stdout, then passed to R, transformed by R and shown again at the C++ level. Questions on RInside and Rcpp are welcome on the rcpp-devel list. Hope this helps, Dirk - #include RInside.h// for the embedded R via RInside void show(const Rcpp::List L) { // this function is cumbersome as we haven't defined operators std::cout Showing list content:\n; std::cout L[0] Rcpp::asint(L[0]) std::endl; std::cout L[1] Rcpp::asdouble(L[1]) std::endl; Rcpp::IntegerVector v = Rcpp::asRcpp::IntegerVector(L[2]); std::cout L[2][0] v[0] std::endl; std::cout L[2][1] v[1] std::endl; } int main(int argc, char *argv[]) { // create an embedded R instance RInside R(argc, argv); Rcpp::List mylist(3); mylist[0] = 1; mylist[1] = 2.5; Rcpp::IntegerVector v(2); v[0] = 10; v[1] = 11; // with C++0x we could assign directly mylist[2] = v; show(mylist); R[myRlist] = mylist; std::string r_code = myRlist[[1]] = 42; myRlist[[2]] = 42.0; myRlist[[3]][2] = 42; myRlist; Rcpp::List reslist = R.parseEval(r_code); show(reslist); exit(0); } - -- Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Dealing with R list objects in C/C++
On 01/26/2011 02:56 PM, wayne.zh...@barclayscapital.com wrote: Hi, I'd like to construct an R list object in C++, fill it with relevant data, and pass it to an R function which will return a different list object back. I have browsed through all the R manuals, and examples under tests/Embedding, but can't figure out the correct way. Below is my code snippet: #include Rinternals.h // Rf_initEmbeddedR and other setups already performed SEXP arg, ret; // this actually creates a pairlist. I can't find any API that creates a list PROTECT(arg = allocList(3)); Allocate a list of length 3 via SEXPTYPE VECSXP PROTECT(arg = allocVector(VECSXP, 3)); // I want the first element to be type integer, second double, and third a vector. INTEGER(arg)[0] = 1;// - runtime exception: INTEGER() can only be applied to a 'integer', not a 'pairlist' set the first element of the list to an integer vector of length 1, and assign a value SET_VECTOR_ELT(arg, 0, allocVector(INTSXP, 1)); INTEGER(VECTOR_ELT(arg, 0))[0] = 1 or more succinctly SET_VECTOR_ELT(arg, 0, ScalarInteger(1)); REAL(arg)[1] = 2.5; // control never reached here and the second element SET_VECTOR_ELT(arg, 1, ScalarReal(2.5)); VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4); and for the third allocate a REALSXP and then fill SET_VECTOR_ELT(arg, 2, allocVector(REALSXP, 4)); next lines should be ok as REAL(VECTOR_ELT(arg, 2))[0] = 10.0; or with less typing as double *x = REAL(VECTOR_ETL(arg, 2)); x[0] = 10.0; x[1] = 11.0; x[2] = 12.0; x[3] = 13.0; REAL(VECTOR_PTR(arg)[2])[0] = 10.0; REAL(VECTOR_PTR(arg)[2])[1] = 11.0; REAL(VECTOR_PTR(arg)[2])[2] = 12.0; REAL(VECTOR_PTR(arg)[2])[3] = 13.0; PROTECT(call = lang2(install(entryPoint.c_str()), arg)); not sure where entryPoint.c_str() is coming from, but PROTECT(call = lang2(install(fun), arg)); with some debate about whether install(fun) should be PROTECT'ed. ret = R_tryEval(call, R_GlobalEnv, errorOccurred); likely PROTECT(ret = ...) while checking errorOccurred, etc. Hope that helps, Martin I'll be grateful if you can point me to any online docs/samples. Thanks in advance, Wayne ___ i! ce at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group. ___ [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- Computational Biology Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: M1-B861 Telephone: 206 667-2793 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel