Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

2011-01-26 Thread Karl Ove Hufthammer
Simon Urbanek wrote:

 I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII
 factor as it is on the ASCII factor.
 
 Strange - are you sure you get the right locale names? Make sure it's
 listed in locale -a.

Yes, I managed to reproduce it now, using a locale listed in ‘locale -a’.
There is a performance hit, though *much* smaller than on Windows.

 FWIW if you care about speed you should use tabulate() instead - it's much
 faster and incurs no penalty:

Yes, that the solution I ended up using:

res = tabulate(x, nbins=nlevels(x)) # nbins needed for levels that don’t occur
names(res) = levels(x)
res

(Though I’m not sure it’s *guaranteed* that factors are internally stored in a
way that make this works, i.e., as the numbers 1, 2, ... for level 1, 2 ...)

Anyway, do you think it’s worth trying to change the ‘table’ function the way I
outlined in my first post¹? This should eliminate the performance hit on all
platforms. However, it will introduce a performance hit (CPU and memory use)
if the elements of ‘exclude’ make up a large part of the factor(s).

¹ http://permalink.gmane.org/gmane.comp.lang.r.devel/26576

-- 
Karl Ove Hufthammer

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

2011-01-26 Thread Karl Ove Hufthammer
Karl Ove Hufthammer wrote:

 Anyway, do you think it’s worth trying to change the ‘table’ function the
 way I outlined in my first post¹? This should eliminate the performance
 hit on all platforms.

Some additional notes: ‘table’ uses ‘factor’ directly, but also indirectly, 
in ‘addNA’. The definition of ‘addNA’ ends with:

if (!any(is.na(ll))) 
ll - c(ll, NA)
factor(x, levels = ll, exclude = NULL)

Which is slow for non-ASCII levels. One *could* fix this by changing the 
last line to

  attr(x, levels)=ll

But one soon ends up changing every function that uses ‘factor’ in this way, 
which seems like the wrong approach. The problems lies inside ‘factor’,
and that’s where it should be fixed, if feasible.

BTW, the defintion of ‘addNA’ looks suboptimal in a different way. The last 
line is always executed, even if the factor *does* contain NA values (and of 
course NA levels). For this case, basically it’s doing nothing, just taking 
a very long time doing it (at least on Windows). Moving the last line inside 
the ‘if’ clause, and adding a ‘else return(x)’ would fix this (correct me if 
I’m wrong).

-- 
Karl Ove Hufthammer

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] aggregate(as.formula(some formula), data, function) error when called from in a function

2011-01-26 Thread Paul Bailey
I'm having a problem with aggregate.formula when I call it in a function and 
the function is converted from a string in the funtion

I think my problem may also only occur when the left hand side of the formula 
is cbind(...)

Here is example code that generates a dataset and then the error. 

The first function agg2 fails

 agg2(FALSE)
do agg 2
Error in m[[2L]][[2L]] : object of type 'symbol' is not subsettable

but, if I run it have it return what it is going to pass to aggregate and pass 
it myself, it works. I can use this for a workaround (agg3) where one function 
does this itself.

I'm confused by the behavior. Is there some way to not have to use a separate 
function to make the call ?


==
# start R code
# idea: in a function, count the number of instances
# of some factor (y) associated with another
# factor (x). aggregate.formula appears to be
# able to do this... but I have a problem if all of the following:
# (1) It is called in a function
# (2) the formula is created using as.formula(character)
# calling aggregate with the same formula (created with as.formula)
# outside the function works fine.
agg2 - function(test=FALSE) {
  # create a factor y
  dat - data.frame(y=sample(LETTERS[1:3],100,replace=TRUE))
  # create a factor x
  dat$x - sample(letters[1:4],100,replace=TRUE)
  # make a column of 1s and zeros
  # 1 when that row has that level of y
  # 0 otherwise
  lvls - levels(dat$y)
  dat$ya - 1*(dat[,1] == lvls[1])
  dat$yb - 1*(dat[,1] == lvls[2])
  dat$yc - 1*(dat[,1] == lvls[3])
  # this works fine if you give the exact function
  agg1 - aggregate(cbind(ya,yb,yc)~x,data=dat,sum)
  # and fine if you accept
  fo - as.formula(cbind(ya,yb,yc)~x)
  if(test) {
return(list(fo=fo,data=dat))
  }
  cat(do agg 2\n)
  agg2 - aggregate(fo,data=dat,sum)
  list(agg1,agg2)
}
agg2(FALSE)
ag - agg2(TRUE)
ag$fo
aggregate(ag$fo,ag$data,sum)


agg3 - function() {
  ag - agg2(TRUE)
  ag$fo
  aggregate(ag$fo,ag$data,sum)
}
agg3()

# end R code
==
Paul Bailey
University of Maryland
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] aggregate(as.formula(some formula), data, function) error when called from in a function

2011-01-26 Thread Gabor Grothendieck
On Wed, Jan 26, 2011 at 2:04 PM, Paul Bailey pdbai...@umd.edu wrote:
 I'm having a problem with aggregate.formula when I call it in a function and 
 the function is converted from a string in the funtion

 I think my problem may also only occur when the left hand side of the formula 
 is cbind(...)

 Here is example code that generates a dataset and then the error.

 The first function agg2 fails

 agg2(FALSE)
 do agg 2
 Error in m[[2L]][[2L]] : object of type 'symbol' is not subsettable

 but, if I run it have it return what it is going to pass to aggregate and 
 pass it myself, it works. I can use this for a workaround (agg3) where one 
 function does this itself.

 I'm confused by the behavior. Is there some way to not have to use a separate 
 function to make the call ?


 ==
 # start R code
 # idea: in a function, count the number of instances
 # of some factor (y) associated with another
 # factor (x). aggregate.formula appears to be
 # able to do this... but I have a problem if all of the following:
 # (1) It is called in a function
 # (2) the formula is created using as.formula(character)
 # calling aggregate with the same formula (created with as.formula)
 # outside the function works fine.
 agg2 - function(test=FALSE) {
  # create a factor y
  dat - data.frame(y=sample(LETTERS[1:3],100,replace=TRUE))
  # create a factor x
  dat$x - sample(letters[1:4],100,replace=TRUE)
  # make a column of 1s and zeros
  # 1 when that row has that level of y
  # 0 otherwise
  lvls - levels(dat$y)
  dat$ya - 1*(dat[,1] == lvls[1])
  dat$yb - 1*(dat[,1] == lvls[2])
  dat$yc - 1*(dat[,1] == lvls[3])
  # this works fine if you give the exact function
  agg1 - aggregate(cbind(ya,yb,yc)~x,data=dat,sum)
  # and fine if you accept
  fo - as.formula(cbind(ya,yb,yc)~x)
  if(test) {
        return(list(fo=fo,data=dat))
  }
  cat(do agg 2\n)
  agg2 - aggregate(fo,data=dat,sum)
  list(agg1,agg2)
 }
 agg2(FALSE)
 ag - agg2(TRUE)
 ag$fo
 aggregate(ag$fo,ag$data,sum)


 agg3 - function() {
  ag - agg2(TRUE)
  ag$fo
  aggregate(ag$fo,ag$data,sum)
 }
 agg3()

 # end R code
 ==
 Paul Bailey
 University of Maryland

The problem is that the aggregate statement:

agg2 - aggregate(fo, data = dat, sum)

is using non-standard evaluation and is literally looking at fo rather
than fo's value.  This may be a bug in aggregate.formula but at any
rate you could try replacing that statement with the following to
force fo to be evaluated:

agg2 - do.call(aggregate, list(fo, data = dat, FUN = sum))

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes

2011-01-26 Thread Janko Thyson
Dear list,

I'm tackling an empiric research problem that requires me to address a whole
bunch of conceptual and/or technical details at the same time which cuts
time short for all the nitty-gritty details of the components involved.
Having said this, I'm lacking the time at the moment to deeply dive into
parallel computing and HTTP requests via RCurl and I hope you can help me
out with one or two imminent issues of my crawler/scraper:

Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub,
multiHandle=my.multi.handle)' within an lapply()-construct in order to read
chunks of deterministically composed URLs from a host. There are courtesy
time delays implemented between the individual http requests (5 times the
time the last request from this host took) so that I'm not clogging the
host. I'm causing about 15 minutes of traffic per day. The problem is, that
'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why
so. I also can't reproduce the error as it's totally erratic. 

I tried to put the function inside a try() or tryCatch() construct to no
avail. Also, I've experimented with a couple of timeout options of Curl, but
honestly didn't really understand all the implications. None worked so far.
It simply seems that upon an error 'getURIAsynchronous()' simply does not
give control back to the R process. Additionally, due to a lack of profound
knowledge in parallel computing, the program is scripted to run a bunch of R
processes independently. Communication between them takes place via
variables they read from and write to disc in order to have some sort of
shared environment (horrible, I know ;-)). 

So here are my specific questions:
1) Is it possible to catch connection or timeout errors in RCurl functions
that allow me to implement my customized error handling? If so, could you
guide me to some examples, please?
2) Can I somehow identify frozen Rterm or Rscript processes (e.g. via
using Sys.getpid()?) in order to shut them down and reinitialize them? 

You'll find my session info below.

Thanks for any hints or advice! 
Janko

 sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C   
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] tcltk tools stats graphics  grDevices utils datasets 
[8] methods   base 

other attached packages:
 [1] RCurl_1.5-0.1bitops_1.0-4.1   XML_3.2-0.2  RMySQL_0.7-5
 [5] filehash_2.1-1   hash_2.0.1   timeDate_2130.91 RODBC_1.3-2 
 [9] MiscPsycho_1.6   statmod_1.4.8debug_1.2.4  mvbutils_2.5.4  
[13] DBI_0.2-5cwhmisc_2.1  lattice_0.19-13 

loaded via a namespace (and not attached):
[1] grid_2.12.1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Error handling with frozen RCurl function calls + Identification of frozen R processes

2011-01-26 Thread Janko Thyson
Dear list,

I'm tackling an empiric research problem that requires me to address a whole
bunch of conceptual and/or technical details at the same time which cuts
time short for all the nitty-gritty details of the components involved.
Having said this, I'm lacking the time at the moment to deeply dive into
parallel computing and HTTP requests via RCurl and I hope you can help me
out with one or two imminent issues of my crawler/scraper:

Once a day, I'm running 'RCurl::getURIAsynchronous(x=URL.frontier.sub,
multiHandle=my.multi.handle)' within an lapply()-construct in order to read
chunks of deterministically composed URLs from a host. There are courtesy
time delays implemented between the individual http requests (5 times the
time the last request from this host took) so that I'm not clogging the
host. I'm causing about 15 minutes of traffic per day. The problem is, that
'getURIAsynchronous()' simply freezes sometimes and I don't have a clue why
so. I also can't reproduce the error as it's totally erratic. 

I tried to put the function inside a try() or tryCatch() construct to no
avail. Also, I've experimented with a couple of timeout options of Curl, but
honestly didn't really understand all the implications. None worked so far.
It simply seems that upon an error 'getURIAsynchronous()' simply does not
give control back to the R process. Additionally, due to a lack of profound
knowledge in parallel computing, the program is scripted to run a bunch of R
processes independently. Communication between them takes place via
variables they read from and write to disc in order to have some sort of
shared environment (horrible, I know ;-)). 

So here are my specific questions:
1) Is it possible to catch connection or timeout errors in RCurl functions
that allow me to implement my customized error handling? If so, could you
guide me to some examples, please?
2) Can I somehow identify frozen Rterm or Rscript processes (e.g. via
using Sys.getpid()?) in order to shut them down and reinitialize them? 

You'll find my session info below.

Thanks for any hints or advice! 
Janko

 sessionInfo()
R version 2.12.1 (2010-12-16)
Platform: i386-pc-mingw32/i386 (32-bit)

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=German_Germany.1252   
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C   
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] tcltk tools stats graphics  grDevices utils datasets 
[8] methods   base 

other attached packages:
 [1] RCurl_1.5-0.1bitops_1.0-4.1   XML_3.2-0.2  RMySQL_0.7-5
 [5] filehash_2.1-1   hash_2.0.1   timeDate_2130.91 RODBC_1.3-2 
 [9] MiscPsycho_1.6   statmod_1.4.8debug_1.2.4  mvbutils_2.5.4  
[13] DBI_0.2-5cwhmisc_2.1  lattice_0.19-13 

loaded via a namespace (and not attached):
[1] grid_2.12.1

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Dealing with R list objects in C/C++

2011-01-26 Thread Wayne.Zhang
Hi,

I'd like to construct an R list object in C++, fill it with relevant data, and 
pass it to an R function which will return a different list object back.  I 
have browsed through all the R manuals, and examples under tests/Embedding, but 
can't figure out the correct way.  Below is my code snippet:

#include Rinternals.h
// Rf_initEmbeddedR and other setups already performed

SEXP arg, ret;

// this actually creates a pairlist.  I can't find any API that creates a 
list
PROTECT(arg = allocList(3));

// I want the first element to be type integer, second double, and third a 
vector.
INTEGER(arg)[0]  = 1;// - runtime exception: INTEGER() can 
only be applied to a 'integer', not a 'pairlist'
REAL(arg)[1] = 2.5;   // control never reached here

VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4);
REAL(VECTOR_PTR(arg)[2])[0] = 10.0;
REAL(VECTOR_PTR(arg)[2])[1] = 11.0;
REAL(VECTOR_PTR(arg)[2])[2] = 12.0;
REAL(VECTOR_PTR(arg)[2])[3] = 13.0;

PROTECT(call = lang2(install(entryPoint.c_str()), arg));

ret = R_tryEval(call, R_GlobalEnv, errorOccurred);


I'll be grateful if you can point me to any online docs/samples.

Thanks in advance,
Wayne

___

This e-mail may contain information that is confidential, privileged or 
otherwise protected from disclosure. If you are not an intended recipient of 
this e-mail, do not duplicate or redistribute it by any means. Please delete it 
and any attachments and notify the sender that you have received it in error. 
Unless specifically indicated, this e-mail is not an offer to buy or sell or a 
solicitation to buy or sell any securities, investment products or other 
financial product or service, an official confirmation of any transaction, or 
an official statement of Barclays. Any views or opinions presented are solely 
those of the author and do not necessarily represent those of Barclays. This 
e-mail is subject to terms available at the following link: 
www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the 
foregoing.  Barclays Capital is the investment banking division of Barclays 
Bank PLC, a company registered in England (number 1026167) with its registered 
offi!
 ce at 1 Churchill Place, London, E14 5HP.  This email may relate to or be sent 
from other members of the Barclays Group.
___

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Dealing with R list objects in C/C++

2011-01-26 Thread Dirk Eddelbuettel

Hi Wayne,

On 26 January 2011 at 17:56, wayne.zh...@barclayscapital.com wrote:
| Hi,
| 
| I'd like to construct an R list object in C++, fill it with relevant data, 
and pass it to an R function which will return a different list object back.  I 
have browsed through all the R manuals, and examples under tests/Embedding, but 
can't figure out the correct way.  Below is my code snippet:
| 
| #include Rinternals.h
| // Rf_initEmbeddedR and other setups already performed
| 
| SEXP arg, ret;
| 
| // this actually creates a pairlist.  I can't find any API that creates a 
list
| PROTECT(arg = allocList(3));
| 
| // I want the first element to be type integer, second double, and third a 
vector.
| INTEGER(arg)[0]  = 1;// - runtime exception: INTEGER() can 
only be applied to a 'integer', not a 'pairlist'
| REAL(arg)[1] = 2.5;   // control never reached here
| 
| VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4);
| REAL(VECTOR_PTR(arg)[2])[0] = 10.0;
| REAL(VECTOR_PTR(arg)[2])[1] = 11.0;
| REAL(VECTOR_PTR(arg)[2])[2] = 12.0;
| REAL(VECTOR_PTR(arg)[2])[3] = 13.0;
| 
| PROTECT(call = lang2(install(entryPoint.c_str()), arg));
| 
| ret = R_tryEval(call, R_GlobalEnv, errorOccurred);
| 
| 
| I'll be grateful if you can point me to any online docs/samples.

This is a non-trivial problem when the use the C API provided by R. It is all
documented, but you need to study the 'Writing R Extensions' in some detail,
as well as maybe 'R Programming' by Gentleman and/or 'Software for Data
Analysis' by Chambers.

But there is another API you can use. It is provided by RInside (to embed R
inside C++) which uses Rcpp (for R and C++ integration).  Install those two
packages from CRAN, and then drop the few lines below as a file, say,
wayne.cpp in the examples/standard/ directory of RInside. Saying 'make wayne'
will build an executable, using proper flags and linker options, and you can
run that:

edd@max:~/svn/rinside/pkg/inst/examples/standard$ make wayne
g++ -I/usr/share/R/include -I/usr/local/lib/R/site-library/Rcpp/include 
-I/usr/local/lib/R/site-library/RInside/include -O3 -pipe -g -Wall
wayne.cpp  -L/usr/lib64/R/lib -lR  -lblas -llapack 
-L/usr/local/lib/R/site-library/Rcpp/lib -lRcpp 
-Wl,-rpath,/usr/local/lib/R/site-library/Rcpp/lib 
-L/usr/local/lib/R/site-library/RInside/lib -lRInside 
-Wl,-rpath,/usr/local/lib/R/site-library/RInside/lib -o wayne
edd@max:~/svn/rinside/pkg/inst/examples/standard$ ./wayne 
Showing list content:
L[0] 1
L[1] 2.5
L[2][0] 10
L[2][1] 11
Showing list content:
L[0] 42
L[1] 42
L[2][0] 10
L[2][1] 42
edd@max:~/svn/rinside/pkg/inst/examples/standard$ 

The code a list as you spec'ed with int, double and vector. The list is shown
on stdout, then passed to R, transformed by R and shown again at the C++ level.

Questions on RInside and Rcpp are welcome on the rcpp-devel list.

Hope this helps,  Dirk

-
#include RInside.h// for the embedded R via RInside

void show(const Rcpp::List  L) {
// this function is cumbersome as we haven't defined  operators
std::cout  Showing list content:\n;
std::cout  L[0]   Rcpp::asint(L[0])  std::endl;
std::cout  L[1]   Rcpp::asdouble(L[1])  std::endl;
Rcpp::IntegerVector v = Rcpp::asRcpp::IntegerVector(L[2]);
std::cout  L[2][0]   v[0]  std::endl;
std::cout  L[2][1]   v[1]  std::endl;
}

int main(int argc, char *argv[]) {

// create an embedded R instance
RInside R(argc, argv);   

Rcpp::List mylist(3);
mylist[0] = 1;
mylist[1] = 2.5;
Rcpp::IntegerVector v(2); v[0] = 10; v[1] = 11; // with C++0x we could 
assign directly
mylist[2] = v;
show(mylist);

R[myRlist] = mylist;
std::string r_code = myRlist[[1]] = 42; myRlist[[2]] = 42.0; 
myRlist[[3]][2] = 42; myRlist;

Rcpp::List reslist = R.parseEval(r_code);
show(reslist);

exit(0);
}
-


-- 
Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Dealing with R list objects in C/C++

2011-01-26 Thread Martin Morgan
On 01/26/2011 02:56 PM, wayne.zh...@barclayscapital.com wrote:
 Hi,
 
 I'd like to construct an R list object in C++, fill it with relevant data, 
 and pass it to an R function which will return a different list object back.  
 I have browsed through all the R manuals, and examples under tests/Embedding, 
 but can't figure out the correct way.  Below is my code snippet:
 
 #include Rinternals.h
 // Rf_initEmbeddedR and other setups already performed
 
 SEXP arg, ret;
 
 // this actually creates a pairlist.  I can't find any API that creates a 
 list
 PROTECT(arg = allocList(3));

Allocate a list of length 3 via SEXPTYPE VECSXP

  PROTECT(arg = allocVector(VECSXP, 3));

 
 // I want the first element to be type integer, second double, and third a 
 vector.
 INTEGER(arg)[0]  = 1;// - runtime exception: INTEGER() can 
 only be applied to a 'integer', not a 'pairlist'

set the first element of the list to an integer vector of length 1, and
assign a value

  SET_VECTOR_ELT(arg, 0, allocVector(INTSXP, 1));
  INTEGER(VECTOR_ELT(arg, 0))[0] = 1

or more succinctly

  SET_VECTOR_ELT(arg, 0, ScalarInteger(1));

 REAL(arg)[1] = 2.5;   // control never reached here

and the second element

  SET_VECTOR_ELT(arg, 1, ScalarReal(2.5));

 VECTOR_PTR(arg)[2] = allocVector(REALSXP, 4);

and for the third allocate a REALSXP and then fill

  SET_VECTOR_ELT(arg, 2, allocVector(REALSXP, 4));

next lines should be ok as REAL(VECTOR_ELT(arg, 2))[0] = 10.0; or with
less typing as

  double *x = REAL(VECTOR_ETL(arg, 2));
  x[0] = 10.0; x[1] = 11.0; x[2] = 12.0; x[3] = 13.0;

 REAL(VECTOR_PTR(arg)[2])[0] = 10.0;
 REAL(VECTOR_PTR(arg)[2])[1] = 11.0;
 REAL(VECTOR_PTR(arg)[2])[2] = 12.0;
 REAL(VECTOR_PTR(arg)[2])[3] = 13.0;
 
 PROTECT(call = lang2(install(entryPoint.c_str()), arg));

not sure where entryPoint.c_str() is coming from, but

 PROTECT(call = lang2(install(fun), arg));

with some debate about whether install(fun) should be PROTECT'ed.

 
 ret = R_tryEval(call, R_GlobalEnv, errorOccurred);

likely PROTECT(ret = ...) while checking errorOccurred, etc.

Hope that helps,

Martin

 
 
 I'll be grateful if you can point me to any online docs/samples.
 
 Thanks in advance,
 Wayne
 
 ___
 
 
i!
  ce at 1 Churchill Place, London, E14 5HP.  This email may relate to or be 
 sent from other members of the Barclays Group.
 ___
 
   [[alternative HTML version deleted]]
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel