Re: [R] Create a new variable and concatenation inside a "for" loop

2016-05-02 Thread Raubertas, Richard
What nonsense.  There is a group of finger-waggers on this list who jump on 
every poster who uses the name of an R function as a variable name.  R is 
perfectly capable of distinguishing the two, so if 'c' (or 'data' or 'df', 
etc.) is the natural name for a variable then go ahead and use it.  Mr. 
Newmiller provides an excellent example of this:  he recommends 'C' instead of 
'c', apparently without realizing that 'C' is also a built-in R 
function--because there is "no such problem".

Richard Raubertas

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jeff Newmiller
Sent: Wednesday, April 27, 2016 3:58 PM
To: Gordon, Fabiana; 'r-help@R-project.org'
Subject: Re: [R] Create a new variable and concatenation inside a "for" loop

"c" an extremely commonly-used function. Functions are first-class objects that 
occupy the same namespaces that variables do, so they can obscure each other. 
In short, don't use variables called "c" (R is case sensitive, so "C" has no 
such problem).

Wherever possible, avoid incremental concatenation like the plague. If you feel 
you must use it, at least concatenate in lists and then use functions like 
unlist, do.call, or pre-allocate vectors or matrix-like objects with unuseful 
values like NA and then overwrite each element in the vector or matrix-type 
object in a loop like your first one. 
-- 
Sent from my phone. Please excuse my brevity.

On April 27, 2016 3:25:14 PM GMT+01:00, "Gordon, Fabiana" 
 wrote:
>Hello,
>
>Suppose the you need a loop to create a new variable , i.e., you are
>not reading data from outside the loop. This is a simple example in
>Matlab code,
>
>for i=1:5
>r1=randn
>r2=randn
>r=[r1 r2]
>c(i,:)=r;   % creation of each row of c , % the ":" symbol indicates
>all columns. In R this would be [i,]
>end
>
>The output of interest is c which I'm creating inside the "for" loop
>-also the index used in the loop is used to create c. In R I had to
>create c as an  empty vector (numeric() ) outside the loop, otherwise I
>get an error message saying that c doesn't exit.
>
>The other issue is the concatenation. In each iteration I'm creating
>the rows of c by placing the new row  (r) below the previous one so
>that c becomes a 5 x 2 matrix.
>In R, it seems that I have no choice but use the function "rbind". I
>managed to write this code in R . However, I'm not sure that if instead
>of creating a new variable  using  the index in the "for" loop , I
>wanted to use the index to read data, e.g.  suppose I have a 2 X 10
>matrix X and suppose I want to calculate the sin () for each 2 x 2
>sub-matrix of and stored in a matrix A. Then the code would be
>something like this,
>
>for i=1:5
>A(:, 2*i-1:2*i)= sin(X(:, 2*i-1:2*i))   % the ":" symbol indicates all
>rows
>end
>
>Many Thanks,
>
>Fabiana
>
>
>Dr Fabiana Gordon
>
>Senior Statistical Consultant
>Statistical Advisory Service, School Of Public Health,
>Imperial College London
>1st Floor, Stadium House, 68 Wood Lane,
>London W12 7RH.
>
>Tel: 020 7594 1749
>Email:
>fabiana.gor...@imperial.ac.uk
>Web: 
>www.imperial.ac.uk/research-and-innovation/support-for-staff/stats-advice-service/
>
>
>
>   [[alternative HTML version deleted]]
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] lsqlin in R package pracma

2015-08-27 Thread Raubertas, Richard
Is it really that complicated?  This looks like an ordinary quadratic 
programming problem, and 'solve.QP' from the 'quadprog' package seems to solve 
it without user-specified starting values:

library(quadprog)
Dmat - t(C) %*% C
dvec - (t(C) %*% d)
Amat - -1 * t(A)
bvec - -1 * b

rslt - solve.QP(Dmat, dvec, Amat, bvec)
sum((C %*% rslt$solution - d)^2)

[1] 0.01758538

Richard Raubertas
Merck  Co.

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Hans W Borchers
Sent: Wednesday, August 26, 2015 6:22 AM
To: r-h...@stat.math.ethz.ch
Subject: Re: [R] lsqlin in R package pracma

On Mon Aug 24 Wang, Xue, Ph.D. Wang.Xue at mayo.edu wrote
 I am looking for a R version of Matlab function lsqlin. I came across
 R pracma package which has a lsqlin function. Compared with Matlab lsqlin,
 the R version does not allow inequality constraints.
 I am wondering if this functionality will be available in future. And also
 like to get your opinion on which R package/function is the best for
solving
 least square minimization problem with linear inequality constraints.
 Thanks very much for your time and attention!


Solving (linear) least-squares problems with linear inequality constraints
is more difficult then one would expect. Inspecting the MATLAB code reveals
that it employs advanced methods such as active-set (linear inequality
constraints) and interior-point (for bounds constraints).

Function nlsLM() in package *minpack.lm* supports bound constraints if that
is sufficient for you. The same is true for *nlmrt*. Convex optimization
might be a promising approach for linear inequality constraints, but there
is no easy-to-handle convex solver in R at this moment.

So the most straightforward way would be to use constrOptim(), that is
optim with linear constraints. It requires a reasonable starting point, and
keeping your fingers crossed that you are able to find such a point in the
interior of the feasible region.

I someone wants to try: Here is the example from the MATLAB lsqlin page:

C - matrix(c(
0.9501,   0.7620,   0.6153,   0.4057,
0.2311,   0.4564,   0.7919,   0.9354,
0.6068,   0.0185,   0.9218,   0.9169,
0.4859,   0.8214,   0.7382,   0.4102,
0.8912,   0.4447,   0.1762,   0.8936), 5, 4, byrow=TRUE)
d - c(0.0578, 0.3528, 0.8131, 0.0098, 0.1388)
A - matrix(c(
0.2027,   0.2721,   0.7467,   0.4659,
0.1987,   0.1988,   0.4450,   0.4186,
0.6037,   0.0152,   0.9318,   0.8462), 3, 4, byrow=TRUE)
b - c(0.5251, 0.2026, 0.6721)

The least-square function to be minimized is  ||C x - d||_2 , and the
constraints are  A x = b :

f - function(x) sum((C %*% x - d)^2)

The solution x0 returned by MATLAB has a minimum of  f(x0) = 0.01759204 .
This point does not lie in the interior and cannot be used for a start.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [R-pkgs] New package list for analyzing list surveyexperiments

2010-07-13 Thread Raubertas, Richard
I agree that 'list' is a terrible package name, but only secondarily 
because it is a data type.  The primary problem is that it is so generic

as to be almost totally uninformative about what the package does.  

For some reason package writers seem to prefer maximally uninformative 
names for their packages.  To take some examples of recently announced 
packages, can anyone guess what packages 'FDTH', 'rtv', or 'lavaan' 
do?  Why the aversion to informative names along the lines of
'Freq_dist_and_histogram', 'RandomTimeVariables', and 
'Latent_Variable_Analysis', respectively? 

R.Raubertas

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jeffrey J. Hallman
 Sent: Monday, July 12, 2010 10:09 AM
 To: r-h...@stat.math.ethz.ch
 Subject: Re: [R] [R-pkgs] New package list for analyzing 
 list surveyexperiments
 
 I know nothing about your package, but list is a terrible 
 name for it,
 as list is also the name of a data type in R. 
 -- 
 Jeff
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Cforest and Random Forest memory use

2010-06-18 Thread Raubertas, Richard
Max, 
My disagreement was really just about the single statement 'I suspect 
that 1M points are pretty densely packed into 40-dimensional space' in 
your original post.  On the larger issue of diminishing returns with 
the size of a training set, I agree with your points below.

Rich

 -Original Message-
 From: Max Kuhn [mailto:mxk...@gmail.com] 
 Sent: Friday, June 18, 2010 1:35 PM
 To: Bert Gunter
 Cc: Raubertas, Richard; Matthew OKane; r-help@r-project.org
 Subject: Re: [R] Cforest and Random Forest memory use
 
 Rich's calculations are correct, but from a practical standpoint I
 think that using all the data for the model is overkill for a few
 reasons:
 
 - the calculations that you show implicitly assume that the predictor
 values can be reliably differentiated from each other. Unless they are
 deterministic calculations (e.g. number of hydrogen bonds, % GC in a
 sequence) the measurement error. We don't know anything about the
 context here, but in the lab sciences, the measurement variation can
 make the *effective* number of predictor values much less than n. So
 you can have millions of predictor values but you might only be able
 to differentiate k  n values reliably.
 
 - the important dimensionality to consider is based on how many of
 those 40 are relevant to the outcome. Again, we don't now the context
 of the data but there is a strong prior towards the number of
 important variables being less than 40
 
 - We've had to consider these types of problems a lot. We might have
 200K samples (compounds in this case) and 1000 predictors that appear
 to matter. Ensembles of trees tended to do very well, as did kernel
 methods. In either of those two classes of models, the prediction time
 for a single new observation is very long. So we looked at how
 performance was affected if we were to reduce the training set size.
 In essence, we found that 50% of the data could be used with no
 appreciable effect on performance. We could make the percentage
 smaller if we used the predictor values to sample the data set for
 prediction; if we had m samples in the training set, the next sample
 added would have to have maximum dissimilarity to the existing m
 samples.
 
 - If you are going to do any feature selection, you would be better
 off segregating a percentage of those million samples as a hold-out
 set to validate the selection process (a few people form Merck have
 written excellent papers on the selection bias problem). Similarly, if
 this is a classification problem, any ROC curve analysis is most
 effective when the cutoffs are derived from a separate hold-out data
 set. Just dumping all those samples in a training set seems like a
 lost opportunity.
 
 Again, these are not refutations of your calculations. I just think
 that there are plenty of non-theoretical arguments for not using all
 of those values for the training set.
 
 Thanks,
 
 Max
 On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter 
 gunter.ber...@gene.com wrote:
  Rich is right, of course. One way to think about it is this 
 (parphrased from
  the section on the Curse of Dimensionality from Hastie et al's
  Statistical Learning Book): suppose 10 uniformly 
 distributed points on a
  line give what you consider to be adequate coverage of the 
 line. Then in 40
  dimensions, you'd need 10^40 uniformly distributed points 
 to give equivalent
  coverage.
 
  Various other aspects of the curse of dimensionality are 
 discussed in the
  book, one of which is that in high dimensions, most points 
 are closer to the
  boundaries then to each other. As Rich indicates, this has profound
  implications for what one can sensibly do with such data. 
 On example is:
  nearest neighbor procedures don't make much sense (as 
 nobody is likely to
  have anybody else nearby). Which Rich's little simulation nicely
  demonstrated.
 
  Cheers to all,
 
  Bert Gunter
  Genentech Nonclinical Statistics
 
 
 
  -Original Message-
  From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On
  Behalf Of Raubertas, Richard
  Sent: Thursday, June 17, 2010 4:15 PM
  To: Max Kuhn; Matthew OKane
  Cc: r-help@r-project.org
  Subject: Re: [R] Cforest and Random Forest memory use
 
 
 
  -Original Message-
  From: r-help-boun...@r-project.org
  [mailto:r-help-boun...@r-project.org] On Behalf Of Max Kuhn
  Sent: Monday, June 14, 2010 10:19 AM
  To: Matthew OKane
  Cc: r-help@r-project.org
  Subject: Re: [R] Cforest and Random Forest memory use
 
  The first thing that I would recommend is to avoid the formula
  interface to models. The internals that R uses to create matrices
  form a formula+data set are not efficient. If you had a 
 large number
  of variables, I would have automatically pointed to that 
 as a source
  of issues. cforest and ctree only have formula interfaces 
 though, so
  you are stuck on that one. The randomForest package has both
  interfaces, so that might be better.
 
  Probably the issue is the depth

Re: [R] Cforest and Random Forest memory use

2010-06-17 Thread Raubertas, Richard
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Max Kuhn
 Sent: Monday, June 14, 2010 10:19 AM
 To: Matthew OKane
 Cc: r-help@r-project.org
 Subject: Re: [R] Cforest and Random Forest memory use
 
 The first thing that I would recommend is to avoid the formula
 interface to models. The internals that R uses to create matrices
 form a formula+data set are not efficient. If you had a large number
 of variables, I would have automatically pointed to that as a source
 of issues. cforest and ctree only have formula interfaces though, so
 you are stuck on that one. The randomForest package has both
 interfaces, so that might be better.
 
 Probably the issue is the depth of the trees. With that many
 observations, you are likely to get extremely deep trees. You might
 try limiting the depth of the tree and see if that has an effect on
 performance.
 
 We run into these issues with large compound libraries; in those cases
 we do whatever we can to avoid ensembles of trees or kernel methods.
 If you want those, you might need to write your own code that is
 hyper-efficient and tuned to your particular data structure (as we
 did).
 
 On another note... are this many observations really needed? You have
 40ish variables; I suspect that 1M points are pretty densely packed
 into 40-dimensional space. 

This did not seem right to me:  40-dimensional space is very, very big
and even a million observations will be thinly spread.  There is probably 
some analytic result from the theory of coverage processes about this, 
but I just did a quick simulation.  If a million samples are independently 
and randomly distributed in a 40-d unit hypercube, then 90% of the points 
in the hypercube will be more than one-quarter of the maximum possible 
distance (sqrt(40)) from the nearest sample.  And about 40% of the hypercube 
will be more than one-third of the maximum possible distance to the nearest 
sample.  So the samples do not densely cover the space at all.

One implication is that modeling the relation of a response to 40 predictors 
will inevitably require a lot of smoothing, even with a million data points.

Richard Raubertas
Merck  Co.

 Do you loose much by sampling the data set
 or allocating a large portion to a test set? If you have thousands of
 predictors, I could see the need for so many observations, but I'm
 wondering if many of the samples are redundant.
 
 Max
 
 On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane 
 mlok...@gmail.com wrote:
  Answers added below.
  Thanks again,
  Matt
 
  On 11 June 2010 14:28, Max Kuhn mxk...@gmail.com wrote:
 
  Also, you have not said:
 
   - your OS: Windows Server 2003 64-bit
   - your version of R: 2.11.1 64-bit
   - your version of party: 0.9-9995
 
 
 
   - your code:  test.cf -(formula=badflag~.,data =
  example,control=cforest_control
 
                                               (teststat = 
 'max', testtype =
  'Teststatistic', replace = FALSE, ntree = 500, 
 savesplitstats = FALSE,mtry =
  10))
 
   - what Large data set means:  1 million observations, 
 40+ variables,
  around 200MB
   - what very large model objects means - anything which breaks
 
  So... how is anyone suppose to help you?
 
  Max
 
 
 
 
 
 -- 
 
 Max
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extending a vector to length n

2009-04-16 Thread Raubertas, Richard
The following approach works for both of your examples:

xx - rep(x, length.out=n)
xx[m:n] - NA

Thus:

 n - 2
 aa - rep(a, length.out=n)
 aa[(length(a)+1):n] - NA
 aa
[1] 2008-01-01 NA  
 bb - rep(b, length.out=n)
 bb[(length(b)+1):n] - NA
 bb
[1] aNA
Levels: a


R. Raubertas
Merck  Co
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of hadley wickham
 Sent: Wednesday, April 15, 2009 10:55 AM
 To: r-help
 Subject: [R] Extending a vector to length n
 
 In general, how can I increase a vector of length m ( n) to length n
 by padding it with m - n missing values, without losing attributes?
 The two approaches I've tried, using length- and adding missings with
 c, do not work in general:
 
  a - as.Date(2008-01-01)
  c(a, NA)
 [1] 2008-01-01 NA
  length(a) - 2
  a
 [1] 13879NA
 
 
  b - factor(a)
  c(b, NA)
 [1]  1 NA
  length(b) - 2
  b
 [1] aNA
 Levels: a
 
 Hadley
 
 
 
 -- 
 http://had.co.nz/
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Semantics of sequences in R

2009-02-23 Thread Raubertas, Richard
 

 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Duncan Murdoch
 Sent: Sunday, February 22, 2009 4:13 PM
 
 I think this was posted to the wrong list, so my followup is going to 
 R-devel.
 
 On 22/02/2009 3:42 PM, Stavros Macrakis wrote:
  Inspired by the exchange between Rolf Turner and Wacek 
 Kusnierczyk, I
  thought I'd clear up for myself the exact relationship among the
  various sequence concepts in R, including not only generic vectors
  (lists) and atomic vectors, but also pairlists, factor sequences,
  date/time sequences, and difftime sequences.
  
  I tabulated type of sequence vs. property to see if I could 
 make sense
  of all this.  The properties I looked at were the predicates
  is.{vector,list,pairlist}; whether various sequence operations (c,
  rev, unique, sort, rle) can be used on objects of the various types,
  and if relevant, whether they preserve the type of the 
 input; and what
  the length of class( as.XXX (1:2) ) is.
  
  Here are the results (code to reproduce at end of email):
  
   numer list  plist fact  POSIXct difft
  is.vectorTRUE  TRUE  FALSE FALSE FALSE   FALSE
  is.list  FALSE TRUE  TRUE  FALSE FALSE   FALSE
  is.pairlist  FALSE FALSE TRUE  FALSE FALSE   FALSE
  c_keep?  TRUE  TRUE  FALSE FALSE TRUEFALSE
  rev_keep?TRUE  TRUE  FALSE TRUE  TRUETRUE
  unique_keep? TRUE  TRUE  Err TRUE  TRUEFALSE
  sort_keep?   TRUE  Err Err TRUE  TRUETRUE
  rle_len  2 Err Err Err Err   Err
  
  Alas, this tabulation, rather than clarifying things for me, just
  confused me more -- the diverse treatment of sequences by various
  operations is all rather bewildering.
 
 But you are asking lots of different questions, so of course 
 you should 
 get different answers.  For example, the first three rows are 
 behaving 
 exactly as documented.  (Perhaps the functions should have 
 been designed 
 differently, but a pretty-looking matrix isn't an argument for that. 
 Give some examples of how the documented behaviour is causing 
 problems.)
 
 I think some of the operations in the later rows are undocumented 
 (generally pairlists tend not to be documented, even if in some cases 
 they are supported), and it might make sense to make them more 
 consistent in the undocumented cases.  But it may make more sense to 
 completely hide pairlists, for instance, and then several more of the 
 examples are behaving as documented.  (BTW, your description of your 
 last row doesn't match what you did, as far as I can see.)
 
  Wouldn't it be easier to teach, learn, and use R if there were more
  consistency in the treatment of sequences?  
 
 Which ones in particular should change?  What should they change to? 
 What will break when you do that?

Okay, here is one that should change:  'c()' should do something useful 
with factors, for example return a factor whose levels are the union of
the 
levels of the arguments.  Note that precedent for this already exists 
in base R:

 f1 - factor(letters[1:3])
 f2 - factor(letters[3:5])
 c(f1, f2)
[1] 1 2 3 1 2 3
 str(rbind(data.frame(f=f1), data.frame(f=f2)))
'data.frame':   6 obs. of  1 variable:
 $ f: Factor w/ 5 levels a,b,c,d,..: 1 2 3 3 4 5

So the code and documentation already exist in 'rbind.data.frame'.  As 
for what would break, well, it is hard to imagine any possible use for 
the current behavior, or who could have made use of it.  But you never 
know I guess ...

Rich Raubertas
Merck  Co.

 
   I understand that in
  long-running projects like S/R, there is an accumulation of
  contributions by a variety of authors, but perhaps the time has come
  for some cleanup at least for the base library?
 
 Generally R core members are reluctant to take on work just because 
 someone else thinks it would be nice if they did.  If you want to do 
 this, that's one thing, but if you are just saying that it 
 would be nice 
 if someone else did it, then it's much less likely to get 
 done.  To get 
 someone else to do it you need to convince them that it's a 
 valuable use 
 of their time, and I don't see that yet.
 
 Duncan Murdoch
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.