[R-sig-eco] Using rq() for least absolute deviation regression

2011-04-26 Thread Jane Shevtsov
I've seen several websites say that the function rq() from the package
quantreg can be used to do least absolute deviation regression. How do
you go about doing this and what's the connection between quantile
regression and LAD? (I'm very new to the former topic.)

Thanks,
Jane

-- 
-
Jane Shevtsov
Ecology Ph.D. candidate, University of Georgia
co-founder, www.worldbeyondborders.org
Check out my blog, http://perceivingwholes.blogspot.comPerceiving Wholes

In the long run, education intended to produce a molecular
geneticist, a systems ecologist, or an immunologist is inferior, both
for the individual and for society, than that intended to produce a
broadly educated person who has also written a dissertation. --John
Janovy, Jr., On Becoming a Biologist

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] Using rq() for least absolute deviation regression

2011-04-26 Thread Farrar . David
The short answer is that what you seem to want is the rq() default with 
tau not specified.   (Default is tau=.5). 

In general rq() minimizes a sum of weighted absolute residuals.  The 
weights depend on tau (the conditional quantile of 
interest), and turn out to be equal with tau = 0.5, i.e., median 
regression in rq is LAD. 


r-sig-ecology-boun...@r-project.org wrote on 04/26/2011 03:45:23 PM:

 From:
 
 Jane Shevtsov jane@gmail.com
 
 To:
 
 r-sig-ecology@r-project.org
 
 Date:
 
 04/26/2011 03:46 PM
 
 Subject:
 
 [R-sig-eco] Using rq() for least absolute deviation regression
 
 Sent by:
 
 r-sig-ecology-boun...@r-project.org
 
 I've seen several websites say that the function rq() from the package
 quantreg can be used to do least absolute deviation regression. How do
 you go about doing this and what's the connection between quantile
 regression and LAD? (I'm very new to the former topic.)
 
 Thanks,
 Jane
 
 -- 
 -
 Jane Shevtsov
 Ecology Ph.D. candidate, University of Georgia
 co-founder, www.worldbeyondborders.org
 Check out my blog, http://perceivingwholes.blogspot.comPerceiving 
Wholes
 
 In the long run, education intended to produce a molecular
 geneticist, a systems ecologist, or an immunologist is inferior, both
 for the individual and for society, than that intended to produce a
 broadly educated person who has also written a dissertation. --John
 Janovy, Jr., On Becoming a Biologist
 
 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

[[alternative HTML version deleted]]

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] subsetting data in R

2011-04-26 Thread Ben Bolker
  If this isn't already answered:

  I don't quite understand the question: what do you mean by do a
complete data set from an object in R?  What do you mean by the
subsetting is dangerous ... as you need to specify the levels for all
your factors again?

  (What do your 3000 columns of data represent?  If these are predictor
variables I hope you have a truly enormous number of responses ...)

  It may have been mentioned already, but droplevels(subset(...)) will
probably do what you want.  (I have tried very hard over the years to
get drop.levels= to be an optional argument to subset(), but so far I
have failed.  droplevels() is an improvement over the drop.levels()
function in gdata because (1) it is in base R and (2) it doesn't reorder
the factor by default (which is what gdata::drop.levels [insanely in my
opinion] does).

On 11-04-24 11:21 AM, Manuel Spínola wrote:
 Thank you for all the responses.
 
 Is there a way to do a complete data set from an object in R?
 I have a data set with more than 3000 columns.
 
 The subsetting is ok but it could be dangerous if you are using other 
 factors to do some analysis as you need to specify the levels for all 
 your factors again.
 
 Best,
 
 Manuel
 
 On 24/04/2011 08:30 a.m., Gustavo Carvalho wrote:
 pa2- subset(pa, influencia==AP)
 pa2$influencia- factor(pa2$influencia)
 levels(pa2$influencia)

 On Sun, Apr 24, 2011 at 11:24 AM, Manuel Spínolamspinol...@gmail.com  
 wrote:
 Thank you very much for your response, Christian, Roman, and Sarah.

 Sarah,

 I am trying your suggestion but I cannot see the levels:

 pa2 = factor(subset(pa, influencia==AP)$influencia)
 levels(pa2$influencia)
 Error in pa2$influencia : $ operator is invalid for atomic vectors

 Best,

 Manuel



 On 24/04/2011 07:51 a.m., Sarah Goslee wrote:
 By default, read.csv() turns character variables into factors, using all 
 the
 unique values as the levels.

 subset() retains those levels by default, as they are a vital element of 
 the
 data. If you are studying some attribute of men and women, say height,
 even if you are only looking at the heights for women it's important to 
 remember
 that men still exist.

 If you don't want influencia to be a factor, you can change that in the 
 import
 stringsAsFactors=FALSE.

 If you do want influencia to be a factor, but want the unused levels to be
 removed, you can use factor() to do that.

 testdata- data.frame(group=c(A, B, C, A, B, C), value=1:6)
 testdata
 group value
 1 A 1
 2 B 2
 3 C 3
 4 A 4
 5 B 5
 6 C 6
 str(testdata)
 'data.frame': 6 obs. of  2 variables:
$ group: Factor w/ 3 levels A,B,C: 1 2 3 1 2 3
$ value: int  1 2 3 4 5 6
 subset(testdata, group==A)
 group value
 1 A 1
 4 A 4
 subset(testdata, group==A)$group
 [1] A A
 Levels: A B C
 ?subset
 factor(subset(testdata, group==A)$group)
 [1] A A
 Levels: A

 Sarah

 On Sun, Apr 24, 2011 at 9:04 AM, Manuel Spínolamspinol...@gmail.com
 wrote:
 Dear list members,

 I have a question regarding too subsetting a data set in R.

 I created an object for my data:

pa = read.csv(espec_indic.csv, header = T, sep=,, check.names = F)

levels(pa$influencia)
 [1] AID AII AP

 The object has 3 levels for influencia (AP, AID, AII)

 Now I subset only observations with influencia = AID

pa2 = subset(pa, influencia==AID)

 but if I ask for the levels of influencia still show me the 3 levels,
 AP, AID, AII.

levels(pa2$influencia)
 [1] AID AII AP

 Why is that?

 I was thinking that I was creating a new data frame with only AID as a
 level for influencia.

 How can I make a complete new object with only the observations for
 AID and that the only level for influencia is indeed AID?

 Best,

 Manuel



 --
 *Manuel Spínola, Ph.D.*
 Instituto Internacional en Conservación y Manejo de Vida Silvestre
 Universidad Nacional
 Apartado 1350-3000
 Heredia
 COSTA RICA
 mspin...@una.ac.cr
 mspinol...@gmail.com
 Teléfono: (506) 2277-3598
 Fax: (506) 2237-7036
 Personal website: Lobito de río
 https://sites.google.com/site/lobitoderio/
 Institutional website: ICOMVIShttp://www.icomvis.una.ac.cr/

 [[alternative HTML version deleted]]


 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


 
 
 
 
 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] The final result of TWINSPAN

2011-04-26 Thread Dave Roberts

Dear List,

Earlier this year on an (undoubtedly ill-advised) lark I coded up 
an R version of TWINSPAN.  It's far from a polished package at this 
point, but the code does run.  One of the interesting features is that 
you can partition a PCO or NMDS in addition to the traditional CA. To be 
clear, I am not a TWINSPAN fan either, but I wanted it for a methods 
paper I was working on.


The problem is that I based the code on Hill, Bunch  Shaw (1975,
J of  Ecol  63:597-613) which is what I had available.  Apparently the 
algorithm in the commercial TWINSPAN is significantly modified from the 
original, but I couldn't find a description of the actual algorithm 
anywhere in the literature.  It is probably described in the User Manual 
of the software, but I was not sufficiently motivated to chase down a 
copy.  I do have a copy of the FORTRAN code, but it was apparently 
written in FORTRAN II, and is basically inscrutable, even to an old 
FORTRAN dog like me.


So, if somebody has a clear description of the actual algorithm 
(and I think it is disturbing that I could not find one), it would be 
possible to code it up in native R.  The alternative, to write a wrapper 
for the original FORTRAN code is not a trivial task.  I gave it a couple 
of days and gave up.


--

David W. Roberts office 406-994-4548
Professor and Head  FAX 406-994-3190
Department of Ecology email drobe...@montana.edu
Montana State University
Bozeman, MT 59717-3460

On 04/14/2011 01:57 AM, Jari Oksanen wrote:

On 14/04/11 10:37 AM, Yong Zhang2010202...@njau.edu.cn  wrote:


Dear all,

I conducted the two-way indicator species analysis using TWINSPAN program, and
following is the final result:

  0111
  00011011
  011000111
   01001001

I have to certify my analysis, I want to classify the above 24 sampling sites
into 3 major groups based on 7 biotic metrics. The name of my 24 samples could
be site1 to site24, from the left to the right, and I set the cut levels 0, 2,
5, 10, 20,  the maximum level of divisions: 6, and maximum group size for
division:3 .

Now, my question is whether my setting is correct? And how should I classify
these sites into 3 groups accoding to this final result?

Dear Yong Zhang,

This is not an R issue, because there is no TWINSPAN in R. However, the
answer to your question is that strictly speaking you cannot group your data
into three major groups with TWINSPAN. TWINSPAN is a bisection method so
that first division gives you two groups, and second splits each of these
into two groups so that the next choice is to have four groups. However, in
this case one of the groups was so small (3 plots were split off from other
in the first division, and then these were split into groups of 2 plots and
1 plot) that you probably can ignore the second division of the small group.

If your goal was as vague as wanting to classify 24 sites into 3 major
groups you could do better than use TWINSPAN: what's the problem with proper
classification methods in R? Moreover, have you checked that your biotic
metrics suit to the pseudospecies cut level concept of TWINSPAN?

Cheers, jari oksanen

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


--

David W. Roberts office 406-994-4548
Professor and Head  FAX 406-994-3190
Department of Ecology email drobe...@montana.edu
Montana State University
Bozeman, MT 59717-3460

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology


Re: [R-sig-eco] subsetting data in R

2011-04-26 Thread Manuel Spínola
Thank you very much Ben.

I was doing an analysis of indicator species with the subset data and 
the other levels were still in my subset data and the analysis was 
considering them in the analysis.
My 3000 columns are plant species presence/absence type of data.

Best,

Manuel

On 26/04/2011 12:06 p.m., Ben Bolker wrote:
If this isn't already answered:

I don't quite understand the question: what do you mean by do a
 complete data set from an object in R?  What do you mean by the
 subsetting is dangerous ... as you need to specify the levels for all
 your factors again?

(What do your 3000 columns of data represent?  If these are predictor
 variables I hope you have a truly enormous number of responses ...)

It may have been mentioned already, but droplevels(subset(...)) will
 probably do what you want.  (I have tried very hard over the years to
 get drop.levels= to be an optional argument to subset(), but so far I
 have failed.  droplevels() is an improvement over the drop.levels()
 function in gdata because (1) it is in base R and (2) it doesn't reorder
 the factor by default (which is what gdata::drop.levels [insanely in my
 opinion] does).

 On 11-04-24 11:21 AM, Manuel Spínola wrote:
 Thank you for all the responses.

 Is there a way to do a complete data set from an object in R?
 I have a data set with more than 3000 columns.

 The subsetting is ok but it could be dangerous if you are using other
 factors to do some analysis as you need to specify the levels for all
 your factors again.

 Best,

 Manuel

 On 24/04/2011 08:30 a.m., Gustavo Carvalho wrote:
 pa2- subset(pa, influencia==AP)
 pa2$influencia- factor(pa2$influencia)
 levels(pa2$influencia)

 On Sun, Apr 24, 2011 at 11:24 AM, Manuel Spínolamspinol...@gmail.com   
 wrote:
 Thank you very much for your response, Christian, Roman, and Sarah.

 Sarah,

 I am trying your suggestion but I cannot see the levels:

   pa2 = factor(subset(pa, influencia==AP)$influencia)
   levels(pa2$influencia)
 Error in pa2$influencia : $ operator is invalid for atomic vectors

 Best,

 Manuel



 On 24/04/2011 07:51 a.m., Sarah Goslee wrote:
 By default, read.csv() turns character variables into factors, using all 
 the
 unique values as the levels.

 subset() retains those levels by default, as they are a vital element of 
 the
 data. If you are studying some attribute of men and women, say height,
 even if you are only looking at the heights for women it's important to 
 remember
 that men still exist.

 If you don't want influencia to be a factor, you can change that in the 
 import
 stringsAsFactors=FALSE.

 If you do want influencia to be a factor, but want the unused levels to be
 removed, you can use factor() to do that.

 testdata- data.frame(group=c(A, B, C, A, B, C), value=1:6)
 testdata
  group value
 1 A 1
 2 B 2
 3 C 3
 4 A 4
 5 B 5
 6 C 6
 str(testdata)
 'data.frame': 6 obs. of  2 variables:
 $ group: Factor w/ 3 levels A,B,C: 1 2 3 1 2 3
 $ value: int  1 2 3 4 5 6
 subset(testdata, group==A)
  group value
 1 A 1
 4 A 4
 subset(testdata, group==A)$group
 [1] A A
 Levels: A B C
 ?subset
 factor(subset(testdata, group==A)$group)
 [1] A A
 Levels: A

 Sarah

 On Sun, Apr 24, 2011 at 9:04 AM, Manuel Spínolamspinol...@gmail.com
  wrote:
 Dear list members,

 I have a question regarding too subsetting a data set in R.

 I created an object for my data:

 pa = read.csv(espec_indic.csv, header = T, sep=,, check.names = 
 F)

  levels(pa$influencia)
 [1] AID AII AP

 The object has 3 levels for influencia (AP, AID, AII)

 Now I subset only observations with influencia = AID

 pa2 = subset(pa, influencia==AID)

 but if I ask for the levels of influencia still show me the 3 levels,
 AP, AID, AII.

  levels(pa2$influencia)
 [1] AID AII AP

 Why is that?

 I was thinking that I was creating a new data frame with only AID as a
 level for influencia.

 How can I make a complete new object with only the observations for
 AID and that the only level for influencia is indeed AID?

 Best,

 Manuel


 --
 *Manuel Spínola, Ph.D.*
 Instituto Internacional en Conservación y Manejo de Vida Silvestre
 Universidad Nacional
 Apartado 1350-3000
 Heredia
 COSTA RICA
 mspin...@una.ac.cr
 mspinol...@gmail.com
 Teléfono: (506) 2277-3598
 Fax: (506) 2237-7036
 Personal website: Lobito de río
 https://sites.google.com/site/lobitoderio/
 Institutional website: ICOMVIShttp://www.icomvis.una.ac.cr/

  [[alternative HTML version deleted]]


 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology





 ___
 R-sig-ecology mailing list
 R-sig-ecology@r-project.org
 https://stat.ethz.ch/mailman/listinfo/r-sig-ecology



-- 
*Manuel Spínola, Ph.D.*
Instituto Internacional en Conservación y Manejo de Vida 

Re: [R-sig-eco] The final result of TWINSPAN

2011-04-26 Thread Jari Oksanen
On 27/04/11 00:40 AM, Dave Roberts dvr...@ecology.msu.montana.edu wrote:
 
  Earlier this year on an (undoubtedly ill-advised) lark I coded up
 an R version of TWINSPAN.  It's far from a polished package at this
 point, but the code does run.  One of the interesting features is that
 you can partition a PCO or NMDS in addition to the traditional CA. To be
 clear, I am not a TWINSPAN fan either, but I wanted it for a methods
 paper I was working on.
 
  The problem is that I based the code on Hill, Bunch  Shaw (1975,
 J of  Ecol  63:597-613) which is what I had available.  Apparently the
 algorithm in the commercial TWINSPAN is significantly modified from the
 original, but I couldn't find a description of the actual algorithm
 anywhere in the literature.  It is probably described in the User Manual
 of the software, but I was not sufficiently motivated to chase down a
 copy.  I do have a copy of the FORTRAN code, but it was apparently
 written in FORTRAN II, and is basically inscrutable, even to an old
 FORTRAN dog like me.
 
  So, if somebody has a clear description of the actual algorithm
 (and I think it is disturbing that I could not find one), it would be
 possible to code it up in native R.  The alternative, to write a wrapper
 for the original FORTRAN code is not a trivial task.  I gave it a couple
 of days and gave up.

Dave,

Hill, Bunch  Shaw describe the general idea of TWINSPAN, but the
implementation is more complicated. Martin Kent and Paddy Coker do a great
job of explaining the twists in their book (vegetation description and
analysis: a practical approach). If I remember correctly, the TWINSPAN
manual also was more detailed, but I lost it somewhere when I moved around
(for the kids: it was a bunch of paper: pdf was not yet invented when
TWINSPAN was published).

I don't think that the actual TWINSPAN is easily extended beyond CA. Each
step is a two-stage one-dimensional ordination on a current subset, where
the first stage selects indicators and the second stage is polarized for the
indicator species. The final split is based on site ordination and
indicators are secondary (which we see in misclassifications if you try to
use the provided key for the data that was classified in TWINSPAN). The
polarization stage is particularly challenging when working with
dissimilarities (PCO, NMDS).

I don't think that the FORTRAN I have is completely impenetrable. I think
the largest problem is the design principle: R code should run silently and
return a result, but TWINSPAN prints when it goes on and returns only a part
of the result. Incorporating that in R would need stripping most PRINT and
WRITE and have subroutines to return useful data directly.

I also wrote a small funny test on TWINSPAN principle, where the splitting
and pre-defined pseudospecies where replaced with regression tree split.
I'll send you a copy of that and the FORTRAN (IV, I think) code I have in a
separate message.

Cheers, Jari Oksanen

___
R-sig-ecology mailing list
R-sig-ecology@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology