Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2024-01-06 Thread Andy

Hi Tim

This is brilliant - thank you!!

I've had to tweak the basePath line a bit (I am on a Linux machine), but 
having done that, the code works as intended. This is a truly helpful 
contribution that gives me ideas about how to work it through for the 
missing fields, which is one of the major sticking points I kept bumping 
up against.


Thank you so much for this.

All the best
Andy

On 05/01/2024 13:59, Howard, Tim G (DEC) wrote:

Here's a simplified version of how I would do it, using `textreadr` but 
otherwise base functions. I haven't done it
all, but have a few examples of finding the correct row then extracting the 
right data.
I made a duplicate of the file you provided, so this loops through the two 
identical files, extracts a few parts,
then sticks those parts in a data frame.

#
library(textreadr)

# recommend not using setwd(), but instead just include the
# path as follows
basePath <- file.path("C:","temp")
files <- list.files(path=basePath, pattern = "docx$")

length(files)
# 2

# initialize a list to put the data in
myList <- vector(mode = "list", length = length(files))

for(i in 1:length(files)){
   fileDat <- read_docx(file.path(basePath, files[[i]]))
   # get the data you want, here one line per item to make it clearer
   # assume consistency among articles
   ttl <- fileDat[[1]]
   src <- fileDat[[2]]
   dt <- fileDat[[3]]
   aut <- fileDat[grepl("Byline:",fileDat)]
   aut <- trimws(sub("Byline:","",aut), whitespace = "[\\h\\v]")
   pg <- fileDat[grepl("Pg.",fileDat)]
   pg <- as.integer(sub(".*Pg. ([[:digit:]]+)","\\1",pg))
   len <- fileDat[grepl("Length:", fileDat)]
   len <- as.integer(sub("Length:.{1}([[:digit:]]+) .*","\\1",len))
   myList[[i]] <- data.frame("title"=ttl,
"source"=src,
"date"=dt,
"author"=aut,
"page"=pg,
    "length"=len)
}

# roll up the list to a data frame. Many ways to do this.
myDF <- do.call("rbind",myList)

#

Hope that helps.
Tim




--

Date: Thu, 4 Jan 2024 12:59:59 +
From: Andy 
To: r-help@r-project.org
Subject: Re: [R]  Help request: Parsing docx files for key words and
 appending to a spreadsheet
Message-ID: 
Content-Type: text/plain; charset="utf-8"; Format="flowed"

Hi folks

Thanks for your help and suggestions - very much appreciated.

I now have some working code, using this file I uploaded for public
access:
https://docs/.
google.com%2Fdocument%2Fd%2F1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVER
k%2Fedit%3Fusp%3Dsharing%26ouid%3D103065135255080058813%26rtpof%
3Dtrue%26sd%3Dtrue=05%7C02%7Ctim.howard%40dec.ny.gov%7C8f2
952a3ae474d4da14908dc0ddd95fd%7Cf46cb8ea79004d108ceb80e8c1c81ee7
%7C0%7C0%7C638400492578674983%7CUnknown%7CTWFpbGZsb3d8eyJWIj
oiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
000%7C%7C%7C=%2BpYrk6cJA%2BDUn9szLbd2Y7R%2F30UNY2TFSJN
HcwkHa9Y%3D=0


The small code segment that now works is as follows:

###

# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format

# I'd like to keep this as it enables more control over the selected directories
filepath <- setwd(tk_choose.dir())

# The following correctly lists the names of all 9 files in my test directory 
files
<- list.files(filepath, ".docx") files
length(files)

# Ideally, I'd like to skip this step by being able to automatically read in the
name of each file, but one step at a time:
filename <- "Now they want us to charge our electric cars from litter
bins.docx"

# This produces the file content as output when run, and identifies the fields
that I want to extract.
read_docx(filename) %>%
str_split(",") %>%
unlist() %>%
str_trim()

###

What I'd like to try and accomplish next is to extract the data from selected
fields and append to a spreadsheet (Calc or Excel) under specific columns, or
if it is easier to write a CSV which I can then use later.

The fields I want to extract are illustrated with reference to the above file,
viz.:

The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"
The publication date: "September 24, 2023" (in date format, preferably
separated into month and year (day is not important)) The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"
The subject: from the Subject section, but this is to match a value e.g.
GREENWASHING

Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2024-01-04 Thread Andy

Hi folks

Thanks for your help and suggestions - very much appreciated.

I now have some working code, using this file I uploaded for public 
access: 
https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing=103065135255080058813=true=true 



The small code segment that now works is as follows:

###

# Load libraries
library(textreadr)
library(tcltk)
library(tidyverse)
#library(officer)
#library(stringr) #for splitting and trimming raw data
#library(tidyr) #for converting to wide format

# I'd like to keep this as it enables more control over the selected 
directories

filepath <- setwd(tk_choose.dir())

# The following correctly lists the names of all 9 files in my test 
directory

files <- list.files(filepath, ".docx")
files
length(files)

# Ideally, I'd like to skip this step by being able to automatically 
read in the name of each file, but one step at a time:
filename <- "Now they want us to charge our electric cars from litter 
bins.docx"


# This produces the file content as output when run, and identifies the 
fields that I want to extract.

read_docx(filename) %>%
  str_split(",") %>%
  unlist() %>%
  str_trim()

###

What I'd like to try and accomplish next is to extract the data from 
selected fields and append to a spreadsheet (Calc or Excel) under 
specific columns, or if it is easier to write a CSV which I can then use 
later.


The fields I want to extract are illustrated with reference to the above 
file, viz.:


The title: "Now they want us to charge our electric cars from litter bins"
The name of the newspaper: "Mail on Sunday (London)"
The publication date: "September 24, 2023" (in date format, preferably 
separated into month and year (day is not important))

The section: "NEWS"
The page number(s): "16" (as numeric)
The length: "515" (as numeric)
The author: "Anna Mikhailova"
The subject: from the Subject section, but this is to match a value e.g. 
GREENWASHING >= 50% (here this value is 51% so would be included). A 
match moves onto select the highest value under the section "Industry" 
(here it is ELECTRIC MOBILITY (91%)) and appends this text and % value. 
If no match with 'Greenwashing', then appends 'Null' and moves onto the 
next file in the directory.


###

The theory I am working with is if I can figure out how to extract these 
fields and append correctly, then the rest should just be wrapping this 
up in a for loop.


However, I am struggling to get my head around the extraction and append 
part. If I can get it to work for one of these fields, I suspect that I 
can repeat the basic syntax to extract and append the remaining fields.


Therefore, if someone can either suggest a syntax or point me to a 
useful tutorial, that would be splendid.


Thank you in anticipation.

Best wishes
Andy



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-30 Thread Andy
An update: Running this block of code:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

filepath <- setwd(tk_choose.dir())

filename <- "Now they want us to charge our electric cars from litter 
bins.docx"

#full_filename <- paste0(filepath, filename)
full_filename <- paste(filepath, filename, sep="/")

if (!file.exists(full_filename)) {
   message("File missing")
} else {
   content <- read_docx(full_filename) |>
     docx_summary()
   # this reads docx for the full filename and
   # passes it ( |> command) to the next line
   # which summarises it.
   # the result is saved in a data frame object
   # called content which we shall show some
   # heading into from

   head(content)
}


Results in this error now:Error in x$doc_obj : $ operator is invalid for 
atomic vectors

Thank you.



On 30/12/2023 12:12, Andy wrote:
> Hi Eric
>
> Thanks for that. That seems to fix one problem (the lack of a 
> separator), but introduces a new one when I complete the function 
> Calum proposed:Error in docx_summary() : argument "x" is missing, with 
> no default
>
> The whole code so far looks like this:
>
>
> # Load libraries
> library(tcltk)
> library(tidyverse)
> library(officer)
>
> filepath <- setwd(tk_choose.dir())
>
> filename <- "Now they want us to charge our electric cars from litter 
> bins.docx"
> #full_filename <- paste0(filepath, filename) # Calum's original suggestion
>
> full_filename <- paste(filepath, filename, sep="/") # Eric's proposed fix
>
> #lets double check the file does exist! # The rest here is Calum's 
> suggestion
> if (!file.exists(full_filename)) {
>   message("File missing")
> } else {
>   content <- read_docx(full_filename)
>   docx_summary()
>   # this reads docx for the full filename and
>   # passes it ( |> command) to the next line
>   # which summarises it.
>   # the result is saved in a data frame object
>   # called content which we shall show some
>   # heading into from
>
>   head(content)
> }
>
>
> Running this, results in the error cited above.
>
> Thanks as always :-)
>
>
>
>
> On 30/12/2023 11:58, Eric Berger wrote:
>> full_filename <- paste(filepath, filename,sep="/")
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-30 Thread Andy
Hi Eric

Thanks for that. That seems to fix one problem (the lack of a 
separator), but introduces a new one when I complete the function Calum 
proposed:Error in docx_summary() : argument "x" is missing, with no default

The whole code so far looks like this:


# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

filepath <- setwd(tk_choose.dir())

filename <- "Now they want us to charge our electric cars from litter 
bins.docx"
#full_filename <- paste0(filepath, filename) # Calum's original suggestion

full_filename <- paste(filepath, filename, sep="/") # Eric's proposed fix

#lets double check the file does exist! # The rest here is Calum's 
suggestion
if (!file.exists(full_filename)) {
   message("File missing")
} else {
   content <- read_docx(full_filename)
   docx_summary()
   # this reads docx for the full filename and
   # passes it ( |> command) to the next line
   # which summarises it.
   # the result is saved in a data frame object
   # called content which we shall show some
   # heading into from

   head(content)
}


Running this, results in the error cited above.

Thanks as always :-)




On 30/12/2023 11:58, Eric Berger wrote:
> full_filename <- paste(filepath, filename,sep="/")


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-30 Thread Andy

Good idea, El - thanks.

The link is 
https://docs.google.com/document/d/1QwuaWZk6tYlWQXJ3WLczxC8Cda6zVERk/edit?usp=sharing=103065135255080058813=true=true


This is helpful.

From the article, which is typical of Lexis+ output, I want to extract 
the following fields and append to a Calc/ Excel spreadsheet. Given the 
volume of articles I have to work through, if this can be iterative and 
semi-automatic, that would be a god send and I might be able to do some 
actual research on the articles before I reach my pensionable age. :-)


Title
Newspaper
Date
Section and page number
Length
Byline
Subject (only if the threshold of coverage for a specific subject is 
>=50% is reached (e.g. Greenwashing (51%)) - if not, enter 'nil' and 
move onto the next article in the folder


This is the ambition. I am clearly a long way short of that though.

Many thanks.
Andy


On 30/12/2023 00:08, Dr Eberhard W Lisse wrote:

Andy,

you can always open a public Dropbox or Google folder and post the link.

el

On 29/12/2023 22:37, Andy wrote:

Thanks - I'll have a look at these options too.

I'm happy to send over a sample document, but wasn't aware if
attachments are allowed. The documents come Lexis+, so require user
  credentials to log in, but I could upload the file somewhere if
that would help? Any ideas for a good location to do so?

[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-30 Thread Andy
Thanks Ivan and Calum

I continue to appreciate your support.

Calum, I entered the code snippet you provided, and it returns 'file 
missing'. Looking at this, while the object 'full_filename' exists, what 
is happening is that the path from getwd() is being appended to the 
title of the article, but without the '/' between the end of the path 
name (here 'TEST' and the name of the article. In other words, 
full_filename is reading "~/TESTNow they want us to charge our electric 
cars from litter bins.docx", so logically, this file doesn't exist. To 
work, the '/' needs to be inserted to differentiate between the end of 
the path name and the start of the article name. I've tried both paste0, 
as you suggested, and paste but neither do the trick.

Is this a result of me using the tkinter folder selection that you 
remarked on? I wanted to keep that so that the selection is interactive, 
but if there are better ways of doing this I am open to suggestions.

Thanks again, both.

Best wishes
Andrew


On 29/12/2023 22:25, CALUM POLWART wrote:
>
>
> help(read_docx) says that the function only imports one docx file. In
> order to read multiple files, use a for loop or the lapply function.
>
>
> I told you people will suggest better ways to loop!!
>
>
>
> docx_summary(read_docx("Now they want us to charge our electric cars
> from litter bins.docx")) should work.
>
>
> Ivan thanks for spotting my fail! Since the OP is new to all this I'm 
> going to suggest a little tweak to this code which we can then build 
> into a for loop:
>
> filepath <- getwd() #you will want to change this later. You are doing 
> something with tcl to pick a directory which seems rather fancy! But 
> keep doing it for now or set the directory here ending in a /
>
> filename <- "Now they want us to charge our electric cars from litter 
> bins.docx"
>
> full_filename <- paste0(filepath, filename)
>
> #lets double check the file does exist!
> if (!file.exists(full_filename)) {
>   message("File missing")
> } else {
>   content <- read_docx(full_filename) |>
>     docx_summary()
>     # this reads docx for the full filename and
>     # passes it ( |> command) to the next line
>     # which summarises it.
>     # the result is saved in a data frame object
>     # called content which we shall show some
>     # heading into from
>
>    head(content)
> }
>
> Let's get this bit working before we try and loop
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-29 Thread Andy

Thanks - I'll have a look at these options too.

I'm happy to send over a sample document, but wasn't aware if 
attachments are allowed. The documents come Lexis+, so require user 
credentials to log in, but I could upload the file somewhere if that 
would help? Any ideas for a good location to do so?



On 29/12/2023 20:25, Dr Eberhard W Lisse wrote:

I would also look at https://pandoc.org perhaps which can
export a number of formats...

And for spreadsheets https://github.com/jqnatividad/qsv is my
goto weapon.  Can also read and write XLSX and others.

A sample document or two would always be helpful...

el

On 29/12/2023 21:01, CALUM POLWART wrote:

It sounded like he looked at officeR but I would agree

content <- officer::docx_summary("filename.docx")

Would get the text content into an object called content.

That object is a data.frame so you can then manipulate it.
To be more specific, we might need an example of the DF

[...]

On Fri, Dec 29, 2023 at 10:14 AM Andy 
wrote:

[...]

I'd like to be able to accomplish the following:

(1) Append the title, the month, the author, the number of
words, and page number(s) to a spreadsheet

(2) Read each article and extract keywords (in the docs,
these are listed in 'Subject' section as a list of
keywords with a percentage showing the extent to which the
keyword features in the article (e.g., FAST FASHION (72%))
and to append the keyword and the % coverage to the same
row in the spreadsheet.  However, I want to ensure that
the keyword coverage meets the threshold of >= 50%; if
not, then pass onto the next article in the directory.
Rinse and repeat for the entire directory.

[...]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-29 Thread Andy
Hi Roy (& others)

Many thanks for the advice - well taken. Thanks also to the others who 
have responded so quickly - I thought I might have to wait days!! :-)

I'm on a Linux (Mint) machine. Below, I document three attempts, two 
using officer and the last now using textreadr

My attempts so far using 'officer':

##

(1) First Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

## This works to here - obtain a list of docx files in directory 'TEST 
with 9 files'. However, the next line
doc_in <- read_docx(files)

Results in this error:Error in filetype %in% c("docx") && 
grepl("^([fh]ttp)", file) :'length = 9' in coercion to 'logical(1)'

No idea how to debug that.

Even when trying Calum's suggestion with officer:

content <- officer::docx_summary("Now they want us to charge our 
electric cars from litter bins.docx") # A title of one of the articles

The error returned is:Error in x$doc_obj : $ operator is invalid for 
atomic vectors


##
(2) Second Attempt:

# Load libraries
library(tcltk)
library(tidyverse)
library(officer)

setwd(tk_choose.dir())

doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE)

files <- list.files(getwd(), ".docx")
files
length(files)

docx_summary(doc_path, preserve = FALSE)
## At this point, the error is:Error in x$doc_obj : $ operator is 
invalid for atomic vectors

So, not sure how I am passing an atomic vector or if there is something 
I am supposed to set to make this something else?

##
(3) Third attempt - now trying with textreadr (Thanks for the help on 
installing this, Calum):

# Load libraries
library(tcltk)
library(tidyverse)
library(textreadr)

folder <- setwd(tk_choose.dir())

files <- list.files(folder, ".docx")
files
length(files)

doc <- read_docx("Now they want us to charge our electric cars from 
litter bins.docx") # One of the 9 files in the folder

read_docx(doc, skip = 0, remove.empty = TRUE, trim = TRUE) # To test 
against one file

## The last line returns the following error:Error in filetype %in% 
c("docx") && grepl("^([fh]ttp)", file) :'length = 38' in coercion to 
'logical(1)'

##
And so I am going around in circles and not at all clear on how I can 
make progress.

I am sure that there must be a way, but the suggestions on-line each 
lead to the above errors.

Thanks for any further help.

Best wishes, and thanks
Andy


On 29/12/2023 18:25, Roy Mendelssohn - NOAA Federal wrote:
> Hi Andy:
>
> I don’t have an answer but I do have what I hope is some friendly advice.  
> Generally the more information you can provide,  the more likely you will get 
> help that is useful.  In your case you say that you tried several packages 
> and they didn’t do what you wanted.  Providing that code,  as well as why 
> they didn’t do what you wanted (be specific)  would greatly facilitate things.
>
> Happy new year,
>
> -Roy
>
>
>> On Dec 29, 2023, at 10:14 AM, Andy  wrote:
>>
>> Hello
>>
>> I am trying to work through a problem, but feel like I've gone down a rabbit 
>> hole. I'd very much appreciate any help.
>>
>> The task: I have several directories of multiple (some directories, up to 
>> 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want 
>> to iterate through to append to a spreadsheet only those articles that 
>> satisfy a condition (i.e., a specific keyword is present for >= 50% coverage 
>> of the subject matter). Lexis+ has a very specific structure and keywords 
>> are given in the row "Subject".
>>
>> I'd like to be able to accomplish the following:
>>
>> (1) Append the title, the month, the author, the number of words, and page 
>> number(s) to a spreadsheet
>>
>> (2) Read each article and extract keywords (in the docs, these are listed in 
>> 'Subject' section as a list of keywords with a percentage showing the extent 
>> to which the keyword features in the article (e.g., FAST FASHION (72%)) and 
>> to append the keyword and the % coverage to the same row in the spreadsheet. 
>> However, I want to ensure that the keyword coverage meets the threshold of 
>> >= 50%; if not, then pass onto the next article in the directory. Rinse and 
>> repeat for the entire directory.
>>
>> So far, I've tried working through some Stack Overflow-based solutions, but 
>> most seem to use the textreadr package, which is now deprecated; others use 
>> either the officer or the officedown packages. 

[R] Help request: Parsing docx files for key words and appending to a spreadsheet

2023-12-29 Thread Andy

Hello

I am trying to work through a problem, but feel like I've gone down a 
rabbit hole. I'd very much appreciate any help.


The task: I have several directories of multiple (some directories, up 
to 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that 
I want to iterate through to append to a spreadsheet only those articles 
that satisfy a condition (i.e., a specific keyword is present for >= 50% 
coverage of the subject matter). Lexis+ has a very specific structure 
and keywords are given in the row "Subject".


I'd like to be able to accomplish the following:

(1) Append the title, the month, the author, the number of words, and 
page number(s) to a spreadsheet


(2) Read each article and extract keywords (in the docs, these are 
listed in 'Subject' section as a list of keywords with a percentage 
showing the extent to which the keyword features in the article (e.g., 
FAST FASHION (72%)) and to append the keyword and the % coverage to the 
same row in the spreadsheet. However, I want to ensure that the keyword 
coverage meets the threshold of >= 50%; if not, then pass onto the next 
article in the directory. Rinse and repeat for the entire directory.


So far, I've tried working through some Stack Overflow-based solutions, 
but most seem to use the textreadr package, which is now deprecated; 
others use either the officer or the officedown packages. However, these 
packages don't appear to do what I want the program to do, at least not 
in any of the examples I have found, nor in the vignettes and relevant 
package manuals I've looked at.


The first point is, is what I am intending to do even possible using R? 
If it is, then where do I start with this? If these docx files were 
converted to UTF-8 plain text, would that make the task easier?


I am not a confident coder, and am really only just getting my head 
around R so appreciate a steep learning curve ahead, but of course, I 
don't know what I don't know, so any pointers in the right direction 
would be a big help.


Many thanks in anticipation

Andy

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-14 Thread Andy Jacobson via R-help

I have been using DMTCP successfully for a long-running optim() task. This is a 
single-core process running on a large linux cluster with slurm as the job 
manager. This cluster places an 8-hour limit on individual jobs, and since my 
cost function takes 11 minutes to compute, I need many such jobs run 
sequentially. To make DMTCP work, I have had to rework file I/O to avoid 
references to temporary files written to /tmp, but other than that...optim() is 
checkpointed just before 8 hours is up, and then resumed successfully in a 
subsequent batch job running on a different core of the cluster.

While I have an answer for my particular task, it would still be useful to 
checkpoint using the scheme Henrik suggests. Thanks all for the interesting 
conversation!

-Andy



On 12/14/21 5:39 PM, Henrik Bengtsson wrote:

On Tue, Dec 14, 2021 at 1:17 AM Andy Jacobson  wrote:


Those are good points, Duncan. I am experimenting with a nice checkpointing 
tool called DMTCP. It operates on the system level but is quite OS-dependent. 
It can be found at http://dmtcp.sourceforge.net/index.html.

Still, it would be nice to be able to checkpoint calls within R to potentially 
long-running processes like optim().


Teasing idea. Imagine if we could come up with some de-facto standard
API for this and that such a framework could be called automatically
by R. Something similar to how user interrupts are checked (e.g.
R_CheckUserInterrupt()) on a regular basis by the R engine and
through-out the R code. That could help troubleshooting and debugging,
e.g. sending the checkpoint to someone else or going backwards in
time.

Pasting in the below since I failed to hit Reply *All* the other day,
and it was only Richard who got it:

A few weeks ago, I played around with DMTCP (Distributed MultiThreaded
CheckPointing ) for Linux (https://github.com/dmtcp/dmtcp).  I'm
sharing in case someone is interested in investigating this further.
Also, somewhere on the DMTCP wiki, they asked for testing with R by
more experienced users.

"DMTCP is a tool to transparently checkpoint the state of multiple
simultaneous applications, including multi-threaded and distributed
applications. It operates directly on the user binary executable,
without any Linux kernel modules or other kernel modifications."

They seem to be able to run this with HPC jobs, open files, Linux
containers, and even MPI, and so on.  I've only tested it very quickly
with interactive R and it seems to work.  Obviously more testing needs
to be done to identify when it doesn't work.  For example, I'd have a
hard time it would work out of the box with local parallel PSOCK
workers.  They mention "plug-ins", so maybe there's a way to adding
support for specific use cases on a one by one.

Different academic HPC environment appear to use it, e.g.

* https://docs.nersc.gov/development/checkpoint-restart/dmtcp/
* http://wiki.orc.gmu.edu/mkdocs/Creating_Checkpoints_%28DMTCP%29/
* https://wiki.york.ac.uk/display/RCS/VK21%29+Checkpointing+with+DMTCP

That's all I have time for now,

Henrik



-Andy

On 12/13/21 11:51 AM, Duncan Murdoch wrote:

On 13/12/2021 12:58 p.m., Greg Minshall wrote:

Jeff,


This sounds like an OS feature, not an R feature... certainly not a
portable R feature.


i'm not arguing for it, but this seems to me like something that could
be a language feature.



R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function call, and 
would have a lot of trouble saving it.

If you added some limitations, e.g. a process that periodically has its entire 
state stored in R variables, then it would be a lot easier.

Duncan Murdoch


--
Andy Jacobson
a...@yovo.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Andy Jacobson
andy.jacob...@noaa.gov

NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305

303/497-4916

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] checkpointing

2021-12-14 Thread Andy Jacobson

Those are good points, Duncan. I am experimenting with a nice checkpointing 
tool called DMTCP. It operates on the system level but is quite OS-dependent. 
It can be found at http://dmtcp.sourceforge.net/index.html.

Still, it would be nice to be able to checkpoint calls within R to potentially 
long-running processes like optim().

-Andy

On 12/13/21 11:51 AM, Duncan Murdoch wrote:

On 13/12/2021 12:58 p.m., Greg Minshall wrote:

Jeff,


This sounds like an OS feature, not an R feature... certainly not a
portable R feature.


i'm not arguing for it, but this seems to me like something that could
be a language feature.



R functions can call libraries written in other languages, and can start 
processes, etc.  R doesn't know everything going on in every function call, and 
would have a lot of trouble saving it.

If you added some limitations, e.g. a process that periodically has its entire 
state stored in R variables, then it would be a lot easier.

Duncan Murdoch


--
Andy Jacobson
a...@yovo.org

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] checkpointing

2021-12-13 Thread Andy Jacobson via R-help

Has anyone ever considered what it would take to implement checkpointing in R, 
so that long-running processes could be interrupted and resumed later, from a 
different process or even a different machine?

Thanks,

Andy

--
Andy Jacobson
andy.jacob...@noaa.gov

NOAA Global Monitoring Lab
325 Broadway
Boulder, Colorado 80305

303/497-4916

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] levels

2020-07-18 Thread andy elprama
Thanks, I will check it out.

Op za 18 jul. 2020 om 00:47 schreef Chris Gordon-Smith <
c.gordonsm...@gmail.com>:

> There is an interesting item on stringsAsFactors in this useR! 2020
> session:
>
> https://www.youtube.com/watch?v=X_eDHNVceCU=youtu.be
>
> It's about 27 minutes in.
>
> Chris Gordon-Smith
> On 15/07/2020 17:16, Marc Schwartz via R-help wrote:
>
> On Jul 15, 2020, at 4:31 AM, andy elprama  
>  wrote:
>
> Dear R-users,
>
> Something strange happened within the command "levels"
>
> R version 3.6.1
> name <- c("a","b","c")
> values <- c(1,2,3)
> data <- data.frame(name,values)
> levels(data$name)
> [1] "a" "b" "c"
>
> R version 4.0
> name <- c("a","b","c")
> values <- c(1,2,3)
> data <- data.frame(name,values)
> levels(data$name)
> [1] NULL
>
> What is happening here?
>
> Hi,
>
> The default value for 'stringsAsFactors' for data.frame() and read.table() 
> changed from TRUE to FALSE in version 4.0.0, per the news() file:
>
> "R now uses a stringsAsFactors = FALSE default, and hence by default no 
> longer converts strings to factors in calls to data.frame() and read.table()."
>
>
> Using 4.0.2:
>
> data <- data.frame(name, values, stringsAsFactors = TRUE)
>
>
> levels(data$name)
>
> [1] "a" "b" "c"
>
>
> If you see behavioral changes from one version of R to another, especially 
> major version increments, check the news() file.
>
> Regards,
>
> Marc Schwartz
>
>
> __r-h...@r-project.org mailing 
> list -- To UNSUBSCRIBE and more, 
> seehttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] levels

2020-07-15 Thread andy elprama
Dear R-users,

Something strange happened within the command "levels"

R version 3.6.1
name <- c("a","b","c")
values <- c(1,2,3)
data <- data.frame(name,values)
levels(data$name)
[1] "a" "b" "c"

R version 4.0
name <- c("a","b","c")
values <- c(1,2,3)
data <- data.frame(name,values)
levels(data$name)
[1] NULL

What is happening here?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] regular expression, stringr::str_view, grep

2020-04-29 Thread Andy Spada

This highlights the literal meaning of the last ] in your correct_brackets:

aff <- c("affgfk]ing", "fgok", "rafgkah]e","a fgk", "bafghk]")

To me, too, the missing_brackets looks more like what was desired, and
returns correct results for a PCRE. Perhaps the regular expression
should have been rewritten:

desired_brackets <- "af+g[^m$][^A-Z]"
grep(desired_brackets, aff, value = TRUE) ### correct result
str_view(aff, desired_brackets) ### correct result

Regards,
Andy


On 28.04.2020 18:41:50, David Winsemius wrote:


On 4/28/20 2:29 AM, Sigbert Klinke wrote:

Hi,

we gave students the task to construct a regular expression selecting
some texts. One send us back a program which gives different results
on stringr::str_view and grep.

The problem is "[^[A-Z]]" / "[^[A-Z]" at the end of the regular
expression. I would have expected that all four calls would give the
same result; interpreting [ and ] within [...] as the characters `[`
and `]`. Obviously this not the case and moreover stringr::str_view
and grep interpret the regular expressions differently.

Any ideas?

Thanks Sigbert

---

aff <- c("affgfking", "fgok", "rafgkahe","a fgk", "bafghk", "affgm",
 "baffgkit", "afffhk", "affgfking", "fgok", "rafgkahe", "afg.K",
 "bafghk", "aff gm", "baffg kit", "afffhgk")


TL;DR: different versions of regex character class syntax:




correct_brackets <- "af+g[^m$][^[A-Z]]"

To me that looks "incorrect" because of an unnecessary square-bracket.

missing_brackets <- "af+g[^m$][^[A-Z]"

And that one looks complete. To my mind it looks like the negation of
a character class with "[" and the range A-Z.


library("stringr")



I think this is the root of your problem. If you execute ?regex you
should be given a choice of two different help pages and if you go to
the one from pkg stringr it says in the Usage section:

regex
The default. Uses ICU regular expressions.

So that's probably different than the base regex convention which uses
TRE regular expressions.


You should carefully review:


help('stringi-search-charclass' , pac=stringi)

 I think you should also find the adding square brackets around ranges
is not needed in either type of regex syntax, but that stringi's regex
(unlike base R's TRE regex) does allow multiple disjoint ranges inside
the outer square brackets of a character class. I've never seen that
in base R regex. So I think that this base regex pattern,
grepl("([a-b]|[r-t])", letters) is the same as this stringi pattern: 
str_view( letters, "[[a-c][r-t]]").




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] nlme::gls potential bug

2019-01-31 Thread Andy Beet via R-help
Hi there,


I have been using the nlme::gls package created in R to fit a pretty 
simple model (linear with AR error)

y(t) = beta*x(t) + e(t)              where e(t) ~ rho*e(t-1) + Z(t)    
     and Z(t)~ N(0,sig^2)

I call the R routine

glsObj <- nlme::gls(y ~ x -1, data=data, correlation = 
nlme::corAR1(form= ~x), method="ML")

All seems fine.


In addition, I have also coded the likelihood myself and maximized it 
for beta, rho and sigma.

I get the exact same estimates of beta and rho, (as nlme::gls) but the 
estimate of sigma is not the same and i can not figure out why.

The maximum likelihood estimator for sigma under this model is

sig^2 = (( 1-rho^2)u(1)^2 + sum((u(t)- rho*u(t-1))^2)/n

where the sum is t=2,...,n and

u(t) = y(t) - X(t)*beta


I have read the mixed-effects models in S and S-Plus book (nlme::gls 
code is based directly on this) and this problem is specified on page 
204 eq (5.5). I have also calculated sigma based on (5.7) -after the 
transformation documented (5.2) -and i do not get the same value as 
either the package or my implementation.

Any advice would be most welcomed. Is there a bug in the estimation of 
sigma in this package?

Thanks

Andy

-- 
Andy Beet
Ecosystem Dynamics & Assessment Branch
Northeast Fisheries Science Center
NOAA Fisheries Service
166 Water Street
Woods Hole, MA 02543
tel: 508-495-2073


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] word stemming for corpus linguistics

2016-07-26 Thread Andy Wolfe

Hi

Thanks for following up on this thread.

I've opted for this, albeit circuitous, route: use the tm package to 
stem the document and then use writeCorpus to write the stemmed document 
to disk, so that I can open it up and do the concordancing piece.


Many thanks - this'll do me fine until I come across a better (read, 
more elegant) solution.

Best
Andy


On 26/07/16 14:05, Paul Johnston wrote:

Hi

I use the tm_map() with stemDocument used as an argument

Looking at a particular file before stemming

writeLines(as.character(data_mined_volatile[[1]]))

## The European Union is a "force for social injustice" which backs "the haves 
rather than the have-nots", Iain Duncan Smith has said.
## The ex-work and pensions secretary said "uncontrolled migration" drove down 
wages and increased the cost of living.
## He appealed to people "who may have done OK from the EU" to "think about the 
people that haven't".
## But Labour's Alan Johnson said the EU protected workers and stopped them from being 
"exploited".
## The former Labour home secretary accused the Leave campaign of dismissing such 
protections as "red tape".
## In other EU referendum campaign developments:
## Thirteen former US secretaries of state and defence and national security advisers, including 
Madeleine Albright and Leon Panetta, say in a letter to the Times that the UK's "place and 
influence" in the world would be diminished if it left the EU - and Europe would be 
"dangerously weakened"
## A British Chambers of Commerce survey suggests most business people back 
Remain but the gap with those backing Leave has narrowed.
## Five former heads of Nato claimed the UK would lose influence and "give succour 
to its enemies" by leaving the EU - claims dismissed as scaremongering by Boris 
Johnson
## Mr Corbyn is launching his party's battle bus, saying Labour votes will be 
crucial if the Remain side is to win
## The official Scottish campaign to keep the UK in the European Union is due 
to be launched in Edinburgh
## Mr Duncan Smith's speech came after he told the Sun Germany had a "de facto veto" over 
David Cameron's EU renegotiations, with Angela Merkel blocking the PM's plans for an 
"emergency brake" on EU migration.
## Downing Street said curbs it negotiated on in-work benefits for EU migrants were a 
"more effective" way forward.
## Follow the latest developments on BBC EU referendum live
## Laura Kuenssberg: Can Leave win over the have-nots


Now look at the same text after stemming

corpus <- data_mined_volatile
corpus <- tm_map(corpus,stemDocument)

writeLines(as.character(corpus[[1]]))

## The European Union is a "forc for social injustice" which back "the have rather 
than the have-nots", Iain Duncan Smith has said.
## The ex-work and pension secretari said "uncontrol migration" drove down wage 
and increas the cost of living.
## He appeal to peopl "who may have done OK from the EU" to "think about the peopl 
that haven't".
## But Labour Alan Johnson said the EU protect worker and stop them from be 
"exploited".
## The former Labour home secretari accus the Leav campaign of dismiss such protect as 
"red tape".
2
## In other EU referendum campaign developments:
## Thirteen former US secretari of state and defenc and nation secur advisers, includ Madelein 
Albright and Leon Panetta, say in a letter to the Time that the UK "place and influence" 
in the world would be diminish if it left the EU - and Europ would be "danger weakened"
## A British Chamber of Commerc survey suggest most busi peopl back Remain but 
the gap with those back Leav has narrowed.
## Five former head of Nato claim the UK would lose influenc and "give succour to it 
enemies" by leav the EU - claim dismiss as scaremong by Bori Johnson
## Mr Corbyn is launch his parti battl bus, say Labour vote will be crucial if 
the Remain side is to win
## The offici Scottish campaign to keep the UK in the European Union is due to 
be launch in Edinburgh
## Mr Duncan Smith speech came after he told the Sun Germani had a "de facto veto" over 
David Cameron EU renegotiations, with Angela Merkel block the PM plan for an "emerg 
brake" on EU migration.
## Down Street said curb it negoti on in-work benefit for EU migrant were a "more 
effective" way forward.
## Follow the latest develop on BBC EU referendum live
## Laura Kuenssberg: Can Leav win over the have-not

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Andy Wolfe
Sent: 26 July 2016 09:14
To: r-help@r-project.org
Subject: Re: [R] word stemming for corpus linguistics

Hi Paul

I have seen this - it's part of the tm package mentioned originally. So, I've 
tried it again and perhaps I'm using stemDocument incorrectly, but 

Re: [R] word stemming for corpus linguistics

2016-07-26 Thread Andy Wolfe

Hi Paul

I have seen this - it's part of the tm package mentioned originally. So, 
I've tried it again and perhaps I'm using stemDocument incorrectly, but 
this is what I am doing:


# > library(tm)
Loading required package: NLP
> text.v <- scan(file.choose(), what = 'char', sep = '\n')
Read 938 items
# >text.stem.v <- stemDocument(text.v, language = 'english')

But it isn't changing anything in the body of the text I'm passing to it 
- the words are unlemmatized/ unstemmed.


When I try using SnowballC, the error returned is that tm_map doesn't 
have a method to work with objects of class 'character'.


Again, the problem is that tm doesn't seem to allow for concordance 
analysis ... or perhaps it does and I just haven't figured out how to do 
it, so am happy to be shown some documentation on that process, and 
whether that is applied before or after the text is transformed into a 
DTM because searching on-line hasn't (yet) thrown anything back.


Thanks.
Andy


On 26/07/16 08:50, Paul Johnston wrote:

Suggest look at http://www.inside-r.org/packages/cran/tm/docs/stemDocument



-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Andy Wolfe
Sent: 26 July 2016 08:10
To: r-help@r-project.org
Subject: [R] word stemming for corpus linguistics

Hi list

On a piece of work I'm doing in corpus linguistics, using a combo of texts by Gries 
"Quantitative Corpus Linguistics with R: A Practical Introduction" and Jockers "Text 
Analysis with R for Students of Literature", which are both really excellent by the way, I 
want to stem or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', and 
'facilitates' all become 'facilit'.

In text mining, using a combination of the packages 'tm' and 'SnowballC'
this is feasible, but then I am finding that working with the DTM (document 
term matrix) becomes difficult for when I want to do concordance (or key word 
in context) analysis.

So, two questions:

(1) is there a package for R version 3.3.1 that can work with corpus 
linguistics? and/ or

(2) is there a way of doing concordance analysis using the tm package as part 
of the whole text mining process?

I appreciate any help. Thanks.

Andy


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see 
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] word stemming for corpus linguistics

2016-07-26 Thread Andy Wolfe
Hi list

On a piece of work I'm doing in corpus linguistics, using a combo of 
texts by Gries "Quantitative Corpus Linguistics with R: A Practical 
Introduction" and Jockers "Text Analysis with R for Students of 
Literature", which are both really excellent by the way, I want to stem 
or lemmatize the words so that, for e.g., 'facilitating', 'facilitated', 
and 'facilitates' all become 'facilit'.

In text mining, using a combination of the packages 'tm' and 'SnowballC' 
this is feasible, but then I am finding that working with the DTM 
(document term matrix) becomes difficult for when I want to do 
concordance (or key word in context) analysis.

So, two questions:

(1) is there a package for R version 3.3.1 that can work with corpus 
linguistics? and/ or

(2) is there a way of doing concordance analysis using the tm package as 
part of the whole text mining process?

I appreciate any help. Thanks.

Andy


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest classification

2016-04-18 Thread Liaw, Andy
This is explained in the "Details" section of the help page for partialPlot.

Best
Andy

> -Original Message-
> From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Jesús Para
> Fernández
> Sent: Tuesday, April 12, 2016 1:17 AM
> To: r-help@r-project.org
> Subject: [R] Random Forest classification
> 
> Hi,
> 
> To evaluate the partial influence of a factor with a random Forest, wich
> response is OK/NOK I�m using partialPlot, being the x axis the factor axis and
> the Y axis is between -1 and 1. What this -1 and 1 means?
> 
> An example:
> 
> https://www.dropbox.com/s/4b92lqxi3592r0d/Captura.JPG?dl=0
> 
> 
> Thanks for all!!!
>   [[alternative HTML version deleted]]

Notice:  This e-mail message, together with any attachments, contains
information of Merck & Co., Inc. (2000 Galloping Hill Road, Kenilworth,
New Jersey, USA 07033), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] HELP - as.numeric changing column data

2016-01-06 Thread Andy Schneider
Hi -

I'm trying to plot some data and having a lot of trouble! I have a simple
dataset consisting of two columns - income_per_capita and mass_beauty_value.
When I read the data in and plot it, I get the attached plot Mass Beauty
Non-Numeric:
 .
You can see that, while it contains all the values, the income_per_capita
axis is out of order and there are some weird vertical lines happening.

To fix this, I converted both columns to numerics using:

mass_beauty$income_per_capita <- as.numeric(mass_beauty$income_per_capita)
mass_beauty$mass_beauty_value <- as.numeric(mass_beauty$mass_beauty_value)

When I did this, I noticed that my income_per_capita column's values
suddenly changed. Whereas I have values extending all the way to 30,000 or
so before, now they maxed out at around 1,400. While at first I thought they
might at least have changed to scale, it unfortunately looks like changes
were more or less random. But, they plotted much better:
 .

Does anyone have any solution for how I can convert my income_per_capita
column to a plottable numeric without changing up its values? I've tried
doing as.numeric(as.character(mass_beauty_value$income_per_capita)) but it
didn't work.

Thanks so much for your help!
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Most appropriate function for the following optimisation issue?

2015-10-20 Thread Andy Yuan
Hello
 
Please could you help me to select the most appropriate/fastest function to use 
for the following constraint optimisation issue?
 
Objective function:
 
Min: Sum( (X[i] - S[i] )^2) 
 
Subject to constraint :
 
Sum (B[i] x X[i]) =0
 
where i=1��n and S[i] and B[i] are real numbers
 
Need to solve for X
 
Example:
 
Assume n=3
 
S <- c(-0.5, 7.8, 2.3)
B <- c(0.42, 1.12, 0.78)
 
Many thanks
AY



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Login

2014-05-27 Thread Andy Siddaway
Dear R help,

I cannot login to my account. I am keen to remove the posting I made to R
help from google web searches - see
http://r.789695.n4.nabble.com/R-software-installation-problem-td4659556.html


Thanks,

Andy

Dr Andy Siddaway
Registered Clinical Psychologist/
MRC Clinical Research Training Fellow,
Behavioural Science Centre
Stirling Management School
Cottrell Building
University of Stirling
Stirling
Scotland
FK9 4LA

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rpart and randomforest results

2014-04-07 Thread Liaw, Andy
Hi Sonja,

How did you build the rpart tree (i.e., what settings did you use in 
rpart.control)?  Rpart by default will use cross validation to prune back the 
tree, whereas RF doesn't need that.  There are other more subtle differences as 
well.  If you want to compare single tree results, you really want to make sure 
the settings in the two are as close as possible.  Also, how did you compute 
the pseudo R2, on test set, or some other way?

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Schillo, Sonja
Sent: Thursday, April 03, 2014 3:58 PM
To: Mitchell Maltenfort
Cc: r-help@r-project.org
Subject: Re: [R] rpart and randomforest results

Hi,

the random forest should do that, you're totally right. As far as I know it 
does so by randomly selecting the variables considered for a split (but here we 
set the option for how many variables to consider at each split to the number 
of variables available so that I thought that the random forest does not have 
the chance to randomly select the variables). The next thing that randomforest 
does is bootstrapping. But here again we set the option to the number of cases 
we have in the data set so that no bootstrapping should be done.
We tried to take all the randomness from the randomforest away.

Is that plausible and does anyone have another idea?

Thanks
Sonja


Von: Mitchell Maltenfort [mailto:mmal...@gmail.com]
Gesendet: Dienstag, 1. April 2014 13:32
An: Schillo, Sonja
Cc: r-help@r-project.org
Betreff: Re: [R] rpart and randomforest results


Is it possible that the random forest is somehow adjusting for optimism or 
overfitting?
On Apr 1, 2014 7:27 AM, Schillo, Sonja 
sonja.schi...@uni-due.demailto:sonja.schi...@uni-due.de wrote:
Hi all,

I have a question on rpart and randomforest results:

We calculated a single regression tree using rpart and got a pseudo-r2 of 
roundabout 10% (which is not too bad compared to a linear regression on this 
data). Encouraged by this we grew a whole regression forest on the same data 
set using randomforest. But we got  pretty bad pseudo-r2 values for the 
randomforest (even sometimes negative values for some option settings).
We then thought that if we built only one single tree with the randomforest 
routine we should get a result similar to that of rpart. So we set the options 
for randomforest to only one single tree but the resulting pseudo-r2 value was 
negative aswell.

Does anyone have a clue as to why the randomforest results are so bad whereas 
the rpart result is quite ok?
Is our assumption that a single tree grown by randomforest should give similar 
results as a tree grown by rpart wrong?
What am I missing here?

Thanks a lot for your help!
Sonja

__
R-help@r-project.orgmailto:R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest warning: The response has five or fewer unique values. Are you sure you want to do regression?

2014-03-24 Thread Liaw, Andy
If you are using the code, that's not really using randomForest directly.  I 
don't understand the data structure you have (since you did not show anything) 
so can't really tell you much.  In any case, that warning came from 
randomForest() when it is run in regression mode but the response has fewer 
than five distinct values.  It may be legitimate regression data, and if so you 
can safely ignore the warning (that's why it's not an error).  It's there to 
catch the cases when people try to do classification with class labels 1, 2, 
..., k and forgot to make it a factor.

Best,
Andy Liaw

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Sean Porter
Sent: Thursday, March 20, 2014 3:27 AM
To: r-help@r-project.org
Subject: [R] randomForest warning: The response has five or fewer unique 
values. Are you sure you want to do regression?

Hello everyone,

 

Im relatively new to R and new to the randomForest package and have scoured
the archives for help with no luck. I am trying to perform a regression on a
set of predictors and response variables to determine the most important
predictors. I have 100 response variables collected from 14 sites and 8
predictor variables from the same 14 sites. I run the code to perform the
randomForest  regression given by Pitcher et al 2011   (
http://gradientforest.r-forge.r-project.org/biodiversity-survey.pdf ). 

 

However, after running the code I get the warning:

 

 In randomForest.default(m, y, ...) :

  The response has five or fewer unique values.  Are you sure you want to do
regression?

 

And it produces a set of 500 regression trees for each of 3 species only
when the number of species in the response file is 100. I noticed that in
the example by Pitcher they get 500 trees from only 90 species even though
they input 110 species in the response data.

 

Why am I getting the warning/how do I solve it, and why is randomForest
producing trees for only 3 species when I am looking at 100 species
(response variables)?

 

Many thanks

 

Sean

 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Variable importance - ANN

2013-12-04 Thread Liaw, Andy
You can try something like this:
http://pubs.acs.org/doi/abs/10.1021/ci050022a

Basically similar idea to what is done in random forests: permute predictor 
variable one at a time and see how much that degrades prediction performance.

Cheers,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Giulia Di Lauro
Sent: Wednesday, December 04, 2013 6:42 AM
To: r-help@r-project.org
Subject: [R] Variable importance - ANN

Hi everybody,
I created a neural network for a regression analysis with package ANN, but
now I need to know which is the significance of each predictor variable in
explaining the dependent variable. I thought to analyze the weight, but I
don't know how to do it.

Thanks in advance,
Giulia Di Lauro.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I extract Random Forest Terms and Probabilities?

2013-12-02 Thread Liaw, Andy
#2 can be done simply with predict(fmi, type=prob).  See the help page for 
predict.randomForest().

Best,
Andy


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of arun
Sent: Tuesday, November 26, 2013 6:57 PM
To: R help
Subject: Re: [R] How do I extract Random Forest Terms and Probabilities?



Hi,
For the first part, you could do:

fmi2 - fmi 
attributes(fmi2$terms) - NULL
capture.output(fmi2$terms)
#[1] Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

A.k.

On Tuesday, November 26, 2013 3:55 PM, Lopez, Dan lopez...@llnl.gov wrote:
Hi R Experts,

I need your help with two question regarding randomForest.


1.       When I run a Random Forest model how do I extract the formula I used 
so that I can store it in a character vector in a dataframe?
For example the dataframe might look like this if I am running models using the 
IRIS dataset
#ModelID,Type,

#001,RF,Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width

fmi-randomForest(Species~.,iris,mtry=3,ntry=500)
#I know one place where the information is in fmi$terms but not sure how to 
extract just the formula info. Or perhaps there is somewhere else in fmi that I 
could get this?


2.       How do I get the probabilities (probability-like values) from the 
model that was run? I know for the test set I can use predict. And I know to 
extract the classifications from the model I use fmi$predicted. But where are 
the probabilities?


Dan
Workforce Analyst
HRIM - Workforce Analytics  Metrics
LLNL


    [[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:13}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interpretation of MDS plot in random forest

2013-12-02 Thread Liaw, Andy
Yes, that's part of the intention anyway.  One can also use them to do 
clustering.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Massimo Bressan
Sent: Monday, December 02, 2013 6:34 AM
To: r-help@r-project.org
Subject: [R] interpretation of MDS plot in random forest

Given this general example:

set.seed(1)

data(iris)

iris.rf - randomForest(Species ~ ., iris, proximity=TRUE, keep.forest=TRUE)

#varImpPlot(iris.rf)

#varUsed(iris.rf)

MDSplot(iris.rf, iris$Species)

I’ve been reading the documentation about random forest (at best of my - 
poor - knowledge) but I’m in trouble with the correct interpretation of 
the MDS plot and I hope someone can give me some clues

What is intended for “the scaling coordinates of the proximity matrix”?


I think to understand that the objective is here to present the distance 
among species in a parsimonious and visual way (of lower dimensionality)

Is therefore a parallelism to what are intended the principal components 
in a classical PCA?

Are the scaling coordinates DIM 1 and DIM2 the eigenvectors of the 
proximity matrix?

If that is correct, how would you find the eigenvalues for that 
eigenvectors? And what are the eigenvalues repreenting?


What are saying these two dimensions in the plot about the different 
iris species? Their relative distance in terms of proximity within the 
space DIM1 and DIM2?

How to choose for the k parameter (number of dimensions for the scaling 
coordinates)?

And finally how would you explain the plot in simple terms?

Thank you for any feedback
Best regards

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachments, contains
information of Merck  Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at 
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you are
not the intended recipient, and have received this message in error,
please notify us immediately by reply e-mail and then delete it from 
your system.
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Split type in the RandomForest package

2013-11-20 Thread Liaw, Andy
Classification trees use the Gini index, whereas the regression trees use sum 
of squared errors.  They are hard-wired into the C/Fortran code, so not 
easily changeable.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Cheng, Chuan
Sent: Monday, September 30, 2013 6:30 AM
To: 'R-help@r-project.org'
Subject: [R] Split type in the RandomForest package

Hi guys,

I'm new to Random Forest package and I'd like to know what type of split is 
used in the package for classification? Or can I configure the package to use 
different split type (like simple split alongside single attribute axis or 
linear split based on several attributes etc..)

Thanks a lot!

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the difference between Mean Decrease Accuracy produced by importance(foo) vs foo$importance in a Random Forest Model?

2013-11-19 Thread Liaw, Andy
The difference is importance(..., scale=TRUE).  See the help page for detail.  
If you extract the $importance component from a randomForest object, you do not 
get the scaling.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lopez, Dan
Sent: Wednesday, November 13, 2013 12:16 PM
To: R help (r-help@r-project.org)
Subject: [R] What is the difference between Mean Decrease Accuracy produced by 
importance(foo) vs foo$importance in a Random Forest Model?

Hi R Expert Community,

My question: What is the difference between Mean Decrease Accuracy produced by 
importance(foo) vs foo$importance in a Random Forest Model?

I ran a Random Forest classification model where the classifier is binary. I 
stored the model in object FOREST_model. I than ran importance(FOREST_model) 
and FOREST_model$importance. I usually use the prior but decided to learn more 
about what is in summary(randomForest ) so I ran the latter. I expected both to 
produce identical output. Mean Decrease Gini is the only thing that is 
identical in both.

I looked at ? Random Forest and Package 'randomForest' documentation and didn't 
find any info explaining this difference.

I am not including a reproducible example because this is most likely 
something, perhaps simple, such as one  is divided by something (if so, what?), 
that I am just not aware of.


importance(FOREST_model)

 HC  TER MeanDecreaseAccuracy MeanDecreaseGini
APPT_TYP_CD_LL0.16025157 -0.521041660   0.1567029712.793624
ORG_NAM_LL0.20886631 -0.952057325   0.20208393   107.137049
NEW_DISCIPLINE0.20685079 -0.960719435   0.2007676286.495063


FOREST_model$importance


  HC   TER MeanDecreaseAccuracy MeanDecreaseGini

APPT_TYP_CD_LL0.0049473962 -3.727629e-03 0.0045949805
12.793624

ORG_NAM_LL0.0090715845 -2.401016e-02 0.0077298067   
107.137049

NEW_DISCIPLINE0.0130672572 -2.656671e-02 0.0114583178
86.495063

Dan Lopez
LLNL, HRIM, Workforce Analytics  Metrics


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Nadaraya-Watson kernel

2013-11-07 Thread Liaw, Andy
Use KernSmooth (one of the recommended packages that are included in R 
distribution).  E.g.,

 library(KernSmooth)
KernSmooth 2.23 loaded
Copyright M. P. Wand 1997-2009
 x - seq(0, 1, length=201)
 y - 4 * cos(2*pi*x) + rnorm(x)
 f - locpoly(x, y, degree=0, kernel=epan, bandwidth=.1)
 plot(x, y)
 lines(f, lwd=2)

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Ms khulood aljehani
Sent: Tuesday, November 05, 2013 9:49 AM
To: r-h...@stat.math.ethz.ch
Subject: [R] FW: Nadaraya-Watson kernel

From: aljehan...@hotmail.com
To: r-help@r-project.org
Subject: Nadaraya-Watson kernel
Date: Tue, 5 Nov 2013 17:42:13 +0300




Hello
 
i want to compute the Nadaraya-Watson kernel estimation when the kernel 
function is Epanchincov kernel
i use the command
ksmooth(x, y, kernel=normal, bandwidth ,)
 
the argmunt ( kernel=normal ) accept normal and box kernels
i want to compute it if the kerenl = Epanchincov
 
 
thank you
 
  
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Override setClass and setMethod in a package R 3.0.1

2013-07-11 Thread Andy Pranata
Hi,

I am having problem to override setClass  and setMethod in a package

Requirement: R 3.0.1, Rtools

Please find below the steps to recreate the problem:
*
*
*#1. Create class and method*

*setClass(Class=AAA,*
*   representation(*
*   name=character,*
*   val=numeric*
* )*
*)*

*setMethod(*, *
*  signature(e1 = numeric, e2 = AAA), *
*  definition=function (e1, e2) {*
* e2@val = e2@val * e1*
* e2   *
* }*
*)*

*# 2/ save the file in *
C:\\AAA.r

*# 3/  build a package*
*  setwd(C:)*
*  package.skeleton(name=AAA,code_files=C:\\AAA.r)
*
*  system(R CMD build AAA)*
*  system(R CMD INSTALL --build AAA)*

*#4/ testing*
* library(AAA)*
* x = new(AAA)*
* x@val = 100*
* -1 * x*
*An object of class AAA*
*Slot name:*
*character(0)*
*
*
*Slot val:*
*[1] -100*
*
*
*Slot type:*
*character(0)*

*#5/ override class and  method*

*setClass(Class=AAA,*
*   representation(*
*   name=character,*
*   val=numeric,*
*   type=character,*
*   desc=character*
* )*
*)*
*
*
*setMethod(*,
*
* signature(e1 = numeric, e2 = AAA), *
*  definition=function (e1, e2) {*
* if (e2@type == double){*
* e2@val = e2@val * (2 * e1)   *
* } else {*
* e2@val = e2@val * e1 *
* }*
*  }*
* )*
*
*
*#6/ testing*
* y = new(AAA)*
* y@val=25*
* y@type=double*
* -1 * y*
*Error in -1 * y : invalid object (non-function) used as method*


Please advise

-- 
-- Andy --

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Lattice different colours for bars

2013-06-13 Thread Andrew McFadden (Andy)
Hi all

Perhaps this is torturous methodology. I was trying to use lattice to produce a 
barchart showing the number positive and negative over time. I wasn't quite 
sure how create a different colour for values of arbo$Ikeda in the example 
below ie red for ikeda and green for neg.


library(reshape)
library(lattice)

Time=c(rep(6,17), rep(5,17), rep(4,17),
   rep(3,17),rep(2,17), rep(1,17))
Ikeda=c(rep(Ikeda,6),rep(Neg,11),
rep(Ikeda,0),rep(Neg,17),
rep(Ikeda,1),rep(Neg,16),
rep(Ikeda,0),rep(Neg,17),
rep(Ikeda,0),rep(Neg,17),
rep(Ikeda,0),rep(Neg,17))
Theileria=c(rep(Other,6),rep(Neg,11),
 rep(Other,12),rep(Neg,5),
 rep(Other,12),rep(Neg,5),
 rep(Other,14),rep(Neg,3),
 rep(Other,14),rep(Neg,3),
 rep(Other,13),rep(Neg,4))
value=c(rep(1,102))
arbo=data.frame(Time, Ikeda,Theileria,value)

arbo$Time=as.factor(arbo$Time)
levels(arbo$Time)

arbo$Time=factor(arbo$Time,
levels=c(1,2,3,4,5,6),
labels=c(Dec 2008, Dec 2009, Dec 2010,
 Dec 2011, Jun 2012, Dec 2012)
)

mdat=melt(arbo,measure.var=c(4),id.var=c(1:3),na.rm=FALSE)
mdat=cast(mdat,Time +Ikeda~variable,fun.aggregate = c(sum))

barchart(value~Ikeda|Time, data = mdat,
type=count,
cex=1.1,
xlab=PCR positive over time,
aspect = c(1.5),layout = c(6, 1),
stack=FALSE,
strip=strip.custom(strip.names=FALSE, strip.levels=TRUE, bg=light 
blue),
par.strip.text=list(cex=1.1), scales=list(cex=c(1.1)))

Any suggestions on how to do this would be appreciated.

Regards

Andrew

Investigation and Diagnostic Centre- Upper Hutt



This email message and any attachment(s) is intended solely for the addressee(s)
named above. The information it contains is confidential and may be legally
privileged.  Unauthorised use of the message, or the information it contains,
may be unlawful. If you have received this message by mistake please call the
sender immediately on 64 4 8940100 or notify us by return email and erase the
original message and attachments. Thank you.

The Ministry for Primary Industries accepts no responsibility for changes
made to this email or to any attachments after transmission from the office.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Maximum likelihood estimation of ARMA(1,1)-GARCH(1,1)

2013-04-08 Thread Andy Yeh
Hello

Following some standard textbooks on ARMA(1,1)-GARCH(1,1) (e.g. Ruey
Tsay's Analysis of Financial Time Series), I try to write an R program
to estimate the key parameters of an ARMA(1,1)-GARCH(1,1) model for
Intel's stock returns. For some random reason, I cannot decipher what
is wrong with my R program. The R package fGarch already gives me the
answer, but my customized function does not seem to produce the same
result.

I would like to build an R program that helps estimate the baseline
ARMA(1,1)-GARCH(1,1) model. Then I would like to adapt this baseline
script to fit different GARCH variants (e.g. EGARCH, NGARCH, and
TGARCH). It would be much appreciated if you could provide some
guidance in this case. The code below is the R script for estimating
the 6 parameters of an ARMA(1,1)-GARCH(1,1) model for Intel's stock
returns. At any rate, I would be glad to know your thoughts and
insights. If you have a similar example, please feel free to share
your extant code in R. Many thanks in advance.

Emily



# This R script offers a suite of functions for estimating  the
volatility dynamics based on the standard ARMA(1,1)-GARCH(1,1) model
and its variants.
# The baseline ARMA(1,1) model characterizes the dynamic evolution of
the return generating process.
# The baseline GARCH(1,1) model depicts the the return volatility
dynamics over time.
# We can extend the GARCH(1,1) volatility model to a variety of
alternative specifications to capture the potential asymmetry for a
better comparison:
# GARCH(1,1), EGARCH(1,1),  NGARCH(1,1), and TGARCH(1,1).

options(scipen=10)

intel= read.csv(file=intel.csv)
summary(intel)

raw_data= as.matrix(intel$logret)

library(fGarch)
garchFit(~arma(1,1)+garch(1,1), data=raw_data, trace=FALSE)


negative_log_likelihood_arma11_garch11=
function(theta, data)
{mean =theta[1]
 delta=theta[2]
 gamma=theta[3]
 omega=theta[4]
 alpha=theta[5]
 beta= theta[6]

 r= ts(data)
 n= length(r)

 u= vector(length=n)
 u= ts(u)
 u[1]= r[1]- mean

 for (t in 2:n)
 {u[t]= r[t]- mean- delta*r[t-1]- gamma*u[t-1]}

 h= vector(length=n)
 h= ts(h)
 h[1]= omega/(1-alpha-beta)

 for (t in 2:n)
 {h[t]= omega+ alpha*(u[t-1]^2)+ beta*h[t-1]}

 #return(-sum(dnorm(u[2:n], mean=mean, sd=sqrt(h[2:n]), log=TRUE)))
 pi=3.141592653589793238462643383279502884197169399375105820974944592
 return(-sum(-0.5*log(2*pi) -0.5*log(h[2:n]) -0.5*(u[2:n]^2)/h[2:n]))
}


#theta0=c(0, +0.78, -0.79, +0.018, +0.06, +0.93, 0.01)
theta0=rep(0.01,6)
negative_log_likelihood_arma11_garch11(theta=theta0, data=raw_data)


alpha= proc.time()
maximum_likelihood_fit_arma11_garch11=
nlm(negative_log_likelihood_arma11_garch11,
p=theta0,
data=raw_data,
hessian=TRUE,
iterlim=500)
#optim(theta0,
#  negative_log_likelihood_arma11_garch11,
#  data=raw_data,
#  method=L-BFGS-B,
#  
upper=c(+0.,+0.,+0.,0.,0.,0.),
#  
lower=c(-0.,-0.,-0.,0.0001,0.0001,0.0001),
#  hessian=TRUE)

# We record the end time and calculate the total runtime for the above work.
omega= proc.time()
runtime= omega-alpha
zhours = floor(runtime/60/60)
zminutes=floor(runtime/60- zhours*60)
zseconds=floor(runtime- zhours*60*60- zminutes*60)
print(paste(It takes ,zhours,hour(s), zminutes, minute(s) ,and
, zseconds,second(s) to finish running this R program,sep=))


maximum_likelihood_fit_arma11_garch11

sqrt(diag(solve(maximum_likelihood_fit_arma11_garch11$hessian)))

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] SVD on very large data matrix

2013-04-08 Thread Andy Cooper


Dear All,

I need to perform a SVD on a very large data matrix, of dimension ~ 500,000 x 
1,000 , and I am looking
for an efficient algorithm that can perform an approximate (partial) SVD to 
extract on the order of the top 50
right and left singular vectors.

Would be very grateful for any advice on what R-packages are available to 
perform such a task, what the RAM requirement is, and indeed what would be the 
state-of-the-art in terms of numerical algorithms and programming
language to use to accomplish this task.


with many thanks in advance,

Andy Cooper

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SVD on very large data matrix

2013-04-08 Thread Andy Cooper
Dear Berend,

thanks to everyone who responded. I should point that the data matrix is not 
sparse. Yes, the irlba package seemed of interest, however annoyingly, they 
don't specify in the package how large is large!!!, and when you read the 
paper on which it is based, it deals with rather smaller data matrices than the 
ones I have, so it is unclear how it would perform and scale.

So, no one has direct experience running irlba on a data matrix as large as 
500,000 x 1,000 or larger?

kind regards
Andy






 From: Berend Hasselman b...@xs4all.nl

Cc: r-help@R-project.org r-help@r-project.org 
Sent: Monday, 8 April 2013, 20:31
Subject: Re: [R] SVD on very large data matrix




 
 
 Dear All,
 
 I need to perform a SVD on a very large data matrix, of dimension ~ 500,000 x 
 1,000 , and I am looking
 for an efficient algorithm that can perform an approximate (partial) SVD to 
 extract on the order of the top 50
 right and left singular vectors.
 
 Would be very grateful for any advice on what R-packages are available to 
 perform such a task, what the RAM requirement is, and indeed what would be 
 the state-of-the-art in terms of numerical algorithms and programming
 language to use to accomplish this task.


Info found with package sos and findFn(svd) and scrolling through the list 
for something relevant.

Have a look at package irlba.
It can work with dense matrices and sparse matrices as provided by package 
Matrix, according to the documentation.

Berend
[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Creating 3d partial dependence plots

2013-03-20 Thread Liaw, Andy
It needs to be done by hand, in that partialPlot() does not handle more than 
one variable at a time.  You need to modify its code to do that (and be ready 
to wait even longer, as it can be slow).

Andy
 
-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jerrod Parker
Sent: Sunday, March 03, 2013 7:08 PM
To: r-help@r-project.org
Subject: [R] Creating 3d partial dependence plots

Help,

I've been having a difficult time trying to create 3d partial dependence
plots using rgl.  It looks like this question has been asked a couple
times, but I'm unable to find a clear answer googling.  I've tried creating
x, y, and z variables by extracting them from the partialPlot output to no
avail.  I've seen these plots used several times in articles, and I think
they would help me a great deal looking at interactions.  Could someone
provide a coding example using randomForest and rgl?  It would be greatly
appreciated.

Thank you,
Jerrod Parker

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R software installation problem

2013-02-25 Thread Andy Siddaway
Dear Sarah,

Thanks for your email. I'll describe the problm but without the screenshots
then.


Firstly, I think I’ve correctly installed R.

I have installed R for Windows via the R site, CRAN and then UK –
University of Bristol or UK – Imperial College London. Both times, I have
installed the ‘base’ version. I installed the 64-bit version, which I’m
running (I’ve got Office 2010).



When installed, R appears to be same (in terms of the menus and layout) as
the screenshots shown as part of youtube R installation tutorials - except
for the following (error?) message, which I have exactly copied and
pasted below. This message is displayed in the R Console box and I cannot
close this box or open a new script:



R version 2.15.2 (2012-10-26) -- Trick or Treat

Copyright (C) 2012 The R Foundation for Statistical Computing

ISBN 3-900051-07-0

Platform: x86_64-w64-mingw32/x64 (64-bit)



R is free software and comes with ABSOLUTELY NO WARRANTY.

You are welcome to redistribute it under certain conditions.

Type 'license()' or 'licence()' for distribution details.



  Natural language support but running in an English locale



R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.



Type 'demo()' for some demos, 'help()' for on-line help, or

'help.start()' for an HTML browser interface to help.

Type 'q()' to quit R.





This ‘Trick or Treat’ message also appears (in the Console box) when I
downloaded RStudio.



Any tips or guidance on resolving this problem would be really appreciated!



Many thanks,



Andy Siddaway




On 25 February 2013 00:10, Sarah Goslee sarah.gos...@gmail.com wrote:

 Hi Andy,

 This list strips most forms of attachments.

 Instead, you need to tell us what OS and version you're using, how
 you're trying to install, and what's going wrong, in detail.

 Sarah

 On Sun, Feb 24, 2013 at 8:58 AM, Andy Siddaway
 andysidda...@googlemail.com wrote:
  Dear R-help,
 
  Please could I have some quick guidance on what I'm doing wrong when
 trying
  to instal R software? (I have read the R-FAQs and instructions, and
 watched
  youtube instructional videos on installing R, but they didn't help)
 
  I've attached screenshots to hopefully make what I've done clearer.
 Basically,
  R doesn't seem to be installing correctly and I can't figure out why.
 It's
  probably a simple error which a non-(complete)-novice would notice.
 
  Thanks very much,
 
  Andy Siddaway
  Trainee Clinical Psychologist
  University of Hertordshire (UK)
 


 --
 Sarah Goslee
 http://www.functionaldiversity.org


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R software installation problem

2013-02-24 Thread Andy Siddaway
Dear R-help,

Please could I have some quick guidance on what I'm doing wrong when trying
to instal R software? (I have read the R-FAQs and instructions, and watched
youtube instructional videos on installing R, but they didn't help)

I've attached screenshots to hopefully make what I've done clearer. Basically,
R doesn't seem to be installing correctly and I can't figure out why. It's
probably a simple error which a non-(complete)-novice would notice.

Thanks very much,

Andy Siddaway
Trainee Clinical Psychologist
University of Hertordshire (UK)
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Getting WinBUGS Leuk example to work from R using R2winBUGS

2013-02-17 Thread Andy Cox
I am trying to learn to use winBUGS from R, I have experience with R.
I have managed to successfully run a simple example from R with no
problems. I have been trying to run the Leuk: Survival from winBUGS
examples Volume 1. I have managed to run this from winBUGS GUI with no
problems. My problem is try as I might ( and I have been trying and
searching for days) I cannot get it to run using R2winBUGS.I am sure
it is something simple.

The error message I get if I try and set inits in the script is

Error in bugs(data = L, inits = inits,
  parameters.to.save = params, model.file model.txt,  :
  Number of initialized chains (length(inits)) != n.chains

I know this means I have not initialised some of the chains, but I am
pasting the inits code from winbugs examples manual and all other
settings seem to me to be the same as when run on the winBUGS GUI.



If I try inits=NULL I get another error message

display(log)
check(C:/BUGS/model.txt)
model is syntactically correct
data(C:/BUGS/data.txt)
data loaded
compile(1)
model compiled
gen.inits()
shape parameter (r) of gamma dL0[1] too small -- cannot sample
thin.updater(1)
update(500)
command #Bugs:update cannot be executed (is greyed out)
set(beta)

Which indicates to me I will still have problems after solving the
first one!! I am about to give up on using winBUGS, please can someone
save me? I know I am probably going to look stupid, but everyone has
to learn:-)

I have also tried changing nc-2 (On advice, which doesnt work and
gives an uninitialised chain error)

I am using winBUGS 1.4.3 on Windows XP 2002 SP3

My R code is below, many thanks for at least reading this far.

   rm(list = ls())

L-list(N = 42, T = 17, eps = 1.0E-10,
obs.t = c(1, 1, 2, 2, 3, 4, 4, 5, 5, 8, 8, 8, 8, 11, 11, 12,
12, 15, 17, 22, 23, 6,
6, 6, 6, 7, 9, 10, 10, 11, 13, 16, 17, 19, 20, 22, 23, 25, 32,
32, 34, 35),
fail = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0,
0),
Z = c(0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
   -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5,
-0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5, -0.5,
-0.5),
t = c(1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 15, 16, 17, 22, 23, 35))

### 5.4. Analysis using WinBUGS
library(R2WinBUGS)  # Load the R2WinBUGS library CHOOSE to use WinBUGS
#library(R2OpenBUGS)# Load the R2OpenBUGS library CHOOSE
to use OpenBUGS
setwd(C://BUGS)

# Save BUGS description of the model to working directory
sink(model.txt)
cat(

model
{
# Set up data
for(i in 1:N) {
for(j in 1:T) {

# risk set = 1 if obs.t = t
Y[i,j] - step(obs.t[i] - t[j] + eps)
# counting process jump = 1 if obs.t in [ t[j], t[j+1] )
# i.e. if t[j] = obs.t  t[j+1]
dN[i, j] - Y[i, j] * step(t[j + 1] - obs.t[i] - eps) * fail[i]
}
}
# Model
for(j in 1:T) {
for(i in 1:N) {
dN[i, j] ~ dpois(Idt[i, j]) # Likelihood
Idt[i, j] - Y[i, j] * exp(beta * Z[i]) * dL0[j] # Intensity
}
dL0[j] ~ dgamma(mu[j], c)
mu[j] - dL0.star[j] * c # prior mean hazard
# Survivor function = exp(-Integral{l0(u)du})^exp(beta*z)
S.treat[j] - pow(exp(-sum(dL0[1 : j])), exp(beta * -0.5));
S.placebo[j] - pow(exp(-sum(dL0[1 : j])), exp(beta * 0.5));
}
c - 0.001
r - 0.1
for (j in 1 : T) {
dL0.star[j] - r * (t[j + 1] - t[j])
}
beta ~ dnorm(0.0,0.01)
}


,fill=TRUE)
sink()

params- c(beta,S.placebo,S.treat)

inits-list( beta = 0.0,
 dL0 = c(1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,
 1.0,1.0,1.0,1.0,1.0,1.0, 1.0,1.0))

# MCMC settings
nc -1  # Number of chains
ni - 1000  # Number of draws from posterior (for each chain)
ns-1000 #Number of sims (n.sims)
nb - floor(ni/2)   # Number of draws to discard as burn-in
nt - max(1, floor(nc * (ni - nb) / ns))# Thinning rate
Lout-list()



# Start Gibbs sampler: Run model in WinBUGS and save results in object
called out
out - bugs(data = L, inits =inits, parameters.to.save = params,
model.file = model.txt,
n.thin = nt, n.chains = nc, n.burnin = nb, n.iter = ni, debug = T, DIC
= TRUE,digits=5,
codaPkg=FALSE, working.directory = getwd())

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I make R randomForest model size smaller?

2012-12-04 Thread Liaw, Andy
Try the following:

set.seed(100)
rf1 - randomForest(Species ~ ., data=iris)
set.seed(100)
rf2 - randomForest(iris[1:4], iris$Species)
object.size(rf1)
object.size(rf2)
str(rf1)
str(rf2)

You can try it on your own data.  That should give you some hints about why the 
formula interface should be avoided with large datasets.

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of John Foreman
Sent: Monday, December 03, 2012 3:43 PM
To: r-help@r-project.org
Subject: [R] How do I make R randomForest model size smaller?

I've been training randomForest models on 7 million rows of data (41
features). Here's an example call:

myModel - randomForest(RESPONSE~., data=mydata, ntree=50, maxnodes=30)

I thought surely with only 50 trees and 30 terminal nodes that the memory
footprint of myModel would be small. But it's 65 megs in a dump file. The
object seems to be holding all sorts of predicted, actual, and vote data
from the training process.

What if I just want the forest and that's it? I want a tiny dump file that
I can load later to make predictions off of quickly. I feel like the forest
by itself shouldn't be all that large...

Anyone know how to strip this sucker down to just something I can make
predictions off of going forward?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Different results from random.Forest with test option and using predict function

2012-12-04 Thread Liaw, Andy
Without data to reproduce what you saw, we can only guess.

One possibility is due to tie-breaking.  There are several places where ties 
can occur and are broken at random, including at the prediction step.  One 
difference between the two ways of doing prediction is that when it's all done 
within randomForest(), the test set prediction is performed as each tree is 
grown.  If there is any tie that needs to be broken at any prediction step, it 
will affect the RNG stream used by the subsequent tree growing step.

You can also inspect/compare the forest components of the randomForest 
objects to see if they are the same.  At least the first tree in both should be 
identical.

Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of tdbuskirk
Sent: Monday, December 03, 2012 6:31 PM
To: r-help@r-project.org
Subject: [R] Different results from random.Forest with test option and using 
predict function

Hello R Gurus,

I am perplexed by the different results I obtained when I ran code like
this:
set.seed(100)
test1-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200)
predict(test1, newdata=cbind(NewBinaryY, NewXs), type=response)

and this code:
set.seed(100)
test2-randomForest(BinaryY~., data=Xvars, trees=51, mtry=5, seed=200,
xtest=NewXs, ytest=NewBinarY)

The confusion matrices for the two forests I thought would be the same by
virtue of the same seed settings, but they differ as do the predicted values
as well as the votes.  At first I thought it was just the way ties were
broken, so I changed the number of trees to an odd number so there are no
ties anymore.  

Can anyone shed light on what I am hoping is a simple oversight?  I just
can't figure out why the results of the predictions from these two forests
applied to the NewBinaryYs and NewX data sets would not be the same.

Thanks for any hints and help.

Sincerely,

Trent Buskirk



--
View this message in context: 
http://r.789695.n4.nabble.com/Different-results-from-random-Forest-with-test-option-and-using-predict-function-tp4651970.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Partial dependence plot in randomForest package (all flat responses)

2012-11-26 Thread Liaw, Andy
Not unless we have more information.  Please read the Posting Guide to see how 
to make it easier for people to answer your question.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Oritteropus
Sent: Thursday, November 22, 2012 2:02 PM
To: r-help@r-project.org
Subject: [R] Partial dependence plot in randomForest package (all flat 
responses)

Hi,
I'm trying to make a partial plot with package randomForest in R. After I
perform my random forest object I type

partialPlot(data.rforest, pred.data=act2, x.var=centroid, C) 

where data.rforest is my randomforest object, act2 is the original dataset,
centroid is one of the predictor and C is one of the classes in my response
variable. 
Whatever predictor or response class I try I always get a plot with a
straight line (a completely flat response). Similarly, If I set a
categorical variable as predictor, I get a barplot with all the bar with the
same height. I suppose I'm doing something wrong here because all other
analysis on the same rforest object seem correct (e.g. varImp or MDSplot).
Is it possible it is related to some option set in random forest object? Can
somebody see the problem here?
Thanks for your time



--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-dependence-plot-in-randomForest-package-all-flat-responses-tp4650470.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest for multiple categorical variables

2012-10-17 Thread Liaw, Andy
How about taking the combination of the two?  E.g., gamma = factor(paste(alpha, 
beta1, sep=:)) and use gamma as the response.

Best,
Andy
 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Gyanendra Pokharel
Sent: Tuesday, October 16, 2012 10:47 PM
To: R-help@r-project.org
Subject: [R] Random Forest for multiple categorical variables

Dear all,

I have the following data set.

V1  V2  V3  V4  V5  V6  V7  V8  V9  V10  alpha   beta
1111   111   111   11111alpha   beta1
2122   122   12   2   12212alpha  beta1
3133   133   13   3   13 313alpha   beta1
4144   14414  4   144 14   alpha   beta1
5155   15515  5   155 15   alpha   beta1
6166166   16   6  166 16   alpha   beta2
717717717  7   17   7 17   alpha   beta2
8188   18 818  818   818alpha  beta2
919919919  9 19   9   19alpha   beta2
10   20   10   20   10   20  10   20  10  20  alpha   beta2

I want to use the randomForest classification. If there is one categorical
variable with different classes, we can use

randomForest(resp~., data, ), but here I need to classify the data
with two categorical variables. Any idea will be great.

Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest - Extract

2012-10-03 Thread Liaw, Andy
1.  Not sure what you want.  What details are you looking for exactly?  If 
you call predict(trainset) without the newdata argument, you will get the 
(out-of-bag) prediction of the training set, which is exactly the predicted  
component of the RF object.

2. If you set type=votes and norm.votes=FALSE, you will get the counts 
instead of proportions.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Lopez, Dan
Sent: Wednesday, September 26, 2012 9:05 PM
To: R help (r-help@r-project.org)
Subject: [R] Random Forest - Extract

Hello,

I have two Random Forest (RF) related questions.


1.   How do I view the classifications for the detail data of my training 
data (aka trainset) that I used to build the model? I know there is an object 
called predicted which I believe is a vector. To view the detail for my testset 
I use the below-bind the columns together. I was trying to do something similar 
for my trainset  but without putting it through the predict function. Instead 
taking directly from the randomForest which I stored in FOREST_model. I really 
need to get to this information to do some comparison of certain cases.

RF_DTL-cbind(testset,predict(FOREST_model, testset, type=response))



2.   In the RF model in R the predict function has three possible 
arguments: response, vote or prob. I noticed vote and prob are 
identical for all records in my data set. Is this typical? If so then what is 
the point of having these two arguments? Ease of use?

Dan


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] interpret the importance output?

2012-08-29 Thread Liaw, Andy
The type=1 importance measure in RF compares the prediction error of each 
tree on the OOB data with the prediction error of the same tree on the OOB data 
with the values of one variable randomly shuffled.  If the variable has no 
predictive power, then the two should be very close, and there's 50% chance 
that the difference is negative.  If the variable is important, then 
shuffling the values should significantly degrade the prediction in the form of 
increased MSE.  The importance measure takes mean of the differences of all 
these individual tree MSEs and then divide by the SD of these differences.

With that, I hope it's clear that only v2 and v4 in your example are 
potentially important.

Best,
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Johnathan Mercer
Sent: Monday, August 27, 2012 11:40 AM
To: r-h...@stat.math.ethz.ch
Subject: [R] interpret the importance output?

 importance(rfor.pdp11_t25.comb1,type=1)
  %IncMSE
v1 -0.28956401263
v2  1.92865561147
v3 -0.63443929130
v4  1.58949137047
v5  0.03190940065

I wasn't entirely confident with interpreting these results based on the
documentation.
Could you please interpret?

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Significance of interaction depends on factor reference level - lmer/AIC model averaging

2012-06-30 Thread Andy Robertson
Dear R users,

 

I am using lmer combined with AIC model selection and averaging (in the
MuMIn package) to try and assess how isotope values (which indicate diet)
vary within a population of animals.

 

I have multiple measures from individuals (variable 'Tattoo') and multiple
individuals within social groups within 4 locations (A, B, C ,D) crucially I
am interested if there are differences between sexes and age classes
(variable AGECAT2) and whether this differs with location.

However, whether or not I get a significant sex:location interaction depends
on which location is my reference level and I cannot understand why this is
the case. It seems to be due to the fact that the standard error associated
with my interactions varies depending on which level is the reference.

 

Any help or advice would be appreciated,

 

Andrew Robertson

 

Below is the example code of what I am doing and an example of the model
summary and model averaging results with location A as the ref level or
location B.

 

if A is the reference level...

 

#full model

Amodel-lmer(d15N~(AGECAT2+Sex+Location1+AGECAT2:Location1+Sex:Location1+AGE
CAT2:Sex+(1|Year)+(1|Location1/Socialgroup/Tattoo)), REML=FALSE,
data=nocubs)

 

#standardise model

Amodels-standardize(Amodel, standardize.y=FALSE)

 

#dredge models

summary(model.avg(get.models(Adredge,cumsum(weight)0.95)))

 

Then the average model coefficients indicate no sex by location interaction

 
Component models:
  df  logLikAICc Delta Weight
235   13 -765.33 1557.28  0.00   0.68
1235  15 -764.55 1559.91  2.63   0.18
3  9 -771.64 1561.57  4.29   0.08
12345 17 -763.67 1562.37  5.09   0.05
 
Term codes:
AGECAT2   c.Sex   Location1   AGECAT2:c.Sex
c.Sex:Location1 
  1   2   3   4
5 
 
Model-averaged coefficients: 
   Estimate Std. Error z value Pr(|z|)
(Intercept)8.673592   0.474524  18.279   2e-16 ***
c.Sex  0.095375   0.452065   0.2110.833
Location1B-3.972882   0.556575   7.138   2e-16 ***
Location1C-3.61   0.531858   6.831   2e-16 ***
Location1D-3.348665   0.539143   6.211   2e-16 ***
c.Sex:Location1B  -0.372653   0.513492   0.7260.468
c.Sex:Location1C   0.428299   0.511254   0.8380.402
c.Sex:Location1D  -0.757582   0.512586   1.4780.139
AGECAT2OLD-0.179772   0.150842   1.1920.233
AGECAT2YEARLING   -0.009596   0.132328   0.0730.942
AGECAT2OLD:c.Sex   0.045963   0.296471   0.1550.877
AGECAT2YEARLING:c.Sex -0.323985   0.268919   1.2050.228
---
 

And the full model summary looks like this..

 

 

Linear mixed model fit by maximum likelihood 

Formula: d15N ~ (AGECAT2 + Sex + Location1 + AGECAT2:Location1 +
Sex:Location1 +  AGECAT2:Sex + (1 | Year) + (1 |
Location1/Socialgroup/Tattoo)) 

   Data: nocubs 

  AIC  BIC logLik deviance REMLdev

1568 1670 -761.1 15221534

Random effects:

Groups NameVariance Std.Dev.

Tattoo:(Socialgroup:Location1) (Intercept) 0.35500  0.59582 

 Socialgroup:Location1  (Intercept) 0.35620  0.59682 

 Location1  (Intercept) 0.0  0.0 

 Year   (Intercept) 0.0  0.0 

 Residual   0.49584  0.70416 

Number of obs: 608, groups: Tattoo:(Socialgroup:Location1), 132;
Socialgroup:Location1, 22; Location1, 4; Year, 2

 

Fixed effects:

   Estimate Std. Error t value

(Intercept) 8.831790.52961  16.676

AGECAT2OLD -0.441010.41081  -1.074

AGECAT2YEARLING 0.018050.38698   0.047

SexMale-0.113460.51239  -0.221

Location1B -3.978800.63063  -6.309

Location1C -4.048160.60404  -6.702

Location1D -3.363890.63304  -5.314

AGECAT2OLD:Location1B   0.441980.54751   0.807

AGECAT2YEARLING:Location1B -0.221340.52784  -0.419

AGECAT2OLD:Location1C   0.206840.50157   0.412

AGECAT2YEARLING:Location1C  0.241320.47770   0.505

AGECAT2OLD:Location1D   0.536530.52778   1.017

AGECAT2YEARLING:Location1D  0.517550.51038   1.014

SexMale:Location1B -0.024420.57546  -0.042

SexMale:Location1C  0.746800.58128   1.285

SexMale:Location1D -0.418000.59505  -0.702

AGECAT2OLD:SexMale -0.089070.32513  -0.274

AGECAT2YEARLING:SexMale-0.401460.30409  -1.320

 

 

If location B is the reference level then the average model coefficients
indicate an age by sex interaction in location C.

 

Component models:
  df  logLikAICc Delta Weight
235   13 -765.33 1557.28  0.00   0.68
1235  15 -764.55 1559.91  2.63   0.18
3  9 -771.64 1561.57  4.29   0.08
12345 17 -763.67 1562.37  5.09   0.05
 
Term codes:
AGECAT2 

Re: [R] DCC-GARCH model

2012-06-28 Thread andy
Hello Marcin,

did you get the answer to your questions. I have the same questions and
would appreciate your help if you found the answers.

Thanks,
Ankur 

--
View this message in context: 
http://r.789695.n4.nabble.com/DCC-GARCH-model-tp3524387p4634776.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Multivariate P-GARCH Model

2012-06-27 Thread andy
Hi,

I am trying to estimate a multivariate P-GARCH model for two factors xy. I
have selected p-garch to study the leverage effects. Is there any toolkit in
R that can help me do this? 

Thanks,
Andy

--
View this message in context: 
http://r.789695.n4.nabble.com/Multivariate-P-GARCH-Model-tp4634654.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stratified Sampling with randomForest Regression

2012-06-01 Thread Liaw, Andy
Yes, you need to modify both the R and the underlying C code.  It's the the 
source package on CRAN (the .tar.gz file).

Andy
 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Josh Browning
Sent: Friday, June 01, 2012 10:48 AM
To: r-help@r-project.org
Subject: [R] Stratified Sampling with randomForest Regression

Hi All,

 

I'm using R's randomForest package (and it's quite awesome!) but I'd
really like to do some stratified sampling with a regression problem.
However, it appears that the package was designed to only accommodate
stratified sampling for classification purposes (see
https://stat.ethz.ch/pipermail/r-help/2006-November/117477.html).  As
Andy suggests in the link just mentioned, I'm trying to modify the
source code.  However, it appears that I may also need to modify the C
code that randomForest is calling, is that correct?  If so, how do I
access that code?

 

Or, has anyone modified the package to allow for stratified sampling in
regression problems?

 

Please let me know if I'm not being clear enough with this question, and
thanks for helping me out!

 

Josh


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] lattice: add a marginal histogram on top of the colorkey of a levelplot?

2012-05-29 Thread Andy Bunn
Lattice experts:

Can you think of a way to produce a levelplot as below and then add a histogram 
of the z variable to the top margin of the plot that would sit on top of the 
color key?



x - seq(pi/4, 5 * pi, length.out = 100)
y - seq(pi/4, 5 * pi, length.out = 100)
r - as.vector(sqrt(outer(x^2, y^2, +)))
grid - expand.grid(x=x, y=y)
grid$z - cos(r^2) * exp(-r/(pi^3))
my.levs - seq(-1,1,by=0.1)
my.cols - grey(0:length(my.levs)/length(my.levs))
levelplot(z~x*y, grid, at=my.levs, scales=list(log=e), xlab=,
  ylab=,colorkey = list(space = 'top'),col.regions = my.cols)
# is there a way to add a marginal histogram above the colorkey?
histogram(~z, grid, breaks=my.levs,col=my.cols,xlab='',ylab='',
scales = list(draw = FALSE),
par.settings = list(axis.line = list(col = transparent)))

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about random Forest function in R

2012-05-29 Thread Liaw, Andy
Hi Kelly,

The function has a limitation that it cannot handle any column in your x that 
is a categorical variable with more than 32 categories.  One possibility is to 
see if you can bin some of the categories into one to get below 32 categories.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Kelly Cool
Sent: Tuesday, May 29, 2012 10:47 AM
To: r-help@r-project.org
Subject: [R] Question about random Forest function in R



Hello, 

I am trying to run the random Forest function on a data.frame using the 
following code..

myrf - randomForest (y=sample_data_metal, x=Train, importance=TRUE, 
proximity=TRUE)


However, an error occurs saying, can not handle categorical predictors with 
more than 32 categories. 

My x=Train data.frame is quite large and my y=sample_data_metal is one 
column. 

I'm not sure how to go about fixing this error or if there is even a way to get 
around this error. Thanks in advance for any help. 

[[alternative HTML version deleted]]

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Classification_ForestCombination

2012-05-29 Thread Liaw, Andy
As long as you can remember that the summaries such as variable importance, OOB 
predictions, and OOB error rates are not applicable, I think that should be 
fine.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Nikita Desai
Sent: Wednesday, May 23, 2012 1:51 PM
To: r-help@R-project.org
Subject: [R] Random Forest Classification_ForestCombination

Hello,

I am aware of the fact that the combine() function in the Random Forest package 
of R is meant to combine forests built from the same training set, but is there 
any way to combine trees built on different training sets? Both the training 
datasets used contain the same variables and classes, but their sizes are 
different.

Thanks


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy
I don't think this is so hard to explain.  If you evaluate AUC using either OOB 
prediction or on a test set (or something like CV or bootstrap), that would be 
what I expect for most data.  When you add more variables (that are, say, less 
informative) to a model, the model has to look harder to find the informative 
ones, and thus you pay a penalty.  One exception to that is if some of the 
new variables happen to have very strong interaction with some of the old 
variables, then you may see improved performance.

I've said it several times before, but it seems to be worth repeating:  Don't 
use the training set for evaluating models:  that almost never make sense.

Andy


-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of matt
Sent: Friday, May 11, 2012 3:43 PM
To: r-help@r-project.org
Subject: [R] Random forests prediction

Hi all,

I have a strange problem when applying RF in R. 
I have a set of variables with which I obtain an AUC of 0.67.

I do have a second set of variables that have an AUC of 0.57. 

When I merge the first and second set of variables, the AUC becomes 0.64. 

I would expect the prediction to become better as I add variables that do
have some predictive power?
This is even more strange as the AUC on the training set increased when I
added more variables (while the AUC of the validation set thus decreased).

Is there anyone who has experienced the same and/or who know what could be
the reason?

Thanks,

Matthijs

--
View this message in context: 
http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] No Data in randomForest predict

2012-05-14 Thread Liaw, Andy
It doesn't:  You just get an error if there are NAs in the data; e.g.,

R rf1 = randomForest(iris[1:4], iris[[5]])
R predict(rf1, newdata=data.frame(Sepal.Length=1, Sepal.Width=2, 
Petal.Length=3, Petal.Width=NA))
Error in predict.randomForest(rf1, newdata = data.frame(Sepal.Length = 1,  : 
  missing values in newdata
 
Andy

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jennifer Corcoran
Sent: Saturday, May 05, 2012 5:17 PM
To: r-help@r-project.org
Subject: [R] No Data in randomForest predict

I would like to ask a general question about the randomForest predict
function and how it handles No Data values.  I understand that you can omit
No Data values while developing the randomForest object, but how does it
handle No Data in the prediction phase?  I would like the output to be NA
if any (not just all) of the input data have an NA value. It is not clear
to me if this is the default or if I need to add an argument in the predict
function.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random forests prediction

2012-05-14 Thread Liaw, Andy
That's not how RF works at all.  The setting of mtry is irrelevant to this.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of matt
Sent: Monday, May 14, 2012 10:22 AM
To: r-help@r-project.org
Subject: Re: [R] Random forests prediction

But shouldn't it be resolved when I set mtry to the maximum number of
variables? 
Then the model explores all the variables for the next step, so it will
still be able to find the better ones? And then in the later steps it could
use the (less important) variables.

Matthijs

--
View this message in context: 
http://r.789695.n4.nabble.com/Random-forests-prediction-tp4627409p4629944.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Hmisc::xYplot - text on xaxis

2012-04-26 Thread Andy Bunn
Hello, I'm making a simple plot using xYplot in the Hmisc library and having 
problems with labeling the values on the x-axis. Using the reproducible example 
below, how can I have the text (jan, feb,mar, etc.) in place of 1:12. 

Thanks, AB


x - c(seq(0,0.5,by=0.1),seq(0.5,0,by=-0.1))
ci - rnorm(12,0,sd=0.1)
xupper - x + ci
xlower - x - ci
mo.fac -  c(jan, feb, mar, apr, may, jun, jul,
 aug, sep, oct, nov, dec)
foo - data.frame(mo=1:12,mo.fac,x,xupper,xlower)

# example 1: works but I want text (jan, feb, etc.) and not numbers on the x 
axis
xYplot(Cbind(x,xlower,xupper) ~ mo,data=foo)

# example 2: doesn't work
xYplot(Cbind(x,xlower,xupper) ~ mo.fac,data=foo)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Partial Dependence and RandomForest

2012-04-17 Thread Liaw, Andy
Note that the partialPlot() function also returns the x-y pairs being plotted, 
so you can work from there if you wish.  As to SD, my guess is you want some 
sort of confidence interval or band around the curve?  I do not know of any 
theory to produce that, but that may well just be my ignorance.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of jmc
Sent: Friday, April 13, 2012 11:20 AM
To: r-help@r-project.org
Subject: Re: [R] Partial Dependence and RandomForest

Thank you Andy.  I obviously neglected to read into the help file and,
frustratingly, could have known this all along.  However, I am still
interested in knowing the relative maximum value in the partial plots via
query instead of visual interpretation (and possibly getting at other
statistical measures like standard deviation).  Is it possible to do this? 
I will keep investigating, but would appreciate a hint in the right
direction if you have time.

--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4555146.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] loess function take

2012-04-13 Thread Liaw, Andy
Alternatively, use only a subset to run loess(), either a random sample or 
something like every other k-th (sorted) data value, or the quantiles.  It's 
hard for me to imagine that that many data points are going to improve your 
model much at all (unless you use tiny span).

Andy


From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Uwe Ligges

On 12.04.2012 05:49, arunkumar wrote:
 Hi

 The function loess takes very long time if the dataset is very huge
 I have around 100 records
 and used only one independent variable. still it takes very long time

 Any suggestion to reduce the time


Use another method that is computationally less expensive for that many 
observations.

Uwe Ligges


 -
 Thanks in Advance
  Arun
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/loess-function-take-tp4550896p4550896.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Partial Dependence and RandomForest

2012-04-13 Thread Liaw, Andy
Please read the help page for the partialPlot() function and make sure you 
learn about all its arguments (in particular, which.class).

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of jmc
Sent: Wednesday, April 11, 2012 2:44 PM
To: r-help@r-project.org
Subject: [R] Partial Dependence and RandomForest

Hello all~

I am interested in clarifying something more conceptual, so I won't be
providing any data or code here.  

From what I understand, partial dependence plots can help you understand the
relative dependence on a variable, and the subsequent values of that
variable, after averaging out the effects of the other input variables. 
This is great, but what I am interested in knowing is how that relates to
each predictor class, not just the overall prediction.

Is it possible to plot partial dependence per class?  Specifically, I'd like
to know the important threshold values of my most important variables.

Thank you for your time,


--
View this message in context: 
http://r.789695.n4.nabble.com/Partial-Dependence-and-RandomForest-tp4549705p4549705.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Execution speed in randomForest

2012-04-13 Thread Liaw, Andy
Without seeing your code, it's hard to say much more, but do avoid using 
formula when you have large data.

Andy 

-Original Message-
From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
Behalf Of Jason  Caroline Shaw
Sent: Friday, April 06, 2012 1:20 PM
To: jim holtman
Cc: r-help@r-project.org
Subject: Re: [R] Execution speed in randomForest

The CPU time and elapsed time are essentially identical. (That is, the
system time is negligible.)

Using Rprof, I just ran the code twice.  The first time, while
randomForest is doing its thing, there are 850 consecutive lines which
read:
.C randomForest.default randomForest randomForest.formula randomForest
Upon running it a second time, this time taking 285 seconds to
complete, there are 14201 such lines, with nothing intervening

There shouldn't be interference from elsewhere on the machine.  This
is the only memory- and CPU-intensive process.  I don't know how to
check what kind of paging is going on, but since the machine has 16GB
of memory and I am using maybe 3 or 4 at most, I hope paging is not an
issue.

I'm on a CentOS 5 box running R 2.15.0.

On Fri, Apr 6, 2012 at 12:45 PM, jim holtman jholt...@gmail.com wrote:
 Are you looking at the CPU or the elapsed time?  If it is the elapsed
 time, then also capture the CPU time to see if it is different.  Also
 consider the use of the Rprof function to see where time is being
 spent.  What else is running on the machine?  Are you doing any
 paging?  What type of system are you running on?  Use some of the
 system level profiling tools.  If on Windows, then use perfmon.

 On Fri, Apr 6, 2012 at 11:28 AM, Jason  Caroline Shaw
 los.sh...@gmail.com wrote:
 I am using the randomForest package.  I have found that multiple runs
 of precisely the same command can generate drastically different run
 times.  Can anyone with knowledge of this package provide some insight
 as to why this would happen and whether there's anything I can do
 about it?  Here are some details of what I'm doing:

 - Data: ~80,000 rows, with 10 columns (one of which is the class label)
 - I randomly select 90% of the data to use to build 500 trees.

 And this is what I find:

 - Execution times of randomForest() using the entire dataset (in
 seconds): 20.65, 20.93, 20.79, 21.05, 21.00, 21.52, 21.22, 21.22
 - Execution times of randomForest() using the 90% selection: 17.78,
 17.74, 126.52, 241.87, 17.56, 17.97, 182.05, 17.82 -- Note the 3rd,
 4th, and 7th.
 - When the speed is slow, it often stutters, with one or a few trees
 being produced very quickly, followed by a slow build taking 10 or 20
 seconds
 - The oob results are indistinguishable between the fast and slow runs.

 I select the 90% of my data by using sample() to generate indices and
 then subsetting, like: selection - data[sample,].  I thought perhaps
 this subsetting was getting repeated, rather than storing in memory a
 new copy of all that data, so I tried circumventing this with
 eval(data[sample,]).  Probably barking up the wrong tree -- it had no
 effect, and doesn't explain the run-to-run variation (really, I'm just
 not clear on what eval() is for).  I have also tried garbage
 collecting with gc() between each run, and adding a Sys.sleep() for 5
 seconds, but neither of these has helped either.

 Any ideas?

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 --
 Jim Holtman
 Data Munger Guru

 What is the problem that you are trying to solve?
 Tell me what you want to do, not how you want to do it.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Imputing missing values using LSmeans (i.e., population marginal means) - advice in R?

2012-04-05 Thread Liaw, Andy
Don't know how you searched, but perhaps this might help:

https://stat.ethz.ch/pipermail/r-help/2007-March/128064.html 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jenn Barrett
 Sent: Tuesday, April 03, 2012 1:23 AM
 To: r-help@r-project.org
 Subject: [R] Imputing missing values using LSmeans (i.e., 
 population marginal means) - advice in R?
 
 Hi folks,
 
 I have a dataset that consists of counts over a ~30 year 
 period at multiple (200) sites. Only one count is conducted 
 at each site in each year; however, not all sites are 
 surveyed in all years. I need to impute the missing values 
 because I need an estimate of the total population size 
 (i.e., sum of counts across all sites) in each year as input 
 to another model. 
 
  head(newdat,40)
SITE YEAR COUNT
 1 1 1975 12620
 2 1 1976 13499
 3 1 1977 45575
 4 1 1978 21919
 5 1 1979 33423
 ...
 372 1975 4
 382 1978 40322
 392 1979 7
 402 1980 16244
 
 
 It was suggested to me by a statistician to use LSmeans to do 
 this; however, I do not have SAS, nor do I know anything much 
 about SAS. I have spent DAYS reading about these LSmeans 
 and while (I think) I understand what they are, I have 
 absolutely no idea how to a) calculate them in R and b) how 
 to use them to impute my missing values in R. Again, I've 
 searched the mail lists, internet and literature and have not 
 found any documentation to advise on how to do this - I'm lost.
 
 I've looked at popMeans, but have no clue how to use this 
 with predict() - if this is even the route to go. Any advice 
 would be much appreciated. Note that YEAR will be treated as 
 a factor and not a linear variable (i.e., the relationship 
 between COUNT and YEAR is not linear - rather there are highs 
 and lows about every 10 or so years).
 
 One thought I did have was to just set up a loop to calculate 
 the least-squares estimates as:
 
 Yij = (IYi + JYj - Y)/[(I-1)(J-1)]
 where  I = number of treatments and J = number of blocks (so 
 I = sites and J = years). I found this formula in some stats 
 lecture handouts by UC Davis on unbalanced data and 
 LSMeans...but does it yield the same thing as using the 
 LSmeans estimates? Does it make any sense? Thoughts?
 
 Many thanks in advance.
 
 Jenn
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about randomForest

2012-04-04 Thread Liaw, Andy
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Saruman
 
 I dont see how this answered the original question of the poster.
 
 He was quite clear: the value of the predictions coming out 
 of RF do not
 match what comes out of the predict function using the same 
 RF object and
 the same data. Therefore, what is predict() doing that is 
 different from RF?
 Yes, RF is making its predictions using OOB, but nowhere does 
 it say way
 predict() is doing; indeed, it says if newdata is not given, then the
 results are just the OOB predictions. But newdata=oldata, then
 predict(newdata) != OOB predictions. So what is it then? 

Let me make this as clear as I possibly can:  If predict() is called without 
newdata, all it can do is assume prediction on the training set is desired.  In 
that case it returns the OOB prediction.  If newdata is given in predict(), it 
assumes it is new data and thus makes prediction using all trees.  If you 
just feed the training data as newdata, then yes, you will get overfitted 
predictions.  It almost never make sense (to me anyway) to make predictions on 
the training set.
 
 Opens another issue, which is if newdata is close but not 
 exactly oldata,
 then you get overfitted results?

Possibly, depending on how close the new data are to the training set.  This 
applies to nearly _ALL_ methods, not just RF.

Andy
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Question-about-randomForest-tp41
11311p4529770.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Memory limits for MDSplot in randomForest package

2012-03-30 Thread Liaw, Andy
Sam,

As you've probably seen, all the MDSplot() function does is feed 1 - proximity 
to the cmdscale() function.  Some suggestion and clarification:

1. If all you want is the proximity matrix, you can run randomForest() with 
keep.forest=FALSE to save memory.  You will likely want to run somewhat large 
number of trees if you're interested in proximity, and with the large number of 
data points, the trees are going to be quite large as well.

2. The proximity is nxn, so if you have about 19000 data points, that's a 19000 
by 19000 matrix, which takes approx. 2.8GB of memory to store a copy.

3. I tried making up a 19000^2 cross-product matrix, then tried cmdscale(1-xx, 
k=5).  The memory usage seems to peak at around 16.3GB, but I killed it after 
more than two hours.  Thus I suspect it really is the eigen decomposition in 
cmdscale() on such a large matrix that's taking up the time.

My suggestion is to see if you can find some efficient ways of doing eigen 
decomposition on such large matrices.  You might be able to make the proximity 
matrix sparse (e.g., by thresholding), and see if there are packages that can 
do the decomposition on the sparse form.

Best,
Andy


 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Sam Albers
 Sent: Friday, March 23, 2012 3:31 PM
 To: r-help@r-project.org
 Subject: [R] Memory limits for MDSplot in randomForest package
 
 Hello,
 
 I am struggling to produce an MDS plot using the randomForest package
 with a moderately large data set. My data set has one categorical
 response variables, 7 predictor variables and just under 19000
 observations. That means my proximity matrix is approximately 133000
 by 133000 which is quite large. To train a random forest on this large
 a dataset I have to use my institutions high performance computer.
 Using this setup I was able to train a randomForest with the proximity
 argument set to TRUE. At this point I wanted to construct an MDSplot
 using the following:
 
 MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), 
 pch=as.numeric(nech.d$pd.fl))
 
 where nech.rf is the randomForest object and nech.d$pd.fl is the
 classification factor. Now with the architecture listed below, I've
 been waiting for approximately 2 days for this to run. My issue is
 that I am not sure if this will ever run.
 
 Can anyone recommend a way to tweak the MDSplot function to run a
 little faster? I tried changing the cmdscale arguments (i.e.
 eigenvalues) within the MDSplot function a little but that didn't seem
 to have any effect of the overall running time using a much smaller
 data set. Or even if someone could comment whether I am dreaming that
 this will actually ever run?
 
 This is probably the best computer that I will have access to so I was
 hoping that somehow I could get this to run. I was just hoping that
 someone reading the list might have some experience with randomForests
 and using large datasets and might be able to comment on my situation.
 Below the architecture information I have constructed a dummy example
 to illustrate what I am doing but given the nature of the problem,
 this doesn't completely reflect my situation.
 
 Any help would be much appreciated!
 
 Thanks!
 
 Sam
 
 
 
 Computer specs and sessionInfo()
 
 OS: Suse Linux
 Memory: 64 GB
 Processors: Intel Itanium 2, 64 x 1500 MHz
 
 And:
 
  sessionInfo()
 R version 2.6.2 (2008-02-08)
 ia64-unknown-linux-gnu
 
 locale:
 LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLA
 TE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8
 ;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC
 _MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
 
 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base
 
 other attached packages:
 [1] randomForest_4.6-6
 
 loaded via a namespace (and not attached):
 [1] rcompgen_0.1-17
 
 
 ###
 # Dummy Example
 ###
 
 require(randomForest)
 set.seed(17)
 
 ## Number of points
 x - 10
 
 df - rbind(
 data.frame(var1=runif(x, 10, 50),
var2=runif(x, 2, 7),
var3=runif(x, 0.2, 0.35),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor(CLASS-2)
)
   ,
 data.frame(var1=runif(x, 10, 50),
var2=runif(x, -3, 3),
var3=runif(x, 0.1, 0.25),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor(CLASS-1)
)
 
 )
 
 
 df.rf-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, 
 importance=TRUE)
 
 MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), 
 pch=as.numeric(df$cls))
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html

Re: [R] fitted values with locfit

2012-03-28 Thread Liaw, Andy
I believe you are expecting the software to do what it did not claim being able 
to do.  predict.locfit() does not have a type argument, nor can that take on 
terms.  When you specify two variables in the smooth, a bivariate smooth is 
done, so you get one bivariate smooth function, not the sum of two univariate 
smooths.  If the latter is what you want, use packages that fits additive 
models.

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Soberon 
 Velez, Alexandra Pilar
 Sent: Monday, March 19, 2012 5:13 AM
 To: r-help@r-project.org
 Subject: [R] fitted values with locfit
 
 Dear memberships,
 
 
 
 I'm trying to estimate the following multivariate local 
 regression model using the locfit package:
 
 BMI=m1(RCC)+m2(WCC)
 
 where (m1) and (m2) are unknown smooth functions.
 
 
 My problem is that once I get the regression done I cannot 
 get the fitted values of each of this smooth functions (m1) 
 and (m2). What I write is the following
 
 library(locfit)
 
 data(ais)
 fit2-locfit.raw(x=lp(ais$RCC,h=0.5,deg=1)+lp(ais$WCC,deg=1,h=
 0.75),y=ais$BMI,ev=dat(),kt=prod,kern=gauss)
 g21-predict(fit2,type=terms)
 
 
 If I done this on the computer the results of (g21) is a 
 vector when I should have a matrix with 2 columns (one for 
 each fitted smooth function).
 
 
 Please, somebody knows how can I get the estimated fitted 
 values of both smooth functions (m1) and (m2) using a local 
 linear regression with kernel weights as this example?
 
 
 thanks a lot in advance I'm very desperate.
 
 Alexandra
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] job opening at Merck Research Labs, NJ USA

2012-03-20 Thread Liaw, Andy
The Biometrics Research department at the Merck Research Laboratories has an 
open position to be located in Rahway, New Jersey, USA:

This position will be responsible for imaging and bio-signal biomarkers 
projects including analysis of preclinical, early clinical, and experimental 
medicine imaging and EEG data. Responsibilities include all phases of data 
analysis from processing of raw imaging and EEG data to derivation of 
endpoints. Part of the responsibilities is development and implementation of 
novel statistical methods and software for analysis of imaging and bio-signal 
data.  This position will closely collaborate with Imaging and Clinical 
Pharmacology departments; Experimental Medicine; Early and Late Stage 
Development Statistics; and Modeling and Simulation.  Publication and 
presentation of the results is highly encouraged as is collaboration with 
external experts. 

Education Minimum Requirement:  PhD in Statistics, Applied Mathematics, 
Physics, Computer Science, Engineering, or related fields
Required Experience and Skills: Education should include Statistics related 
courses or equivalently working experience should involve data analysis and 
statistical modeling for at least 1 year. Excellent computing skills, R and/or 
SAS , MATLAB  in Linux and Windows environment; working knowledge of parallel 
computing; C, C++,  or Fortran programming.  Dissertation or experience in at 
least one of these areas: statistical image and signal analysis; data mining 
and machine learning; mathematical modeling in medicine and biology;  general 
statistical research
Desired Experience and Skills -  education in and/or experience with EEG and 
Imaging data analysis; stochastic modeling; functional data analysis; 
familiarity with wavelet analysis and other spectral analysis methods


Please apply electronically at:
http://www.merck.com/careers/search-and-apply/search-jobs/home.html 
Click on Experienced Opportunities, and search by Requisition ID: BIO003546 
and email CV to:
vladimir_svet...@merck.com

Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using caegorical variables in package randomForest.

2012-03-13 Thread Liaw, Andy
The way to represent categorical variables is with factors.  See ?factor.  
randomForest() will handle factors appropriately, as most modeling functions in 
R.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of abhishek
 Sent: Tuesday, March 13, 2012 8:11 AM
 To: r-help@r-project.org
 Subject: [R] Using caegorical variables in package randomForest.
 
 Hello,
 
 I am sorry if there are already post that answers to this 
 question but i
 tried to find them before making this post. I did not really 
 find relevant
 posts.
 
 I am using randomForest package for building a two class 
 classifier. There
 are categorical variables and numerical variables in my data. 
 Different
 categorical variables have different number of categories 
 from 2 to 10. I am
 not sure about how to represent the categorical data.
 For example, I am using 0 and 1 for variables that have only 
 two categories.
 But, i doubt, the program is analysing the values as 
 numerical. Do you have
 any idea how can i use the c*ategorical variables for 
 building a two class
 classifier.* I am using a factor consisting of 0 and 1 for the
 classification target.
 
 Thank you for your ideas.
 
 -
 abhishek
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Using-caegorical-variables-in-pa
ckage-randomForest-tp4468923p4468923.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help on reshape function

2012-03-06 Thread Liaw, Andy
Just using the reshape() function in base R:

df.long = reshape(df, varying=list(names(df)[4:7]), direction=long)

This also gives two extra columns (time and id) can can be dropped.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of R. Michael Weylandt
 Sent: Tuesday, March 06, 2012 8:45 AM
 To: mails
 Cc: r-help@r-project.org
 Subject: Re: [R] Help on reshape function
 
 library(reshape2)
 
 melt(df, id.vars = c(ID1, ID2, ID3))[, -4]
 # To drop an extraneous column (but you should take a look and see
 what it is for future reference)
 
 Michael
 
 On Tue, Mar 6, 2012 at 6:17 AM, mails mails00...@gmail.com wrote:
  Hello,
 
 
  I am trying to reshape a data.frame in wide format into long format.
  Although in the reshape R documentation
  they programmer list some examples I am struggling to bring 
 my data.frame
  into long and then transform it back into wide format. The 
 data.frame I look
  at is:
 
 
  df - data.frame(ID1 = c(1,1,1,1,1,1,1,1,1), ID2 = c(A, 
 A, A, B,
  B, B, C, C, C),
 
                                  ID3 = c(E, E, E, E, 
 E, E, E, E, E),
 
                                  X1 = c(1,4,3,5,2,4,6,4,2), 
 X2 = c(6,8,9,6,7,8,9,6,7),
 
                                  X3 = c(7,6,7,5,6,5,6,7,5), 
 X4 = c(1,2,1,2,3,1,2,1,2))
 
  df
   ID1 ID2 ID3 X1 X2 X3 X4
  1   1   A   E  1  6  7  1
  2   1   A   E  4  8  6  2
  3   1   A   E  3  9  7  1
  4   1   B   E  5  6  5  2
  5   1   B   E  2  7  6  3
  6   1   B   E  4  8  5  1
  7   1   C   E  6  9  6  2
  8   1   C   E  4  6  7  1
  9   1   C   E  2  7  5  2
 
  I want to use the reshape function to get the following result:
 
  df
   ID1 ID2 ID3 X
  1   1   A   E  1
  2   1   A   E  4
  3   1   A   E  3
  4   1   B   E  5
  5   1   B   E  2
  6   1   B   E  4
  7   1   C   E  6
  8   1   C   E  4
  9   1   C   E  2
 
  10   1   A   E  6
  11   1   A   E  8
  12   1   A   E  9
  13   1   B   E  6
  14   1   B   E  7
  15   1   B   E  8
  16   1   C   E  9
  17   1   C   E  6
  18   1   C   E  7
 
  19   1   A   E  7
  20   1   A   E  6
  21   1   A   E  7
  22   1   B   E  5
  23   1   B   E  6
  24   1   B   E  5
  25   1   C   E  6
  26   1   C   E  7
  27   1   C   E  5
 
  28   1   A   E  1
  29   1   A   E  2
  30   1   A   E  1
  31   1   B   E  2
  32   1   B   E  3
  33   1   B   E  1
  34   1   C   E  2
  35   1   C   E  1
  36   1   C   E  2
 
 
  Can anyone help?
 
  Cheers
 
 
 
  --
  View this message in context: 
 http://r.789695.n4.nabble.com/Help-on-reshape-function-tp44494
64p4449464.html
  Sent from the R help mailing list archive at Nabble.com.
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-29 Thread Liaw, Andy
That's why I said you need the book.  The details are all in the book.


From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 1:49 PM
To: Liaw, Andy
Cc: r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thanks Andy.

I am reading the locfit document...

but not sure how to do the CV and bandwidth selection...

Here is a quote about the function regband: it doesn't seem to be usable?

Basically I am looking for a locfit that comes with an automatic bandwidth 
selection so that I am essentially parameter free for the local-regression 
step...

-

regband

Bandwidth selectors for local regression.

Description

Function to compute local regression bandwidths for local linear regression, 
implemented as a front

end to

locfit().

This function is included for comparative purposes only. Plug-in selectors are 
based on flawed logic,

make unreasonable and restrictive assumptions and do not use the full power of 
the estimates available

in Locfit. Any relation between the results produced by this function and 
desirable estimates

are entirely coincidental.

Usage

regband(formula, what = c(CP, GCV, GKK, RSW), deg=1, ...)

2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
If that's the kind of framework you'd like to work in, use locfit, which has 
the predict() method for evaluating new data.  There are several different 
handwidth selectors in that package for your choosing.

Kernel smoothers don't really fit the framework of creating a model object, 
followed by predicting new data using that fitted model object very well 
because of it's local nature.  Think of k-nn classification, which has similar 
problem:  The model needs to be computed for every data point you want to 
predict.

Andy


From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 10:06 AM

To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thank you Andy!

I went thru KernSmooth package but I don't see a way to use the fitted function 
to do the predict part...


data=data.frame(z=z, x=x)

datanew=data.frame(z=z, x=x)

lmfit=lm(z

~x, data=data)

lmforecast=predict(lmfit, newdata=datanew)

Am I missing anything here?

Thanks!
2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy


From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help

Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...


On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,
 
  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:
 
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:
 
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
 
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
 
 
  Could you please

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy
In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy


From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...


On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,
 
  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:
 
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:
 
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
 
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
 
 
  Could you please give me some pointers?
 
  I did some google search but there are so many packages
 that do this... I
  just wanted to find the best/modern one to use...
 
  Thank you!
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.orgmailto:R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:27}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-23 Thread Liaw, Andy
If that's the kind of framework you'd like to work in, use locfit, which has 
the predict() method for evaluating new data.  There are several different 
handwidth selectors in that package for your choosing.

Kernel smoothers don't really fit the framework of creating a model object, 
followed by predicting new data using that fitted model object very well 
because of it's local nature.  Think of k-nn classification, which has similar 
problem:  The model needs to be computed for every data point you want to 
predict.

Andy


From: Michael [mailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 10:06 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help
Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

Thank you Andy!

I went thru KernSmooth package but I don't see a way to use the fitted function 
to do the predict part...


data=data.frame(z=z, x=x)

datanew=data.frame(z=z, x=x)

lmfit=lm(z

~x, data=data)

lmforecast=predict(lmfit, newdata=datanew)

Am I missing anything here?

Thanks!
2012/2/23 Liaw, Andy andy_l...@merck.commailto:andy_l...@merck.com
In short, pick your poison...

Is there any particular reason why the tools that shipped with R itself (e.g., 
kernSmooth) are inadequate for you?

I like using the locfit package because it has many tools, including the ones 
that the author didn't think were optimal.  You may need the book to get most 
mileage out of it though.

Andy


From: Michael [mailto:comtech@gmail.commailto:comtech@gmail.com]
Sent: Thursday, February 23, 2012 12:25 AM
To: Liaw, Andy
Cc: Bert Gunter; r-help

Subject: Re: [R] Good and modern Kernel Regression package in R with 
auto-bandwidth?

$B#I(Bmeant its very slow when I use cv.aic...

On Wed, Feb 22, 2012 at 11:24 PM, Michael 
comtech@gmail.commailto:comtech@gmail.com wrote:
Is np an okay package to use?

I am worried about the multi-start thing... and also it's very slow...


On Wed, Feb 22, 2012 at 8:35 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy

 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in
 R with auto-bandwidth?

 Would you like it to do your your taxes for you too? :-)

 Bert

 Sent from my iPhone -- please excuse typos.

 On Feb 22, 2012, at 11:46 AM, Michael 
 comtech@gmail.commailto:comtech@gmail.com wrote:

  Hi all,
 
  I am looking for a good and modern Kernel Regression
 package in R, which
  has the following features:
 
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the
 function at different
  times on the same data-set, the results should be exactly
 the same... I am
  trying np, but I am seeing:
 
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
 
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
 
 
  Could you please give me some pointers?
 
  I did some google search but there are so many packages
 that do this... I
  just wanted to find the best/modern one to use...
 
  Thank you!
 
 [[alternative HTML version deleted]]
 
  __
  R-help@r-project.orgmailto:R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.htmlhttp://www.r-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachments, contains
information of Merck  Co., Inc. (One Merck Drive, Whitehouse Station,
New Jersey, USA 08889), and/or its affiliates Direct contact information
for affiliates is available at
http://www.merck.com/contact/contacts.html) that may be confidential,
proprietary copyrighted and/or legally privileged. It is intended solely
for the use of the individual or entity named on this message. If you

Re: [R] Good and modern Kernel Regression package in R with auto-bandwidth?

2012-02-22 Thread Liaw, Andy
Bert's question aside (I was going to ask about laundry, but that's much harder 
than taxes...), my understanding of the situation is that optimal is in the 
eye of the beholder.  There were at least two schools of thought on which is 
the better way of automatically selecting bandwidth, using plug-in methods or 
CV-type.  The last I check, the jury is still out.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Bert Gunter
 Sent: Wednesday, February 22, 2012 6:03 PM
 To: Michael
 Cc: r-help
 Subject: Re: [R] Good and modern Kernel Regression package in 
 R with auto-bandwidth?
 
 Would you like it to do your your taxes for you too? :-)
 
 Bert
 
 Sent from my iPhone -- please excuse typos.
 
 On Feb 22, 2012, at 11:46 AM, Michael comtech@gmail.com wrote:
 
  Hi all,
  
  I am looking for a good and modern Kernel Regression 
 package in R, which
  has the following features:
  
  1) It has cross-validation
  2) It can automatically choose the optimal bandwidth
  3) It doesn't have random effect - i.e. if I run the 
 function at different
  times on the same data-set, the results should be exactly 
 the same... I am
  trying np, but I am seeing:
  
  Multistart 1 of 1 |
  Multistart 1 of 1 |
  ...
  
  It looks like in order to do the optimization, it's doing
  multiple-random-start optimization... am I right?
  
  
  Could you please give me some pointers?
  
  I did some google search but there are so many packages 
 that do this... I
  just wanted to find the best/modern one to use...
  
  Thank you!
  
 [[alternative HTML version deleted]]
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forest Package

2012-02-01 Thread Liaw, Andy
You should be able to use the Rgui menu to install packages.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Niratha
 Sent: Wednesday, February 01, 2012 5:16 AM
 To: r-help@r-project.org
 Subject: [R] Random Forest Package
 
 Hi,
  I have installed R version 2.14 in windows 7 . I want to use
 randomForest package. I installed Rtools and MikTex 2.9, but i am not
 possible to read description file and also its not possible to build
 package. when i give this command in windows R CMD IINSTALL --build
 randomForest its shows the error R CMD is not recognized as 
 an internal or
 external command.
 
 Thanks
 Niratha
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Random-Forest-Package-tp4347424p
4347424.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest: proximity for new objects using an existing rf

2012-02-01 Thread Liaw, Andy
There's an alternative, but it may not be any more efficient in time or 
memory...

You can run predict() on the training set once, setting nodes=TRUE.  That will 
give you a n by ntree matrix of which node of which tree the data point falls 
in.  For any new data, you would run predict() with nodes=TRUE, then compute 
the proximity by hand by counting how often any given pair landed in the same 
terminal node of each tree.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Kilian
 Sent: Wednesday, February 01, 2012 5:39 AM
 To: r-help@r-project.org
 Subject: [R] randomForest: proximity for new objects using an 
 existing rf
 
 Dear all,
 
 using an existing random forest, I would like to calculate 
 the proximity
 for a new test object, i.e. the similarity between the new 
 object and the
 old training objects which were used for building the random 
 forest. I do
 not want to build a new random forest based on both old and 
 new objects.
 
 Currently, my workaround is to calculate the proximites of a 
 combined data
 set consisting of training and new objects like this:
 
 model - randomForest(Xtrain, Ytrain) # build random forest
 nnew - nrow(Xnew) # number of new objects
 Xcombi - rbind(Xnew, Xtrain) # combine new objects and 
 training objects
 predcombi - predict(model, Xcombi, proximity=TRUE) # 
 calculate proximities
 proxcombi - predcombi$proximity # get proximities of combined dataset
 proxnew - proxcombi[(1:nnew),-(1:nnew)] # get proximities of 
 new objects
 only
 
 But this approach causes a lot of wasted computation time as I am not
 interested in the proximities among the training objects 
 themselves but
 only among the training objects and the new objects. With 
 1000 training
 objects and 5 new objects, I have to calculate a 1005x1005 
 proximity matrix
 to get the essential 5x1000 matrix of the new objects only.
 
 Am I doing something wrong? I read through the documentation 
 but could not
 find another solution. Any advice would be highly appreciated.
 
 Thanks in advance!
 Kilian
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-02-01 Thread Liaw, Andy
Hi Ista,

When you write a package, you have to anticipate what users will throw at the 
code.  I can insist that users only input matriices where none of the column 
names are empty, but that's not what I wish to impose on the users.  I can add 
the name if it's empty, but as a user I don't want a function to do that, 
either.  That's why I need to look for a workaround.

Using which() seems rather clumsy for the purpose, as I need to combine those 
with the non-empty ones, and preserving ordering would be a mess.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Ista Zahn
 Sent: Wednesday, February 01, 2012 5:45 AM
 To: r-help@r-project.org
 Subject: Re: [R] indexing by empty string (was RE: Error in 
 predict.randomForest ... subscript out of bounds with NULL name in X)
 
 Hi Andy,
 
 On Tuesday, January 31, 2012 08:44:13 AM Liaw, Andy wrote:
  I'm not exactly sure if this is a problem with indexing by 
 name; i.e., is
  the following behavior by design?  The problem is that 
 names or dimnames
  that are empty seem to be treated differently, and one 
 can't index by them:
  
  R junk = 1:3
  R names(junk) = c(a, b, )
  R junk
  a b
  1 2 3
  R junk[]
  NA
NA
  R junk = matrix(1:4, 2, 2)
  R colnames(junk) = c(a, )
  R junk[, ]
  Error: subscript out of bounds
 
 You can index them by number, e.g.,
 junk[, 2]
 
 and you can use which() to find the numbers where the colname 
 is empty.
 
 junk[, which(colnames(junk) == )]
 
 
  
  I may need to find workaround...
 
 Going back to the original issue with predict, I don't think 
 you need a 
 workaround. I think you need give your matrix some colnames.
 
 Best,
 Ista
 
  
   -Original Message-
   From: r-help-boun...@r-project.org
   [mailto:r-help-boun...@r-project.org] On Behalf Of 
 Czerminski, Ryszard
   Sent: Wednesday, January 25, 2012 10:39 AM
   To: r-help@r-project.org
   Subject: [R] Error in predict.randomForest ... subscript out
   of bounds with NULL name in X
   
   RF trains fine with X, but fails on prediction
   
library(randomForest)
chirps -
   
   
 c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1)
   
temp -
   
   c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80
   .6,83.5,76
   .3)
   
X - cbind(1,chirps)
rf - randomForest(X, temp)
yp - predict(rf, X)
   
   Error in predict.randomForest(rf, X) : subscript out of bounds
   
   BTW: Just find out that  apparently predict() does not like
   NULL name in
   
   X, because this works fine:
one - rep(1, length(chirps))
X - cbind(one,chirps)
rf - randomForest(X, temp)
yp - predict(rf, X)
   
   Ryszard Czerminski
   AstraZeneca Pharmaceuticals LP
   35 Gatehouse Drive
   Waltham, MA 02451
   USA
   781-839-4304
   ryszard.czermin...@astrazeneca.com
   
   
   --
   
   Confidentiality Notice: This message is private and may
   ...{{dropped:11}}
   
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide
   http://www.R-project.org/posting-guide.html
   and provide commented, minimal, self-contained, reproducible code.
  
  Notice:  This e-mail message, together with any 
 attachme...{{dropped:11}}
  
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bivariate Partial Dependence Plots in Random Forests

2012-01-31 Thread Liaw, Andy
The reason that it's not implemented is because of computational cost.  Some 
users had done it on their own using the same idea.  It's just that it takes 
too much memory for even moderately sized data.  It can be done much more 
efficiently in MART because computational shortcuts were used.  

Best,
Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Lucie Bland
 Sent: Friday, January 27, 2012 5:01 AM
 To: r-help@r-project.org
 Subject: [R] Bivariate Partial Dependence Plots in Random Forests
 
 Hello,
 
  
 
 I was wondering if anyone knew of an R function/R code to 
 plot bivariate
 (3 dimensional) partial dependence plots in random forests 
 (randomForest
 package). 
 
  
 
 It is apparently possible using the rgl package
 (http://esapubs.org/archive/ecol/E088/173/appendix-C.htm) or there may
 be a more direct function such as the pairplot() in MART (multiple
 additive regression trees)?
 
  
 
 Many thanks,
 
  
 
 Lucie
 
  
 
 My Computer:
 
 HP Z400 Workstation,
 
 16.0 GB, Windows 7 Professional, Intel(R) Xeon(R) CPU, W365 3.20 GHz
 3.19 GHz
 
 64bit
 
  
 
 My R version:
 
 R version 2.14.1 (2011-12-22) 64 bit
 
 
 
 The Zoological Society of London is incorporated by Royal Charter
 Principal Office England. Company Number RC000749
 Registered address: 
 Regent's Park, London, England NW1 4RY
 Registered Charity in England and Wales no. 208728 
 
 __
 ___
 This e-mail has been sent in confidence to the named=2...{{dropped:21}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] indexing by empty string (was RE: Error in predict.randomForest ... subscript out of bounds with NULL name in X)

2012-01-31 Thread Liaw, Andy
I'm not exactly sure if this is a problem with indexing by name; i.e., is the 
following behavior by design?  The problem is that names or dimnames that are 
empty seem to be treated differently, and one can't index by them:

R junk = 1:3
R names(junk) = c(a, b, )
R junk
a b   
1 2 3 
R junk[]
NA 
  NA 
R junk = matrix(1:4, 2, 2)
R colnames(junk) = c(a, )
R junk[, ]
Error: subscript out of bounds

I may need to find workaround...

 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Czerminski, Ryszard
 Sent: Wednesday, January 25, 2012 10:39 AM
 To: r-help@r-project.org
 Subject: [R] Error in predict.randomForest ... subscript out 
 of bounds with NULL name in X
 
 RF trains fine with X, but fails on prediction
 
  library(randomForest)
  chirps -
 c(20,16.0,19.8,18.4,17.1,15.5,14.7,17.1,15.4,16.2,15,17.2,16,17,14.1)
  temp -
 c(88.6,71.6,93.3,84.3,80.6,75.2,69.7,82,69.4,83.3,78.6,82.6,80
 .6,83.5,76
 .3)
  X - cbind(1,chirps)
  rf - randomForest(X, temp)
  yp - predict(rf, X)
 Error in predict.randomForest(rf, X) : subscript out of bounds
 
 BTW: Just find out that  apparently predict() does not like 
 NULL name in
 X, because this works fine:
 
  one - rep(1, length(chirps))
  X - cbind(one,chirps)
  rf - randomForest(X, temp)
  yp - predict(rf, X)
 
 Ryszard Czerminski
 AstraZeneca Pharmaceuticals LP
 35 Gatehouse Drive
 Waltham, MA 02451
 USA
 781-839-4304
 ryszard.czermin...@astrazeneca.com
 
 
 --
 
 Confidentiality Notice: This message is private and may 
 ...{{dropped:11}}
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Variable selection based on both training and testing data

2012-01-30 Thread Liaw, Andy
Variable section is part of the training process-- it chooses the model.  By 
definition, test data is used only for testing (evaluating chosen model).

If you find a package or function that does variable selection on test data, 
run from it!

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Jin Minming
 Sent: Monday, January 30, 2012 8:14 AM
 To: r-help@r-project.org
 Subject: [R] Variable selection based on both training and 
 testing data
 
 Dear all,
 
 The variable selection in regression is usually determined by 
 the training data using AIC or F value, such as stepAIC. Is 
 there some R package that can consider both the training and 
 test dataset? For example, I have two separate training data 
 and test data. Firstly, a regression model is obtained by 
 using training data, and then this model is tested by using 
 test data. This process continues in order to find some 
 possible optimal models in terms of RMSE or R2 for both 
 training and test data. 
 
 Thanks,
 
 Jim
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] contour(): Thickness contour labels

2012-01-21 Thread Andy Richling
Hi,

I want to display some contour labels. It works well, but the thickness of
the label is to low. I tried the labcex command

contour(x,y,z, labcex=2)

But only the size is rising and not the thickness of labels. Only if i set
the value to labcex=10 the thickness is good, but the size is to big, so
i can't see anything ;)

Is there any command to rise the thickness of contour labels like the

lwd=?? command

for the width of lines?

Thanks for help :)

Andy

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] tm package, custom reader

2012-01-14 Thread Andy Adamiec
On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat nalimi...@club.frwrote:

 Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit :
  Hi Milan,
 
 
  The xml solr files are not in a typical format, here is an example
  http://www.omegahat.org/RSXML/solr.xml
  I'm not sure how to parse the documents with out using solrDocs.R
  function, and how to make the function compatible with a tm package.
 Indeed, this doesn't seem to be easy to parse using the generic XML
 source from tm. So it will be easier for you to create your own custom
 source from scratch. Have a look at the source.R and reader.R files in
 the tm source: you need to replicate the behavior of one of the sources.

 The code should include the following functions:

 readSorl - FunctionGenerator(function(...) {
function(elem, language, id) {
# Use elem$content, which contains an item set by SorlSource()
 below,
# and create a PlainTextDocument() from it,
# putting the data where appropriate (text, meta-data)
}
 })

 SorlSource - function(x) {
# Parse the XML file using functions from solrDocs.R, and
# create content, which is a list with one item for each document,
# to pass to readSorl() one by one

s - tm:::.Source(readSorl, UTF-8, length(content), FALSE, seq(1,
 length(content)), 0, FALSE)
s$Content - content
s$URI - match.call()$x
class(s) = c(SorlSource, Source)
s
 }

 getElem - function(x) UseMethod(getElem, x)
 getElem.SorlSource -  function(x) {
list(content = x$Content[[x$Position]], uri = match.call()$x)
 }

 eoi - function(x) UseMethod(eoi, x)
 eoi.SorlSource - function(x) length(x$Content) = x$Position


 Hope this helps



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] tm package, custom reader

2012-01-14 Thread Andy Adamiec
On Sat, Jan 14, 2012 at 12:41 PM, Milan Bouchet-Valat nalimi...@club.frwrote:

 Le samedi 14 janvier 2012 à 12:24 -0600, Andy Adamiec a écrit :
  Hi Milan,
 
 
  The xml solr files are not in a typical format, here is an example
  http://www.omegahat.org/RSXML/solr.xml
  I'm not sure how to parse the documents with out using solrDocs.R
  function, and how to make the function compatible with a tm package.
 Indeed, this doesn't seem to be easy to parse using the generic XML
 source from tm. So it will be easier for you to create your own custom
 source from scratch. Have a look at the source.R and reader.R files in
 the tm source: you need to replicate the behavior of one of the sources.

 The code should include the following functions:

 readSorl - FunctionGenerator(function(...) {
function(elem, language, id) {
# Use elem$content, which contains an item set by SorlSource()
 below,
# and create a PlainTextDocument() from it,
# putting the data where appropriate (text, meta-data)
}
 })

 SorlSource - function(x) {
# Parse the XML file using functions from solrDocs.R, and
# create content, which is a list with one item for each document,
# to pass to readSorl() one by one

s - tm:::.Source(readSorl, UTF-8, length(content), FALSE, seq(1,
 length(content)), 0, FALSE)
s$Content - content
s$URI - match.call()$x
class(s) = c(SorlSource, Source)
s
 }

 getElem - function(x) UseMethod(getElem, x)
 getElem.SorlSource -  function(x) {
list(content = x$Content[[x$Position]], uri = match.call()$x)
 }

 eoi - function(x) UseMethod(eoi, x)
 eoi.SorlSource - function(x) length(x$Content) = x$Position


 Hope this helps



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] What is the function for smoothing splines with the smoothing parameter selected by generalized maximum likelihood?

2012-01-09 Thread Liaw, Andy
See the gss package on CRAN.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of ali_protocol
 Sent: Monday, January 09, 2012 7:13 AM
 To: r-help@r-project.org
 Subject: [R] What is the function for smoothing splines with 
 the smoothing parameter selected by generalized maximum likelihood?
 
 Dear all, 
 
 I am new to R, and I am a biotechnologist, I want to fit a 
 smoothing spline
 with smoothing parameter selected by generalized maximum 
 likelihood. I was
 wondering what function implement this,  and, if possible how 
 I can find the
 fitted results for a certain point (or predict from the 
 fitted spline if
 this is the correct  language)
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/What-is-the-function-for-smoothi
ng-splines-with-the-smoothing-parameter-selected-by-generalized- 
maxi-tp4278275p4278275.html
 Sent from the R help mailing list archive at Nabble.com.
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy
Tree based models (such as RF) are invriant to monotonic transformations in the 
predictor (x) variables, because they only use the ranks of the variables, not 
their actual values.  More specifically, they look for splits that are at the 
mid-points of unique values.  Thus the resulting trees are basically identical 
regardless of how you transform the x variables.

Of course, the only, probably minor, differences is, e.g., mid-points can be 
different between the original and transformed data.  While this doesn't impact 
the training data, it can impact the prediction on test data (although 
difference should be slight).

Transformation of the response variable is quite another thing.  RF needs it 
just as much as others if the situation calls for it.

Cheers,
Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo
 Sent: Monday, December 05, 2011 1:41 PM
 To: r-help@r-project.org
 Subject: [R] explanation why RandomForest don't require a 
 transformations (e.g. logarithmic) of variables
 
 Dear Researches,
 
 sorry for the easy and common question. I am trying to 
 justify the idea of
 RandomForest don't require a transformations (e.g. logarithmic) of
 variables, comparing this non parametrics method with e.g. the linear
 regressions. In leteruature to study my phenomena i need to apply a
 logarithmic trasformation to describe my model, but i found RF don't
 required this approach. Some people could suggest me text or 
 bibliography
 to study?
 
 thanks in advance
 
 Gianni
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] explanation why RandomForest don't require a transformations (e.g. logarithmic) of variables

2011-12-05 Thread Liaw, Andy
You should see no differences beyond what you'd get by running RF a second time 
with a different random number seed.

Best,
Andy


From: gianni lavaredo [mailto:gianni.lavar...@gmail.com]
Sent: Monday, December 05, 2011 2:19 PM
To: Liaw, Andy
Cc: r-help@r-project.org
Subject: Re: [R] explanation why RandomForest don't require a transformations 
(e.g. logarithmic) of variables

about the  because they only use the ranks of the variables. Using a 
leave-one-out, in each interaction the the predictor variable ranks change 
slightly every time RF builds the model, especially for the variables with low 
importance. Is It correct to justify this because there are random splitting?

Thanks in advance
Gianni


On Mon, Dec 5, 2011 at 7:59 PM, Liaw, Andy 
andy_l...@merck.commailto:andy_l...@merck.com wrote:
Tree based models (such as RF) are invriant to monotonic transformations in the 
predictor (x) variables, because they only use the ranks of the variables, not 
their actual values.  More specifically, they look for splits that are at the 
mid-points of unique values.  Thus the resulting trees are basically identical 
regardless of how you transform the x variables.

Of course, the only, probably minor, differences is, e.g., mid-points can be 
different between the original and transformed data.  While this doesn't impact 
the training data, it can impact the prediction on test data (although 
difference should be slight).

Transformation of the response variable is quite another thing.  RF needs it 
just as much as others if the situation calls for it.

Cheers,
Andy


 -Original Message-
 From: r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org
 [mailto:r-help-boun...@r-project.orgmailto:r-help-boun...@r-project.org] On 
 Behalf Of gianni lavaredo
 Sent: Monday, December 05, 2011 1:41 PM
 To: r-help@r-project.orgmailto:r-help@r-project.org
 Subject: [R] explanation why RandomForest don't require a
 transformations (e.g. logarithmic) of variables

 Dear Researches,

 sorry for the easy and common question. I am trying to
 justify the idea of
 RandomForest don't require a transformations (e.g. logarithmic) of
 variables, comparing this non parametrics method with e.g. the linear
 regressions. In leteruature to study my phenomena i need to apply a
 logarithmic trasformation to describe my model, but i found RF don't
 required this approach. Some people could suggest me text or
 bibliography
 to study?

 thanks in advance

 Gianni

   [[alternative HTML version deleted]]

 __
 R-help@r-project.orgmailto:R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Notice:  This e-mail message, together with any attachme...{{dropped:26}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random Forests in R

2011-12-01 Thread Liaw, Andy
The first version of the package was created by re-writing the main program in 
the original Fortran as C, and calls other Fortran subroutines that were mostly 
untouched, so dynamic memory allocation can be done.  Later versions have most 
of the Fortran code translated/re-written in C.  Currently the only Fortran 
part is the node splitting in classification trees.

Andy

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Peter Langfelder
 Sent: Thursday, December 01, 2011 12:33 AM
 To: Axel Urbiz
 Cc: R-help@r-project.org
 Subject: Re: [R] Random Forests in R
 
 On Wed, Nov 30, 2011 at 7:48 PM, Axel Urbiz 
 axel.ur...@gmail.com wrote:
  I understand the original implementation of Random Forest 
 was done in
  Fortran code. In the source files of the R implementation 
 there is a note
  C wrapper for random forests:  get input from R and drive  
the Fortran
  routines.. I'm far from an expert on this...does that mean that the
  implementation in R is through calls to C functions only 
 (not Fortran)?
 
  So, would knowing C be enough to understand this code, or 
 Fortran is also
  necessary?
 
 I haven't seen the C and Fortran code for Random Forest but I
 understand the note to say that R code calls some C functions that
 pre-process (possibly re-format etc) the data, then call the actual
 Random Forest method that's written in Fortran, then possibly
 post-process the output and return it to R. It would imply that to
 understand the actual Random Forest code, you will have to read the
 Fortran source code.
 
 Best,
 
 Peter
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about randomForest

2011-11-28 Thread Liaw, Andy
Not only that, but in the same help page, same Value section, it says:

predicted   the predicted values of the input data based on out-of-bag 
samples
 
so people really should read the help pages instead of speculate...

If the error rates were not based on OOB samples, they would drop to (near) 0 
rather quickly, as each tree is intentially overfitting its training set.

Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Weidong Gu
 Sent: Sunday, November 27, 2011 10:56 AM
 To: Matthew Francis
 Cc: r-help@r-project.org
 Subject: Re: [R] Question about randomForest
 
 Matthew,
 
 Your intepretation of calculating error rates based on the training
 data is incorrect.
 
 In Andy Liaw's help file err.rate-- (classification only) vector
 error rates of the prediction on the input data, the i-th element
 being the (OOB) error rate for all trees up to the i-th.
 
 My understanding is that the error rate is calculated by throwing the
 OOB cases(after a few trees, all cases in the original data would
 serve as OOB for some trees) to all the trees up to the i-th which
 they are OOB and get the majority vote. The plot of a rf object
 indicates that OOB error declines quickly after the ensemble becomes
 sizable and increase variation in trees works! ( If they are based on
 the training sets, you wouldn't see such a drop since each tree is
 overfitting to the training set)
 
 Weidong
 
 
 On Sun, Nov 27, 2011 at 3:21 AM, Matthew Francis
 mattjamesfran...@gmail.com wrote:
  Thanks for the help. Let me explain in more detail how I think that
  randomForest works so that you (or others) can more easily see the
  error of my ways.
 
  The function first takes a random sample of the data, of the size
  specified by the sampsize argument. With this it fully grows a tree
  resulting in a horribly over-fitted classifier for the 
 random sub-set.
  It then repeats this again with a different sample to generate the
  next tree and so on.
 
  Now, my understanding is that after each tree is constructed, a test
  prediction for the *whole* training data set is made by 
 combining the
  results of all trees (so e.g. for classification the 
 majority votes of
  all individual tree predictions). From this an error rate is
  determined (applicable to the ensemble applied to the training data)
  and reported in the err.rate member of the returned randomForest
  object. If you look at the error rate (or plot it using the default
  plot method) you see that it starts out very high when only 
 1 or a few
  over-fitted trees are contributing, but once the forest gets larger
  the error rate drops since the ensemble is doing its job. It doesn't
  make sense to me that this error rate is for a sub-set of the data,
  since the sub-set in question changes at each step (i.e. at 
 each tree
  construction)?
 
  By doing cross-validation test making 'training' and 'test' 
 sets from
  the data I have, I do find that I get error rates on the test sets
  comparable to the error rate that is obtained from the prediction
  member of the returned randomForest object. So that does seem to be
  the 'correct' error.
 
  By my understanding the error reported for the ith tree is that
  obtained using all trees up to and including the ith tree to make an
  ensemble prediction. Therefore the final error reported 
 should be the
  same as that obtained using the predict.randomForest function on the
  training set, because by my understanding that should return an
  identical result to that used to generate the error rate 
 for the final
  tree constructed??
 
  Sorry that is a bit long winded, but I hope someone can point out
  where I'm going wrong and set me straight.
 
  Thanks!
 
  On Sun, Nov 27, 2011 at 11:44 AM, Weidong Gu 
 anopheles...@gmail.com wrote:
  Hi Matthew,
 
  The error rate reported by randomForest is the prediction 
 error based
  on out-of-bag OOB data. Therefore, it is different from prediction
  error on the original data  since each tree was built 
 using bootstrap
  samples (about 70% of the original data), and the error 
 rate of OOB is
  likely higher than the prediction error of the original data as you
  observed.
 
  Weidong
 
  On Sat, Nov 26, 2011 at 3:02 PM, Matthew Francis
  mattjamesfran...@gmail.com wrote:
  I've been using the R package randomForest but there is 
 an aspect I
  cannot work out the meaning of. After calling the randomForest
  function, the returned object contains an element called 
 prediction,
  which is the prediction obtained using all the trees (at 
 least that's
  my understanding). I've checked that this prediction set 
 has the error
  rate as reported by err.rate.
 
  However, if I send the training data back into the the
  predict.randomForest function I find I get a different 
 result to the
  stored set of predictions. This is true for both 
 classification and
  regression. I find the predictions obtained

Re: [R] tuning random forest. An unexpected result

2011-11-23 Thread Liaw, Andy
Gianni,

You should not tune ntree in cross-validation or other validation methods, 
and especially should not be using OOB MSE to do so.

1. At ntree=1, you are using only about 36% of the data to assess the 
performance of a single random tree.  This number can vary wildly.  I'd say 
don't bother looking at OOB measure of anything with ntree  30.  If you want 
an exercise in probability, compute the number of trees you need to have the 
desired probability that all n data points are out-of-bag at least k times, and 
don't look at ntree  k.

2. If you just plot the randomForest object using the generic plot() function, 
you will see that it gives you the vector of MSEs for ntree=1 to the max.  
That's why you need not use other methods such as cross-validation.

3. As mentioned in the article you cited, RF is insentive to ntree, and they 
settled on ntree=250.  Also as we mentioned in the R News article, too many 
trees does not degrade prediction performance, only computational cost (which 
is trivial even for moderate size of data set).

4. It is not wise to optimize parameters of a model like that.  When all of 
the MSE estimates are within a few percent of each other, you're likely just 
chasing noise in the evaluation process.

Just my $0.02...

Best,
Andy


 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of gianni lavaredo
 Sent: Thursday, November 17, 2011 6:29 PM
 To: r-help@r-project.org
 Subject: [R] tuning random forest. An unexpected result
 
 Dear Researches,
 
 I am using RF (in regression way) for analize several metrics 
 extract from
 image. I am tuning RF setting a loop using different range of 
 mtry, tree
 and nodesize using the lower value of MSE-OOB
 
 mtry from 1 to 5
 nodesize from1 to 10
 tree from 1 to 500
 
 using this paper as refery
 
 Palmer, D. S., O'Boyle, N. M., Glen, R. C.,  Mitchell, J. B. 
 O. (2007).
 Random Forest Models To Predict Aqueous Solubility. Journal 
 of Chemical
 Information and Modeling, 47, 150-158.
 
 my problem is the following using data(airquality) :
 
 the tunning parameters with the lower value is:
 
  print(result.mtry.df[result.mtry.df$RMSE == 
 min(result.mtry.df$RMSE),])
 *RMSE  = 15.44751
 MSE = 238.6257
 mtry = 3
 nodesize = 5
 tree = 35*
 
 the numer of tree is very low, different respect how i can 
 read in several
 pubblications
 
 And the second value lower is a tunning parameters with *tree = 1*
 
 print(head(result.mtry.df[
 with(result.mtry.df, order(MSE)), ]))
   RMSE  MSE mtry nodesize tree
 12035 15.44751 238.625735   35
 *18001 15.44861 238.6595471
 *7018  16.02354 256.753925   18
 20031 16.02536 256.812151   31
 11037 16.02862 256.916533   37
 11612 16.05162 257.654434  112
 
 i am wondering if i wrong in the setting or there are some 
 aspects i don't
 conseder.
 thanks for attention and thanks in advance for suggestions and help
 
 Gianni
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] arima.sim: innov querry

2011-11-22 Thread Andy Bunn
 On 22/11/11 13:04, Andy Bunn wrote:
  Apologies for thickness - I'm sure that this operates as documented
 and with good reason. However...
 
  My understanding of arima.sim() is obviously imperfect. In the
 example below I assume that x1 and x2 are similar white noise processes
 with a mean of 5 and a standard deviation of 1. I thought x3 should be
 an AR1 process but still have a mean of 5 and a sd of 1. Why does x3
 have a mean of ~7? Obviously I'm missing something fundamental about
 the burn in or the innovations.
 
  x1- rnorm(1e3,mean=5,sd=1)
  summary(x1)
  x2- arima.sim(list(order=c(0,0,0)),n=1e3,mean=5,sd=1)
  summary(x2)
  x3- arima.sim(list(order=c(1,0,0),ar=0.3),n=1e3,mean=5,sd=1)
  summary(x3) # why does x3 have a mean of ~7?
 
  X_t = 0.3 * X_{t-1} + E_t
 
 where E_t ~ N(5,1).
 
 So E(X_t) = 0.3*E(X_{t-1}) + E(E_t), i.e
 
  mu = 0.3*mu + 5, whence
 
  mu = 5/0.7 = 7.1429 approx. = 7

Of course, stupid of me. I should not send r-help requests out at the end of 
the day. But now I do have a more nuanced question. I'm trying to simulate an 
ARMA(1,1) process where the underlying distribution is log normal. At the end 
of the process, can I use arima.sim in conjunction with rlnorm with the 
parameters below? Thanks in advance for advice. -A

mu - -0.935338
sigma - 0.4762476
# the dist'n I want but with white noise
x1 - rlnorm(1e5,meanlog=mu,sdlog=sigma)
# how can I add these arima coefs to a log-normal distn yet keep parameters mu 
and sigma?
ar1 - 0.6621
ma1 - -0.1473
# This is not it:
x2 - arima.sim(list(order=c(1,0,1),ar=ar1,ma=ma1),
  n = 1e3, rand.gen=rlnorm, meanlog=mu, sdlog=sigma)









 
 So all is in harmony.  OMM! :-)
 
  cheers,
 
  Rolf Turner
 
 P. S. If you want the population mean of x3 to be 5, add 5 *after*
 generating
 x3 from innovations with mean 0.
 
  R. T.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] arima.sim: innov querry

2011-11-21 Thread Andy Bunn
Apologies for thickness - I'm sure that this operates as documented and with 
good reason. However...

My understanding of arima.sim() is obviously imperfect. In the example below I 
assume that x1 and x2 are similar white noise processes with a mean of 5 and a 
standard deviation of 1. I thought x3 should be an AR1 process but still have a 
mean of 5 and a sd of 1. Why does x3 have a mean of ~7? Obviously I'm missing 
something fundamental about the burn in or the innovations.

x1 - rnorm(1e3,mean=5,sd=1)
summary(x1)
x2 - arima.sim(list(order=c(0,0,0)),n=1e3,mean=5,sd=1)
summary(x2)
x3 - arima.sim(list(order=c(1,0,0),ar=0.3),n=1e3,mean=5,sd=1)
summary(x3) # why does x3 have a mean of ~7?

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] equal spacing of the polygons in levelplot key (lattice)

2011-11-16 Thread Andy Bunn


 -Original Message-
 From: Dennis Murphy [mailto:djmu...@gmail.com]
 Sent: Tuesday, November 15, 2011 8:54 PM
 To: Andy Bunn
 Cc: r-help@r-project.org
 Subject: Re: [R] equal spacing of the polygons in levelplot key
 (lattice)
 
 Hi:
 
 Does this work?

Thanks Dennis.

This almost works. Is there a way to make the rectangles in the key the same 
size? In this example five rectangles of the same area evenly arrayed? Can the 
key be coerced into being categorical?

The data I want to work with are not spatial but it occurs to me that this is a 
common mapping task (e.g., in this example you might want to label these colors 
'low', 'kind of low', 'medium low', etc. or map land covers or such.) I'll look 
at the sp or raster plotting equivalent. 






 
 # library('lattice')
 levs - as.vector(quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
 levelplot(volcano, at = levs,
 colorkey = list(labels = list(at = levs,
labels = levs) ))
 
 HTH,
 Dennis
 
 On Tue, Nov 15, 2011 at 1:12 PM, Andy Bunn andy.b...@wwu.edu wrote:
  Given the example:
  R (levs - quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
    0%  10%  50%  90%  99% 100%
    94  100  124  170  189  195
  R levelplot(volcano,at=levs)
 
  How can I make the key categorical with the size of the divisions
 equally spaced in the key? E.g., five equal size rectangles with labels
 at levs c(100,124,170,189,195)?
 
  Apologies if this is obvious.
 
  -A
 
  R version
                 _
   platform       i386-pc-mingw32
   arch           i386
   os             mingw32
   system         i386, mingw32
   status
   major          2
   minor          14.0
   year           2011
   month          10
   day            31
   svn rev        57496
   language       R
   version.string R version 2.14.0 (2011-10-31)
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-
 guide.html
  and provide commented, minimal, self-contained, reproducible code.
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] equal spacing of the polygons in levelplot key (lattice)

2011-11-16 Thread Andy Bunn
 -Original Message-
 From: Dennis Murphy [mailto:djmu...@gmail.com]
 Sent: Wednesday, November 16, 2011 11:22 AM
 To: Andy Bunn
 Cc: r-help@r-project.org
 Subject: Re: [R] equal spacing of the polygons in levelplot key
 (lattice)
 
 OK, how about this instead?
 
 # library('lattice')
 levs - as.vector(quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
 levq - seq(min(levs), max(levs), length = 6)
 levelplot(volcano, at = levs,
colorkey = list(at = levq,
  labels = list(at = levq,
labels = levs) ))
 

Whoa. Tricky. That's great. Thanks!

 Dennis
 
 On Wed, Nov 16, 2011 at 10:27 AM, Andy Bunn andy.b...@wwu.edu wrote:
 
 
  -Original Message-
  From: Dennis Murphy [mailto:djmu...@gmail.com]
  Sent: Tuesday, November 15, 2011 8:54 PM
  To: Andy Bunn
  Cc: r-help@r-project.org
  Subject: Re: [R] equal spacing of the polygons in levelplot key
  (lattice)
 
  Hi:
 
  Does this work?
 
  Thanks Dennis.
 
  This almost works. Is there a way to make the rectangles in the key
 the same size? In this example five rectangles of the same area evenly
 arrayed? Can the key be coerced into being categorical?
 
  The data I want to work with are not spatial but it occurs to me that
 this is a common mapping task (e.g., in this example you might want to
 label these colors 'low', 'kind of low', 'medium low', etc. or map land
 covers or such.) I'll look at the sp or raster plotting equivalent.
 
 
 
 
 
 
 
  # library('lattice')
  levs - as.vector(quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
  levelplot(volcano, at = levs,
              colorkey = list(labels = list(at = levs,
                                                     labels = levs) ))
 
  HTH,
  Dennis
 
  On Tue, Nov 15, 2011 at 1:12 PM, Andy Bunn andy.b...@wwu.edu
 wrote:
   Given the example:
   R (levs - quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
     0%  10%  50%  90%  99% 100%
     94  100  124  170  189  195
   R levelplot(volcano,at=levs)
  
   How can I make the key categorical with the size of the divisions
  equally spaced in the key? E.g., five equal size rectangles with
 labels
  at levs c(100,124,170,189,195)?
  
   Apologies if this is obvious.
  
   -A
  
   R version
                  _
    platform       i386-pc-mingw32
    arch           i386
    os             mingw32
    system         i386, mingw32
    status
    major          2
    minor          14.0
    year           2011
    month          10
    day            31
    svn rev        57496
    language       R
    version.string R version 2.14.0 (2011-10-31)
  
   __
   R-help@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/r-help
   PLEASE do read the posting guide http://www.R-project.org/posting-
  guide.html
   and provide commented, minimal, self-contained, reproducible code.
  
 
 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] gsDesign

2011-11-15 Thread Liaw, Andy
Hi Dongli,

Questions about usage of specific contributed packages are best directed toward 
the package maintainer/author first, as they are likely the best sources of 
information, and they don't necessarily subscribe to or keep up with the daily 
deluge of R-help messages.

(In this particular case, I'm quite sure the package maintainer for gsDesign 
doesn't keep up with R-help.)

Best,
Andy
 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Dongli Zhou
 Sent: Monday, November 14, 2011 6:13 PM
 To: Marc Schwartz
 Cc: r-help@r-project.org
 Subject: Re: [R] gsDesign
 
 Hi, Marc,
 
 Thank you very much for the reply. I'm using the gsDesign 
 function to create an object of type gsDesign. But the inputs 
 do not include the 'ratio' argument.
 
 Dongli 
 
 On Nov 14, 2011, at 5:50 PM, Marc Schwartz 
 marc_schwa...@me.com wrote:
 
  On Nov 14, 2011, at 4:11 PM, Dongli Zhou wrote:
  
  I'm trying to use gsDesign for a noninferiority trial with binary
  endpoint. Did anyone know how to specify the trial with 
 different sample
  sizes for two treatment groups? Thanks in advance!
  
  
  Hi,
  
  Presuming that you are using the nBinomial() function, see 
 the 'ratio' argument, which defines the desired sample size 
 ratio between the two groups.
  
  See ?nBinomial and the examples there, which does include 
 one using the 'ratio' argument.
  
  HTH,
  
  Marc Schwartz
  
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] equal spacing of the polygons in levelplot key (lattice)

2011-11-15 Thread Andy Bunn
Given the example:
R (levs - quantile(volcano,c(0,0.1,0.5,0.9,0.99,1)))
   0%  10%  50%  90%  99% 100% 
   94  100  124  170  189  195 
R levelplot(volcano,at=levs)

How can I make the key categorical with the size of the divisions equally 
spaced in the key? E.g., five equal size rectangles with labels at levs 
c(100,124,170,189,195)? 

Apologies if this is obvious. 

-A

R version
_
 platform   i386-pc-mingw32  
 arch   i386 
 os mingw32  
 system i386, mingw32
 status  
 major  2
 minor  14.0 
 year   2011 
 month  10   
 day31   
 svn rev57496
 language   R
 version.string R version 2.14.0 (2011-10-31)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest - NaN in %IncMSE

2011-09-23 Thread Liaw, Andy
You are not giving anyone much to go on.  Please read the posting guide and see 
how to ask your question in a way that's easier for others to answer.  At the 
_very_ least, show what commands you used, what your data looks like, etc.

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller
 Sent: Tuesday, September 20, 2011 1:43 PM
 To: r-help@r-project.org
 Subject: [R] randomForest - NaN in %IncMSE
 
 Hi
 
 I am having a problem using varImpPlot in randomForest.  I 
 get the error
 message Error in plot.window(xlim = xlim, ylim = ylim, log = 
 ) :   need
 finite 'xlim' values
 
 When print $importance, several variables have NaN under 
 %IncMSE.   There
 are no NaNs in the original data.  Can someone help me figure 
 out what is
 happening here?
 
 Thanks!
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] class weights with Random Forest

2011-09-13 Thread Liaw, Andy
The current classwt option in the randomForest package has been there since 
the beginning, and is different from how the official Fortran code (version 4 
and later) implements class weights.  It simply account for the class weights 
in the Gini index calculation when splitting nodes, exactly as how a single 
CART tree is done when given class weights.  Prof. Breiman came up with the 
newer class weighting scheme implemented in the newer version of his Fortran 
code after we found that simply using the weights in the Gini index didn't seem 
to help much in extremely unbalanced data (say 1:100 or worse).  If using 
weighted Gini helps in your situation, by all means do it.  I can only say that 
in the past it didn't give us the result we were expecting.

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of James Long
 Sent: Tuesday, September 13, 2011 2:10 AM
 To: r-help@r-project.org
 Subject: [R] class weights with Random Forest
 
 Hi All,
 
 I am looking for a reference that explains how the 
 randomForest function in
 the randomForest package uses the classwt parameter. Here:
 
 http://tolstoy.newcastle.edu.au/R/e4/help/08/05/12088.html
 
 Andy Liaw suggests not using classwt. And according to:
 
 http://r.789695.n4.nabble.com/R-help-with-RandomForest-classwt
 -option-td817149.html
 
 it has not been implemented as of 2007. However it improved 
 classification
 performance for a problem I am working on, more than 
 adjusting the sampsize
 parameter. So I'm wondering if it has been implemented 
 recently (since 2007)
 or if there is a detailed explanation of what this 
 unimplemented version is
 doing.
 
 Thanks!
 James
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest memory footprint

2011-09-08 Thread Liaw, Andy
It looks like you are building a regression model.  With such a large number of 
rows, you should try to limit the size of the trees by setting nodesize to 
something larger than the default (5).  The issue, I suspect, is the fact that 
the size of the largest possible tree has about 2*nodesize nodes, and each node 
takes a row in a matrix to store.  Multiply that by the number of trees you are 
trying to build, and you see how the memory can be gobbled up quickly.  Boosted 
trees don't usually run into this problem because one usually boost very small 
trees (usually no more than 10 terminal nodes per tree).

Best,
Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of John Foreman
 Sent: Wednesday, September 07, 2011 2:46 PM
 To: r-help@r-project.org
 Subject: [R] randomForest memory footprint
 
 Hello, I am attempting to train a random forest model using the
 randomForest package on 500,000 rows and 8 columns (7 predictors, 1
 response). The data set is the first block of data from the UCI
 Machine Learning Repo dataset Record Linkage Comparison Patterns
 with the slight modification that I dropped two columns with lots of
 NA's and I used knn imputation to fill in other gaps.
 
 When I load in my dataset, R uses no more than 100 megs of RAM. I'm
 running a 64-bit R with ~4 gigs of RAM available. When I execute the
 randomForest() function, however I get memory complaints. Example:
 
  summary(mydata1.clean[,3:10])
   cmp_fname_c1 cmp_lname_c1   cmp_sex   cmp_bd
   cmp_bm   cmp_by  cmp_plz is_match
  Min.   :0.   Min.   :0.   Min.   :0.   Min.   :0.
 Min.   :0.   Min.   :0.   Min.   :0.0   FALSE:572820
  1st Qu.:0.2857   1st Qu.:0.1000   1st Qu.:1.   1st Qu.:0.
 1st Qu.:0.   1st Qu.:0.   1st Qu.:0.0   TRUE :  2093
  Median :1.   Median :0.1818   Median :1.   Median :0.
 Median :0.   Median :0.   Median :0.0
  Mean   :0.7127   Mean   :0.3156   Mean   :0.9551   Mean   :0.2247
 Mean   :0.4886   Mean   :0.2226   Mean   :0.00549
  3rd Qu.:1.   3rd Qu.:0.4286   3rd Qu.:1.   3rd Qu.:0.
 3rd Qu.:1.   3rd Qu.:0.   3rd Qu.:0.0
  Max.   :1.   Max.   :1.   Max.   :1.   Max.   :1.
 Max.   :1.   Max.   :1.   Max.   :1.0
  mydata1.rf.model2 - randomForest(x = 
 mydata1.clean[,3:9],y=mydata1.clean[,10],ntree=100)
 Error: cannot allocate vector of size 877.2 Mb
 In addition: Warning messages:
 1: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 2: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 3: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 4: In dim(data) - dim :
   Reached total allocation of 3992Mb: see help(memory.size)
 
 Other techniques such as boosted trees handle the data size just fine.
 Are there any parameters I can adjust such that I can use a value of
 100 or more for ntree?
 
 Thanks,
 John
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] randomForest partial dependence plot variable names

2011-08-09 Thread Liaw, Andy
See if the following is close to what you're looking for.  If not, please give 
more detail on what you want to do.

data(airquality)
airquality - na.omit(airquality)
set.seed(131)
ozone.rf - randomForest(Ozone ~ ., airquality, importance=TRUE)
imp - importance(ozone.rf)  # get the importance measures
impvar - rownames(imp)[order(imp[, 1], decreasing=TRUE)]  # get the sorted 
names
op - par(mfrow=c(2, 3))
for (i in seq_along(impvar)) {
partialPlot(ozone.rf, airquality, impvar[i], xlab=impvar[i],
main=paste(Partial Dependence on, impvar[i]), ylim=c(30, 70))
}
par(op)

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Katharine Miller
 Sent: Thursday, August 04, 2011 4:38 PM
 To: r-help@r-project.org
 Subject: [R] randomForest partial dependence plot variable names
 
 Hello,
 
 I am running randomForest models on a number of species.  I 
 would like to be
 able to automate the printing of dependence plots for the 
 most important
 variables in each model, but I am unable to figure out how to 
 enter the
 variable names into my code.  I had originally thought to 
 extract them from
 the $importance matrix after sorting by metric (e.g. %IncMSE), but the
 importance matrix is n by 2 - containing only the data for each metric
 (%IncMSE and IncNodePurity).  It is clearly linked to the 
 variable names,
 but I am unsure how to extract those names for use in scripting.  Any
 assistance would be greatly appreciated as I am currently typing the
 variable names into each partialPlot call for every model I 
 run.and that
 is taking a LONG time.
 
 Thanks!
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] convert a splus randomforest object to R

2011-08-09 Thread Liaw, Andy
You really need to follow the suggestions in the posting guide to get the best 
help from this list.  

Which versions of randomForest are you using in S-PLUS and R?  Which version of 
R are you using?  When you restore the object into R, what does str(object) 
say?  Have you also tried dump()/source() as the R Data Import/Export manual 
suggests?

Andy 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Zhiming Ni
 Sent: Tuesday, August 02, 2011 8:11 PM
 To: r-help@r-project.org
 Subject: [R] convert a splus randomforest object to R
 
 Hi,
  
 I have a randomforest object cost.rf that was created in splus 8.0,
 now I need to use this trained RF model in R. So in Splus, I 
 dump the RF
 file as below
  
 data.dump(cost.rf, file=cost.rf.txt, oldStyle=T) 
 
 then in R, restore the dumped file,
 
 library(foreign)
 
 data.restore(cost.rf.txt)
 
 it works fine and able to restore the cost.rf object. But when I try
 to pass a new data through this randomforest object using predict()
 function, it gives me error message.
 
 in R:
 
 library(randomForest)
 set.seed(2211)
 
 pred - predict(cost.rf, InputData[ , ])
 
 Error in object$forest$cutoff : $ operator is invalid for 
 atomic vectors
 
 
 Looks like after restoring the dump file, the object is not compatible
 in R. Have anyone successfully converted a splus randomforest 
 object to
 R? what will be the appropriate method to do this?
 
 Thanks in advance.
 
 Jimmy
 
 ==
 This communication contains information that is confidential, 
 and solely for the use of the intended recipient. It may 
 contain information that is privileged and exempt from 
 disclosure under applicable law. If you are not the intended 
 recipient of this communication, please be advised that any 
 disclosure, copying, distribution or use of this 
 communication is strictly prohibited. Please also immediately 
 notify SCAN Health Plan at 1-800-247-5091, x5263 and return 
 the communication to the originating address.
 Thank You.
 ==
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] squared pie chart - is there such a thing?

2011-07-25 Thread Liaw, Andy
Has anyone suggested mosaic displays?  That's the closest I can think of as a 
square pie chart... 

 -Original Message-
 From: r-help-boun...@r-project.org 
 [mailto:r-help-boun...@r-project.org] On Behalf Of Naomi Robbins
 Sent: Sunday, July 24, 2011 7:09 AM
 To: Thomas Levine
 Cc: r-help@r-project.org
 Subject: Re: [R] squared pie chart - is there such a thing?
 
 I don't usually use stacked bar charts since it is difficult 
 to compare 
 lengths that don't have
 a common baseline.
 
 Naomi
 
 On 7/23/2011 11:14 PM, Thomas Levine wrote:
  How about just a stacked bar plot?
 
  barplot(matrix(c(3,5,3),3,1),horiz=T,beside=F)
 
  Tom
 
  On Fri, Jul 22, 2011 at 7:14 AM, Naomi 
 Robbinsnbrgra...@optonline.net  wrote:
  Hello!
  It's a shoot in the dark, but I'll try. If one has a total of 100
  (e.g., %), and three components of the total, e.g.,
  mytotal=data.frame(x=50,y=30,z=20), - one could build a 
 pie chart with
  3 sectors representing x, y, and z according to their 
 proportions in
  the total.
  I am wondering if it's possible to build something very 
 similar, but
  not on a circle but in a square - such that the total area of the
  square is the sum of the components and the components (x, 
 y, and z)
  are represented on a square as shapes with right angles (squares,
  rectangles, L-shapes, etc.). I realize there are many possible
  positions and shapes - even for 3 components. But I don't 
 really care
  where components are located within the square - as long 
 as they are
  there.
 
  Is there a package that could do something like that?
  Thanks a lot!
 
  -
 
  I included waffle charts in Creating More Effective Graphs.
  The reaction was very negative; many readers let me know
  that they didn't like them. To create them I just drew a table
  in Word with 10 rows and 10 columns. Then I shaded the
  backgrounds of cells so for your example we would shade
  50 cells one color, 30 another, and 20 a third color.
 
  Naomi
 
  -
 
 
  Naomi B. Robbins
  11 Christine Court
  Wayne, NJ 07470
  973-694-6009
 
  na...@nbr-graphs.commailto:na...@nbr-graphs.com
 
  http://www.nbr-graphs.com
 
  Author of Creating More Effective Graphs
  http://www.nbr-graphs.com/bookframe.html
 
  //
 
 
 
  [[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 
 
 
 -- 
 
 -- 
 
 Naomi B. Robbins
 
 NBR
 
 11 Christine Court
 
 Wayne, NJ 07470
 
 Phone:  (973) 694-6009
 
 na...@nbr-graphs.com mailto:na...@nbr-graphs.com
 
 http://www.nbr-graphs.com http://www.nbr-graphs.com/
 
 Follow me at http://www.twitter.com/nbrgraphs
 
 Author of /Creating More Effective Graphs 
 http://www.nbr-graphs.com/bookframe.html/
 
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
Notice:  This e-mail message, together with any attachme...{{dropped:11}}

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Bounding ellipse for any set of points

2011-07-21 Thread Andy Lyons

The mvee() function is intended to be released under the BSD license.

Copyright (c) 2009, Nima Moshtagh
Copyright (c) 2011, Andy Lyons
All rights reserved.
http://www.opensource.org/licenses/bsd-license.php

Redistribution and use in source and binary forms, with or without 
modification, are permitted provided that the following conditions are met:


Redistributions of source code must retain the above copyright notice, 
this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright 
notice, this list of conditions and the following disclaimer in the 
documentation and/or other materials provided with the distribution.


THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS AS IS 
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE 
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE 
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE 
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR 
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF 
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS 
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN 
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) 
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 
POSSIBILITY OF SUCH DAMAGE.




Date: Thu, 24 Mar 2011 17:23:00 -0700
From: Andy Lyons ajly...@berkeley.edu
To: r-help@r-project.org
Subject: [R] Bounding ellipse for any set of points
Message-ID: 6.2.1.2.2.20110324051124.117a0...@calmail.berkeley.edu
Content-Type: text/plain; charset=us-ascii


   After a lot of effort I developed the following function to compute the
   bounding ellipse (also known a as minimum volume enclosing ellipsoid) for
   any set of points. This script is limited to two dimensions, but I believe
   with minor modification the algorithm should work for 3 or more 
dimensions.

   It seems to work well (although I don't know if can be optimized to run
   faster) and hope it may be useful to someone else. Andy
   ##
   ## mvee()
   ## Uses the Khachiyan Algorithm to find the the minimum volume enclosing
   ## ellipsoid (MVEE) of a set of points. In two dimensions, this is just
   ## the bounding ellipse (this function is limited to two dimensions).
   ## Adapted by Andy Lyons from Matlab code by Nima Moshtagh.
   ## Copyright (c) 2009, Nima Moshtagh
   ## [1]http://www.mathworks.com/matlabcentral/fileexchange/9542
   ## [2]http://www.mathworks.com/matlabcentral/fileexchange/13844
   ## [3]http://stackoverflow.com/questions/1768197/bounding-ellipse
   ##
   ## Parameters
   ## xy  a two-column data frame containing x and y coordinates
   ##  if  NULL then a random sample set of 10 points will be
   generated
   ## tolerance   a tolerance value (default = 0.005)
   ## plotme  FALSE/TRUE. Plots the points and ellipse. Default TRUE.
   ## max.iterThe maximum number of iterations. If the script tries this
   ## number of iterations but still can't get to the tolerance
   ## value, it displays an error message and returns NULL
   ## shiftxy TRUE/FALSE. If True, will apply a shift to the 
coordinates to

   make them
   ## smaller and speed up the matrix calculations, then reverse
   the shift
   ## to the center point of the resulting ellipoid. Default TRUE
   ## Output: A list containing the center form matrix equation of the
   ##  ellipse.  i.e.  a  2x2 matrix A and a 2x1 vector C
   representing
   ## the center of the ellipse such that:
   ## (x - C)' A (x - C) = 1
   ##  Also in the list is a 2x1 vector elps.axes.lngth whose
   elements
   ##  are  one-half  the lengths of the major and minor axes
   (variables
   ## a and b
   ## Also in list is alpha, the angle of rotation
   ##
   mvee - function(xy=NULL, tolerance = 0.005, plotme = TRUE, max.iter = 
500,

   shiftxy = TRUE) {
   if (is.null(xy)) {
   xy - data.frame(x=runif(10,100,200), y=runif(10,100,200))
   } else if (ncol(xy) != 2) {
   warning(xy must be a two-column data frame)
   return(NULL)
   }
   ## Number of points
   n = nrow(xy)
   ## Dimension of the points (2)
   d = ncol(xy)
   if (n = d) return(NULL)
##  Apply  a  uniform shift to the xy coordinates to make matrix
   calculations computationally
   ## simpler (if x and y are very large, for example UTM 
coordinates, this

   may be necessary
   ## to prevent a 'compuationally singular matrix' error
   if (shiftxy) {
   xy.min - sapply(xy, FUN = min

  1   2   3   4   >