[R] glm gamma scale parameter
I would like the option to specify alternative scale parameters when using the gamma family, log link glm. In particular I would like the option to specify any of the following: 1. maximum likelihood estimate 2. moment estimator/Pearson's 3. total deviance estimator Is this easy? Possible? In addition, I would like to know what estimation process (maximum likelihood?) R is using to estimate the parameter if somebody knows that off the top of their head or can point me to something to read? I did read the help search the archives but I'm a bit confused trying to reconcile the terminology I'm used to w/R terminology as we're transitioning to R, so if I missed an obvious way to do this, or stated this question in a way that's incomprehensible, my apologies. Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] glm gamma scale parameter
Thank you. You are correct, the shape parameter is what I need to change I think I see how to use the MASS package to do it...or if not, at least I have enough now to figure it out. A question to reconcile terminology which will speed me up, if you have time to help me a bit more: phi = 'scale parameter' vs. 'dispersion parameter' vs. 'shape parameter'? Excerpt below from the R intro. manual defining phi the stats compliment discussion. R intro: distribution of y is of the form f_Y(y; mu, phi) = exp((A/phi) * (y lambda(mu) - gamma(lambda(mu))) + tau(y, phi)) where phi is a scale parameter (possibly known), and is constant for all observations, A represents a prior weight, assumed known but possibly varying with the observations, and $\mu$ is the mean of y. So it is assumed that the distribution of y is determined by its mean and possibly a scale parameter as well. Statistics Complements to Modern Applied Statistics with S, Fourth edition By W. N. Venables and B. D. Ripley Springer: 7.6 Gamma models The role of dispersion parameter for the Gamma family is rather different. This is a parametric family which can be fitted by maximum likelihood, including its shape parameter Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 06, 2007 12:25 PM To: WILLIE, JILL Cc: r-help@stat.math.ethz.ch Subject: Re: [R] glm gamma scale parameter On Tue, 6 Feb 2007, Prof Brian Ripley wrote: I think you mean 'shape parameter'. If so, see the MASS package and ?gamma.shape. Also http://www.stats.ox.ac.uk/pub/MASS4/#Complements leads to several pages of discussion. glm() _is_ providing you with the MLE of the scale parameter, but really no estimate of the shape (although summary.glm makes use of one). On Tue, 6 Feb 2007, WILLIE, JILL wrote: I would like the option to specify alternative scale parameters when using the gamma family, log link glm. In particular I would like the option to specify any of the following: 1. maximum likelihood estimate 2. moment estimator/Pearson's 3. total deviance estimator Is this easy? Possible? In addition, I would like to know what estimation process (maximum likelihood?) R is using to estimate the parameter if somebody knows that off the top of their head or can point me to something to read? I did read the help search the archives but I'm a bit confused trying to reconcile the terminology I'm used to w/R terminology as we're transitioning to R, so if I missed an obvious way to do this, or stated this question in a way that's incomprehensible, my apologies. Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] reducing RODBC odbcQuery memory use?
Basic Questions: 1. Can I avoid having RODBC use so much memory (35 times the data size or more) making a data.frame then .rda file via. sqlQuery/save? 2. If not, is there some more appropriate way from w/in R to pull large data sets (2-5GB) into .rda files from sql? 3. I get an unexpectedly high ratio of virtual memory to memory use (10:1). Can that be avoided? Testing details (R transcript below): 1GB CPU, 1GB RAM windows machine. 1. testing bigger input table (AUTCombinedWA_BILossCost_1per), size is 20MB in sql, 10 rows, 2 numeric columns, 55 integer columns; consumes 35kb of memory 80kb of virtual memory to execute the sqlQuery command. Memory not released after the step finishes or upon execution of odbcCloseAll(), or gc(). 2. tested small input table, size is 2MB in sql, 1 rows, 2 numeric columns, 55 integer columns; consumes 55000kb of memory 515000kb (vm seems oddly high to me) of virtual memory to execute the sqlQuery command. 3. concluded the high memory use is isolated to the odbcQuery step w/in the sqlQuery function as opposed to sqlGetResults or ODBC itself. Relevant R session transcript: library(RODBC) channel-odbcConnect(psmrd) df_OnePer -data.frame(sqlQuery(channel, select * from AUTCombinedWA_BILossCost_1per)) save(df_OnePer, file = df_OnePer.rda) Additional testing details: I exited R which released all memory cleanly, then started R again, loaded the .rda saved in prior step as below. This confirmed relatively little of the memory is consumed going from .rda to data frame; isolating to the RODBC step: load(df_OnePer.rda) df - data.frame(df_OnePer) I closed R opened MS Access used the same DSN psmrd, to import the AUTCombinedWA_BILossCost_1per into MS Access which required about 3kb of memory 2kb of virtual. And finally, I have this excerpt from Prof Brian Ripley that seems potentially relevant (if it's not just confusion because I called them 'byte-size' when really I should have said they're integers just having values limited to 1-255). In any case I'm unable to see from the RODBC help how to specify this: ...sqlQuery returns a data frame directly. I think you need to tell RODBC to translate your 'byte-sized factors' to numeric, as it will be going through character if these are a type it does not know about. Read all the RODBC help, read all the data import guide searched help archives...can't find an answer. Would appreciate advice, experience, or direction. Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] -Original Message- From: Prof Brian Ripley [mailto:[EMAIL PROTECTED] Sent: Thursday, January 25, 2007 12:05 AM To: WILLIE, JILL Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Size of data vs. needed memory...rule of thumb? On Wed, 24 Jan 2007, WILLIE, JILL wrote: I have been searching all day most of last night, but can't find any benchmarking or recommendations regarding R system requirements for very large (2-5GB) data sets to help guide our hardware configuration. If anybody has experience with this they're willing to share or could anybody point me in a direction that might be productive to research, it would be much appreciated. Specifically: will R simply use as much memory as the OS makes available to it, unlimited? Under most OSes. Because Windows has no means to limit the amount made available, R under Windows does have it own limiting mechanism (which you hit in the examples below). R under Linux will allow you to run a 4GB process on a machine with 2GB RAM, but you are likely to get around 0.7% usage. (One of my colleagues did that on a server earlier this week, hence the very specific answer.) Is there a multi-threading version R, packages? Not to run computations in R. Some parts of R (e.g. GUIs) and some libraries (e.g. some BLAS) are multithreaded. There are multiprocess packages, e.g. Rmpi, rpvm, snow. Does the core R package support 64-bit Yes, and has for many years. should I expect to see any difference in how memory's handled under that version? yes, because the address space will not get seriously fragmented. See the appropriate section in R-admin.html (referenced from INSTALL). Is 3 GB of memory to 1GB of data a reasonable ballpark? I'd say it was a bit low, but it really depends on the analysis you are doing, how 1GB of data is made up (many rows?, many cols?, etc) and so on. Had you asked me to suggest a ratio I would have said 5. Our testing thus far has been on a windows 32-bit box w/1GB of RAM 1 CPU; it appears to indicate something like 3GB of RAM for every 1GB of sql table (ex-indexes, byte-sized factors). At this point, we're planning on setting up a dual core 64-bit Linux box w/16GB of RAM for starters, since we have summed-down sql tables of approx 2-5GB generally. Here's details, just for context, or in case I'm misinterpreting the results, or in case there's some more memory-efficient way to get data
[R] Size of data vs. needed memory...rule of thumb?
I have been searching all day most of last night, but can't find any benchmarking or recommendations regarding R system requirements for very large (2-5GB) data sets to help guide our hardware configuration. If anybody has experience with this they're willing to share or could anybody point me in a direction that might be productive to research, it would be much appreciated. Specifically: will R simply use as much memory as the OS makes available to it, unlimited? Is there a multi-threading version R, packages? Does the core R package support 64-bit should I expect to see any difference in how memory's handled under that version? Is 3 GB of memory to 1GB of data a reasonable ballpark? Our testing thus far has been on a windows 32-bit box w/1GB of RAM 1 CPU; it appears to indicate something like 3GB of RAM for every 1GB of sql table (ex-indexes, byte-sized factors). At this point, we're planning on setting up a dual core 64-bit Linux box w/16GB of RAM for starters, since we have summed-down sql tables of approx 2-5GB generally. Here's details, just for context, or in case I'm misinterpreting the results, or in case there's some more memory-efficient way to get data in R's binary format than going w/the data.frame. R session: library(RODBC) channel-odbcConnect(psmrd) FivePer -data.frame(sqlQuery(channel, select * from AUTCombinedWA_BILossCost_5per)) Error: cannot allocate vector of size 2000 Kb In addition: Warning messages: 1: Reached total allocation of 1023Mb: see help(memory.size) 2: Reached total allocation of 1023Mb: see help(memory.size) ODBC connection: Microsoft SQL Server ODBC Driver Version 03.86.1830 Data Source Name: psmrd Data Source Description: Server: psmrdcdw01\modeling Database: OpenSeas_Work1 Language: (Default) Translate Character Data: Yes Log Long Running Queries: No Log Driver Statistics: No Use Integrated Security: Yes Use Regional Settings: No Prepared Statements Option: Drop temporary procedures on disconnect Use Failover Server: No Use ANSI Quoted Identifiers: Yes Use ANSI Null, Paddings and Warnings: Yes Data Encryption: No Please be patient, I'm a new R user (or at least I'm trying to be...at this point I'm mostly a new R-help-reader); I'd appreciated being pointed in the right direction if this isn't the right help list to send this question to...or if this question is poorly worded (I did read the posting guide). Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Can we do GLM on 2GB data set with R?
We are wanting to use R instead of/in addition to our existing stats package because of it's huge assortment of stat functions. But, we routinely need to fit GLM models to files that are approximately 2-4GB (as SQL tables, un-indexed, w/tinyint-sized fields except for the response weight variables). Is this feasible, does anybody know, given sufficient hardware, using R? It appears to use a great deal of memory on the small files I've tested. I've read the data import, memory.limit, memory.size general documentation but can't seem to find a way to tell what the boundaries are roughly gauge the needed memory...other than trial error. I've started by testing the data.frame run out of memory on my PC. I'm new to R so please be forgiving if this is a poorly-worded question. Jill Willie Open Seas Safeco Insurance [EMAIL PROTECTED] 206-545-5673 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.