Hey everyone, I don't have a question. Instead some helpful advice with things I've learned from trying to connect 'R' to databases using RODBC.
ROBDC is a very handy tool that, once you have everything fixed up nicely, is a great way to have scripts run fairly autonomously, safe in the knowledge that data isn't accidentally messed up. But I was fairly ignorant about databases, and I suspect maybe some others out there are as well. 1) Connecting: the function odbcConnect() works very well, presuming you first have things set up properly with your connection to a database. For instance, what is the "data source name", or *dsn*, that odbcConnect() asks for? In my ignorant view, I see this as a "nickname" that you have created on your computer (i.e., *outside* of 'R') to make a connection between your computer & the database you wish to talk to. I have set up my database connections through windows software, so I can only speak intelligently to that. There is a windows program called "ODBC Data Source Administrator" likely hidden somewhere on your computer. If you have Windows version 7, simply search for "ODBC" in that Start menu "search programs & files" box. You should find the ODBC data source administrator. Again, this is specific to my Windows 7 computer, but here's what I then did: click on the User DSN tab. Then click the "add" button. Choose a driver type (e.g., SQL Server)... this is dependent upon the database you are connecting to. In the next window, give it a name & description, then get the server name (the server where the database resides, your IT admin would likely know this). I don't want to go into too many other details on setting this up, beyond saying you will likely need to change the "default database" to the one which you want to access. Finally, the *name* you give this new User DSN is the *dsn* you will use when calling odbcConnect(). Once you've set up this data sourcing junk properly, the odbcConnect() call is trivial. (Famous last words...) 2) Querying: sqlQuery() works in a fairly straight-forward manner, presuming 2 things: (a) you have set up the dsn correctly & connected to it (see above), and (b) you know something of database querying language. I don't want to get bogged down by any details here, but I find this website to have good tutorials (although horrible search capabilities or reference-style documentation): http://sqlzoo.net/ Final point on querying: with an accidentally typo, your query can go from 2,000 rows long to 2,000,000,000 rows long (e.g., if you perform a join incorrectly). So it's sometimes worthwhile to make your first query a "count(*)" query so that you know how many rows of data you'll get back. 3) Populating the database: here is where I mostly wanted to focus. Dr. Ripley has some great documentation on RODBC. (E.g., http://cran.r-project.org/web/packages/RODBC/vignettes/RODBC.pdf has great info beyond what can be found in the help files.) But it seemed there was *very little* info on how to populate a database. If you're anything like me, you are generating useful results (or so I pretend) and you want to share them with your organization. So my details below help to *populate an existing database table*. Most of this is useful for other cases, too. One of the trickiest things in making sure that you populate the database correctly is making sure that you have the right *data type*. Databases are quite picky about data types; they are optimized for fast access & minimal storage, so if they can do something in 2 bytes instead of 16, they will. Therefore, there are a bunch of different data types. Instead of guessing which data types are being used, I simply query the database and grab the data types (this will work even if the database has columns but is empty), then use this information when populating. In short, I do the following: library(RODBC) dbCon <- odbcConnect(db) tmp <- sqlColumns(dbCon, dbTable) ## this function grabs a bunch of info about the columns) varTypes <- as.character(tmp$TYPE_NAME) names(varTypes) <- as.character(tmp$COLUMN_NAME) sqlSave(dbCon, dataSet, dbTable, append=TRUE, rownames=FALSE, varTypes=varTypes) The "secret ingredient" here is varTypes : this is a named character vector of data types, as said in many other threads. But done in this fashion above, you're bound to have the types set up properly. E.g., varTypes might look something like this: >varTypes my_primary_key_id my_foreign_key_id ID "bigint" "bigint" "varchar" barcode lane type "varchar" "int" "varchar" predict_value "float" Another thing I find handy is to pre-generate the primary key in the data table, since this is usually some unique numerical identifier, but otherwise just gibberish. Below I have written a wrapper function that works quite nicely for me, and that I hope others out there might find handy. A HUGE thanks to Dr. Ripley for making an excellent package!! Regards, Mike #################################################################################### #################################################################################### db.populate <- function(dataSet=NULL, dbTable=NULL, primeKey=NULL, db="blah", check.names=TRUE, verbose=FALSE, safer=TRUE, fast=TRUE, test=FALSE, nastring=NULL) { iAm <- "db.populate" if (is.null(dataSet) | is.null(dbTable)) stop(paste(iAm,": Both \"dataSet\" and \"dbtable\" variables must be", " provided.", sep="")) ### connect to the database & query tables. dbCon <- odbcConnect(db) tmp <- sqlColumns(dbCon, dbTable) varTypes <- as.character(tmp$TYPE_NAME) names(varTypes) <- as.character(tmp$COLUMN_NAME) if (!is.null(primeKey)) { myQuery <- paste("Select max(",primeKey,") from ",dbTable) primeKeys <- seq(1,dim(dataSet)[1]) + sqlQuery(dbCon, myQuery)[1,1] dataSet <- cbind(primeKeys,dataSet) ; names(dataSet)[1] <- primeKey } ## end if clause to create prime Key if (check.names) { if (length(setdiff(names(dataSet),names(varTypes)))!=0) { message(paste(iAm,": column names of \"dataSet\" do not match", " those of \"dbTable\", ",dbTable, sep="")) message("\n\tNames of \"dataSet\":") print(names(dataSet)) message(paste("\n\tNames in \"dbTable\",",dbTable,":")) print(names(varTypes)) stop(paste(iAm,": stopped due to this mis-match.", sep="")) } ## end if clause to see if "dataSet" & "dbTable" names match dataSet <- dataSet[,names(varTypes)] } ## end if clause to check names & re-arrange "dataSet" as needed message(paste(iAm, ": populating table ",dbTable, " with \"dataSet\"", sep="")) sqlSave(dbCon, dataSet, dbTable, append=TRUE, rownames=FALSE, verbose=verbose, safer=safer, varTypes=varTypes, fast=fast, test=test, nastring=nastring) odbcClose(dbCon) message(paste(iAm, ": added \"dataSet\" to table ",dbTable, sep="")) if (verbose) { message(paste(iAm, ": ",dim(dataSet)[1]," rows added.", sep="")) if (!is.null(primeKey)) { message(paste(iAm, ": primary key ",primeKey," updated.")) message(paste("\t",primeKey," values from ",primeKeys[1], " to ",primeKeys[length(primeKeys)], "are the newly updated data", sep="")) } } #################################################################################### #################################################################################### Finally, a few words to act as good keys if someone out there does a search for info: R R-help RODBC ODBC database SQL table sqlSave obdcConnect [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.