----- Original Message ----- On Fri, 15 Feb 2008 09:15:32 -0500, "Patrick Dolan" wrote: > Hi folks- > I seem to remember people on this list mentioning data sets that > are publically available. Can anybody point me to some? I'm > particularly interested in ones that might be related to public health. > Thanks! > Patrick
In addition to the suggestions already made, you should take a look at the StatLib dataset archive at Carnegie-Mellon: see http://lib.stat.cmu.edu/datasets/ Datasets here seem to fall into three categories: (1) Specialized reference dataset which produce well known results and can be used to determine a particular statistical package can consistently produce accurate results (e.g., is the new version of Excel any better than the previous version). This is provided in the NIST Statistical Reference Datasets. (2) Copies of dataset provided with a number of statistics textbooks, including the statistics book "DATA" which was a compilation of dataset by Andews and Herzberg (A&H provide a desription of each dataset and the data which, needless to say, is much better to have in electronic form). (3) Miscellaneous datasets covering a variety of topics including public health. Some examples include: Arsenic This datafile contains measurements of drinking water and toenail levels of arsenic, as well as related covariates, for 21 individuals with private wells in N.H. Source: Karagas MR, Morris JS, Weiss JE, Spate V, Baskett C, Greenberg ER. Toenail Samples as an Indicator of Drinking Water Arsenic Exposure. Cancer Epidemiology, Biomarkers and Prevention 1996;5:849-852. (Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (5 kbytes) arsenic.zip This is a zip file containing the arsenic data from Southwestern Taiwan that was used in the analysis reported by the National Academy of Sciences reports on arsenic (NRC 1999 and 2001), analyzed by Morales et al. (2000) and also discussed by Ryan (2004). See the following for a good summary, plus links to the NAS reports: http://www4.nas.edu/news.nsf/isbn/0309076293?OpenDocument Submitted by Louise Ryan ([EMAIL PROTECTED]). [19/Dec/03](35 kbytes) backache This file contains the `backache in pregnancy' data analysed in Exercise D.2 of Problem-Solving: A Statistician's Guide, 2nd edn., by C. Chatfield, Chapman and Hall, 1995. ([EMAIL PROTECTED]) [2/Oct/95] (16 kbytes) bodyfat Lists estimates of the percentage of body fat determined by underwater weighing and various body circumference measurements for 252 men. Submitted by Roger Johnson ([EMAIL PROTECTED]) [2/Oct/95](35 kbytes) detroit Data on annual homicides in Detroit, 1961-73, from Gunst & Mason's book `Regression Analysis and its Application', Marcel Dekker. Contains data on 14 relevant variables collected by J.C. Fisher. ([EMAIL PROTECTED]) [10/Feb/92] (3357 bytes) (NOTE: some consider homicide a publich health issue) lupus 87 persons with lupus nephritis. Followed up 15+ years. 35 deaths. Var = duration of disease. Over 40 baseline variables avaiable from authors. Submited by todd mackenzie ([EMAIL PROTECTED]) (4k) NLTCS This data set is an extract from the National Long Term Care Survey (NLTCS). 16 binary variables in the extract are functional disability measures: 6 activities of daily living and 10 instrumental activities of daily living, pooled over 1982, 1984, 1989, and 1994 waves of the survey. The Center for Demographic Studies, Duke University, gave its permission to redistribute the 2^16 extract via placement on StatLib under the NLTCS Data Use Agreement. If you download the data, please provide the Center for Demographic Studies, Duke University, with your name and contact information (e-mail [EMAIL PROTECTED]).(149k) Plasma_Retinol This datafile (N=315) investigates the relationship between personal characteristics and dietary factors, and plasma concentrations of retinol, beta-carotene and other carotenoids. Analysis unpublished. Related paper: Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of plasma levels of beta-carotene and retinol. American Journal of Epidemiology 1989;130:511-521.(Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98][28/Nov/01] (26 kbytes) PM10 The data are a subsample of 500 observations from a data set that originate in a study where air pollution at a road is related to traffic volume and meteorological variables, collected by the Norwegian Public Roads Administration. The response variable (column 1) consist of hourly values of the logarithm of the concentration of PM10 (particles), measured at Alnabru in Oslo, Norway, between October 2001 and August 2003. The predictor variables (columns 2 to 8) are the logarithm of the number of cars per hour, temperature $2$ meter above ground (degree C), wind speed (meters/second), the temperature difference between $25$ and $2$ meters above ground (degree C), wind direction (degrees between 0 and 360), hour of day and day number from October 1. 2001. Submitted by Magne Aldrin ([EMAIL PROTECTED]). [28/Jul/04] (19kbytes) And given that we're in a presidential election, political junkies might have fun with the following: fl2000.txt County data from the 2000 Presidential Election in Florida. For each of the 67 Florida counties, the data include the type of voting machine used, the number of columns in the presidential ballot, the undervote, the overvote, and the official certified votes for each of the twelve presidential candidates. Of particular interest are the Buchanan vote in Palm Beach county, and the overvote as a function of voting machine type and number of columns (see Agresti and Presnell, "Misvotes, Undervotes, and Overvotes: The 2000 Presidential Election in Florida," Statistical Science, Vol. 17, No. 4, 1-5, 2002. Submitted by ([EMAIL PROTECTED]). [28/Jan/03] (8.0kbytes) At the bottom of the page is a short list of other sites that also provide data. -Mike Palij New York University [EMAIL PROTECTED] --- To make changes to your subscription contact: Bill Southerly ([EMAIL PROTECTED])
