----- Original Message ----- 
On Fri, 15 Feb 2008 09:15:32 -0500,  "Patrick Dolan" wrote:
> Hi folks-
> I seem to remember people on this list mentioning data sets that
> are publically available.  Can anybody point me to some? I'm
> particularly interested in ones that might be related to public health.
> Thanks!
> Patrick

In addition to the suggestions already made, you should
take a look at the StatLib dataset archive at Carnegie-Mellon: see
http://lib.stat.cmu.edu/datasets/

Datasets here seem to fall into three categories:

(1) Specialized reference dataset which produce well known results
and can be used to determine a particular statistical package can
consistently produce accurate results (e.g., is the new version of
Excel any better than the previous version).  This is provided in
the NIST Statistical Reference Datasets.

(2)  Copies of dataset provided with a number of statistics textbooks,
including the statistics book "DATA" which was a compilation of
dataset by Andews and Herzberg (A&H provide a desription of each
dataset and the data which, needless to say, is much better to have
in electronic form).

(3)  Miscellaneous datasets covering a variety of topics including
public health.  Some examples include:

 Arsenic
    This datafile contains measurements of drinking water and toenail levels
of arsenic, as well as related covariates, for 21 individuals with private
wells in N.H. Source: Karagas MR, Morris JS, Weiss JE, Spate V, Baskett C,
Greenberg ER. Toenail Samples as an Indicator of Drinking Water Arsenic
Exposure. Cancer Epidemiology, Biomarkers and Prevention 1996;5:849-852.
(Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format) [21/Jul/98] (5 kbytes)

arsenic.zip
    This is a zip file containing the arsenic data from Southwestern Taiwan
that was used in the analysis reported by the National Academy of Sciences
reports on arsenic (NRC 1999 and 2001), analyzed by Morales et al. (2000)
and also discussed by Ryan (2004). See the following for a good summary,
plus links to the NAS reports:
http://www4.nas.edu/news.nsf/isbn/0309076293?OpenDocument Submitted by
Louise Ryan ([EMAIL PROTECTED]). [19/Dec/03](35 kbytes)

backache
    This file contains the `backache in pregnancy' data analysed in Exercise
D.2 of Problem-Solving: A Statistician's Guide, 2nd edn., by C. Chatfield,
Chapman and Hall, 1995. ([EMAIL PROTECTED]) [2/Oct/95] (16 kbytes)

bodyfat
    Lists estimates of the percentage of body fat determined by underwater
weighing and various body circumference measurements for 252 men. Submitted
by Roger Johnson ([EMAIL PROTECTED]) [2/Oct/95](35 kbytes)

detroit
    Data on annual homicides in Detroit, 1961-73, from Gunst & Mason's book
`Regression Analysis and its Application', Marcel Dekker. Contains data on
14 relevant variables collected by J.C. Fisher.
([EMAIL PROTECTED]) [10/Feb/92] (3357 bytes)
(NOTE:  some consider homicide a publich health issue)

 lupus
    87 persons with lupus nephritis. Followed up 15+ years. 35 deaths. Var =
duration of disease. Over 40 baseline variables avaiable from authors.
Submited by todd mackenzie ([EMAIL PROTECTED]) (4k)

 NLTCS
    This data set is an extract from the National Long Term Care Survey
(NLTCS). 16 binary variables in the extract are functional disability
measures: 6 activities of daily living and 10 instrumental activities of
daily living, pooled over 1982, 1984, 1989, and 1994 waves of the survey.
The Center for Demographic Studies, Duke University, gave its permission to
redistribute the 2^16 extract via placement on StatLib under the NLTCS Data
Use Agreement. If you download the data, please provide the Center for
Demographic Studies, Duke University, with your name and contact information
(e-mail [EMAIL PROTECTED]).(149k)

Plasma_Retinol
    This datafile (N=315) investigates the relationship between personal
characteristics and dietary factors, and plasma concentrations of retinol,
beta-carotene and other carotenoids. Analysis unpublished. Related paper:
Nierenberg DW, Stukel TA, Baron JA, Dain BJ, Greenberg ER. Determinants of
plasma levels of beta-carotene and retinol. American Journal of Epidemiology
1989;130:511-521.(Therese.A.Stukel_AT_Dartmouth.EDU) (MS Word format)
[21/Jul/98][28/Nov/01] (26 kbytes)

 PM10
    The data are a subsample of 500 observations from a data set that
originate in a study where air pollution at a road is related to traffic
volume and meteorological variables, collected by the Norwegian Public Roads
Administration. The response variable (column 1) consist of hourly values of
the logarithm of the concentration of PM10 (particles), measured at Alnabru
in Oslo, Norway, between October 2001 and August 2003. The predictor
variables (columns 2 to 8) are the logarithm of the number of cars per hour,
temperature $2$ meter above ground (degree C), wind speed (meters/second),
the temperature difference between $25$ and $2$ meters above ground (degree
C), wind direction (degrees between 0 and 360), hour of day and day number
from October 1. 2001. Submitted by Magne Aldrin ([EMAIL PROTECTED]).
[28/Jul/04] (19kbytes)

And given that we're in a presidential election, political junkies might
have
fun with the following:

 fl2000.txt
    County data from the 2000 Presidential Election in Florida. For each of
the 67 Florida counties, the data include the type of voting machine used,
the number of columns in the presidential ballot, the undervote, the
overvote, and the official certified votes for each of the twelve
presidential candidates. Of particular interest are the Buchanan vote in
Palm Beach county, and the overvote as a function of voting machine type and
number of columns (see Agresti and Presnell, "Misvotes, Undervotes, and
Overvotes: The 2000 Presidential Election in Florida," Statistical Science,
Vol. 17, No. 4, 1-5, 2002. Submitted by ([EMAIL PROTECTED]). [28/Jan/03]
(8.0kbytes)


At the bottom of the page is a short list of other sites that also
provide data.

-Mike Palij
New York University
[EMAIL PROTECTED]



---
To make changes to your subscription contact:

Bill Southerly ([EMAIL PROTECTED])

Reply via email to