I am posting to this thread that has been quiet for some time because I
remembered the following question.
Christophe Pallier wrote:
Hi,
Can you provide examples of data formats that are problematic to read and
clean with R ?
Today I had a data manipulation problem that I don't know how to
If I understand correctly (from your Perl script)
1. you count the number of occurences of each (echo, muga) pairs in the
first file.
2. you remove from the second file the lines that correspond to these
occurences.
If this is indeed your aim, here's a solution in R:
cumcount - function(x) {
As a tangent to this thread, there is a very relevant
article in the latest issue of the RSS magazine Significance,
which I have just received:
Dr Fisher's Casebook
The trouble with data
Significance, Vol 4 (2007) Issue 2.
Full current contents at
--- [EMAIL PROTECTED] wrote:
As a tangent to this thread, there is a very
relevant
article in the latest issue of the RSS magazine
Significance,
which I have just received:
Dr Fisher's Casebook
The trouble with data
Significance, Vol 4 (2007) Issue 2.
Full current contents at
[ Arrggh, not reply , but reply to all , cross my fingers again , sorry Peter! ]
Hmm,
I don't think you need a retain statement.
if first.patientID ;
or
if last.patientID ;
ought to do it.
It's actually better than the Vilno version, I must admit, a bit more concise:
if ( not
(Ted Harding) sent the following at 10/06/2007 09:28:
... much snipped ...
(As is implicit in many comments in Robert's blog, and indeed also
from many postings to this list over time and undoubtedly well
known to many of us in practice, a lot of the problems with data
files arise at the
Chris Evans wrote:
Thanks Ted, great thread and I'm impressed with EpiData that I've
discovered through this. I'd still like something that is even more
integrated with R but maybe some day, if EpiData go fully open source as
I think they are doing (A full conversion plan to secure this and
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote:
That can be elegantly handled in R through R's object
oriented programming by defining a class for the fancy input.
See this post:
https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.
On 6/9/07,
Douglas Bates wrote:
Frank Harrell indicated that it is possible to do a lot of difficult
data transformation within R itself if you try hard enough but that
sometimes means working against the S language and its whole object
view to accomplish what you want and it can require knowledge of
On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:
... a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and then banging it shut.
Not specifically
Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the
On 10-Jun-07 14:04:44, Sarah Goslee wrote:
On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:
... a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and
On 10-Jun-07 19:27:50, Stephen Tucker wrote:
Since R is supposed to be a complete programming language,
I wonder why these tools couldn't be implemented in R
(unless speed is the issue). Of course, it's a naive desire
to have a single language that does everything, but it seems
that R
An important potential benefit of R solutions shared by awk, sed, ...
is that they provide a reproducible way to document exactly how one
got
from one version of the data to the next. This seems to be the main
problem with handicraft methods like editing excel files, it is too
easy to
Embarrasingly, I don't know awk or sed but R's code seems to be
shorter for most tasks than Python, which is my basis for comparison.
It's true that R's more powerful data structures usually aren't
necessary for the data cleaning, but sometimes in the filtering
process I will pick out lines that
Here are some examples of the type of data crunching you might have to do.
In response to the requests by Christophe Pallier and Martin Stevens.
Before I started developing Vilno, some six years ago, I had been working in
the pharmaceuticals for eight years ( it's not easy to show you actual
That can be elegantly handled in R through R's object oriented programming
by defining a class for the fancy input. See this post:
https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
for a simple example of that style.
On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote:
Here are
Hi,
Can you provide examples of data formats that are problematic to read and
clean with R ?
The only problematic cases I have encountered were cases with multiline
and/or varying length records (optional information). Then, it is sometimes
a good idea to preprocess the data to present in a
On 08-Jun-07 08:27:21, Christophe Pallier wrote:
Hi,
Can you provide examples of data formats that are problematic
to read and clean with R ?
The only problematic cases I have encountered were cases with
multiline and/or varying length records (optional information).
Then, it is
On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
As noted on the R-project web site itself ( www.r-project.org -
Manuals - R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of
I had mentioned exactly the same thing to others and the feedback I got is -
'when you have a hammer, everything will look like a nail'
^_^.
On 6/7/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote:
Robert Wilkins wrote:
As noted on the R-project web site itself ( www.r-project.org -
Manuals -
Is there an example available of this sort of problematic data that
requires this kind of data screening and filtering? For many of us,
this issue would be nice to learn about, and deal with within R. If a
package could be created, that would be optimal for some of us. I
would like to
Martin Henry H. Stevens sent the following at 08/06/2007 15:11:
Is there an example available of this sort of problematic data that
requires this kind of data screening and filtering? For many of us,
this issue would be nice to learn about, and deal with within R. If a
package could be
For windows users, EpiData Entry http://www.epidata.dk/ is an
excellent (free) tool for data entry and documentation.--Dale
On 6/8/07, Chris Evans [EMAIL PROTECTED] wrote:
Martin Henry H. Stevens sent the following at 08/06/2007 15:11:
Is there an example available of this sort of
Dale Steele wrote:
For windows users, EpiData Entry http://www.epidata.dk/ is an
excellent (free) tool for data entry and documentation.--Dale
Note that EpiData seems to work well under linux using wine.
Frank
__
R-help@stat.math.ethz.ch mailing
On 6/8/07, Douglas Bates [EMAIL PROTECTED] wrote:
Other responses in this thread have mentioned 'little language'
filters like awk, which is fine for those who were raised in the Bell
Labs tradition of programming (why type three characters when two
character names should suffice for
As noted on the R-project web site itself ( www.r-project.org -
Manuals - R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
that says, more briefly, the
An additional option for Windows users is Micro Osiris
http://www.microsiris.com/
best
robert
On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
As noted on the R-project web site itself ( www.r-project.org -
Manuals - R Data Import/Export ), it can be cumbersome to prepare
messy and dirty
Robert Wilkins wrote:
As noted on the R-project web site itself ( www.r-project.org -
Manuals - R Data Import/Export ), it can be cumbersome to prepare
messy and dirty data for analysis with the R tool itself. I've also
seen at least one S programming book (one of the yellow Springer ones)
29 matches
Mail list logo