[R] Text Mining in R

2016-05-17 Thread Burhan ul haq
Hi,

Wishing you all well.

I am exploring text mining with R. Here is where I need help:

1. The starting point is a data frame

worder1<- c("I am, taking 2","are these the three samples?",
"He speaks differently to you, aint it !","This is distilled -
my dear, now give me $3","I saved 2500 this month.")
df1 <- data.frame(id=1:5, words=worder1)

here in dput format:

dput(df1)
structure(list(id = 1:5, words = structure(c(3L, 1L, 2L, 5L,
4L), .Label = c("are these the three samples?", "He speaks differently to
you, aint it !",
"I am, taking 2", "I saved 2500 this month.", "This is distilled - my dear,
now give me $3"
), class = "factor")), .Names = c("id", "words"), row.names = c(NA,
-5L), class = "data.frame")


2. The corpus rituals ...

corp1 <- Corpus(VectorSource(df1$words))
inspect(corp1)
class(corp1)

corp1 <- tm_map(corp1, removeNumbers)
corp1 <- tm_map(corp1, removePunctuation)
corp1 <- tm_map(corp1, removeWords, stopwords("english"))
corp1 <- tm_map(corp1, stripWhitespace)
class(corp1)


3. Getting to the analysis

tdm1 <- TermDocumentMatrix(corp1)
inspect(tdm1[1:5,])
dtm1 <- DocumentTermMatrix(corp1)
inspect(dtm1[1:5,])

4. Now here is the problem

If I do a translation, not in getTransformations(), I am unable to convert
to tdm or dtm

corp1 <- tm_map(corp1, tolower)
class(corp1)
tdm1.2 <- TermDocumentMatrix(corp1)
dtm1.2 <- DocumentTermMatrix(corp1)

The error returned is:

Error: inherits(doc, "TextDocument") is not TRUE

5. The explaination on internet suggests either

a) corp1 <- tm_map(corp1, content_transformer(tolower))
which in my case returns error:
Error in UseMethod("content", x) :
  no applicable method for 'content' applied to an object of class
"character"

b) corpus_clean <- tm_map(corp1, PlainTextDocument)
which results in loss of all the meta data

I will appreciate any help. Lastly to keep the doc ids with R corpus,
should the step 2 be changed as:
corp1 <- Corpus(DataframeSource(df1))

from:
corp1 <- Corpus(VectorSource(df1$words))

Thanks /


-

Some of the references I explored:
http://stackoverflow.com/questions/25638503/tm-loses-the-metadata-when-applying-tm-map
http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument
http://stackoverflow.com/questions/24771165/r-project-no-applicable-method-for-meta-applied-to-an-object-of-class-charact
http://stackoverflow.com/questions/25551514/termdocumentmatrix-errors-in-r
http://stackoverflow.com/questions/20699111/tm-map-error-message-in-r
http://stackoverflow.com/questions/31996891/error-in-usemethodmeta-x-no-applicable-method-for-meta-applied-to-an-ob
http://stackoverflow.com/questions/11876740/r-stemming-a-string-document-corpus

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Sum of Numeric Values in a DF Column

2016-04-18 Thread Burhan ul haq
Dear Gunter /  Heiberger,

Thanks for the help. This is what I was looking for:

> ... and here is a non-dplyr rsolution:
>
>> z <-gsub("[^[:digit:]]"," ",dd$Lower)
>
>> sapply(strsplit(z," +"),function(x)sum(as.numeric(x),na.rm=TRUE))
> [1] 105  67  60 100  80

And that would explain, why one could not use "unlist" as a grand sum total
was not desired, but rather sum for each of the rows.


Br /

On Mon, Apr 18, 2016 at 10:57 PM, Bert Gunter <bgunter.4...@gmail.com>
wrote:

> ... and a slightly more efficient non-dplyr 1-liner:
>
> > sapply(strsplit(dd$Lower,"[^[:digit:]]"),
> function(x)sum(as.numeric(x), na.rm=TRUE))
>
> [1] 105  67  60 100  80
>
> Cheers,
> Bert
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Apr 18, 2016 at 10:43 AM, Bert Gunter <bgunter.4...@gmail.com>
> wrote:
> > ... and here is a non-dplyr rsolution:
> >
> >> z <-gsub("[^[:digit:]]"," ",dd$Lower)
> >
> >> sapply(strsplit(z," +"),function(x)sum(as.numeric(x),na.rm=TRUE))
> > [1] 105  67  60 100  80
> >
> >
> > Cheers,
> > Bert
> > Bert Gunter
> >
> > "The trouble with having an open mind is that people keep coming along
> > and sticking things into it."
> > -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >
> >
> > On Mon, Apr 18, 2016 at 10:07 AM, Richard M. Heiberger <r...@temple.edu>
> wrote:
> >> ## Continuing with your data
> >>
> >> AA <- stringr::str_extract_all(dd[[2]],"[[:digit:]]+")
> >> BB <- lapply(AA, as.numeric)
> >> ## I think you are looking for one of the following two expressions
> >> sum(unlist(BB))
> >> sapply(BB, sum)
> >>
> >>
> >> On Mon, Apr 18, 2016 at 12:48 PM, Burhan ul haq <ulh...@gmail.com>
> wrote:
> >>> Hi,
> >>>
> >>> I request help with the following:
> >>>
> >>> INPUT: A data frame where column "Lower" is a character containing
> numeric
> >>> values (different count or occurrences of numeric values in each row,
> >>> mostly 2)
> >>>
> >>>> dput(dd)
> >>> structure(list(State = c("Alabama", "Alaska", "Arizona", "Arkansas",
> >>> "California"), Lower = c("R 72–33", "R/Coalition 27(23 R, 4 D)–12 D, 1
> >>> Ind.",
> >>> "R 36–24", "R 64–35, 1 Ind.", "D 52–28"), Upper = c("R 26–8, 1 Ind.",
> >>> "R/Coalition 15(14 R, 1 D)–5 D", "R 18–12", "R 24–11", "D 26–14"
> >>> )), .Names = c("State", "Lower", "Upper"), row.names = c(NA,
> >>> 5L), class = "data.frame")
> >>>
> >>> PROBLEM: Need to extract all numeric values and sum them. There are few
> >>> exceptions like row2. But these can be ignored and will be fixed
> manually
> >>>
> >>> SOLUTION SO FAR:
> >>> str_extract_all(dd[[2]],"[[:digit:]]+"), returns a list of numbers as
> >>> character. I am unable to unlist it, because it mixes them all
> together, ...
> >>>
> >>> And if I may add, is there a "dplyr" way of doing it ...
> >>>
> >>>
> >>> Thanks
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >> __
> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Sum of Numeric Values in a DF Column

2016-04-18 Thread Burhan ul haq
Hi,

I request help with the following:

INPUT: A data frame where column "Lower" is a character containing numeric
values (different count or occurrences of numeric values in each row,
mostly 2)

> dput(dd)
structure(list(State = c("Alabama", "Alaska", "Arizona", "Arkansas",
"California"), Lower = c("R 72–33", "R/Coalition 27(23 R, 4 D)–12 D, 1
Ind.",
"R 36–24", "R 64–35, 1 Ind.", "D 52–28"), Upper = c("R 26–8, 1 Ind.",
"R/Coalition 15(14 R, 1 D)–5 D", "R 18–12", "R 24–11", "D 26–14"
)), .Names = c("State", "Lower", "Upper"), row.names = c(NA,
5L), class = "data.frame")

PROBLEM: Need to extract all numeric values and sum them. There are few
exceptions like row2. But these can be ignored and will be fixed manually

SOLUTION SO FAR:
str_extract_all(dd[[2]],"[[:digit:]]+"), returns a list of numbers as
character. I am unable to unlist it, because it mixes them all together, ...

And if I may add, is there a "dplyr" way of doing it ...


Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R-help Digest, Vol 157, Issue 25

2016-03-24 Thread Burhan ul haq
Thanks to Boris Steipe, Jim Lemon and  Ivan Calandra for replying.

I messed up while copying, there are equal number of values for each
country.

@ Ivan,

In case there were different number of values, and we wanted to fill in with
1) NA, or
2)  "average of the rest of values"

in the missing values, how would we "impute" such data.


Thanks again /

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Splitting a vector into data frame

2016-03-24 Thread Burhan ul haq
Hi,

1. I have scraped some data from the web, subset shown below

> dput(temp.data)
c("Armenia", "Armenia", "43827", "39200", "35700", "36700", "39341",
"30571", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", " 0",
"0", "0", "0", "0", "Austria", "Austria", "135417", "166200",
"144500", "147300", "163211", "162536", "155412", "133667", "134962",
"146440", "131188", "11", "10", "8", "35000")

2. The corresponding list of countries, is as follows

> dput(raw.country)
c("Armenia", "Austria", "Belarus", "Belgium", "Brazil", "Bulgaria",
"Canada", "Castile-Leon (Hiszania)", "Catalonia", "Chile", "Colombia",
"Costarica", "Croatia", "Cyprus", "Czech Republic", "Ecuador",
"Estonia", "Finland", "France", "Georgia", "Germany", "Ghana",
"Greece", "Hungary", "Indonesia", "Iran", "Ireland", "Israel",
"Italy", "Kazakhstan", "Kyrgyzstan", "Latvia", "Lithuania", "Macedonia",
"Malaysia", "Mexico", "Moldova", "Mongolia", "Netherland", "Norway",
"Pakistan", "Panama", "Paraguay", "Peru", "Poland", "Portugal",
"Puertorico", "Romania", "Russia", "Serbia", "Slovakia", "Slovenia",
"Spain", "Sweden", "Switzerland", "Tunisia", "Ukraine", "United Kingdom",
"USA", "Venezuela", "Vltava", "World Total")


3. I want to organize the data into a data frame, where each row will
contain the 20 values for the corresponding country.
It needs to ignore the country name which appears twice.Something like:

Armenia "43827", "39200", "35700", "36700", "39341",
"30571", "0", "0", "0", "0", "0", "0", "0", "0", "0", "0", " 0",
"0", "0", "0", "0",

"Austria", "135417", "166200",
"144500", "147300", "163211", "162536", "155412", "133667", "134962",
"146440", "131188", "11", "10", "8", "35000"

and so on


Thanks /

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error in upgrading ggplot2

2016-03-03 Thread Burhan ul haq
Thanks. I will try both the options 1) another mirror 2) upgrading R, and
revert in case of issues.


Br /

On Fri, Mar 4, 2016 at 10:56 AM, Jeff Newmiller <jdnew...@dcn.davis.ca.us>
wrote:

> The usual thing to try in cases like this is another mirror.
>
> Another worthwhile step is upgrading your R software to the latest... if
> only to comply with the Posting Guide.
> --
> Sent from my phone. Please excuse my brevity.
>
> On March 3, 2016 9:33:05 PM PST, Burhan ul haq <ulh...@gmail.com> wrote:
>>
>> Hi,
>>
>> I was planning to use GGally, which required me to upgrade ggplot2 but
>> despite trying multiple times, I have been unable to do so:
>>
>> The ggplot2 downloads and installs, but when I load it, I get the following
>> message:
>>
>>  library("ggplot2", lib.loc="/usr/local/lib/R/site-library")
>>>
>> Error in get(method, envir = home) :
>>   lazy-load database '/usr/local/lib/R/site-library/ggplot2/R/ggplot2.rdb'
>> is corrupt
>> In addition: Warning message:
>> In get(method, envir = home) : internal error -3 in R_decompress1
>> Error: package or namespace load failed for ‘ggplot2’
>>
>> The session info is as follows:
>>
>>  sessionInfo()
>>>
>> R version
>> 3.2.2 (2015-08-14)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.1 LTS
>>
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
>>  LC_COLLATE=C LC_MONETARY=C
>>  [6] LC_MESSAGES=CLC_PAPER=C   LC_NAME=C
>>  LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=C LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics  grDevices utils datasets  methods   base
>>
>> other attached packages:
>> [1] scales_0.3.0   reshape2_1.4.1 dplyr_0.4.3
>>
>> loaded via a namespace (and not attached):
>>  [1] Rcpp_0.12.3  assertthat_0.1   digest_0.6.8 MASS_7.3-40
>>  R6_2.1.1 grid_3.2.2
>>  [7] plyr_1.8.3   gtable_0.1.2 DBI_0.3.1magrittr_1.5
>> stringi_1.0-1lazyeval_0.1.10
>> [13] proto_0.3-10 tools_3.2.2  stringr_1.0.0munsell_0.4.2
>>  parallel_3.2.2   colorspace_1.2-6
>>
>>
>> Thanks
>>
>>  [[alternative HTML version deleted]]
>>
>> --
>>
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Error in upgrading ggplot2

2016-03-03 Thread Burhan ul haq
Hi,

I was planning to use GGally, which required me to upgrade ggplot2 but
despite trying multiple times, I have been unable to do so:

The ggplot2 downloads and installs, but when I load it, I get the following
message:

> library("ggplot2", lib.loc="/usr/local/lib/R/site-library")
Error in get(method, envir = home) :
  lazy-load database '/usr/local/lib/R/site-library/ggplot2/R/ggplot2.rdb'
is corrupt
In addition: Warning message:
In get(method, envir = home) : internal error -3 in R_decompress1
Error: package or namespace load failed for ‘ggplot2’

The session info is as follows:

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=C
 LC_COLLATE=C LC_MONETARY=C
 [6] LC_MESSAGES=CLC_PAPER=C   LC_NAME=C
 LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=C LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] scales_0.3.0   reshape2_1.4.1 dplyr_0.4.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.3  assertthat_0.1   digest_0.6.8 MASS_7.3-40
 R6_2.1.1 grid_3.2.2
 [7] plyr_1.8.3   gtable_0.1.2 DBI_0.3.1magrittr_1.5
stringi_1.0-1lazyeval_0.1.10
[13] proto_0.3-10 tools_3.2.2  stringr_1.0.0munsell_0.4.2
 parallel_3.2.2   colorspace_1.2-6


Thanks

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Grep Help

2016-02-22 Thread Burhan ul haq
Hi,

# 1) I have read in a CSV file

df = read.csv(file="GiftCards - v1.csv",stringsAsFactors=FALSE)
head(df)
str(df)

# 2) converted to a tbl_df
df2 = tbl_df(df)

# 3) fixed the names to remove leading "X" character
n = names(df2)
n2 = gsub(pattern="^\\w","\\1",n)
names(df2) = n2

# 4) somehow the col names are character strings, requiring me to use
quotes:
df2$`2006` instead of df2$2006 # ---> PROBLEM 1


# 5) I need to remove the leading $ sign followed by spaces to extract
values. The problem is # it could be a two or three digit number. I am able
to retrieve two digits correctly, but miss # out on the leading third digit.
df2$`2006`= gsub("^(.+)([0-9]{2,3}\\.[0-9]{2})","\\2",df2$`2006`) # -->
Problem 2

# 6) dump for the data frame

df2 <-
structure(list(`2006` = structure(c(3L, 2L, 1L), .Label = c("$
24.81",
"$ 39.16", "$   146.20"), class = "factor"), `2007` = structure(c(3L,
2L, 1L), .Label = c("$   26.25", "$ 41.95", "$   156.24"
), class = "factor"), `2008` = structure(c(3L, 2L, 1L), .Label = c("$
24.92",
"$ 40.54", "$   147.33"), class = "factor"), `2009` = structure(c(3L,
2L, 1L), .Label = c("$   23.63", "$ 39.80", "$   139.91"
), class = "factor"), `2010` = structure(c(3L, 2L, 1L), .Label = c("$
24.78",
"$ 41.48", "$   145.61"), class = "factor"), `2011` = structure(c(3L,
2L, 1L), .Label = c("$   27.80", "$ 43.23", "$   155.43"
), class = "factor"), `2012` = structure(c(3L, 2L, 1L), .Label = c("$
28.79",
"$ 43.75", "$   156.86"), class = "factor"), `2013` = structure(c(3L,
2L, 1L), .Label = c("$   29.80", "$ 45.16", "$   163.16"
), class = "factor")), .Names = c("2006", "2007", "2008", "2009",
"2010", "2011", "2012", "2013"), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L))



Thanks for the help


Br /

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Text Input from a Non Delimited File

2014-02-09 Thread Burhan ul haq
Hi,

Minor Additions:

The original file was as follows:

##  ---
GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
1 10038 Carl Allwood M Sutton  Ashfield Harriers 02:38:40 1 02:38:40
2 10098 Adam Holland M Votwo/USN 02:41:25 2 02:41:25
3 13007 Pumlani Bangani M 02:43:23 3 02:43:23
4 10028 Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
5 10187 Peter Stockdale M 02:45:26 5 02:45:25
6 10064 Jared Bethell M Harlow RC 02:46:43 6 02:46:40
7 13003 Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
8 13009 Rod Harris M 02:47:47 8 02:47:45
9 10033 Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
10 10037 Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
11 10048 Pavel Toropov M 02:50:41 11 02:50:41
12 10008 Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
13 10044 Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
14 10380 Ludovic Renou M 02:53:37 14 02:53:34
15 10056 Alex Keenan M 02:53:48 15 02:53:47
##  ---

Available here:
http://www.coltishalljaguars.co.uk/wp-content/uploads/2011/09/Robin-hood2011.pdf

I am able to match a single entry with the regular expression:
^(\d+),(\d+),( )(.)*(M |F )(.)*(\d{2}):(\d{2}):(\d{2})( )(\d{1,})(
)(\d{2}):(\d{2}):(\d{2})

But unable to handle the back reference mechanism well. And put commas
to delimit the text.

I believe regular expressions pertain to R as much as they do to
Sublime, but please let me know, if I should be posting this to
sublime forum.



\\Cheers


On Mon, Feb 10, 2014 at 3:48 AM, Burhan ul haq ulh...@gmail.com wrote:
 Hi,

 I am trying to read in a file, which is not delimited by any specific
 characters.

 Something as follows:
 ##  ---
 GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
 1,10038, Carl Allwood M Sutton  Ashfield Harriers 02:38:40 1 02:38:40
 2,10098, Adam Holland M Votwo/USN 02:41:25 2 02:41:25
 3,13007, Pumlani Bangani M 02:43:23 3 02:43:23
 4,10028, Anthony Jackson M Sittingbourne Striders 02:44:39 4 02:44:39
 5,10187, Peter Stockdale M 02:45:26 5 02:45:25
 6,10064, Jared Bethell M Harlow RC 02:46:43 6 02:46:40
 7,13003, Sarah Harris F 35 Long Eaton RC 02:47:47 7 02:47:44
 8,13009, Rod Harris M 02:47:47 8 02:47:45
 9,10033, Carl Sommer M Huncote Harriers 02:47:59 9 02:47:58
 10,10037, Peter Swaine M Charnwood AC 02:49:28 10 02:49:27
 11,10048, Pavel Toropov M 02:50:41 11 02:50:41
 12,10008, Derek Dunne M 45 Treasury Running Club 02:51:42 12 02:51:40
 13,10044, Matthew Nutt M Scunthorpe 02:52:20 13 02:52:15
 14,10380, Ludovic Renou M 02:53:37 14 02:53:34
 15,10056, Alex Keenan M 02:53:48 15 02:53:47
 ##  ---


 As I failed to read it in via R or Excel, I used a text editor with
 regular expressions, sublime to be exact. I was trying to convert it
 in CSV format, and was successful to put commas for the first two
 entries, as follows:

 ##  ---
 GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
 1,10038, Carl Allwood ,M ,Sutton  Ashfield Harriers 02:38:40 1 02:38:40
 2,10098, Adam Holland ,M ,Votwo/USN 02:41:25 2 02:41:25
 3,13007, Pumlani Bangani ,M ,02:43:23 3 02:43:23
 4,10028, Anthony Jackson ,M ,Sittingbourne Striders 02:44:39 4 02:44:39
 5,10187, Peter Stockdale ,M ,02:45:26 5 02:45:25
 6,10064, Jared Bethell ,M ,Harlow RC 02:46:43 6 02:46:40
 7,13003, Sarah Harris ,F ,35 Long Eaton RC 02:47:47 7 02:47:44
 8,13009, Rod Harris ,M ,02:47:47 8 02:47:45
 9,10033, Carl Sommer ,M ,Huncote Harriers 02:47:59 9 02:47:58
 10,10037, Peter Swaine ,M ,Charnwood AC 02:49:28 10 02:49:27
 11,10048, Pavel Toropov ,M ,02:50:41 11 02:50:41
 12,10008, Derek Dunne ,M ,45 Treasury Running Club 02:51:42 12 02:51:40
 13,10044, Matthew Nutt ,M ,Scunthorpe 02:52:20 13 02:52:15
 14,10380, Ludovic Renou ,M ,02:53:37 14 02:53:34
 15,10056, Alex Keenan ,M ,02:53:48 15 02:53:47
 ##  ---

 I am failing after that, I tried to search the expression:
 (.)*(\d{2}:\d{2}:\d{2})( )
 and replace it with: \1,\2,\3, with the result:

 ##  ---
 GunPos RaceNo Name Gender Cat Club GunTime ChipPos ChipTime
 ,02:38:40, 1 02:38:40
  ,02:41:25, 2 02:41:25
 ##  ---

 How do I fix the regular expression here. If you examine the later
 entries some name contains hyphen, or have three parts, so other
 approaches do not work well.

 Secondly, is there a better way to handle this problem. The original
 input file is in pdf format.I copied the text, and made a txt file out
 of it.

 The input txt file is attached.

 Thanks in advance for any suggestions.

 \\Cheers

__
R-help@r-project.org

Re: [R] Generate Variable Length Strings from Various Sources

2014-01-18 Thread Burhan ul haq
Hi Rainer,

Thanks for the tip.

Your suggestion works perfectly, however as per the R Mantra of avoiding
for loops,  I propose the following this alternate:

# number of strings to be created
n - 50

# random length of each string
v.length = sample( c( 2:4), n, rep = TRUE )

# letter sources
src.1 = LETTERS[ 1:10 ]
src.2 = LETTERS[ 11:20 ]
src.3 = z
src.4 = c( 1, 2 )

# turn into a list
src - list( src.1, src.2, src.3, src.4 )

my.g = function(len,src)
{
  my.s = src[[ sample( 1:4, 1 ) ]]
  tmp = sample(my.s,len,rep=TRUE)
  n1 = paste(tmp,collapse=)
  n1
} # end

sapply(v.length,my.g,src)


// Cheers.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Generate Variable Length Strings from Various Sources

2014-01-15 Thread Burhan ul haq
Hi,

I am trying to generate variable length strings from variable sources as
follows:

# 8
8-
# Function to generate a string, given:
#   its length(passed as len)
#   and the source(passed as src)
my.f = function(len,src)
{
tmp = sample(src,len,rep=FALSE)
n1 = paste(tmp,collapse=)
n1
} # end

# count
n=50

# length of names, a variable indicating string length
v.length = sample(c(2,3,4),n,rep=TRUE)

# letter sources
src.1 = LETTERS[1:10]
src.2 = LETTERS[11:20]
src.3 = z
src.4 = c(1,2)

# Issue
#s.ind = sample(c(src.1,src.2),n,rep=TRUE)
s.ind = sample(c(src.1,src.3,src.4),n,rep=TRUE)

# Generate n strings, whose length is given by v.length, and randomly
using sources (src1 to 4)
unlist(lapply(v.length,my.f,s.ind))
# 8
8-

# ISSUE -  Details:
How to randomly pass a source, either of source 1, 2, 3 or 4.
I have tried with and without the quotes, but it does not work. Without
quotes, it works, but then letters are chosen from a randomized mix of all
sources, such as A from src.1, z from src.3, whereas I want, only 1
source at a time, for a name.

# Result with quotes:
 dput(r1)
c(src.4src.1src.4, src.1src.4src.4, src.4src.3, src.4src.3src.4,
src.4src.4, src.1src.4src.4, src.1src.1src.4src.3,
src.1src.1src.1src.4,
src.4src.1src.4src.3, src.1src.4src.4, src.3src.1src.4,
src.4src.3src.1, src.1src.3src.1src.3, src.4src.1src.1src.1,
src.4src.3src.4, src.3src.3src.4, src.1src.3src.1src.1,
src.3src.3src.1src.4, src.1src.1src.3, src.3src.4src.3,
src.3src.4src.3, src.4src.1src.4src.3, src.1src.3src.4src.3,
src.4src.1, src.1src.3src.4, src.3src.4src.3, src.4src.3,
src.3src.3, src.3src.4, src.4src.4, src.1src.4src.1src.4,
src.1src.4src.1, src.3src.3, src.3src.1src.4, src.1src.3src.1src.3,
src.3src.4src.1, src.4src.3src.1, src.1src.4src.1src.4,
src.3src.4src.1src.4, src.1src.3src.4src.3, src.4src.4src.3,
src.4src.1src.3src.1, src.3src.3, src.1src.4src.4, src.4src.1src.4,
src.3src.3, src.1src.1, src.3src.1src.1, src.1src.3,
src.3src.4src.4src.3)


# Result without quotes:
 dput(r1)
c(IGC, B1I, BB, G1C, AE, GBE, 2DJA, CIAG, IGE1,
G22, EFD, DGI, BFzB, 1FI1, JFH, EJA, IEzF, FJGB,
I2z, IFC, FFE, IzJE, FJ1I, BI, FJG, EJB, GF,
AD, IJ, IE, BCGA, G1F, FF, GBB, FGCJ, 1ID,
FzA, GJ12, FC2G, FCJ2, zIJ, GHFB, AI, EFB, 2GI,
FF, 22, EI1, EG, FC21)



Thanks in advance.


/ Cheers

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Relative Cumulative Frequency of Event Occurence

2013-11-29 Thread Burhan ul haq
Hi Arun,

Thanks a lot. It works perfectly.

Here is the complete code - for all those who are interested to see Rel
Cum Freq oscillating to reach the Expected Value

# Bernouilli Trial where:
v.fly=c(G,B) # Outcome is Green or Blue fly
n=100 # No of Events / Trials
v.smp = seq(1:n) # Event Id
v.fst = sample(v.fly,n,rep=T) # Simulating First Draw
v.sec = sample(v.fly,n,rep=T)  # Simulating Second Draw
df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF
df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if
color is same in both the the draws
df.1$Rel.Freq = with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 #
Relative Frequency
df.1$Rel.Freq = round(df.1$Rel.Freq,2)

ggplot(df.1,
aes(x=sample,y=Rel.Freq))+geom_line(col=green,size=2)+geom_abline(intercept=0.5,slope=0)+geom_point(col=blue)+labs(x=Sample
No,y=Relative Cum Freq,title=Rel Cum Freq approaching 0.5 Value) +
annotate(text,x=60,y=0.53,label=Probability of 0.5)



Cheers !


On Thu, Nov 28, 2013 at 9:40 PM, arun smartpink...@yahoo.com wrote:

 HI,
 From the dput() version of df.1, it looks like you want:
 cumsum(df.1[,4]==Yes)/seq_len(nrow(df.1))
  [1] 0.000 0.500 0.333 0.250 0.400 0.333 0.4285714
  [8] 0.500 0.444 0.500


 A.K.


 On Thursday, November 28, 2013 11:26 AM, Burhan ul haq ulh...@gmail.com
 wrote:
 Hi,

 My objective is to calculate Relative (Cumulative) Frequency of Event
 Occurrence - something as follows:

 Sample.Number 1st.Fly 2nd.Fly  Did.E.occur? Relative.Cum.Frequency.of.E
 1 G B No 0.000
 2 B B Yes 0.500
 3 B G No 0.333
 4 G B No 0.250
 5 G G Yes 0.400
 6 G B No 0.333
 7 B B Yes 0.429
 8 G G Yes 0.500
 9 G B No 0.444
 10 B B Yes 0.500

 Please refer to the code below:
 ##
 # 1.
 v.fly=c(G,B) # Outcome is Green or Blue fly

 # 2.
 n=10 # No of Events / Trials

 # 3.
 v.smp = seq(1:n) # Event Id

 # 4.
 v.fst = sample(v.fly,n,rep=T) # Simulating First Draw

 # 5.
 v.sec = sample(v.fly,n,rep=T)  # Simulating Second Draw

 # 6.
 df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a
 DF

 # 7.
 df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if
 color is same in both the the draws

 # 8.
 df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency
  This line does NOT work, and needs to fix the denominator part
 ##

 Problem is with #8, specifically the part:
 cumsum(E.occur)/(E.Occur)

 The denominator E.Occur is a fixed value, instead of a moving count. I have
 tried nrow(), length() but none provides a moving version of row count, as
 cumsum does for the True values, occurring so far.

  dput(df.1)
 structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B,
 G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G,
 B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes,
 No, No, Yes, No, Yes, Yes, No, Yes),
 Relative.Cum.Frequency.of.E = c(0,
 0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names =
 c(Sample.Number,
 X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E
 ), class = data.frame, row.names = c(NA, -10L))


 Cheers !

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Help with Cast Function

2013-11-29 Thread Burhan ul haq
Hi,

This is the input data frame:

###
df.1 = read.table(header=T,text=
id gender WMC_alcohol WMC_caffeine WMC_no.drug RT_alcohol RT_caffeine
RT_no.drug
1 1 female 3.7 3.7 3.9 488 236 371
2 2 female 6.4 7.3 7.9 607 376 349
3 3 female 4.6 7.4 7.3 643 226 412
4 4 male 6.4 7.8 8.2 684 206 252
5 5 female 4.9 5.2 7.0 593 262 439
6 6 male 5.4 6.6 7.2 492 230 464
7 7 male 7.9 7.9 8.9 690 259 327
8 8 male 4.1 5.9 4.5 486 230 305
9 9 female 5.2 6.2 7.2 686 273 327
10 10 female 6.2 7.4 7.8 645 240 498
  )
###

This is the desired output:
###
id gender drug WMC RT
1 1 female alcohol 3.7 488
2 2 female alcohol 6.4 607
3 3 female alcohol 4.6 643
4 4 male alcohol 6.4 684
5 5 female alcohol 4.9 593
6 6 male alcohol 5.4 492
7 7 male alcohol 7.9 690
8 8 male alcohol 4.1 486
9 9 female alcohol 5.2 686
10 10 female alcohol 6.2 645
11 1 female caffeine 3.7 236
12 2 female caffeine 7.3 376
###

I know some melt and cast magic is required. But I was unable to sort it
myself.

Here are the dput versions:

Input Data Frame
###
 dput(df.1)
structure(list(id = 1:10, gender = structure(c(1L, 1L, 1L, 2L,
1L, 2L, 2L, 2L, 1L, 1L), .Label = c(female, male), class = factor),
WMC_alcohol = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2,
6.2), WMC_caffeine = c(3.7, 7.3, 7.4, 7.8, 5.2, 6.6, 7.9,
5.9, 6.2, 7.4), WMC_no.drug = c(3.9, 7.9, 7.3, 8.2, 7, 7.2,
8.9, 4.5, 7.2, 7.8), RT_alcohol = c(488L, 607L, 643L, 684L,
593L, 492L, 690L, 486L, 686L, 645L), RT_caffeine = c(236L,
376L, 226L, 206L, 262L, 230L, 259L, 230L, 273L, 240L), RT_no.drug =
c(371L,
349L, 412L, 252L, 439L, 464L, 327L, 305L, 327L, 498L)), .Names =
c(id,
gender, WMC_alcohol, WMC_caffeine, WMC_no.drug, RT_alcohol,
RT_caffeine, RT_no.drug), class = data.frame, row.names = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10))


Output Data Frame
###
 dput(df.output)
structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
1L, 2L), gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L,
1L, 1L, 1L, 1L), .Label = c(female, male), class = factor),
drug = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L), .Label = c(alcohol, caffeine), class = factor),
WMC = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2,
3.7, 7.3), RT = c(488L, 607L, 643L, 684L, 593L, 492L, 690L,
486L, 686L, 645L, 236L, 376L)), .Names = c(id, gender,
drug, WMC, RT), class = data.frame, row.names = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))


Cheers !

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help with Cast Function

2013-11-29 Thread Burhan ul haq
Hi,


First, a big thanks to all those who replied.


I am including all the replies in one email for easier reference later:

# Input from David
# 
  reshape(df.1, idvar=1:2, sep=_, direction=long,
varying=names(df.1)[3:8])
# 


# Input from Dennis
# 
  dfr1 - reshape(df.1, idvar = c(id, gender), v.names = c(WMC, RT),
timevar = type, times = c(alcohol, caffeine,
no.drug),
varying = list(3:5, 6:8), direction = long)
  rownames(dfr1) - NULL
  dfr
# 


# Input from Arun
# 
  library(reshape2)
  library(plyr)
  join_all(lapply(c(WMC,RT),function(x)
transform(melt(df.1[,c(1:2,grep(x,names(df.1)))],id.vars=c(id,gender),var=drug),drug=gsub(.*\\_,,drug))),by=c
 (id,gender,drug))
# 


Cheers !


On Sat, Nov 30, 2013 at 1:20 AM, David Winsemius dwinsem...@comcast.netwrote:


 On Nov 29, 2013, at 9:42 AM, Burhan ul haq wrote:

  Hi,
 
  This is the input data frame:
 
  ###
  df.1 = read.table(header=T,text=
  id gender WMC_alcohol WMC_caffeine WMC_no.drug RT_alcohol RT_caffeine
  RT_no.drug
  1 1 female 3.7 3.7 3.9 488 236 371
  2 2 female 6.4 7.3 7.9 607 376 349
  3 3 female 4.6 7.4 7.3 643 226 412
  4 4 male 6.4 7.8 8.2 684 206 252
  5 5 female 4.9 5.2 7.0 593 262 439
  6 6 male 5.4 6.6 7.2 492 230 464
  7 7 male 7.9 7.9 8.9 690 259 327
  8 8 male 4.1 5.9 4.5 486 230 305
  9 9 female 5.2 6.2 7.2 686 273 327
  10 10 female 6.2 7.4 7.8 645 240 498
   )
  ###
 
  This is the desired output:
  ###
  id gender drug WMC RT
  1 1 female alcohol 3.7 488
  2 2 female alcohol 6.4 607
  3 3 female alcohol 4.6 643
  4 4 male alcohol 6.4 684
  5 5 female alcohol 4.9 593
  6 6 male alcohol 5.4 492
  7 7 male alcohol 7.9 690
  8 8 male alcohol 4.1 486
  9 9 female alcohol 5.2 686
  10 10 female alcohol 6.2 645
  11 1 female caffeine 3.7 236
  12 2 female caffeine 7.3 376
  ###
 
  I know some melt and cast magic is required. But I was unable to sort it
  myself.
 
 # this is base::reshape

 reshape(df.1, idvar=1:2, sep=_, direction=long,
 varying=names(df.1)[3:8])


  Here are the dput versions:
 
  Input Data Frame
  ###
  dput(df.1)
  structure(list(id = 1:10, gender = structure(c(1L, 1L, 1L, 2L,
  1L, 2L, 2L, 2L, 1L, 1L), .Label = c(female, male), class = factor),
 WMC_alcohol = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2,
 6.2), WMC_caffeine = c(3.7, 7.3, 7.4, 7.8, 5.2, 6.6, 7.9,
 5.9, 6.2, 7.4), WMC_no.drug = c(3.9, 7.9, 7.3, 8.2, 7, 7.2,
 8.9, 4.5, 7.2, 7.8), RT_alcohol = c(488L, 607L, 643L, 684L,
 593L, 492L, 690L, 486L, 686L, 645L), RT_caffeine = c(236L,
 376L, 226L, 206L, 262L, 230L, 259L, 230L, 273L, 240L), RT_no.drug =
  c(371L,
 349L, 412L, 252L, 439L, 464L, 327L, 305L, 327L, 498L)), .Names =
  c(id,
  gender, WMC_alcohol, WMC_caffeine, WMC_no.drug, RT_alcohol,
  RT_caffeine, RT_no.drug), class = data.frame, row.names = c(1,
  2, 3, 4, 5, 6, 7, 8, 9, 10))
 
 
  Output Data Frame
  ###
  dput(df.output)
  structure(list(id = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L,
  1L, 2L), gender = structure(c(1L, 1L, 1L, 2L, 1L, 2L, 2L, 2L,
  1L, 1L, 1L, 1L), .Label = c(female, male), class = factor),
 drug = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
 2L, 2L), .Label = c(alcohol, caffeine), class = factor),
 WMC = c(3.7, 6.4, 4.6, 6.4, 4.9, 5.4, 7.9, 4.1, 5.2, 6.2,
 3.7, 7.3), RT = c(488L, 607L, 643L, 684L, 593L, 492L, 690L,
 486L, 686L, 645L, 236L, 376L)), .Names = c(id, gender,
  drug, WMC, RT), class = data.frame, row.names = c(1,
  2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12))
 
 
  Cheers !
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.

 David Winsemius
 Alameda, CA, USA



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Relative Cumulative Frequency of Event Occurence

2013-11-29 Thread Burhan ul haq
Hi Arun,

Thanks again. Comment noted :)

Amazing use of regular expressions in your solutions. Any reference, or
book you would recommend.


Cheers !


On Fri, Nov 29, 2013 at 10:56 PM, arun smartpink...@yahoo.com wrote:

 Hi Burhan,

 No problem.  One suggestion in this code would be:
   with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1  ##==TRUE is not
 needed
  identical( with(df.1, cumsum(E.Occur)/(seq_len(nrow(df.1,
 with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 )


  is.logical(TRUE)
 #[1] TRUE


 is.logical(Yes)
 #[1] FALSE
 A.K.






 On Friday, November 29, 2013 12:36 PM, Burhan ul haq ulh...@gmail.com
 wrote:

 Hi Arun,

 Thanks a lot. It works perfectly.

 Here is the complete code - for all those who are interested to see Rel
 Cum Freq oscillating to reach the Expected Value

 # Bernouilli Trial where:
 v.fly=c(G,B) # Outcome is Green or Blue fly
 n=100 # No of Events / Trials
 v.smp = seq(1:n) # Event Id
 v.fst = sample(v.fly,n,rep=T) # Simulating First Draw
 v.sec = sample(v.fly,n,rep=T)  # Simulating Second Draw
 df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a
 DF
 df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if
 color is same in both the the draws
 df.1$Rel.Freq = with(df.1, cumsum(E.Occur==TRUE)/(seq_len(nrow(df.1 #
 Relative Frequency
 df.1$Rel.Freq = round(df.1$Rel.Freq,2)

 ggplot(df.1,
 aes(x=sample,y=Rel.Freq))+geom_line(col=green,size=2)+geom_abline(intercept=0.5,slope=0)+geom_point(col=blue)+labs(x=Sample
 No,y=Relative Cum Freq,title=Rel Cum Freq approaching 0.5 Value) +
 annotate(text,x=60,y=0.53,label=Probability of 0.5)



 Cheers !



 On Thu, Nov 28, 2013 at 9:40 PM, arun smartpink...@yahoo.com wrote:

 HI,
 From the dput() version of df.1, it looks like you want:
 cumsum(df.1[,4]==Yes)/seq_len(nrow(df.1))
  [1] 0.000 0.500 0.333 0.250 0.400 0.333 0.4285714
  [8] 0.500 0.444 0.500
 
 
 A.K.
 
 
 
 On Thursday, November 28, 2013 11:26 AM, Burhan ul haq ulh...@gmail.com
 wrote:
 Hi,
 
 My objective is to calculate Relative (Cumulative) Frequency of Event
 Occurrence - something as follows:
 
 Sample.Number 1st.Fly 2nd.Fly  Did.E.occur? Relative.Cum.Frequency.of.E
 1 G B No 0.000
 2 B B Yes 0.500
 3 B G No 0.333
 4 G B No 0.250
 5 G G Yes 0.400
 6 G B No 0.333
 7 B B Yes 0.429
 8 G G Yes 0.500
 9 G B No 0.444
 10 B B Yes 0.500
 
 Please refer to the code below:
 ##
 # 1.
 v.fly=c(G,B) # Outcome is Green or Blue fly
 
 # 2.
 n=10 # No of Events / Trials
 
 # 3.
 v.smp = seq(1:n) # Event Id
 
 # 4.
 v.fst = sample(v.fly,n,rep=T) # Simulating First Draw
 
 # 5.
 v.sec = sample(v.fly,n,rep=T)  # Simulating Second Draw
 
 # 6.
 df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a
 DF
 
 # 7.
 df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if
 color is same in both the the draws
 
 # 8.
 df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency
  This line does NOT work, and needs to fix the denominator part
 ##
 
 Problem is with #8, specifically the part:
 cumsum(E.occur)/(E.Occur)
 
 The denominator E.Occur is a fixed value, instead of a moving count. I
 have
 tried nrow(), length() but none provides a moving version of row count, as
 cumsum does for the True values, occurring so far.
 
  dput(df.1)
 structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B,
 G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G,
 B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes,
 No, No, Yes, No, Yes, Yes, No, Yes),
 Relative.Cum.Frequency.of.E = c(0,
 0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names =
 c(Sample.Number,
 X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E
 ), class = data.frame, row.names = c(NA, -10L))
 
 
 Cheers !
 
 [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Relative Cumulative Frequency of Event Occurence

2013-11-28 Thread Burhan ul haq
Hi,

My objective is to calculate Relative (Cumulative) Frequency of Event
Occurrence - something as follows:

Sample.Number 1st.Fly 2nd.Fly  Did.E.occur? Relative.Cum.Frequency.of.E
1 G B No 0.000
2 B B Yes 0.500
3 B G No 0.333
4 G B No 0.250
5 G G Yes 0.400
6 G B No 0.333
7 B B Yes 0.429
8 G G Yes 0.500
9 G B No 0.444
10 B B Yes 0.500

Please refer to the code below:
##
# 1.
v.fly=c(G,B) # Outcome is Green or Blue fly

# 2.
n=10 # No of Events / Trials

# 3.
v.smp = seq(1:n) # Event Id

# 4.
v.fst = sample(v.fly,n,rep=T) # Simulating First Draw

# 5.
v.sec = sample(v.fly,n,rep=T)  # Simulating Second Draw

# 6.
df.1 = data.frame(sample = v.smp, fst=v.fst, sec = v.sec) # Clumping in a DF

# 7.
df.1$E.Occur = with(df.1, ifelse(fst==sec,TRUE,FALSE)) # Event Occurs, if
color is same in both the the draws

# 8.
df.1$Rel.Freq = with(df.1, cumsum(E.occur)/(E.Occur)) # Relative Frequency
 This line does NOT work, and needs to fix the denominator part
##

Problem is with #8, specifically the part:
cumsum(E.occur)/(E.Occur)

The denominator E.Occur is a fixed value, instead of a moving count. I have
tried nrow(), length() but none provides a moving version of row count, as
cumsum does for the True values, occurring so far.

 dput(df.1)
structure(list(Sample.Number = 1:10, X1st.Fly = c(G, B, B,
G, G, G, B, G, G, B), X2nd.Fly = c(B, B, G,
B, G, B, B, G, B, B), Did.E.occur. = c(No, Yes,
No, No, Yes, No, Yes, Yes, No, Yes),
Relative.Cum.Frequency.of.E = c(0,
0.5, 0.333, 0.25, 0.4, 0.333, 0.429, 0.5, 0.444, 0.5)), .Names =
c(Sample.Number,
X1st.Fly, X2nd.Fly, Did.E.occur., Relative.Cum.Frequency.of.E
), class = data.frame, row.names = c(NA, -10L))


Cheers !

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Generating Frequency Values

2013-11-26 Thread Burhan ul haq
Hi,

My problem is as follows:

INPUT:
Frequency from one column and  value of Piglets from another one

OUTPUT:
Repeat this Piglet value as per the Frequency
i.e.
Piglet 1, Frequency 3, implies 1,1,1
Piglet 7, Frequency 2, implies 7,7

SOLUTION:
This is what I have tried so far:

1. A helper function:
 dput(fn.1)
function (df.1, vt.1)
{
i = c(1)
for (i in seq_along(dim(df.1)[1])) {
print(i)
temp = rep(df.1$Piglets[i], df.1$Frequency[i])
append(vt.1, values = temp)
}
}

2. A dummy data frame:
 dput(df.1)
structure(list(Piglets = 5:14, Frequency = c(1L, 0L, 2L, 3L,
3L, 9L, 8L, 5L, 3L, 2L)), .Names = c(Piglets, Frequency), class =
data.frame, row.names = c(NA,
-10L))

3. A dummy vector to hold results:
 dput(vt.1)
1

4. Finally the function call:
fn.1(df.1, vt.1)

5. The results is:
[1] 1

PROBLEM:
The result is not a repetition of Piglet value as per their respective
frequencies.



Thanks in advance for guidance and help.

CheeRs !


p.s I have used caps for my heading / sections, nothing else is implied by
their use.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Generating Frequency Values

2013-11-26 Thread Burhan ul haq
Hi,

A big thanks to everyone who replied. But special ones to Berend for
pointing out my mistakes, that will really help me in future.



Cheers !


On Tue, Nov 26, 2013 at 11:19 PM, Berend Hasselman b...@xs4all.nl wrote:


 On 26-11-2013, at 15:59, Burhan ul haq ulh...@gmail.com wrote:

  Hi,
 
  My problem is as follows:
 
  INPUT:
  Frequency from one column and  value of Piglets from another one
 
  OUTPUT:
  Repeat this Piglet value as per the Frequency
  i.e.
  Piglet 1, Frequency 3, implies 1,1,1
  Piglet 7, Frequency 2, implies 7,7
 
  SOLUTION:
  This is what I have tried so far:
 
  1. A helper function:
  dput(fn.1)
  function (df.1, vt.1)
  {
 i = c(1)
 for (i in seq_along(dim(df.1)[1])) {
 print(i)
 temp = rep(df.1$Piglets[i], df.1$Frequency[i])
 append(vt.1, values = temp)
 }
  }
 

 There is a lot wrong with your function.
 You should assign the result of append to vt.1
 The function should return vt.1
 Use seq_len instead of seq_along.

 The function should be something like this

 fn.1 - function (df.1, vt.1)
 {
for (i in seq_len(length.out=dim(df.1)[1])) {
print(i)
temp = rep(df.1$Piglets[i], df.1$Frequency[i])
vt.1 - append(vt.1, values = temp)
}
vt.1
 }

 But Sarah’s solution is the way to go.

 Berend


  2. A dummy data frame:
  dput(df.1)
  structure(list(Piglets = 5:14, Frequency = c(1L, 0L, 2L, 3L,
  3L, 9L, 8L, 5L, 3L, 2L)), .Names = c(Piglets, Frequency), class =
  data.frame, row.names = c(NA,
  -10L))
 
  3. A dummy vector to hold results:
  dput(vt.1)
  1
 
  4. Finally the function call:
  fn.1(df.1, vt.1)
 
  5. The results is:
  [1] 1
 
  PROBLEM:
  The result is not a repetition of Piglet value as per their respective
  frequencies.
 
 
 
  Thanks in advance for guidance and help.
 
  CheeRs !
 
 
  p.s I have used caps for my heading / sections, nothing else is implied
 by
  their use.
 
[[alternative HTML version deleted]]
 
  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.