Re: [R] Create a categorical variable using the deciles of data

2022-06-15 Thread Richard O'Keefe
I had the advantage of studying
   Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988)
   "The New S Language".  Wadsworth & Brooks/Cole.
before starting to use R.  That book was reissued by CRC Press
only a few years ago.  It's *still* a pretty darned good intro
to R.  cut, pretty, and quantile are all there.  They are
pretty basic.

I strongly recommend looking for a copy of that book and
skimming through appendix 1 repeatedly until you have a rough
idea of what's there.  A lot has been added since then, and R
never did srarch the current working directory as part of the
environment, but a lot has NOT changed.  category(..) has gone,
that's the main difference I recall.


On Wed, 15 Jun 2022 at 01:49, Ebert,Timothy Aaron  wrote:

> A problem in R is that there are several dozen ways to do any of these
> basic activities. I used the approach that I could get to work the fastest
> and tried to make it somewhat general. I do not know functions like ?pretty
> that Rui used, nor ?quantile or ?cut. It is a difficulty in learning R
> where the internet or sites like this one are the "teacher." There are
> plenty of books, but they too take one approach to solve a problem rather
> than "here is a problem" and "these are all possible solutions." I
> appreciate seeing alternative solutions.
>
> Tim
>
> -Original Message-
> From: R-help  On Behalf Of Richard O'Keefe
> Sent: Tuesday, June 14, 2022 9:08 AM
> To: anteneh asmare 
> Cc: R Project Help 
> Subject: Re: [R] Create a categorical variable using the deciles of data
>
> [External Email]
>
> Can you explain why you are not using
> ?quantile
> to find the deciles then
> ?cut
> to construct the factor?
> What have I misunderstood?
>
> On Tue, 14 Jun 2022 at 23:29, anteneh asmare  wrote:
>
> > I want Create a categorical variable using the deciles of the
> > following data frame to divide the individuals into 10 groups equally.
> > I try the following codes
> > data_catigocal<-data.frame(c(1:5))
> > # create categorical vector using deciles group_vector <-
> >
> > c('0-10','11-20','21-30','31-40','41-50','51-60','61-70','71-80','81-9
> > 0','91-100') # Add categorical variable to the data_catigocal
> > data_catigocal$decile <- factor(group_vector) # print data frame
> > data_catigocal
> >
> > can any one help me with the r code
> > Kind regards,
> > Hana
> >
> > __
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mail
> > man_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAs
> > Rzsn7AkP-g=JNEsCwSpVwVmoMiEXA8K7ZWqg1GZK3Cx87LshtZ5gy5Y8SyDZrUSTuotO
> > cQ44yzy=2x4gMg5K_GJPK-XUk3UfSB3hhFCziCOgqvxl7yJXTvA=
> > PLEASE do read the posting guide
> > https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.or
> > g_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeA
> > sRzsn7AkP-g=JNEsCwSpVwVmoMiEXA8K7ZWqg1GZK3Cx87LshtZ5gy5Y8SyDZrUSTuot
> > OcQ44yzy=50k59quZy2KmFsVBRxK4P-M7RyxDsPGieX6TiiY5or0=
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=JNEsCwSpVwVmoMiEXA8K7ZWqg1GZK3Cx87LshtZ5gy5Y8SyDZrUSTuotOcQ44yzy=2x4gMg5K_GJPK-XUk3UfSB3hhFCziCOgqvxl7yJXTvA=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=JNEsCwSpVwVmoMiEXA8K7ZWqg1GZK3Cx87LshtZ5gy5Y8SyDZrUSTuotOcQ44yzy=50k59quZy2KmFsVBRxK4P-M7RyxDsPGieX6TiiY5or0=
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a package that can do Fuzzy name matching to standardize names in a single column

2022-06-15 Thread Ashim Kapoor
Dear Gregg,

This is what I  meant :-

> df1
 Names
1John Good
2  Joe Jackson
3Bob A. Barker
4 John B. Good
5   Joe J. Jackson
6 Bob Allen Barker
7John Good
8 Joe Jack Johnson
9   Bob Barker

> stringdist_left_join(df1,df1,by="Names",max_dist = 3)
Names.x  Names.y
1 John GoodJohn Good
2 John Good John B. Good
3 John GoodJohn Good
4   Joe Jackson  Joe Jackson
5   Joe Jackson   Joe J. Jackson
6 Bob A. BarkerBob A. Barker
7 Bob A. Barker   Bob Barker
8  John B. GoodJohn Good
9  John B. Good John B. Good
10 John B. GoodJohn Good
11   Joe J. Jackson  Joe Jackson
12   Joe J. Jackson   Joe J. Jackson
13 Bob Allen Barker Bob Allen Barker
14John GoodJohn Good
15John Good John B. Good
16John GoodJohn Good
17 Joe Jack Johnson Joe Jack Johnson
18   Bob BarkerBob A. Barker
19   Bob Barker   Bob Barker
>


You can join a table to itself while tinkering with the max_distance function..
Please notice the clusters that have formed. This has to be cleaned up.

This is similar to the answer by Jan van der Laan.

Best Regards,
Ashim

On Wed, Jun 15, 2022 at 9:13 PM Gregg Powell  wrote:
>
>
> Hello Ashim and kind regards for you taking the time to answer back.
>
>
> > library(fuzzyjoin)
> > ?stringdist_left_join
>
> -this will join two tables, but what I am trying to do is just standardize 
> the similarly spelled duplicate names in just the first column of a single 
> table.
>
> I don't think fuzzyjoin will help me in that regard.
>
> Thanks.
> Gregg
> Arizona, USA
>
> --- Original Message ---
> On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor 
>  wrote:
>
>
> >
>
> >
>
> > Dear Gregg,
> >
>
> > Check this out:
> >
>
> > library(fuzzyjoin)
> > ?stringdist_left_join
> >
>
> > Best Regards,
> > Ashim
> >
>
> > On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
> > r-help@r-project.org wrote:
> >
>
> > > Have data sets where there are names, in the first column, client names 
> > > in the second, and Client start date in the third.
> > >
>
> > > There are thousands of these records with thousands of 
> > > names/clients/client start dates. The name is entered each time the 
> > > person begins with a new client such that each person has many entries in 
> > > the name column. Often the names were not entered in a consistent way. 
> > > With and without middle initial, middle name, or various abbreviations 
> > > such as ",RN" at the end of the name.
> > >
>
> > > Is there a package that can do fuzzy name matching so that the names in 
> > > name column get replaced with a "standardized" format - where some type 
> > > of machine learning can pick the most common spelling of each repeat name 
> > > and replace the different variations with the common spelling?
> > >
>
> > > I included an example below. First table includes the names with the 
> > > various spellings. Second table depicts what I hope to achieve.
> > >
>
> > > Again - this is on a large scale - there are something like 10,000 
> > > records with names that need to be standardized.
> > >
>
> > > Name
> > >
>
> > > Client
> > >
>
> > > Client Start Date
> > >
>
> > > John Good
> > >
>
> > > Client 1
> > >
>
> > > 1/1/2020
> > >
>
> > > Joe Jackson
> > >
>
> > > Client 2
> > >
>
> > > 6/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 3
> > >
>
> > > 8/1/2020
> > >
>
> > > John B. Good
> > >
>
> > > Client 4
> > >
>
> > > 10/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 5
> > >
>
> > > 12/1/2020
> > >
>
> > > Bob Allen Barker
> > >
>
> > > Client 6
> > >
>
> > > 1/1/2021
> > >
>
> > > John Good
> > >
>
> > > Client 7
> > >
>
> > > 5/1/2021
> > >
>
> > > Joe Jack Jackson
> > >
>
> > > Client 8
> > >
>
> > > 8/1/2021
> > >
>
> > > Bob Barker
> > >
>
> > > Client 9
> > >
>
> > > 12/1/2021
> > >
>
> > > Name
> > >
>
> > > Client
> > >
>
> > > Client Start Date
> > >
>
> > > John Good
> > >
>
> > > Client 1
> > >
>
> > > 1/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 2
> > >
>
> > > 6/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 3
> > >
>
> > > 8/1/2020
> > >
>
> > > John Good
> > >
>
> > > Client 4
> > >
>
> > > 10/1/2020
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 5
> > >
>
> > > 12/1/2020
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 6
> > >
>
> > > 1/1/2021
> > >
>
> > > John Good
> > >
>
> > > Client 7
> > >
>
> > > 5/1/2021
> > >
>
> > > Joe J. Jackson
> > >
>
> > > Client 8
> > >
>
> > > 8/1/2021
> > >
>
> > > Bob A. Barker
> > >
>
> > > Client 9
> > >
>
> > > 12/1/2021
> > >
>
> > > THANKS!
> > >
>
> > > Gregg Powell
> > >
>
> > > Arizona, USA__
> > > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide 
> > > 

Re: [R] How do I "teach" R that columns 1, 2, and 3 are the Year-Month-Day

2022-06-15 Thread Bert Gunter
Look again -- perhaps in your spam folder.

Here's what he said:

"
Reprex:

dta <- read.table( text =
"Yr  Mo Dy   Fuel
2021  7 25   50.45
2021  8 27   61.48
2021  9 26   59.07
2021 11  4   55.40
2021 11 22   30.63
2021 11 26   41.35
2021 12  6   32.81
2022  1 14   49.86
2022  4 29   62.99
2022  6 11   89.37
", header=TRUE )
dta$Dtm <- with( dta, as.Date( ISOdate( Yr, Mo, Dy ) ) )
with( dta, plot( Dtm, Fuel ) )

The ISOdate function returns a POSIXct which includes time-of-day.
Analyses that don't need time can instead rely on the Date type to
avoid issues with timezones. "


Bert Gunter

On Wed, Jun 15, 2022 at 9:58 AM Gregory Coats via R-help
 wrote:
>
> I do not see any posting on this topic from Jeff Newmiller.
> I seek a way to “teach” R that "2021-07-25” represents a Year, Month, and Day.
> Greg Coats
>
> > Fuel <- c(50.45, 61.48, 59.07, 55.40, 30.63, 41.35, 32.81, 49.86, 62.99, 
> > 89.37)
> > plot (Fuel)
> > Dates <- c("2021-07-25", "2021-08-27", "2021-09-26", "2021-11-04", 
> > "2021-11-22", "2021-11-26", "2021-12-06", "2022-01-14", "2022-04-29", 
> > "2022-06-11")
> > plot (Dates)
> Error in plot.window(...) : need finite 'ylim' values
> In addition: Warning messages:
> 1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
> 2: In min(x) : no non-missing arguments to min; returning Inf
> 3: In max(x) : no non-missing arguments to max; returning -Inf
>
> > xyplot (Dates, Fuel)
> Error in xyplot(Dates, Fuel) : could not find function "xyplot"
>
> > On Jun 15, 2022, at 6:02 AM, Martin Maechler  
> > wrote:
> > Jeff Newmiller's answer which was much shorter *and* only
> > used base R  instead of an extra package
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Model Comparision for case control studies in R

2022-06-15 Thread anteneh asmare
Dear Tim, Thanks. the first vector
y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1) is the disease status y=
(1=Case,0=Control). The covariate age, smoking status and hypertension
are independent(uncorrelated). The logistic regression (unconditional)
will used. But I need to compare other models with logistic regression
instead of fitting it directly to logistic regression.
There is no matching on the data to use conditional logistics regression.
Best,
Hana
On 6/15/22, Ebert,Timothy Aaron  wrote:
> Disease status is missing from the sample data.
> Are age, disease, smoking, and/or hypertension correlated in any way or are
> they independent (correlation=0)?
> Are the correlations large enough to adversely influence your model?
> Tim
>
> -Original Message-
> From: R-help  On Behalf Of anteneh asmare
> Sent: Wednesday, June 15, 2022 7:29 AM
> To: r-help@r-project.org
> Subject: [R] Model Comparision for case control studies in R
>
> [External Email]
>
> y<-c(0,1,1,0,0,1,0,0,1,1,1,0,1,1,1,0,0,0,0,1)
> age<-c(45,23,56,67,23,23,28,56,45,47,36,37,33,35,38,39,43,28,39,41)
> smoking<-c(0,1,1,1,0,0,0,0,0,1,1,0,0,1,0,1,1,1,0,1)
> hypertension<-c(1,1,0,1,0,1,0,1,1,0,1,1,1,1,1,1,0,0,1,0)
> data<-data.frame(y,age,smoking,hypertension)
> data
> model<-glm(y~age+factor(smoking)+factor(hypertension), data, family =
> binomial(link = "logit"),na.action = na.omit)
> summary(model)
> from above sample data I want to study a case-control study on male
> individuals with my response variable y, disease status (1=Case,
> 0=Control) with covariates age, smoking status(1=Yes, 0=No)  and
> hypertension, hypertensive (1=Yes, 0=No). I want to fit the model to predict
> the disease status using at least two different methods. And to make model
> comparisons. I think logistic regression will be the best fit for this case
> control study. Do we have other options in addition to logistic regression?
> My objective is to fit the model to predict the disease status using at
> least two different methods.
> Kind regards,
> Hana
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Dhelp=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWwwig4oYOB=ztyDthknydhlcM49F33Gz6xRl6G7U9s8aIhB1VN-EKY=
> PLEASE do read the posting guide
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.R-2Dproject.org_posting-2Dguide.html=DwICAg=sJ6xIWYx-zLMB3EPkvcnVg=9PEhQh2kVeAsRzsn7AkP-g=l7afPQ_gGAoV2EsNoYSYul0qAISEiXLmTmu0IQ03nZO4rcAi9xHZGsWwwig4oYOB=tcsGkhvtVvoVvb1Ehah-vLRC6an40rJXQXqqfX2f0gI=
> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How do I "teach" R that columns 1, 2, and 3 are the Year-Month-Day

2022-06-15 Thread Gregory Coats via R-help
I do not see any posting on this topic from Jeff Newmiller.
I seek a way to “teach” R that "2021-07-25” represents a Year, Month, and Day.
Greg Coats

> Fuel <- c(50.45, 61.48, 59.07, 55.40, 30.63, 41.35, 32.81, 49.86, 62.99, 
> 89.37)
> plot (Fuel)
> Dates <- c("2021-07-25", "2021-08-27", "2021-09-26", "2021-11-04", 
> "2021-11-22", "2021-11-26", "2021-12-06", "2022-01-14", "2022-04-29", 
> "2022-06-11")
> plot (Dates)
Error in plot.window(...) : need finite 'ylim' values
In addition: Warning messages:
1: In xy.coords(x, y, xlabel, ylabel, log) : NAs introduced by coercion
2: In min(x) : no non-missing arguments to min; returning Inf
3: In max(x) : no non-missing arguments to max; returning -Inf

> xyplot (Dates, Fuel)
Error in xyplot(Dates, Fuel) : could not find function "xyplot"

> On Jun 15, 2022, at 6:02 AM, Martin Maechler  
> wrote:
> Jeff Newmiller's answer which was much shorter *and* only
> used base R  instead of an extra package


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a package that can do Fuzzy name matching to standardize names in a single column

2022-06-15 Thread Chris Evans
This isn't my expert area but I have at times encountered issues relating to it 
and I think this isn't "just"
(as in "just standardize the similarly spelled duplicate names"). I once 
thought about trying to work out how
many names I have in citations to my work. Over the years I have seen my name 
as:
Chris Evans
Evans, Chris
Christopher Evans
Evans, Christopher
C.D.H.Evans
Evans, C.D.H.
and a great one that a bank once gave me: DR CHRISTOPHE D EVANS (honestly ... 
why?)

Then there are all the misspellings as you say.  Back in the days of snail mail 
reprint requests I used to
get teased about getting a fair few addressed to "Christ Evans".

Then there are things that add permutations of my qualifications (OK, perhaps 
not in your data but you have 
the "Jr." and perhaps "III" or the like.  I think these are more common in the 
USA than the UK.)

I also suspect that having names of "non-English" origins in there may 
complicate things too.  I still get
Spanish naming conventions wrong and know that the default order of given name 
/ family name is reversed
in Japanese but that many Japanese know that much of the world won't know this 
so reverse their name order
for things going outside Japan.

I think there's nothing trivial or "just" about doing this but I suspect there 
are established, accepted,
and always fallible ways of doing it but I have a nasty suspicion that some are 
proprietary and not at all
open source.

I think you may have to start with the issue of commas: are they being used 
before terminal qualifiers 
(", Jr.", ", Dr." ...) or are they reversing family name and given name 
("Evans, Chris")?  I might start
by counting the numbers of commas in the entries and hoping it's always zero or 
one.   If it is, I would
then look at the parts after the commas and see if I could get a list of common 
terminal qualifiers so I
would know they were not being treated as names (I know a man called Doctor 
Ronnie Doctor, but I suspect
he is never typed in as "Ronnie Doctor, Doctor"!)

If you have reversed given/family names you might want to try generating all 
the reversals and looking
for matches.

Then I might start to drill into full stops abbreviating names ("C. Evans", 
"Evans, C.", "Evans, C.D.H.") 
and what about "Evans, CDH"?  Can you assume that text segments all in upper 
case can be split, i.e. 
always translate "CDH" into "C.D.H.".  But then you have to deal with "II", 
"III" and even "IV" I guess.
Good job you're not doing British or French monarchs: "Henry VIII" and "Louise 
XVI" (I am not sure if 
you can change your name to "Henry VIII" by deed poll in the UK.  I do know you 
can't change it to "Jesus
Christ".

One tangential thing that might help is if you have other demographics: you 
might want to see if gender
(though it can change), age (will change), d.o.b. (shouldn't) might help you 
disaggregate some matches.

Enough already! Challenging stuff.

Very best (all),

Chris


- Original Message -
> From: "Gregg Powell via R-help" 
> To: "Ashim Kapoor" 
> Cc: "r-help" , h...@r-project.org, 
> "r-help-requ...@lists.r-project.org"
> 
> Sent: Wednesday, 15 June, 2022 17:43:14
> Subject: Re: [R] Is there a package that can do Fuzzy name matching to 
> standardize names in a single column

> Hello Ashim and kind regards for you taking the time to answer back.
> 
> 
>> library(fuzzyjoin)
>> ?stringdist_left_join
> 
> -this will join two tables, but what I am trying to do is just standardize the
> similarly spelled duplicate names in just the first column of a single table.
> 
> I don't think fuzzyjoin will help me in that regard.
> 
> Thanks.
> Gregg
> Arizona, USA
> 
> --- Original Message ---
> On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor 
> wrote:
> 
> 
>> 
> 
>> 
> 
>> Dear Gregg,
>> 
> 
>> Check this out:
>> 
> 
>> library(fuzzyjoin)
>> ?stringdist_left_join
>> 
> 
>> Best Regards,
>> Ashim
>> 
> 
>> On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
>> r-help@r-project.org wrote:
>> 
> 
>> > Have data sets where there are names, in the first column, client names in 
>> > the
>> > second, and Client start date in the third.
>> > 
> 
>> > There are thousands of these records with thousands of names/clients/client
>> > start dates. The name is entered each time the person begins with a new 
>> > client
>> > such that each person has many entries in the name column. Often the names 
>> > were
>> > not entered in a consistent way. With and without middle initial, middle 
>> > name,
>> > or various abbreviations such as ",RN" at the end of the name.
>> > 
> 
>> > Is there a package that can do fuzzy name matching so that the names in 
>> > name
>> > column get replaced with a "standardized" format - where some type of 
>> > machine
>> > learning can pick the most common spelling of each repeat name and replace 
>> > the
>> > different variations with the common spelling?
>> > 
> 
>> > I included an example below. First table includes the names with the 
>> 

Re: [R] Is there a package that can do Fuzzy name matching to standardize names in a single column

2022-06-15 Thread Gregg Powell via R-help

Hello Ashim and kind regards for you taking the time to answer back.


> library(fuzzyjoin)
> ?stringdist_left_join

-this will join two tables, but what I am trying to do is just standardize the 
similarly spelled duplicate names in just the first column of a single table.

I don't think fuzzyjoin will help me in that regard.

Thanks.
Gregg
Arizona, USA

--- Original Message ---
On Wednesday, June 15th, 2022 at 8:04 AM, Ashim Kapoor  
wrote:


> 

> 

> Dear Gregg,
> 

> Check this out:
> 

> library(fuzzyjoin)
> ?stringdist_left_join
> 

> Best Regards,
> Ashim
> 

> On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
> r-help@r-project.org wrote:
> 

> > Have data sets where there are names, in the first column, client names in 
> > the second, and Client start date in the third.
> > 

> > There are thousands of these records with thousands of names/clients/client 
> > start dates. The name is entered each time the person begins with a new 
> > client such that each person has many entries in the name column. Often the 
> > names were not entered in a consistent way. With and without middle 
> > initial, middle name, or various abbreviations such as ",RN" at the end of 
> > the name.
> > 

> > Is there a package that can do fuzzy name matching so that the names in 
> > name column get replaced with a "standardized" format - where some type of 
> > machine learning can pick the most common spelling of each repeat name and 
> > replace the different variations with the common spelling?
> > 

> > I included an example below. First table includes the names with the 
> > various spellings. Second table depicts what I hope to achieve.
> > 

> > Again - this is on a large scale - there are something like 10,000 records 
> > with names that need to be standardized.
> > 

> > Name
> > 

> > Client
> > 

> > Client Start Date
> > 

> > John Good
> > 

> > Client 1
> > 

> > 1/1/2020
> > 

> > Joe Jackson
> > 

> > Client 2
> > 

> > 6/1/2020
> > 

> > Bob A. Barker
> > 

> > Client 3
> > 

> > 8/1/2020
> > 

> > John B. Good
> > 

> > Client 4
> > 

> > 10/1/2020
> > 

> > Joe J. Jackson
> > 

> > Client 5
> > 

> > 12/1/2020
> > 

> > Bob Allen Barker
> > 

> > Client 6
> > 

> > 1/1/2021
> > 

> > John Good
> > 

> > Client 7
> > 

> > 5/1/2021
> > 

> > Joe Jack Jackson
> > 

> > Client 8
> > 

> > 8/1/2021
> > 

> > Bob Barker
> > 

> > Client 9
> > 

> > 12/1/2021
> > 

> > Name
> > 

> > Client
> > 

> > Client Start Date
> > 

> > John Good
> > 

> > Client 1
> > 

> > 1/1/2020
> > 

> > Joe J. Jackson
> > 

> > Client 2
> > 

> > 6/1/2020
> > 

> > Bob A. Barker
> > 

> > Client 3
> > 

> > 8/1/2020
> > 

> > John Good
> > 

> > Client 4
> > 

> > 10/1/2020
> > 

> > Joe J. Jackson
> > 

> > Client 5
> > 

> > 12/1/2020
> > 

> > Bob A. Barker
> > 

> > Client 6
> > 

> > 1/1/2021
> > 

> > John Good
> > 

> > Client 7
> > 

> > 5/1/2021
> > 

> > Joe J. Jackson
> > 

> > Client 8
> > 

> > 8/1/2021
> > 

> > Bob A. Barker
> > 

> > Client 9
> > 

> > 12/1/2021
> > 

> > THANKS!
> > 

> > Gregg Powell
> > 

> > Arizona, USA__
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.

signature.asc
Description: OpenPGP digital signature
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a package that can do Fuzzy name matching to standardize names in a single column

2022-06-15 Thread Bert Gunter
As these are English names and appear to be present always as **first
?? last** (you didn't specify but that's how your example shows it),
maybe something like the following might be a start:

1. Use strsplit() to split the names into their constituent parts.
2. Find the last *meaningful* part in each vector (e.g. Joe Smith Jr.
should exclude Jr. and choose Smith)
3. Split the names into the groups of identical unique last parts
4. Split each of the groups of last names into subgroups based on the
first one or more letters of first name so that, e.g. Joe and Joseph
would be in the same subgroup of Smith. Of course Joe and John would
be also, so you see the problem...

Other Issues:
Are Joe Smith and Joe Smith Jr. the same person?
Misspellings? Typos?  Is Arlene Smith the same as Alene Smith?

Some sort of clustering of the names might also be appropriate. See
https://cran.r-project.org/web/views/Cluster.html  for ideas.

Cheers,
Bert

On Wed, Jun 15, 2022 at 7:58 AM Gregg Powell via R-help
 wrote:
>
> Have data sets where there are names, in the first column, client names in 
> the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client 
> start dates. The name is entered each time the person begins with a new 
> client such that each person has many entries in the name column. Often the 
> names were not entered in a consistent way. With and without middle initial, 
> middle name, or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in name 
> column get replaced with a "standardized" format - where some type of machine 
> learning can pick the most common spelling of each repeat name and replace 
> the different variations with the common spelling?
>
> I included an example below. First table includes the names with the various 
> spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records 
> with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>
>
>
>
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA__
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there a package that can do Fuzzy name matching to standardize names in a single column

2022-06-15 Thread Ashim Kapoor
Dear Gregg,

Check this out:

library(fuzzyjoin)
?stringdist_left_join

Best Regards,
Ashim

On Wed, Jun 15, 2022 at 8:28 PM Gregg Powell via R-help
 wrote:
>
> Have data sets where there are names, in the first column, client names in 
> the second, and Client start date in the third.
>
> There are thousands of these records with thousands of names/clients/client 
> start dates. The name is entered each time the person begins with a new 
> client such that each person has many entries in the name column. Often the 
> names were not entered in a consistent way. With and without middle initial, 
> middle name, or various abbreviations such as ",RN" at the end of the name.
>
> Is there a package that can do fuzzy name matching so that the names in name 
> column get replaced with a "standardized" format - where some type of machine 
> learning can pick the most common spelling of each repeat name and replace 
> the different variations with the common spelling?
>
> I included an example below. First table includes the names with the various 
> spellings. Second table depicts what I hope to achieve.
>
> Again - this is on a large scale - there are something like 10,000 records 
> with names that need to be standardized.
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John B. Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob Allen Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe Jack Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob Barker
>
> Client 9
>
> 12/1/2021
>
>
>
>
>
>
>
> Name
>
> Client
>
> Client Start Date
>
> John Good
>
> Client 1
>
> 1/1/2020
>
> Joe J. Jackson
>
> Client 2
>
> 6/1/2020
>
> Bob A. Barker
>
> Client 3
>
> 8/1/2020
>
> John Good
>
> Client 4
>
> 10/1/2020
>
> Joe J. Jackson
>
> Client 5
>
> 12/1/2020
>
> Bob A. Barker
>
> Client 6
>
> 1/1/2021
>
> John Good
>
> Client 7
>
> 5/1/2021
>
> Joe J. Jackson
>
> Client 8
>
> 8/1/2021
>
> Bob A. Barker
>
> Client 9
>
> 12/1/2021
>
>
>
> THANKS!
>
> Gregg Powell
>
> Arizona, USA__
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extracting only some coefficients for the logistic regression model and its plot

2022-06-15 Thread anteneh asmare
Dear Rui, Thanks it works!
Best,
Hana
On 6/15/22, Rui Barradas  wrote:
> Hello,
>
> With ggplot it's easy, add color = id and coord_flip().
>
>
> ggplot(ORCI, aes(id, OR, color = id)) +
>geom_point() +
>geom_errorbar(aes(ymin = `2.5 %`, max = `97.5 %`)) +
>coord_flip() +
>theme_bw()
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 08:15 de 15/06/2022, anteneh asmare escreveu:
>> Dear Rui,  thanks a lot, dose  it possible to have the horizontal line
>> for scale OR value on Y axis and different color for entire box plots
>> ?
>> Best,
>> Hana
>> On 6/15/22, Rui Barradas  wrote:
>>> Hello,
>>>
>>> To extract all but the first 2 rows, use a negative index on the rows.
>>> I will also coerce to data.frame and add a id column, it will be needed
>>> to plot the confidence intervals.
>>>
>>>
>>> ORCI <- exp(cbind(OR = coef(model ), confint(model )))[-(1:2), ]
>>> ORCI <- cbind.data.frame(ORCI, id = row.names(ORCI))
>>>
>>>
>>> Now the base R and ggplot plots. In both cases plot the bars first, then
>>> the points.
>>>
>>> 1. Base R
>>>
>>>
>>> ymin <- min(apply(ORCI[2:3], 1, range)[1,])
>>> ymax <- max(apply(ORCI[2:3], 1, range)[2,])
>>>
>>> plot((ymin + ymax)/2,
>>>type = "n",
>>>xaxt = "n",
>>>xlim = c(0.5, 5.5),
>>>ylim = c(ymin, ymax),
>>>xlab = "X3",
>>>ylab = "Odds Ratio")
>>> with(ORCI, arrows(x0 = seq_along(id),
>>> y0 = `2.5 %`,
>>> y1 = `97.5 %`,
>>> code = 3,
>>> angle = 90))
>>> points(OR ~ seq_along(id), ORCI, pch = 16)
>>> axis(1, at = seq_along(ORCI$id), labels = ORCI$id)
>>>
>>>
>>>
>>> 2. Package ggplot2
>>>
>>>
>>> library(ggplot2)
>>>
>>> ggplot(ORCI, aes(id, OR)) +
>>> geom_errorbar(aes(ymin = `2.5 %`, max = `97.5 %`)) +
>>> geom_point() +
>>> theme_bw()
>>>
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>> Às 21:01 de 14/06/2022, anteneh asmare escreveu:
 sample_data =
 read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";;)
 head(sample_data)
 model = glm(Y~0+X1+X2+X3,family=binomial,data=sample_data)
 summary(model)
 exp(coef(model ))
 exp(cbind(OR = coef(model ), confint(model )))
 I have the aove sample data on logistic regression with categorical
 predictor
 I try the above code i get the follwing  out put,
   OR   2.5 % 97.5 %
 X1  1.67639337 1.352583976 2.09856514
 X2  1.23377720 1.071959330 1.42496949
 X3A 0.01157565 0.001429430 0.08726854
 X3B 0.06627849 0.008011818 0.54419759
 X3C 0.01118084 0.001339984 0.08721028
 X3D 0.01254032 0.001545240 0.09539880
 X3E 0.10654454 0.013141540 0.87369972
but i am wondering i want to extract OR and  CI only for factors, My
 desire out put will be
OR   2.5 % 97.5 %
 X3A 0.01157565 0.001429430 0.08726854
 X3B 0.06627849 0.008011818 0.54419759
 X3C 0.01118084 0.001339984 0.08721028
 X3D 0.01254032 0.001545240 0.09539880
 X3E 0.10654454 0.013141540 0.87369972
 Can any one help me the code to extact it?
 additionally I want to plot the above OR with confidence interval for
 the extracted one
 can you also help me the code with plot,or box plot
 Kind Regards,
 Hana

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
>>>
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Extracting only some coefficients for the logistic regression model and its plot

2022-06-15 Thread anteneh asmare
Dear Rui,  thanks a lot, dose  it possible to have the horizontal line
for scale OR value on Y axis and different color for entire box plots
?
Best,
Hana
On 6/15/22, Rui Barradas  wrote:
> Hello,
>
> To extract all but the first 2 rows, use a negative index on the rows.
> I will also coerce to data.frame and add a id column, it will be needed
> to plot the confidence intervals.
>
>
> ORCI <- exp(cbind(OR = coef(model ), confint(model )))[-(1:2), ]
> ORCI <- cbind.data.frame(ORCI, id = row.names(ORCI))
>
>
> Now the base R and ggplot plots. In both cases plot the bars first, then
> the points.
>
> 1. Base R
>
>
> ymin <- min(apply(ORCI[2:3], 1, range)[1,])
> ymax <- max(apply(ORCI[2:3], 1, range)[2,])
>
> plot((ymin + ymax)/2,
>   type = "n",
>   xaxt = "n",
>   xlim = c(0.5, 5.5),
>   ylim = c(ymin, ymax),
>   xlab = "X3",
>   ylab = "Odds Ratio")
> with(ORCI, arrows(x0 = seq_along(id),
>y0 = `2.5 %`,
>y1 = `97.5 %`,
>code = 3,
>angle = 90))
> points(OR ~ seq_along(id), ORCI, pch = 16)
> axis(1, at = seq_along(ORCI$id), labels = ORCI$id)
>
>
>
> 2. Package ggplot2
>
>
> library(ggplot2)
>
> ggplot(ORCI, aes(id, OR)) +
>geom_errorbar(aes(ymin = `2.5 %`, max = `97.5 %`)) +
>geom_point() +
>theme_bw()
>
>
> Hope this helps,
>
> Rui Barradas
>
> Às 21:01 de 14/06/2022, anteneh asmare escreveu:
>> sample_data =
>> read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";;)
>> head(sample_data)
>> model = glm(Y~0+X1+X2+X3,family=binomial,data=sample_data)
>> summary(model)
>> exp(coef(model ))
>> exp(cbind(OR = coef(model ), confint(model )))
>> I have the aove sample data on logistic regression with categorical
>> predictor
>> I try the above code i get the follwing  out put,
>>  OR   2.5 % 97.5 %
>> X1  1.67639337 1.352583976 2.09856514
>> X2  1.23377720 1.071959330 1.42496949
>> X3A 0.01157565 0.001429430 0.08726854
>> X3B 0.06627849 0.008011818 0.54419759
>> X3C 0.01118084 0.001339984 0.08721028
>> X3D 0.01254032 0.001545240 0.09539880
>> X3E 0.10654454 0.013141540 0.87369972
>>   but i am wondering i want to extract OR and  CI only for factors, My
>> desire out put will be
>>   OR   2.5 % 97.5 %
>> X3A 0.01157565 0.001429430 0.08726854
>> X3B 0.06627849 0.008011818 0.54419759
>> X3C 0.01118084 0.001339984 0.08721028
>> X3D 0.01254032 0.001545240 0.09539880
>> X3E 0.10654454 0.013141540 0.87369972
>> Can any one help me the code to extact it?
>> additionally I want to plot the above OR with confidence interval for
>> the extracted one
>> can you also help me the code with plot,or box plot
>> Kind Regards,
>> Hana
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.