Dear srinivas,
You can try using trigrams, a special case of N-grams, often used in Natural
Language Processing.
I am interested in grouping/cluster these names as those which are
similar letter to letter. Are there any text clustering algorithm in R
which can group names of similar type
Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an
example (very clumsy with two loops, but you can tweak that with apply
stuff).
HTH,
STG
levenshtein - function(string1, string2, case=TRUE, map=NULL) {
# levenshtein algorithm in R
#
#
On Fri, Jan 23, 2009 at 08:28, Stefan Th. Gries stgr...@gmail.com wrote:
Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an
example (very clumsy with two loops, but you can tweak that with apply
stuff).
Like this maybe (sorry, should've thought about that earlier):
[...]
.
Ed
--
Ed Merkle, PhD
Assistant Professor
Dept. of Psychology
Wichita State University
Wichita, KS 67260
Date: Thu, 22 Jan 2009 16:33:03 +0530
From: srinivasa raghavan srinivasrag...@gmail.com
Subject: [R] text vector clustering
To: r-help@r-project.org
Message-ID
Hi,
I am a new user of R using R 2.8.1 in windows 2003. I have a csv file with
single column which contain the 30,000 students names. There were typo
errors while entering this student names. The actual list of names is
1000. However we dont have that list for keyword search.
I am interested
Simply doing a tabulation and isolating the cases with only one entry
might have been a possibility if the count discrepancy weren't so
high. It appears you have a greater degree of corruption than would be
expected just from typos.
Have you looked at the packages referenced at:
6 matches
Mail list logo