Re: [R] text vector clustering

2009-01-26 Thread San Miguel Martín , Eduardo
Dear srinivas, You can try using trigrams, a special case of N-grams, often used in Natural Language Processing. I am interested in grouping/cluster these names as those which are similar letter to letter. Are there any text clustering algorithm in R which can group names of similar type

Re: [R] text vector clustering

2009-01-23 Thread Stefan Th. Gries
Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an example (very clumsy with two loops, but you can tweak that with apply stuff). HTH, STG levenshtein - function(string1, string2, case=TRUE, map=NULL) { # levenshtein algorithm in R # #

Re: [R] text vector clustering

2009-01-23 Thread Stefan Th. Gries
On Fri, Jan 23, 2009 at 08:28, Stefan Th. Gries stgr...@gmail.com wrote: Hans-Joerg Bibiko's function Levenshtein would help; cf. below for an example (very clumsy with two loops, but you can tweak that with apply stuff). Like this maybe (sorry, should've thought about that earlier): [...]

Re: [R] text vector clustering

2009-01-23 Thread Ed Merkle
. Ed -- Ed Merkle, PhD Assistant Professor Dept. of Psychology Wichita State University Wichita, KS 67260 Date: Thu, 22 Jan 2009 16:33:03 +0530 From: srinivasa raghavan srinivasrag...@gmail.com Subject: [R] text vector clustering To: r-help@r-project.org Message-ID

[R] text vector clustering

2009-01-22 Thread srinivasa raghavan
Hi, I am a new user of R using R 2.8.1 in windows 2003. I have a csv file with single column which contain the 30,000 students names. There were typo errors while entering this student names. The actual list of names is 1000. However we dont have that list for keyword search. I am interested

Re: [R] text vector clustering

2009-01-22 Thread David Winsemius
Simply doing a tabulation and isolating the cases with only one entry might have been a possibility if the count discrepancy weren't so high. It appears you have a greater degree of corruption than would be expected just from typos. Have you looked at the packages referenced at: