On 6/4/14, David Noel <[email protected]> wrote: > On 6/3/14, David Noel <[email protected]> wrote: >> Does anyone know of any libraries to clean common strings >> from a set of strings (Java, preferably)? > > I wound up rolling one of my own. It wasn't too bad.
Can't seem to settle on an ideal algorithm. Essentially the problem is that I have frequently repeated, somewhat similar substrings appended and prepended to each news article. Because I know the strings appear at either the beginning or the end of the article I'm able to reduce the search space somewhat. I've come up with two algorithms. The question I have is how to select the optimal substring for each document based on its Levenshtein distance. A rough overview of each algorithm I've come up with is as follows: Pull a list of all the domains I've crawled from the database. For each domain in the result set, pull all of the corresponding articles for that domain. .For each article in our result set ..For each next article our result set ...For every substring starting at the beginning of each documents (tokenized by word, not character.. so each substring consists of one more word of the document than the last) ....Generate and store the Levenshtein distance for each substring ...End for ...For every generated Levenshtein distance ....Find the furthest position of the most often repeated Levenshtein distance ...End for ..End for .End for End for But there are problems with this algorithm. Say for example I am comparing two strings (where each character in the string represents some word): string x: aaaaabcdeeeefghiiipqr string y: aaaaajkleeeemnoiiituv The Levenshtein distances are: 000001233333456666789 The frequency of each Levenshtein distance is: dist, freq, furthest position 0,5,4 1,1,5 2,1,6 3,5,11 4,1,12 5,1,13 6,4,17 7,1,18 8,1,19 9,1,20 This algorithm would see that the most common frequency was 5, that the corresponding distance was 3, and that the furthest position was 11. So it would return the substring x[0,11], or "aaaaabcdeeee". But really we would like the substring "aaaaabcdeeeefghiii". So what's a different approach? Maybe we could find the furthest position where the Levenshtein distance is repeated more than once. Pull a list of all the domains I've crawled from the database. For each domain in the result set, pull all of the corresponding articles for that domain. .For each article in our result set ..For each next article our result set ...For every substring starting at the beginning of each documents ....Generate and store the Levenshtein distance for each substring ...End for ...For every generated Levenshtein distance ....Find the furthest position where the Levenshtein distance is repeated more than once ...End for ..End for .End for End for Again, our strings are: string x: aaaaabcdeeeefghiiipqr string y: aaaaajkleeeemnoiiituv The Levenshtein distances are: 000001233333456666789 The frequency of each Levenshtein distance is: dist, freq, furthest position 0,5,4 1,1,5 2,1,6 3,5,11 4,1,12 5,1,13 6,4,17 7,1,18 8,1,19 9,1,20 ...and so our algorithm would see that the distance measure of 6 with a frequency of 4 and last position 17 was the furthest distance of a frequency greater than 1, so we'd return "aaaaabcdeeeefghiii" But there are problems with this algorithm too. Say our strings are: string x: aaaaabcdeeeefghiiipqrstuvwwdefghi string y: aaaaajkleeeemnoiiituvwxyzwwabcjkl The Levenshtein distances are: 0 0 0 0 0 1 2 3 3 3 3 3 4 5 6 6 6 6 7 8 9 10 11 12 13 14 14 15 16 17 18 19 20 This algorithm would see that the distance measure of the string ending in "ww" (or, "aaaaabcdeeeefghiiipqrstuvww") was the furthest position in the string with a frequency of distance measure greater than 1 and return it as the result. But as you can see there is a whole lot of noise in that string. What we really want is the string "aaaaabcdeeeefghiii". It's the longest similar string with the least amount of noise. So how do we write an algorithm for that? My first thought is maybe we use some relative measure of the Levenshtein distance. In other words, we want the longest string where the Levenshtein distance does not vary by more than a certain amount from the last most frequent distance measure, where the amount that it may vary is determined by some ratio of the distance measure to its position in the string. This is the point where I don't know how to proceed. Does any of this make sense? I can't seem to find any established algorithms for fuzzy or approximate string matching using Levenshtein distances, but I'm sure someone somewhere has done it before and come up with an optimal solution for this. Does anyone have any familiarity with this problem? What is the preferred algorithm for this? Any thoughts would be appreciated. This was long. Hope it makes sense. -David
