Roger, Thank you, I will start working on it per your suggestions, though I would love to get mine cleaned up easily, I want to provide a solution that other users can use :)
On Sun, Sep 22, 2013 at 11:14 PM, Roger Binns <[email protected]> wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 22/09/13 19:41, Kristopher Roy wrote: > > I have a table of songs, several have similar titles I can't find where > > to get started. I tried this but its not right. Select SongTitle, > > COUNT(SongTitle) AS LIKE_COUNT FROM Songs > > I did work with a database that came from a company that sold music. > Their source data came from the various record companies and was a > complete mess. (Yes record companies would make mistakes even for their > own artists!) For our purposes the data had to be denormalised, > deduplicated and many items merged where the differences weren't important. > > Fixing up the data required probabilistic matching, and can't be done in > simple SQL queries. For example spelling mistakes had to be accounted > for, truncations, case differences, punctuation differences, numeric > differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4" > vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"), > compositions of multiple artists or contributing to another artists songs > so "artist" becomes murky, etc. > > This could only be achieved by processing all the data in advance. > Essentially every artist had to be scored against every other to see if > they were the same (but not similar), same for every album of that artist > against their other albums, and finally of all the songs within each > album. It required a lot of inspecting the matches, finding anomalies, > doing google searches to find canonical information, adding heuristics, > making sure that heuristics changes did not break existing good matches, > and endless repeats until things are good enough. > > If you are trying to do a good job, then you will need to do something > like that. > > If you are trying to fix up your own collection, then first go in and fix > all the meta data. Musicbrainz is a good source for authoritative > information and there are plenty of apps out there to help you edit and > update tags. > > If you want a quick fix, then add another column to your SQLite database > that contains the normalised song title. You will need to iterate over > all your existing data to calculate a normalised title. For example > convert to all upper case, remove all punctuation, convert multiple spaces > to single, remove "digits -" as you gave in your example, truncating to 30 > characters, and whatever else is relevant for your data. You can now do > matching against the normalised title column for each title. > > Roger > > > -----BEGIN PGP SIGNATURE----- > Version: GnuPG v1.4.12 (GNU/Linux) > > iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx > JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE > =RDic > -----END PGP SIGNATURE----- > _______________________________________________ > sqlite-users mailing list > [email protected] > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- Thank You, Kristopher C. Roy _______________________________________________ sqlite-users mailing list [email protected] http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

