Re: [sqlite] Compare Similar Cells

2013-09-23 Thread Kristopher Roy
Roger,
Thank you, I will start working on it per your suggestions, though I would
love to get mine cleaned up easily, I want to provide a solution that other
users can use :)


On Sun, Sep 22, 2013 at 11:14 PM, Roger Binns  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 22/09/13 19:41, Kristopher Roy wrote:
> > I have a table of songs, several have similar titles I can't find where
> > to get started. I tried this but its not right. Select SongTitle,
> > COUNT(SongTitle) AS LIKE_COUNT FROM Songs
>
> I did work with a database that came from a company that sold music.
> Their source data came from the various record companies and was a
> complete mess.  (Yes record companies would make mistakes even for their
> own artists!)  For our purposes the data had to be denormalised,
> deduplicated and many items merged where the differences weren't important.
>
> Fixing up the data required probabilistic matching, and can't be done in
> simple SQL queries.  For example spelling mistakes had to be accounted
> for, truncations, case differences, punctuation differences, numeric
> differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4"
> vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"),
> compositions of multiple artists or contributing to another artists songs
> so "artist" becomes murky, etc.
>
> This could only be achieved by processing all the data in advance.
> Essentially every artist had to be scored against every other to see if
> they were the same (but not similar), same for every album of that artist
> against their other albums, and finally of all the songs within each
> album.  It required a lot of inspecting the matches, finding anomalies,
> doing google searches to find canonical information, adding heuristics,
> making sure that heuristics changes did not break existing good matches,
> and endless repeats until things are good enough.
>
> If you are trying to do a good job, then you will need to do something
> like that.
>
> If you are trying to fix up your own collection, then first go in and fix
> all the meta data.  Musicbrainz is a good source for authoritative
> information and there are plenty of apps out there to help you edit and
> update tags.
>
> If you want a quick fix, then add another column to your SQLite database
> that contains the normalised song title.  You will need to iterate over
> all your existing data to calculate a normalised title.  For example
> convert to all upper case, remove all punctuation, convert multiple spaces
> to single, remove "digits -" as you gave in your example, truncating to 30
> characters, and whatever else is relevant for your data.  You can now do
> matching against the normalised title column for each title.
>
> Roger
>
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.12 (GNU/Linux)
>
> iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx
> JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE
> =RDic
> -END PGP SIGNATURE-
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
Thank You,
Kristopher C. Roy
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Compare Similar Cells

2013-09-23 Thread Stephen Chrzanowski
To the OP:
I've got a tonne of media on my Drobo, and I feel your pain.  Between my
MP3s, pictures, videos and books, keeping tabs on file names, physical file
locations, its quite the nightmare, and by the sounds of it, Roger has been
exposed to this issue more so than I have been, and with my limited
experience with just my central repository, well, its a daunting and time
consuming task.

For MP3s, there are plenty of applications out there, both paid and free,
that will handle tags with ease, check file names, and I've even heard of
applications that will "listen" to each MP3 and see how close they sound to
each other.  A few years back, on my first initial attempt to organize my
MP3s, I used Winamp to edit the tags and rename the physical files,
however, moving the files to an appropriate directory was still a 'manual'
process IIRC, but the physical file management did become easier with the
file renames, and Google came in to help me track down lyrics and put the
name of a song to an MP3.  Then there's how iTunes handles naming files it
downloads.  *sighs and goes and curls up in a corner*

Basically, there is no magic bullet for organizing based on file names.  As
Roger mentions, a lot of it is going to be a manual process.

If you're developing an application that would at least get you started,
what I would do is this:
- Get a full directory list of all files to be taken into consideration and
put them into two tables.  One table for the paths that contain the files,
and another table to hold the file names.  FK between these two tables.
- Run through the list of file names and extract each word and put each
word into a third table, with another FK relationship to just the file name
table.  For proper normalization, I could also put a 4th table in the mix
to make a "many to many" relationship between the word list and file name
list, which I would ensure the third table has only unique words.
Depending on how you write and your preferences, this would be up to you.
- Have a sorted list of words (Table 3) in a list box on the GUI so that
when I click on a word, another list box is populated with the file names
and/or the paths of the files found.
- Have some kind of event/trigger that would allow me to rename, move, or
edit the MP3 ID3 tags of the file in the second list box.  Update the
database, word list (if needed) and continue on.

The reason for the word list is that even though you can search for a
particular word, you'd never really be able to ensure a 100% hit on every
file by doing a manual key entry, not to mention what you would type in
with proper spelling may show up in the word list as being spelled
incorrectly.  I would then go through each word entry and see if I can find
duplicate file names in the second list box.  I could also introduce
reading the ID tags directly from the MP3 and edit as is.

There are many ways to handle this behemoth task.  A query, to get you
started, like [ select * from FileList where FileName like '%demon%' ]
would be something towards what you'd have to use..
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Compare Similar Cells

2013-09-23 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 22/09/13 19:41, Kristopher Roy wrote:
> I have a table of songs, several have similar titles I can't find where
> to get started. I tried this but its not right. Select SongTitle,
> COUNT(SongTitle) AS LIKE_COUNT FROM Songs

I did work with a database that came from a company that sold music.
Their source data came from the various record companies and was a
complete mess.  (Yes record companies would make mistakes even for their
own artists!)  For our purposes the data had to be denormalised,
deduplicated and many items merged where the differences weren't important.

Fixing up the data required probabilistic matching, and can't be done in
simple SQL queries.  For example spelling mistakes had to be accounted
for, truncations, case differences, punctuation differences, numeric
differences (eg "Song One" vs "Song 1", "Album 3" vs "Album III", "Vol 4"
vs "Volume 4."), mixes (eg "Song One" vs "Song One (Radio Edit)"),
compositions of multiple artists or contributing to another artists songs
so "artist" becomes murky, etc.

This could only be achieved by processing all the data in advance.
Essentially every artist had to be scored against every other to see if
they were the same (but not similar), same for every album of that artist
against their other albums, and finally of all the songs within each
album.  It required a lot of inspecting the matches, finding anomalies,
doing google searches to find canonical information, adding heuristics,
making sure that heuristics changes did not break existing good matches,
and endless repeats until things are good enough.

If you are trying to do a good job, then you will need to do something
like that.

If you are trying to fix up your own collection, then first go in and fix
all the meta data.  Musicbrainz is a good source for authoritative
information and there are plenty of apps out there to help you edit and
update tags.

If you want a quick fix, then add another column to your SQLite database
that contains the normalised song title.  You will need to iterate over
all your existing data to calculate a normalised title.  For example
convert to all upper case, remove all punctuation, convert multiple spaces
to single, remove "digits -" as you gave in your example, truncating to 30
characters, and whatever else is relevant for your data.  You can now do
matching against the normalised title column for each title.

Roger


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.12 (GNU/Linux)

iEYEARECAAYFAlI/3EAACgkQmOOfHg372QSuHgCgla77zTSx5knJL036AMpU0Unx
JnEAoJ9Cx/kocO3ue4xafKFkM7BVEviE
=RDic
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Compare Similar Cells

2013-09-22 Thread Kristopher Roy
Lets say one songtitle is "Happy Day" and another one "12 - Happy Day"


On Sun, Sep 22, 2013 at 8:28 PM, Igor Tandetnik  wrote:

> On 9/22/2013 10:41 PM, Kristopher Roy wrote:
>
>> I have a table of songs, several have similar titles I can't find where to
>> get started. I tried this but its not right.
>> Select SongTitle, COUNT(SongTitle) AS LIKE_COUNT FROM Songs
>>
>
> What do you mean by "similar titles"? Show a representative sample of data
> in your table, and the result you would like to obtain from the query when
> run on that sample.
> --
> Igor Tandetnik
>
> __**_
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-**bin/mailman/listinfo/sqlite-**users
>



-- 
Thank You,
Kristopher C. Roy
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Compare Similar Cells

2013-09-22 Thread Igor Tandetnik

On 9/22/2013 10:41 PM, Kristopher Roy wrote:

I have a table of songs, several have similar titles I can't find where to
get started. I tried this but its not right.
Select SongTitle, COUNT(SongTitle) AS LIKE_COUNT FROM Songs


What do you mean by "similar titles"? Show a representative sample of 
data in your table, and the result you would like to obtain from the 
query when run on that sample.

--
Igor Tandetnik

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users