Re: [Mixxx-devel] Directory-hash calculation during library-rescan
On Wed, 5 Jun 2013 22:46:57 -0700 (PDT) Steven Boswell II ulatekh-/e1597as9lqavxtiumw...@public.gmane.org wrote: Right now I'm rescanning my track collection on my laptop, and I expected it to be a short operation, but it's taking forever. I'm guessing it's because I moved a large part of my track collection to a different partition on my hard drive, even though the path remained the same. The issue is that QDirIterator doesn't present the tracks in any particular order; it'll be dictated by how they happened to be stored in the underlying filesystem. When I moved the track collection, that permuted the files in each directory. The issue is that the directory-hash calculation isn't order-independent; it calculates the directory hash by appending all the filenames to one string. If we changed it to calculate the hash of each filename individually, and XORed those together, then it would be order-independent, and shouldn't detect less changes than the previous method. Thoughts? I'm guessing it should be seen as a generic, recursive hash(files[n]) - hash(parent_dir) problem, in which case using XOR wouldn't report a file's movement to a grandparent because ((a ^ b) ^ c)) == (a ^ (b ^ c)) and surely the point is to do a top-down traversal and actually detect such movements? My first instinct would be to wrap QDirIterator with a qsort() so it's deterministic and doesn't fluctuate across different OSes/Filesystems, then it's an optimization problem but not trivial. I read a paper a while back that used MD6 hashes because they're immune to order and can be massively parallelized on GPUs, and used a header/footer block hash for faster discrimination. It was for file deduplication though, in Mixxx's case I don't see how you can avoid having to recompute the hash of a file that suddenly popped up; you can never be certain it was the same file that disappeared from another location until a full rehash... unless the move operation is done from within the Mixxx UI or the hash is embedded in the file's metadata. -- p -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Get Mixxx, the #1 Free MP3 DJ Mixing software Today http://mixxx.org Mixxx-devel mailing list Mixxx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mixxx-devel
Re: [Mixxx-devel] Directory-hash calculation during library-rescan
We don't hash the file contents during library scanning. For context, the hash Steven is referring to is a per-directory hash that is used to determine whether the directory should be rescanned. The way it works today is that it concatenates all the names of files/dirs in the directory and hashes that. If the hash is different from what we saw before we rescan the directory. Steven is right that we should switch to something order independent. I've also wanted to include mtimes in there for a while but not sure if that would result in an extra stat or not on all the common filesystems we care about. On Jun 9, 2013 8:53 AM, petah mi...@djpetah.com wrote: On Wed, 5 Jun 2013 22:46:57 -0700 (PDT) Steven Boswell II ulatekh-/e1597as9lqavxtiumw...@public.gmane.org wrote: Right now I'm rescanning my track collection on my laptop, and I expected it to be a short operation, but it's taking forever. I'm guessing it's because I moved a large part of my track collection to a different partition on my hard drive, even though the path remained the same. The issue is that QDirIterator doesn't present the tracks in any particular order; it'll be dictated by how they happened to be stored in the underlying filesystem. When I moved the track collection, that permuted the files in each directory. The issue is that the directory-hash calculation isn't order-independent; it calculates the directory hash by appending all the filenames to one string. If we changed it to calculate the hash of each filename individually, and XORed those together, then it would be order-independent, and shouldn't detect less changes than the previous method. Thoughts? I'm guessing it should be seen as a generic, recursive hash(files[n]) - hash(parent_dir) problem, in which case using XOR wouldn't report a file's movement to a grandparent because ((a ^ b) ^ c)) == (a ^ (b ^ c)) and surely the point is to do a top-down traversal and actually detect such movements? My first instinct would be to wrap QDirIterator with a qsort() so it's deterministic and doesn't fluctuate across different OSes/Filesystems, then it's an optimization problem but not trivial. I read a paper a while back that used MD6 hashes because they're immune to order and can be massively parallelized on GPUs, and used a header/footer block hash for faster discrimination. It was for file deduplication though, in Mixxx's case I don't see how you can avoid having to recompute the hash of a file that suddenly popped up; you can never be certain it was the same file that disappeared from another location until a full rehash... unless the move operation is done from within the Mixxx UI or the hash is embedded in the file's metadata. -- p -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Get Mixxx, the #1 Free MP3 DJ Mixing software Today http://mixxx.org Mixxx-devel mailing list Mixxx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mixxx-devel -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Get Mixxx, the #1 Free MP3 DJ Mixing software Today http://mixxx.org Mixxx-devel mailing list Mixxx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mixxx-devel
Re: [Mixxx-devel] Directory-hash calculation during library-rescan
On Sun, 9 Jun 2013 09:03:12 -0400 RJ Ryan russelljryan-re5jqeeqqe8avxtiumw...@public.gmane.org wrote: We don't hash the file contents during library scanning. For context, the hash Steven is referring to is a per-directory hash that is used to determine whether the directory should be rescanned. The way it works today is that it concatenates all the names of files/dirs in the directory and hashes that. If the hash is different from what we saw before we rescan the directory. Steven is right that we should switch to something order independent. I've also wanted to include mtimes in there for a while but not sure if that would result in an extra stat or not on all the common filesystems we care about. Hmm, just using the filename would worry me, I have a lot of tracks with the same name and even metadata, usually got the crappy version first (p2p, friend, nasty vinyl rip) then the high-quality, legit version. Timestamps quickly go haywire even if you don't travel across timezones with an NTP-synced laptop. I wouldn't trust a file system's meta-data anyway since the medium could be passed from someone else at the last minute. I don't think git uses either. Before a gig I really need to be absolutely sure the file I'm queuing doesn't have a huge glitch half-way through or is the low-gain version that'll come out crap. Presumably embedding metadata isn't on the table due to WAV/AIFF support? Wrt to moving files, IIRC Traktor solved it by requiring you to do it from within their UI or it'd pop a where is it now dialog, was a pain in the ass though. -- p -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j ___ Get Mixxx, the #1 Free MP3 DJ Mixing software Today http://mixxx.org Mixxx-devel mailing list Mixxx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mixxx-devel
[Mixxx-devel] Directory-hash calculation during library-rescan
Right now I'm rescanning my track collection on my laptop, and I expected it to be a short operation, but it's taking forever. I'm guessing it's because I moved a large part of my track collection to a different partition on my hard drive, even though the path remained the same. The issue is that QDirIterator doesn't present the tracks in any particular order; it'll be dictated by how they happened to be stored in the underlying filesystem. When I moved the track collection, that permuted the files in each directory. The issue is that the directory-hash calculation isn't order-independent; it calculates the directory hash by appending all the filenames to one string. If we changed it to calculate the hash of each filename individually, and XORed those together, then it would be order-independent, and shouldn't detect less changes than the previous method. Thoughts? Steven Boswell -- How ServiceNow helps IT people transform IT departments: 1. A cloud service to automate IT design, transition and operations 2. Dashboards that offer high-level views of enterprise services 3. A single system of record for all IT processes http://p.sf.net/sfu/servicenow-d2d-j___ Get Mixxx, the #1 Free MP3 DJ Mixing software Today http://mixxx.org Mixxx-devel mailing list Mixxx-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mixxx-devel