Re: [Mixxx-devel] Directory-hash calculation during library-rescan

2013-06-09 Thread petah
On Wed, 5 Jun 2013 22:46:57 -0700 (PDT)
Steven Boswell II ulatekh-/e1597as9lqavxtiumw...@public.gmane.org wrote:
 Right now I'm rescanning my track collection on my laptop, and I expected it 
 to be a short operation, but it's taking forever.  I'm guessing it's because 
 I moved a large part of my track collection to a different partition on my 
 hard drive, even though the path remained the same.  The issue is that 
 QDirIterator doesn't present the tracks in any particular order; it'll be 
 dictated by how they happened to be stored in the underlying filesystem.  
 When I moved the track collection, that permuted the files in each directory.
 
 The issue is that the directory-hash calculation isn't order-independent; it 
 calculates the directory hash by appending all the filenames to one string.  
 If we changed it to calculate the hash of each filename individually, and 
 XORed those together, then it would be order-independent, and shouldn't 
 detect less changes than the previous method.
 
 Thoughts?

I'm guessing it should be seen as a generic, recursive hash(files[n]) - 
hash(parent_dir) problem, in which case using XOR wouldn't report a file's 
movement to a grandparent because ((a ^ b) ^ c)) == (a ^ (b ^ c)) and surely 
the point is to do a top-down traversal and actually detect such movements?

My first instinct would be to wrap QDirIterator with a qsort() so it's 
deterministic and doesn't fluctuate across different OSes/Filesystems, then 
it's an optimization problem but not trivial.

I read a paper a while back that used MD6 hashes because they're immune to 
order and can be massively parallelized on GPUs, and used a header/footer block 
hash for faster discrimination. It was for file deduplication though, in 
Mixxx's case I don't see how you can avoid having to recompute the hash of a 
file that suddenly popped up; you can never be certain it was the same file 
that disappeared from another location until a full rehash... unless the move 
operation is done from within the Mixxx UI or the hash is embedded in the 
file's metadata.

-- p

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Get Mixxx, the #1 Free MP3 DJ Mixing software Today
http://mixxx.org


Mixxx-devel mailing list
Mixxx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mixxx-devel

Re: [Mixxx-devel] Directory-hash calculation during library-rescan

2013-06-09 Thread RJ Ryan
We don't hash the file contents during library scanning. For context, the
hash Steven is referring to is a per-directory hash that is used to
determine whether the directory should be rescanned. The way it works today
is that it concatenates all the names of files/dirs in the directory and
hashes that. If the hash is different from what we saw before we rescan the
directory.

Steven is right that we should switch to something order independent. I've
also wanted to include mtimes in there for a while but not sure if that
would result in an extra stat or not on all the common filesystems we care
about.
On Jun 9, 2013 8:53 AM, petah mi...@djpetah.com wrote:

 On Wed, 5 Jun 2013 22:46:57 -0700 (PDT)
 Steven Boswell II ulatekh-/e1597as9lqavxtiumw...@public.gmane.org wrote:
  Right now I'm rescanning my track collection on my laptop, and I
 expected it to be a short operation, but it's taking forever.  I'm guessing
 it's because I moved a large part of my track collection to a different
 partition on my hard drive, even though the path remained the same.  The
 issue is that QDirIterator doesn't present the tracks in any particular
 order; it'll be dictated by how they happened to be stored in the
 underlying filesystem.  When I moved the track collection, that permuted
 the files in each directory.
 
  The issue is that the directory-hash calculation isn't
 order-independent; it calculates the directory hash by appending all the
 filenames to one string.  If we changed it to calculate the hash of each
 filename individually, and XORed those together, then it would be
 order-independent, and shouldn't detect less changes than the previous
 method.
 
  Thoughts?

 I'm guessing it should be seen as a generic, recursive hash(files[n]) -
 hash(parent_dir) problem, in which case using XOR wouldn't report a file's
 movement to a grandparent because ((a ^ b) ^ c)) == (a ^ (b ^ c)) and
 surely the point is to do a top-down traversal and actually detect such
 movements?

 My first instinct would be to wrap QDirIterator with a qsort() so it's
 deterministic and doesn't fluctuate across different OSes/Filesystems, then
 it's an optimization problem but not trivial.

 I read a paper a while back that used MD6 hashes because they're immune to
 order and can be massively parallelized on GPUs, and used a header/footer
 block hash for faster discrimination. It was for file deduplication though,
 in Mixxx's case I don't see how you can avoid having to recompute the hash
 of a file that suddenly popped up; you can never be certain it was the same
 file that disappeared from another location until a full rehash... unless
 the move operation is done from within the Mixxx UI or the hash is embedded
 in the file's metadata.

 -- p


 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Get Mixxx, the #1 Free MP3 DJ Mixing software Today
 http://mixxx.org


 Mixxx-devel mailing list
 Mixxx-devel@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/mixxx-devel
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Get Mixxx, the #1 Free MP3 DJ Mixing software Today
http://mixxx.org


Mixxx-devel mailing list
Mixxx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mixxx-devel

Re: [Mixxx-devel] Directory-hash calculation during library-rescan

2013-06-09 Thread petah
On Sun, 9 Jun 2013 09:03:12 -0400
RJ Ryan russelljryan-re5jqeeqqe8avxtiumw...@public.gmane.org wrote:
 We don't hash the file contents during library scanning. For context, the
 hash Steven is referring to is a per-directory hash that is used to
 determine whether the directory should be rescanned. The way it works today
 is that it concatenates all the names of files/dirs in the directory and
 hashes that. If the hash is different from what we saw before we rescan the
 directory.
 
 Steven is right that we should switch to something order independent. I've
 also wanted to include mtimes in there for a while but not sure if that
 would result in an extra stat or not on all the common filesystems we care
 about.

Hmm, just using the filename would worry me, I have a lot of tracks with the 
same name and even metadata, usually got the crappy version first (p2p, friend, 
nasty vinyl rip) then the high-quality, legit version. Timestamps quickly go 
haywire even if you don't travel across timezones with an NTP-synced laptop. I 
wouldn't trust a file system's meta-data anyway since the medium could be 
passed from someone else at the last minute. I don't think git uses either.

Before a gig I really need to be absolutely sure the file I'm queuing doesn't 
have a huge glitch half-way through or is the low-gain version that'll come out 
crap. Presumably embedding metadata isn't on the table due to WAV/AIFF support?

Wrt to moving files, IIRC Traktor solved it by requiring you to do it from 
within their UI or it'd pop a where is it now dialog, was a pain in the ass 
though.

-- p



--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Get Mixxx, the #1 Free MP3 DJ Mixing software Today
http://mixxx.org


Mixxx-devel mailing list
Mixxx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mixxx-devel


[Mixxx-devel] Directory-hash calculation during library-rescan

2013-06-05 Thread Steven Boswell II
Right now I'm rescanning my track collection on my laptop, and I expected it to 
be a short operation, but it's taking forever.  I'm guessing it's because I 
moved a large part of my track collection to a different partition on my hard 
drive, even though the path remained the same.  The issue is that QDirIterator 
doesn't present the tracks in any particular order; it'll be dictated by how 
they happened to be stored in the underlying filesystem.  When I moved the 
track collection, that permuted the files in each directory.

The issue is that the directory-hash calculation isn't order-independent; it 
calculates the directory hash by appending all the filenames to one string.  If 
we changed it to calculate the hash of each filename individually, and XORed 
those together, then it would be order-independent, and shouldn't detect less 
changes than the previous method.

Thoughts?

Steven Boswell
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Get Mixxx, the #1 Free MP3 DJ Mixing software Today
http://mixxx.org


Mixxx-devel mailing list
Mixxx-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mixxx-devel