Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

2017-03-01 Thread Gregory Szorc
On Wed, Mar 1, 2017 at 4:46 PM, Mike Hommey  wrote:

> On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> > On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori <
> fo...@lares.dti.ne.jp>
> > wrote:
> >
> > > # HG changeset patch
> > > # User FUJIWARA Katsunori 
> > > # Date 1488380487 -32400
> > > #  Thu Mar 02 00:01:27 2017 +0900
> > > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > > similar: allow similarity detection to use sha256 for digesting file
> > > contents
> > >
> > > Before this patch, similarity detection logic (used for addremove and
> > > automv) uses SHA-1 digesting. But this cause incorrect rename
> > > detection, if:
> > >
> > >   - removing file A and adding file B occur at same committing, and
> > >   - SHA-1 hash values of file A and B are same
> > >
> > > This may prevent security experts from managing sample files for
> > > SHAttered issue in Mercurial repository, for example.
> > >
> > >   https://security.googleblog.com/2017/02/announcing-first-
> > > sha1-collision.html
> > >   https://shattered.it/
> > >
> > > Hash collision itself isn't so serious for core repository
> > > functionality of Mercurial, described by mpm as below, though.
> > >
> > >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> > >
> > > HOW ABOUT:
> > >
> > >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> > >
> >
> > SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> > likely no good reason to use SHA-256 for simple identity checks.
> >
> >
> > >
> > > ease (handling problematic files safely by default) or
> > > performance?
> > >
> >
> >
> > On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> > SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time
> content
> > identity checks, hashing shouldn't be a bottleneck, at least not for most
> > repos.
> >
> > So I think it is fine to change this function from SHA-1 to SHA-512
> > assuming the hashes don't "leak" into storage. If they end up being
> stored
> > or used for something other than identity checks, then we need to bloat
> > scope to discuss our general hashing future. And that needs its own
> thread
> > ;)
>
> With hashing, there is *always* the risk of collision. It might be tiny,
> but it still exists. Why not just compare the contents when the hash
> match? Then it doesn't really matter what the hash is. The hash is just
> a shortcut to avoid comparing full contents in a O(n^2) fashion.
>
> There aren't going to be that many hash matches anyways, comparing the
> content then should not make a significant difference in speed, but
> would guarantee that the "similarity" is real.
>

Yeah, the objects will already be in memory. Under the hood bytes.__eq__
just looks at the buffer length then falls back to memcmp, which should be
on the order of 10 GB/s. So using __eq__ here seems reasonable.


>
> (BTW, interestingly, in terms of similarity detection, while the
> SHAttered PDFs are not 100% identical, they are 80%+ similar)
>
> Mike
>
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

2017-03-01 Thread Mike Hommey
On Wed, Mar 01, 2017 at 04:34:43PM -0800, Gregory Szorc wrote:
> On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori 
> wrote:
> 
> > # HG changeset patch
> > # User FUJIWARA Katsunori 
> > # Date 1488380487 -32400
> > #  Thu Mar 02 00:01:27 2017 +0900
> > # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> > # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> > similar: allow similarity detection to use sha256 for digesting file
> > contents
> >
> > Before this patch, similarity detection logic (used for addremove and
> > automv) uses SHA-1 digesting. But this cause incorrect rename
> > detection, if:
> >
> >   - removing file A and adding file B occur at same committing, and
> >   - SHA-1 hash values of file A and B are same
> >
> > This may prevent security experts from managing sample files for
> > SHAttered issue in Mercurial repository, for example.
> >
> >   https://security.googleblog.com/2017/02/announcing-first-
> > sha1-collision.html
> >   https://shattered.it/
> >
> > Hash collision itself isn't so serious for core repository
> > functionality of Mercurial, described by mpm as below, though.
> >
> >   https://www.mercurial-scm.org/wiki/mpm/SHA1
> >
> > HOW ABOUT:
> >
> >   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
> >
> 
> SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
> likely no good reason to use SHA-256 for simple identity checks.
> 
> 
> >
> > ease (handling problematic files safely by default) or
> > performance?
> >
> 
> 
> On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
> SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
> identity checks, hashing shouldn't be a bottleneck, at least not for most
> repos.
> 
> So I think it is fine to change this function from SHA-1 to SHA-512
> assuming the hashes don't "leak" into storage. If they end up being stored
> or used for something other than identity checks, then we need to bloat
> scope to discuss our general hashing future. And that needs its own thread
> ;)

With hashing, there is *always* the risk of collision. It might be tiny,
but it still exists. Why not just compare the contents when the hash
match? Then it doesn't really matter what the hash is. The hash is just
a shortcut to avoid comparing full contents in a O(n^2) fashion.

There aren't going to be that many hash matches anyways, comparing the
content then should not make a significant difference in speed, but
would guarantee that the "similarity" is real.

(BTW, interestingly, in terms of similarity detection, while the
SHAttered PDFs are not 100% identical, they are 80%+ similar)

Mike
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

2017-03-01 Thread Gregory Szorc
On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori 
wrote:

> # HG changeset patch
> # User FUJIWARA Katsunori 
> # Date 1488380487 -32400
> #  Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file
> contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same
>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-
> sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>

SHA-512 should be faster than SHA-256 on 64-bit hardware. So, there's
likely no good reason to use SHA-256 for simple identity checks.


>
> ease (handling problematic files safely by default) or
> performance?
>


On my Skylake at 4.0 GHz, SHA-1 is capable of running at ~975 MB/s and
SHA-512 at ~700 MB/s. Both are fast enough that for simple one-time content
identity checks, hashing shouldn't be a bottleneck, at least not for most
repos.

So I think it is fine to change this function from SHA-1 to SHA-512
assuming the hashes don't "leak" into storage. If they end up being stored
or used for something other than identity checks, then we need to bloat
scope to discuss our general hashing future. And that needs its own thread
;)


>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
> or should we fully switch to more secure digest alrgorithm ?
>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
> SHA-1 is applied on "blob SIZE\0" header + file content (only if
> experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>


>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>  bdiff,
> +error,
>  mdiff,
>  )
>
> +DIGESTERS = {
> +'sha1': hashlib.sha1,
> +'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>  '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>  '''
>  numfiles = len(added) + len(removed)
>
> +digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +if digest not in DIGESTERS:
> +raise error.ConfigError(_("ui.similaritydigest value is invalid:
> %s")
> +% digest)
> +digester = DIGESTERS[digest]
> +
>  # Get hashes of removed files.
>  hashes = {}
>  for i, fctx in enumerate(removed):
>  repo.ui.progress(_('searching for exact renames'), i,
> total=numfiles,
>   unit=_('files'))
> -h = hashlib.sha1(fctx.data()).digest()
> +h = digester(fctx.data()).digest()
>  hashes[h] = fctx
>
>  # For each added file, see if it corresponds to a removed file.
>  for i, fctx in enumerate(added):
>  repo.ui.progress(_('searching for exact renames'), i +
> len(removed),
>  total=numfiles, unit=_('files'))
> -h = hashlib.sha1(fctx.data()).digest()
> +h = digester(fctx.data()).digest()
>  if h in hashes:
>  yield (hashes[h], fctx)
>
> ___
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
>
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel


Re: [PATCH RFC] similar: allow similarity detection to use sha256 for digesting file contents

2017-03-01 Thread Martin von Zweigbergk via Mercurial-devel
On Wed, Mar 1, 2017 at 7:02 AM, FUJIWARA Katsunori
 wrote:
> # HG changeset patch
> # User FUJIWARA Katsunori 
> # Date 1488380487 -32400
> #  Thu Mar 02 00:01:27 2017 +0900
> # Node ID 018d9759cb93f116007d4640341a82db6cf2d45c
> # Parent  0bb3089fe73527c64f1afc40b86ecb8dfe7fd7aa
> similar: allow similarity detection to use sha256 for digesting file contents
>
> Before this patch, similarity detection logic (used for addremove and
> automv) uses SHA-1 digesting. But this cause incorrect rename
> detection, if:
>
>   - removing file A and adding file B occur at same committing, and
>   - SHA-1 hash values of file A and B are same

If rename detection in this very special case is all that's lost, I
don't think the extra complexity (code and config) is worth it.

>
> This may prevent security experts from managing sample files for
> SHAttered issue in Mercurial repository, for example.
>
>   https://security.googleblog.com/2017/02/announcing-first-sha1-collision.html
>   https://shattered.it/
>
> Hash collision itself isn't so serious for core repository
> functionality of Mercurial, described by mpm as below, though.
>
>   https://www.mercurial-scm.org/wiki/mpm/SHA1
>
> HOW ABOUT:
>
>   - which should we use default algorithm SHA-1, SHA-256 or SHA-512 ?
>
> ease (handling problematic files safely by default) or
> performance?
>
>   - what name of config knob is reasonable to control digesting algorithm ?
>
> or should we fully switch to more secure digest alrgorithm ?

Yes, I'd rather wait until we do that.

>
> BTW, almost all of SHA-1 clients in Mercurial source tree applies it
> on other than file contents (e.g. list of parent hash ids, file path,
> URL, and so on). But some SHA-1 clients below apply it on file
> contents.
>
>   - patch.trydiff()
>
> SHA-1 is applied on "blob SIZE\0" header + file content (only if
> experimental.extendedheader.index is enabled)
>
>   - largefiles
>
> The former should be less serious.
>
> On the other hand, the latter causes unintentional unification between
> largefiles, which cause same SHA-1 hash value, at checking files out.
>
> diff --git a/mercurial/similar.py b/mercurial/similar.py
> --- a/mercurial/similar.py
> +++ b/mercurial/similar.py
> @@ -12,9 +12,15 @@ import hashlib
>  from .i18n import _
>  from . import (
>  bdiff,
> +error,
>  mdiff,
>  )
>
> +DIGESTERS = {
> +'sha1': hashlib.sha1,
> +'sha256': hashlib.sha256,
> +}
> +
>  def _findexactmatches(repo, added, removed):
>  '''find renamed files that have no changes
>
> @@ -23,19 +29,25 @@ def _findexactmatches(repo, added, remov
>  '''
>  numfiles = len(added) + len(removed)
>
> +digest = repo.ui.config('ui', 'similaritydigest', 'sha256')
> +if digest not in DIGESTERS:
> +raise error.ConfigError(_("ui.similaritydigest value is invalid: %s")
> +% digest)
> +digester = DIGESTERS[digest]
> +
>  # Get hashes of removed files.
>  hashes = {}
>  for i, fctx in enumerate(removed):
>  repo.ui.progress(_('searching for exact renames'), i, total=numfiles,
>   unit=_('files'))
> -h = hashlib.sha1(fctx.data()).digest()
> +h = digester(fctx.data()).digest()
>  hashes[h] = fctx
>
>  # For each added file, see if it corresponds to a removed file.
>  for i, fctx in enumerate(added):
>  repo.ui.progress(_('searching for exact renames'), i + len(removed),
>  total=numfiles, unit=_('files'))
> -h = hashlib.sha1(fctx.data()).digest()
> +h = digester(fctx.data()).digest()
>  if h in hashes:
>  yield (hashes[h], fctx)
>
> ___
> Mercurial-devel mailing list
> Mercurial-devel@mercurial-scm.org
> https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel
___
Mercurial-devel mailing list
Mercurial-devel@mercurial-scm.org
https://www.mercurial-scm.org/mailman/listinfo/mercurial-devel