Re: btrfs-image hash collision option, super slow

2017-11-13 Thread Chris Murphy
On Mon, Nov 13, 2017 at 1:02 AM, Piotr Pawłow  wrote:
> W dniu 13.11.2017 o 04:42, Chris Murphy pisze:
>> Strange. I was using 4.3.3 and it had been running for over 9 hours at
>> the time I finally cancelled it.
>
> If you're compiling from source, the usual advice would be to "make clean" 
> and make sure you're using the correct executable.

Interesting, using btrfs-progs-4.13.3-1.fc28.x86_64, it goes super
fast with -ss (minutes). But when using 4.3.3 built myself (nothing
special, autogen, configure, make) it's super slow, hours.




-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-13 Thread Piotr Pawłow
W dniu 13.11.2017 o 04:42, Chris Murphy pisze:
> Strange. I was using 4.3.3 and it had been running for over 9 hours at
> the time I finally cancelled it.

If you're compiling from source, the usual advice would be to "make clean" and 
make sure you're using the correct executable.

If your fs is very large then caching may be a problem. With the brute force 
method it was a good idea to cache the results. With the "reverse crc" method 
caching is not very useful. It's only marginally faster on my root fs, and on 
larger filesystems searching the cache will be slower than finding collisions. 
You can remove the code and see if it makes any difference:

diff --git a/image/main.c b/image/main.c
index 4cffbdba..f77b1504 100644
--- a/image/main.c
+++ b/image/main.c
@@ -500,19 +500,9 @@ static char *find_collision(struct metadump_struct *md, 
char *name,
    u32 name_len)
 {
    struct name *val;
-   struct rb_node *entry;
-   struct name tmp;
    int found;
    int i;
 
-   tmp.val = name;
-   tmp.len = name_len;
-   entry = tree_search(>name_tree, , name_cmp, 0);
-   if (entry) {
-   val = rb_entry(entry, struct name, n);
-   free(name);
-   return val->sub;
-   }
 
    val = malloc(sizeof(struct name));
    if (!val) {
@@ -548,7 +538,6 @@ static char *find_collision(struct metadump_struct *md, 
char *name,
    }
    }
 
-   tree_insert(>name_tree, >n, name_cmp);
    return val->sub;
 }

>
>> Unfortunately there are no CRC32 collisions for any file name having 4 or 
>> less characters when you have to keep the same file name length, and there 
>> may be no collisions for longer file names when you limit the result to 
>> ASCII only.
> Gotcha.

Yeah, it also means for short file names an attacker can easily find the real 
name by finding all collisions and filtering out obviously nonsensical names, 
so it's more of an obfuscation than sanitization :/

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-12 Thread Chris Murphy
On Sun, Nov 12, 2017 at 7:05 AM, Piotr Pawłow  wrote:
>
>>>It could definitely be improved -- I believe there are some good
>>> (but non-trivial) algorithms for finding preimages for CRC32 checksums
>>> out there. It's just that btrfs-image doesn't use them.
>
> I implemented a faster method using a reverse CRC32 table, which is in 
> btrfs-progs since release 4.13.1.
>
> On my 20GB root fs SSD:
>
> # echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -s /dev/mapper/pp-root 
> /dev/null >/dev/null 2>&1
>
> real0m22,023s
> user0m14,812s
> sys0m2,508s
> # echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -ss 
> /dev/mapper/pp-root /dev/null >/dev/null 2>&1
>
> real0m31,977s
> user0m17,984s
> sys0m2,632s
>
> Slower, but not 2 orders of magnitude slower :)


Strange. I was using 4.3.3 and it had been running for over 9 hours at
the time I finally cancelled it.


>
>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
>
> Unfortunately there are no CRC32 collisions for any file name having 4 or 
> less characters when you have to keep the same file name length, and there 
> may be no collisions for longer file names when you limit the result to ASCII 
> only.

Gotcha.



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-12 Thread Duncan
Piotr Pawłow posted on Sun, 12 Nov 2017 15:05:44 +0100 as excerpted:

>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
> 
> Unfortunately there are no CRC32 collisions for any file name having 4
> or less characters when you have to keep the same file name length, and
> there may be no collisions for longer file names when you limit the
> result to ASCII only.

Hmm...  Sounds reasonable, tho I hadn't thought/read of it before.  But 
now that I am...

What's the statistical collision spread as the number of characters 
increases?

More worrying than not having /any/ collisions might be having only one, 
or for that matter, some known small number less than say 1024, and 
certainly less than 64 or so, since for the only one collision 
possibility, once you know it wasn't the one you have, you know which one 
it actually was, and chances are, with a bit of information about the 
target in question, such as a suspected crime or the department at a 
competing company the filesystem was from, 64 or even something like 1024 
possibilities are trivial to look up in some rainbow table somewhere and 
see which possibilities look interesting.   Get a few of those and pretty 
soon you can start putting a story together.

So it might make sense to (at least have an option to, maybe -sss) force 
garbage for anything under say 6 chars or whatever, depending on how fast 
the number of collisions rise, to avoid the information leak due to too 
few collisions issue, particularly given that once we assume people 
consider it worth going to the trouble in the first place, they likely 
/are/ going to be concerned about such low number of collisions 
information leakage.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-12 Thread Piotr Pawłow

>>It could definitely be improved -- I believe there are some good
>> (but non-trivial) algorithms for finding preimages for CRC32 checksums
>> out there. It's just that btrfs-image doesn't use them.

I implemented a faster method using a reverse CRC32 table, which is in 
btrfs-progs since release 4.13.1.

On my 20GB root fs SSD:

# echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -s /dev/mapper/pp-root 
/dev/null >/dev/null 2>&1

real    0m22,023s
user    0m14,812s
sys    0m2,508s
# echo 3 >/proc/sys/vm/drop_caches && time btrfs-image -ss /dev/mapper/pp-root 
/dev/null >/dev/null 2>&1

real    0m31,977s
user    0m17,984s
sys    0m2,632s

Slower, but not 2 orders of magnitude slower :)

> The other part I'm not groking is that some filenames fail with:
>
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes

Unfortunately there are no CRC32 collisions for any file name having 4 or less 
characters when you have to keep the same file name length, and there may be no 
collisions for longer file names when you limit the result to ASCII only.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-11 Thread Chris Murphy
On Sat, Nov 11, 2017 at 5:42 PM, Hugo Mills  wrote:
> On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
>> OK this might be in the stupid questions category, but I'm not
>> understanding the purpose of computing hash collisions with -ss. Or
>> more correctly, why it's taking so much longer than -s.
>>
>> It seems like what we'd want is every filename to have the same hash,
>> but for the file to go through a PBKDF so the hashes we get aren't
>> (easily) brute forced. So I totally understand that -ss should take
>> much longer than -s, but this is at least two orders magnitude longer
>> (so far). That's why I'm confused.
>>
>> -s option on this file system took 5 minutes, start to finish.
>> -ss option is at 8 hours and counting.
>>
>> The other part I'm not groking is that some filenames fail with:
>>
>> WARNING: cannot find a hash collision for 'Tool', generating garbage,
>> it won't match indexes
>>
>> So? That seems like an undesirable outcome. And if it were just being
>> pushed through a PBKDF function, it wouldn't fail. Every
>> file/directory "Tool" would get the same hash on *this* run of
>> btrs-image. If I run it again, or someone else runs it, they'd get
>> some other hash (same hashes for each instance of "Tool" on their
>> filesystem).
>
>In the FS tree, you can go from the inode of the file to its name
> (where the inode is in the index, and the name is stored in the
> corresponding data item). Alternatively, you can go from the filename
> to the inode. In the latter case, since the keys are a structured 17
> byte object, you obviously can't fit the whole filename into the key,
> so the filename is hashed (using, IIRC, CRC32), and it's the hash that
> appears in the key of the index.
>
>When an image is made without the -s options, the whole metadata is
> stored, including all the filenames in the data items. For some
> people, that's a security risk, and they don't want their filenames
> leaking out, so -s exists to put junk in the filename records.
> However, it doesn't change the hashes in the index to correspond with
> the modified filenames, because that would at minimum require the
> whole tree to be rebuilt (because all the items would have different
> hashes, and hence different ordering in the index). This is a bad
> thing for debugging, because you're not getting the details of the
> tree as it was in the broken filesystem. So, in this case, the image
> is actually broken, because the filenames don't match the hashes.
>
>Most of the time, that's absolutely fine, because the thing being
> debugged is somewhere else, and it doesn't matter that "ls" on the
> restored FS won't work right.
>
>However, in some (possibly hypothetical) cases, it _does_ matter,
> and you do need the hashes to match the filenames. This is where -ss
> comes in. We can't generate random filenames and then take the hashes
> of those, because of the undesirability of rewriting the whole FS tree
> to reindex it with the changed hashes. So, what -ss tries to do is
> stick with the original hashes and find arbitrary filenames which
> match them. It's (I think) CRC32, so it shouldn't be too hard, but
> it's still non-trivial amounts of work to reverse engineer a
> human-readable ASCII filename which hashes to a given value.
> Particularly if, as was the case when Josef wrote it, a simple
> brute-force algorithm was used.
>
>It could definitely be improved -- I believe there are some good
> (but non-trivial) algorithms for finding preimages for CRC32 checksums
> out there. It's just that btrfs-image doesn't use them. However, it's
> not an option that's needed very often, so it's probably not worth
> putting in the effort to fix it up. (I definitely remember Josef
> commenting on IRC when he wrote -s and -ss that it could almost
> certainly be done more efficiently, but he had bigger fish to fry at
> the time, like fixing the broken FS he was working on)
>
>As to the thing where it's not finding a pre-image at all -- I'm
> guessing here, but it's possible that this is a case where two of the
> orginal filenames hashed to the same value. If that happens, one of
> the hashes is incremented by a small integer in a predictable way
> before storage. So it may be that the resulting value isn't mappable
> to an ASCII pre-image, or that the search just gives up before finding
> one.
>

Super explanation, thanks Hugo. Maybe it's worth an additional note in
the man page, just how expensive it is. I'm definitely regretting not
starting this imaging in a tmux session.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-image hash collision option, super slow

2017-11-11 Thread Hugo Mills
On Sat, Nov 11, 2017 at 05:18:33PM -0700, Chris Murphy wrote:
> OK this might be in the stupid questions category, but I'm not
> understanding the purpose of computing hash collisions with -ss. Or
> more correctly, why it's taking so much longer than -s.
> 
> It seems like what we'd want is every filename to have the same hash,
> but for the file to go through a PBKDF so the hashes we get aren't
> (easily) brute forced. So I totally understand that -ss should take
> much longer than -s, but this is at least two orders magnitude longer
> (so far). That's why I'm confused.
> 
> -s option on this file system took 5 minutes, start to finish.
> -ss option is at 8 hours and counting.
> 
> The other part I'm not groking is that some filenames fail with:
> 
> WARNING: cannot find a hash collision for 'Tool', generating garbage,
> it won't match indexes
> 
> So? That seems like an undesirable outcome. And if it were just being
> pushed through a PBKDF function, it wouldn't fail. Every
> file/directory "Tool" would get the same hash on *this* run of
> btrs-image. If I run it again, or someone else runs it, they'd get
> some other hash (same hashes for each instance of "Tool" on their
> filesystem).

   In the FS tree, you can go from the inode of the file to its name
(where the inode is in the index, and the name is stored in the
corresponding data item). Alternatively, you can go from the filename
to the inode. In the latter case, since the keys are a structured 17
byte object, you obviously can't fit the whole filename into the key,
so the filename is hashed (using, IIRC, CRC32), and it's the hash that
appears in the key of the index.

   When an image is made without the -s options, the whole metadata is
stored, including all the filenames in the data items. For some
people, that's a security risk, and they don't want their filenames
leaking out, so -s exists to put junk in the filename records.
However, it doesn't change the hashes in the index to correspond with
the modified filenames, because that would at minimum require the
whole tree to be rebuilt (because all the items would have different
hashes, and hence different ordering in the index). This is a bad
thing for debugging, because you're not getting the details of the
tree as it was in the broken filesystem. So, in this case, the image
is actually broken, because the filenames don't match the hashes.

   Most of the time, that's absolutely fine, because the thing being
debugged is somewhere else, and it doesn't matter that "ls" on the
restored FS won't work right.

   However, in some (possibly hypothetical) cases, it _does_ matter,
and you do need the hashes to match the filenames. This is where -ss
comes in. We can't generate random filenames and then take the hashes
of those, because of the undesirability of rewriting the whole FS tree
to reindex it with the changed hashes. So, what -ss tries to do is
stick with the original hashes and find arbitrary filenames which
match them. It's (I think) CRC32, so it shouldn't be too hard, but
it's still non-trivial amounts of work to reverse engineer a
human-readable ASCII filename which hashes to a given value.
Particularly if, as was the case when Josef wrote it, a simple
brute-force algorithm was used.

   It could definitely be improved -- I believe there are some good
(but non-trivial) algorithms for finding preimages for CRC32 checksums
out there. It's just that btrfs-image doesn't use them. However, it's
not an option that's needed very often, so it's probably not worth
putting in the effort to fix it up. (I definitely remember Josef
commenting on IRC when he wrote -s and -ss that it could almost
certainly be done more efficiently, but he had bigger fish to fry at
the time, like fixing the broken FS he was working on)

   As to the thing where it's not finding a pre-image at all -- I'm
guessing here, but it's possible that this is a case where two of the
orginal filenames hashed to the same value. If that happens, one of
the hashes is incremented by a small integer in a predictable way
before storage. So it may be that the resulting value isn't mappable
to an ASCII pre-image, or that the search just gives up before finding
one.

   Hugo.

-- 
Hugo Mills | Yes, this is an example of something that becomes
hugo@... carfax.org.uk | less explosive as a one-to-one cocrystal with TNT.
http://carfax.org.uk/  | (Hexanitrohexaazaisowurtzitane)
PGP: E2AB1DE4  |Derek Lowe


signature.asc
Description: Digital signature


btrfs-image hash collision option, super slow

2017-11-11 Thread Chris Murphy
OK this might be in the stupid questions category, but I'm not
understanding the purpose of computing hash collisions with -ss. Or
more correctly, why it's taking so much longer than -s.

It seems like what we'd want is every filename to have the same hash,
but for the file to go through a PBKDF so the hashes we get aren't
(easily) brute forced. So I totally understand that -ss should take
much longer than -s, but this is at least two orders magnitude longer
(so far). That's why I'm confused.

-s option on this file system took 5 minutes, start to finish.
-ss option is at 8 hours and counting.

The other part I'm not groking is that some filenames fail with:

WARNING: cannot find a hash collision for 'Tool', generating garbage,
it won't match indexes

So? That seems like an undesirable outcome. And if it were just being
pushed through a PBKDF function, it wouldn't fail. Every
file/directory "Tool" would get the same hash on *this* run of
btrs-image. If I run it again, or someone else runs it, they'd get
some other hash (same hashes for each instance of "Tool" on their
filesystem).



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html