Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-25 Thread Rob Crittenden

Petr Viktorin wrote:

On 07/24/2012 08:36 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 04:50 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Rob Crittenden wrote:

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as
time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to
use an
integer value.  Instead, we can take time and attribute name. To
get
reasonable 'randomness' these values are then hashed with sha1
and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it
doesn't
add any randomness at all. It just obscures what's really
happening.

Hence using quotes to describe it. We don't need randomness in the
task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second
resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have
enough
guarantee for $date (time.time()) to be unique in sufficiently
tight
conditions, thus use of SHA-1 to generate something that has
better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no
effect
on the chance of collisions.

You could use Python's pseudorandom number generator
(random.randint)
instead of random.SystemRandom. It's not cryptographically secure
but
it's enough to avoid collisions, and it doesn't use up system
entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique
enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute
already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing
to use overall.

Updated patch attached.

2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
2012-07-24T14:36:31Z DEBUG Task id:
cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config


2012-07-24T14:36:32Z INFO Indexing finished
...
2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
2012-07-24T14:36:43Z DEBUG Task id:
cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config



2012-07-24T14:36:44Z INFO Indexing finished

You can note that clock_seq does not change, it is because uuid.uuid1()
uses nanosecond resolution and uuid_generate_time() if the latter is
available. If we happen to ask for uuids within sub-nanosecond
interval,
clock_seq will be different then.

I'm extracting only time and clock_seq instead of pasting full uuid
value because uuid_generate_time() will leak out ethernet MAC address
for 48-bit node. We don't need more bits here so I drop these 48 bits
and avoid publishing the ethernet MAC address, even in logs.

--
/ Alexander Bokovoy


Yes, this approach will work. Just some technical comments:



+self.sub_dict['TIME'] = cn_uuid.get_time()
+self.sub_dict['SEQ'] = cn_uuid.get_clock_seq()


Use the attributes, `cn_uuid.time` and `cn_uuid.clock_seq`. The
accessor methods are undocumented.

Fixed.




+self.sub_dict['ATTR'] = attribute

-# Refresh the time to make uniqueness more probable. Add on
some
-# randomness for good measure.
-self.sub_dict['TIME'] = int(time.time()) + r.randint(0,1)
+cn = self._template_str("indextask_${ATTR}_${TIME}_${SEQ}")


You're overwriting sub_dict['TIME'], which is used elsewhere in the
class. That could cause trouble.

Since it was set here and others are using it, I kept it updated in a
new version as well, based on cn_uuid.time. But since cn_uuid.time is in
nanoseconds resolution, I have to divide th

Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-25 Thread Petr Viktorin

On 07/24/2012 08:36 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 04:50 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Rob Crittenden wrote:

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as
time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To
get
reasonable 'randomness' these values are then hashed with sha1
and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it
doesn't
add any randomness at all. It just obscures what's really
happening.

Hence using quotes to describe it. We don't need randomness in the
task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second
resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently
tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator
(random.randint)
instead of random.SystemRandom. It's not cryptographically secure
but
it's enough to avoid collisions, and it doesn't use up system
entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique
enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute
already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing
to use overall.

Updated patch attached.

2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
2012-07-24T14:36:31Z DEBUG Task id:
cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config

2012-07-24T14:36:32Z INFO Indexing finished
...
2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
2012-07-24T14:36:43Z DEBUG Task id:
cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config


2012-07-24T14:36:44Z INFO Indexing finished

You can note that clock_seq does not change, it is because uuid.uuid1()
uses nanosecond resolution and uuid_generate_time() if the latter is
available. If we happen to ask for uuids within sub-nanosecond interval,
clock_seq will be different then.

I'm extracting only time and clock_seq instead of pasting full uuid
value because uuid_generate_time() will leak out ethernet MAC address
for 48-bit node. We don't need more bits here so I drop these 48 bits
and avoid publishing the ethernet MAC address, even in logs.

--
/ Alexander Bokovoy


Yes, this approach will work. Just some technical comments:



+self.sub_dict['TIME'] = cn_uuid.get_time()
+self.sub_dict['SEQ'] = cn_uuid.get_clock_seq()


Use the attributes, `cn_uuid.time` and `cn_uuid.clock_seq`. The
accessor methods are undocumented.

Fixed.




+self.sub_dict['ATTR'] = attribute

-# Refresh the time to make uniqueness more probable. Add on
some
-# randomness for good measure.
-self.sub_dict['TIME'] = int(time.time()) + r.randint(0,1)
+cn = self._template_str("indextask_${ATTR}_${TIME}_${SEQ}")


You're overwriting sub_dict['TIME'], which is used elsewhere in the
class. That could cause trouble.

Since it was set here and others are using it, I kept it updated in a
new version as well, based on cn_uuid.time. But since cn_uuid.time is in
nanoseconds resolution, I have to divide the value by 1e9.


There'

Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Alexander Bokovoy

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 04:50 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Rob Crittenden wrote:

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To
get
reasonable 'randomness' these values are then hashed with sha1 and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it
doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the
task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second
resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator (random.randint)
instead of random.SystemRandom. It's not cryptographically secure but
it's enough to avoid collisions, and it doesn't use up system entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique
enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing
to use overall.

Updated patch attached.

2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
2012-07-24T14:36:31Z DEBUG Task id:
cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config
2012-07-24T14:36:32Z INFO Indexing finished
...
2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
2012-07-24T14:36:43Z DEBUG Task id:
cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config

2012-07-24T14:36:44Z INFO Indexing finished

You can note that clock_seq does not change, it is because uuid.uuid1()
uses nanosecond resolution and uuid_generate_time() if the latter is
available. If we happen to ask for uuids within sub-nanosecond interval,
clock_seq will be different then.

I'm extracting only time and clock_seq instead of pasting full uuid
value because uuid_generate_time() will leak out ethernet MAC address
for 48-bit node. We don't need more bits here so I drop these 48 bits
and avoid publishing the ethernet MAC address, even in logs.

--
/ Alexander Bokovoy


Yes, this approach will work. Just some technical comments:



+self.sub_dict['TIME'] = cn_uuid.get_time()
+self.sub_dict['SEQ'] = cn_uuid.get_clock_seq()


Use the attributes, `cn_uuid.time` and `cn_uuid.clock_seq`. The 
accessor methods are undocumented.

Fixed.




+self.sub_dict['ATTR'] = attribute

-# Refresh the time to make uniqueness more probable. Add on some
-# randomness for good measure.
-self.sub_dict['TIME'] = int(time.time()) + r.randint(0,1)
+cn = self._template_str("indextask_${ATTR}_${TIME}_${SEQ}")


You're overwriting sub_dict['TIME'], which is used elsewhere in the 
class. That could cause trouble.

Since it was set here and others are using it, I kept it updated in a
new version as well, based on cn_uuid.time. But since cn_uuid.time is in
nanoseconds resolution, I have to divide the value by 1e9.


There's no reason to use sub_dict here; you can simply d

Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Petr Viktorin

On 07/24/2012 04:50 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Rob Crittenden wrote:

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To
get
reasonable 'randomness' these values are then hashed with sha1 and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it
doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the
task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second
resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator (random.randint)
instead of random.SystemRandom. It's not cryptographically secure but
it's enough to avoid collisions, and it doesn't use up system entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique
enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing
to use overall.

Updated patch attached.

2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
2012-07-24T14:36:31Z DEBUG Task id:
cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config
2012-07-24T14:36:32Z INFO Indexing finished
...
2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
2012-07-24T14:36:43Z DEBUG Task id:
cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config

2012-07-24T14:36:44Z INFO Indexing finished

You can note that clock_seq does not change, it is because uuid.uuid1()
uses nanosecond resolution and uuid_generate_time() if the latter is
available. If we happen to ask for uuids within sub-nanosecond interval,
clock_seq will be different then.

I'm extracting only time and clock_seq instead of pasting full uuid
value because uuid_generate_time() will leak out ethernet MAC address
for 48-bit node. We don't need more bits here so I drop these 48 bits
and avoid publishing the ethernet MAC address, even in logs.

--
/ Alexander Bokovoy


Yes, this approach will work. Just some technical comments:



+self.sub_dict['TIME'] = cn_uuid.get_time()
+self.sub_dict['SEQ'] = cn_uuid.get_clock_seq()


Use the attributes, `cn_uuid.time` and `cn_uuid.clock_seq`. The accessor 
methods are undocumented.



+self.sub_dict['ATTR'] = attribute

-# Refresh the time to make uniqueness more probable. Add on some
-# randomness for good measure.
-self.sub_dict['TIME'] = int(time.time()) + r.randint(0,1)
+cn = self._template_str("indextask_${ATTR}_${TIME}_${SEQ}")


You're overwriting sub_dict['TIME'], which is used elsewhere in the 
class. That could cause trouble.

There's no reason to use sub_dict here; you can simply do:

cn = 'indextask_%s_%s_%s' % (attribute, cn_uuid.time, cn_uuid.clock_seq)



--
Petr³


___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Alexander Bokovoy

On Tue, 24 Jul 2012, Rob Crittenden wrote:

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator (random.randint)
instead of random.SystemRandom. It's not cryptographically secure but
it's enough to avoid collisions, and it doesn't use up system entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing 
to use overall.

Updated patch attached.

2012-07-24T14:36:31Z INFO Creating task to index attribute: memberOf
2012-07-24T14:36:31Z DEBUG Task id: 
cn=indextask_memberOf_135624333918176170_14302,cn=index,cn=tasks,cn=config
2012-07-24T14:36:32Z INFO Indexing finished
...
2012-07-24T14:36:43Z INFO Creating task to index attribute: memberUser
2012-07-24T14:36:43Z DEBUG Task id: 
cn=indextask_memberUser_135624334038532670_14302,cn=index,cn=tasks,cn=config
2012-07-24T14:36:44Z INFO Indexing finished

You can note that clock_seq does not change, it is because uuid.uuid1()
uses nanosecond resolution and uuid_generate_time() if the latter is
available. If we happen to ask for uuids within sub-nanosecond interval,
clock_seq will be different then.

I'm extracting only time and clock_seq instead of pasting full uuid
value because uuid_generate_time() will leak out ethernet MAC address
for 48-bit node. We don't need more bits here so I drop these 48 bits
and avoid publishing the ethernet MAC address, even in logs.

--
/ Alexander Bokovoy
>From e5262b5625e8e3b2deaf9228ca8a53dcbea90593 Mon Sep 17 00:00:00 2001
From: Alexander Bokovoy 
Date: Tue, 24 Jul 2012 12:07:23 +0300
Subject: [PATCH 3/3] Rework task naming in LDAP updates to avoid conflicting
 names in certain cases

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an integer value.
Instead, we generate an UUID and use its 60-bit time, 14-bit sequential number,
and attribute name.

https://fedorahosted.org/freeipa/ticket/2942
---
 ipaserver/install/ldapupdate.py |   17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/ipaserver/install/ldapupdate.py b/ipaserver/install/ldapupdate.py
index 
c64139889d9f84866ac0cd358ed3a3a7d95af

Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Rob Crittenden

Petr Viktorin wrote:

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and
use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

 indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator (random.randint)
instead of random.SystemRandom. It's not cryptographically secure but
it's enough to avoid collisions, and it doesn't use up system entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if
there's a duplicate -- if a reindex task for the same attribute already
exists from the same second, do we really need to start a new one?



That is true.

We can generate a UUID, I think that is probably a better/safer thing to 
use overall.


rob

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Petr Viktorin

On 07/24/2012 02:49 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

 indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique,
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect
on the chance of collisions.

You could use Python's pseudorandom number generator (random.randint)
instead of random.SystemRandom. It's not cryptographically secure but
it's enough to avoid collisions, and it doesn't use up system entropy
(except for initial seeding, through `import random`).
"indextask_$date_$attribute_$pseudorandomvalue" should be unique enough.

Using random here is really bad.
What we ideally need is a method to increment sequential calls for
the same attribute.We use seconds to differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.



Or maybe use $date_$attribute and just warn (ignore the error) if 
there's a duplicate -- if a reindex task for the same attribute already 
exists from the same second, do we really need to start a new one?


--
Petr³


___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Alexander Bokovoy

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

 indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique, 
neither is SHA1("indextask_$date_$attribute"). Hashing has no effect 
on the chance of collisions.


You could use Python's pseudorandom number generator (random.randint) 
instead of random.SystemRandom. It's not cryptographically secure but 
it's enough to avoid collisions, and it doesn't use up system entropy 
(except for initial seeding, through `import random`).

"indextask_$date_$attribute_$pseudorandomvalue" should be unique enough.

Using random here is really bad. What we ideally need is a method to
increment sequential calls for the same attribute. We use seconds to 
differentiate
between all tasks but that is not really required, tasks that were
processed will be removed.

--
/ Alexander Bokovoy

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Petr Viktorin

On 07/24/2012 02:06 PM, Alexander Bokovoy wrote:

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

  indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't
add any randomness at all. It just obscures what's really happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


My point is that if "indextask_$date_$attribute" is not unique, neither 
is SHA1("indextask_$date_$attribute"). Hashing has no effect on the 
chance of collisions.


You could use Python's pseudorandom number generator (random.randint) 
instead of random.SystemRandom. It's not cryptographically secure but 
it's enough to avoid collisions, and it doesn't use up system entropy 
(except for initial seeding, through `import random`).

"indextask_$date_$attribute_$pseudorandomvalue" should be unique enough.


Same with repeating [tasktime, attribute] two times.

This can be reduced as SHA-1 output does not depend on size of the
input message.



--
Petr³


___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Alexander Bokovoy

On Tue, 24 Jul 2012, Petr Viktorin wrote:

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

  indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it 
doesn't add any randomness at all. It just obscures what's really 
happening.

Hence using quotes to describe it. We don't need randomness in the task
names, we need something that avoids collisions.

An issue here is in time.time() -- it may give us sub-second resolution
if underlying OS supports it, it may not. Having a second-level
resolution is not enough, especially on fast machines, so we can't
simply use int(times.time()) as it was in the original version.

indextask_$date_$attribute has this issue that we don't have enough
guarantee for $date (time.time()) to be unique in sufficiently tight
conditions, thus use of SHA-1 to generate something that has better
chances to avoid collisions than $data_$attribute.


Same with repeating [tasktime, attribute] two times.

This can be reduced as SHA-1 output does not depend on size of the
input message.


--
/ Alexander Bokovoy

___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel


Re: [Freeipa-devel] [PATCH] 0064 Rework task naming in LDAP updates to avoid conflicts

2012-07-24 Thread Petr Viktorin

On 07/24/2012 12:01 PM, Alexander Bokovoy wrote:

Hi,

There are two problems in task naming in LDAP updates:

1. Randomness may be scarce in virtual machines
2. Random number is added to the time value rounded to a second

The second issue leads to values that may repeat themselves as time
only grows and random number is non-negative as well, so
t2+r2 can be equal to t1+t2 generated earlier.

Since task name is a DN, there is no strict requirement to use an
integer value.  Instead, we can take time and attribute name. To get
reasonable 'randomness' these values are then hashed with sha1 and use
the resulting string as task name.

SHA1 may technically be an overkill here as we could simply use

   indextask_$date_$attribute

where $date is a value of time.time() but SHA1 gives a resonable
'randomness' into the string.


What kind of randomness do you mean? SHA1 is deterministic, it doesn't 
add any randomness at all. It just obscures what's really happening.

Same with repeating [tasktime, attribute] two times.



> -root_logger.debug("Task id: %s", dn)
> +root_logger.debug("Task id: %s", str(dn))
This change is unnecessary; the "%s" means "convert to str".




I was hit by this issue today, see
https://fedorahosted.org/freeipa/ticket/2942





--
Petr³


___
Freeipa-devel mailing list
Freeipa-devel@redhat.com
https://www.redhat.com/mailman/listinfo/freeipa-devel