Re: [Unicon-group] Venn's revenge - a data normalization problem

Hugh Sasse Sun, 14 Sep 2008 05:52:14 -0700

On Sun, 14 Sep 2008, Art Eschenlauer wrote:

> Is there a clever (i.e., cheap) approach for searching a group of sets for
> a "test" set?


If you are looking to see if any set contains the test set, I'd explore
joining them all together, and if that superset contains the test set
then subdivide the problem repeatedly until you find the original set
that does.  But that is from that statement of the problem....

> I suppose another way of asking this question is whether there is an
> Icon-like way of hashing sets (by using hashing that is built into the
> implementation, avoiding brute-force explicit computation of a hash in
> Icon code)?

That sounds to me like an implementation detail, but maybe you are really
talking about uniqueness here.
> 
> This is the essence a subproblem of a problem that I am working on (the
> problem happens to be inventorying versions of files on many computers,
> many of them similar).  Here is the long description of this subproblem:
> 
> I want to maintain a group of properties related to instances of an entity.
> 
> Many instances share exactly the same group of property values.
> There may be many instances of the entity, and the number of attributes is
> not known a priori.

I got to here and with 6 abstractions and two constraints, I sort of
glazed over!  I think that's the 7 chunks issue about short term
memory, and "cognitive load" kicking in.  If only human brains had
kept pace with Moore's law...

You said it was about files, and versions.  Are permissions and
dates involved?  This would make it more concrete.  Anyway,
continuing...

> It may be (and frequently is) that there is a different number of
> properties applicable to different instances.
> Because there are so many instances, it is too expensive to maintain a
> list of all property values for each instance.

On reading this I immediately thought of Programming Pearls, chapter
1 where Jon Bentley discusses bit fields...
> 
> Each property value consists of a property identifier and three values,
> which are grouped in an Icon record.

The critical think I think, is if one considers those as enumerated
types then how many values can each take?

> (Unicon objects may be an acceptable alternative.)
> I collect the properties for a given instance of the entity in a
> non-deterministic order.
> 
> What I would like to do is to save memory by associating a reference to a
> structure, so that each of the many instances with the same group of
> property values will only need to store a reference to the structure
> representing that group.

If I understand this then I think you could represent that group as
single number [0, count(properties) * count(identifiers) *
count(value1) * count(value2) * count(value3)) -- a half open
interval, where count() tells you how many possible values there
are.

> To simplify the problem, I am assuming fewer than 256 attributes; however,
> I am assuming more than 256 groups.
> 
> My thought was to "normalize" my data using an Icon set structure for each
> group of property value records.
> (I believe that a table structure could work just as well.)
> Each entity record would have a field referencing such a set.
> Each time that I learn a new property value for the instance represented
> by the entity record, I would go to a master list of sets, look for or
> construct a set consisting of the union of the newly acquired property
> value and the entity record's current set, and assign a reference to this
> other set to the entity record's properties field.

I think, if I have this right, you could set a value in a table used as
a bitfield, or increment the value if you wanted to count them.

> "Look for" means constructing a set that is a union of the record's
> current set and the new property value and then testing each set in the
> master list for equivalence to the constructed set.
> (I am thinking of using the seteq procedure from sets.icn in the IPL to
> test for set equivalence.)
> 
> The problem that I see with such an approach is that, if N were the
> maximum size of each set and M were the number of sets, then this search
> would be O(M*N).
> So, this approach doesn't scale well.
> 
> Is there a clever alternative for searching a group of sets for a "test" set?
> It is not a requirement that the structure actually be an Icon set.
> One issue that confounds my problem is that since the properties may not
> become known in a particular order, I cannot see how to use a finite state
> machine (e.g., construct a tree that I can walk until I find the node that
> I am looking for or find that node to be missing and add it).

Are you saying that you know in advance of the search what {propert,
identifier, value1, value2, value3} you are looking for?  I think that's
a different problem which is O(N), because you basically have to look at
each file till you find it, looking at all of them if you wish to count
how often it occurs.
> 
> Any ideas?
> 
        Hugh

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Unicon-group mailing list
Unicon-group@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/unicon-group

Re: [Unicon-group] Venn's revenge - a data normalization problem

Reply via email to