A proposal for extending the MV data structure

Jan Shaw and Keith Johnson Fri, 02 Apr 2004 17:39:07 -0800

Since the list will be closing soon, I thought I'd put up for discussion an
idea a friend and I had back in 2000.  I put it here partly as a thought
provoker, and partly so it doesn't get lost.  I had it on Microsoft's
community groups, but it never got any traffic so died a death. (By the way,
my Outlook spelling check wanted to change "Microsoft's" to "Microfossil's")


I'm adding my original stuff to the bottom of this message, and I extended
the idea in some thoughts I put up at
http://emeraldglenlodge.co.nz/superpick.html.  Part of the extension was to
point out that if the marks (like attribute marks) were changed to be 9C-9F
then there wouldn't be a clash with the characters used in the internet.
Even using 0C-0F would work better.


Interestingly, having followed the recent threads Dawn has engaged in at
comp.databases.theory, I have retreated a bit from my original position.
When you think about it, the fact we have (in general) only one level of
data 'nesting' means that we don't get a heirachical structure that is
difficult to understand as a single conceptual "thang".  Codd's original
paper drew back from having relations within relations, maybe because he
didn't consider the Pick idea of limiting the depth of this structure?



Anyway, following is the original idea (although maybe calling it "SuperPick
was a conceit - I could be modest and follow established precedent and call
it "Johnson")


Regards, Keith.



SuperPick
 Copyright Keith Johnson 2000
 Background
 My experience has been as an application programmer using Pick-type
databases.  Within these all data is represented as an ASCII string using
delimiters to separate fields.  Pick allows three levels of fields called
attributes, values, and sub-values using characters 254, 253 and 252 as the
respective delimiters.  I was seeking a method of storing data which would
be similar, but which could cope with a theoretically unlimited nested
structure.  This structure would work well for the sort of data I see in my
work - names, dates, addresses, money, product codes, etc.  It would also
convert easily to an XML form, which I see as the coming data interchange
format.

 My colleague Ron Knox, one day came up with the idea that 'brackets' would
allow for nesting of any depth.  We refined this idea over time into a data
structure one that Ron has called "Noble" (as in, it's not base!).
"SuperPick" is Noble with a data map concept added which allows the data to
be easily converted to XML.



 Considering XML
 XML itself is interesting and I could see in it Pick-like things, such as
repeating fields, but it annoys me to see the verbosity associated with the
tag mechanism.  In Pick, one describes fields by their position within the
record   that is, by counting delimiters.  From my experience with Pick,
parsing out fields for manipulation does not have an adverse effect on
performance as long as you try to avoid extremely long strings, and actively
code to avoid re-parsing long strings as much as possible.  XML, being
verbose, would be more vulnerable to that sort of performance problem.  The
mechanism required to pull a field out of XML is more complex than delimiter
counting, as it has to match the tag strings surrounding the field.  This is
more difficult than it sounds, because tags do not have to be unique.



 Format of example
 Described below is a mechanism for storing data - "SuperPick .  Under this
mechanism there is there is the data itself, and a map.  Unlike Pick, the
map is required.  Both map and data are stored as text strings with four
special characters   file start, file end, record separator, and field
separator.  In this example I have used the left square bracket as the file
start, the right square bracket as the file end, the pipe as the record
separator, and the backslash as the field separator   that is, []|\
respectively.  This is not to say these are the characters that would in
fact be used, just that they are clear.



 The example
 The map is


[Customer\0\file|FirstName\1|LastName\2|CreditLimit\3|OrderEntry\4\file|Orde
rID\4,1|OrderDetail\4,2\file|Title\4,2,1| Author\4,2,2|Price\4,2,3]

  While the data is

 [Amy\Higginbottom\5000\[16273\[Number, the Language of
Science\Danzig\5.95|Tales of Grandpa Cat\Wardlaw, Lee\6.58]]]



 The map
 Taking the map first, it consists of a file with ten records.  If we put
each record on a separate line, we can see that each one is made up of two
or three fields.  The fields are a name, a position, and a type.  The type
defaults to a standard one, which may be called  generic field



 Customer\0\file

 FirstName\1

 LastName\2

 CreditLimit\3

 OrderEntry\4\file

 OrderID\4,1

 OrderDetail\4,2\file

 Title\4,2,1

 Author\4,2,2

 Price\4,2,3



 The map means that data is held in a file called "Customer  in four fields.
The first three fields are standard ones called FirstName, LastName, and
CreditLimit.  The fourth is a sub-file called OrderEntry.  The OrderEntry
sub-file has two fields, the first is a standard one called OrderId and the
second is a further sub-file called OrderDetail.  OrderDetail contains three
fields, which are Title, Author, and Price.



 Positional referencing
 The positional numbers are the key concept in the mechanism.  The field
separators define what field a datum actually is, and a missing one will
totally destroy meaning.  They do however, give a way to refer to data
within the structure in an unambiguous way using a notation whereby
Customer{1} is the first name, Customer{4,2,2} is the set of authors, and
Customer{4,2.2,3} is the price "6.58" in the example.  That is, while
Customer is the entire file, Customer{} is a record from that file.  A
perspicacious comment I recently read said that the dynamic reference (like
VAR<3>) in Pick was not a variable, but a process.  Here I am claiming for
SuperPick the  curly brackets  in a line like

 READ CUSTOMER.REC{} FROM CUSTOMERFILE,ID ELSE STOP.

 Then I could have a line like AUTHORS = CUSTOMER.REC{4,2,2} which in Pick
terms this would be "Danzig :CHAR(254): Wardlaw, Lee .  This would be
different from AUTHORS{} = CUSTOMER.REC{4,2,2} which would be
"Danzig|Wardlaw, Lee    the pipe representing whatever was used as a record
delimiter.



 The data
 If we now look at the data

  [Amy\Higginbottom\5000\[16273,[Number, the Language of
Science\Danzig\5.95|Tales of Grandpa Cat\Wardlaw, Lee\6.58]]]

 We can see that it is a file of one record.  Laying this out with an indent
for each sub-file shows the record structure as below



 Amy\Higginbottom\5000\

        16273\

                Number, the Language of Science\Danzig\5.95

                Tales of Grandpa Cat\Wardlaw, Lee\6.58



 Expanding this and labeling each field gives

 Firstname         Amy

 LastName        Higginbottom

 CreditLimit       5000

        OrderEntry

        OrderID 16273

                        OrderDetail

                        Title      Number, the Language of Science

                        Author  Danzig

                        Price    5.95

                        OrderDetail

                        Title      Tales of Grandpa Cat

                        Author  Wardlaw, Lee

                        Price    6.58



 In XML form
 And from there, I can put the data into the XML form

 <Customer>
        <FirstName>Amy</FirstName>
        <LastName>Higginbottom</LastName>
        <CreditLimit>500</CreditLimit>
        <OrderEntry>
                <OrderID>16273</OrderID>
                <OrderDetail>
                        <Title>Number, the Language of Science</Title>
                        <Author>Danzig</Author>
                        <Price>5.95</Price>
                </OrderDetail>
                <OrderDetail>
                        <Title>Tales of Grandpa Cat</Title>
                        <Author>Wardlaw, Lee</Author>
                        <Price>6.58</Price>
                </OrderDetail>
        </OrderEntry>
 </Customer>




 Comparison
 The previous XML was the original example I picked up from the Internet
when I was first looking at XML and thought "this is SO verbose .  While it
s not entirely a fair comparison, the XML version is 402 characters while
the map and data added together are 263 characters.  I have done tests that
indicate one could reduce the typical XML by about 50-60% and that a zipped
file would be about 20% shorter than zipping the original XML.



 Some extra thoughts
 The mechanism as described does not cover using XML attributes
interchangeably with tags for storing data.  This is not outside the XML
specification, but attributes do seem to belong in the DTD in my opinion.
However, an extension to the types in the map could easily cover this.



 The map does not intrude into areas covered by the XML DTD, but perhaps it
should for SuperPick itself.  I see defining here whether a field is a date
or a number, and whether it is mandatory.  A useful extension would be to
have a limit on the number of records allowed in a file.  The practical
circumstance covered here is something like an address (a multi-value in
Pick, a sub-file with one field in SuperPick) where you want to limit the
number of lines so it fits on a label.  Within a SuperPick dictionary, you
could refer to the map field names directly or to something like
@SRECORD{4,2.2,2} perhaps.



 The map is an intrinsic part of a file, but one could have a seperate
dictionary item (something like Pick's file translations) that links this
file to another.  A query would then return its results in the form of
another map (which covers only the data set requested) and the requested
data.  The query results could include stuff from files at one, two, or more
removes from the one initially interrogated.  Logically, any 'jump' to
another file results in a new sub-file in the query results.



 One way to implement this would be to use the Berkeley DB
http://www.sleepycat.com/products.html and add keys to the map.  In this
case one could add keys to the map like

 keypartA/1/key|keypartB/2/key...

 The key would be a string, and a record would be another string - Berkeley
DB will support this.



 Weaknesses
 The obvious one is that the data is not as easily read as XML.  However, it
is not impossible to read the data if one has the map.  It would be easy to
write software that presents the data in a readable form (something like the
intermediate forms I used above) with perhaps a zooming facility for
sub-files.



 A map that put a lot of usually empty fields first would make records with
lots of leading delimiters.  The data would take more space than required,
possibly more than an XML version.  It is not possible to build the map
without knowing all the structure required, because there is a difference
between fields and sub-files.  My feeling is that this would provide a
gentle push to make the structure an  efficient  representation of the data,
in these terms.  Also, I can see that it would be relatively easy to go
through a file, counting the fields, and then to re-structure the file to be
more 'efficient' with an automatic process.


-- 
u2-users mailing list
[EMAIL PROTECTED]
http://www.oliver.com/mailman/listinfo/u2-users

A proposal for extending the MV data structure

Reply via email to