Re: [SQL] Table Design for Hierarchical Data
You could also consider the genealogical approach, e.g.
postg...@dynacom=# \d paintgentypes
Table "public.paintgentypes"
Column | Type| Modifiers
-+---+---
id | integer | not null default
nextval(('public.paintgentypes_id_seq'::text)::regclass)
name| text | not null
parents | integer[] |
Indexes:
"paintgentypes_pkey" PRIMARY KEY, btree (id)
"paintgentypes_name2" UNIQUE, btree (name) WHERE parents IS NULL
"paintgentypes_uk" UNIQUE, btree (name, parents)
"paintgentypes_first" btree (first(parents))
"paintgentypes_last" btree (last(parents))
"paintgentypes_level" btree (level(parents))
"paintgentypes_name" btree (name)
"paintgentypes_parents" gin (parents gin__int_ops)
The indexes are based on the contrib/intarray package.
It is very fast to do any operation on this tree.
Also it is very fast to search for the parent of any node, or the
children of any node, or the whole subtree of any node, or the depth of any
node in the tree.
The parents of any node to the root, i.e. the path of any node to the root are
depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent
.
parents[n] : root of the tree
Στις Tuesday 06 April 2010 20:33:18 ο/η Lee Hachadoorian έγραψε:
> Please point me to another listserv or forum if this question is more
> appropriately addressed elsewhere.
>
> I am trying to come up with a structure to store employment data by NAICS
> (North American Industrial Classification System). The data uses a
> hierarchical encoding scheme ranging between 2 and 5 digits. That is, each
> 2-digit code includes all industries beginning with the same two digits. 61
> includes 611 which includes 6111, 6112, 6113, etc. A portion of the
> hierarchy is shown after the sig.
>
> A standard way to store hierarchical data is the adjacency list model, where
> each node's parent appears as an attribute (table column). So 6111 would
> list 611 as its parent. Since NAICS uses a hierarchical encoding scheme, the
> node's name is the same as the node's id, and the parent can always be
> derived from the node's id. Storing the parent id separately would seem to
> violate a normal form (because of the redundancy).
>
> One way to store this data would be to store at the most granular level
> (5-digit NAICS) and then aggregate up if I wanted employment at the 4-, 3-,
> or 2-digit level. The problem is that because of nondisclosure rules, the
> data is sometimes censored at the more specific level. I might, for example,
> have data for 6114, but not 61141, 61142, 61143. For a different branch of
> the tree, I might have data at the 5-digit level while for yet another
> branch I might have data only to the 3-digit level (not 4 or 5). I think
> that means I have to store all data at multiple levels, even if some of the
> higher-level data could be reconstructed from other, lower-level data.
>
> Specifically I'd like to know if this should be a single table or should
> there be a separate table for each level of the hierarchy (four in all)? If
> one table, should the digits be broken into separate columns? Should parent
> ids be stored in each node?
>
> More generally, what questions should I be asking to help decide what
> structure makes the most sense? Are there any websites, forums, or books
> that cover this kind of problem?
>
> Regards,
> --Lee
>
--
Achilleas Mantzios
--
Sent via pgsql-sql mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
On 04/07/10 11:00, Achilleas Mantzios wrote:
Column | Type| Modifiers
-+---+---
id | integer | not null default
nextval(('public.paintgentypes_id_seq'::text)::regclass)
name| text | not null
parents | integer[] |
The parents of any node to the root, i.e. the path of any node to the root are
depicted as
parents[0] : immediate parent
parents[1] : immediate parent of the above parent
.
parents[n] : root of the tree
what this schema gives?
(1) the parent branch in one select.
what else?
nothing.
compare it to a nested-tree
id | integer | NOT NULL
name| text | not null
parent | integer |
l | numeric
r | numeric
(1) parent branch in one select
(2) child subtree in one select
(it makes a sence!)
--
Sent via pgsql-sql mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
Achilleas Mantzios wrote: You could also consider the genealogical approach, e.g. The parents of any node to the root, i.e. the path of any node to the root are depicted as parents[0] : immediate parent parents[1] : immediate parent of the above parent What I have more than one parent? regards, Yeb Havinga -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
On 6 April 2010 21:33, Lee Hachadoorian wrote: > More generally, what questions should I be asking to help decide what > structure makes the most sense? Are there any websites, forums, or books > that cover this kind of problem? Haven't you thought about ltree contrib? From the description of ltree: "This module implements a data type ltree for representing labels of data stored in a hierarchical tree-like structure". http://www.postgresql.org/docs/8.4/interactive/ltree.html -- Sergey Konoplev Blog: http://gray-hemp.blogspot.com / Linkedin: http://ru.linkedin.com/in/grayhemp / JID/GTalk: [email protected] / Skype: gray-hemp / ICQ: 29353802 -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
Στις Wednesday 07 April 2010 10:53:00 ο/η silly sad έγραψε:
> On 04/07/10 11:00, Achilleas Mantzios wrote:
>
> > Column | Type| Modifiers
> > -+---+---
> > id | integer | not null default
> > nextval(('public.paintgentypes_id_seq'::text)::regclass)
> > name| text | not null
> > parents | integer[] |
>
> > The parents of any node to the root, i.e. the path of any node to the root
> > are depicted as
> > parents[0] : immediate parent
> > parents[1] : immediate parent of the above parent
> > .
> > parents[n] : root of the tree
>
> what this schema gives?
>
> (1) the parent branch in one select.
1st the number of selects has nothing to do with speed
2nd as you will see below, the number of select is always 1, for any basic tree
operation.
> what else?
> nothing.
>
No, you are wrong.
1) find immediate father
parents[0] (O(1) complexity)
2) find root
parents[level(parents)] (O(1) complexity)
3) insert a node under a father
O(1) complexity
4) find all immediate children of a father node: (e.g. 2)
SELECT * from paintgentypes where parents[1] =2; (caution: NON indexed select)
or
SELECT * from paintgentypes where itoar(2) ~ parents and level(parents)=(level
of node 2 )+1;
5) find all children and grandchildren of a father node: (e.g. 2)
SELECT * from paintgentypes where itoar(2) ~ parents and level(parents)<=(level
of node 2 )+2;
6) find whole subtree of a node (e.g. 2)
SELECT * from paintgentypes where itoar(2) ~ parents;
In PostgreSQL, the above model i think is superior to nested trees in every
apsect.
This is due to the excellent intarray module.
PS
Excuse me for the typo in the previous mail.
Arrays in postgresql are 1-based.
> compare it to a nested-tree
>
> id | integer | NOT NULL
> name| text | not null
> parent | integer |
> l | numeric
> r | numeric
>
> (1) parent branch in one select
> (2) child subtree in one select
> (it makes a sence!)
>
>
>
--
Achilleas Mantzios
--
Sent via pgsql-sql mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
Στις Wednesday 07 April 2010 11:26:29 ο/η Sergey Konoplev έγραψε: > On 6 April 2010 21:33, Lee Hachadoorian wrote: > > More generally, what questions should I be asking to help decide what > > structure makes the most sense? Are there any websites, forums, or books > > that cover this kind of problem? > > Haven't you thought about ltree contrib? From the description of > ltree: "This module implements a data type ltree for representing > labels of data stored in a hierarchical tree-like structure". > > http://www.postgresql.org/docs/8.4/interactive/ltree.html Thats definitely worth checking out. Personally i didn't follow this apprach cause it seemd a little bit more restricting than doing it my way. However, for this case especially, it looks like it solves a photograph of the original problem of this therad. Besides, the authors of this fine contrib module are the same known PostgreSQL contributors of tsearch2, postgresql FTS, intarray (which i heavily use for the tree representation), etc... > > > -- > Sergey Konoplev > > Blog: http://gray-hemp.blogspot.com / > Linkedin: http://ru.linkedin.com/in/grayhemp / > JID/GTalk: [email protected] / Skype: gray-hemp / ICQ: 29353802 > -- Achilleas Mantzios -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε: > Achilleas Mantzios wrote: > > You could also consider the genealogical approach, e.g. > > > > > > The parents of any node to the root, i.e. the path of any node to the root > > are depicted as > > parents[0] : immediate parent > > parents[1] : immediate parent of the above parent > > > What I have more than one parent? Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph. This a totally different problem. > > regards, > Yeb Havinga > > -- Achilleas Mantzios -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
On Tue, Apr 6, 2010 at 2:34 PM, Little, Douglas wrote: > Hey Lee, > > I’m on [email protected] Thanks for the pointer. I'm looking at their archives now. > Ie a row is a point in time, or average for a period of time. Are the > Numbers actual, or estimates?To be useful, you’ll want to be able to > insert new records while retaining history and projections and easily be > able to build forecasts. Yes, each row represents employment for a given time period (usually year), geography (county, ZIP code, etc.), and NAICS code. I'm planning to partition by year, particularly since the agency we get data from releases preliminary data first, final data later, so that I can easily drop or disinherit preliminary data. Numbers are actual employment based on establishment reporting. > I suppose you can get data at the 61, 611, and 6111 level. You want to be > able to accurately sum where code like ‘61%’ We would never do a sum like 61%, because that would double or triple count data all the way down the hierarchy. The employment for NAICS 61 at a particular geography and time is the same as the sum of all 3-digit children (61#) which is also the same as the sum of all 4-digit grandchildren (61##), etc. On Tue, Apr 6, 2010 at 6:04 PM, Michael Glaesemann wrote: > Another is nested sets which performs quite nicely for loads which are more > read than write (which I suspect is the case here). You are right that we will be reading more than writing, but the SQL looks complicated, and I don't have the skills to build the kind of application layer that would allow our users to work with data stored as a nested set. >> The problem is that because of nondisclosure rules, the >> data is sometimes censored at the more specific level. > > I don't know if this is per-user or per-category or what, but it may be > something you store separately from the main table. Suppression is per-category. All users at our org will have access to the same data. More info below. On Tue, Apr 6, 2010 at 8:23 PM, Steve Crawford wrote: > reports. The CTE/recursive-query features in 8.4 are great for this. But in > the case you have described, the number of levels is well defined as is the > type of information associated with each level. The number of levels is well-defined, but we won't have data for all time-periods/geographies/NAICS levels. Because of nondisclosure rules, we might have 4-digit data at the county level but only 2-digit data at the ZIP code, and not have full coverage for all years. > One question that might impact this is the coding of your source data. Is it > all full 5-digit coding or are some records coded at a high level of detail > and others only to the top-level? We actually don't usually get below 4-digit, but the answer is the latter: some data is available at a detailed level and some data only at the top level. > What do you mean by censored? Is the data supplied to you pre-aggregated to > some level and censored to preserve confidentiality or are do you have the > record-level source data and the responsibility to suppress data in your > reports? Is the data suppression ad-hoc (i.e. someone comes to you and says The state agency gives us pre-aggregated data not microdata. The exact suppression rule is we don't get any data for a cell with fewer than 3 firms or where one firm has >80% of the total employment. Thus, the smaller the cells (smaller geography, more specific NAICS categorization), the more likely we run into empty cells. On Wed, Apr 7, 2010 at 2:17 AM, Scott Marlowe wrote: > Since it's an identifier and not really a numeric per se, I'd store it > as text. I mean it could as easily be a 5 character alpha code as 5 > character number code. Yes, already decided to store as text. Thanks for the substring index suggestion. On Wed, Apr 7, 2010 at 4:26 AM, Sergey Konoplev wrote: > Haven't you thought about ltree contrib? From the description of > ltree: "This module implements a data type ltree for representing > labels of data stored in a hierarchical tree-like structure". > > http://www.postgresql.org/docs/8.4/interactive/ltree.html No I was not familiar with this, and it looks really promising. Thanks for the pointer. It's a little repetitive, but it looks like the path should be stored as 61.611.6111. Assuming I define a column naics as ltree, being able to query WHERE nlevel(naics) = [2|3|4] will work nicely, and with the right views, my users never have to see it. Thanks again to everyone who replied. Any further remarks, questions, comments are welcome. --Lee -- Lee Hachadoorian PhD Student, Geography Program in Earth & Environmental Sciences CUNY Graduate Center -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
Re: [SQL] Table Design for Hierarchical Data
Achilleas Mantzios wrote: Στις Wednesday 07 April 2010 11:06:44 ο/η Yeb Havinga έγραψε: Achilleas Mantzios wrote: You could also consider the genealogical approach, e.g. The parents of any node to the root, i.e. the path of any node to the root are depicted as parents[0] : immediate parent parents[1] : immediate parent of the above parent What I have more than one parent? Then it is no longer neither a tree, nor a hierarchical structure, but rather a graph. This a totally different problem. My question was actually an attempt to point at the inability of what you call the 'genealogical approach' database design to store information of more than one parent. regards, Yeb Havinga -- Sent via pgsql-sql mailing list ([email protected]) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-sql
