Mapping property values to a discrete set, and refering to them using their
'id' is quite reminiscent of a foreign key in a relational database. Why not
take the next step and make a node for each value, and link all data nodes
to the value nodes? This is then a kind of index, a category index.
I was thinking about doing this for the OSM importer myself, but I have an
aversion to the number of relationships that would then appear. It is still
worth considering, as a relationship takes less space than a string.
Also, another trick I discussed with the neo4j guys (to mixed response) was
to use lucene to index the property values, but then fail to actually save
that value to the node. This means that the only existence of the value is
in the lucene index. If the only purpose of the value is to find nodes using
the index, this is certainly easier than adding relationships. The primary
negative comment from the neo4j guys was that lucene is not protected from
failure like the neo4j core, so you cannot recreate the index if necessary
if you don't have the original properties. So I'm still favouring the
category index approach.
In cases where the value diversity is very high (very many different
values), the index can be split into a tree to improve performance.
In cases where very many data nodes link to very few index nodes, there is
another trick I'm fond of, and that is the composite index, indexing
multiple properties at the same time, which has the effect of increasing the
number of index nodes, and decreasing the number of data nodes connected to
each index node, which is better for query traversal performance :-)
On Tue, Jul 27, 2010 at 9:19 PM, Davide dav...@davidesavazzi.net wrote:
Lately I've played with some OpenStreetMap data...
Nodes imported have many properties with a small set of values (road
type, point-of-interest type, colour, ...) but I don't know in advance
the set of values (sometimes a new value can become standard,
sometimes an invalid value is present).
Other node properties are just unique text (address, url).
To speed up the import process I've tried to apply some kind of
compression, I've seen that Neo4j encode property names using a
sequence of integers, I've tried to do the same for values of all the
properties which I know they contain only a small set.
With this encoding the database is obviously much smaller..
after importing sweden.osm the database dir is 552M:
100M neostore.propertystore.db
220M neostore.propertystore.db.arrays
227M neostore.propertystore.db.strings
with 'compression' on is 344M:
100M neostore.propertystore.db
220M neostore.propertystore.db.arrays
20M neostore.propertystore.db.strings
property value dictionary entries: 16286
property value dictionary size: 387378 bytes
I don't know if this is a common use case, but it would be cool to
have this kind of compression out of the box!
WDYT?
Regards,
--
Davide Savazzi
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user