Mapping property values to a discrete set, and refering to them using their 'id' is quite reminiscent of a foreign key in a relational database. Why not take the next step and make a node for each value, and link all data nodes to the value nodes? This is then a kind of index, a category index.
I was thinking about doing this for the OSM importer myself, but I have an aversion to the number of relationships that would then appear. It is still worth considering, as a relationship takes less space than a string. Also, another trick I discussed with the neo4j guys (to mixed response) was to use lucene to index the property values, but then fail to actually save that value to the node. This means that the only existence of the value is in the lucene index. If the only purpose of the value is to find nodes using the index, this is certainly easier than adding relationships. The primary negative comment from the neo4j guys was that lucene is not protected from failure like the neo4j core, so you cannot recreate the index if necessary if you don't have the original properties. So I'm still favouring the category index approach. In cases where the value diversity is very high (very many different values), the index can be split into a tree to improve performance. In cases where very many data nodes link to very few index nodes, there is another trick I'm fond of, and that is the composite index, indexing multiple properties at the same time, which has the effect of increasing the number of index nodes, and decreasing the number of data nodes connected to each index node, which is better for query traversal performance :-) On Tue, Jul 27, 2010 at 9:19 PM, Davide <[email protected]> wrote: > Lately I've played with some OpenStreetMap data... > Nodes imported have many properties with a small set of values (road > type, point-of-interest type, colour, ...) but I don't know in advance > the set of values (sometimes a new value can become standard, > sometimes an invalid value is present). > Other node properties are just unique text (address, url). > To speed up the import process I've tried to apply some kind of > compression, I've seen that Neo4j encode property names using a > sequence of integers, I've tried to do the same for values of all the > properties which I know they contain only a small set. > > With this encoding the database is obviously much smaller.. > > after importing sweden.osm the database dir is 552M: > 100M neostore.propertystore.db > 220M neostore.propertystore.db.arrays > 227M neostore.propertystore.db.strings > > with 'compression' on is 344M: > 100M neostore.propertystore.db > 220M neostore.propertystore.db.arrays > 20M neostore.propertystore.db.strings > property value dictionary entries: 16286 > property value dictionary size: 387378 bytes > > I don't know if this is a common use case, but it would be cool to > have this kind of compression out of the box! > > WDYT? > > Regards, > -- > Davide Savazzi > _______________________________________________ > Neo4j mailing list > [email protected] > https://lists.neo4j.org/mailman/listinfo/user > _______________________________________________ Neo4j mailing list [email protected] https://lists.neo4j.org/mailman/listinfo/user

