Re: [Neo4j] property value encoding

2010-07-28 Thread Davide
On Tue, Jul 27, 2010 at 22:29, Craig Taverner cr...@amanzi.com wrote:
 Mapping property values to a discrete set, and refering to them using
 their 'id' is quite reminiscent of a foreign key
 in a relational database.

Yes, with a relational database I would create foreign keys and maybe bitmap
indexes on columns used for search. But I was thinking about a compression
algorithm like these:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja/index.html

 Why not take the next step and make a node for each value, and link
 all data nodes to the value nodes?

I've thought about using nodes and relationships to index these values,
but as you've said, this would generate a big number of relationships.

I've 200 properties indexed and the dictionary of encoded values
contains 15k entries (many properties have a set of hundreds of possible
values).
Now I've only 600k nodes but each node has from one to several encoded
properties.

Considering that mantaining an index is expensive, maybe an acceptable
trade-off is to put the dictionary of encoded values in the graph, but create
relationships from dictionary entries to nodes only for OpenStreetMap
properties
used for search.

 In cases where very many data nodes link to very few index nodes,
 there is another trick I'm fond of, and that is the
 composite index

I need more informations on the implementation of this composite index :)

-- 
Davide Savazzi
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] property value encoding

2010-07-27 Thread Davide
Lately I've played with some OpenStreetMap data...
Nodes imported have many properties with a small set of values (road
type, point-of-interest type, colour, ...) but I don't know in advance
the set of values (sometimes a new value can become standard,
sometimes an invalid value is present).
Other node properties are just unique text (address, url).
To speed up the import process I've tried to apply some kind of
compression, I've seen that Neo4j encode property names using a
sequence of integers, I've tried to do the same for values of all the
properties which I know they contain only a small set.

With this encoding the database is obviously much smaller..

after importing sweden.osm the database dir is 552M:
100M neostore.propertystore.db
220M neostore.propertystore.db.arrays
227M neostore.propertystore.db.strings

with 'compression' on is 344M:
100M neostore.propertystore.db
220M neostore.propertystore.db.arrays
20M neostore.propertystore.db.strings
property value dictionary entries: 16286
property value dictionary size: 387378 bytes

I don't know if this is a common use case, but it would be cool to
have this kind of compression out of the box!

WDYT?

Regards,
-- 
Davide Savazzi
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] property value encoding

2010-07-27 Thread Craig Taverner
Mapping property values to a discrete set, and refering to them using their
'id' is quite reminiscent of a foreign key in a relational database. Why not
take the next step and make a node for each value, and link all data nodes
to the value nodes? This is then a kind of index, a category index.

I was thinking about doing this for the OSM importer myself, but I have an
aversion to the number of relationships that would then appear. It is still
worth considering, as a relationship takes less space than a string.

Also, another trick I discussed with the neo4j guys (to mixed response) was
to use lucene to index the property values, but then fail to actually save
that value to the node. This means that the only existence of the value is
in the lucene index. If the only purpose of the value is to find nodes using
the index, this is certainly easier than adding relationships. The primary
negative comment from the neo4j guys was that lucene is not protected from
failure like the neo4j core, so you cannot recreate the index if necessary
if you don't have the original properties. So I'm still favouring the
category index approach.

In cases where the value diversity is very high (very many different
values), the index can be split into a tree to improve performance.

In cases where very many data nodes link to very few index nodes, there is
another trick I'm fond of, and that is the composite index, indexing
multiple properties at the same time, which has the effect of increasing the
number of index nodes, and decreasing the number of data nodes connected to
each index node, which is better for query traversal performance :-)

On Tue, Jul 27, 2010 at 9:19 PM, Davide dav...@davidesavazzi.net wrote:

 Lately I've played with some OpenStreetMap data...
 Nodes imported have many properties with a small set of values (road
 type, point-of-interest type, colour, ...) but I don't know in advance
 the set of values (sometimes a new value can become standard,
 sometimes an invalid value is present).
 Other node properties are just unique text (address, url).
 To speed up the import process I've tried to apply some kind of
 compression, I've seen that Neo4j encode property names using a
 sequence of integers, I've tried to do the same for values of all the
 properties which I know they contain only a small set.

 With this encoding the database is obviously much smaller..

 after importing sweden.osm the database dir is 552M:
 100M neostore.propertystore.db
 220M neostore.propertystore.db.arrays
 227M neostore.propertystore.db.strings

 with 'compression' on is 344M:
 100M neostore.propertystore.db
 220M neostore.propertystore.db.arrays
 20M neostore.propertystore.db.strings
 property value dictionary entries: 16286
 property value dictionary size: 387378 bytes

 I don't know if this is a common use case, but it would be cool to
 have this kind of compression out of the box!

 WDYT?

 Regards,
 --
 Davide Savazzi
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user