Thanks so much for that explanation. Put it in a blog post.
Michael Am 02.02.2011 um 01:33 schrieb Craig Taverner: > Hi Michael, > > I agree that _all_ is a strong work :-) > > What I do is provide a mapper interface for the domain modeler to implement > based on their own understanding of what 'order' and 'resolution' to provide > to the index. I have a set of pre-defined mappers for general cases, but do > not suppose they will satisfy all users needs. Hopefully, should this index > get developed further, we would have a much wider range of mappers that do > cover all or most cases. > > So, let me give some examples. The simple ones are the integer and float > mappers, where I have a few static factory methods for configuring them with > various resolutions and origins, and defaults for those also. The default > integer mapper, if I remember correctly, maps the integer to itself, and the > default float mapper maps the float to its integer value (so 2.1 and 2.9 map > to integer 2). There are configurations that auto-configure the mapper based > on a sample dataset, setting the origin to the average value and choosing > resolution to divide the space into a reasonable number of steps. > > The only string mapper I have configured simply maps each character in the > string to an integer in the 94-character range from SPACE to SPACE+94 (126). > It has a depth parameter, which controls the resolution of the first level > of the index. This means that AB and AC are in adjacent index nodes. There > is also a utility method for auto-configuring the range and depth based on a > sample dataset. This mapper should work well when indexing, for example, a > random selection out of a dictionary of words. However, if the sample data > conforms to a certain pattern, and there are internal correlations, then a > different mapper should be used (or written). The auto-config method does > look for common-prefixes in the sample data to ensure that the depth is set > correctly to not have all words in the same index node. > > The number of properties in the index does need to be defined at index > creation. So you need to configure the index with all property names and > types (and mappers) up-front, and then you simply add nodes to the index as > you go. If the node does not have one of the configured properties, it gets > a special index value for that. The resulting tree that is built is like an > n-dimensional pyramid with lots of gaps (depending on the evenness of the > property value distribution). Properties that do not need a tree of much > depth (eg. a discrete set of categories, or tags) will cause the tree to > collapse to n-1 dimensions at higher levels. So the total tree height is the > hight of the most diverse property indexed. > > My expectation is that this index will not perform as well as dedicated > single-property, auto-balancing indices, when queried with a single property > condition, but when queried with complex conditions involving multiple > conditions it should perform much better than indexes that use separate > trees for each property and then perform set-intersection on the > result-sets. The traverser will apply the conditions on all properties in > the tree, narrowing the result set as early as possible in the search. > > And now I've moved from what is coded into what is still envisioned, so I'd > better stop writing .... ;-) > > On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger < > michael.hun...@neotechnology.com> wrote: > >> Craig, >> how do you map _all_ properties to the integer (or rather numeric/long?) >> space. >> >> Does the mapping then also rely on the alphabetic (or comparision) order of >> the properties? >> >> Interesting approach. Is your space then as n-dimensional as the numbers of >> properties you have? >> >> Cheers >> >> Michael >> >> Am 02.02.2011 um 00:59 schrieb Craig Taverner: >> >>> Here is a crazy idea - how about taking the properties you care about and >>> dropping them into a combined lucene index? Then all results for nodes >> with >>> the same properties would be 'ambiguous'. Moving this forward to degrees >> of >>> ambiguity might be possible by creating the combined 'value' using a >> reduced >>> resolution of the properties (to increase similarity so the index will >> see >>> them as identical). >>> >>> Another option is the 'still very much in progress' composite index I >>> started in December. Since all properties are mapped into normal integer >>> space, the euclidean distance of the first level index nodes from each >> other >>> is a discrete measure of similarity. A distance of zero means that the >> nodes >>> attach to the same index node, and are very similar or identical. Higher >>> values mean greater dissimilarity. This index theoretically supports any >>> number of properties of any type (including strings) and allows you to >> plug >>> in your own value->index mappers, which means you can control what you >> mean >>> by 'similar'. >>> >>> On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand <b...@bensand.com> wrote: >>> >>>> I was working on a project that used matching algorithms a while back. >>>> >>>> What you have is an n-dimensional matching problem. I can't remember >>>> specifically what the last project were using, but this and the linked >>>> algos >>>> may be what you're looking for: >>>> http://en.wikipedia.org/wiki/Mahalanobis_distance >>>> >>>> On 2 February 2011 07:34, Tim McNamara <paperl...@timmcnamara.co.nz> >>>> wrote: >>>> >>>>> Say I have two nodes, >>>>> >>>>> >>>>> { "type": "person", "name": "Neo" } >>>>> { "type": "person", "name": "Neo" } >>>>> >>>>> >>>>> >>>>> Over time, I learn their locations. They both live in the same city. >> This >>>>> increases the chances that they're the same person. However, over time >> it >>>>> turns out that their ages differ, therefore it's far less likely that >>>> they >>>>> are the same Neo. >>>>> >>>>> >>>>> Is there anything inside of Neo4j that attempts to determine how close >>>> two >>>>> nodes are? E.g. to what extent their subtrees and properties match? >>>>> Additionally, can anyone suggest literature for algorithms for >>>>> disambiguating the two entities? >>>>> >>>>> >>>>> If I wanted to implement something that searches for similarities, that >>>>> returns a probability of a match, can I do this within the database or >>>>> should I implement it within the application? >>>>> >>>>> >>>>> -- >>>>> Tim McNamara >>>>> @timClicks >>>>> http://timmcnamara.co.nz >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> Neo4j mailing list >>>>> User@lists.neo4j.org >>>>> https://lists.neo4j.org/mailman/listinfo/user >>>>> >>>> _______________________________________________ >>>> Neo4j mailing list >>>> User@lists.neo4j.org >>>> https://lists.neo4j.org/mailman/listinfo/user >>>> >>> _______________________________________________ >>> Neo4j mailing list >>> User@lists.neo4j.org >>> https://lists.neo4j.org/mailman/listinfo/user >> >> _______________________________________________ >> Neo4j mailing list >> User@lists.neo4j.org >> https://lists.neo4j.org/mailman/listinfo/user >> > _______________________________________________ > Neo4j mailing list > User@lists.neo4j.org > https://lists.neo4j.org/mailman/listinfo/user _______________________________________________ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user