Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Craig Taverner Tue, 01 Feb 2011 16:33:31 -0800

Hi Michael,

I agree that _all_ is a strong work :-)

What I do is provide a mapper interface for the domain modeler to implement
based on their own understanding of what 'order' and 'resolution' to provide
to the index. I have a set of pre-defined mappers for general cases, but do
not suppose they will satisfy all users needs. Hopefully, should this index
get developed further, we would have a much wider range of mappers that do
cover all or most cases.

So, let me give some examples. The simple ones are the integer and float
mappers, where I have a few static factory methods for configuring them with
various resolutions and origins, and defaults for those also. The default
integer mapper, if I remember correctly, maps the integer to itself, and the
default float mapper maps the float to its integer value (so 2.1 and 2.9 map
to integer 2). There are configurations that auto-configure the mapper based
on a sample dataset, setting the origin to the average value and choosing
resolution to divide the space into a reasonable number of steps.

The only string mapper I have configured simply maps each character in the
string to an integer in the 94-character range from SPACE to SPACE+94 (126).
It has a depth parameter, which controls the resolution of the first level
of the index. This means that AB and AC are in adjacent index nodes. There
is also a utility method for auto-configuring the range and depth based on a
sample dataset. This mapper should work well when indexing, for example, a
random selection out of a dictionary of words. However, if the sample data
conforms to a certain pattern, and there are internal correlations, then a
different mapper should be used (or written). The auto-config method does
look for common-prefixes in the sample data to ensure that the depth is set
correctly to not have all words in the same index node.

The number of properties in the index does need to be defined at index
creation. So you need to configure the index with all property names and
types (and mappers) up-front, and then you simply add nodes to the index as
you go. If the node does not have one of the configured properties, it gets
a special index value for that. The resulting tree that is built is like an
n-dimensional pyramid with lots of gaps (depending on the evenness of the
property value distribution). Properties that do not need a tree of much
depth (eg. a discrete set of categories, or tags) will cause the tree to
collapse to n-1 dimensions at higher levels. So the total tree height is the
hight of the most diverse property indexed.

My expectation is that this index will not perform as well as dedicated
single-property, auto-balancing indices, when queried with a single property
condition, but when queried with complex conditions involving multiple
conditions it should perform much better than indexes that use separate
trees for each property and then perform set-intersection on the
result-sets. The traverser will apply the conditions on all properties in
the tree, narrowing the result set as early as possible in the search.

And now I've moved from what is coded into what is still envisioned, so I'd
better stop writing .... ;-)

On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger <
michael.hun...@neotechnology.com> wrote:

> Craig,
> how do you map _all_ properties to the integer (or rather numeric/long?)
> space.
>
> Does the mapping then also rely on the alphabetic (or comparision) order of
> the properties?
>
> Interesting approach. Is your space then as n-dimensional as the numbers of
> properties you have?
>
> Cheers
>
> Michael
>
> Am 02.02.2011 um 00:59 schrieb Craig Taverner:
>
> > Here is a crazy idea - how about taking the properties you care about and
> > dropping them into a combined lucene index? Then all results for nodes
> with
> > the same properties would be 'ambiguous'. Moving this forward to degrees
> of
> > ambiguity might be possible by creating the combined 'value' using a
> reduced
> > resolution of the properties (to increase similarity so the index will
> see
> > them as identical).
> >
> > Another option is the 'still very much in progress' composite index I
> > started in December. Since all properties are mapped into normal integer
> > space, the euclidean distance of the first level index nodes from each
> other
> > is a discrete measure of similarity. A distance of zero means that the
> nodes
> > attach to the same index node, and are very similar or identical. Higher
> > values mean greater dissimilarity. This index theoretically supports any
> > number of properties of any type (including strings) and allows you to
> plug
> > in your own value->index mappers, which means you can control what you
> mean
> > by 'similar'.
> >
> > On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand <b...@bensand.com> wrote:
> >
> >> I was working on a project that used matching algorithms a while back.
> >>
> >> What you have is an n-dimensional matching problem. I can't remember
> >> specifically what the last project were using, but this and the linked
> >> algos
> >> may be what you're looking for:
> >> http://en.wikipedia.org/wiki/Mahalanobis_distance
> >>
> >> On 2 February 2011 07:34, Tim McNamara <paperl...@timmcnamara.co.nz>
> >> wrote:
> >>
> >>> Say I have two nodes,
> >>>
> >>>
> >>> { "type": "person", "name": "Neo" }
> >>> { "type": "person", "name": "Neo" }
> >>>
> >>>
> >>>
> >>> Over time, I learn their locations. They both live in the same city.
> This
> >>> increases the chances that they're the same person. However, over time
> it
> >>> turns out that their ages differ, therefore it's far less likely that
> >> they
> >>> are the same Neo.
> >>>
> >>>
> >>> Is there anything inside of Neo4j that attempts to determine how close
> >> two
> >>> nodes are? E.g. to what extent their subtrees and properties match?
> >>> Additionally, can anyone suggest literature for algorithms for
> >>> disambiguating the two entities?
> >>>
> >>>
> >>> If I wanted to implement something that searches for similarities, that
> >>> returns a probability of a match, can I do this within the database or
> >>> should I implement it within the application?
> >>>
> >>>
> >>> --
> >>> Tim McNamara
> >>> @timClicks
> >>> http://timmcnamara.co.nz
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> Neo4j mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>>
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >>
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Reply via email to