Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Craig Taverner Wed, 02 Feb 2011 05:35:35 -0800

Hi Michael,

It is fun to see you thinking this through :-)


I have had a chance to discuss my crazy ideas with some of the neo4j team
last year, but this is the first time with you. And since December there is
even code to review (see https://github.com/craigtaverner/amanzi-index).

You are correct that I connect the data nodes directly to the index nodes.
And therefor the keys in the index nodes are representations of the data, at
various 'resolutions' as you climb the tree. This also means that each index
node represents a range of values, and so inequality operations can be
performed on the index nodes themselves. This also leads to another
interesting side effect, and that is that statistics can be calculated on
the index tree without having to touch the original data. To fully realize
this we also need some counters on the index nodes. The plan is that the
index itself becomes the basis for very fast statistics calculations.

Your suggestion for relationships is also related to another idea I
discussed last year, and that is to use a range of relationship types, where
the type itself contains information about the range of the child nodes, so
that traversing the index tree can be optimized by using only the
relationship type in most tests, and accessing node or relationship
properties only for edge cases. I think the idea makes sense, but there are
complications in that we can reach the capacity of the database if we
overuse relationship type for something that logically would normally be
seen as a relationship property.

Cheers, Craig

On Wed, Feb 2, 2011 at 1:46 AM, Michael Hunger <
michael.hun...@neotechnology.com> wrote:

> Craig,
>
> The interesting thing is - if you do this index in-graph you can directly
> attach the
> index-nodes to the real nodes and have them mimic the original node but
> instead of having
> the property value per property you have the mapped value.
>
> And you can many of such indexes in graph attached via a "index-name"
> relationship to the real node.
> Would also be easy to update the index nodes that way or to remove them
> altogether.
>
> But perhaps it is even better to use relationships pointing to the nodes
> contain the mapped property values
> and have the index nodes just be an ordering structure for those
> relationships.
>
> Cheers
>
> Michael
>
> Am 02.02.2011 um 01:33 schrieb Craig Taverner:
>
> > Hi Michael,
> >
> > I agree that _all_ is a strong work :-)
> >
> > What I do is provide a mapper interface for the domain modeler to
> implement
> > based on their own understanding of what 'order' and 'resolution' to
> provide
> > to the index. I have a set of pre-defined mappers for general cases, but
> do
> > not suppose they will satisfy all users needs. Hopefully, should this
> index
> > get developed further, we would have a much wider range of mappers that
> do
> > cover all or most cases.
> >
> > So, let me give some examples. The simple ones are the integer and float
> > mappers, where I have a few static factory methods for configuring them
> with
> > various resolutions and origins, and defaults for those also. The default
> > integer mapper, if I remember correctly, maps the integer to itself, and
> the
> > default float mapper maps the float to its integer value (so 2.1 and 2.9
> map
> > to integer 2). There are configurations that auto-configure the mapper
> based
> > on a sample dataset, setting the origin to the average value and choosing
> > resolution to divide the space into a reasonable number of steps.
> >
> > The only string mapper I have configured simply maps each character in
> the
> > string to an integer in the 94-character range from SPACE to SPACE+94
> (126).
> > It has a depth parameter, which controls the resolution of the first
> level
> > of the index. This means that AB and AC are in adjacent index nodes.
> There
> > is also a utility method for auto-configuring the range and depth based
> on a
> > sample dataset. This mapper should work well when indexing, for example,
> a
> > random selection out of a dictionary of words. However, if the sample
> data
> > conforms to a certain pattern, and there are internal correlations, then
> a
> > different mapper should be used (or written). The auto-config method does
> > look for common-prefixes in the sample data to ensure that the depth is
> set
> > correctly to not have all words in the same index node.
> >
> > The number of properties in the index does need to be defined at index
> > creation. So you need to configure the index with all property names and
> > types (and mappers) up-front, and then you simply add nodes to the index
> as
> > you go. If the node does not have one of the configured properties, it
> gets
> > a special index value for that. The resulting tree that is built is like
> an
> > n-dimensional pyramid with lots of gaps (depending on the evenness of the
> > property value distribution). Properties that do not need a tree of much
> > depth (eg. a discrete set of categories, or tags) will cause the tree to
> > collapse to n-1 dimensions at higher levels. So the total tree height is
> the
> > hight of the most diverse property indexed.
> >
> > My expectation is that this index will not perform as well as dedicated
> > single-property, auto-balancing indices, when queried with a single
> property
> > condition, but when queried with complex conditions involving multiple
> > conditions it should perform much better than indexes that use separate
> > trees for each property and then perform set-intersection on the
> > result-sets. The traverser will apply the conditions on all properties in
> > the tree, narrowing the result set as early as possible in the search.
> >
> > And now I've moved from what is coded into what is still envisioned, so
> I'd
> > better stop writing .... ;-)
> >
> > On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger <
> > michael.hun...@neotechnology.com> wrote:
> >
> >> Craig,
> >> how do you map _all_ properties to the integer (or rather numeric/long?)
> >> space.
> >>
> >> Does the mapping then also rely on the alphabetic (or comparision) order
> of
> >> the properties?
> >>
> >> Interesting approach. Is your space then as n-dimensional as the numbers
> of
> >> properties you have?
> >>
> >> Cheers
> >>
> >> Michael
> >>
> >> Am 02.02.2011 um 00:59 schrieb Craig Taverner:
> >>
> >>> Here is a crazy idea - how about taking the properties you care about
> and
> >>> dropping them into a combined lucene index? Then all results for nodes
> >> with
> >>> the same properties would be 'ambiguous'. Moving this forward to
> degrees
> >> of
> >>> ambiguity might be possible by creating the combined 'value' using a
> >> reduced
> >>> resolution of the properties (to increase similarity so the index will
> >> see
> >>> them as identical).
> >>>
> >>> Another option is the 'still very much in progress' composite index I
> >>> started in December. Since all properties are mapped into normal
> integer
> >>> space, the euclidean distance of the first level index nodes from each
> >> other
> >>> is a discrete measure of similarity. A distance of zero means that the
> >> nodes
> >>> attach to the same index node, and are very similar or identical.
> Higher
> >>> values mean greater dissimilarity. This index theoretically supports
> any
> >>> number of properties of any type (including strings) and allows you to
> >> plug
> >>> in your own value->index mappers, which means you can control what you
> >> mean
> >>> by 'similar'.
> >>>
> >>> On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand <b...@bensand.com> wrote:
> >>>
> >>>> I was working on a project that used matching algorithms a while back.
> >>>>
> >>>> What you have is an n-dimensional matching problem. I can't remember
> >>>> specifically what the last project were using, but this and the linked
> >>>> algos
> >>>> may be what you're looking for:
> >>>> http://en.wikipedia.org/wiki/Mahalanobis_distance
> >>>>
> >>>> On 2 February 2011 07:34, Tim McNamara <paperl...@timmcnamara.co.nz>
> >>>> wrote:
> >>>>
> >>>>> Say I have two nodes,
> >>>>>
> >>>>>
> >>>>> { "type": "person", "name": "Neo" }
> >>>>> { "type": "person", "name": "Neo" }
> >>>>>
> >>>>>
> >>>>>
> >>>>> Over time, I learn their locations. They both live in the same city.
> >> This
> >>>>> increases the chances that they're the same person. However, over
> time
> >> it
> >>>>> turns out that their ages differ, therefore it's far less likely that
> >>>> they
> >>>>> are the same Neo.
> >>>>>
> >>>>>
> >>>>> Is there anything inside of Neo4j that attempts to determine how
> close
> >>>> two
> >>>>> nodes are? E.g. to what extent their subtrees and properties match?
> >>>>> Additionally, can anyone suggest literature for algorithms for
> >>>>> disambiguating the two entities?
> >>>>>
> >>>>>
> >>>>> If I wanted to implement something that searches for similarities,
> that
> >>>>> returns a probability of a match, can I do this within the database
> or
> >>>>> should I implement it within the application?
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Tim McNamara
> >>>>> @timClicks
> >>>>> http://timmcnamara.co.nz
> >>>>>
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> Neo4j mailing list
> >>>>> User@lists.neo4j.org
> >>>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>>
> >>>> _______________________________________________
> >>>> Neo4j mailing list
> >>>> User@lists.neo4j.org
> >>>> https://lists.neo4j.org/mailman/listinfo/user
> >>>>
> >>> _______________________________________________
> >>> Neo4j mailing list
> >>> User@lists.neo4j.org
> >>> https://lists.neo4j.org/mailman/listinfo/user
> >>
> >> _______________________________________________
> >> Neo4j mailing list
> >> User@lists.neo4j.org
> >> https://lists.neo4j.org/mailman/listinfo/user
> >>
> > _______________________________________________
> > Neo4j mailing list
> > User@lists.neo4j.org
> > https://lists.neo4j.org/mailman/listinfo/user
>
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user
>
_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Reply via email to