Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-02 Thread Craig Taverner
Hi Michael,

It is fun to see you thinking this through :-)

I have had a chance to discuss my crazy ideas with some of the neo4j team
last year, but this is the first time with you. And since December there is
even code to review (see https://github.com/craigtaverner/amanzi-index).

You are correct that I connect the data nodes directly to the index nodes.
And therefor the keys in the index nodes are representations of the data, at
various 'resolutions' as you climb the tree. This also means that each index
node represents a range of values, and so inequality operations can be
performed on the index nodes themselves. This also leads to another
interesting side effect, and that is that statistics can be calculated on
the index tree without having to touch the original data. To fully realize
this we also need some counters on the index nodes. The plan is that the
index itself becomes the basis for very fast statistics calculations.

Your suggestion for relationships is also related to another idea I
discussed last year, and that is to use a range of relationship types, where
the type itself contains information about the range of the child nodes, so
that traversing the index tree can be optimized by using only the
relationship type in most tests, and accessing node or relationship
properties only for edge cases. I think the idea makes sense, but there are
complications in that we can reach the capacity of the database if we
overuse relationship type for something that logically would normally be
seen as a relationship property.

Cheers, Craig

On Wed, Feb 2, 2011 at 1:46 AM, Michael Hunger 
michael.hun...@neotechnology.com wrote:

 Craig,

 The interesting thing is - if you do this index in-graph you can directly
 attach the
 index-nodes to the real nodes and have them mimic the original node but
 instead of having
 the property value per property you have the mapped value.

 And you can many of such indexes in graph attached via a index-name
 relationship to the real node.
 Would also be easy to update the index nodes that way or to remove them
 altogether.

 But perhaps it is even better to use relationships pointing to the nodes
 contain the mapped property values
 and have the index nodes just be an ordering structure for those
 relationships.

 Cheers

 Michael

 Am 02.02.2011 um 01:33 schrieb Craig Taverner:

  Hi Michael,
 
  I agree that _all_ is a strong work :-)
 
  What I do is provide a mapper interface for the domain modeler to
 implement
  based on their own understanding of what 'order' and 'resolution' to
 provide
  to the index. I have a set of pre-defined mappers for general cases, but
 do
  not suppose they will satisfy all users needs. Hopefully, should this
 index
  get developed further, we would have a much wider range of mappers that
 do
  cover all or most cases.
 
  So, let me give some examples. The simple ones are the integer and float
  mappers, where I have a few static factory methods for configuring them
 with
  various resolutions and origins, and defaults for those also. The default
  integer mapper, if I remember correctly, maps the integer to itself, and
 the
  default float mapper maps the float to its integer value (so 2.1 and 2.9
 map
  to integer 2). There are configurations that auto-configure the mapper
 based
  on a sample dataset, setting the origin to the average value and choosing
  resolution to divide the space into a reasonable number of steps.
 
  The only string mapper I have configured simply maps each character in
 the
  string to an integer in the 94-character range from SPACE to SPACE+94
 (126).
  It has a depth parameter, which controls the resolution of the first
 level
  of the index. This means that AB and AC are in adjacent index nodes.
 There
  is also a utility method for auto-configuring the range and depth based
 on a
  sample dataset. This mapper should work well when indexing, for example,
 a
  random selection out of a dictionary of words. However, if the sample
 data
  conforms to a certain pattern, and there are internal correlations, then
 a
  different mapper should be used (or written). The auto-config method does
  look for common-prefixes in the sample data to ensure that the depth is
 set
  correctly to not have all words in the same index node.
 
  The number of properties in the index does need to be defined at index
  creation. So you need to configure the index with all property names and
  types (and mappers) up-front, and then you simply add nodes to the index
 as
  you go. If the node does not have one of the configured properties, it
 gets
  a special index value for that. The resulting tree that is built is like
 an
  n-dimensional pyramid with lots of gaps (depending on the evenness of the
  property value distribution). Properties that do not need a tree of much
  depth (eg. a discrete set of categories, or tags) will cause the tree to
  collapse to n-1 dimensions at higher levels. So the total tree height is
 the
  hight 

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Peter Neubauer
Tim,
I don't know enough about disambiguation algos to point you in any
special direction, but implementation-wise you could hook into the
event framework, http://wiki.neo4j.org/content/Event_framework, in
order to do e.g. graph matching queries upon any modifying transaction
changing database content.

HTH

Cheers,

/peter neubauer

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.



On Tue, Feb 1, 2011 at 9:34 PM, Tim McNamara
paperl...@timmcnamara.co.nz wrote:
 Say I have two nodes,


 { type: person, name: Neo }
 { type: person, name: Neo }



 Over time, I learn their locations. They both live in the same city. This 
 increases the chances that they're the same person. However, over time it 
 turns out that their ages differ, therefore it's far less likely that they 
 are the same Neo.


 Is there anything inside of Neo4j that attempts to determine how close two 
 nodes are? E.g. to what extent their subtrees and properties match? 
 Additionally, can anyone suggest literature for algorithms for disambiguating 
 the two entities?


 If I wanted to implement something that searches for similarities, that 
 returns a probability of a match, can I do this within the database or should 
 I implement it within the application?


 --
 Tim McNamara
 @timClicks
 http://timmcnamara.co.nz



 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Ben Sand
I was working on a project that used matching algorithms a while back.

What you have is an n-dimensional matching problem. I can't remember
specifically what the last project were using, but this and the linked algos
may be what you're looking for:
http://en.wikipedia.org/wiki/Mahalanobis_distance

On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz wrote:

 Say I have two nodes,


 { type: person, name: Neo }
 { type: person, name: Neo }



 Over time, I learn their locations. They both live in the same city. This
 increases the chances that they're the same person. However, over time it
 turns out that their ages differ, therefore it's far less likely that they
 are the same Neo.


 Is there anything inside of Neo4j that attempts to determine how close two
 nodes are? E.g. to what extent their subtrees and properties match?
 Additionally, can anyone suggest literature for algorithms for
 disambiguating the two entities?


 If I wanted to implement something that searches for similarities, that
 returns a probability of a match, can I do this within the database or
 should I implement it within the application?


 --
 Tim McNamara
 @timClicks
 http://timmcnamara.co.nz



 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Craig Taverner
Here is a crazy idea - how about taking the properties you care about and
dropping them into a combined lucene index? Then all results for nodes with
the same properties would be 'ambiguous'. Moving this forward to degrees of
ambiguity might be possible by creating the combined 'value' using a reduced
resolution of the properties (to increase similarity so the index will see
them as identical).

Another option is the 'still very much in progress' composite index I
started in December. Since all properties are mapped into normal integer
space, the euclidean distance of the first level index nodes from each other
is a discrete measure of similarity. A distance of zero means that the nodes
attach to the same index node, and are very similar or identical. Higher
values mean greater dissimilarity. This index theoretically supports any
number of properties of any type (including strings) and allows you to plug
in your own value-index mappers, which means you can control what you mean
by 'similar'.

On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote:

 I was working on a project that used matching algorithms a while back.

 What you have is an n-dimensional matching problem. I can't remember
 specifically what the last project were using, but this and the linked
 algos
 may be what you're looking for:
 http://en.wikipedia.org/wiki/Mahalanobis_distance

 On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz
 wrote:

  Say I have two nodes,
 
 
  { type: person, name: Neo }
  { type: person, name: Neo }
 
 
 
  Over time, I learn their locations. They both live in the same city. This
  increases the chances that they're the same person. However, over time it
  turns out that their ages differ, therefore it's far less likely that
 they
  are the same Neo.
 
 
  Is there anything inside of Neo4j that attempts to determine how close
 two
  nodes are? E.g. to what extent their subtrees and properties match?
  Additionally, can anyone suggest literature for algorithms for
  disambiguating the two entities?
 
 
  If I wanted to implement something that searches for similarities, that
  returns a probability of a match, can I do this within the database or
  should I implement it within the application?
 
 
  --
  Tim McNamara
  @timClicks
  http://timmcnamara.co.nz
 
 
 
  ___
  Neo4j mailing list
  User@lists.neo4j.org
  https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Michael Hunger
Craig,
how do you map _all_ properties to the integer (or rather numeric/long?) space.

Does the mapping then also rely on the alphabetic (or comparision) order of the 
properties?

Interesting approach. Is your space then as n-dimensional as the numbers of 
properties you have?

Cheers

Michael

Am 02.02.2011 um 00:59 schrieb Craig Taverner:

 Here is a crazy idea - how about taking the properties you care about and
 dropping them into a combined lucene index? Then all results for nodes with
 the same properties would be 'ambiguous'. Moving this forward to degrees of
 ambiguity might be possible by creating the combined 'value' using a reduced
 resolution of the properties (to increase similarity so the index will see
 them as identical).
 
 Another option is the 'still very much in progress' composite index I
 started in December. Since all properties are mapped into normal integer
 space, the euclidean distance of the first level index nodes from each other
 is a discrete measure of similarity. A distance of zero means that the nodes
 attach to the same index node, and are very similar or identical. Higher
 values mean greater dissimilarity. This index theoretically supports any
 number of properties of any type (including strings) and allows you to plug
 in your own value-index mappers, which means you can control what you mean
 by 'similar'.
 
 On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote:
 
 I was working on a project that used matching algorithms a while back.
 
 What you have is an n-dimensional matching problem. I can't remember
 specifically what the last project were using, but this and the linked
 algos
 may be what you're looking for:
 http://en.wikipedia.org/wiki/Mahalanobis_distance
 
 On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz
 wrote:
 
 Say I have two nodes,
 
 
 { type: person, name: Neo }
 { type: person, name: Neo }
 
 
 
 Over time, I learn their locations. They both live in the same city. This
 increases the chances that they're the same person. However, over time it
 turns out that their ages differ, therefore it's far less likely that
 they
 are the same Neo.
 
 
 Is there anything inside of Neo4j that attempts to determine how close
 two
 nodes are? E.g. to what extent their subtrees and properties match?
 Additionally, can anyone suggest literature for algorithms for
 disambiguating the two entities?
 
 
 If I wanted to implement something that searches for similarities, that
 returns a probability of a match, can I do this within the database or
 should I implement it within the application?
 
 
 --
 Tim McNamara
 @timClicks
 http://timmcnamara.co.nz
 
 
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user
 
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Michael Hunger
Thanks so much for that explanation.

Put it in a blog post.

Michael

Am 02.02.2011 um 01:33 schrieb Craig Taverner:

 Hi Michael,
 
 I agree that _all_ is a strong work :-)
 
 What I do is provide a mapper interface for the domain modeler to implement
 based on their own understanding of what 'order' and 'resolution' to provide
 to the index. I have a set of pre-defined mappers for general cases, but do
 not suppose they will satisfy all users needs. Hopefully, should this index
 get developed further, we would have a much wider range of mappers that do
 cover all or most cases.
 
 So, let me give some examples. The simple ones are the integer and float
 mappers, where I have a few static factory methods for configuring them with
 various resolutions and origins, and defaults for those also. The default
 integer mapper, if I remember correctly, maps the integer to itself, and the
 default float mapper maps the float to its integer value (so 2.1 and 2.9 map
 to integer 2). There are configurations that auto-configure the mapper based
 on a sample dataset, setting the origin to the average value and choosing
 resolution to divide the space into a reasonable number of steps.
 
 The only string mapper I have configured simply maps each character in the
 string to an integer in the 94-character range from SPACE to SPACE+94 (126).
 It has a depth parameter, which controls the resolution of the first level
 of the index. This means that AB and AC are in adjacent index nodes. There
 is also a utility method for auto-configuring the range and depth based on a
 sample dataset. This mapper should work well when indexing, for example, a
 random selection out of a dictionary of words. However, if the sample data
 conforms to a certain pattern, and there are internal correlations, then a
 different mapper should be used (or written). The auto-config method does
 look for common-prefixes in the sample data to ensure that the depth is set
 correctly to not have all words in the same index node.
 
 The number of properties in the index does need to be defined at index
 creation. So you need to configure the index with all property names and
 types (and mappers) up-front, and then you simply add nodes to the index as
 you go. If the node does not have one of the configured properties, it gets
 a special index value for that. The resulting tree that is built is like an
 n-dimensional pyramid with lots of gaps (depending on the evenness of the
 property value distribution). Properties that do not need a tree of much
 depth (eg. a discrete set of categories, or tags) will cause the tree to
 collapse to n-1 dimensions at higher levels. So the total tree height is the
 hight of the most diverse property indexed.
 
 My expectation is that this index will not perform as well as dedicated
 single-property, auto-balancing indices, when queried with a single property
 condition, but when queried with complex conditions involving multiple
 conditions it should perform much better than indexes that use separate
 trees for each property and then perform set-intersection on the
 result-sets. The traverser will apply the conditions on all properties in
 the tree, narrowing the result set as early as possible in the search.
 
 And now I've moved from what is coded into what is still envisioned, so I'd
 better stop writing  ;-)
 
 On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger 
 michael.hun...@neotechnology.com wrote:
 
 Craig,
 how do you map _all_ properties to the integer (or rather numeric/long?)
 space.
 
 Does the mapping then also rely on the alphabetic (or comparision) order of
 the properties?
 
 Interesting approach. Is your space then as n-dimensional as the numbers of
 properties you have?
 
 Cheers
 
 Michael
 
 Am 02.02.2011 um 00:59 schrieb Craig Taverner:
 
 Here is a crazy idea - how about taking the properties you care about and
 dropping them into a combined lucene index? Then all results for nodes
 with
 the same properties would be 'ambiguous'. Moving this forward to degrees
 of
 ambiguity might be possible by creating the combined 'value' using a
 reduced
 resolution of the properties (to increase similarity so the index will
 see
 them as identical).
 
 Another option is the 'still very much in progress' composite index I
 started in December. Since all properties are mapped into normal integer
 space, the euclidean distance of the first level index nodes from each
 other
 is a discrete measure of similarity. A distance of zero means that the
 nodes
 attach to the same index node, and are very similar or identical. Higher
 values mean greater dissimilarity. This index theoretically supports any
 number of properties of any type (including strings) and allows you to
 plug
 in your own value-index mappers, which means you can control what you
 mean
 by 'similar'.
 
 On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote:
 
 I was working on a project that used matching algorithms a while back.
 
 What 

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

2011-02-01 Thread Michael Hunger
Craig,

The interesting thing is - if you do this index in-graph you can directly 
attach the 
index-nodes to the real nodes and have them mimic the original node but instead 
of having
the property value per property you have the mapped value.

And you can many of such indexes in graph attached via a index-name 
relationship to the real node.
Would also be easy to update the index nodes that way or to remove them 
altogether.

But perhaps it is even better to use relationships pointing to the nodes 
contain the mapped property values
and have the index nodes just be an ordering structure for those relationships.

Cheers

Michael

Am 02.02.2011 um 01:33 schrieb Craig Taverner:

 Hi Michael,
 
 I agree that _all_ is a strong work :-)
 
 What I do is provide a mapper interface for the domain modeler to implement
 based on their own understanding of what 'order' and 'resolution' to provide
 to the index. I have a set of pre-defined mappers for general cases, but do
 not suppose they will satisfy all users needs. Hopefully, should this index
 get developed further, we would have a much wider range of mappers that do
 cover all or most cases.
 
 So, let me give some examples. The simple ones are the integer and float
 mappers, where I have a few static factory methods for configuring them with
 various resolutions and origins, and defaults for those also. The default
 integer mapper, if I remember correctly, maps the integer to itself, and the
 default float mapper maps the float to its integer value (so 2.1 and 2.9 map
 to integer 2). There are configurations that auto-configure the mapper based
 on a sample dataset, setting the origin to the average value and choosing
 resolution to divide the space into a reasonable number of steps.
 
 The only string mapper I have configured simply maps each character in the
 string to an integer in the 94-character range from SPACE to SPACE+94 (126).
 It has a depth parameter, which controls the resolution of the first level
 of the index. This means that AB and AC are in adjacent index nodes. There
 is also a utility method for auto-configuring the range and depth based on a
 sample dataset. This mapper should work well when indexing, for example, a
 random selection out of a dictionary of words. However, if the sample data
 conforms to a certain pattern, and there are internal correlations, then a
 different mapper should be used (or written). The auto-config method does
 look for common-prefixes in the sample data to ensure that the depth is set
 correctly to not have all words in the same index node.
 
 The number of properties in the index does need to be defined at index
 creation. So you need to configure the index with all property names and
 types (and mappers) up-front, and then you simply add nodes to the index as
 you go. If the node does not have one of the configured properties, it gets
 a special index value for that. The resulting tree that is built is like an
 n-dimensional pyramid with lots of gaps (depending on the evenness of the
 property value distribution). Properties that do not need a tree of much
 depth (eg. a discrete set of categories, or tags) will cause the tree to
 collapse to n-1 dimensions at higher levels. So the total tree height is the
 hight of the most diverse property indexed.
 
 My expectation is that this index will not perform as well as dedicated
 single-property, auto-balancing indices, when queried with a single property
 condition, but when queried with complex conditions involving multiple
 conditions it should perform much better than indexes that use separate
 trees for each property and then perform set-intersection on the
 result-sets. The traverser will apply the conditions on all properties in
 the tree, narrowing the result set as early as possible in the search.
 
 And now I've moved from what is coded into what is still envisioned, so I'd
 better stop writing  ;-)
 
 On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger 
 michael.hun...@neotechnology.com wrote:
 
 Craig,
 how do you map _all_ properties to the integer (or rather numeric/long?)
 space.
 
 Does the mapping then also rely on the alphabetic (or comparision) order of
 the properties?
 
 Interesting approach. Is your space then as n-dimensional as the numbers of
 properties you have?
 
 Cheers
 
 Michael
 
 Am 02.02.2011 um 00:59 schrieb Craig Taverner:
 
 Here is a crazy idea - how about taking the properties you care about and
 dropping them into a combined lucene index? Then all results for nodes
 with
 the same properties would be 'ambiguous'. Moving this forward to degrees
 of
 ambiguity might be possible by creating the combined 'value' using a
 reduced
 resolution of the properties (to increase similarity so the index will
 see
 them as identical).
 
 Another option is the 'still very much in progress' composite index I
 started in December. Since all properties are mapped into normal integer
 space, the euclidean distance of the first level index