Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Hi Michael, It is fun to see you thinking this through :-) I have had a chance to discuss my crazy ideas with some of the neo4j team last year, but this is the first time with you. And since December there is even code to review (see https://github.com/craigtaverner/amanzi-index). You are correct that I connect the data nodes directly to the index nodes. And therefor the keys in the index nodes are representations of the data, at various 'resolutions' as you climb the tree. This also means that each index node represents a range of values, and so inequality operations can be performed on the index nodes themselves. This also leads to another interesting side effect, and that is that statistics can be calculated on the index tree without having to touch the original data. To fully realize this we also need some counters on the index nodes. The plan is that the index itself becomes the basis for very fast statistics calculations. Your suggestion for relationships is also related to another idea I discussed last year, and that is to use a range of relationship types, where the type itself contains information about the range of the child nodes, so that traversing the index tree can be optimized by using only the relationship type in most tests, and accessing node or relationship properties only for edge cases. I think the idea makes sense, but there are complications in that we can reach the capacity of the database if we overuse relationship type for something that logically would normally be seen as a relationship property. Cheers, Craig On Wed, Feb 2, 2011 at 1:46 AM, Michael Hunger michael.hun...@neotechnology.com wrote: Craig, The interesting thing is - if you do this index in-graph you can directly attach the index-nodes to the real nodes and have them mimic the original node but instead of having the property value per property you have the mapped value. And you can many of such indexes in graph attached via a index-name relationship to the real node. Would also be easy to update the index nodes that way or to remove them altogether. But perhaps it is even better to use relationships pointing to the nodes contain the mapped property values and have the index nodes just be an ordering structure for those relationships. Cheers Michael Am 02.02.2011 um 01:33 schrieb Craig Taverner: Hi Michael, I agree that _all_ is a strong work :-) What I do is provide a mapper interface for the domain modeler to implement based on their own understanding of what 'order' and 'resolution' to provide to the index. I have a set of pre-defined mappers for general cases, but do not suppose they will satisfy all users needs. Hopefully, should this index get developed further, we would have a much wider range of mappers that do cover all or most cases. So, let me give some examples. The simple ones are the integer and float mappers, where I have a few static factory methods for configuring them with various resolutions and origins, and defaults for those also. The default integer mapper, if I remember correctly, maps the integer to itself, and the default float mapper maps the float to its integer value (so 2.1 and 2.9 map to integer 2). There are configurations that auto-configure the mapper based on a sample dataset, setting the origin to the average value and choosing resolution to divide the space into a reasonable number of steps. The only string mapper I have configured simply maps each character in the string to an integer in the 94-character range from SPACE to SPACE+94 (126). It has a depth parameter, which controls the resolution of the first level of the index. This means that AB and AC are in adjacent index nodes. There is also a utility method for auto-configuring the range and depth based on a sample dataset. This mapper should work well when indexing, for example, a random selection out of a dictionary of words. However, if the sample data conforms to a certain pattern, and there are internal correlations, then a different mapper should be used (or written). The auto-config method does look for common-prefixes in the sample data to ensure that the depth is set correctly to not have all words in the same index node. The number of properties in the index does need to be defined at index creation. So you need to configure the index with all property names and types (and mappers) up-front, and then you simply add nodes to the index as you go. If the node does not have one of the configured properties, it gets a special index value for that. The resulting tree that is built is like an n-dimensional pyramid with lots of gaps (depending on the evenness of the property value distribution). Properties that do not need a tree of much depth (eg. a discrete set of categories, or tags) will cause the tree to collapse to n-1 dimensions at higher levels. So the total tree height is the hight
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Tim, I don't know enough about disambiguation algos to point you in any special direction, but implementation-wise you could hook into the event framework, http://wiki.neo4j.org/content/Event_framework, in order to do e.g. graph matching queries upon any modifying transaction changing database content. HTH Cheers, /peter neubauer GTalk: neubauer.peter Skype peter.neubauer Phone +46 704 106975 LinkedIn http://www.linkedin.com/in/neubauer Twitter http://twitter.com/peterneubauer http://www.neo4j.org - Your high performance graph database. http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party. On Tue, Feb 1, 2011 at 9:34 PM, Tim McNamara paperl...@timmcnamara.co.nz wrote: Say I have two nodes, { type: person, name: Neo } { type: person, name: Neo } Over time, I learn their locations. They both live in the same city. This increases the chances that they're the same person. However, over time it turns out that their ages differ, therefore it's far less likely that they are the same Neo. Is there anything inside of Neo4j that attempts to determine how close two nodes are? E.g. to what extent their subtrees and properties match? Additionally, can anyone suggest literature for algorithms for disambiguating the two entities? If I wanted to implement something that searches for similarities, that returns a probability of a match, can I do this within the database or should I implement it within the application? -- Tim McNamara @timClicks http://timmcnamara.co.nz ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
I was working on a project that used matching algorithms a while back. What you have is an n-dimensional matching problem. I can't remember specifically what the last project were using, but this and the linked algos may be what you're looking for: http://en.wikipedia.org/wiki/Mahalanobis_distance On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz wrote: Say I have two nodes, { type: person, name: Neo } { type: person, name: Neo } Over time, I learn their locations. They both live in the same city. This increases the chances that they're the same person. However, over time it turns out that their ages differ, therefore it's far less likely that they are the same Neo. Is there anything inside of Neo4j that attempts to determine how close two nodes are? E.g. to what extent their subtrees and properties match? Additionally, can anyone suggest literature for algorithms for disambiguating the two entities? If I wanted to implement something that searches for similarities, that returns a probability of a match, can I do this within the database or should I implement it within the application? -- Tim McNamara @timClicks http://timmcnamara.co.nz ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Here is a crazy idea - how about taking the properties you care about and dropping them into a combined lucene index? Then all results for nodes with the same properties would be 'ambiguous'. Moving this forward to degrees of ambiguity might be possible by creating the combined 'value' using a reduced resolution of the properties (to increase similarity so the index will see them as identical). Another option is the 'still very much in progress' composite index I started in December. Since all properties are mapped into normal integer space, the euclidean distance of the first level index nodes from each other is a discrete measure of similarity. A distance of zero means that the nodes attach to the same index node, and are very similar or identical. Higher values mean greater dissimilarity. This index theoretically supports any number of properties of any type (including strings) and allows you to plug in your own value-index mappers, which means you can control what you mean by 'similar'. On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote: I was working on a project that used matching algorithms a while back. What you have is an n-dimensional matching problem. I can't remember specifically what the last project were using, but this and the linked algos may be what you're looking for: http://en.wikipedia.org/wiki/Mahalanobis_distance On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz wrote: Say I have two nodes, { type: person, name: Neo } { type: person, name: Neo } Over time, I learn their locations. They both live in the same city. This increases the chances that they're the same person. However, over time it turns out that their ages differ, therefore it's far less likely that they are the same Neo. Is there anything inside of Neo4j that attempts to determine how close two nodes are? E.g. to what extent their subtrees and properties match? Additionally, can anyone suggest literature for algorithms for disambiguating the two entities? If I wanted to implement something that searches for similarities, that returns a probability of a match, can I do this within the database or should I implement it within the application? -- Tim McNamara @timClicks http://timmcnamara.co.nz ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Craig, how do you map _all_ properties to the integer (or rather numeric/long?) space. Does the mapping then also rely on the alphabetic (or comparision) order of the properties? Interesting approach. Is your space then as n-dimensional as the numbers of properties you have? Cheers Michael Am 02.02.2011 um 00:59 schrieb Craig Taverner: Here is a crazy idea - how about taking the properties you care about and dropping them into a combined lucene index? Then all results for nodes with the same properties would be 'ambiguous'. Moving this forward to degrees of ambiguity might be possible by creating the combined 'value' using a reduced resolution of the properties (to increase similarity so the index will see them as identical). Another option is the 'still very much in progress' composite index I started in December. Since all properties are mapped into normal integer space, the euclidean distance of the first level index nodes from each other is a discrete measure of similarity. A distance of zero means that the nodes attach to the same index node, and are very similar or identical. Higher values mean greater dissimilarity. This index theoretically supports any number of properties of any type (including strings) and allows you to plug in your own value-index mappers, which means you can control what you mean by 'similar'. On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote: I was working on a project that used matching algorithms a while back. What you have is an n-dimensional matching problem. I can't remember specifically what the last project were using, but this and the linked algos may be what you're looking for: http://en.wikipedia.org/wiki/Mahalanobis_distance On 2 February 2011 07:34, Tim McNamara paperl...@timmcnamara.co.nz wrote: Say I have two nodes, { type: person, name: Neo } { type: person, name: Neo } Over time, I learn their locations. They both live in the same city. This increases the chances that they're the same person. However, over time it turns out that their ages differ, therefore it's far less likely that they are the same Neo. Is there anything inside of Neo4j that attempts to determine how close two nodes are? E.g. to what extent their subtrees and properties match? Additionally, can anyone suggest literature for algorithms for disambiguating the two entities? If I wanted to implement something that searches for similarities, that returns a probability of a match, can I do this within the database or should I implement it within the application? -- Tim McNamara @timClicks http://timmcnamara.co.nz ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user ___ Neo4j mailing list User@lists.neo4j.org https://lists.neo4j.org/mailman/listinfo/user
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Thanks so much for that explanation. Put it in a blog post. Michael Am 02.02.2011 um 01:33 schrieb Craig Taverner: Hi Michael, I agree that _all_ is a strong work :-) What I do is provide a mapper interface for the domain modeler to implement based on their own understanding of what 'order' and 'resolution' to provide to the index. I have a set of pre-defined mappers for general cases, but do not suppose they will satisfy all users needs. Hopefully, should this index get developed further, we would have a much wider range of mappers that do cover all or most cases. So, let me give some examples. The simple ones are the integer and float mappers, where I have a few static factory methods for configuring them with various resolutions and origins, and defaults for those also. The default integer mapper, if I remember correctly, maps the integer to itself, and the default float mapper maps the float to its integer value (so 2.1 and 2.9 map to integer 2). There are configurations that auto-configure the mapper based on a sample dataset, setting the origin to the average value and choosing resolution to divide the space into a reasonable number of steps. The only string mapper I have configured simply maps each character in the string to an integer in the 94-character range from SPACE to SPACE+94 (126). It has a depth parameter, which controls the resolution of the first level of the index. This means that AB and AC are in adjacent index nodes. There is also a utility method for auto-configuring the range and depth based on a sample dataset. This mapper should work well when indexing, for example, a random selection out of a dictionary of words. However, if the sample data conforms to a certain pattern, and there are internal correlations, then a different mapper should be used (or written). The auto-config method does look for common-prefixes in the sample data to ensure that the depth is set correctly to not have all words in the same index node. The number of properties in the index does need to be defined at index creation. So you need to configure the index with all property names and types (and mappers) up-front, and then you simply add nodes to the index as you go. If the node does not have one of the configured properties, it gets a special index value for that. The resulting tree that is built is like an n-dimensional pyramid with lots of gaps (depending on the evenness of the property value distribution). Properties that do not need a tree of much depth (eg. a discrete set of categories, or tags) will cause the tree to collapse to n-1 dimensions at higher levels. So the total tree height is the hight of the most diverse property indexed. My expectation is that this index will not perform as well as dedicated single-property, auto-balancing indices, when queried with a single property condition, but when queried with complex conditions involving multiple conditions it should perform much better than indexes that use separate trees for each property and then perform set-intersection on the result-sets. The traverser will apply the conditions on all properties in the tree, narrowing the result set as early as possible in the search. And now I've moved from what is coded into what is still envisioned, so I'd better stop writing ;-) On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger michael.hun...@neotechnology.com wrote: Craig, how do you map _all_ properties to the integer (or rather numeric/long?) space. Does the mapping then also rely on the alphabetic (or comparision) order of the properties? Interesting approach. Is your space then as n-dimensional as the numbers of properties you have? Cheers Michael Am 02.02.2011 um 00:59 schrieb Craig Taverner: Here is a crazy idea - how about taking the properties you care about and dropping them into a combined lucene index? Then all results for nodes with the same properties would be 'ambiguous'. Moving this forward to degrees of ambiguity might be possible by creating the combined 'value' using a reduced resolution of the properties (to increase similarity so the index will see them as identical). Another option is the 'still very much in progress' composite index I started in December. Since all properties are mapped into normal integer space, the euclidean distance of the first level index nodes from each other is a discrete measure of similarity. A distance of zero means that the nodes attach to the same index node, and are very similar or identical. Higher values mean greater dissimilarity. This index theoretically supports any number of properties of any type (including strings) and allows you to plug in your own value-index mappers, which means you can control what you mean by 'similar'. On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand b...@bensand.com wrote: I was working on a project that used matching algorithms a while back. What
Re: [Neo4j] Implementing disambiguation algorithms in Neo4j
Craig, The interesting thing is - if you do this index in-graph you can directly attach the index-nodes to the real nodes and have them mimic the original node but instead of having the property value per property you have the mapped value. And you can many of such indexes in graph attached via a index-name relationship to the real node. Would also be easy to update the index nodes that way or to remove them altogether. But perhaps it is even better to use relationships pointing to the nodes contain the mapped property values and have the index nodes just be an ordering structure for those relationships. Cheers Michael Am 02.02.2011 um 01:33 schrieb Craig Taverner: Hi Michael, I agree that _all_ is a strong work :-) What I do is provide a mapper interface for the domain modeler to implement based on their own understanding of what 'order' and 'resolution' to provide to the index. I have a set of pre-defined mappers for general cases, but do not suppose they will satisfy all users needs. Hopefully, should this index get developed further, we would have a much wider range of mappers that do cover all or most cases. So, let me give some examples. The simple ones are the integer and float mappers, where I have a few static factory methods for configuring them with various resolutions and origins, and defaults for those also. The default integer mapper, if I remember correctly, maps the integer to itself, and the default float mapper maps the float to its integer value (so 2.1 and 2.9 map to integer 2). There are configurations that auto-configure the mapper based on a sample dataset, setting the origin to the average value and choosing resolution to divide the space into a reasonable number of steps. The only string mapper I have configured simply maps each character in the string to an integer in the 94-character range from SPACE to SPACE+94 (126). It has a depth parameter, which controls the resolution of the first level of the index. This means that AB and AC are in adjacent index nodes. There is also a utility method for auto-configuring the range and depth based on a sample dataset. This mapper should work well when indexing, for example, a random selection out of a dictionary of words. However, if the sample data conforms to a certain pattern, and there are internal correlations, then a different mapper should be used (or written). The auto-config method does look for common-prefixes in the sample data to ensure that the depth is set correctly to not have all words in the same index node. The number of properties in the index does need to be defined at index creation. So you need to configure the index with all property names and types (and mappers) up-front, and then you simply add nodes to the index as you go. If the node does not have one of the configured properties, it gets a special index value for that. The resulting tree that is built is like an n-dimensional pyramid with lots of gaps (depending on the evenness of the property value distribution). Properties that do not need a tree of much depth (eg. a discrete set of categories, or tags) will cause the tree to collapse to n-1 dimensions at higher levels. So the total tree height is the hight of the most diverse property indexed. My expectation is that this index will not perform as well as dedicated single-property, auto-balancing indices, when queried with a single property condition, but when queried with complex conditions involving multiple conditions it should perform much better than indexes that use separate trees for each property and then perform set-intersection on the result-sets. The traverser will apply the conditions on all properties in the tree, narrowing the result set as early as possible in the search. And now I've moved from what is coded into what is still envisioned, so I'd better stop writing ;-) On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger michael.hun...@neotechnology.com wrote: Craig, how do you map _all_ properties to the integer (or rather numeric/long?) space. Does the mapping then also rely on the alphabetic (or comparision) order of the properties? Interesting approach. Is your space then as n-dimensional as the numbers of properties you have? Cheers Michael Am 02.02.2011 um 00:59 schrieb Craig Taverner: Here is a crazy idea - how about taking the properties you care about and dropping them into a combined lucene index? Then all results for nodes with the same properties would be 'ambiguous'. Moving this forward to degrees of ambiguity might be possible by creating the combined 'value' using a reduced resolution of the properties (to increase similarity so the index will see them as identical). Another option is the 'still very much in progress' composite index I started in December. Since all properties are mapped into normal integer space, the euclidean distance of the first level index