Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Michael Hunger Tue, 01 Feb 2011 16:40:58 -0800

Thanks so much for that explanation.

Put it in a blog post.


Michael

Am 02.02.2011 um 01:33 schrieb Craig Taverner:

> Hi Michael,
> 
> I agree that _all_ is a strong work :-)
> 
> What I do is provide a mapper interface for the domain modeler to implement
> based on their own understanding of what 'order' and 'resolution' to provide
> to the index. I have a set of pre-defined mappers for general cases, but do
> not suppose they will satisfy all users needs. Hopefully, should this index
> get developed further, we would have a much wider range of mappers that do
> cover all or most cases.
> 
> So, let me give some examples. The simple ones are the integer and float
> mappers, where I have a few static factory methods for configuring them with
> various resolutions and origins, and defaults for those also. The default
> integer mapper, if I remember correctly, maps the integer to itself, and the
> default float mapper maps the float to its integer value (so 2.1 and 2.9 map
> to integer 2). There are configurations that auto-configure the mapper based
> on a sample dataset, setting the origin to the average value and choosing
> resolution to divide the space into a reasonable number of steps.
> 
> The only string mapper I have configured simply maps each character in the
> string to an integer in the 94-character range from SPACE to SPACE+94 (126).
> It has a depth parameter, which controls the resolution of the first level
> of the index. This means that AB and AC are in adjacent index nodes. There
> is also a utility method for auto-configuring the range and depth based on a
> sample dataset. This mapper should work well when indexing, for example, a
> random selection out of a dictionary of words. However, if the sample data
> conforms to a certain pattern, and there are internal correlations, then a
> different mapper should be used (or written). The auto-config method does
> look for common-prefixes in the sample data to ensure that the depth is set
> correctly to not have all words in the same index node.
> 
> The number of properties in the index does need to be defined at index
> creation. So you need to configure the index with all property names and
> types (and mappers) up-front, and then you simply add nodes to the index as
> you go. If the node does not have one of the configured properties, it gets
> a special index value for that. The resulting tree that is built is like an
> n-dimensional pyramid with lots of gaps (depending on the evenness of the
> property value distribution). Properties that do not need a tree of much
> depth (eg. a discrete set of categories, or tags) will cause the tree to
> collapse to n-1 dimensions at higher levels. So the total tree height is the
> hight of the most diverse property indexed.
> 
> My expectation is that this index will not perform as well as dedicated
> single-property, auto-balancing indices, when queried with a single property
> condition, but when queried with complex conditions involving multiple
> conditions it should perform much better than indexes that use separate
> trees for each property and then perform set-intersection on the
> result-sets. The traverser will apply the conditions on all properties in
> the tree, narrowing the result set as early as possible in the search.
> 
> And now I've moved from what is coded into what is still envisioned, so I'd
> better stop writing .... ;-)
> 
> On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger <
> michael.hun...@neotechnology.com> wrote:
> 
>> Craig,
>> how do you map _all_ properties to the integer (or rather numeric/long?)
>> space.
>> 
>> Does the mapping then also rely on the alphabetic (or comparision) order of
>> the properties?
>> 
>> Interesting approach. Is your space then as n-dimensional as the numbers of
>> properties you have?
>> 
>> Cheers
>> 
>> Michael
>> 
>> Am 02.02.2011 um 00:59 schrieb Craig Taverner:
>> 
>>> Here is a crazy idea - how about taking the properties you care about and
>>> dropping them into a combined lucene index? Then all results for nodes
>> with
>>> the same properties would be 'ambiguous'. Moving this forward to degrees
>> of
>>> ambiguity might be possible by creating the combined 'value' using a
>> reduced
>>> resolution of the properties (to increase similarity so the index will
>> see
>>> them as identical).
>>> 
>>> Another option is the 'still very much in progress' composite index I
>>> started in December. Since all properties are mapped into normal integer
>>> space, the euclidean distance of the first level index nodes from each
>> other
>>> is a discrete measure of similarity. A distance of zero means that the
>> nodes
>>> attach to the same index node, and are very similar or identical. Higher
>>> values mean greater dissimilarity. This index theoretically supports any
>>> number of properties of any type (including strings) and allows you to
>> plug
>>> in your own value->index mappers, which means you can control what you
>> mean
>>> by 'similar'.
>>> 
>>> On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand <b...@bensand.com> wrote:
>>> 
>>>> I was working on a project that used matching algorithms a while back.
>>>> 
>>>> What you have is an n-dimensional matching problem. I can't remember
>>>> specifically what the last project were using, but this and the linked
>>>> algos
>>>> may be what you're looking for:
>>>> http://en.wikipedia.org/wiki/Mahalanobis_distance
>>>> 
>>>> On 2 February 2011 07:34, Tim McNamara <paperl...@timmcnamara.co.nz>
>>>> wrote:
>>>> 
>>>>> Say I have two nodes,
>>>>> 
>>>>> 
>>>>> { "type": "person", "name": "Neo" }
>>>>> { "type": "person", "name": "Neo" }
>>>>> 
>>>>> 
>>>>> 
>>>>> Over time, I learn their locations. They both live in the same city.
>> This
>>>>> increases the chances that they're the same person. However, over time
>> it
>>>>> turns out that their ages differ, therefore it's far less likely that
>>>> they
>>>>> are the same Neo.
>>>>> 
>>>>> 
>>>>> Is there anything inside of Neo4j that attempts to determine how close
>>>> two
>>>>> nodes are? E.g. to what extent their subtrees and properties match?
>>>>> Additionally, can anyone suggest literature for algorithms for
>>>>> disambiguating the two entities?
>>>>> 
>>>>> 
>>>>> If I wanted to implement something that searches for similarities, that
>>>>> returns a probability of a match, can I do this within the database or
>>>>> should I implement it within the application?
>>>>> 
>>>>> 
>>>>> --
>>>>> Tim McNamara
>>>>> @timClicks
>>>>> http://timmcnamara.co.nz
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Reply via email to