Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Michael Hunger Tue, 01 Feb 2011 16:47:05 -0800

Craig,

The interesting thing is - if you do this index in-graph you can directly 
attach the 
index-nodes to the real nodes and have them mimic the original node but instead 
of having
the property value per property you have the mapped value.


And you can many of such indexes in graph attached via a "index-name" 
relationship to the real node.
Would also be easy to update the index nodes that way or to remove them 
altogether.

But perhaps it is even better to use relationships pointing to the nodes 
contain the mapped property values
and have the index nodes just be an ordering structure for those relationships.

Cheers

Michael

Am 02.02.2011 um 01:33 schrieb Craig Taverner:

> Hi Michael,
> 
> I agree that _all_ is a strong work :-)
> 
> What I do is provide a mapper interface for the domain modeler to implement
> based on their own understanding of what 'order' and 'resolution' to provide
> to the index. I have a set of pre-defined mappers for general cases, but do
> not suppose they will satisfy all users needs. Hopefully, should this index
> get developed further, we would have a much wider range of mappers that do
> cover all or most cases.
> 
> So, let me give some examples. The simple ones are the integer and float
> mappers, where I have a few static factory methods for configuring them with
> various resolutions and origins, and defaults for those also. The default
> integer mapper, if I remember correctly, maps the integer to itself, and the
> default float mapper maps the float to its integer value (so 2.1 and 2.9 map
> to integer 2). There are configurations that auto-configure the mapper based
> on a sample dataset, setting the origin to the average value and choosing
> resolution to divide the space into a reasonable number of steps.
> 
> The only string mapper I have configured simply maps each character in the
> string to an integer in the 94-character range from SPACE to SPACE+94 (126).
> It has a depth parameter, which controls the resolution of the first level
> of the index. This means that AB and AC are in adjacent index nodes. There
> is also a utility method for auto-configuring the range and depth based on a
> sample dataset. This mapper should work well when indexing, for example, a
> random selection out of a dictionary of words. However, if the sample data
> conforms to a certain pattern, and there are internal correlations, then a
> different mapper should be used (or written). The auto-config method does
> look for common-prefixes in the sample data to ensure that the depth is set
> correctly to not have all words in the same index node.
> 
> The number of properties in the index does need to be defined at index
> creation. So you need to configure the index with all property names and
> types (and mappers) up-front, and then you simply add nodes to the index as
> you go. If the node does not have one of the configured properties, it gets
> a special index value for that. The resulting tree that is built is like an
> n-dimensional pyramid with lots of gaps (depending on the evenness of the
> property value distribution). Properties that do not need a tree of much
> depth (eg. a discrete set of categories, or tags) will cause the tree to
> collapse to n-1 dimensions at higher levels. So the total tree height is the
> hight of the most diverse property indexed.
> 
> My expectation is that this index will not perform as well as dedicated
> single-property, auto-balancing indices, when queried with a single property
> condition, but when queried with complex conditions involving multiple
> conditions it should perform much better than indexes that use separate
> trees for each property and then perform set-intersection on the
> result-sets. The traverser will apply the conditions on all properties in
> the tree, narrowing the result set as early as possible in the search.
> 
> And now I've moved from what is coded into what is still envisioned, so I'd
> better stop writing .... ;-)
> 
> On Wed, Feb 2, 2011 at 1:11 AM, Michael Hunger <
> michael.hun...@neotechnology.com> wrote:
> 
>> Craig,
>> how do you map _all_ properties to the integer (or rather numeric/long?)
>> space.
>> 
>> Does the mapping then also rely on the alphabetic (or comparision) order of
>> the properties?
>> 
>> Interesting approach. Is your space then as n-dimensional as the numbers of
>> properties you have?
>> 
>> Cheers
>> 
>> Michael
>> 
>> Am 02.02.2011 um 00:59 schrieb Craig Taverner:
>> 
>>> Here is a crazy idea - how about taking the properties you care about and
>>> dropping them into a combined lucene index? Then all results for nodes
>> with
>>> the same properties would be 'ambiguous'. Moving this forward to degrees
>> of
>>> ambiguity might be possible by creating the combined 'value' using a
>> reduced
>>> resolution of the properties (to increase similarity so the index will
>> see
>>> them as identical).
>>> 
>>> Another option is the 'still very much in progress' composite index I
>>> started in December. Since all properties are mapped into normal integer
>>> space, the euclidean distance of the first level index nodes from each
>> other
>>> is a discrete measure of similarity. A distance of zero means that the
>> nodes
>>> attach to the same index node, and are very similar or identical. Higher
>>> values mean greater dissimilarity. This index theoretically supports any
>>> number of properties of any type (including strings) and allows you to
>> plug
>>> in your own value->index mappers, which means you can control what you
>> mean
>>> by 'similar'.
>>> 
>>> On Tue, Feb 1, 2011 at 9:52 PM, Ben Sand <b...@bensand.com> wrote:
>>> 
>>>> I was working on a project that used matching algorithms a while back.
>>>> 
>>>> What you have is an n-dimensional matching problem. I can't remember
>>>> specifically what the last project were using, but this and the linked
>>>> algos
>>>> may be what you're looking for:
>>>> http://en.wikipedia.org/wiki/Mahalanobis_distance
>>>> 
>>>> On 2 February 2011 07:34, Tim McNamara <paperl...@timmcnamara.co.nz>
>>>> wrote:
>>>> 
>>>>> Say I have two nodes,
>>>>> 
>>>>> 
>>>>> { "type": "person", "name": "Neo" }
>>>>> { "type": "person", "name": "Neo" }
>>>>> 
>>>>> 
>>>>> 
>>>>> Over time, I learn their locations. They both live in the same city.
>> This
>>>>> increases the chances that they're the same person. However, over time
>> it
>>>>> turns out that their ages differ, therefore it's far less likely that
>>>> they
>>>>> are the same Neo.
>>>>> 
>>>>> 
>>>>> Is there anything inside of Neo4j that attempts to determine how close
>>>> two
>>>>> nodes are? E.g. to what extent their subtrees and properties match?
>>>>> Additionally, can anyone suggest literature for algorithms for
>>>>> disambiguating the two entities?
>>>>> 
>>>>> 
>>>>> If I wanted to implement something that searches for similarities, that
>>>>> returns a probability of a match, can I do this within the database or
>>>>> should I implement it within the application?
>>>>> 
>>>>> 
>>>>> --
>>>>> Tim McNamara
>>>>> @timClicks
>>>>> http://timmcnamara.co.nz
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> Neo4j mailing list
>>>>> User@lists.neo4j.org
>>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>>> 
>>>> _______________________________________________
>>>> Neo4j mailing list
>>>> User@lists.neo4j.org
>>>> https://lists.neo4j.org/mailman/listinfo/user
>>>> 
>>> _______________________________________________
>>> Neo4j mailing list
>>> User@lists.neo4j.org
>>> https://lists.neo4j.org/mailman/listinfo/user
>> 
>> _______________________________________________
>> Neo4j mailing list
>> User@lists.neo4j.org
>> https://lists.neo4j.org/mailman/listinfo/user
>> 
> _______________________________________________
> Neo4j mailing list
> User@lists.neo4j.org
> https://lists.neo4j.org/mailman/listinfo/user

_______________________________________________
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

Re: [Neo4j] Implementing disambiguation algorithms in Neo4j

Reply via email to