Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node

2010-07-28 Thread Alberto Perdomo
Hi everyone,

I would have an SQL db for the app besides the graph db.

I have users that I would store as nodes within the graph besides
storing them in SQL as well. Within those nodes I store attributes
like male/female, age or date of birth, etc.
I would have one kind of relationship for friendship, which doesn't
present any kind of problem and I would do the standard type of
queries neo4jr-social provides (e.g. friend suggestions, degrees of
separation, friends in common, ...)

We want to measure the compatibility/taste match/whatever between
users in background, meaning for instance how much you have in common.
This is done in Ruby. The result will be an integer between 0 and 100.
BTW, this value is symmetric, meaning it could be modelled as a
bidirectional relationship.

Let's say I have 10k users and for every user I calculate the match
between him and 10 other users.
If I store all the results I calculate I potentially up to 100k
relationships every day / 3m relationships every month. If I store
this in SQL it can turn into a bottleneck very fast. The table will
grow soon too big and the queries will be slower and slower.

That's when I started thinking in storing those relationships in Neo4j
because it's meant to handle a very large number of nodes and
relationships really efficiently. I can model that as a relationship
and either store the value inside the relationship or code the
relationship names as 'match_high, match_medium, match_low'

Now back to step 1. Selecting the users I'll be calculating new
relationships with. They must match certain criteria, e.g.
female/male, similar age, etc. and it could be pseudo random.
Now the first step if you think in SQL is to query for all users that
match the criteria and don't have a relationship with user A.

And then yesterday looking at the Neo4j docs I thought this kind of
query cannot be done. I could select all the users that match the
criteria from SQL, then query all the relationships for A from Neo4j,
substract those from the array of valid users and pick randomly n
users. Because n is a low value, perhaps 10, this looks to me like a
very inefficient way of doing this. Also it will be fast at the
beginning but it will get slower as the relationship density grows
with time...

Maybe I should consider a different strategy. I've been also
considering only storing high or interesting values but it would be
more interesting to have the n top users for A ordered by relationship
value. If I go ahead with this then I could just go and store it
within SQL.

This is not what we strive for but if I don't find a better way I'll
guess we'll have to live with that. Also the solution I find should be
easily scalable. It should also apply when having for instance 100k
users.

Any thoughts or comments?
What would you recommend?

Thanks for help guys!
Alberto.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Querying for nodes that have no relationhip to a specfic node

2010-07-28 Thread David Montag
Hi Alberto,

Okay, interesting. You want to calculate some metric between pairs of
users, so it's not a friend-of-a-friend scenario or anything like
that, which would have been great in a graph db. This is just all/some
pairs of random users. That you can do with your SQL db or neo4j or
what ever db you want.

But then you need to store the result. You can store these metrics as
relationships in neo4j, and then just update them for each user when
you recompute. You can find the user nodes via indexing. Maybe it's
acceptable that some metrics are out of date, so you can just
background process them continuously.

Depending on your scenario, if your users know each other, it might be
interesting to start computing in a foaf style order (breadth first).
Remember, the power is in the relationships. Isolated nodes are not
interesting.

David

--
Sent from cell, excuse typos.

On Wednesday, July 28, 2010, Alberto Perdomo alberto.perd...@gmail.com wrote:
 Hi everyone,

 I would have an SQL db for the app besides the graph db.

 I have users that I would store as nodes within the graph besides
 storing them in SQL as well. Within those nodes I store attributes
 like male/female, age or date of birth, etc.
 I would have one kind of relationship for friendship, which doesn't
 present any kind of problem and I would do the standard type of
 queries neo4jr-social provides (e.g. friend suggestions, degrees of
 separation, friends in common, ...)

 We want to measure the compatibility/taste match/whatever between
 users in background, meaning for instance how much you have in common.
 This is done in Ruby. The result will be an integer between 0 and 100.
 BTW, this value is symmetric, meaning it could be modelled as a
 bidirectional relationship.

 Let's say I have 10k users and for every user I calculate the match
 between him and 10 other users.
 If I store all the results I calculate I potentially up to 100k
 relationships every day / 3m relationships every month. If I store
 this in SQL it can turn into a bottleneck very fast. The table will
 grow soon too big and the queries will be slower and slower.

 That's when I started thinking in storing those relationships in Neo4j
 because it's meant to handle a very large number of nodes and
 relationships really efficiently. I can model that as a relationship
 and either store the value inside the relationship or code the
 relationship names as 'match_high, match_medium, match_low'

 Now back to step 1. Selecting the users I'll be calculating new
 relationships with. They must match certain criteria, e.g.
 female/male, similar age, etc. and it could be pseudo random.
 Now the first step if you think in SQL is to query for all users that
 match the criteria and don't have a relationship with user A.

 And then yesterday looking at the Neo4j docs I thought this kind of
 query cannot be done. I could select all the users that match the
 criteria from SQL, then query all the relationships for A from Neo4j,
 substract those from the array of valid users and pick randomly n
 users. Because n is a low value, perhaps 10, this looks to me like a
 very inefficient way of doing this. Also it will be fast at the
 beginning but it will get slower as the relationship density grows
 with time...

 Maybe I should consider a different strategy. I've been also
 considering only storing high or interesting values but it would be
 more interesting to have the n top users for A ordered by relationship
 value. If I go ahead with this then I could just go and store it
 within SQL.

 This is not what we strive for but if I don't find a better way I'll
 guess we'll have to live with that. Also the solution I find should be
 easily scalable. It should also apply when having for instance 100k
 users.

 Any thoughts or comments?
 What would you recommend?

 Thanks for help guys!
 Alberto.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Neo4j and JPA refactoring

2010-07-28 Thread Peter Neubauer
Hi there,
I noticed the big update to the JPA implementation for Neo4j at
https://svn.neo4j.org/laboratory/components/neo-persistence/trunk/ .
Avishay, could you spread some light on what is changed and improved,
and the current state of the project?



Cheers,

/peter neubauer

COO and Sales, Neo Technology

GTalk:      neubauer.peter
Skype       peter.neubauer
Phone       +46 704 106975
LinkedIn   http://www.linkedin.com/in/neubauer
Twitter      http://twitter.com/peterneubauer

http://www.neo4j.org               - Your high performance graph database.
http://www.thoughtmade.com - Scandinavia's coolest Bring-a-Thing party.
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Read-only transactions?

2010-07-28 Thread Tim Jones
Hi,

Is it possible to mark a transaction as being read-only? It's taking a while 
for 
my transaction to shut down, even though there are no writes to commit.

Thanks,
Tim



  

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Unable to build a master-slave system

2010-07-28 Thread George Ciubotaru
Hello,

I'm trying to build a high availability system with neo4j as explained here: 
http://wiki.neo4j.org/content/Online_Backup_HA. In theory everything looks 
pretty simple and straightforward... but once I try to run the slave process 
I'm getting the following exception:

Throwing away org.neo4j.onlinebackup.net.connecttomaster...@1f1fba0
java.nio.channels.NotYetConnectedException
  at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(Unknown Source)
  at sun.nio.ch.SocketChannelImpl.write(Unknown Source)
  at org.neo4j.onlinebackup.net.Connection.write(Connection.java:238)
  at 
org.neo4j.onlinebackup.net.ConnectToMasterJob.sendGreeting(ConnectToMasterJob.java:55)
  at 
org.neo4j.onlinebackup.net.ConnectToMasterJob.performJob(ConnectToMasterJob.java:141)
  at org.neo4j.onlinebackup.net.JobEater.run(JobEater.java:32)

... followed by this exception on master side:

Connection closed Connection[slave_ip_address:11587]
org.neo4j.onlinebackup.net.SocketException: Connection[slave_ip_address:11587] 
error reading
Throwing away org.neo4j.onlinebackup.net.handleincommingslave...@fd13b5
  at org.neo4j.onlinebackup.net.Connection.read(Connection.java:210)
  at 
org.neo4j.onlinebackup.net.HandleIncommingSlaveJob.getGreeting(HandleIncommingSlaveJob.java:41)
  at 
org.neo4j.onlinebackup.net.HandleIncommingSlaveJob.performJob(HandleIncommingSlaveJob.java:160)
  at org.neo4j.onlinebackup.net.JobEater.run(JobEater.java:93)
Caused by: java.nio.channels.ClosedChannelException
  at sun.nio.ch.SocketChannelImpl.ensureReadOpen(Unknown Source)
  at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
  at org.neo4j.onlinebackup.net.Connection.read(Connection.java:205)
  ... 3 more
null chain job

Any idea of what might be wrong here? (I'm running everything on 64-bit Windows 
(7 or Server 2008 R2), neo4j-kernel 1.0 and online-backup 0.5)

Thank you,
George
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node

2010-07-28 Thread Alberto Perdomo
Hi David,


 But then you need to store the result. You can store these metrics as
 relationships in neo4j, and then just update them for each user when
 you recompute. You can find the user nodes via indexing. Maybe it's
 acceptable that some metrics are out of date, so you can just
 background process them continuously.

I already have background processes that go through all users and
calculate new new pairs. But then in order to do that I do need to
exclude the pairs I already have... because it would be silly and as
the relationship density grows the probablity of calculating a pair
again would be higher and higher...
Would I be able to do that kind of query using indexing?

 Depending on your scenario, if your users know each other, it might be
 interesting to start computing in a foaf style order (breadth first).
 Remember, the power is in the relationships. Isolated nodes are not
 interesting.

You mean I look first for possible pairs with users that are friends
of friends instead of randomly? We are also interesting in storing
friendship relationship so that sounds interesting.
That would be a different type of query: Traverse the graph from node
A to nodes which are friends of friends of A and have no match
relationship with A. I guess that is not difficult to implement using
Neo4j?

Thanks for your input David!
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


[Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Jeff Klann
Hi, I have an algorithm running on my little server that is very very slow.
It's a recommendation traversal (for all A and B in the catalog of items:
for each item A, how many customers also purchased another item in the
catalog B). It's processed 90 items in about 8 hours so far! Before I dive
deeper into trying to figure out the performance problem, I thought I'd
email the list to see if more experienced people have ideas.

Some characteristics of my datastore: it's size is pretty moderate for a
database application. 7500 items, not sure how many customers and purchases
(how can I find the size of an index?) but probably ~1 million customers.
The relationshipstore + nodestore  500mb. (Propertystore is huge but I
don't access it much in traversals.)

The possibilities I see are:

1) *Neo4J is just slow.* Probably not slower than Postgres which I was using
previously, but maybe I need to switch to a distributed map-reduce db in the
cloud and give up the very nice graph modeling approach? I didn't think this
would be a problem, because my data size is pretty moderate and Neo4J is
supposed to be fast.

2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB
currently. But would this get my 20day traversal down to a few hours?
Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing
much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
about 300m memory-map.

3) *There's some secret about Neo4J performance I don't know.* Is there
something I'm unaware that Neo4J is doing? When I access a property, does it
load a chunk of properties I don't care about? For the current node/edge or
others? I turned off log rotation and I commit after each item A. Are there
other performance tips I might have missed?

4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
there's some optimizations I can do. It looks like:

 For each item A in the catalog:
   For each customer C that has purchased that item:
For each item B that customer purchased:
   Update the co-occurrence edge between AB.

  (If the edge exists, add one to its weight. If it doesn't exist,
 create it with weight one.)

This is O(n^2) worst case, but practically it'll be much better due to the
sparseness of purchases. The large number of customers slows it down,
though. The slowest part, I suspect, is the last line. It's a lot of finding
and re-finding edges between As and Bs and updating the edge properties. I
don't see much way around it, though. I wrote another version that avoids
this but is always O(n^2), and it takes about 15 minutes per A to check
against all B (which would also take a month). The version above seems to be
averaging 3 customers/sec, which doesn't seem that slow until you realize
that some of these items were purchased by thousands of customers.

I'd hate to give up on Neo4J. I really like the graph database concept. But
can it handle data? I hope someone sees something I'm doing wrong.

Thanks,
Jeff Klann
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Rick Bullotta
Jeff, when you're doing your traversal/update process, how often do you
commit the transactions?

-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Jeff Klann
Sent: Wednesday, July 28, 2010 11:20 AM
To: Neo4j user discussions
Subject: [Neo4j] Stumped by performance issue in traversal - would take a
month to run!

Hi, I have an algorithm running on my little server that is very very slow.
It's a recommendation traversal (for all A and B in the catalog of items:
for each item A, how many customers also purchased another item in the
catalog B). It's processed 90 items in about 8 hours so far! Before I dive
deeper into trying to figure out the performance problem, I thought I'd
email the list to see if more experienced people have ideas.

Some characteristics of my datastore: it's size is pretty moderate for a
database application. 7500 items, not sure how many customers and purchases
(how can I find the size of an index?) but probably ~1 million customers.
The relationshipstore + nodestore  500mb. (Propertystore is huge but I
don't access it much in traversals.)

The possibilities I see are:

1) *Neo4J is just slow.* Probably not slower than Postgres which I was using
previously, but maybe I need to switch to a distributed map-reduce db in the
cloud and give up the very nice graph modeling approach? I didn't think this
would be a problem, because my data size is pretty moderate and Neo4J is
supposed to be fast.

2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB
currently. But would this get my 20day traversal down to a few hours?
Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing
much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
about 300m memory-map.

3) *There's some secret about Neo4J performance I don't know.* Is there
something I'm unaware that Neo4J is doing? When I access a property, does it
load a chunk of properties I don't care about? For the current node/edge or
others? I turned off log rotation and I commit after each item A. Are there
other performance tips I might have missed?

4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
there's some optimizations I can do. It looks like:

 For each item A in the catalog:
   For each customer C that has purchased that item:
For each item B that customer purchased:
   Update the co-occurrence edge between AB.

  (If the edge exists, add one to its weight. If it doesn't exist,
 create it with weight one.)

This is O(n^2) worst case, but practically it'll be much better due to the
sparseness of purchases. The large number of customers slows it down,
though. The slowest part, I suspect, is the last line. It's a lot of finding
and re-finding edges between As and Bs and updating the edge properties. I
don't see much way around it, though. I wrote another version that avoids
this but is always O(n^2), and it takes about 15 minutes per A to check
against all B (which would also take a month). The version above seems to be
averaging 3 customers/sec, which doesn't seem that slow until you realize
that some of these items were purchased by thousands of customers.

I'd hate to give up on Neo4J. I really like the graph database concept. But
can it handle data? I hope someone sees something I'm doing wrong.

Thanks,
Jeff Klann
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Rick Bullotta
Oh, and you DEFINITELY need more RAM!

-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Jeff Klann
Sent: Wednesday, July 28, 2010 11:20 AM
To: Neo4j user discussions
Subject: [Neo4j] Stumped by performance issue in traversal - would take a
month to run!

Hi, I have an algorithm running on my little server that is very very slow.
It's a recommendation traversal (for all A and B in the catalog of items:
for each item A, how many customers also purchased another item in the
catalog B). It's processed 90 items in about 8 hours so far! Before I dive
deeper into trying to figure out the performance problem, I thought I'd
email the list to see if more experienced people have ideas.

Some characteristics of my datastore: it's size is pretty moderate for a
database application. 7500 items, not sure how many customers and purchases
(how can I find the size of an index?) but probably ~1 million customers.
The relationshipstore + nodestore  500mb. (Propertystore is huge but I
don't access it much in traversals.)

The possibilities I see are:

1) *Neo4J is just slow.* Probably not slower than Postgres which I was using
previously, but maybe I need to switch to a distributed map-reduce db in the
cloud and give up the very nice graph modeling approach? I didn't think this
would be a problem, because my data size is pretty moderate and Neo4J is
supposed to be fast.

2) *I just need more RAM.* I definitely need more RAM - I have a measly 1GB
currently. But would this get my 20day traversal down to a few hours?
Doesn't seem like it'd have THAT much impact. I'm running Linux and nothing
much else besides Neo4j, so I've got 650m physical RAM. Using 300m heap,
about 300m memory-map.

3) *There's some secret about Neo4J performance I don't know.* Is there
something I'm unaware that Neo4J is doing? When I access a property, does it
load a chunk of properties I don't care about? For the current node/edge or
others? I turned off log rotation and I commit after each item A. Are there
other performance tips I might have missed?

4) *My algorithm is inefficient.* It's a fairly naive algorithm and maybe
there's some optimizations I can do. It looks like:

 For each item A in the catalog:
   For each customer C that has purchased that item:
For each item B that customer purchased:
   Update the co-occurrence edge between AB.

  (If the edge exists, add one to its weight. If it doesn't exist,
 create it with weight one.)

This is O(n^2) worst case, but practically it'll be much better due to the
sparseness of purchases. The large number of customers slows it down,
though. The slowest part, I suspect, is the last line. It's a lot of finding
and re-finding edges between As and Bs and updating the edge properties. I
don't see much way around it, though. I wrote another version that avoids
this but is always O(n^2), and it takes about 15 minutes per A to check
against all B (which would also take a month). The version above seems to be
averaging 3 customers/sec, which doesn't seem that slow until you realize
that some of these items were purchased by thousands of customers.

I'd hate to give up on Neo4J. I really like the graph database concept. But
can it handle data? I hope someone sees something I'm doing wrong.

Thanks,
Jeff Klann
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user

___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Tim Jones
I can't give too much help on this unfortunately, but as far as possibility 1) 
goes, my database contains around 8 million nodes, and I traverse them in about 
15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of RAM. 
I 
allocate a 1GB heap to the JDK.

Inserts take a little longer because of the approach I use - inserting 200K 
nodes now takes a few minutes. I then have a separate step to remove duplicates 
that takes about 10-15 minutes.

It seems to me that you might be better off doing something similar: creating a 
new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add 
this 
relationship between products in catalogues A and B.

Then run a post-processing job that retrieves all PURCHASED_BOTH relationships 
for each product in catalogue A, and build an in-memory map so you only keep 
one 
of these relationships, and update the 'count' attribute in memory for that 
relationship. Delete the duplicates and commit. This way to get your desired 
result in 2 passes instead of doing everything in one go.

It seems a bit of a fiddle and I can't guarantee it'll improve performance 
(just 
to stress - I may be waaay off the mark here, but it works for me). I think it 
will though because it'll mean that your loop only has to create relationships 
instead of performing updates. Oh, and make sure that you aren't performing one 
operation per transaction - you could group together several tens of thousands 
before committing (I do 50,000 inserts before committing when I'm running this 
post-processing operation, and it's fine).

Tim



- Original Message 
 From: Jeff Klann jkl...@iupui.edu
 To: Neo4j user discussions user@lists.neo4j.org
 Sent: Wed, July 28, 2010 4:20:28 PM
 Subject: [Neo4j] Stumped by performance issue in traversal - would take a 
 month 
to run!
 
 Hi, I have an algorithm running on my little server that is very very  slow.
 It's a recommendation traversal (for all A and B in the catalog of  items:
 for each item A, how many customers also purchased another item in  the
 catalog B). It's processed 90 items in about 8 hours so far! Before I  dive
 deeper into trying to figure out the performance problem, I thought  I'd
 email the list to see if more experienced people have ideas.
 
 Some  characteristics of my datastore: it's size is pretty moderate for a
 database  application. 7500 items, not sure how many customers and purchases
 (how can I  find the size of an index?) but probably ~1 million customers.
 The  relationshipstore + nodestore  500mb. (Propertystore is huge but I
 don't  access it much in traversals.)
 
 The possibilities I see are:
 
 1)  *Neo4J is just slow.* Probably not slower than Postgres which I was  using
 previously, but maybe I need to switch to a distributed map-reduce db  in the
 cloud and give up the very nice graph modeling approach? I didn't  think this
 would be a problem, because my data size is pretty moderate and  Neo4J is
 supposed to be fast.
 
 2) *I just need more RAM.* I definitely  need more RAM - I have a measly 1GB
 currently. But would this get my 20day  traversal down to a few hours?
 Doesn't seem like it'd have THAT much impact.  I'm running Linux and nothing
 much else besides Neo4j, so I've got 650m  physical RAM. Using 300m heap,
 about 300m memory-map.
 
 3) *There's some  secret about Neo4J performance I don't know.* Is there
 something I'm unaware  that Neo4J is doing? When I access a property, does it
 load a chunk of  properties I don't care about? For the current node/edge or
 others? I turned  off log rotation and I commit after each item A. Are there
 other performance  tips I might have missed?
 
 4) *My algorithm is inefficient.* It's a fairly  naive algorithm and maybe
 there's some optimizations I can do. It looks  like:
 
  For each item A in the catalog:
For each  customer C that has purchased that item:
 For each item B  that customer purchased:
Update the co-occurrence  edge between AB.
 
   (If the edge exists, add  one to its weight. If it doesn't exist,
  create it with weight  one.)
 
 This is O(n^2) worst case, but practically it'll be much better  due to the
 sparseness of purchases. The large number of customers slows it  down,
 though. The slowest part, I suspect, is the last line. It's a lot of  finding
 and re-finding edges between As and Bs and updating the edge  properties. I
 don't see much way around it, though. I wrote another version  that avoids
 this but is always O(n^2), and it takes about 15 minutes per A to  check
 against all B (which would also take a month). The version above seems  to be
 averaging 3 customers/sec, which doesn't seem that slow until you  realize
 that some of these items were purchased by thousands of  customers.
 
 I'd hate to give up on Neo4J. I really like the graph database  concept. But
 can it handle data? I hope someone sees something I'm doing  wrong.
 
 Thanks,
 Jeff  Klann
 

Re: [Neo4j] Querying for nodes that have no relationhip to a specfic node

2010-07-28 Thread Mattias Persson
One benefit you of Neo4j is that you can get rid of these pesky
background jobs and instead calculate such things on the fly quite
fast, and not needing to store that calculated info at all. Tried it?

2010/7/28, Alberto Perdomo alberto.perd...@gmail.com:
 Hi everyone,

 I would have an SQL db for the app besides the graph db.

 I have users that I would store as nodes within the graph besides
 storing them in SQL as well. Within those nodes I store attributes
 like male/female, age or date of birth, etc.
 I would have one kind of relationship for friendship, which doesn't
 present any kind of problem and I would do the standard type of
 queries neo4jr-social provides (e.g. friend suggestions, degrees of
 separation, friends in common, ...)

 We want to measure the compatibility/taste match/whatever between
 users in background, meaning for instance how much you have in common.
 This is done in Ruby. The result will be an integer between 0 and 100.
 BTW, this value is symmetric, meaning it could be modelled as a
 bidirectional relationship.

 Let's say I have 10k users and for every user I calculate the match
 between him and 10 other users.
 If I store all the results I calculate I potentially up to 100k
 relationships every day / 3m relationships every month. If I store
 this in SQL it can turn into a bottleneck very fast. The table will
 grow soon too big and the queries will be slower and slower.

 That's when I started thinking in storing those relationships in Neo4j
 because it's meant to handle a very large number of nodes and
 relationships really efficiently. I can model that as a relationship
 and either store the value inside the relationship or code the
 relationship names as 'match_high, match_medium, match_low'

 Now back to step 1. Selecting the users I'll be calculating new
 relationships with. They must match certain criteria, e.g.
 female/male, similar age, etc. and it could be pseudo random.
 Now the first step if you think in SQL is to query for all users that
 match the criteria and don't have a relationship with user A.

 And then yesterday looking at the Neo4j docs I thought this kind of
 query cannot be done. I could select all the users that match the
 criteria from SQL, then query all the relationships for A from Neo4j,
 substract those from the array of valid users and pick randomly n
 users. Because n is a low value, perhaps 10, this looks to me like a
 very inefficient way of doing this. Also it will be fast at the
 beginning but it will get slower as the relationship density grows
 with time...

 Maybe I should consider a different strategy. I've been also
 considering only storing high or interesting values but it would be
 more interesting to have the n top users for A ordered by relationship
 value. If I go ahead with this then I could just go and store it
 within SQL.

 This is not what we strive for but if I don't find a better way I'll
 guess we'll have to live with that. Also the solution I find should be
 easily scalable. It should also apply when having for instance 100k
 users.

 Any thoughts or comments?
 What would you recommend?

 Thanks for help guys!
 Alberto.
 ___
 Neo4j mailing list
 User@lists.neo4j.org
 https://lists.neo4j.org/mailman/listinfo/user



-- 
Mattias Persson, [matt...@neotechnology.com]
Hacker, Neo Technology
www.neotechnology.com
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user


Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Jeff Klann
Thank you both for your responses.

- I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
that's a huge factor.
- Tim, I like your algorithm trick! Would save a lot of reading/writing but
would definitely require more memory due to the massive increase in # of
edges.
- Transactions are not the issue, unless reading AFTER comitting a
transaction is somehow slower? I'm only committing after each of 7,000 items
and like I said it took 8 hours to run through 90-some items... committing
is not where the time is being spent.

To gauge the performance problem, I wanted to see how many customers are
purchasing each item and I'm concerned that even this query is taking a
really long time. It's simple:

 For each item A
   Count the number of relationships to a customer

It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
count the number of customers who purchased an item! (Looks like on average
about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR
query! Considering that Neo4J advertises it can traverse 1m
relationships/sec on commodity hardware, I would expect this to be much
faster. (Even if it were 50k customers per item, that'd be 7000items *
5customers / 1m traversals = 350 seconds. 6 minutes would be much more
reasonable.)

My commodity hardware will have a lot more memory tomorrow, hopefully
that'll solve these problems!

Thanks,
Jeff Klann
p.s. My propertystore is big because I was naive on import and stored
everything as string properties (this will change). How does that affect
performance?

On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote:

 I can't give too much help on this unfortunately, but as far as possibility
 1)
 goes, my database contains around 8 million nodes, and I traverse them in
 about
 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
 RAM. I
 allocate a 1GB heap to the JDK.

 Inserts take a little longer because of the approach I use - inserting 200K
 nodes now takes a few minutes. I then have a separate step to remove
 duplicates
 that takes about 10-15 minutes.

 It seems to me that you might be better off doing something similar:
 creating a
 new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always add
 this
 relationship between products in catalogues A and B.

 Then run a post-processing job that retrieves all PURCHASED_BOTH
 relationships
 for each product in catalogue A, and build an in-memory map so you only
 keep one
 of these relationships, and update the 'count' attribute in memory for that
 relationship. Delete the duplicates and commit. This way to get your
 desired
 result in 2 passes instead of doing everything in one go.

 It seems a bit of a fiddle and I can't guarantee it'll improve performance
 (just
 to stress - I may be waaay off the mark here, but it works for me). I think
 it
 will though because it'll mean that your loop only has to create
 relationships
 instead of performing updates. Oh, and make sure that you aren't performing
 one
 operation per transaction - you could group together several tens of
 thousands
 before committing (I do 50,000 inserts before committing when I'm running
 this
 post-processing operation, and it's fine).

 Tim



 - Original Message 
  From: Jeff Klann jkl...@iupui.edu
  To: Neo4j user discussions user@lists.neo4j.org
  Sent: Wed, July 28, 2010 4:20:28 PM
  Subject: [Neo4j] Stumped by performance issue in traversal - would take a
 month
 to run!
 
  Hi, I have an algorithm running on my little server that is very very
  slow.
  It's a recommendation traversal (for all A and B in the catalog of
  items:
  for each item A, how many customers also purchased another item in  the
  catalog B). It's processed 90 items in about 8 hours so far! Before I
  dive
  deeper into trying to figure out the performance problem, I thought  I'd
  email the list to see if more experienced people have ideas.
 
  Some  characteristics of my datastore: it's size is pretty moderate for a
  database  application. 7500 items, not sure how many customers and
 purchases
  (how can I  find the size of an index?) but probably ~1 million
 customers.
  The  relationshipstore + nodestore  500mb. (Propertystore is huge but I
  don't  access it much in traversals.)
 
  The possibilities I see are:
 
  1)  *Neo4J is just slow.* Probably not slower than Postgres which I was
  using
  previously, but maybe I need to switch to a distributed map-reduce db  in
 the
  cloud and give up the very nice graph modeling approach? I didn't  think
 this
  would be a problem, because my data size is pretty moderate and  Neo4J is
  supposed to be fast.
 
  2) *I just need more RAM.* I definitely  need more RAM - I have a measly
 1GB
  currently. But would this get my 20day  traversal down to a few hours?
  Doesn't seem like it'd have THAT much impact.  I'm running Linux and
 nothing
  much else besides Neo4j, so I've got 650m  physical RAM. 

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Rick Bullotta
Hi, Jeff.

If you are committing after each item, it definitely will slow down
performance.  Start a single transaction, commit when you're all done the
entire traversal, and report back the results.  You will still see the
changes you've made prior to committing the transaction, as long as you're
on the same execution thread.

Rick

-Original Message-
From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org] On
Behalf Of Jeff Klann
Sent: Wednesday, July 28, 2010 5:43 PM
To: Neo4j user discussions
Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take
a month to run!

Thank you both for your responses.

- I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
that's a huge factor.
- Tim, I like your algorithm trick! Would save a lot of reading/writing but
would definitely require more memory due to the massive increase in # of
edges.
- Transactions are not the issue, unless reading AFTER comitting a
transaction is somehow slower? I'm only committing after each of 7,000 items
and like I said it took 8 hours to run through 90-some items... committing
is not where the time is being spent.

To gauge the performance problem, I wanted to see how many customers are
purchasing each item and I'm concerned that even this query is taking a
really long time. It's simple:

 For each item A
   Count the number of relationships to a customer

It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
count the number of customers who purchased an item! (Looks like on average
about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE HOUR
query! Considering that Neo4J advertises it can traverse 1m
relationships/sec on commodity hardware, I would expect this to be much
faster. (Even if it were 50k customers per item, that'd be 7000items *
5customers / 1m traversals = 350 seconds. 6 minutes would be much more
reasonable.)

My commodity hardware will have a lot more memory tomorrow, hopefully
that'll solve these problems!

Thanks,
Jeff Klann
p.s. My propertystore is big because I was naive on import and stored
everything as string properties (this will change). How does that affect
performance?

On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote:

 I can't give too much help on this unfortunately, but as far as
possibility
 1)
 goes, my database contains around 8 million nodes, and I traverse them in
 about
 15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
 RAM. I
 allocate a 1GB heap to the JDK.

 Inserts take a little longer because of the approach I use - inserting
200K
 nodes now takes a few minutes. I then have a separate step to remove
 duplicates
 that takes about 10-15 minutes.

 It seems to me that you might be better off doing something similar:
 creating a
 new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always
add
 this
 relationship between products in catalogues A and B.

 Then run a post-processing job that retrieves all PURCHASED_BOTH
 relationships
 for each product in catalogue A, and build an in-memory map so you only
 keep one
 of these relationships, and update the 'count' attribute in memory for
that
 relationship. Delete the duplicates and commit. This way to get your
 desired
 result in 2 passes instead of doing everything in one go.

 It seems a bit of a fiddle and I can't guarantee it'll improve performance
 (just
 to stress - I may be waaay off the mark here, but it works for me). I
think
 it
 will though because it'll mean that your loop only has to create
 relationships
 instead of performing updates. Oh, and make sure that you aren't
performing
 one
 operation per transaction - you could group together several tens of
 thousands
 before committing (I do 50,000 inserts before committing when I'm running
 this
 post-processing operation, and it's fine).

 Tim



 - Original Message 
  From: Jeff Klann jkl...@iupui.edu
  To: Neo4j user discussions user@lists.neo4j.org
  Sent: Wed, July 28, 2010 4:20:28 PM
  Subject: [Neo4j] Stumped by performance issue in traversal - would take
a
 month
 to run!
 
  Hi, I have an algorithm running on my little server that is very very
  slow.
  It's a recommendation traversal (for all A and B in the catalog of
  items:
  for each item A, how many customers also purchased another item in  the
  catalog B). It's processed 90 items in about 8 hours so far! Before I
  dive
  deeper into trying to figure out the performance problem, I thought  I'd
  email the list to see if more experienced people have ideas.
 
  Some  characteristics of my datastore: it's size is pretty moderate for
a
  database  application. 7500 items, not sure how many customers and
 purchases
  (how can I  find the size of an index?) but probably ~1 million
 customers.
  The  relationshipstore + nodestore  500mb. (Propertystore is huge but I
  don't  access it much in traversals.)
 
  The possibilities I see are:
 
  1)  *Neo4J is just 

Re: [Neo4j] Stumped by performance issue in traversal - would take a month to run!

2010-07-28 Thread Jeff Klann
I don't think that's the problem. Here's why...

When I was importing my data, it eventually slowed down to a crawl (though
it was pretty fast at first). Someone pointed out that since I was trying to
do it all in one transaction, it was filling the java heap too much. They
suggested I commit after every 40,000 node/edge creations (that's
empirically when the slowdown happened). I did and then the import zipped
along just fine.

I'm only committing after the outer pass through an item, which is again
only after tens of thousands of writes/property updates.

Hmm, that makes me wonder if it's possible I'm not committing often enough.
Well, when I have more memory we'll see how it does.

- Jeff Klann
p.s. The simple counter in my last post isn't using transactions at all.

On Wed, Jul 28, 2010 at 5:48 PM, Rick Bullotta 
rick.bullo...@burningskysoftware.com wrote:

 Hi, Jeff.

 If you are committing after each item, it definitely will slow down
 performance.  Start a single transaction, commit when you're all done the
 entire traversal, and report back the results.  You will still see the
 changes you've made prior to committing the transaction, as long as you're
 on the same execution thread.

 Rick

 -Original Message-
 From: user-boun...@lists.neo4j.org [mailto:user-boun...@lists.neo4j.org]
 On
 Behalf Of Jeff Klann
 Sent: Wednesday, July 28, 2010 5:43 PM
 To: Neo4j user discussions
 Subject: Re: [Neo4j] Stumped by performance issue in traversal - would take
 a month to run!

 Thank you both for your responses.

 - I will get some more RAM tomorrow and give Neo4J another shot. Hopefully
 that's a huge factor.
 - Tim, I like your algorithm trick! Would save a lot of reading/writing but
 would definitely require more memory due to the massive increase in # of
 edges.
 - Transactions are not the issue, unless reading AFTER comitting a
 transaction is somehow slower? I'm only committing after each of 7,000
 items
 and like I said it took 8 hours to run through 90-some items... committing
 is not where the time is being spent.

 To gauge the performance problem, I wanted to see how many customers are
 purchasing each item and I'm concerned that even this query is taking a
 really long time. It's simple:

  For each item A
Count the number of relationships to a customer
 
 It took 15 minutes to do 200 items. That's almost 5 seconds an item just to
 count the number of customers who purchased an item! (Looks like on average
 about 5,000 customers each, ranging from 300 to 200,000.) That's a NINE
 HOUR
 query! Considering that Neo4J advertises it can traverse 1m
 relationships/sec on commodity hardware, I would expect this to be much
 faster. (Even if it were 50k customers per item, that'd be 7000items *
 5customers / 1m traversals = 350 seconds. 6 minutes would be much more
 reasonable.)

 My commodity hardware will have a lot more memory tomorrow, hopefully
 that'll solve these problems!

 Thanks,
 Jeff Klann
 p.s. My propertystore is big because I was naive on import and stored
 everything as string properties (this will change). How does that affect
 performance?

 On Wed, Jul 28, 2010 at 11:53 AM, Tim Jones bogol...@ymail.com wrote:

  I can't give too much help on this unfortunately, but as far as
 possibility
  1)
  goes, my database contains around 8 million nodes, and I traverse them in
  about
  15 seconds for retrievals. It's 2.8GB on disk, and the machine has 4GB of
  RAM. I
  allocate a 1GB heap to the JDK.
 
  Inserts take a little longer because of the approach I use - inserting
 200K
  nodes now takes a few minutes. I then have a separate step to remove
  duplicates
  that takes about 10-15 minutes.
 
  It seems to me that you might be better off doing something similar:
  creating a
  new Relationship PURCHASED_BOTH with an attribute 'count: 1' and always
 add
  this
  relationship between products in catalogues A and B.
 
  Then run a post-processing job that retrieves all PURCHASED_BOTH
  relationships
  for each product in catalogue A, and build an in-memory map so you only
  keep one
  of these relationships, and update the 'count' attribute in memory for
 that
  relationship. Delete the duplicates and commit. This way to get your
  desired
  result in 2 passes instead of doing everything in one go.
 
  It seems a bit of a fiddle and I can't guarantee it'll improve
 performance
  (just
  to stress - I may be waaay off the mark here, but it works for me). I
 think
  it
  will though because it'll mean that your loop only has to create
  relationships
  instead of performing updates. Oh, and make sure that you aren't
 performing
  one
  operation per transaction - you could group together several tens of
  thousands
  before committing (I do 50,000 inserts before committing when I'm running
  this
  post-processing operation, and it's fine).
 
  Tim
 
 
 
  - Original Message 
   From: Jeff Klann jkl...@iupui.edu
   To: Neo4j user discussions user@lists.neo4j.org
   Sent: 

Re: [Neo4j] property value encoding

2010-07-28 Thread Davide
On Tue, Jul 27, 2010 at 22:29, Craig Taverner cr...@amanzi.com wrote:
 Mapping property values to a discrete set, and refering to them using
 their 'id' is quite reminiscent of a foreign key
 in a relational database.

Yes, with a relational database I would create foreign keys and maybe bitmap
indexes on columns used for search. But I was thinking about a compression
algorithm like these:
http://www.ibm.com/developerworks/data/library/techarticle/dm-0605ahuja/index.html

 Why not take the next step and make a node for each value, and link
 all data nodes to the value nodes?

I've thought about using nodes and relationships to index these values,
but as you've said, this would generate a big number of relationships.

I've 200 properties indexed and the dictionary of encoded values
contains 15k entries (many properties have a set of hundreds of possible
values).
Now I've only 600k nodes but each node has from one to several encoded
properties.

Considering that mantaining an index is expensive, maybe an acceptable
trade-off is to put the dictionary of encoded values in the graph, but create
relationships from dictionary entries to nodes only for OpenStreetMap
properties
used for search.

 In cases where very many data nodes link to very few index nodes,
 there is another trick I'm fond of, and that is the
 composite index

I need more informations on the implementation of this composite index :)

-- 
Davide Savazzi
___
Neo4j mailing list
User@lists.neo4j.org
https://lists.neo4j.org/mailman/listinfo/user