Re: Scaling issue
I've summed up the result of this thread here: http://picolisp.com/5000/-2-I.html with some explanations. /Henrik On Fri, May 14, 2010 at 8:59 AM, Henrik Sarvell hsarv...@gmail.com wrote: OK since I can't rely on sorting by date anyway let's forget that idea. Yes since it seemed I had to involve dates anyway I simply chose a date far back enough in time that if someone is looking for something they might as well use Google. Anyway the above is scanning 19 remotes containing indexes for 10 000 articles each and returns in 3-4 seconds which is OK for me, problem solved as far as I'm concerned. I have to add though that all remotes are currently on the same machine, had they been truly distributed it would be faster, especially if the other machines were in the same data center. On Fri, May 14, 2010 at 7:55 AM, Alexander Burger a...@software-lab.de wrote: On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? While this is correct in principle, I would not rely on it. If there should ever be an object deleted from that database file, the space would be reused by the next new object, and the assumption would break. How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. Currently I see no easy way. The only function that walks a database file directly is 'seq', but it can only step forwards. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? Yes, 'iter' (and the related 'scan') can walk in both directions. You need only to pass inverted keys (i.e. Beg End). If I understand it right, however, you solved the problem in your next mail(s) by using the date index, and starting at 6 months ago? Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? While this is correct in principle, I would not rely on it. If there should ever be an object deleted from that database file, the space would be reused by the next new object, and the assumption would break. How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. Currently I see no easy way. The only function that walks a database file directly is 'seq', but it can only step forwards. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? Yes, 'iter' (and the related 'scan') can walk in both directions. You need only to pass inverted keys (i.e. Beg End). If I understand it right, however, you solved the problem in your next mail(s) by using the date index, and starting at 6 months ago? Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
OK since I can't rely on sorting by date anyway let's forget that idea. Yes since it seemed I had to involve dates anyway I simply chose a date far back enough in time that if someone is looking for something they might as well use Google. Anyway the above is scanning 19 remotes containing indexes for 10 000 articles each and returns in 3-4 seconds which is OK for me, problem solved as far as I'm concerned. I have to add though that all remotes are currently on the same machine, had they been truly distributed it would be faster, especially if the other machines were in the same data center. On Fri, May 14, 2010 at 7:55 AM, Alexander Burger a...@software-lab.de wrote: On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? While this is correct in principle, I would not rely on it. If there should ever be an object deleted from that database file, the space would be reused by the next new object, and the assumption would break. How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. Currently I see no easy way. The only function that walks a database file directly is 'seq', but it can only step forwards. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? Yes, 'iter' (and the related 'scan') can walk in both directions. You need only to pass inverted keys (i.e. Beg End). If I understand it right, however, you solved the problem in your next mail(s) by using the date index, and starting at 6 months ago? Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Everything is running smoothly now, I intend to make a write up on the wiki this weekend maybe on this. One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? On Tue, May 11, 2010 at 9:09 AM, Alexander Burger a...@software-lab.de wro= te: On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: My code simply stops executing (as if waiting for the next entry but it never gets it) when I run out of entries to fetch, really strange and a traceAll confirms this, the last output is a call to rd1. What happens on the remote side, after all entries are sent? If the remote doesn't 'close' (or 'bye'), then the receiving end doesn't know it is done. This is my rd1: (dm rd1 (Sock) =A0 =A0(or =A0 =A0 =A0 (in Sock (rd)) =A0 =A0 =A0 (nil =A0 =A0 =A0 =A0 =A0(close Sock This looks all right, but isn't obviously the problem, as it hangs in 'rd'. (de getArticles (W) =A0 =A0(for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp) =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp))) =A0 =A0 =A0(unless (flush) (bye What happens if you do (bye) after the 'for' loop is done? I assume that 'getArticles' is executed in the (eval @) below =A0 =A0(task (port (+ *IdxNum 4040)) =A0 =A0 =A0 (let? Sock (accept @) =A0 =A0 =A0 =A0 =A0(unless (fork) =A0 =A0 =A0 =A0 =A0 =A0 (in Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @ =A0 =A0 =A0 =A0 =A0 =A0 (bye)) =A0 =A0 =A0 =A0 =A0(close Sock))) This looks OK, because (bye) is called after the while loop is done. Perhaps there is something in the way 'getArticles' is invoked here? You could change the second last line to (! bye) and see if it is indeed reached. I would suspect it isn't. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
See my prior post for context. I've been testing a few different approaches and this is the fastest so far= : (de getArticles (W) (let Goal (goal (quote @Word W (select (@Wcs) ((word +WordCount @Word)) (same @Word @Wcs word (do 25 (NIL (prove Goal)) (bind @ (pr (cons (; @Wcs article) (; @Wcs picoStamp))) (unless (flush) (bye) (bye)) Where the remote ER is: (class +WordCount +Entity) # (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) (rel picoStamp (+Ref +Number)) On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell hsarv...@gmail.com wrote: Everything is running smoothly now, I intend to make a write up on the wiki this weekend maybe on this. One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? On Tue, May 11, 2010 at 9:09 AM, Alexander Burger a...@software-lab.de w= rote: On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: My code simply stops executing (as if waiting for the next entry but it never gets it) when I run out of entries to fetch, really strange and a traceAll confirms this, the last output is a call to rd1. What happens on the remote side, after all entries are sent? If the remote doesn't 'close' (or 'bye'), then the receiving end doesn't know it is done. This is my rd1: (dm rd1 (Sock) =A0 =A0(or =A0 =A0 =A0 (in Sock (rd)) =A0 =A0 =A0 (nil =A0 =A0 =A0 =A0 =A0(close Sock This looks all right, but isn't obviously the problem, as it hangs in 'rd'. (de getArticles (W) =A0 =A0(for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp) =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp))) =A0 =A0 =A0(unless (flush) (bye What happens if you do (bye) after the 'for' loop is done? I assume that 'getArticles' is executed in the (eval @) below =A0 =A0(task (port (+ *IdxNum 4040)) =A0 =A0 =A0 (let? Sock (accept @) =A0 =A0 =A0 =A0 =A0(unless (fork) =A0 =A0 =A0 =A0 =A0 =A0 (in Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @ =A0 =A0 =A0 =A0 =A0 =A0 (bye)) =A0 =A0 =A0 =A0 =A0(close Sock))) This looks OK, because (bye) is called after the while loop is done. Perhaps there is something in the way 'getArticles' is invoked here? You could change the second last line to (! bye) and see if it is indeed reached. I would suspect it isn't. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: My code simply stops executing (as if waiting for the next entry but it never gets it) when I run out of entries to fetch, really strange and a traceAll confirms this, the last output is a call to rd1. What happens on the remote side, after all entries are sent? If the remote doesn't 'close' (or 'bye'), then the receiving end doesn't know it is done. This is my rd1: (dm rd1 (Sock) (or (in Sock (rd)) (nil (close Sock This looks all right, but isn't obviously the problem, as it hangs in 'rd'. (de getArticles (W) (for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp) (pr (cons (; Wc article) (; Wc picoStamp))) (unless (flush) (bye What happens if you do (bye) after the 'for' loop is done? I assume that 'getArticles' is executed in the (eval @) below (task (port (+ *IdxNum 4040)) (let? Sock (accept @) (unless (fork) (in Sock (while (rd) (sync) (out Sock (eval @ (bye)) (close Sock))) This looks OK, because (bye) is called after the while loop is done. Perhaps there is something in the way 'getArticles' is invoked here? You could change the second last line to (! bye) and see if it is indeed reached. I would suspect it isn't. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Ah I see, so the issue is on the remote side then, what did your code look like there, did you use (prove)? On Mon, May 10, 2010 at 7:22 AM, Alexander Burger a...@software-lab.de wro= te: Hi Henrik, One final question, how did you define the rd1 mechanism? In the mentioned case, I used the followin method in the +Agent class =A0 (dm rd1 (Sock) =A0 =A0 =A0(when (assoc Sock (: socks)) =A0 =A0 =A0 =A0 (rot (: socks) (index @ (: socks))) =A0 =A0 =A0 =A0 (ext (: ext) =A0 =A0 =A0 =A0 =A0 =A0(or =A0 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock (rd)) =A0 =A0 =A0 =A0 =A0 =A0 =A0 (nil =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(close Sock) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(pop (:: socks)) ) ) ) ) ) This looks a little complicated, as each agent maintains a list of open sockets (in 'socks'). But if you omit the 'socks' management, it is basically just =A0 (ext (: ext) (in Sock (rd))) followed by 'close' if the remote side closed the connection. Simply doing: (dm rd1 (Sock) =A0 =A0(in Sock (rd))) will read the whole result, not just the first result, won't it? This should not be the case. It depends on what the other side sends. If it sends a list, you'll get the whole list. In the examples we discussed, however, the query results were sent one by one. I'm a little bit confused since it says in the reference that rd will read the first item from the current input channel but when I look Yes, analog to 'read', 'line', 'char' etc. Maybe something is needed on the remote? At the moment there is simply a collect and sort by there. Could it be that remote sends the result of 'collect'? This would be the whole list then. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote: Ah I see, so the issue is on the remote side then, what did your code look like there, did you use (prove)? There were several scenarios. In cases where only a few hits are to be expected, I used 'collect': (for Obj (collect 'var '+Cls (...)) (pr Obj) (unless (flush) (bye)) ) The 'flush' is there for two purposes: (1) to get the data sent immediately (without holding in a local buffer), and (2) to have an immediate feedback. When the receiving side should close the connection (i.e. the GUI is not interested in more results, or the client has quit), 'flush' returns NIL and the local query can be terminated. In other cases, where there were potentially many hits (so that I didn't want to use 'collect'), I used the low-level tree iteration function 'iter' (which is also internally by 'collect'): (iter (tree 'var '+Cls) '((Obj) (pr Obj) (unless (flush) (bye)) ) (cons From) (cons Till T) ) (bye) ) So 'iter' is quite efficient, as it avoids the overhead of Pilog, but still can deliver an unlimited number of hits. Note, however, that you have to pass the proper 'from' and 'till' arguments. They must have the right structure of the index tree's key. For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref' (like in the shown case) it must be '(From . NIL)' and '(Till . T)'. 'db', 'collect' and the Pilog functions take care of such details automatically. For complexer queries, involving more than one index, yes, I used Pilog and 'prove'. Each call to 'prove' returns (and sends) a single object. For plain Pilog queries, i.e. without any special requirements like a defined sorting order, you can get along even without any custom functions/methods on the remote side. The 'remote/2' predicate can handle this transparently by executing its clauses on all remote machines. I have examples for that, but they are probably beyond the scope of this mail. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
One final question, how did you define the rd1 mechanism? Simply doing: (dm rd1 (Sock) (in Sock (rd))) will read the whole result, not just the first result, won't it? I'm a little bit confused since it says in the reference that rd will read the first item from the current input channel but when I look at my current usage of rd I get the feeling it will read the whole result? Maybe something is needed on the remote? At the moment there is simply a collect and sort by there. I hope I'm not too cryptic. /Henrik On Sun, Apr 25, 2010 at 5:08 PM, Henrik Sarvell hsarv...@gmail.com wrote: Ah so the key is to have the connections in a list, I should have underst= ood that. Thanks for the help, I'll try it out! On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger a...@software-lab.de wrote: On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: So I gather the *Ext mapping is absolutely necessary regardless of whether remote or ext is used. Yes. Only in case you do not intend to communicate whole objects between the remote and local application, but only scalar data like strings, numbers, or lists of those. I would say this would be quite a limitation. You need to communicate whole objects, at least because you want to compare them locally to find the biggest (see below). I took at the *Ext section again, could I use this maybe: (setq *Ext =A0# Define extension functions ... =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (off Sock)= ) ) ) ) ) ) ) ) =A0 =A0 =A0 '(localhost localhost) =A0 =A0 =A0 '(4041 4042) =A0 =A0 =A0 (40 80) ) ) Yes, that's good. The example in the docu was not sufficient, as it has a single port hard-coded. And then with *ext* I need to create that single look ahead queue in t= he local code you talked about earlier, but how? The look ahead queue of a single object per connection consisted simply = of a list, the first result sent from each remote host. What I did was: 1. Starting a new query, a list of connections to all remote hosts is =A0 opened: =A0 =A0 =A0(extract =A0 =A0 =A0 =A0 '((Agent) =A0 =A0 =A0 =A0 =A0 =A0(query Agent arguments) ) =A0 =A0 =A0 =A0 (list of agents) ) =A0 This returns a list of all agent objects who succeeded to connect. I =A0 used that list to initialize a Pilog query. 2. Then you fetch the first answer from each connection. I used a method =A0 'rd1' in the agent class for that: =A0 =A0 =A0(extract 'rd1 (list of open agents)) =A0 'extract' is used here, as it behaves like 'mapcar' but filters all =A0 NIL items out of the result. A NIL item will be returned in the frst =A0 'extract' if the connection cannot be openend, and in the second one =A0 if that remote host has no results to send. =A0 So now you have a list of results, the first (highest, biggest, =A0 newest?) object from each remote host. 3. Now the main query loop starts. Each time a new result is requested, =A0 e.g. from the GUI, you just need to find the object with the highest= , =A0 biggest, newest attribute in that list. You take it from the list =A0 (e.g. with 'prog1'), and immediately fill the slot in the list by =A0 calling 'rd1' for that host again. =A0 If that 'rd1' returns NIL, it means this remote hosts has no more =A0 results, so you delete it from the list of open agents. If it return= s =A0 non-NIL, you store the read value into the slot. In that way, the list of received items constitutes a kind of look-ahead structure, always containing the items which might be returned next to the caller. I mean at the moment the problem is that I get too many articles in my local code since all the remotes send all their articles at once, thus swamping There cannot be any swamping. All remote processes will send their results, yes, but only until the TCP queue fills up, or until they have no more results. The local process doesn't see anything of that, it just fetches the next result with 'rd1' whenever it needs one. You don't have to worry at all whether the GUI calls for the next result 50 times, or 1 times. Each time simply the next result is returned. This works well, and produces not more load than is necessary. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Hi Henrik, One final question, how did you define the rd1 mechanism? In the mentioned case, I used the followin method in the +Agent class (dm rd1 (Sock) (when (assoc Sock (: socks)) (rot (: socks) (index @ (: socks))) (ext (: ext) (or (in Sock (rd)) (nil (close Sock) (pop (:: socks)) ) ) ) ) ) This looks a little complicated, as each agent maintains a list of open sockets (in 'socks'). But if you omit the 'socks' management, it is basically just (ext (: ext) (in Sock (rd))) followed by 'close' if the remote side closed the connection. Simply doing: (dm rd1 (Sock) (in Sock (rd))) will read the whole result, not just the first result, won't it? This should not be the case. It depends on what the other side sends. If it sends a list, you'll get the whole list. In the examples we discussed, however, the query results were sent one by one. I'm a little bit confused since it says in the reference that rd will read the first item from the current input channel but when I look Yes, analog to 'read', 'line', 'char' etc. Maybe something is needed on the remote? At the moment there is simply a collect and sort by there. Could it be that remote sends the result of 'collect'? This would be the whole list then. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
So I gather the *Ext mapping is absolutely necessary regardless of whether remote or ext is used. I took at the *Ext section again, could I use this maybe: (setq *Ext # Define extension functions (mapcar '((@Host @Port @Ext) (let Sock NIL (cons @Ext (curry (@Host @Ext Sock) (Obj) (when (or Sock (setq Sock (connect @Host @Port))) (ext @Ext (out Sock (pr (cons 'qsym Obj))) (prog1 (in Sock (rd)) (unless @ (close Sock) (off Sock) ) ) ) ) ) ) ) ) '(localhost localhost) '(4041 4042) (40 80) ) ) And then with *ext* I need to create that single look ahead queue in the local code you talked about earlier, but how? I mean at the moment the problem is that I get too many articles in my local code since all the remotes send all their articles at once, thus swamping the local process, I'll show you what I'm using now: (dm evalAll @ (let Result (make (for N (getMachine This localhost) (later (chain (cons void)) (eval This N (rest) (wait 5000 (not (memq void Result))) Result)) (Note that this logic does not respect a multi machine environment, I will add that when/if my current single machine is not enough.) This one will evalute code on all remotes and return all the results. If the result contains let's say more than 10 000 articles I will choke as it is now. That's why I need that single look ahead you talked about, but I don't know how to implement it. If it was just about returning the 25 newest articles I could have each remote simply return the 25 newest ones and then sort again locally. In that case I would get 50 back and not 10 000 in this case. And when I want the next result which will be 25-50 I suppose I could return 50 from each remote then but this is a very ugly solution that doesn't scale very well. On Sun, Apr 25, 2010 at 12:05 PM, Alexander Burger a...@software-lab.dewrote: Hi Henrik, I've reviewed the **Ext* part in the manual and I will need something different as I will have several nodes on each machine on different ports (starting with simply localhost). I suppose I could have simply modified it if I had had one node per machine? With node you mean a server process? What makes you think that the example limits it to one node? IIRC, the example is in fact a simplified version (perhaps too simplified?) of a system where there were many servers, of equal and different types, on each host. Anyway, what would the whole procedure you've described look like if I have two external nodes listening on 4041 and 4042 respectively but on localhost both of them, and the E/R in question looks like this?: (class +Article +Entity) (rel aid (+Key +Number)) (rel title (+String)) (rel htmlUrl (+Key +String)) # (rel body (+Blob)) (rel pubDate (+Ref +Number)) Side question: Is there a special reason why 'pubDate' is a '+Number' and not a '+Date'? Should work that way, though. In this case I want to fetch article 25 - 50 sorted by pubDate from both nodes Unfortunately, this cannot be achieved directly with an '+Aux' relation, because the article number and the date cannot be organized into a single index with a primary and secondary sorting criterion. There is no other way then fetching and then sorting them, I think: (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50)) Thus, the send part from a node to the central server would be (for Article (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50) ) (pr Article) # Send the article object (NIL (flush)) ) # Flush the socket The 'flush' is important, not so much to immediately send the data, but to detect whether the other side (the central server) has closed the connection, perhaps because it isn't interested in further data. 'flush' returns NIL if it cannot send the data successfully, and thus causes the 'for' loop to terminate. So as far as I've understood it a (setq *Ext ... ) section is needed and then the specific logic described in your previous post in the form of something using *ext* or maybe *remote*? Yes. '*Ext' is necessary if remote objects are accessed locally. 'remote' might be handy if Pilog is used for remote queries. This is not the case in the above example. But 'ext' is needed on the central server, with the proper offsets for the clients. This can be all encapsulated in the +Agent objects. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: So I gather the *Ext mapping is absolutely necessary regardless of whether remote or ext is used. Yes. Only in case you do not intend to communicate whole objects between the remote and local application, but only scalar data like strings, numbers, or lists of those. I would say this would be quite a limitation. You need to communicate whole objects, at least because you want to compare them locally to find the biggest (see below). I took at the *Ext section again, could I use this maybe: (setq *Ext # Define extension functions ... (off Sock) ) ) ) ) ) ) ) ) '(localhost localhost) '(4041 4042) (40 80) ) ) Yes, that's good. The example in the docu was not sufficient, as it has a single port hard-coded. And then with *ext* I need to create that single look ahead queue in the local code you talked about earlier, but how? The look ahead queue of a single object per connection consisted simply of a list, the first result sent from each remote host. What I did was: 1. Starting a new query, a list of connections to all remote hosts is opened: (extract '((Agent) (query Agent arguments) ) (list of agents) ) This returns a list of all agent objects who succeeded to connect. I used that list to initialize a Pilog query. 2. Then you fetch the first answer from each connection. I used a method 'rd1' in the agent class for that: (extract 'rd1 (list of open agents)) 'extract' is used here, as it behaves like 'mapcar' but filters all NIL items out of the result. A NIL item will be returned in the frst 'extract' if the connection cannot be openend, and in the second one if that remote host has no results to send. So now you have a list of results, the first (highest, biggest, newest?) object from each remote host. 3. Now the main query loop starts. Each time a new result is requested, e.g. from the GUI, you just need to find the object with the highest, biggest, newest attribute in that list. You take it from the list (e.g. with 'prog1'), and immediately fill the slot in the list by calling 'rd1' for that host again. If that 'rd1' returns NIL, it means this remote hosts has no more results, so you delete it from the list of open agents. If it returns non-NIL, you store the read value into the slot. In that way, the list of received items constitutes a kind of look-ahead structure, always containing the items which might be returned next to the caller. I mean at the moment the problem is that I get too many articles in my local code since all the remotes send all their articles at once, thus swamping There cannot be any swamping. All remote processes will send their results, yes, but only until the TCP queue fills up, or until they have no more results. The local process doesn't see anything of that, it just fetches the next result with 'rd1' whenever it needs one. You don't have to worry at all whether the GUI calls for the next result 50 times, or 1 times. Each time simply the next result is returned. This works well, and produces not more load than is necessary. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Ah so the key is to have the connections in a list, I should have understood that. Thanks for the help, I'll try it out! On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger a...@software-lab.dewrote: On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: So I gather the *Ext mapping is absolutely necessary regardless of whether remote or ext is used. Yes. Only in case you do not intend to communicate whole objects between the remote and local application, but only scalar data like strings, numbers, or lists of those. I would say this would be quite a limitation. You need to communicate whole objects, at least because you want to compare them locally to find the biggest (see below). I took at the *Ext section again, could I use this maybe: (setq *Ext # Define extension functions ... (off Sock) ) ) ) ) ) ) ) ) '(localhost localhost) '(4041 4042) (40 80) ) ) Yes, that's good. The example in the docu was not sufficient, as it has a single port hard-coded. And then with *ext* I need to create that single look ahead queue in the local code you talked about earlier, but how? The look ahead queue of a single object per connection consisted simply of a list, the first result sent from each remote host. What I did was: 1. Starting a new query, a list of connections to all remote hosts is opened: (extract '((Agent) (query Agent arguments) ) (list of agents) ) This returns a list of all agent objects who succeeded to connect. I used that list to initialize a Pilog query. 2. Then you fetch the first answer from each connection. I used a method 'rd1' in the agent class for that: (extract 'rd1 (list of open agents)) 'extract' is used here, as it behaves like 'mapcar' but filters all NIL items out of the result. A NIL item will be returned in the frst 'extract' if the connection cannot be openend, and in the second one if that remote host has no results to send. So now you have a list of results, the first (highest, biggest, newest?) object from each remote host. 3. Now the main query loop starts. Each time a new result is requested, e.g. from the GUI, you just need to find the object with the highest, biggest, newest attribute in that list. You take it from the list (e.g. with 'prog1'), and immediately fill the slot in the list by calling 'rd1' for that host again. If that 'rd1' returns NIL, it means this remote hosts has no more results, so you delete it from the list of open agents. If it returns non-NIL, you store the read value into the slot. In that way, the list of received items constitutes a kind of look-ahead structure, always containing the items which might be returned next to the caller. I mean at the moment the problem is that I get too many articles in my local code since all the remotes send all their articles at once, thus swamping There cannot be any swamping. All remote processes will send their results, yes, but only until the TCP queue fills up, or until they have no more results. The local process doesn't see anything of that, it just fetches the next result with 'rd1' whenever it needs one. You don't have to worry at all whether the GUI calls for the next result 50 times, or 1 times. Each time simply the next result is returned. This works well, and produces not more load than is necessary. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I've been reading up a bit on the remote stuff, I haven't made the articles distributed yet but let's assume I have, with 10 000 articles per remote. Let's also assume that I have remade the word indexes to now work with real +Ref +Links on each remote that links words and articles (not simply numbers for subsequent use with (id) locally). So with the refs in place I could use the full remote logic to run pilog queries on the remotes. Now a search is made for all articles containing the word picolisp for instance. I then need to be able to get an arbitrary slice back of the total which needs to be sorted by time. I have a hard time understanding how this can be achieved in any sensible way except through one the following: Central Command: 1.) The remotes are setup so that remote one contains the oldest articles, remote two the second oldest articles and so on (this is the case naturally as a new remote is spawned when the newest one is full). 2.) Each remote then returns how many articles it has that contains picolisp. This is needed for the pagination anyway in order to display a correct amount of page numbers and can be done pretty trivially through the count tree mechanism described earlier in this thread. 3.) The local logic now determines which remote(s) should be queried in order to get 25 correct articles, issues the queries to be executed remotely and displays the returned articles. If pagination is scrapped the total count is not needed, it's possible to have a More Results button instead, I'm fine with that kind of interface too. In most cases the count is not important for the user anyway. In that way the following might be possible: Cascading: 1.) The newest remote is queried first and can quickly determine through count tree that it has the requested articles, quickly fetches them and returns them. 2.) If it doesn't contain them it will pass on the request to the second newest remote which might contain all of the requested articles, or a subset in which case the missing ones will be returned from the third newest remote through the same mechanism. 3.) The end result is that the correct articles now end up in the first remote which will return them to the local. Did I miss something, might this problem be solved in a cleverer way? /Henrik On Thu, Apr 15, 2010 at 12:55 PM, Henrik Sarvell hsarv...@gmail.com wrote: To simply be able to pass along simple commands like collect and db ie. the *Ext stuff was overkill, which works just fine except in this special case when there are thousands of articles to a feed. I'm planning to distribute the whole DB except users and what feeds they subscribe to. Everything else will be article centric and remote. I will also keep local records of which feeds have articles in which remote so I don't query remotes for nothing. On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger a...@software-lab.dewrote: On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote: On the other hand, if I'm to follow my own thinking to its logical conclusion I should make the articles distributed too, with blobs and all. What was the rationale to use object IDs instead of direct remote access via '*Ext'? I can't remember at the moment. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On the other hand, if I'm to follow my own thinking to its logical conclusion I should make the articles distributed too, with blobs and all. On Wed, Apr 14, 2010 at 9:51 PM, Henrik Sarvell hsarv...@gmail.com wrote: I don't know Alex, remember that we disconnected stuff, I'll paste the remote E/R again (all of it, there is nothing else on the remotes): (class +WordCount +Entity) (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) The numbers here can then be used in the main app with (id) to actually locate the objects in question. Could the *Ext functionality still be used somehow? I have a hard time understanding how if I don't map the feed (parent) - article (child) relationship remotely, I mean at some point I will have to filter all retrieved articles against a set of articles fetched locally (all articles belonging to my Twitter feed), if I don't store the connections remotely. Storing the feed - article links remotely will let me avoid checking locally, and it's that check that is the bottleneck at the moment. I suppose you could find some clever way of speeding up the local filtering, at the moment I'm simply loading all Twitter articles with collect and then throwing away all remotely retrieved articles that are not in that list. However that just seems like a duct tape solution, even if it works to begin with it won't work for long. /Henrik On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger a...@software-lab.dewrote: On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: Thanks Alex, I will go for the the reversed range and check out select/3. Let me mention that since picoLisp-3.0.1 we have a separate documentation of 'select/3', in doc/select.html. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
To simply be able to pass along simple commands like collect and db ie. the *Ext stuff was overkill, which works just fine except in this special case when there are thousands of articles to a feed. I'm planning to distribute the whole DB except users and what feeds they subscribe to. Everything else will be article centric and remote. I will also keep local records of which feeds have articles in which remote so I don't query remotes for nothing. On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger a...@software-lab.dewrote: On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote: On the other hand, if I'm to follow my own thinking to its logical conclusion I should make the articles distributed too, with blobs and all. What was the rationale to use object IDs instead of direct remote access via '*Ext'? I can't remember at the moment. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I don't know Alex, remember that we disconnected stuff, I'll paste the remote E/R again (all of it, there is nothing else on the remotes): (class +WordCount +Entity) (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) The numbers here can then be used in the main app with (id) to actually locate the objects in question. Could the *Ext functionality still be used somehow? I have a hard time understanding how if I don't map the feed (parent) - article (child) relationship remotely, I mean at some point I will have to filter all retrieved articles against a set of articles fetched locally (all articles belonging to my Twitter feed), if I don't store the connections remotely. Storing the feed - article links remotely will let me avoid checking locally, and it's that check that is the bottleneck at the moment. I suppose you could find some clever way of speeding up the local filtering, at the moment I'm simply loading all Twitter articles with collect and then throwing away all remotely retrieved articles that are not in that list. However that just seems like a duct tape solution, even if it works to begin with it won't work for long. /Henrik On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger a...@software-lab.dewrote: On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: Thanks Alex, I will go for the the reversed range and check out select/3. Let me mention that since picoLisp-3.0.1 we have a separate documentation of 'select/3', in doc/select.html. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Hi Henrik, (class +ArFeLink +Entity) (rel article (+Aux +Ref +Link) (feed) NIL (+Article)) (rel feed (+Ref +Link) NIL (+Feed)) (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need it to take something like maximum 2 seconds... Can this be fixed by adding some index or key or do I need make this part of the DB distributed and chopped up so I can run this in parallel? This is already the proper index. Is it perhaps the case that there are simply too many articles fetched at once? How may articles does the above 'collect' return? And are these articles all needed at that time? If you talk about 2 seconds, I assume you don't want the user having to wait, so it is a GUI interaction. In such cases it is typical not to fetch all data from the DB, but only the first chunk e.g. to display them in the GUI. It would be better then to use a Pilog query, returning the results one by one (as done in +QueryChart). To get results analog to the above 'collect', you could create a query like (let Q (goal (quote @Obj Obj (db feed +ArFeLink @Obj @Feed) (val @Article @Feed article) ) ) ... (do 20 # Then fetch the first 20 articles (NIL (prove Q)) # More? (bind @ # Bind the result values (println @Article) # Use the article ... Instead of 'bind' you could also simply use 'get' to extract the @Article: (get @ '@Article). Before doing so, I would test it interactively, e.g. : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article)) if '{ART}' is an article. Not that the above is not tested. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I see, I should've known about that one (I'm using it to get similar articles already). What's additionally needed is: 1.) Calculating total count somehow without retrieving all articles. 2.) Somehow sorting by date so I get say the 25 first articles. If those two can also be achieved in a manner that won't require me to fetch all articles then I can use Pilog in this manner to fetch the results when it comes to getting all articles under all feeds under a specific tag. At the moment I'm fetching all of them at once and using head, not optimal. However, it won't work with the word indexes, a redesign of how the system works is needed I think. When searching for articles belonging to a certain feed containing a word in the content I now let the distributed indexes return all articles and then I simply use filter to get at the articles. And to do that I of course need to fetch all the articles in a certain feed, which works fine for most feeds but not Twitter as it now probably contains more than 10 000 articles. The only solution I can see to this is to simply store the feed - article mapping remotely too, ie each word index server contains this info too for the articles it's mapping, resutling in an E/R section looking like this: (class +WordCount +Entity) # (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) (class +ArFeLink +Entity) (rel article (+Aux +Ref +Number) (feed)) (rel feed (+Ref +Number)) Then I could simply filter by feed remotely. /Henrik On Sun, Apr 11, 2010 at 9:25 AM, Alexander Burger a...@software-lab.dewrote: Hi Henrik, (class +ArFeLink +Entity) (rel article (+Aux +Ref +Link) (feed) NIL (+Article)) (rel feed (+Ref +Link) NIL (+Feed)) (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need it to take something like maximum 2 seconds... Can this be fixed by adding some index or key or do I need make this part of the DB distributed and chopped up so I can run this in parallel? This is already the proper index. Is it perhaps the case that there are simply too many articles fetched at once? How may articles does the above 'collect' return? And are these articles all needed at that time? If you talk about 2 seconds, I assume you don't want the user having to wait, so it is a GUI interaction. In such cases it is typical not to fetch all data from the DB, but only the first chunk e.g. to display them in the GUI. It would be better then to use a Pilog query, returning the results one by one (as done in +QueryChart). To get results analog to the above 'collect', you could create a query like (let Q (goal (quote @Obj Obj (db feed +ArFeLink @Obj @Feed) (val @Article @Feed article) ) ) ... (do 20 # Then fetch the first 20 articles (NIL (prove Q)) # More? (bind @ # Bind the result values (println @Article) # Use the article ... Instead of 'bind' you could also simply use 'get' to extract the @Article: (get @ '@Article). Before doing so, I would test it interactively, e.g. : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article)) if '{ART}' is an article. Not that the above is not tested. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote: What's additionally needed is: 1.) Calculating total count somehow without retrieving all articles. If it is simply the count of all articles in the DB, you can get it directly from a '+Key' or '+Ref' index. I don't quite remember the E/R model, but I found this in an old mail: (class +Article +Entity) (rel aid (+Key +Number)) (rel title (+Idx +String)) (rel htmlUrl (+Key +String)) With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl '+Article)) will give all articles having the property 'aid' or 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more than one tree node per object). If you need distinguished counts (e.g. for groups of articles or according to certain features), it might be necessary to build more indexes, or simply maintain counts during import. 2.) Somehow sorting by date so I get say the 25 first articles. This is also best done with a dedicated index, e.g. (rel dat (+Ref +Date)) in '+Article'. Then you could specify a reversed range (T . NIL) for a pilog query (? (db dat +Article (T . NIL) @Article) (show @Article)) This will start with the newest article, and step backwards. Even easier might be if you specify a range of dates, say from today till one week ago. Then you could use 'collect' (collect 'dat '+Article (date) (- (date) 7)) or, as 'today' is not very informative, (collect 'dat '+Article T (- (date) 7)) When searching for articles belonging to a certain feed containing a word in the content I now let the distributed indexes return all articles and then I simply use filter to get at the articles. And to do that I of course need to fetch all the articles in a certain feed, which works fine for most feeds but not Twitter as it now probably contains more than 10 000 articles. I think that usually it should not be necessary to fetch all articles, if you build a combined query with the 'select/3' predicate. The only solution I can see to this is to simply store the feed - article mapping remotely too, ie each word index server contains this info too for ... Then I could simply filter by feed remotely. Not sure. But I feel that I would use distributed processing here only if there is no other way (i.e. the parallel search with 'select/3'). Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Thanks Alex, I will go for the the reversed range and check out select/3. I'm already using collect with dates extensively but in this case it wouldn't work as I need the 25 newest regardless of exactly when they were published. /Henrik On Sun, Apr 11, 2010 at 1:27 PM, Alexander Burger a...@software-lab.dewrote: On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote: What's additionally needed is: 1.) Calculating total count somehow without retrieving all articles. If it is simply the count of all articles in the DB, you can get it directly from a '+Key' or '+Ref' index. I don't quite remember the E/R model, but I found this in an old mail: (class +Article +Entity) (rel aid (+Key +Number)) (rel title (+Idx +String)) (rel htmlUrl (+Key +String)) With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl '+Article)) will give all articles having the property 'aid' or 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more than one tree node per object). If you need distinguished counts (e.g. for groups of articles or according to certain features), it might be necessary to build more indexes, or simply maintain counts during import. 2.) Somehow sorting by date so I get say the 25 first articles. This is also best done with a dedicated index, e.g. (rel dat (+Ref +Date)) in '+Article'. Then you could specify a reversed range (T . NIL) for a pilog query (? (db dat +Article (T . NIL) @Article) (show @Article)) This will start with the newest article, and step backwards. Even easier might be if you specify a range of dates, say from today till one week ago. Then you could use 'collect' (collect 'dat '+Article (date) (- (date) 7)) or, as 'today' is not very informative, (collect 'dat '+Article T (- (date) 7)) When searching for articles belonging to a certain feed containing a word in the content I now let the distributed indexes return all articles and then I simply use filter to get at the articles. And to do that I of course need to fetch all the articles in a certain feed, which works fine for most feeds but not Twitter as it now probably contains more than 10 000 articles. I think that usually it should not be necessary to fetch all articles, if you build a combined query with the 'select/3' predicate. The only solution I can see to this is to simply store the feed - article mapping remotely too, ie each word index server contains this info too for ... Then I could simply filter by feed remotely. Not sure. But I feel that I would use distributed processing here only if there is no other way (i.e. the parallel search with 'select/3'). Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: Thanks Alex, I will go for the the reversed range and check out select/3. Let me mention that since picoLisp-3.0.1 we have a separate documentation of 'select/3', in doc/select.html. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe