Re: Scaling
Hi Thorsten, > Distribution involves separate machines, connected via TCP. On each > > machine, typically several PicoLisp database processes are running, > Is my interpretation right, that the ' several PicoLisp database processes' > running on one machine form a 'PicoLisp process family' that is considered > as one application with one database? Yes, but there may be also several such "families" on a single machine. A single application, operating on a single database, consists of a parent process with an arbitrary number of child processes. This structure is necessary because synchronization of all processes that access a given database must go via a common parent (family IPC uses simple pipes). "A single database" means usually a single directory, containing all files of that database. Theoretically, a database may consist of maximally 65536 files, but this dosn't make sense in a typical Unix environment, because of too many file descriptors and other resource problems. A single file can contain maximally 4 Tera objects (42 bit object ID). It makes well sense to run several applications (= databases) on a single machine, to get a better load distribution. I have no general rule, for opimal tuning some experimentation is required. It depends mostly on the number of CPU cores and the amount of available RAM (file buffer cache). For the program logic (how those applications communicate with each other), it doesn't matter which application is running on which machine, as long as all is properly configered. I had an admin application for connecting/starting/stopping the individual apps. > How do you split up the databases? Rather by rows or rather by columns (I Not on that level, but on a functional level. For example, we had many databases (about 70) collecting data from filer volumes, sending some of their data to a second layer (also 70) which in turn sent some boiled-up stuff to a single dedicated database containing some global data. Another front-end application queried all the lower levels to generate statistics and user reports, and contained a rule database (in Pilog) about what to do on the lower levels. > know they are not 2D tables in picolisp, what I mean is: does every DB cover Right. > the whole class hierarchy, but only a fraction of the objects, or does each Yes, this was the case for the first and second layer described above. In each layer all databases had the same model (E/R definitions, in fact the same program code). > DB cover a fraction of the class hierarchy, but all objects belonging to > these classes? So each application is a complete class hierarchy in itself, independent from (but knowing about) the other DBs. But what I described was for that concrete use case. I had only a single project with such large DBs until now. Probably many other designs are possible. As Henrik said, stress is on ease of designing such structures, not on a given framework. The philosophy of PicoLisp was always to go for a vertical approach, with easy access from the lowest to the highest levels. Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Scaling
Hi Alexander, Distribution involves separate machines, connected via TCP. On each > machine, typically several PicoLisp database processes are running, > > Changes to the individual DBs have to be done the normal way (e.g. the > 'put>' family of methods), where each application (PicoLisp process > family) is maintaining its own DB. > Is my interpretation right, that the ' several PicoLisp database processes' running on one machine form a 'PicoLisp process family' that is considered as one application with one database? So it is one database per machine, using several processes on that machine, that has to be changed individually, but can be queried as part of a distributed red of databases on several machines connected via TCP? How do you split up the databases? Rather by rows or rather by columns (I know they are not 2D tables in picolisp, what I mean is: does every DB cover the whole class hierarchy, but only a fraction of the objects, or does each DB cover a fraction of the class hierarchy, but all objects belonging to these classes? Cheers Thorsten
Re: Scaling
Hi Thorsten, in addition to what Henrik wrote: > So dividing a database in several smaller files and accessing them with > something like id or ext gives a distributed faster database, and when doing Dividing the database into multiple files is the "normal" approach to design a DB application in PicoLisp, so this is not what I would call "distributed". Distribution involves separate machines, connected via TCP. On each machine, typically several PicoLisp database processes are running, and they exchange objects via 'id' or 'ext', but - more importantly - can do remote calls (via 'pr', 'rd' etc., i.e. the PLIO protocol mentioned in the other mail) and remote queries (see "doc/refR.html#remote/2"). Direct remote DB operations involve only read accesses (queries). Changes to the individual DBs have to be done the normal way (e.g. the 'put>' family of methods), where each application (PicoLisp process family) is maintaining its own DB. Hmm, that's all rather hard to explain, and unfortunately not formally documented yet (except for Henrik's great descriptions). > so ie in an Amazon EC2 account the database might (automagically) end up on > different servers, thus becoming faster and (almost endlessly) scalable. Yes, though the current system doesn't have any mechanisms for dynamically relocation of database processes yet. Actually, I was planning for something along that way, but the project where I would have needed that was terminated :( > Is anybody using Emacs/Gnus for this mailing list and can give some advice > how to make that work? Yes, our Argentinian frieds. By now, they should be up ;-) Cheers, - Alex -- UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
Re: Scaling
Hi Henrik, thanks, that's an interesting read. So dividing a database in several smaller files and accessing them with something like id or ext gives a distributed faster database, and when doing so ie in an Amazon EC2 account the database might (automagically) end up on different servers, thus becoming faster and (almost endlessly) scalable. I have no practical experience with deploying picolisp or the amazon cloud, so I'm just guessing, I just want to get a general idea of what can be done with picolisp and what not. Thorsten PS Is anybody using Emacs/Gnus for this mailing list and can give some advice how to make that work? 2011/3/24 Henrik Sarvell > Hi Thorsten. > > Here is a description of a real world example: > http://picolisp.com/5000/-2-I.html > > In that article you will also find some links to functions that might or > might now be of use to you, such as (ext). > > When it comes to distributed data and PicoLisp you don't get much for free > (apart from the aforementioned ext functionality). It's more like a > framework with which you are able to create something more specific. > > In short, you won't get something like Cassandra, Hadoop or Riak out of the > box but you could certainly create something like them with the tools that > you do have. > > And you could probably create something similar to those three with less > hassle than it was to create them in their respective languages (Java / > Erlang). > > /Henrik > > > > On Thu, Mar 24, 2011 at 6:11 PM, Thorsten < > gruenderteam.ber...@googlemail.com> wrote: > >> Hallo, >> I recently discovered (amazing) picolisp and have a few (I hope not too >> naive) questions. I write one mail for each question to not mix up >> things. >> >> I read in the documentations about distributed picolisp databases, the >> ability to make picolisp apps faster and faster by adding hardware cores >> (and using different pipes of the underlying linux OS?), and the >> possibility to deploy picolisp-apps in the clouds. But these things are >> only mentioned, without further explications. >> >> Since scaling and concurrency is all the hype in the Java world (scala, >> clojure) I would like to know a bit more about capabilities and limits >> of picolisp in this area, and how these things are achieved in practise >> (ie how to deploy an picolisp-app in the cloud?) >> >> Thanks >> Thorsten >> >> >
Re: Scaling
Hi Thorsten. Here is a description of a real world example: http://picolisp.com/5000/-2-I.html In that article you will also find some links to functions that might or might now be of use to you, such as (ext). When it comes to distributed data and PicoLisp you don't get much for free (apart from the aforementioned ext functionality). It's more like a framework with which you are able to create something more specific. In short, you won't get something like Cassandra, Hadoop or Riak out of the box but you could certainly create something like them with the tools that you do have. And you could probably create something similar to those three with less hassle than it was to create them in their respective languages (Java / Erlang). /Henrik On Thu, Mar 24, 2011 at 6:11 PM, Thorsten < gruenderteam.ber...@googlemail.com> wrote: > Hallo, > I recently discovered (amazing) picolisp and have a few (I hope not too > naive) questions. I write one mail for each question to not mix up > things. > > I read in the documentations about distributed picolisp databases, the > ability to make picolisp apps faster and faster by adding hardware cores > (and using different pipes of the underlying linux OS?), and the > possibility to deploy picolisp-apps in the clouds. But these things are > only mentioned, without further explications. > > Since scaling and concurrency is all the hype in the Java world (scala, > clojure) I would like to know a bit more about capabilities and limits > of picolisp in this area, and how these things are achieved in practise > (ie how to deploy an picolisp-app in the cloud?) > > Thanks > Thorsten > >
Re: Scaling issue
I've summed up the result of this thread here: http://picolisp.com/5000/-2-I.html with some explanations. /Henrik On Fri, May 14, 2010 at 8:59 AM, Henrik Sarvell wrote: > OK since I can't rely on sorting by date anyway let's forget that idea. > > Yes since it seemed I had to involve dates anyway I simply chose a > date far back enough in time that if someone is looking for something > they might as well use Google. > > Anyway the above is scanning 19 remotes containing indexes for 10 000 > articles each and returns in 3-4 seconds which is OK for me, problem > solved as far as I'm concerned. I have to add though that all remotes > are currently on the same machine, had they been truly distributed it > would be faster, especially if the other machines were in the same > data center. > > On Fri, May 14, 2010 at 7:55 AM, Alexander Burger > wrote: >> On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: >>> One thing first though, since articles are indexed when they're parsed >>> and PL isn't doing any kind of sorting automatically on insert then >>> they should be sorted by date automatically with the latest articles >>> at the end of the database file since I suppose they're just appended? >> >> While this is correct in principle, I would not rely on it. If there >> should ever be an object deleted from that database file, the space >> would be reused by the next new object, and the assumption would break. >> >> >>> How can I simply start walking from the end of the file until I've >>> found say 25 matches? This procedure should be the absolutely fastest >>> way of getting what I want. >> >> Currently I see no easy way. The only function that walks a database >> file directly is 'seq', but it can only step forwards. >> >> >>> I know about your iter example earlier and it seems like a good fit if >>> it starts walking in the right end? >> >> Yes, 'iter' (and the related 'scan') can walk in both directions. You >> need only to pass inverted keys (i.e. Beg > End). >> >> >> If I understand it right, however, you solved the problem in your next >> mail(s) by using the date index, and starting at 6 months ago? >> >> Cheers, >> - Alex >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >> > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
OK since I can't rely on sorting by date anyway let's forget that idea. Yes since it seemed I had to involve dates anyway I simply chose a date far back enough in time that if someone is looking for something they might as well use Google. Anyway the above is scanning 19 remotes containing indexes for 10 000 articles each and returns in 3-4 seconds which is OK for me, problem solved as far as I'm concerned. I have to add though that all remotes are currently on the same machine, had they been truly distributed it would be faster, especially if the other machines were in the same data center. On Fri, May 14, 2010 at 7:55 AM, Alexander Burger wrote: > On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: >> One thing first though, since articles are indexed when they're parsed >> and PL isn't doing any kind of sorting automatically on insert then >> they should be sorted by date automatically with the latest articles >> at the end of the database file since I suppose they're just appended? > > While this is correct in principle, I would not rely on it. If there > should ever be an object deleted from that database file, the space > would be reused by the next new object, and the assumption would break. > > >> How can I simply start walking from the end of the file until I've >> found say 25 matches? This procedure should be the absolutely fastest >> way of getting what I want. > > Currently I see no easy way. The only function that walks a database > file directly is 'seq', but it can only step forwards. > > >> I know about your iter example earlier and it seems like a good fit if >> it starts walking in the right end? > > Yes, 'iter' (and the related 'scan') can walk in both directions. You > need only to pass inverted keys (i.e. Beg > End). > > > If I understand it right, however, you solved the problem in your next > mail(s) by using the date index, and starting at 6 months ago? > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote: > One thing first though, since articles are indexed when they're parsed > and PL isn't doing any kind of sorting automatically on insert then > they should be sorted by date automatically with the latest articles > at the end of the database file since I suppose they're just appended? While this is correct in principle, I would not rely on it. If there should ever be an object deleted from that database file, the space would be reused by the next new object, and the assumption would break. > How can I simply start walking from the end of the file until I've > found say 25 matches? This procedure should be the absolutely fastest > way of getting what I want. Currently I see no easy way. The only function that walks a database file directly is 'seq', but it can only step forwards. > I know about your iter example earlier and it seems like a good fit if > it starts walking in the right end? Yes, 'iter' (and the related 'scan') can walk in both directions. You need only to pass inverted keys (i.e. Beg > End). If I understand it right, however, you solved the problem in your next mail(s) by using the date index, and starting at 6 months ago? Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Sorry for the spam but the prior listing is not correct, it didn't manage to return sorted by date, this one does though: (de getArticles (W) (let Goal (goal (quote @Word W @Date (cons (- (stamp> '+Gh) (* 6 31 86400)) (stamp> '+Gh)) (select (@Wcs) ((picoStamp +WordCount @Date) (word +WordCount @Word)) (same @Word @Wcs word) (range @Date @Wcs picoStamp (do 25 (NIL (prove Goal)) (bind @ (pr (cons (; @Wcs article) (; @Wcs picoStamp))) (unless (flush) (bye) (bye)) On Thu, May 13, 2010 at 9:36 PM, Henrik Sarvell wrote: > See my prior post for context. > > I've been testing a few different approaches and this is the fastest so f= ar: > > (de getArticles (W) > =A0 (let Goal > =A0 =A0 =A0(goal > =A0 =A0 =A0 =A0 (quote > =A0 =A0 =A0 =A0 =A0 =...@word W > =A0 =A0 =A0 =A0 =A0 =A0(select (@Wcs) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 ((word +WordCount @Word)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 (same @Word @Wcs word > =A0 =A0 =A0(do 25 > =A0 =A0 =A0 =A0 (NIL (prove Goal)) > =A0 =A0 =A0 =A0 (bind @ > =A0 =A0 =A0 =A0 =A0 =A0(pr (cons (; @Wcs article) (; @Wcs picoStamp))) > =A0 =A0 =A0 =A0 =A0 =A0(unless (flush) (bye) > =A0 (bye)) > > Where the remote ER is: > > (class +WordCount +Entity) # > (rel article =A0 (+Ref +Number)) > (rel word =A0 =A0 =A0(+Aux +Ref +Number) (article)) > (rel count =A0 =A0 (+Number)) > (rel picoStamp (+Ref +Number)) > > > > On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell wrot= e: >> Everything is running smoothly now, I intend to make a write up on the >> wiki this weekend maybe on this. >> >> One thing first though, since articles are indexed when they're parsed >> and PL isn't doing any kind of sorting automatically on insert then >> they should be sorted by date automatically with the latest articles >> at the end of the database file since I suppose they're just appended? >> >> How can I simply start walking from the end of the file until I've >> found say 25 matches? This procedure should be the absolutely fastest >> way of getting what I want. >> >> I know about your iter example earlier and it seems like a good fit if >> it starts walking in the right end? >> >> >> >> >> On Tue, May 11, 2010 at 9:09 AM, Alexander Burger = wrote: >>> On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: My code simply stops executing (as if waiting for the next entry but it never gets it) when I run out of entries to fetch, really strange and a traceAll confirms this, the last output is a call to rd1>. >>> >>> What happens on the remote side, after all entries are sent? If the >>> remote doesn't 'close' (or 'bye'), then the receiving end doesn't know >>> it is done. >>> >>> This is my rd1>: (dm rd1> (Sock) =A0 =A0(or =A0 =A0 =A0 (in Sock (rd)) =A0 =A0 =A0 (nil =A0 =A0 =A0 =A0 =A0(close Sock >>> >>> This looks all right, but isn't obviously the problem, as it hangs in >>> 'rd'. >>> >>> (de getArticles (W) =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp) =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp))) =A0 =A0 =A0(unless (flush) (bye >>> >>> What happens if you do (bye) after the 'for' loop is done? >>> >>> I assume that 'getArticles' is executed in the (eval @) below >>> >>> =A0 =A0(task (port (+ *IdxNum 4040)) =A0 =A0 =A0 (let? Sock (accept @) =A0 =A0 =A0 =A0 =A0(unless (fork) =A0 =A0 =A0 =A0 =A0 =A0 (in Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @ =A0 =A0 =A0 =A0 =A0 =A0 (bye)) =A0 =A0 =A0 =A0 =A0(close Sock))) >>> >>> This looks OK, because (bye) is called after the while loop is done. >>> Perhaps there is something in the way 'getArticles' is invoked here? Yo= u >>> could change the second last line to (! bye) and see if it is indeed >>> reached. I would suspect it isn't. >>> >>> Cheers, >>> - Alex >>> -- >>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe >>> >> > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
See my prior post for context. I've been testing a few different approaches and this is the fastest so far= : (de getArticles (W) (let Goal (goal (quote @Word W (select (@Wcs) ((word +WordCount @Word)) (same @Word @Wcs word (do 25 (NIL (prove Goal)) (bind @ (pr (cons (; @Wcs article) (; @Wcs picoStamp))) (unless (flush) (bye) (bye)) Where the remote ER is: (class +WordCount +Entity) # (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) (rel picoStamp (+Ref +Number)) On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell wrote: > Everything is running smoothly now, I intend to make a write up on the > wiki this weekend maybe on this. > > One thing first though, since articles are indexed when they're parsed > and PL isn't doing any kind of sorting automatically on insert then > they should be sorted by date automatically with the latest articles > at the end of the database file since I suppose they're just appended? > > How can I simply start walking from the end of the file until I've > found say 25 matches? This procedure should be the absolutely fastest > way of getting what I want. > > I know about your iter example earlier and it seems like a good fit if > it starts walking in the right end? > > > > > On Tue, May 11, 2010 at 9:09 AM, Alexander Burger w= rote: >> On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: >>> My code simply stops executing (as if waiting for the next entry but >>> it never gets it) when I run out of entries to fetch, really strange >>> and a traceAll confirms this, the last output is a call to rd1>. >> >> What happens on the remote side, after all entries are sent? If the >> remote doesn't 'close' (or 'bye'), then the receiving end doesn't know >> it is done. >> >> >>> This is my rd1>: >>> >>> (dm rd1> (Sock) >>> =A0 =A0(or >>> =A0 =A0 =A0 (in Sock (rd)) >>> =A0 =A0 =A0 (nil >>> =A0 =A0 =A0 =A0 =A0(close Sock >> >> This looks all right, but isn't obviously the problem, as it hangs in >> 'rd'. >> >> >>> (de getArticles (W) >>> =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp) >>> =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp))) >>> =A0 =A0 =A0(unless (flush) (bye >> >> What happens if you do (bye) after the 'for' loop is done? >> >> I assume that 'getArticles' is executed in the (eval @) below >> >> >>> =A0 =A0(task (port (+ *IdxNum 4040)) >>> =A0 =A0 =A0 (let? Sock (accept @) >>> =A0 =A0 =A0 =A0 =A0(unless (fork) >>> =A0 =A0 =A0 =A0 =A0 =A0 (in Sock >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd) >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync) >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock >>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @ >>> =A0 =A0 =A0 =A0 =A0 =A0 (bye)) >>> =A0 =A0 =A0 =A0 =A0(close Sock))) >> >> This looks OK, because (bye) is called after the while loop is done. >> Perhaps there is something in the way 'getArticles' is invoked here? You >> could change the second last line to (! bye) and see if it is indeed >> reached. I would suspect it isn't. >> >> Cheers, >> - Alex >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe >> > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Everything is running smoothly now, I intend to make a write up on the wiki this weekend maybe on this. One thing first though, since articles are indexed when they're parsed and PL isn't doing any kind of sorting automatically on insert then they should be sorted by date automatically with the latest articles at the end of the database file since I suppose they're just appended? How can I simply start walking from the end of the file until I've found say 25 matches? This procedure should be the absolutely fastest way of getting what I want. I know about your iter example earlier and it seems like a good fit if it starts walking in the right end? On Tue, May 11, 2010 at 9:09 AM, Alexander Burger wro= te: > On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: >> My code simply stops executing (as if waiting for the next entry but >> it never gets it) when I run out of entries to fetch, really strange >> and a traceAll confirms this, the last output is a call to rd1>. > > What happens on the remote side, after all entries are sent? If the > remote doesn't 'close' (or 'bye'), then the receiving end doesn't know > it is done. > > >> This is my rd1>: >> >> (dm rd1> (Sock) >> =A0 =A0(or >> =A0 =A0 =A0 (in Sock (rd)) >> =A0 =A0 =A0 (nil >> =A0 =A0 =A0 =A0 =A0(close Sock > > This looks all right, but isn't obviously the problem, as it hangs in > 'rd'. > > >> (de getArticles (W) >> =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp) >> =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp))) >> =A0 =A0 =A0(unless (flush) (bye > > What happens if you do (bye) after the 'for' loop is done? > > I assume that 'getArticles' is executed in the (eval @) below > > >> =A0 =A0(task (port (+ *IdxNum 4040)) >> =A0 =A0 =A0 (let? Sock (accept @) >> =A0 =A0 =A0 =A0 =A0(unless (fork) >> =A0 =A0 =A0 =A0 =A0 =A0 (in Sock >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync) >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock >> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @ >> =A0 =A0 =A0 =A0 =A0 =A0 (bye)) >> =A0 =A0 =A0 =A0 =A0(close Sock))) > > This looks OK, because (bye) is called after the while loop is done. > Perhaps there is something in the way 'getArticles' is invoked here? You > could change the second last line to (! bye) and see if it is indeed > reached. I would suspect it isn't. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote: > My code simply stops executing (as if waiting for the next entry but > it never gets it) when I run out of entries to fetch, really strange > and a traceAll confirms this, the last output is a call to rd1>. What happens on the remote side, after all entries are sent? If the remote doesn't 'close' (or 'bye'), then the receiving end doesn't know it is done. > This is my rd1>: > > (dm rd1> (Sock) >(or > (in Sock (rd)) > (nil > (close Sock This looks all right, but isn't obviously the problem, as it hangs in 'rd'. > (de getArticles (W) >(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp) > (pr (cons (; Wc article) (; Wc picoStamp))) > (unless (flush) (bye What happens if you do (bye) after the 'for' loop is done? I assume that 'getArticles' is executed in the (eval @) below >(task (port (+ *IdxNum 4040)) > (let? Sock (accept @) > (unless (fork) > (in Sock >(while (rd) > (sync) > (out Sock > (eval @ > (bye)) > (close Sock))) This looks OK, because (bye) is called after the while loop is done. Perhaps there is something in the way 'getArticles' is invoked here? You could change the second last line to (! bye) and see if it is indeed reached. I would suspect it isn't. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
My code simply stops executing (as if waiting for the next entry but it never gets it) when I run out of entries to fetch, really strange and a traceAll confirms this, the last output is a call to rd1>. I know for a fact that 2 results should be returned but then when I try to fetch the third and think I should get NIL something goes really wrong, some race condition or a never ending wait for something that refuses to happen. This is my rd1>: (dm rd1> (Sock) (or (in Sock (rd)) (nil (close Sock And on the remote: (de getArticles (W) (for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp) (pr (cons (; Wc article) (; Wc picoStamp))) (unless (flush) (bye And the go of the remote: (de go () .. (rollback) (task (port (+ *IdxNum 4040)) (let? Sock (accept @) (unless (fork) (in Sock (while (rd) (sync) (out Sock (eval @ (bye)) (close Sock))) (forked)) On Mon, May 10, 2010 at 9:50 AM, Alexander Burger wro= te: > On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote: >> Ah I see, so the issue is on the remote side then, what did your code >> look like there, did you use (prove)? > > There were several scenarios. In cases where only a few hits are to be > expected, I used 'collect': > > =A0 (for Obj (collect 'var '+Cls (...)) > =A0 =A0 =A0(pr Obj) > =A0 =A0 =A0(unless (flush) (bye)) ) > > The 'flush' is there for two purposes: (1) to get the data sent > immediately (without holding in a local buffer), and (2) to have an > immediate feedback. When the receiving side should close the connection > (i.e. the GUI is not interested in more results, or the client has > quit), 'flush' returns NIL and the local query can be terminated. > > > In other cases, where there were potentially many hits (so that I didn't > want to use 'collect'), I used the low-level tree iteration function > 'iter' (which is also internally by 'collect'): > > =A0 (iter (tree 'var '+Cls) > =A0 =A0 =A0'((Obj) > =A0 =A0 =A0 =A0 (pr Obj) > =A0 =A0 =A0 =A0 (unless (flush) (bye)) ) > =A0 =A0 =A0(cons From) > =A0 =A0 =A0(cons Till T) ) > =A0 (bye) ) > > So 'iter' is quite efficient, as it avoids the overhead of Pilog, but > still can deliver an unlimited number of hits. > > Note, however, that you have to pass the proper 'from' and 'till' > arguments. They must have the right structure of the index tree's key. > For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref' > (like in the shown case) it must be '(From . NIL)' and '(Till . T)'. > 'db', 'collect' and the Pilog functions take care of such details > automatically. > > > For complexer queries, involving more than one index, yes, I used Pilog > and 'prove'. Each call to 'prove' returns (and sends) a single object. > > > For plain Pilog queries, i.e. without any special requirements like a > defined sorting order, you can get along even without any custom > functions/methods on the remote side. The 'remote/2' predicate can > handle this transparently by executing its clauses on all remote > machines. I have examples for that, but they are probably beyond the > scope of this mail. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote: > Ah I see, so the issue is on the remote side then, what did your code > look like there, did you use (prove)? There were several scenarios. In cases where only a few hits are to be expected, I used 'collect': (for Obj (collect 'var '+Cls (...)) (pr Obj) (unless (flush) (bye)) ) The 'flush' is there for two purposes: (1) to get the data sent immediately (without holding in a local buffer), and (2) to have an immediate feedback. When the receiving side should close the connection (i.e. the GUI is not interested in more results, or the client has quit), 'flush' returns NIL and the local query can be terminated. In other cases, where there were potentially many hits (so that I didn't want to use 'collect'), I used the low-level tree iteration function 'iter' (which is also internally by 'collect'): (iter (tree 'var '+Cls) '((Obj) (pr Obj) (unless (flush) (bye)) ) (cons From) (cons Till T) ) (bye) ) So 'iter' is quite efficient, as it avoids the overhead of Pilog, but still can deliver an unlimited number of hits. Note, however, that you have to pass the proper 'from' and 'till' arguments. They must have the right structure of the index tree's key. For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref' (like in the shown case) it must be '(From . NIL)' and '(Till . T)'. 'db', 'collect' and the Pilog functions take care of such details automatically. For complexer queries, involving more than one index, yes, I used Pilog and 'prove'. Each call to 'prove' returns (and sends) a single object. For plain Pilog queries, i.e. without any special requirements like a defined sorting order, you can get along even without any custom functions/methods on the remote side. The 'remote/2' predicate can handle this transparently by executing its clauses on all remote machines. I have examples for that, but they are probably beyond the scope of this mail. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Ah I see, so the issue is on the remote side then, what did your code look like there, did you use (prove)? On Mon, May 10, 2010 at 7:22 AM, Alexander Burger wro= te: > Hi Henrik, > >> One final question, how did you define the rd1> mechanism? > > In the mentioned case, I used the followin method in the +Agent class > > =A0 (dm rd1> (Sock) > =A0 =A0 =A0(when (assoc Sock (: socks)) > =A0 =A0 =A0 =A0 (rot (: socks) (index @ (: socks))) > =A0 =A0 =A0 =A0 (ext (: ext) > =A0 =A0 =A0 =A0 =A0 =A0(or > =A0 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock (rd)) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 (nil > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(close Sock) > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(pop (:: socks)) ) ) ) ) ) > > This looks a little complicated, as each agent maintains a list of open > sockets (in 'socks'). But if you omit the 'socks' management, it is > basically just > > =A0 (ext (: ext) (in Sock (rd))) > > followed by 'close' if the remote side closed the connection. > > >> Simply doing: >> >> (dm rd1> (Sock) >> =A0 =A0(in Sock (rd))) >> >> will read the whole result, not just the first result, won't it? > > This should not be the case. It depends on what the other side sends. If > it sends a list, you'll get the whole list. In the examples we > discussed, however, the query results were sent one by one. > > >> I'm a little bit confused since it says in the reference that rd will >> "read the first item from the current input channel" but when I look > > Yes, analog to 'read', 'line', 'char' etc. > >> Maybe something is needed on the remote? At the moment there is simply >> a collect and sort by there. > > Could it be that remote sends the result of 'collect'? This would be the > whole list then. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Hi Henrik, > One final question, how did you define the rd1> mechanism? In the mentioned case, I used the followin method in the +Agent class (dm rd1> (Sock) (when (assoc Sock (: socks)) (rot (: socks) (index @ (: socks))) (ext (: ext) (or (in Sock (rd)) (nil (close Sock) (pop (:: socks)) ) ) ) ) ) This looks a little complicated, as each agent maintains a list of open sockets (in 'socks'). But if you omit the 'socks' management, it is basically just (ext (: ext) (in Sock (rd))) followed by 'close' if the remote side closed the connection. > Simply doing: > > (dm rd1> (Sock) >(in Sock (rd))) > > will read the whole result, not just the first result, won't it? This should not be the case. It depends on what the other side sends. If it sends a list, you'll get the whole list. In the examples we discussed, however, the query results were sent one by one. > I'm a little bit confused since it says in the reference that rd will > "read the first item from the current input channel" but when I look Yes, analog to 'read', 'line', 'char' etc. > Maybe something is needed on the remote? At the moment there is simply > a collect and sort by there. Could it be that remote sends the result of 'collect'? This would be the whole list then. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
One final question, how did you define the rd1> mechanism? Simply doing: (dm rd1> (Sock) (in Sock (rd))) will read the whole result, not just the first result, won't it? I'm a little bit confused since it says in the reference that rd will "read the first item from the current input channel" but when I look at my current usage of rd I get the feeling it will read the whole result? Maybe something is needed on the remote? At the moment there is simply a collect and sort by there. I hope I'm not too cryptic. /Henrik On Sun, Apr 25, 2010 at 5:08 PM, Henrik Sarvell wrote: > Ah so the key is to have the connections in a list, I should have underst= ood > that. > > Thanks for the help, I'll try it out! > > > > On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger > wrote: >> >> On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: >> > So I gather the *Ext mapping is absolutely necessary regardless of >> > whether >> > remote or ext is used. >> >> Yes. >> >> Only in case you do not intend to communicate whole objects between the >> remote and local application, but only scalar data like strings, >> numbers, or lists of those. I would say this would be quite a >> limitation. You need to communicate whole objects, at least because you >> want to compare them locally to find the biggest (see below). >> >> >> > I took at the *Ext section again, could I use this maybe: >> > >> > (setq *Ext =A0# Define extension functions >> > ... >> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (off Sock)= ) ) ) ) ) ) ) ) >> > =A0 =A0 =A0 '(localhost localhost) >> > =A0 =A0 =A0 '(4041 4042) >> > =A0 =A0 =A0 (40 80) ) ) >> >> Yes, that's good. The example in the docu was not sufficient, as it has >> a single port hard-coded. >> >> >> > And then with *ext* I need to create that single look ahead queue in t= he >> > local code you talked about earlier, but how? >> >> The look ahead queue of a single object per connection consisted simply = of >> a list, the first result sent from each remote host. >> >> What I did was: >> >> 1. Starting a new query, a list of connections to all remote hosts is >> =A0 opened: >> >> =A0 =A0 =A0(extract >> =A0 =A0 =A0 =A0 '((Agent) >> =A0 =A0 =A0 =A0 =A0 =A0(query> Agent ) ) >> =A0 =A0 =A0 =A0 (list of agents) ) >> >> =A0 This returns a list of all agent objects who succeeded to connect. I >> =A0 used that list to initialize a Pilog query. >> >> 2. Then you fetch the first answer from each connection. I used a method >> =A0 'rd1>' in the agent class for that: >> >> =A0 =A0 =A0(extract 'rd1> (list of open agents)) >> >> =A0 'extract' is used here, as it behaves like 'mapcar' but filters all >> =A0 NIL items out of the result. A NIL item will be returned in the frst >> =A0 'extract' if the connection cannot be openend, and in the second one >> =A0 if that remote host has no results to send. >> >> =A0 So now you have a list of results, the first (highest, biggest, >> =A0 newest?) object from each remote host. >> >> 3. Now the main query loop starts. Each time a new result is requested, >> =A0 e.g. from the GUI, you just need to find the object with the highest= , >> =A0 biggest, newest attribute in that list. You take it from the list >> =A0 (e.g. with 'prog1'), and immediately fill the slot in the list by >> =A0 calling 'rd1>' for that host again. >> >> =A0 If that 'rd1>' returns NIL, it means this remote hosts has no more >> =A0 results, so you delete it from the list of open agents. If it return= s >> =A0 non-NIL, you store the read value into the slot. >> >> In that way, the list of received items constitutes a kind of look-ahead >> structure, always containing the items which might be returned next to >> the caller. >> >> >> > I mean at the moment the problem is that I get too many articles in my >> > local >> > code since all the remotes send all their articles at once, thus >> > swamping >> >> There cannot be any swamping. All remote processes will send their >> results, yes, but only until the TCP queue fills up, or until they have >> no more results. The local process doesn't see anything of that, it just >> fetches the next result with 'rd1>' whenever it needs one. >> >> You don't have to worry at all whether the GUI calls for the next result >> 50 times, or 1 times. Each time simply the next result is returned. >> This works well, and produces not more load than is necessary. >> >> Cheers, >> - Alex >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe > > -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Ah so the key is to have the connections in a list, I should have understood that. Thanks for the help, I'll try it out! On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger wrote: > On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: > > So I gather the *Ext mapping is absolutely necessary regardless of > whether > > remote or ext is used. > > Yes. > > Only in case you do not intend to communicate whole objects between the > remote and local application, but only scalar data like strings, > numbers, or lists of those. I would say this would be quite a > limitation. You need to communicate whole objects, at least because you > want to compare them locally to find the biggest (see below). > > > > I took at the *Ext section again, could I use this maybe: > > > > (setq *Ext # Define extension functions > > ... > > (off Sock) ) ) ) ) ) ) ) ) > > '(localhost localhost) > > '(4041 4042) > > (40 80) ) ) > > Yes, that's good. The example in the docu was not sufficient, as it has > a single port hard-coded. > > > > And then with *ext* I need to create that single look ahead queue in the > > local code you talked about earlier, but how? > > The look ahead queue of a single object per connection consisted simply of > a list, the first result sent from each remote host. > > What I did was: > > 1. Starting a new query, a list of connections to all remote hosts is > opened: > > (extract > '((Agent) >(query> Agent ) ) > (list of agents) ) > > This returns a list of all agent objects who succeeded to connect. I > used that list to initialize a Pilog query. > > 2. Then you fetch the first answer from each connection. I used a method > 'rd1>' in the agent class for that: > > (extract 'rd1> (list of open agents)) > > 'extract' is used here, as it behaves like 'mapcar' but filters all > NIL items out of the result. A NIL item will be returned in the frst > 'extract' if the connection cannot be openend, and in the second one > if that remote host has no results to send. > > So now you have a list of results, the first (highest, biggest, > newest?) object from each remote host. > > 3. Now the main query loop starts. Each time a new result is requested, > e.g. from the GUI, you just need to find the object with the highest, > biggest, newest attribute in that list. You take it from the list > (e.g. with 'prog1'), and immediately fill the slot in the list by > calling 'rd1>' for that host again. > > If that 'rd1>' returns NIL, it means this remote hosts has no more > results, so you delete it from the list of open agents. If it returns > non-NIL, you store the read value into the slot. > > In that way, the list of received items constitutes a kind of look-ahead > structure, always containing the items which might be returned next to > the caller. > > > > I mean at the moment the problem is that I get too many articles in my > local > > code since all the remotes send all their articles at once, thus swamping > > There cannot be any swamping. All remote processes will send their > results, yes, but only until the TCP queue fills up, or until they have > no more results. The local process doesn't see anything of that, it just > fetches the next result with 'rd1>' whenever it needs one. > > You don't have to worry at all whether the GUI calls for the next result > 50 times, or 1 times. Each time simply the next result is returned. > This works well, and produces not more load than is necessary. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote: > So I gather the *Ext mapping is absolutely necessary regardless of whether > remote or ext is used. Yes. Only in case you do not intend to communicate whole objects between the remote and local application, but only scalar data like strings, numbers, or lists of those. I would say this would be quite a limitation. You need to communicate whole objects, at least because you want to compare them locally to find the biggest (see below). > I took at the *Ext section again, could I use this maybe: > > (setq *Ext # Define extension functions > ... > (off Sock) ) ) ) ) ) ) ) ) > '(localhost localhost) > '(4041 4042) > (40 80) ) ) Yes, that's good. The example in the docu was not sufficient, as it has a single port hard-coded. > And then with *ext* I need to create that single look ahead queue in the > local code you talked about earlier, but how? The look ahead queue of a single object per connection consisted simply of a list, the first result sent from each remote host. What I did was: 1. Starting a new query, a list of connections to all remote hosts is opened: (extract '((Agent) (query> Agent ) ) (list of agents) ) This returns a list of all agent objects who succeeded to connect. I used that list to initialize a Pilog query. 2. Then you fetch the first answer from each connection. I used a method 'rd1>' in the agent class for that: (extract 'rd1> (list of open agents)) 'extract' is used here, as it behaves like 'mapcar' but filters all NIL items out of the result. A NIL item will be returned in the frst 'extract' if the connection cannot be openend, and in the second one if that remote host has no results to send. So now you have a list of results, the first (highest, biggest, newest?) object from each remote host. 3. Now the main query loop starts. Each time a new result is requested, e.g. from the GUI, you just need to find the object with the highest, biggest, newest attribute in that list. You take it from the list (e.g. with 'prog1'), and immediately fill the slot in the list by calling 'rd1>' for that host again. If that 'rd1>' returns NIL, it means this remote hosts has no more results, so you delete it from the list of open agents. If it returns non-NIL, you store the read value into the slot. In that way, the list of received items constitutes a kind of look-ahead structure, always containing the items which might be returned next to the caller. > I mean at the moment the problem is that I get too many articles in my local > code since all the remotes send all their articles at once, thus swamping There cannot be any swamping. All remote processes will send their results, yes, but only until the TCP queue fills up, or until they have no more results. The local process doesn't see anything of that, it just fetches the next result with 'rd1>' whenever it needs one. You don't have to worry at all whether the GUI calls for the next result 50 times, or 1 times. Each time simply the next result is returned. This works well, and produces not more load than is necessary. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
So I gather the *Ext mapping is absolutely necessary regardless of whether remote or ext is used. I took at the *Ext section again, could I use this maybe: (setq *Ext # Define extension functions (mapcar '((@Host @Port @Ext) (let Sock NIL (cons @Ext (curry (@Host @Ext Sock) (Obj) (when (or Sock (setq Sock (connect @Host @Port))) (ext @Ext (out Sock (pr (cons 'qsym Obj))) (prog1 (in Sock (rd)) (unless @ (close Sock) (off Sock) ) ) ) ) ) ) ) ) '(localhost localhost) '(4041 4042) (40 80) ) ) And then with *ext* I need to create that single look ahead queue in the local code you talked about earlier, but how? I mean at the moment the problem is that I get too many articles in my local code since all the remotes send all their articles at once, thus swamping the local process, I'll show you what I'm using now: (dm evalAll> @ (let Result (make (for N (getMachine> This "localhost") (later (chain (cons "void")) (eval> This N (rest) (wait 5000 (not (memq "void" Result))) Result)) (Note that this logic does not respect a multi machine environment, I will add that when/if my current single machine is not enough.) This one will evalute code on all remotes and return all the results. If the result contains let's say more than 10 000 articles I will choke as it is now. That's why I need that single look ahead you talked about, but I don't know how to implement it. If it was just about returning the 25 newest articles I could have each remote simply return the 25 newest ones and then sort again locally. In that case I would get 50 back and not 10 000 in this case. And when I want the next result which will be 25-50 I suppose I could return 50 from each remote then but this is a very ugly solution that doesn't scale very well. On Sun, Apr 25, 2010 at 12:05 PM, Alexander Burger wrote: > Hi Henrik, > > > I've reviewed the **Ext* part in the manual and I will need something > > different as I will have several nodes on each machine on different ports > > (starting with simply localhost). I suppose I could have simply modified > it > > if I had had one node per machine? > > With "node" you mean a server process? What makes you think that the > example limits it to one node? IIRC, the example is in fact a simplified > version (perhaps too simplified?) of a system where there were many > servers, of equal and different types, on each host. > > > > Anyway, what would the whole procedure you've described look like if I > have > > two external nodes listening on 4041 and 4042 respectively but on > localhost > > both of them, and the E/R in question looks like this?: > > > > (class +Article +Entity) > > (rel aid (+Key +Number)) > > (rel title (+String)) > > (rel htmlUrl (+Key +String)) # > > (rel body (+Blob)) > > (rel pubDate (+Ref +Number)) > > Side question: Is there a special reason why 'pubDate' is a '+Number' > and not a '+Date'? Should work that way, though. > > > > In this case I want to fetch article 25 - 50 sorted by pubDate from both > > nodes > > Unfortunately, this cannot be achieved directly with an '+Aux' relation, > because the article number and the date cannot be organized into a > single index with a primary and secondary sorting criterion. > > There is no other way then fetching and then sorting them, I think: > > (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50)) > > Thus, the "send" part from a node to the central server would be > > (for Article > (by > '((This) (: pubDate)) > sort > (collect 'aid '+Article 25 50) ) > (pr Article) # Send the article object > (NIL (flush)) ) # Flush the socket > > The 'flush' is important, not so much to immediately send the data, but > to detect whether the other side (the central server) has closed the > connection, perhaps because it isn't interested in further data. > > 'flush' returns NIL if it cannot send the data successfully, and thus > causes the 'for' loop to terminate. > > > > > So as far as I've understood it a (setq *Ext ... ) section is needed and > > then the specific logic described in your previous post in the form of > > something using *ext* or maybe *remote*? > > Yes. '*Ext' is necessary if remote objects are accessed locally. > > 'remote' might be handy if Pilog is used for remote queries. This is not > the case in the above example. > > But 'ext' is needed on the central server, with the proper offsets for > the clients. This can be all encapsulated in the +Agent objects. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
Hi Henrik, > I've reviewed the **Ext* part in the manual and I will need something > different as I will have several nodes on each machine on different ports > (starting with simply localhost). I suppose I could have simply modified it > if I had had one node per machine? With "node" you mean a server process? What makes you think that the example limits it to one node? IIRC, the example is in fact a simplified version (perhaps too simplified?) of a system where there were many servers, of equal and different types, on each host. > Anyway, what would the whole procedure you've described look like if I have > two external nodes listening on 4041 and 4042 respectively but on localhost > both of them, and the E/R in question looks like this?: > > (class +Article +Entity) > (rel aid (+Key +Number)) > (rel title (+String)) > (rel htmlUrl (+Key +String)) # > (rel body (+Blob)) > (rel pubDate (+Ref +Number)) Side question: Is there a special reason why 'pubDate' is a '+Number' and not a '+Date'? Should work that way, though. > In this case I want to fetch article 25 - 50 sorted by pubDate from both > nodes Unfortunately, this cannot be achieved directly with an '+Aux' relation, because the article number and the date cannot be organized into a single index with a primary and secondary sorting criterion. There is no other way then fetching and then sorting them, I think: (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50)) Thus, the "send" part from a node to the central server would be (for Article (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50) ) (pr Article) # Send the article object (NIL (flush)) ) # Flush the socket The 'flush' is important, not so much to immediately send the data, but to detect whether the other side (the central server) has closed the connection, perhaps because it isn't interested in further data. 'flush' returns NIL if it cannot send the data successfully, and thus causes the 'for' loop to terminate. > So as far as I've understood it a (setq *Ext ... ) section is needed and > then the specific logic described in your previous post in the form of > something using *ext* or maybe *remote*? Yes. '*Ext' is necessary if remote objects are accessed locally. 'remote' might be handy if Pilog is used for remote queries. This is not the case in the above example. But 'ext' is needed on the central server, with the proper offsets for the clients. This can be all encapsulated in the +Agent objects. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I've done some refactoring and rewriting of my +Agent, I will employ various ways of fetching/setting remote data but the technique you've described above will be prominent. I've reviewed the **Ext* part in the manual and I will need something different as I will have several nodes on each machine on different ports (starting with simply localhost). I suppose I could have simply modified it if I had had one node per machine? Anyway, what would the whole procedure you've described look like if I have two external nodes listening on 4041 and 4042 respectively but on localhost both of them, and the E/R in question looks like this?: (class +Article +Entity) (rel aid (+Key +Number)) (rel title (+String)) (rel htmlUrl (+Key +String)) # (rel body (+Blob)) (rel pubDate (+Ref +Number)) In this case I want to fetch article 25 - 50 sorted by pubDate from both nodes (if additional relations are needed to facilitate the sorting feel free to add them to the E/R). So as far as I've understood it a (setq *Ext ... ) section is needed and then the specific logic described in your previous post in the form of something using *ext* or maybe *remote*? /Henrik On Wed, Apr 21, 2010 at 8:08 PM, Alexander Burger wrote: > On Wed, Apr 21, 2010 at 06:35:30PM +0200, Henrik Sarvell wrote: > > At first my remotes will be on the same machine so yes they could all be > > forked from the main process. > > That's all right. On the other hand, the remote processes might be > different _programs_ (i.e. starting from a separate 'main', 'go' etc.), > so I would rather expect them not to fork from the same process. > > On which machine(s) all these processes run at the end of the day > doesn't matter. Can well be all on localhost. > > > > I suppose that means they will start to block and what consequences will > > there be? If very bad how do I prevent it in the best way? > > Blocking is no problem at all in this context. As we discussed in > previous examples (and also as shown in the docu of *Ext and related > functions), the remote server spawns a child process for each query > request. Such a query can block if the central server doesn't eat away > all results quickly enough, but this doesn't matter. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
On Wed, Apr 21, 2010 at 06:35:30PM +0200, Henrik Sarvell wrote: > At first my remotes will be on the same machine so yes they could all be > forked from the main process. That's all right. On the other hand, the remote processes might be different _programs_ (i.e. starting from a separate 'main', 'go' etc.), so I would rather expect them not to fork from the same process. On which machine(s) all these processes run at the end of the day doesn't matter. Can well be all on localhost. > I suppose that means they will start to block and what consequences will > there be? If very bad how do I prevent it in the best way? Blocking is no problem at all in this context. As we discussed in previous examples (and also as shown in the docu of *Ext and related functions), the remote server spawns a child process for each query request. Such a query can block if the central server doesn't eat away all results quickly enough, but this doesn't matter. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
One small question before I start working. Now all remote database start sending their results, ordered by date. > They are actually busy only until the TCP queue fills up, or until the > connection is closed. If the queue is filled up, they will block so that > it is advisable that they are all fork'ed children. > At first my remotes will be on the same machine so yes they could all be forked from the main process. In the future though I might want to put them on different machines, that's why I want them to be completely separate processes right from the start. I suppose that means they will start to block and what consequences will there be? If very bad how do I prevent it in the best way? Sorry for the possibly silly questions but my knowledge of TCP is very limited. On Tue, Apr 20, 2010 at 6:11 PM, Henrik Sarvell wrote: > That was a clever one I must say :) > > OK I'll redistribute the articles first and then get back to you for a few > of the details with regards to the above. > > The above can then also be used to fetch articles by feed, tag or any other > attribute because they must always be sorted by date. > > > > > On Tue, Apr 20, 2010 at 5:22 PM, Alexander Burger wrote: > >> Hi Henrik, >> >> > So with the refs in place I could use the full remote logic to run pilog >> > queries on the remotes. >> >> OK >> >> > Now a search is made for all articles containing the word "picolisp" for >> > instance. I then need to be able to get an arbitrary slice back of the >> total >> > which needs to be sorted by time. I have a hard time understanding how >> this >> > can be achieved in any sensible way except through one the following: >> > >> > Central Command: >> > ... >> > Cascading: >> > ... >> >> I think both solutions are feasible. This is because you are in the >> lucky situation that you can separate the articles on the remote machine >> according to their age. In a general case (e.g. if the data are not >> "archived" like here, but are subject to permanent change). >> >> >> However: I think there is a solution that is simpler, as well as more >> general (not assuming anything about the locations of the articles). >> >> I did this in another project, where I collected items from remote >> machines sorted by attributes (not date, but counts and sizes). >> >> >> The first thing is that you define the index to be an +Aux, combining >> the search key with the date. So if you search for a key like "picolisp" >> on a single remote machine, you get all hits sorted by date. No extra >> sorting required. >> >> Then, each remote machine has a function (e.g. 'sendRefDatArticles') >> defined, which simply iterates the index tree (with 'collect', or better >> a pilog query) and sends each found object with 'pr' to the current >> output channel. When it sent all hits, it terminates. >> >> Then on the central server you open connections with *Ext enabled to >> each remote client. This can be done with +Agent objects taking care of >> the details (maintaining the connections, communicating via 'ext' etc.). >> >> The actual query then sends out a remote command like >> >> (start> Agent 'sendRefDatArticles "picolisp" (someDate)) >> >> Now all remote database start sending their results, ordered by date. >> They are actually busy only until the TCP queue fills up, or until the >> connection is closed. If the queue is filled up, they will block so that >> it is advisable that they are all fork'ed children. >> >> The central server then reads a single object from each connection into >> a list. Now, to return the results one by one to the actual caller (e.g. >> the GUI), it always picks the object with the highest date from that >> list, and reads the next item into that place in the list. The list is >> effectively a single-object look-ahead on each connection. When one of >> the connections returns NIL, it means the list of hits on that machine >> is exhausted, the remote child process terminated, and the connection >> can be closed. >> >> So the GUI calls a function (or, more probably) a proper pilog predicate >> which returns always the next available object with the highest date. >> With that, you can fetch 1, 25, or thousands of objects in order. >> >> Cheers, >> - Alex >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >> > >
Re: Scaling issue
That was a clever one I must say :) OK I'll redistribute the articles first and then get back to you for a few of the details with regards to the above. The above can then also be used to fetch articles by feed, tag or any other attribute because they must always be sorted by date. On Tue, Apr 20, 2010 at 5:22 PM, Alexander Burger wrote: > Hi Henrik, > > > So with the refs in place I could use the full remote logic to run pilog > > queries on the remotes. > > OK > > > Now a search is made for all articles containing the word "picolisp" for > > instance. I then need to be able to get an arbitrary slice back of the > total > > which needs to be sorted by time. I have a hard time understanding how > this > > can be achieved in any sensible way except through one the following: > > > > Central Command: > > ... > > Cascading: > > ... > > I think both solutions are feasible. This is because you are in the > lucky situation that you can separate the articles on the remote machine > according to their age. In a general case (e.g. if the data are not > "archived" like here, but are subject to permanent change). > > > However: I think there is a solution that is simpler, as well as more > general (not assuming anything about the locations of the articles). > > I did this in another project, where I collected items from remote > machines sorted by attributes (not date, but counts and sizes). > > > The first thing is that you define the index to be an +Aux, combining > the search key with the date. So if you search for a key like "picolisp" > on a single remote machine, you get all hits sorted by date. No extra > sorting required. > > Then, each remote machine has a function (e.g. 'sendRefDatArticles') > defined, which simply iterates the index tree (with 'collect', or better > a pilog query) and sends each found object with 'pr' to the current > output channel. When it sent all hits, it terminates. > > Then on the central server you open connections with *Ext enabled to > each remote client. This can be done with +Agent objects taking care of > the details (maintaining the connections, communicating via 'ext' etc.). > > The actual query then sends out a remote command like > > (start> Agent 'sendRefDatArticles "picolisp" (someDate)) > > Now all remote database start sending their results, ordered by date. > They are actually busy only until the TCP queue fills up, or until the > connection is closed. If the queue is filled up, they will block so that > it is advisable that they are all fork'ed children. > > The central server then reads a single object from each connection into > a list. Now, to return the results one by one to the actual caller (e.g. > the GUI), it always picks the object with the highest date from that > list, and reads the next item into that place in the list. The list is > effectively a single-object look-ahead on each connection. When one of > the connections returns NIL, it means the list of hits on that machine > is exhausted, the remote child process terminated, and the connection > can be closed. > > So the GUI calls a function (or, more probably) a proper pilog predicate > which returns always the next available object with the highest date. > With that, you can fetch 1, 25, or thousands of objects in order. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
Hi Henrik, > So with the refs in place I could use the full remote logic to run pilog > queries on the remotes. OK > Now a search is made for all articles containing the word "picolisp" for > instance. I then need to be able to get an arbitrary slice back of the total > which needs to be sorted by time. I have a hard time understanding how this > can be achieved in any sensible way except through one the following: > > Central Command: > ... > Cascading: > ... I think both solutions are feasible. This is because you are in the lucky situation that you can separate the articles on the remote machine according to their age. In a general case (e.g. if the data are not "archived" like here, but are subject to permanent change). However: I think there is a solution that is simpler, as well as more general (not assuming anything about the locations of the articles). I did this in another project, where I collected items from remote machines sorted by attributes (not date, but counts and sizes). The first thing is that you define the index to be an +Aux, combining the search key with the date. So if you search for a key like "picolisp" on a single remote machine, you get all hits sorted by date. No extra sorting required. Then, each remote machine has a function (e.g. 'sendRefDatArticles') defined, which simply iterates the index tree (with 'collect', or better a pilog query) and sends each found object with 'pr' to the current output channel. When it sent all hits, it terminates. Then on the central server you open connections with *Ext enabled to each remote client. This can be done with +Agent objects taking care of the details (maintaining the connections, communicating via 'ext' etc.). The actual query then sends out a remote command like (start> Agent 'sendRefDatArticles "picolisp" (someDate)) Now all remote database start sending their results, ordered by date. They are actually busy only until the TCP queue fills up, or until the connection is closed. If the queue is filled up, they will block so that it is advisable that they are all fork'ed children. The central server then reads a single object from each connection into a list. Now, to return the results one by one to the actual caller (e.g. the GUI), it always picks the object with the highest date from that list, and reads the next item into that place in the list. The list is effectively a single-object look-ahead on each connection. When one of the connections returns NIL, it means the list of hits on that machine is exhausted, the remote child process terminated, and the connection can be closed. So the GUI calls a function (or, more probably) a proper pilog predicate which returns always the next available object with the highest date. With that, you can fetch 1, 25, or thousands of objects in order. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I've been reading up a bit on the remote stuff, I haven't made the articles distributed yet but let's assume I have, with 10 000 articles per remote. Let's also assume that I have remade the word indexes to now work with real +Ref +Links on each remote that links words and articles (not simply numbers for subsequent use with (id) locally). So with the refs in place I could use the full remote logic to run pilog queries on the remotes. Now a search is made for all articles containing the word "picolisp" for instance. I then need to be able to get an arbitrary slice back of the total which needs to be sorted by time. I have a hard time understanding how this can be achieved in any sensible way except through one the following: Central Command: 1.) The remotes are setup so that remote one contains the oldest articles, remote two the second oldest articles and so on (this is the case naturally as a new remote is spawned when the newest one is "full"). 2.) Each remote then returns how many articles it has that contains "picolisp". This is needed for the pagination anyway in order to display a correct amount of page numbers and can be done pretty trivially through the count tree mechanism described earlier in this thread. 3.) The local logic now determines which remote(s) should be queried in order to get 25 correct articles, issues the queries to be executed remotely and displays the returned articles. If pagination is scrapped the total count is not needed, it's possible to have a "More Results" button instead, I'm fine with that kind of interface too. In most cases the count is not important for the user anyway. In that way the following might be possible: Cascading: 1.) The newest remote is queried first and can quickly determine through count tree that it has the requested articles, quickly fetches them and returns them. 2.) If it doesn't contain them it will pass on the request to the second newest remote which might contain all of the requested articles, or a subset in which case the missing ones will be returned from the third newest remote through the same mechanism. 3.) The end result is that the correct articles now end up in the first remote which will return them to the local. Did I miss something, might this problem be solved in a cleverer way? /Henrik On Thu, Apr 15, 2010 at 12:55 PM, Henrik Sarvell wrote: > To simply be able to pass along simple commands like collect and db ie. the > *Ext stuff was overkill, which works just fine except in this special case > when there are thousands of articles to a feed. > > I'm planning to distribute the whole DB except users and what feeds they > subscribe to. Everything else will be article centric and remote. I will > also keep local records of which feeds have articles in which remote so I > don't query remotes for nothing. > > > > > > On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger > wrote: > >> On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote: >> > On the other hand, if I'm to follow my own thinking to its logical >> > conclusion I should make the articles distributed too, with blobs and >> all. >> >> What was the rationale to use object IDs instead of direct remote access >> via '*Ext'? I can't remember at the moment. >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >> > >
Re: Scaling issue
To simply be able to pass along simple commands like collect and db ie. the *Ext stuff was overkill, which works just fine except in this special case when there are thousands of articles to a feed. I'm planning to distribute the whole DB except users and what feeds they subscribe to. Everything else will be article centric and remote. I will also keep local records of which feeds have articles in which remote so I don't query remotes for nothing. On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger wrote: > On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote: > > On the other hand, if I'm to follow my own thinking to its logical > > conclusion I should make the articles distributed too, with blobs and > all. > > What was the rationale to use object IDs instead of direct remote access > via '*Ext'? I can't remember at the moment. > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote: > On the other hand, if I'm to follow my own thinking to its logical > conclusion I should make the articles distributed too, with blobs and all. What was the rationale to use object IDs instead of direct remote access via '*Ext'? I can't remember at the moment. -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Hi Henrik, > Could the *Ext functionality still be used somehow? I have a hard time > understanding how if I don't map the feed (parent) -> article (child) > relationship remotely, I mean at some point I will have to filter all Sorry, I probably lost the overview of the total application structure. But if I understand the question right: Though *Ext is not intended in that way (it gives access to the complete remote object), you might still use it locally with 'id', as *Ext preserves the object id (it only maps the DB file number part to the remote range). That is, if you apply 'id' to an object recieved from a remote DB, it still gives the correct number. Does this help? Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
On the other hand, if I'm to follow my own thinking to its logical conclusion I should make the articles distributed too, with blobs and all. On Wed, Apr 14, 2010 at 9:51 PM, Henrik Sarvell wrote: > I don't know Alex, remember that we disconnected stuff, I'll paste the > remote E/R again (all of it, there is nothing else on the remotes): > > > (class +WordCount +Entity) > (rel article (+Ref +Number)) > (rel word (+Aux +Ref +Number) (article)) > (rel count (+Number)) > > The numbers here can then be used in the main app with (id) to actually > locate the objects in question. > > Could the *Ext functionality still be used somehow? I have a hard time > understanding how if I don't map the feed (parent) -> article (child) > relationship remotely, I mean at some point I will have to filter all > retrieved articles against a set of articles fetched locally (all articles > belonging to my Twitter feed), if I don't store the connections remotely. > Storing the feed -> article links remotely will let me avoid checking > locally, and it's that check that is the bottleneck at the moment. > > I suppose you could find some clever way of speeding up the local > filtering, at the moment I'm simply loading all Twitter articles with > collect and then throwing away all remotely retrieved articles that are not > in that list. However that just seems like a duct tape solution, even if it > works to begin with it won't work for long. > > /Henrik > > > > On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger wrote: > >> On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: >> > Thanks Alex, I will go for the the reversed range and check out >> select/3. >> >> Let me mention that since picoLisp-3.0.1 we have a separate >> documentation of 'select/3', in "doc/select.html". >> -- >> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >> > >
Re: Scaling issue
I don't know Alex, remember that we disconnected stuff, I'll paste the remote E/R again (all of it, there is nothing else on the remotes): (class +WordCount +Entity) (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) The numbers here can then be used in the main app with (id) to actually locate the objects in question. Could the *Ext functionality still be used somehow? I have a hard time understanding how if I don't map the feed (parent) -> article (child) relationship remotely, I mean at some point I will have to filter all retrieved articles against a set of articles fetched locally (all articles belonging to my Twitter feed), if I don't store the connections remotely. Storing the feed -> article links remotely will let me avoid checking locally, and it's that check that is the bottleneck at the moment. I suppose you could find some clever way of speeding up the local filtering, at the moment I'm simply loading all Twitter articles with collect and then throwing away all remotely retrieved articles that are not in that list. However that just seems like a duct tape solution, even if it works to begin with it won't work for long. /Henrik On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger wrote: > On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: > > Thanks Alex, I will go for the the reversed range and check out select/3. > > Let me mention that since picoLisp-3.0.1 we have a separate > documentation of 'select/3', in "doc/select.html". > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote: > Thanks Alex, I will go for the the reversed range and check out select/3. Let me mention that since picoLisp-3.0.1 we have a separate documentation of 'select/3', in "doc/select.html". -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
Thanks Alex, I will go for the the reversed range and check out select/3. I'm already using collect with dates extensively but in this case it wouldn't work as I need the 25 newest regardless of exactly when they were published. /Henrik On Sun, Apr 11, 2010 at 1:27 PM, Alexander Burger wrote: > On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote: > > What's additionally needed is: > > > > 1.) Calculating total count somehow without retrieving all articles. > > If it is simply the count of all articles in the DB, you can get it > directly from a '+Key' or '+Ref' index. I don't quite remember the E/R > model, but I found this in an old mail: > > (class +Article +Entity) > (rel aid (+Key +Number)) > (rel title (+Idx +String)) > (rel htmlUrl (+Key +String)) > > With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl > '+Article)) will give all articles having the property 'aid' or > 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more > than one tree node per object). > > If you need distinguished counts (e.g. for groups of articles or > according to certain features), it might be necessary to build more > indexes, or simply maintain counts during import. > > > > 2.) Somehow sorting by date so I get say the 25 first articles. > > This is also best done with a dedicated index, e.g. > > (rel dat (+Ref +Date)) > > in '+Article'. Then you could specify a reversed range (T . NIL) for a > pilog query > > (? (db dat +Article (T . NIL) @Article) (show @Article)) > > This will start with the newest article, and step backwards. Even easier > might be if you specify a range of dates, say from today till one week > ago. Then you could use 'collect' > > (collect 'dat '+Article (date) (- (date) 7)) > > or, as 'today' is not very informative, > > (collect 'dat '+Article T (- (date) 7)) > > > > When searching for articles belonging to a certain feed containing a word > in > > the content I now let the distributed indexes return all articles and > then I > > simply use filter to get at the articles. And to do that I of course need > to > > fetch all the articles in a certain feed, which works fine for most feeds > > but not Twitter as it now probably contains more than 10 000 articles. > > I think that usually it should not be necessary to fetch all articles, > if you build a combined query with the 'select/3' predicate. > > > > The only solution I can see to this is to simply store the feed -> > article > > mapping remotely too, ie each word index server contains this info too > for > > ... > > Then I could simply filter by feed remotely. > > Not sure. But I feel that I would use distributed processing here only > if there is no other way (i.e. the parallel search with 'select/3'). > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote: > What's additionally needed is: > > 1.) Calculating total count somehow without retrieving all articles. If it is simply the count of all articles in the DB, you can get it directly from a '+Key' or '+Ref' index. I don't quite remember the E/R model, but I found this in an old mail: (class +Article +Entity) (rel aid (+Key +Number)) (rel title (+Idx +String)) (rel htmlUrl (+Key +String)) With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl '+Article)) will give all articles having the property 'aid' or 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more than one tree node per object). If you need distinguished counts (e.g. for groups of articles or according to certain features), it might be necessary to build more indexes, or simply maintain counts during import. > 2.) Somehow sorting by date so I get say the 25 first articles. This is also best done with a dedicated index, e.g. (rel dat (+Ref +Date)) in '+Article'. Then you could specify a reversed range (T . NIL) for a pilog query (? (db dat +Article (T . NIL) @Article) (show @Article)) This will start with the newest article, and step backwards. Even easier might be if you specify a range of dates, say from today till one week ago. Then you could use 'collect' (collect 'dat '+Article (date) (- (date) 7)) or, as 'today' is not very informative, (collect 'dat '+Article T (- (date) 7)) > When searching for articles belonging to a certain feed containing a word in > the content I now let the distributed indexes return all articles and then I > simply use filter to get at the articles. And to do that I of course need to > fetch all the articles in a certain feed, which works fine for most feeds > but not Twitter as it now probably contains more than 10 000 articles. I think that usually it should not be necessary to fetch all articles, if you build a combined query with the 'select/3' predicate. > The only solution I can see to this is to simply store the feed -> article > mapping remotely too, ie each word index server contains this info too for > ... > Then I could simply filter by feed remotely. Not sure. But I feel that I would use distributed processing here only if there is no other way (i.e. the parallel search with 'select/3'). Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
Re: Scaling issue
I see, I should've known about that one (I'm using it to get similar articles already). What's additionally needed is: 1.) Calculating total count somehow without retrieving all articles. 2.) Somehow sorting by date so I get say the 25 first articles. If those two can also be achieved in a manner that won't require me to fetch all articles then I can use Pilog in this manner to fetch the results when it comes to getting all articles under all feeds under a specific tag. At the moment I'm fetching all of them at once and using head, not optimal. However, it won't work with the word indexes, a redesign of how the system works is needed I think. When searching for articles belonging to a certain feed containing a word in the content I now let the distributed indexes return all articles and then I simply use filter to get at the articles. And to do that I of course need to fetch all the articles in a certain feed, which works fine for most feeds but not Twitter as it now probably contains more than 10 000 articles. The only solution I can see to this is to simply store the feed -> article mapping remotely too, ie each word index server contains this info too for the articles it's mapping, resutling in an E/R section looking like this: (class +WordCount +Entity) # (rel article (+Ref +Number)) (rel word (+Aux +Ref +Number) (article)) (rel count (+Number)) (class +ArFeLink +Entity) (rel article (+Aux +Ref +Number) (feed)) (rel feed (+Ref +Number)) Then I could simply filter by feed remotely. /Henrik On Sun, Apr 11, 2010 at 9:25 AM, Alexander Burger wrote: > Hi Henrik, > > > (class +ArFeLink +Entity) > > (rel article (+Aux +Ref +Link) (feed) NIL (+Article)) > > (rel feed (+Ref +Link) NIL (+Feed)) > > > > (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need > it > > to take something like maximum 2 seconds... > > > > Can this be fixed by adding some index or key or do I need make this part > of > > the DB distributed and chopped up so I can run this in parallel? > > This is already the proper index. Is it perhaps the case that there are > simply too many articles fetched at once? How may articles does the > above 'collect' return? And are these articles all needed at that time? > > If you talk about 2 seconds, I assume you don't want the user having to > wait, so it is a GUI interaction. In such cases it is typical not to > fetch all data from the DB, but only the first chunk e.g. to display > them in the GUI. It would be better then to use a Pilog query, returning > the results one by one (as done in +QueryChart). > > To get results analog to the above 'collect', you could create a query > like > > (let Q > (goal > (quote >@Obj Obj >(db feed +ArFeLink @Obj @Feed) >(val @Article @Feed article) ) ) > ... > (do 20 # Then fetch the first 20 articles > (NIL (prove Q)) # More? > (bind @ # Bind the result values >(println @Article) # Use the article >... > > Instead of 'bind' you could also simply use 'get' to extract the > @Article: (get @ '@Article). > > Before doing so, I would test it interactively, e.g. > > : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article)) > > if '{ART}' is an article. > > Not that the above is not tested. > > Cheers, > - Alex > -- > UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe >
Re: Scaling issue
Hi Henrik, > (class +ArFeLink +Entity) > (rel article (+Aux +Ref +Link) (feed) NIL (+Article)) > (rel feed (+Ref +Link) NIL (+Feed)) > > (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need it > to take something like maximum 2 seconds... > > Can this be fixed by adding some index or key or do I need make this part of > the DB distributed and chopped up so I can run this in parallel? This is already the proper index. Is it perhaps the case that there are simply too many articles fetched at once? How may articles does the above 'collect' return? And are these articles all needed at that time? If you talk about 2 seconds, I assume you don't want the user having to wait, so it is a GUI interaction. In such cases it is typical not to fetch all data from the DB, but only the first chunk e.g. to display them in the GUI. It would be better then to use a Pilog query, returning the results one by one (as done in +QueryChart). To get results analog to the above 'collect', you could create a query like (let Q (goal (quote @Obj Obj (db feed +ArFeLink @Obj @Feed) (val @Article @Feed article) ) ) ... (do 20 # Then fetch the first 20 articles (NIL (prove Q)) # More? (bind @ # Bind the result values (println @Article) # Use the article ... Instead of 'bind' you could also simply use 'get' to extract the @Article: (get @ '@Article). Before doing so, I would test it interactively, e.g. : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article)) if '{ART}' is an article. Not that the above is not tested. Cheers, - Alex -- UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe