subject:"Re\: Scaling"

Re: Scaling

2011-03-24 Thread Alexander Burger

Hi Thorsten,

> Distribution involves separate machines, connected via TCP. On each
> > machine, typically several PicoLisp database processes are running,

> Is my interpretation right, that the ' several PicoLisp database processes'
> running on one machine form a 'PicoLisp process family' that is considered
> as one application with one database?

Yes, but there may be also several such "families" on a single machine.

A single application, operating on a single database, consists of a
parent process with an arbitrary number of child processes. This
structure is necessary because synchronization of all processes that
access a given database must go via a common parent (family IPC uses
simple pipes).

"A single database" means usually a single directory, containing all
files of that database. Theoretically, a database may consist of
maximally 65536 files, but this dosn't make sense in a typical Unix
environment, because of too many file descriptors and other resource
problems. A single file can contain maximally 4 Tera objects (42 bit
object ID).

It makes well sense to run several applications (= databases) on a
single machine, to get a better load distribution. I have no general
rule, for opimal tuning some experimentation is required. It depends
mostly on the number of CPU cores and the amount of available RAM (file
buffer cache).

For the program logic (how those applications communicate with each
other), it doesn't matter which application is running on which machine,
as long as all is properly configered. I had an admin application for
connecting/starting/stopping the individual apps.


> How do you split up the databases? Rather by rows or rather by columns (I

Not on that level, but on a functional level. For example, we had many
databases (about 70) collecting data from filer volumes, sending some of
their data to a second layer (also 70) which in turn sent some boiled-up
stuff to a single dedicated database containing some global data.
Another front-end application queried all the lower levels to generate
statistics and user reports, and contained a rule database (in Pilog) about
what to do on the lower levels.

> know they are not 2D tables in picolisp, what I mean is: does every DB cover

Right.

> the whole class hierarchy, but only a fraction of the objects, or does each

Yes, this was the case for the first and second layer described above.
In each layer all databases had the same model (E/R definitions, in fact
the same program code).

> DB cover a fraction of the class hierarchy, but all objects belonging to
> these classes?

So each application is a complete class hierarchy in itself, independent
from (but knowing about) the other DBs.

But what I described was for that concrete use case. I had only a single
project with such large DBs until now. Probably many other designs are
possible. As Henrik said, stress is on ease of designing such
structures, not on a given framework. The philosophy of PicoLisp was
always to go for a vertical approach, with easy access from the lowest
to the highest levels.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: Scaling

2011-03-24 Thread Thorsten

Hi Alexander,

Distribution involves separate machines, connected via TCP. On each

> machine, typically several PicoLisp database processes are running,
>


> Changes to the individual DBs have to be done the normal way (e.g. the
> 'put>' family of methods), where each application (PicoLisp process
> family) is maintaining its own DB.
>

Is my interpretation right, that the ' several PicoLisp database processes'
running on one machine form a 'PicoLisp process family' that is considered
as one application with one database? So it is one database per machine,
using several processes on that machine, that has to be changed
individually, but can be queried as part of a distributed red of databases
on several machines connected via TCP?

How do you split up the databases? Rather by rows or rather by columns (I
know they are not 2D tables in picolisp, what I mean is: does every DB cover
the whole class hierarchy, but only a fraction of the objects, or does each
DB cover a fraction of the class hierarchy, but all objects belonging to
these classes?

Cheers
Thorsten

Re: Scaling

2011-03-24 Thread Alexander Burger

Hi Thorsten,

in addition to what Henrik wrote:

> So dividing a database in several smaller files and accessing them with
> something like id or ext gives a distributed faster database, and when doing

Dividing the database into multiple files is the "normal" approach to
design a DB application in PicoLisp, so this is not what I would call
"distributed".

Distribution involves separate machines, connected via TCP. On each
machine, typically several PicoLisp database processes are running, and
they exchange objects via 'id' or 'ext', but - more importantly - can do
remote calls (via 'pr', 'rd' etc., i.e. the PLIO protocol mentioned in
the other mail) and remote queries (see "doc/refR.html#remote/2").

Direct remote DB operations involve only read accesses (queries).
Changes to the individual DBs have to be done the normal way (e.g. the
'put>' family of methods), where each application (PicoLisp process
family) is maintaining its own DB.

Hmm, that's all rather hard to explain, and unfortunately not formally
documented yet (except for Henrik's great descriptions).

> so ie in an Amazon EC2 account the database might (automagically) end up on
> different servers, thus becoming faster and (almost endlessly) scalable.

Yes, though the current system doesn't have any mechanisms for
dynamically relocation of database processes yet. Actually, I was
planning for something along that way, but the project where I would
have needed that was terminated :(

> Is anybody using Emacs/Gnus for this mailing list and can give some advice
> how to make that work?

Yes, our Argentinian frieds. By now, they should be up ;-)

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: Scaling

2011-03-24 Thread Thorsten

Hi Henrik,
thanks, that's an interesting read.
So dividing a database in several smaller files and accessing them with
something like id or ext gives a distributed faster database, and when doing
so ie in an Amazon EC2 account the database might (automagically) end up on
different servers, thus becoming faster and (almost endlessly) scalable.
I have no practical experience with deploying picolisp or the amazon cloud,
so I'm just guessing, I just want to get a general idea of what can be done
with picolisp and what not.

Thorsten

PS
Is anybody using Emacs/Gnus for this mailing list and can give some advice
how to make that work?


2011/3/24 Henrik Sarvell 

> Hi Thorsten.
>
> Here is a description of a real world example:
> http://picolisp.com/5000/-2-I.html
>
> In that article you will also find some links to functions that might or
> might now be of use to you, such as (ext).
>
> When it comes to distributed data and PicoLisp you don't get much for free
> (apart from the aforementioned ext functionality). It's more like a
> framework with which you are able to create something more specific.
>
> In short, you won't get something like Cassandra, Hadoop or Riak out of the
> box but you could certainly create something like them with the tools that
> you do have.
>
> And you could probably create something similar to those three with less
> hassle than it was to create them in their respective languages (Java /
> Erlang).
>
> /Henrik
>
>
>
> On Thu, Mar 24, 2011 at 6:11 PM, Thorsten <
> gruenderteam.ber...@googlemail.com> wrote:
>
>> Hallo,
>> I recently discovered (amazing) picolisp and have a few (I hope not too
>> naive) questions. I write one mail for each question to not mix up
>> things.
>>
>> I read in the documentations about distributed picolisp databases, the
>> ability to make picolisp apps faster and faster by adding hardware cores
>> (and using different pipes of the underlying linux OS?), and the
>> possibility to deploy picolisp-apps in the clouds. But these things are
>> only mentioned, without further explications.
>>
>> Since scaling and concurrency is all the hype in the Java world (scala,
>> clojure) I would like to know a bit more about capabilities and limits
>> of picolisp in this area, and how these things are achieved in practise
>> (ie how to deploy an picolisp-app in the cloud?)
>>
>> Thanks
>> Thorsten
>>
>>
>

Re: Scaling

2011-03-24 Thread Henrik Sarvell

Hi Thorsten.

Here is a description of a real world example:
http://picolisp.com/5000/-2-I.html

In that article you will also find some links to functions that might or
might now be of use to you, such as (ext).

When it comes to distributed data and PicoLisp you don't get much for free
(apart from the aforementioned ext functionality). It's more like a
framework with which you are able to create something more specific.

In short, you won't get something like Cassandra, Hadoop or Riak out of the
box but you could certainly create something like them with the tools that
you do have.

And you could probably create something similar to those three with less
hassle than it was to create them in their respective languages (Java /
Erlang).

/Henrik

On Thu, Mar 24, 2011 at 6:11 PM, Thorsten <
gruenderteam.ber...@googlemail.com> wrote:

> Hallo,
> I recently discovered (amazing) picolisp and have a few (I hope not too
> naive) questions. I write one mail for each question to not mix up
> things.
>
> I read in the documentations about distributed picolisp databases, the
> ability to make picolisp apps faster and faster by adding hardware cores
> (and using different pipes of the underlying linux OS?), and the
> possibility to deploy picolisp-apps in the clouds. But these things are
> only mentioned, without further explications.
>
> Since scaling and concurrency is all the hype in the Java world (scala,
> clojure) I would like to know a bit more about capabilities and limits
> of picolisp in this area, and how these things are achieved in practise
> (ie how to deploy an picolisp-app in the cloud?)
>
> Thanks
> Thorsten
>
>

Re: Scaling issue

2010-05-20 Thread Henrik Sarvell

I've summed up the result of this thread here:
http://picolisp.com/5000/-2-I.html with some explanations.

/Henrik



On Fri, May 14, 2010 at 8:59 AM, Henrik Sarvell  wrote:
> OK since I can't rely on sorting by date anyway let's forget that idea.
>
> Yes since it seemed I had to involve dates anyway I simply chose a
> date far back enough in time that if someone is looking for something
> they might as well use Google.
>
> Anyway the above is scanning 19 remotes containing indexes for 10 000
> articles each and returns in 3-4 seconds which is OK for me, problem
> solved as far as I'm concerned. I have to add though that all remotes
> are currently on the same machine, had they been truly distributed it
> would be faster, especially if the other machines were in the same
> data center.
>
> On Fri, May 14, 2010 at 7:55 AM, Alexander Burger  
> wrote:
>> On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
>>> One thing first though, since articles are indexed when they're parsed
>>> and PL isn't doing any kind of sorting automatically on insert then
>>> they should be sorted by date automatically with the latest articles
>>> at the end of the database file since I suppose they're just appended?
>>
>> While this is correct in principle, I would not rely on it. If there
>> should ever be an object deleted from that database file, the space
>> would be reused by the next new object, and the assumption would break.
>>
>>
>>> How can I simply start walking from the end of the file until I've
>>> found say 25 matches? This procedure should be the absolutely fastest
>>> way of getting what I want.
>>
>> Currently I see no easy way. The only function that walks a database
>> file directly is 'seq', but it can only step forwards.
>>
>>
>>> I know about your iter example earlier and it seems like a good fit if
>>> it starts walking in the right end?
>>
>> Yes, 'iter' (and the related 'scan') can walk in both directions. You
>> need only to pass inverted keys (i.e. Beg > End).
>>
>>
>> If I understand it right, however, you solved the problem in your next
>> mail(s) by using the date index, and starting at 6 months ago?
>>
>> Cheers,
>> - Alex
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>>
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-14 Thread Henrik Sarvell

OK since I can't rely on sorting by date anyway let's forget that idea.

Yes since it seemed I had to involve dates anyway I simply chose a
date far back enough in time that if someone is looking for something
they might as well use Google.

Anyway the above is scanning 19 remotes containing indexes for 10 000
articles each and returns in 3-4 seconds which is OK for me, problem
solved as far as I'm concerned. I have to add though that all remotes
are currently on the same machine, had they been truly distributed it
would be faster, especially if the other machines were in the same
data center.

On Fri, May 14, 2010 at 7:55 AM, Alexander Burger  wrote:
> On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
>> One thing first though, since articles are indexed when they're parsed
>> and PL isn't doing any kind of sorting automatically on insert then
>> they should be sorted by date automatically with the latest articles
>> at the end of the database file since I suppose they're just appended?
>
> While this is correct in principle, I would not rely on it. If there
> should ever be an object deleted from that database file, the space
> would be reused by the next new object, and the assumption would break.
>
>
>> How can I simply start walking from the end of the file until I've
>> found say 25 matches? This procedure should be the absolutely fastest
>> way of getting what I want.
>
> Currently I see no easy way. The only function that walks a database
> file directly is 'seq', but it can only step forwards.
>
>
>> I know about your iter example earlier and it seems like a good fit if
>> it starts walking in the right end?
>
> Yes, 'iter' (and the related 'scan') can walk in both directions. You
> need only to pass inverted keys (i.e. Beg > End).
>
>
> If I understand it right, however, you solved the problem in your next
> mail(s) by using the date index, and starting at 6 months ago?
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-13 Thread Alexander Burger

On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
> One thing first though, since articles are indexed when they're parsed
> and PL isn't doing any kind of sorting automatically on insert then
> they should be sorted by date automatically with the latest articles
> at the end of the database file since I suppose they're just appended?

While this is correct in principle, I would not rely on it. If there
should ever be an object deleted from that database file, the space
would be reused by the next new object, and the assumption would break.

> How can I simply start walking from the end of the file until I've
> found say 25 matches? This procedure should be the absolutely fastest
> way of getting what I want.

Currently I see no easy way. The only function that walks a database
file directly is 'seq', but it can only step forwards.

> I know about your iter example earlier and it seems like a good fit if
> it starts walking in the right end?

Yes, 'iter' (and the related 'scan') can walk in both directions. You
need only to pass inverted keys (i.e. Beg > End).

If I understand it right, however, you solved the problem in your next
mail(s) by using the date index, and starting at 6 months ago?

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-13 Thread Henrik Sarvell

Sorry for the spam but the prior listing is not correct, it didn't
manage to return sorted by date, this one does though:

(de getArticles (W)
   (let Goal
  (goal
 (quote
@Word W
@Date (cons (- (stamp> '+Gh) (* 6 31 86400)) (stamp> '+Gh))
(select (@Wcs)
   ((picoStamp +WordCount @Date) (word +WordCount @Word))
   (same @Word @Wcs word)
   (range @Date @Wcs picoStamp
  (do 25
 (NIL (prove Goal))
 (bind @
(pr (cons (; @Wcs article) (; @Wcs picoStamp)))
(unless (flush) (bye)
   (bye))



On Thu, May 13, 2010 at 9:36 PM, Henrik Sarvell  wrote:
> See my prior post for context.
>
> I've been testing a few different approaches and this is the fastest so f=
ar:
>
> (de getArticles (W)
> =A0 (let Goal
> =A0 =A0 =A0(goal
> =A0 =A0 =A0 =A0 (quote
> =A0 =A0 =A0 =A0 =A0 =...@word W
> =A0 =A0 =A0 =A0 =A0 =A0(select (@Wcs)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 ((word +WordCount @Word))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 (same @Word @Wcs word
> =A0 =A0 =A0(do 25
> =A0 =A0 =A0 =A0 (NIL (prove Goal))
> =A0 =A0 =A0 =A0 (bind @
> =A0 =A0 =A0 =A0 =A0 =A0(pr (cons (; @Wcs article) (; @Wcs picoStamp)))
> =A0 =A0 =A0 =A0 =A0 =A0(unless (flush) (bye)
> =A0 (bye))
>
> Where the remote ER is:
>
> (class +WordCount +Entity) #
> (rel article =A0 (+Ref +Number))
> (rel word =A0 =A0 =A0(+Aux +Ref +Number) (article))
> (rel count =A0 =A0 (+Number))
> (rel picoStamp (+Ref +Number))
>
>
>
> On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell  wrot=
e:
>> Everything is running smoothly now, I intend to make a write up on the
>> wiki this weekend maybe on this.
>>
>> One thing first though, since articles are indexed when they're parsed
>> and PL isn't doing any kind of sorting automatically on insert then
>> they should be sorted by date automatically with the latest articles
>> at the end of the database file since I suppose they're just appended?
>>
>> How can I simply start walking from the end of the file until I've
>> found say 25 matches? This procedure should be the absolutely fastest
>> way of getting what I want.
>>
>> I know about your iter example earlier and it seems like a good fit if
>> it starts walking in the right end?
>>
>>
>>
>>
>> On Tue, May 11, 2010 at 9:09 AM, Alexander Burger  =
wrote:
>>> On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
 My code simply stops executing (as if waiting for the next entry but
 it never gets it) when I run out of entries to fetch, really strange
 and a traceAll confirms this, the last output is a call to rd1>.
>>>
>>> What happens on the remote side, after all entries are sent? If the
>>> remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
>>> it is done.
>>>
>>>
 This is my rd1>:

 (dm rd1> (Sock)
 =A0 =A0(or
 =A0 =A0 =A0 (in Sock (rd))
 =A0 =A0 =A0 (nil
 =A0 =A0 =A0 =A0 =A0(close Sock
>>>
>>> This looks all right, but isn't obviously the problem, as it hangs in
>>> 'rd'.
>>>
>>>
 (de getArticles (W)
 =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp)
 =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp)))
 =A0 =A0 =A0(unless (flush) (bye
>>>
>>> What happens if you do (bye) after the 'for' loop is done?
>>>
>>> I assume that 'getArticles' is executed in the (eval @) below
>>>
>>>
 =A0 =A0(task (port (+ *IdxNum 4040))
 =A0 =A0 =A0 (let? Sock (accept @)
 =A0 =A0 =A0 =A0 =A0(unless (fork)
 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @
 =A0 =A0 =A0 =A0 =A0 =A0 (bye))
 =A0 =A0 =A0 =A0 =A0(close Sock)))
>>>
>>> This looks OK, because (bye) is called after the while loop is done.
>>> Perhaps there is something in the way 'getArticles' is invoked here? Yo=
u
>>> could change the second last line to (! bye) and see if it is indeed
>>> reached. I would suspect it isn't.
>>>
>>> Cheers,
>>> - Alex
>>> --
>>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>>>
>>
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-13 Thread Henrik Sarvell

See my prior post for context.

I've been testing a few different approaches and this is the fastest so far=
:

(de getArticles (W)
   (let Goal
  (goal
 (quote
@Word W
(select (@Wcs)
   ((word +WordCount @Word))
   (same @Word @Wcs word
  (do 25
 (NIL (prove Goal))
 (bind @
(pr (cons (; @Wcs article) (; @Wcs picoStamp)))
(unless (flush) (bye)
   (bye))

Where the remote ER is:

(class +WordCount +Entity) #
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))
(rel picoStamp (+Ref +Number))



On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell  wrote:
> Everything is running smoothly now, I intend to make a write up on the
> wiki this weekend maybe on this.
>
> One thing first though, since articles are indexed when they're parsed
> and PL isn't doing any kind of sorting automatically on insert then
> they should be sorted by date automatically with the latest articles
> at the end of the database file since I suppose they're just appended?
>
> How can I simply start walking from the end of the file until I've
> found say 25 matches? This procedure should be the absolutely fastest
> way of getting what I want.
>
> I know about your iter example earlier and it seems like a good fit if
> it starts walking in the right end?
>
>
>
>
> On Tue, May 11, 2010 at 9:09 AM, Alexander Burger  w=
rote:
>> On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
>>> My code simply stops executing (as if waiting for the next entry but
>>> it never gets it) when I run out of entries to fetch, really strange
>>> and a traceAll confirms this, the last output is a call to rd1>.
>>
>> What happens on the remote side, after all entries are sent? If the
>> remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
>> it is done.
>>
>>
>>> This is my rd1>:
>>>
>>> (dm rd1> (Sock)
>>> =A0 =A0(or
>>> =A0 =A0 =A0 (in Sock (rd))
>>> =A0 =A0 =A0 (nil
>>> =A0 =A0 =A0 =A0 =A0(close Sock
>>
>> This looks all right, but isn't obviously the problem, as it hangs in
>> 'rd'.
>>
>>
>>> (de getArticles (W)
>>> =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp)
>>> =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp)))
>>> =A0 =A0 =A0(unless (flush) (bye
>>
>> What happens if you do (bye) after the 'for' loop is done?
>>
>> I assume that 'getArticles' is executed in the (eval @) below
>>
>>
>>> =A0 =A0(task (port (+ *IdxNum 4040))
>>> =A0 =A0 =A0 (let? Sock (accept @)
>>> =A0 =A0 =A0 =A0 =A0(unless (fork)
>>> =A0 =A0 =A0 =A0 =A0 =A0 (in Sock
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd)
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync)
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock
>>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @
>>> =A0 =A0 =A0 =A0 =A0 =A0 (bye))
>>> =A0 =A0 =A0 =A0 =A0(close Sock)))
>>
>> This looks OK, because (bye) is called after the while loop is done.
>> Perhaps there is something in the way 'getArticles' is invoked here? You
>> could change the second last line to (! bye) and see if it is indeed
>> reached. I would suspect it isn't.
>>
>> Cheers,
>> - Alex
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>>
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-13 Thread Henrik Sarvell

Everything is running smoothly now, I intend to make a write up on the
wiki this weekend maybe on this.

One thing first though, since articles are indexed when they're parsed
and PL isn't doing any kind of sorting automatically on insert then
they should be sorted by date automatically with the latest articles
at the end of the database file since I suppose they're just appended?

How can I simply start walking from the end of the file until I've
found say 25 matches? This procedure should be the absolutely fastest
way of getting what I want.

I know about your iter example earlier and it seems like a good fit if
it starts walking in the right end?




On Tue, May 11, 2010 at 9:09 AM, Alexander Burger  wro=
te:
> On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
>> My code simply stops executing (as if waiting for the next entry but
>> it never gets it) when I run out of entries to fetch, really strange
>> and a traceAll confirms this, the last output is a call to rd1>.
>
> What happens on the remote side, after all entries are sent? If the
> remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
> it is done.
>
>
>> This is my rd1>:
>>
>> (dm rd1> (Sock)
>> =A0 =A0(or
>> =A0 =A0 =A0 (in Sock (rd))
>> =A0 =A0 =A0 (nil
>> =A0 =A0 =A0 =A0 =A0(close Sock
>
> This looks all right, but isn't obviously the problem, as it hangs in
> 'rd'.
>
>
>> (de getArticles (W)
>> =A0 =A0(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp)
>> =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp)))
>> =A0 =A0 =A0(unless (flush) (bye
>
> What happens if you do (bye) after the 'for' loop is done?
>
> I assume that 'getArticles' is executed in the (eval @) below
>
>
>> =A0 =A0(task (port (+ *IdxNum 4040))
>> =A0 =A0 =A0 (let? Sock (accept @)
>> =A0 =A0 =A0 =A0 =A0(unless (fork)
>> =A0 =A0 =A0 =A0 =A0 =A0 (in Sock
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync)
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock
>> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @
>> =A0 =A0 =A0 =A0 =A0 =A0 (bye))
>> =A0 =A0 =A0 =A0 =A0(close Sock)))
>
> This looks OK, because (bye) is called after the while loop is done.
> Perhaps there is something in the way 'getArticles' is invoked here? You
> could change the second last line to (! bye) and see if it is indeed
> reached. I would suspect it isn't.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-11 Thread Alexander Burger

On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
> My code simply stops executing (as if waiting for the next entry but
> it never gets it) when I run out of entries to fetch, really strange
> and a traceAll confirms this, the last output is a call to rd1>.

What happens on the remote side, after all entries are sent? If the
remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
it is done.

> This is my rd1>:
> 
> (dm rd1> (Sock)
>(or
>   (in Sock (rd))
>   (nil
>  (close Sock

This looks all right, but isn't obviously the problem, as it hangs in
'rd'.

> (de getArticles (W)
>(for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp)
>  (pr (cons (; Wc article) (; Wc picoStamp)))
>  (unless (flush) (bye

What happens if you do (bye) after the 'for' loop is done?

I assume that 'getArticles' is executed in the (eval @) below

>(task (port (+ *IdxNum 4040))
>   (let? Sock (accept @)
>  (unless (fork)
> (in Sock
>(while (rd)
>   (sync)
>   (out Sock
>  (eval @
> (bye))
>  (close Sock)))

This looks OK, because (bye) is called after the while loop is done.
Perhaps there is something in the way 'getArticles' is invoked here? You
could change the second last line to (! bye) and see if it is indeed
reached. I would suspect it isn't.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-10 Thread Henrik Sarvell

My code simply stops executing (as if waiting for the next entry but
it never gets it) when I run out of entries to fetch, really strange
and a traceAll confirms this, the last output is a call to rd1>.

I know for a fact that 2 results should be returned but then when I
try to fetch the third and think I should get NIL something goes
really wrong, some race condition or a never ending wait for something
that refuses to happen.

This is my rd1>:

(dm rd1> (Sock)
   (or
  (in Sock (rd))
  (nil
 (close Sock

And on the remote:

(de getArticles (W)
   (for Wc (sortBy> '+Gh (collect 'word '+WordCount W) 'picoStamp)
 (pr (cons (; Wc article) (; Wc picoStamp)))
 (unless (flush) (bye

And the go of the remote:

(de go ()
..
   (rollback)
   (task (port (+ *IdxNum 4040))
  (let? Sock (accept @)
 (unless (fork)
(in Sock
   (while (rd)
  (sync)
  (out Sock
 (eval @
(bye))
 (close Sock)))
   (forked))




On Mon, May 10, 2010 at 9:50 AM, Alexander Burger  wro=
te:
> On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote:
>> Ah I see, so the issue is on the remote side then, what did your code
>> look like there, did you use (prove)?
>
> There were several scenarios. In cases where only a few hits are to be
> expected, I used 'collect':
>
> =A0 (for Obj (collect 'var '+Cls (...))
> =A0 =A0 =A0(pr Obj)
> =A0 =A0 =A0(unless (flush) (bye)) )
>
> The 'flush' is there for two purposes: (1) to get the data sent
> immediately (without holding in a local buffer), and (2) to have an
> immediate feedback. When the receiving side should close the connection
> (i.e. the GUI is not interested in more results, or the client has
> quit), 'flush' returns NIL and the local query can be terminated.
>
>
> In other cases, where there were potentially many hits (so that I didn't
> want to use 'collect'), I used the low-level tree iteration function
> 'iter' (which is also internally by 'collect'):
>
> =A0 (iter (tree 'var '+Cls)
> =A0 =A0 =A0'((Obj)
> =A0 =A0 =A0 =A0 (pr Obj)
> =A0 =A0 =A0 =A0 (unless (flush) (bye)) )
> =A0 =A0 =A0(cons From)
> =A0 =A0 =A0(cons Till T) )
> =A0 (bye) )
>
> So 'iter' is quite efficient, as it avoids the overhead of Pilog, but
> still can deliver an unlimited number of hits.
>
> Note, however, that you have to pass the proper 'from' and 'till'
> arguments. They must have the right structure of the index tree's key.
> For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref'
> (like in the shown case) it must be '(From . NIL)' and '(Till . T)'.
> 'db', 'collect' and the Pilog functions take care of such details
> automatically.
>
>
> For complexer queries, involving more than one index, yes, I used Pilog
> and 'prove'. Each call to 'prove' returns (and sends) a single object.
>
>
> For plain Pilog queries, i.e. without any special requirements like a
> defined sorting order, you can get along even without any custom
> functions/methods on the remote side. The 'remote/2' predicate can
> handle this transparently by executing its clauses on all remote
> machines. I have examples for that, but they are probably beyond the
> scope of this mail.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-10 Thread Alexander Burger

On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote:
> Ah I see, so the issue is on the remote side then, what did your code
> look like there, did you use (prove)?

There were several scenarios. In cases where only a few hits are to be
expected, I used 'collect':

   (for Obj (collect 'var '+Cls (...))
  (pr Obj)
  (unless (flush) (bye)) )

The 'flush' is there for two purposes: (1) to get the data sent
immediately (without holding in a local buffer), and (2) to have an
immediate feedback. When the receiving side should close the connection
(i.e. the GUI is not interested in more results, or the client has
quit), 'flush' returns NIL and the local query can be terminated.

In other cases, where there were potentially many hits (so that I didn't
want to use 'collect'), I used the low-level tree iteration function
'iter' (which is also internally by 'collect'):

   (iter (tree 'var '+Cls)
  '((Obj)
 (pr Obj)
 (unless (flush) (bye)) )
  (cons From)
  (cons Till T) )
   (bye) )

So 'iter' is quite efficient, as it avoids the overhead of Pilog, but
still can deliver an unlimited number of hits.

Note, however, that you have to pass the proper 'from' and 'till'
arguments. They must have the right structure of the index tree's key.
For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref'
(like in the shown case) it must be '(From . NIL)' and '(Till . T)'.
'db', 'collect' and the Pilog functions take care of such details
automatically.

For complexer queries, involving more than one index, yes, I used Pilog
and 'prove'. Each call to 'prove' returns (and sends) a single object.

For plain Pilog queries, i.e. without any special requirements like a
defined sorting order, you can get along even without any custom
functions/methods on the remote side. The 'remote/2' predicate can
handle this transparently by executing its clauses on all remote
machines. I have examples for that, but they are probably beyond the
scope of this mail.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-10 Thread Henrik Sarvell

Ah I see, so the issue is on the remote side then, what did your code
look like there, did you use (prove)?



On Mon, May 10, 2010 at 7:22 AM, Alexander Burger  wro=
te:
> Hi Henrik,
>
>> One final question, how did you define the rd1> mechanism?
>
> In the mentioned case, I used the followin method in the +Agent class
>
> =A0 (dm rd1> (Sock)
> =A0 =A0 =A0(when (assoc Sock (: socks))
> =A0 =A0 =A0 =A0 (rot (: socks) (index @ (: socks)))
> =A0 =A0 =A0 =A0 (ext (: ext)
> =A0 =A0 =A0 =A0 =A0 =A0(or
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock (rd))
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 (nil
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(close Sock)
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(pop (:: socks)) ) ) ) ) )
>
> This looks a little complicated, as each agent maintains a list of open
> sockets (in 'socks'). But if you omit the 'socks' management, it is
> basically just
>
> =A0 (ext (: ext) (in Sock (rd)))
>
> followed by 'close' if the remote side closed the connection.
>
>
>> Simply doing:
>>
>> (dm rd1> (Sock)
>> =A0 =A0(in Sock (rd)))
>>
>> will read the whole result, not just the first result, won't it?
>
> This should not be the case. It depends on what the other side sends. If
> it sends a list, you'll get the whole list. In the examples we
> discussed, however, the query results were sent one by one.
>
>
>> I'm a little bit confused since it says in the reference that rd will
>> "read the first item from the current input channel" but when I look
>
> Yes, analog to 'read', 'line', 'char' etc.
>
>> Maybe something is needed on the remote? At the moment there is simply
>> a collect and sort by there.
>
> Could it be that remote sends the result of 'collect'? This would be the
> whole list then.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-09 Thread Alexander Burger

Hi Henrik,

> One final question, how did you define the rd1> mechanism?

In the mentioned case, I used the followin method in the +Agent class

   (dm rd1> (Sock)
  (when (assoc Sock (: socks))
 (rot (: socks) (index @ (: socks)))
 (ext (: ext)
(or
   (in Sock (rd))
   (nil
  (close Sock)
  (pop (:: socks)) ) ) ) ) )

This looks a little complicated, as each agent maintains a list of open
sockets (in 'socks'). But if you omit the 'socks' management, it is
basically just

   (ext (: ext) (in Sock (rd)))

followed by 'close' if the remote side closed the connection.


> Simply doing:
> 
> (dm rd1> (Sock)
>(in Sock (rd)))
> 
> will read the whole result, not just the first result, won't it?

This should not be the case. It depends on what the other side sends. If
it sends a list, you'll get the whole list. In the examples we
discussed, however, the query results were sent one by one.


> I'm a little bit confused since it says in the reference that rd will
> "read the first item from the current input channel" but when I look

Yes, analog to 'read', 'line', 'char' etc.

> Maybe something is needed on the remote? At the moment there is simply
> a collect and sort by there.

Could it be that remote sends the result of 'collect'? This would be the
whole list then.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-05-09 Thread Henrik Sarvell

One final question, how did you define the rd1> mechanism?

Simply doing:

(dm rd1> (Sock)
   (in Sock (rd)))

will read the whole result, not just the first result, won't it?

I'm a little bit confused since it says in the reference that rd will
"read the first item from the current input channel" but when I look
at my current usage of rd I get the feeling it will read the whole
result?

Maybe something is needed on the remote? At the moment there is simply
a collect and sort by there.

I hope I'm not too cryptic.

/Henrik




On Sun, Apr 25, 2010 at 5:08 PM, Henrik Sarvell  wrote:
> Ah so the key is to have the connections in a list, I should have underst=
ood
> that.
>
> Thanks for the help, I'll try it out!
>
>
>
> On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger 
> wrote:
>>
>> On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
>> > So I gather the *Ext mapping is absolutely necessary regardless of
>> > whether
>> > remote or ext is used.
>>
>> Yes.
>>
>> Only in case you do not intend to communicate whole objects between the
>> remote and local application, but only scalar data like strings,
>> numbers, or lists of those. I would say this would be quite a
>> limitation. You need to communicate whole objects, at least because you
>> want to compare them locally to find the biggest (see below).
>>
>>
>> > I took at the *Ext section again, could I use this maybe:
>> >
>> > (setq *Ext =A0# Define extension functions
>> > ...
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (off Sock)=
 ) ) ) ) ) ) ) )
>> > =A0 =A0 =A0 '(localhost localhost)
>> > =A0 =A0 =A0 '(4041 4042)
>> > =A0 =A0 =A0 (40 80) ) )
>>
>> Yes, that's good. The example in the docu was not sufficient, as it has
>> a single port hard-coded.
>>
>>
>> > And then with *ext* I need to create that single look ahead queue in t=
he
>> > local code you talked about earlier, but how?
>>
>> The look ahead queue of a single object per connection consisted simply =
of
>> a list, the first result sent from each remote host.
>>
>> What I did was:
>>
>> 1. Starting a new query, a list of connections to all remote hosts is
>> =A0 opened:
>>
>> =A0 =A0 =A0(extract
>> =A0 =A0 =A0 =A0 '((Agent)
>> =A0 =A0 =A0 =A0 =A0 =A0(query> Agent ) )
>> =A0 =A0 =A0 =A0 (list of agents) )
>>
>> =A0 This returns a list of all agent objects who succeeded to connect. I
>> =A0 used that list to initialize a Pilog query.
>>
>> 2. Then you fetch the first answer from each connection. I used a method
>> =A0 'rd1>' in the agent class for that:
>>
>> =A0 =A0 =A0(extract 'rd1> (list of open agents))
>>
>> =A0 'extract' is used here, as it behaves like 'mapcar' but filters all
>> =A0 NIL items out of the result. A NIL item will be returned in the frst
>> =A0 'extract' if the connection cannot be openend, and in the second one
>> =A0 if that remote host has no results to send.
>>
>> =A0 So now you have a list of results, the first (highest, biggest,
>> =A0 newest?) object from each remote host.
>>
>> 3. Now the main query loop starts. Each time a new result is requested,
>> =A0 e.g. from the GUI, you just need to find the object with the highest=
,
>> =A0 biggest, newest attribute in that list. You take it from the list
>> =A0 (e.g. with 'prog1'), and immediately fill the slot in the list by
>> =A0 calling 'rd1>' for that host again.
>>
>> =A0 If that 'rd1>' returns NIL, it means this remote hosts has no more
>> =A0 results, so you delete it from the list of open agents. If it return=
s
>> =A0 non-NIL, you store the read value into the slot.
>>
>> In that way, the list of received items constitutes a kind of look-ahead
>> structure, always containing the items which might be returned next to
>> the caller.
>>
>>
>> > I mean at the moment the problem is that I get too many articles in my
>> > local
>> > code since all the remotes send all their articles at once, thus
>> > swamping
>>
>> There cannot be any swamping. All remote processes will send their
>> results, yes, but only until the TCP queue fills up, or until they have
>> no more results. The local process doesn't see anything of that, it just
>> fetches the next result with 'rd1>' whenever it needs one.
>>
>> You don't have to worry at all whether the GUI calls for the next result
>> 50 times, or 1 times. Each time simply the next result is returned.
>> This works well, and produces not more load than is necessary.
>>
>> Cheers,
>> - Alex
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe
>
>
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-25 Thread Henrik Sarvell

Ah so the key is to have the connections in a list, I should have understood
that.

Thanks for the help, I'll try it out!



On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger wrote:

> On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
> > So I gather the *Ext mapping is absolutely necessary regardless of
> whether
> > remote or ext is used.
>
> Yes.
>
> Only in case you do not intend to communicate whole objects between the
> remote and local application, but only scalar data like strings,
> numbers, or lists of those. I would say this would be quite a
> limitation. You need to communicate whole objects, at least because you
> want to compare them locally to find the biggest (see below).
>
>
> > I took at the *Ext section again, could I use this maybe:
> >
> > (setq *Ext  # Define extension functions
> > ...
> >   (off Sock) ) ) ) ) ) ) ) )
> >   '(localhost localhost)
> >   '(4041 4042)
> >   (40 80) ) )
>
> Yes, that's good. The example in the docu was not sufficient, as it has
> a single port hard-coded.
>
>
> > And then with *ext* I need to create that single look ahead queue in the
> > local code you talked about earlier, but how?
>
> The look ahead queue of a single object per connection consisted simply of
> a list, the first result sent from each remote host.
>
> What I did was:
>
> 1. Starting a new query, a list of connections to all remote hosts is
>   opened:
>
>  (extract
> '((Agent)
>(query> Agent ) )
> (list of agents) )
>
>   This returns a list of all agent objects who succeeded to connect. I
>   used that list to initialize a Pilog query.
>
> 2. Then you fetch the first answer from each connection. I used a method
>   'rd1>' in the agent class for that:
>
>  (extract 'rd1> (list of open agents))
>
>   'extract' is used here, as it behaves like 'mapcar' but filters all
>   NIL items out of the result. A NIL item will be returned in the frst
>   'extract' if the connection cannot be openend, and in the second one
>   if that remote host has no results to send.
>
>   So now you have a list of results, the first (highest, biggest,
>   newest?) object from each remote host.
>
> 3. Now the main query loop starts. Each time a new result is requested,
>   e.g. from the GUI, you just need to find the object with the highest,
>   biggest, newest attribute in that list. You take it from the list
>   (e.g. with 'prog1'), and immediately fill the slot in the list by
>   calling 'rd1>' for that host again.
>
>   If that 'rd1>' returns NIL, it means this remote hosts has no more
>   results, so you delete it from the list of open agents. If it returns
>   non-NIL, you store the read value into the slot.
>
> In that way, the list of received items constitutes a kind of look-ahead
> structure, always containing the items which might be returned next to
> the caller.
>
>
> > I mean at the moment the problem is that I get too many articles in my
> local
> > code since all the remotes send all their articles at once, thus swamping
>
> There cannot be any swamping. All remote processes will send their
> results, yes, but only until the TCP queue fills up, or until they have
> no more results. The local process doesn't see anything of that, it just
> fetches the next result with 'rd1>' whenever it needs one.
>
> You don't have to worry at all whether the GUI calls for the next result
> 50 times, or 1 times. Each time simply the next result is returned.
> This works well, and produces not more load than is necessary.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-25 Thread Alexander Burger

On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
> So I gather the *Ext mapping is absolutely necessary regardless of whether
> remote or ext is used.

Yes.

Only in case you do not intend to communicate whole objects between the
remote and local application, but only scalar data like strings,
numbers, or lists of those. I would say this would be quite a
limitation. You need to communicate whole objects, at least because you
want to compare them locally to find the biggest (see below).

> I took at the *Ext section again, could I use this maybe:
> 
> (setq *Ext  # Define extension functions
> ...
>   (off Sock) ) ) ) ) ) ) ) )
>   '(localhost localhost)
>   '(4041 4042)
>   (40 80) ) )

Yes, that's good. The example in the docu was not sufficient, as it has
a single port hard-coded.

> And then with *ext* I need to create that single look ahead queue in the
> local code you talked about earlier, but how?

The look ahead queue of a single object per connection consisted simply of
a list, the first result sent from each remote host.

What I did was:

1. Starting a new query, a list of connections to all remote hosts is
   opened:

  (extract
 '((Agent)
(query> Agent ) )
 (list of agents) )

   This returns a list of all agent objects who succeeded to connect. I
   used that list to initialize a Pilog query.

2. Then you fetch the first answer from each connection. I used a method
   'rd1>' in the agent class for that:

  (extract 'rd1> (list of open agents))

   'extract' is used here, as it behaves like 'mapcar' but filters all
   NIL items out of the result. A NIL item will be returned in the frst
   'extract' if the connection cannot be openend, and in the second one
   if that remote host has no results to send.

   So now you have a list of results, the first (highest, biggest,
   newest?) object from each remote host.

3. Now the main query loop starts. Each time a new result is requested,
   e.g. from the GUI, you just need to find the object with the highest,
   biggest, newest attribute in that list. You take it from the list
   (e.g. with 'prog1'), and immediately fill the slot in the list by
   calling 'rd1>' for that host again.

   If that 'rd1>' returns NIL, it means this remote hosts has no more
   results, so you delete it from the list of open agents. If it returns
   non-NIL, you store the read value into the slot.

In that way, the list of received items constitutes a kind of look-ahead
structure, always containing the items which might be returned next to
the caller.

> I mean at the moment the problem is that I get too many articles in my local
> code since all the remotes send all their articles at once, thus swamping

There cannot be any swamping. All remote processes will send their
results, yes, but only until the TCP queue fills up, or until they have
no more results. The local process doesn't see anything of that, it just
fetches the next result with 'rd1>' whenever it needs one.

You don't have to worry at all whether the GUI calls for the next result
50 times, or 1 times. Each time simply the next result is returned.
This works well, and produces not more load than is necessary.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-25 Thread Henrik Sarvell

So I gather the *Ext mapping is absolutely necessary regardless of whether
remote or ext is used.

I took at the *Ext section again, could I use this maybe:

(setq *Ext  # Define extension functions
   (mapcar
  '((@Host @Port @Ext)
 (let Sock NIL
(cons @Ext
   (curry (@Host @Ext Sock) (Obj)
  (when (or Sock (setq Sock (connect @Host @Port)))
 (ext @Ext
(out Sock (pr (cons 'qsym Obj)))
(prog1 (in Sock (rd))
   (unless @
  (close Sock)
  (off Sock) ) ) ) ) ) ) ) )
  '(localhost localhost)
  '(4041 4042)
  (40 80) ) )

And then with *ext* I need to create that single look ahead queue in the
local code you talked about earlier, but how?

I mean at the moment the problem is that I get too many articles in my local
code since all the remotes send all their articles at once, thus swamping
the local process, I'll show you what I'm using now:

(dm evalAll> @
   (let Result
  (make
 (for N (getMachine> This "localhost")
(later (chain (cons "void"))
   (eval> This N (rest)
  (wait 5000 (not (memq "void" Result)))
  Result))

(Note that this logic does not respect a multi machine environment, I will
add that when/if my current single machine is not enough.)

This one will evalute code on all remotes and return all the results. If the
result contains let's say more than 10 000 articles I will choke as it is
now. That's why I need that single look ahead you talked about, but I don't
know how to implement it.

If it was just about returning the 25 newest articles I could have each
remote simply return the 25 newest ones and then sort again locally. In that
case I would get 50 back and not 10 000 in this case. And when I want the
next result which will be 25-50 I suppose I could return 50 from each remote
then but this is a very ugly solution that doesn't scale very well.




On Sun, Apr 25, 2010 at 12:05 PM, Alexander Burger wrote:

> Hi Henrik,
>
> > I've reviewed the **Ext* part in the manual and I will need something
> > different as I will have several nodes on each machine on different ports
> > (starting with simply localhost). I suppose I could have simply modified
> it
> > if I had had one node per machine?
>
> With "node" you mean a server process? What makes you think that the
> example limits it to one node? IIRC, the example is in fact a simplified
> version (perhaps too simplified?) of a system where there were many
> servers, of equal and different types, on each host.
>
>
> > Anyway, what would the whole procedure you've described look like if I
> have
> > two external nodes listening on 4041 and 4042 respectively but on
> localhost
> > both of them, and the E/R in question looks like this?:
> >
> > (class +Article +Entity)
> > (rel aid   (+Key +Number))
> > (rel title (+String))
> > (rel htmlUrl   (+Key +String)) #
> > (rel body  (+Blob))
> > (rel pubDate   (+Ref +Number))
>
> Side question: Is there a special reason why 'pubDate' is a '+Number'
> and not a '+Date'? Should work that way, though.
>
>
> > In this case I want to fetch article 25 - 50 sorted by pubDate from both
> > nodes
>
> Unfortunately, this cannot be achieved directly with an '+Aux' relation,
> because the article number and the date cannot be organized into a
> single index with a primary and secondary sorting criterion.
>
> There is no other way then fetching and then sorting them, I think:
>
>   (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50))
>
> Thus, the "send" part from a node to the central server would be
>
>   (for Article
>  (by
> '((This) (: pubDate))
> sort
> (collect 'aid '+Article 25 50) )
>  (pr Article) # Send the article object
>  (NIL (flush)) )  # Flush the socket
>
> The 'flush' is important, not so much to immediately send the data, but
> to detect whether the other side (the central server) has closed the
> connection, perhaps because it isn't interested in further data.
>
> 'flush' returns NIL if it cannot send the data successfully, and thus
> causes the 'for' loop to terminate.
>
>
>
> > So as far as I've understood it a (setq *Ext ... ) section is needed and
> > then the specific logic described in your previous post in the form of
> > something using *ext* or maybe *remote*?
>
> Yes. '*Ext' is necessary if remote objects are accessed locally.
>
> 'remote' might be handy if Pilog is used for remote queries. This is not
> the case in the above example.
>
> But 'ext' is needed on the central server, with the proper offsets for
> the clients. This can be all encapsulated in the +Agent objects.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-25 Thread Alexander Burger

Hi Henrik,

> I've reviewed the **Ext* part in the manual and I will need something
> different as I will have several nodes on each machine on different ports
> (starting with simply localhost). I suppose I could have simply modified it
> if I had had one node per machine?

With "node" you mean a server process? What makes you think that the
example limits it to one node? IIRC, the example is in fact a simplified
version (perhaps too simplified?) of a system where there were many
servers, of equal and different types, on each host.


> Anyway, what would the whole procedure you've described look like if I have
> two external nodes listening on 4041 and 4042 respectively but on localhost
> both of them, and the E/R in question looks like this?:
> 
> (class +Article +Entity)
> (rel aid   (+Key +Number))
> (rel title (+String))
> (rel htmlUrl   (+Key +String)) #
> (rel body  (+Blob))
> (rel pubDate   (+Ref +Number))

Side question: Is there a special reason why 'pubDate' is a '+Number'
and not a '+Date'? Should work that way, though.


> In this case I want to fetch article 25 - 50 sorted by pubDate from both
> nodes

Unfortunately, this cannot be achieved directly with an '+Aux' relation,
because the article number and the date cannot be organized into a
single index with a primary and secondary sorting criterion.

There is no other way then fetching and then sorting them, I think:

   (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50))

Thus, the "send" part from a node to the central server would be

   (for Article
  (by
 '((This) (: pubDate))
 sort
 (collect 'aid '+Article 25 50) )
  (pr Article) # Send the article object
  (NIL (flush)) )  # Flush the socket

The 'flush' is important, not so much to immediately send the data, but
to detect whether the other side (the central server) has closed the
connection, perhaps because it isn't interested in further data.

'flush' returns NIL if it cannot send the data successfully, and thus
causes the 'for' loop to terminate.



> So as far as I've understood it a (setq *Ext ... ) section is needed and
> then the specific logic described in your previous post in the form of
> something using *ext* or maybe *remote*?

Yes. '*Ext' is necessary if remote objects are accessed locally.

'remote' might be handy if Pilog is used for remote queries. This is not
the case in the above example.

But 'ext' is needed on the central server, with the proper offsets for
the clients. This can be all encapsulated in the +Agent objects.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-24 Thread Henrik Sarvell

I've done some refactoring and rewriting of my +Agent, I will employ various
ways of fetching/setting remote data but the technique you've described
above will be prominent.

I've reviewed the **Ext* part in the manual and I will need something
different as I will have several nodes on each machine on different ports
(starting with simply localhost). I suppose I could have simply modified it
if I had had one node per machine?

Anyway, what would the whole procedure you've described look like if I have
two external nodes listening on 4041 and 4042 respectively but on localhost
both of them, and the E/R in question looks like this?:

(class +Article +Entity)
(rel aid   (+Key +Number))
(rel title (+String))
(rel htmlUrl   (+Key +String)) #
(rel body  (+Blob))
(rel pubDate   (+Ref +Number))

In this case I want to fetch article 25 - 50 sorted by pubDate from both
nodes (if additional relations are needed to facilitate the sorting feel
free to add them to the E/R).

So as far as I've understood it a (setq *Ext ... ) section is needed and
then the specific logic described in your previous post in the form of
something using *ext* or maybe *remote*?

/Henrik

On Wed, Apr 21, 2010 at 8:08 PM, Alexander Burger wrote:

> On Wed, Apr 21, 2010 at 06:35:30PM +0200, Henrik Sarvell wrote:
> > At first my remotes will be on the same machine so yes they could all be
> > forked from the main process.
>
> That's all right. On the other hand, the remote processes might be
> different _programs_ (i.e. starting from a separate 'main', 'go' etc.),
> so I would rather expect them not to fork from the same process.
>
> On which machine(s) all these processes run at the end of the day
> doesn't matter. Can well be all on localhost.
>
>
> > I suppose that means they will start to block and what consequences will
> > there be? If very bad how do I prevent it in the best way?
>
> Blocking is no problem at all in this context. As we discussed in
> previous examples (and also as shown in the docu of *Ext and related
> functions), the remote server spawns a child process for each query
> request. Such a query can block if the central server doesn't eat away
> all results quickly enough, but this doesn't matter.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-21 Thread Alexander Burger

On Wed, Apr 21, 2010 at 06:35:30PM +0200, Henrik Sarvell wrote:
> At first my remotes will be on the same machine so yes they could all be
> forked from the main process.

That's all right. On the other hand, the remote processes might be
different _programs_ (i.e. starting from a separate 'main', 'go' etc.),
so I would rather expect them not to fork from the same process.

On which machine(s) all these processes run at the end of the day
doesn't matter. Can well be all on localhost.

> I suppose that means they will start to block and what consequences will
> there be? If very bad how do I prevent it in the best way?

Blocking is no problem at all in this context. As we discussed in
previous examples (and also as shown in the docu of *Ext and related
functions), the remote server spawns a child process for each query
request. Such a query can block if the central server doesn't eat away
all results quickly enough, but this doesn't matter.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-21 Thread Henrik Sarvell

One small question before I start working.

Now all remote database start sending their results, ordered by date.
> They are actually busy only until the TCP queue fills up, or until the
> connection is closed. If the queue is filled up, they will block so that
> it is advisable that they are all fork'ed children.
>

At first my remotes will be on the same machine so yes they could all be
forked from the main process.

In the future though I might want to put them on different machines, that's
why I want them to be completely separate processes right from the start.

I suppose that means they will start to block and what consequences will
there be? If very bad how do I prevent it in the best way?

Sorry for the possibly silly questions but my knowledge of TCP is very
limited.



On Tue, Apr 20, 2010 at 6:11 PM, Henrik Sarvell  wrote:

> That was a clever one I must say :)
>
> OK I'll redistribute the articles first and then get back to you for a few
> of the details with regards to the above.
>
> The above can then also be used to fetch articles by feed, tag or any other
> attribute because they must always be sorted by date.
>
>
>
>
> On Tue, Apr 20, 2010 at 5:22 PM, Alexander Burger wrote:
>
>> Hi Henrik,
>>
>> > So with the refs in place I could use the full remote logic to run pilog
>> > queries on the remotes.
>>
>> OK
>>
>> > Now a search is made for all articles containing the word "picolisp" for
>> > instance. I then need to be able to get an arbitrary slice back of the
>> total
>> > which needs to be sorted by time. I have a hard time understanding how
>> this
>> > can be achieved in any sensible way except through one the following:
>> >
>> > Central Command:
>> > ...
>> > Cascading:
>> > ...
>>
>> I think both solutions are feasible. This is because you are in the
>> lucky situation that you can separate the articles on the remote machine
>> according to their age. In a general case (e.g. if the data are not
>> "archived" like here, but are subject to permanent change).
>>
>>
>> However: I think there is a solution that is simpler, as well as more
>> general (not assuming anything about the locations of the articles).
>>
>> I did this in another project, where I collected items from remote
>> machines sorted by attributes (not date, but counts and sizes).
>>
>>
>> The first thing is that you define the index to be an +Aux, combining
>> the search key with the date. So if you search for a key like "picolisp"
>> on a single remote machine, you get all hits sorted by date. No extra
>> sorting required.
>>
>> Then, each remote machine has a function (e.g. 'sendRefDatArticles')
>> defined, which simply iterates the index tree (with 'collect', or better
>> a pilog query) and sends each found object with 'pr' to the current
>> output channel. When it sent all hits, it terminates.
>>
>> Then on the central server you open connections with *Ext enabled to
>> each remote client. This can be done with +Agent objects taking care of
>> the details (maintaining the connections, communicating via 'ext' etc.).
>>
>> The actual query then sends out a remote command like
>>
>>   (start> Agent 'sendRefDatArticles "picolisp" (someDate))
>>
>> Now all remote database start sending their results, ordered by date.
>> They are actually busy only until the TCP queue fills up, or until the
>> connection is closed. If the queue is filled up, they will block so that
>> it is advisable that they are all fork'ed children.
>>
>> The central server then reads a single object from each connection into
>> a list. Now, to return the results one by one to the actual caller (e.g.
>> the GUI), it always picks the object with the highest date from that
>> list, and reads the next item into that place in the list. The list is
>> effectively a single-object look-ahead on each connection. When one of
>> the connections returns NIL, it means the list of hits on that machine
>> is exhausted, the remote child process terminated, and the connection
>> can be closed.
>>
>> So the GUI calls a function (or, more probably) a proper pilog predicate
>> which returns always the next available object with the highest date.
>> With that, you can fetch 1, 25, or thousands of objects in order.
>>
>> Cheers,
>> - Alex
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>>
>
>

Re: Scaling issue

2010-04-20 Thread Henrik Sarvell

That was a clever one I must say :)

OK I'll redistribute the articles first and then get back to you for a few
of the details with regards to the above.

The above can then also be used to fetch articles by feed, tag or any other
attribute because they must always be sorted by date.



On Tue, Apr 20, 2010 at 5:22 PM, Alexander Burger wrote:

> Hi Henrik,
>
> > So with the refs in place I could use the full remote logic to run pilog
> > queries on the remotes.
>
> OK
>
> > Now a search is made for all articles containing the word "picolisp" for
> > instance. I then need to be able to get an arbitrary slice back of the
> total
> > which needs to be sorted by time. I have a hard time understanding how
> this
> > can be achieved in any sensible way except through one the following:
> >
> > Central Command:
> > ...
> > Cascading:
> > ...
>
> I think both solutions are feasible. This is because you are in the
> lucky situation that you can separate the articles on the remote machine
> according to their age. In a general case (e.g. if the data are not
> "archived" like here, but are subject to permanent change).
>
>
> However: I think there is a solution that is simpler, as well as more
> general (not assuming anything about the locations of the articles).
>
> I did this in another project, where I collected items from remote
> machines sorted by attributes (not date, but counts and sizes).
>
>
> The first thing is that you define the index to be an +Aux, combining
> the search key with the date. So if you search for a key like "picolisp"
> on a single remote machine, you get all hits sorted by date. No extra
> sorting required.
>
> Then, each remote machine has a function (e.g. 'sendRefDatArticles')
> defined, which simply iterates the index tree (with 'collect', or better
> a pilog query) and sends each found object with 'pr' to the current
> output channel. When it sent all hits, it terminates.
>
> Then on the central server you open connections with *Ext enabled to
> each remote client. This can be done with +Agent objects taking care of
> the details (maintaining the connections, communicating via 'ext' etc.).
>
> The actual query then sends out a remote command like
>
>   (start> Agent 'sendRefDatArticles "picolisp" (someDate))
>
> Now all remote database start sending their results, ordered by date.
> They are actually busy only until the TCP queue fills up, or until the
> connection is closed. If the queue is filled up, they will block so that
> it is advisable that they are all fork'ed children.
>
> The central server then reads a single object from each connection into
> a list. Now, to return the results one by one to the actual caller (e.g.
> the GUI), it always picks the object with the highest date from that
> list, and reads the next item into that place in the list. The list is
> effectively a single-object look-ahead on each connection. When one of
> the connections returns NIL, it means the list of hits on that machine
> is exhausted, the remote child process terminated, and the connection
> can be closed.
>
> So the GUI calls a function (or, more probably) a proper pilog predicate
> which returns always the next available object with the highest date.
> With that, you can fetch 1, 25, or thousands of objects in order.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-20 Thread Alexander Burger

Hi Henrik,

> So with the refs in place I could use the full remote logic to run pilog
> queries on the remotes.

OK

> Now a search is made for all articles containing the word "picolisp" for
> instance. I then need to be able to get an arbitrary slice back of the total
> which needs to be sorted by time. I have a hard time understanding how this
> can be achieved in any sensible way except through one the following:
> 
> Central Command:
> ...
> Cascading:
> ...

I think both solutions are feasible. This is because you are in the
lucky situation that you can separate the articles on the remote machine
according to their age. In a general case (e.g. if the data are not
"archived" like here, but are subject to permanent change).


However: I think there is a solution that is simpler, as well as more
general (not assuming anything about the locations of the articles).

I did this in another project, where I collected items from remote
machines sorted by attributes (not date, but counts and sizes).


The first thing is that you define the index to be an +Aux, combining
the search key with the date. So if you search for a key like "picolisp"
on a single remote machine, you get all hits sorted by date. No extra
sorting required.

Then, each remote machine has a function (e.g. 'sendRefDatArticles')
defined, which simply iterates the index tree (with 'collect', or better
a pilog query) and sends each found object with 'pr' to the current
output channel. When it sent all hits, it terminates.

Then on the central server you open connections with *Ext enabled to
each remote client. This can be done with +Agent objects taking care of
the details (maintaining the connections, communicating via 'ext' etc.).

The actual query then sends out a remote command like

   (start> Agent 'sendRefDatArticles "picolisp" (someDate))

Now all remote database start sending their results, ordered by date.
They are actually busy only until the TCP queue fills up, or until the
connection is closed. If the queue is filled up, they will block so that
it is advisable that they are all fork'ed children.

The central server then reads a single object from each connection into
a list. Now, to return the results one by one to the actual caller (e.g.
the GUI), it always picks the object with the highest date from that
list, and reads the next item into that place in the list. The list is
effectively a single-object look-ahead on each connection. When one of
the connections returns NIL, it means the list of hits on that machine
is exhausted, the remote child process terminated, and the connection
can be closed.

So the GUI calls a function (or, more probably) a proper pilog predicate
which returns always the next available object with the highest date.
With that, you can fetch 1, 25, or thousands of objects in order.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-20 Thread Henrik Sarvell

I've been reading up a bit on the remote stuff, I haven't made the articles
distributed yet but let's assume I have, with 10 000 articles per remote.
Let's also assume that I have remade the word indexes to now work with real
+Ref +Links on each remote that links words and articles (not simply numbers
for subsequent use with (id) locally).

So with the refs in place I could use the full remote logic to run pilog
queries on the remotes.

Now a search is made for all articles containing the word "picolisp" for
instance. I then need to be able to get an arbitrary slice back of the total
which needs to be sorted by time. I have a hard time understanding how this
can be achieved in any sensible way except through one the following:

Central Command:

1.) The remotes are setup so that remote one contains the oldest articles,
remote two the second oldest articles and so on (this is the case naturally
as a new remote is spawned when the newest one is "full").

2.) Each remote then returns how many articles it has that contains
"picolisp". This is needed for the pagination anyway in order to display a
correct amount of page numbers and can be done pretty trivially through the
count tree mechanism described earlier in this thread.

3.) The local logic now determines which remote(s) should be queried in
order to get 25 correct articles, issues the queries to be executed remotely
and displays the returned articles.

If pagination is scrapped the total count is not needed, it's possible to
have a "More Results" button instead, I'm fine with that kind of interface
too. In most cases the count is not important for the user anyway. In that
way the following might be possible:

Cascading:

1.) The newest remote is queried first and can quickly determine through
count tree that it has the requested articles, quickly fetches them and
returns them.

2.) If it doesn't contain them it will pass on the request to the second
newest remote which might contain all of the requested articles, or a subset
in which case the missing ones will be returned from the third newest remote
through the same mechanism.

3.) The end result is that the correct articles now end up in the first
remote which will return them to the local.

Did I miss something, might this problem be solved in a cleverer way?

/Henrik

On Thu, Apr 15, 2010 at 12:55 PM, Henrik Sarvell  wrote:

> To simply be able to pass along simple commands like collect and db ie. the
> *Ext stuff was overkill, which works just fine except in this special case
> when there are thousands of articles to a feed.
>
> I'm planning to distribute the whole DB except users and what feeds they
> subscribe to. Everything else will be article centric and remote. I will
> also keep local records of which feeds have articles in which remote so I
> don't query remotes for nothing.
>
>
>
>
>
> On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger 
> wrote:
>
>> On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote:
>> > On the other hand, if I'm to follow my own thinking to its logical
>> > conclusion I should make the articles distributed too, with blobs and
>> all.
>>
>> What was the rationale to use object IDs instead of direct remote access
>> via '*Ext'? I can't remember at the moment.
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>>
>
>

Re: Scaling issue

2010-04-15 Thread Henrik Sarvell

To simply be able to pass along simple commands like collect and db ie. the
*Ext stuff was overkill, which works just fine except in this special case
when there are thousands of articles to a feed.

I'm planning to distribute the whole DB except users and what feeds they
subscribe to. Everything else will be article centric and remote. I will
also keep local records of which feeds have articles in which remote so I
don't query remotes for nothing.

On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger wrote:

> On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote:
> > On the other hand, if I'm to follow my own thinking to its logical
> > conclusion I should make the articles distributed too, with blobs and
> all.
>
> What was the rationale to use object IDs instead of direct remote access
> via '*Ext'? I can't remember at the moment.
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-15 Thread Alexander Burger

On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote:
> On the other hand, if I'm to follow my own thinking to its logical
> conclusion I should make the articles distributed too, with blobs and all.

What was the rationale to use object IDs instead of direct remote access
via '*Ext'? I can't remember at the moment.
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-15 Thread Alexander Burger

Hi Henrik,

> Could the *Ext functionality still be used somehow? I have a hard time
> understanding how if I don't map the feed (parent) -> article (child)
> relationship remotely, I mean at some point I will have to filter all

Sorry, I probably lost the overview of the total application structure.

But if I understand the question right: Though *Ext is not intended in
that way (it gives access to the complete remote object), you might
still use it locally with 'id', as *Ext preserves the object id (it only
maps the DB file number part to the remote range). That is, if you
apply 'id' to an object recieved from a remote DB, it still gives
the correct number.

Does this help?

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-15 Thread Henrik Sarvell

On the other hand, if I'm to follow my own thinking to its logical
conclusion I should make the articles distributed too, with blobs and all.


On Wed, Apr 14, 2010 at 9:51 PM, Henrik Sarvell  wrote:

> I don't know Alex, remember that we disconnected stuff, I'll paste the
> remote E/R again (all of it, there is nothing else on the remotes):
>
>
> (class +WordCount +Entity)
> (rel article   (+Ref +Number))
> (rel word  (+Aux +Ref +Number) (article))
> (rel count (+Number))
>
> The numbers here can then be used in the main app with (id) to actually
> locate the objects in question.
>
> Could the *Ext functionality still be used somehow? I have a hard time
> understanding how if I don't map the feed (parent) -> article (child)
> relationship remotely, I mean at some point I will have to filter all
> retrieved articles against a set of articles fetched locally (all articles
> belonging to my Twitter feed), if I don't store the connections remotely.
> Storing the feed -> article links remotely will let me avoid checking
> locally, and it's that check that is the bottleneck at the moment.
>
> I suppose you could find some clever way of speeding up the local
> filtering, at the moment I'm simply loading all Twitter articles with
> collect and then throwing away all remotely retrieved articles that are not
> in that list. However that just seems like a duct tape solution, even if it
> works to begin with it won't work for long.
>
> /Henrik
>
>
>
> On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger wrote:
>
>> On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
>> > Thanks Alex, I will go for the the reversed range and check out
>> select/3.
>>
>> Let me mention that since picoLisp-3.0.1 we have a separate
>> documentation of 'select/3', in "doc/select.html".
>> --
>> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>>
>
>

Re: Scaling issue

2010-04-14 Thread Henrik Sarvell

I don't know Alex, remember that we disconnected stuff, I'll paste the
remote E/R again (all of it, there is nothing else on the remotes):

(class +WordCount +Entity)
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))

The numbers here can then be used in the main app with (id) to actually
locate the objects in question.

Could the *Ext functionality still be used somehow? I have a hard time
understanding how if I don't map the feed (parent) -> article (child)
relationship remotely, I mean at some point I will have to filter all
retrieved articles against a set of articles fetched locally (all articles
belonging to my Twitter feed), if I don't store the connections remotely.
Storing the feed -> article links remotely will let me avoid checking
locally, and it's that check that is the bottleneck at the moment.

I suppose you could find some clever way of speeding up the local filtering,
at the moment I'm simply loading all Twitter articles with collect and then
throwing away all remotely retrieved articles that are not in that list.
However that just seems like a duct tape solution, even if it works to begin
with it won't work for long.

/Henrik

On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger wrote:

> On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
> > Thanks Alex, I will go for the the reversed range and check out select/3.
>
> Let me mention that since picoLisp-3.0.1 we have a separate
> documentation of 'select/3', in "doc/select.html".
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-11 Thread Alexander Burger

On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
> Thanks Alex, I will go for the the reversed range and check out select/3.

Let me mention that since picoLisp-3.0.1 we have a separate
documentation of 'select/3', in "doc/select.html".
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-11 Thread Henrik Sarvell

Thanks Alex, I will go for the the reversed range and check out select/3.

I'm already using collect with dates extensively but in this case it
wouldn't work as I need the 25 newest regardless of exactly when they were
published.

/Henrik

On Sun, Apr 11, 2010 at 1:27 PM, Alexander Burger wrote:

> On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote:
> > What's additionally needed is:
> >
> > 1.) Calculating total count somehow without retrieving all articles.
>
> If it is simply the count of all articles in the DB, you can get it
> directly from a '+Key' or '+Ref' index. I don't quite remember the E/R
> model, but I found this in an old mail:
>
>   (class +Article +Entity)
>   (rel aid   (+Key +Number))
>   (rel title (+Idx +String))
>   (rel htmlUrl   (+Key +String))
>
> With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl
> '+Article)) will give all articles having the property 'aid' or
> 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more
> than one tree node per object).
>
> If you need distinguished counts (e.g. for groups of articles or
> according to certain features), it might be necessary to build more
> indexes, or simply maintain counts during import.
>
>
> > 2.) Somehow sorting by date so I get say the 25 first articles.
>
> This is also best done with a dedicated index, e.g.
>
>   (rel dat (+Ref +Date))
>
> in '+Article'. Then you could specify a reversed range (T . NIL) for a
> pilog query
>
>   (? (db dat +Article (T . NIL) @Article) (show @Article))
>
> This will start with the newest article, and step backwards. Even easier
> might be if you specify a range of dates, say from today till one week
> ago. Then you could use 'collect'
>
>   (collect 'dat '+Article (date) (- (date) 7))
>
> or, as 'today' is not very informative,
>
>   (collect 'dat '+Article T (- (date) 7))
>
>
> > When searching for articles belonging to a certain feed containing a word
> in
> > the content I now let the distributed indexes return all articles and
> then I
> > simply use filter to get at the articles. And to do that I of course need
> to
> > fetch all the articles in a certain feed, which works fine for most feeds
> > but not Twitter as it now probably contains more than 10 000 articles.
>
> I think that usually it should not be necessary to fetch all articles,
> if you build a combined query with the 'select/3' predicate.
>
>
> > The only solution I can see to this is to simply store the feed ->
> article
> > mapping remotely too, ie each word index server contains this info too
> for
> > ...
> > Then I could simply filter by feed remotely.
>
> Not sure. But I feel that I would use distributed processing here only
> if there is no other way (i.e. the parallel search with 'select/3').
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-11 Thread Alexander Burger

On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote:
> What's additionally needed is:
> 
> 1.) Calculating total count somehow without retrieving all articles.

If it is simply the count of all articles in the DB, you can get it
directly from a '+Key' or '+Ref' index. I don't quite remember the E/R
model, but I found this in an old mail:

   (class +Article +Entity)
   (rel aid   (+Key +Number))
   (rel title (+Idx +String))
   (rel htmlUrl   (+Key +String))

With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl
'+Article)) will give all articles having the property 'aid' or
'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more
than one tree node per object).

If you need distinguished counts (e.g. for groups of articles or
according to certain features), it might be necessary to build more
indexes, or simply maintain counts during import.

> 2.) Somehow sorting by date so I get say the 25 first articles.

This is also best done with a dedicated index, e.g.

   (rel dat (+Ref +Date))

in '+Article'. Then you could specify a reversed range (T . NIL) for a
pilog query

   (? (db dat +Article (T . NIL) @Article) (show @Article))

This will start with the newest article, and step backwards. Even easier
might be if you specify a range of dates, say from today till one week
ago. Then you could use 'collect'

   (collect 'dat '+Article (date) (- (date) 7))

or, as 'today' is not very informative,

   (collect 'dat '+Article T (- (date) 7))

> When searching for articles belonging to a certain feed containing a word in
> the content I now let the distributed indexes return all articles and then I
> simply use filter to get at the articles. And to do that I of course need to
> fetch all the articles in a certain feed, which works fine for most feeds
> but not Twitter as it now probably contains more than 10 000 articles.

I think that usually it should not be necessary to fetch all articles,
if you build a combined query with the 'select/3' predicate.

> The only solution I can see to this is to simply store the feed -> article
> mapping remotely too, ie each word index server contains this info too for
> ...
> Then I could simply filter by feed remotely.

Not sure. But I feel that I would use distributed processing here only
if there is no other way (i.e. the parallel search with 'select/3').

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: Scaling issue

2010-04-11 Thread Henrik Sarvell

I see, I should've known about that one (I'm using it to get similar
articles already).

What's additionally needed is:

1.) Calculating total count somehow without retrieving all articles.

2.) Somehow sorting by date so I get say the 25 first articles.

If those two can also be achieved in a manner that won't require me to fetch
all articles then I can use Pilog in this manner to fetch the results when
it comes to getting all articles under all feeds under a specific tag. At
the moment I'm fetching all of them at once and using head, not optimal.

However, it won't work with the word indexes, a redesign of how the system
works is needed I think.

When searching for articles belonging to a certain feed containing a word in
the content I now let the distributed indexes return all articles and then I
simply use filter to get at the articles. And to do that I of course need to
fetch all the articles in a certain feed, which works fine for most feeds
but not Twitter as it now probably contains more than 10 000 articles.

The only solution I can see to this is to simply store the feed -> article
mapping remotely too, ie each word index server contains this info too for
the articles it's mapping, resutling in an E/R section looking like this:

(class +WordCount +Entity) #
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))

(class +ArFeLink +Entity)
(rel article   (+Aux +Ref +Number) (feed))
(rel feed  (+Ref +Number))

Then I could simply filter by feed remotely.

/Henrik

On Sun, Apr 11, 2010 at 9:25 AM, Alexander Burger wrote:

> Hi Henrik,
>
> > (class +ArFeLink +Entity)
> > (rel article   (+Aux +Ref +Link) (feed) NIL (+Article))
> > (rel feed  (+Ref +Link) NIL (+Feed))
> >
> > (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need
> it
> > to take something like maximum 2 seconds...
> >
> > Can this be fixed by adding some index or key or do I need make this part
> of
> > the DB distributed and chopped up so I can run this in parallel?
>
> This is already the proper index. Is it perhaps the case that there are
> simply too many articles fetched at once? How may articles does the
> above 'collect' return? And are these articles all needed at that time?
>
> If you talk about 2 seconds, I assume you don't want the user having to
> wait, so it is a GUI interaction. In such cases it is typical not to
> fetch all data from the DB, but only the first chunk e.g. to display
> them in the GUI. It would be better then to use a Pilog query, returning
> the results one by one (as done in +QueryChart).
>
> To get results analog to the above 'collect', you could create a query
> like
>
>   (let Q
>  (goal
> (quote
>@Obj Obj
>(db feed +ArFeLink @Obj @Feed)
>(val @Article @Feed article) ) )
>  ...
>  (do 20   # Then fetch the first 20 articles
> (NIL (prove Q))  # More?
> (bind @   # Bind the result values
>(println @Article)  # Use the article
>...
>
> Instead of 'bind' you could also simply use 'get' to extract the
> @Article: (get @ '@Article).
>
> Before doing so, I would test it interactively, e.g.
>
> : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article))
>
> if '{ART}' is an article.
>
> Not that the above is not tested.
>
> Cheers,
> - Alex
> --
> UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe
>

Re: Scaling issue

2010-04-11 Thread Alexander Burger

Hi Henrik,

> (class +ArFeLink +Entity)
> (rel article   (+Aux +Ref +Link) (feed) NIL (+Article))
> (rel feed  (+Ref +Link) NIL (+Feed))
> 
> (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need it
> to take something like maximum 2 seconds...
> 
> Can this be fixed by adding some index or key or do I need make this part of
> the DB distributed and chopped up so I can run this in parallel?

This is already the proper index. Is it perhaps the case that there are
simply too many articles fetched at once? How may articles does the
above 'collect' return? And are these articles all needed at that time?

If you talk about 2 seconds, I assume you don't want the user having to
wait, so it is a GUI interaction. In such cases it is typical not to
fetch all data from the DB, but only the first chunk e.g. to display
them in the GUI. It would be better then to use a Pilog query, returning
the results one by one (as done in +QueryChart).

To get results analog to the above 'collect', you could create a query
like

   (let Q
  (goal
 (quote
@Obj Obj
(db feed +ArFeLink @Obj @Feed)
(val @Article @Feed article) ) )
  ...
  (do 20   # Then fetch the first 20 articles
 (NIL (prove Q))  # More?
 (bind @   # Bind the result values
(println @Article)  # Use the article
...

Instead of 'bind' you could also simply use 'get' to extract the
@Article: (get @ '@Article).

Before doing so, I would test it interactively, e.g.

: (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article))

if '{ART}' is an article.

Not that the above is not tested.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

37 matches

Mail list logo