Re: Scaling issue

2010-05-20 Thread Henrik Sarvell
I've summed up the result of this thread here:
http://picolisp.com/5000/-2-I.html with some explanations.

/Henrik



On Fri, May 14, 2010 at 8:59 AM, Henrik Sarvell hsarv...@gmail.com wrote:
 OK since I can't rely on sorting by date anyway let's forget that idea.

 Yes since it seemed I had to involve dates anyway I simply chose a
 date far back enough in time that if someone is looking for something
 they might as well use Google.

 Anyway the above is scanning 19 remotes containing indexes for 10 000
 articles each and returns in 3-4 seconds which is OK for me, problem
 solved as far as I'm concerned. I have to add though that all remotes
 are currently on the same machine, had they been truly distributed it
 would be faster, especially if the other machines were in the same
 data center.

 On Fri, May 14, 2010 at 7:55 AM, Alexander Burger a...@software-lab.de 
 wrote:
 On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
 One thing first though, since articles are indexed when they're parsed
 and PL isn't doing any kind of sorting automatically on insert then
 they should be sorted by date automatically with the latest articles
 at the end of the database file since I suppose they're just appended?

 While this is correct in principle, I would not rely on it. If there
 should ever be an object deleted from that database file, the space
 would be reused by the next new object, and the assumption would break.


 How can I simply start walking from the end of the file until I've
 found say 25 matches? This procedure should be the absolutely fastest
 way of getting what I want.

 Currently I see no easy way. The only function that walks a database
 file directly is 'seq', but it can only step forwards.


 I know about your iter example earlier and it seems like a good fit if
 it starts walking in the right end?

 Yes, 'iter' (and the related 'scan') can walk in both directions. You
 need only to pass inverted keys (i.e. Beg  End).


 If I understand it right, however, you solved the problem in your next
 mail(s) by using the date index, and starting at 6 months ago?

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-14 Thread Alexander Burger
On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
 One thing first though, since articles are indexed when they're parsed
 and PL isn't doing any kind of sorting automatically on insert then
 they should be sorted by date automatically with the latest articles
 at the end of the database file since I suppose they're just appended?

While this is correct in principle, I would not rely on it. If there
should ever be an object deleted from that database file, the space
would be reused by the next new object, and the assumption would break.


 How can I simply start walking from the end of the file until I've
 found say 25 matches? This procedure should be the absolutely fastest
 way of getting what I want.

Currently I see no easy way. The only function that walks a database
file directly is 'seq', but it can only step forwards.


 I know about your iter example earlier and it seems like a good fit if
 it starts walking in the right end?

Yes, 'iter' (and the related 'scan') can walk in both directions. You
need only to pass inverted keys (i.e. Beg  End).


If I understand it right, however, you solved the problem in your next
mail(s) by using the date index, and starting at 6 months ago?

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-14 Thread Henrik Sarvell
OK since I can't rely on sorting by date anyway let's forget that idea.

Yes since it seemed I had to involve dates anyway I simply chose a
date far back enough in time that if someone is looking for something
they might as well use Google.

Anyway the above is scanning 19 remotes containing indexes for 10 000
articles each and returns in 3-4 seconds which is OK for me, problem
solved as far as I'm concerned. I have to add though that all remotes
are currently on the same machine, had they been truly distributed it
would be faster, especially if the other machines were in the same
data center.

On Fri, May 14, 2010 at 7:55 AM, Alexander Burger a...@software-lab.de wrote:
 On Thu, May 13, 2010 at 09:12:06PM +0200, Henrik Sarvell wrote:
 One thing first though, since articles are indexed when they're parsed
 and PL isn't doing any kind of sorting automatically on insert then
 they should be sorted by date automatically with the latest articles
 at the end of the database file since I suppose they're just appended?

 While this is correct in principle, I would not rely on it. If there
 should ever be an object deleted from that database file, the space
 would be reused by the next new object, and the assumption would break.


 How can I simply start walking from the end of the file until I've
 found say 25 matches? This procedure should be the absolutely fastest
 way of getting what I want.

 Currently I see no easy way. The only function that walks a database
 file directly is 'seq', but it can only step forwards.


 I know about your iter example earlier and it seems like a good fit if
 it starts walking in the right end?

 Yes, 'iter' (and the related 'scan') can walk in both directions. You
 need only to pass inverted keys (i.e. Beg  End).


 If I understand it right, however, you solved the problem in your next
 mail(s) by using the date index, and starting at 6 months ago?

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-13 Thread Henrik Sarvell
Everything is running smoothly now, I intend to make a write up on the
wiki this weekend maybe on this.

One thing first though, since articles are indexed when they're parsed
and PL isn't doing any kind of sorting automatically on insert then
they should be sorted by date automatically with the latest articles
at the end of the database file since I suppose they're just appended?

How can I simply start walking from the end of the file until I've
found say 25 matches? This procedure should be the absolutely fastest
way of getting what I want.

I know about your iter example earlier and it seems like a good fit if
it starts walking in the right end?




On Tue, May 11, 2010 at 9:09 AM, Alexander Burger a...@software-lab.de wro=
te:
 On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
 My code simply stops executing (as if waiting for the next entry but
 it never gets it) when I run out of entries to fetch, really strange
 and a traceAll confirms this, the last output is a call to rd1.

 What happens on the remote side, after all entries are sent? If the
 remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
 it is done.


 This is my rd1:

 (dm rd1 (Sock)
 =A0 =A0(or
 =A0 =A0 =A0 (in Sock (rd))
 =A0 =A0 =A0 (nil
 =A0 =A0 =A0 =A0 =A0(close Sock

 This looks all right, but isn't obviously the problem, as it hangs in
 'rd'.


 (de getArticles (W)
 =A0 =A0(for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp)
 =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp)))
 =A0 =A0 =A0(unless (flush) (bye

 What happens if you do (bye) after the 'for' loop is done?

 I assume that 'getArticles' is executed in the (eval @) below


 =A0 =A0(task (port (+ *IdxNum 4040))
 =A0 =A0 =A0 (let? Sock (accept @)
 =A0 =A0 =A0 =A0 =A0(unless (fork)
 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @
 =A0 =A0 =A0 =A0 =A0 =A0 (bye))
 =A0 =A0 =A0 =A0 =A0(close Sock)))

 This looks OK, because (bye) is called after the while loop is done.
 Perhaps there is something in the way 'getArticles' is invoked here? You
 could change the second last line to (! bye) and see if it is indeed
 reached. I would suspect it isn't.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe

-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-13 Thread Henrik Sarvell
See my prior post for context.

I've been testing a few different approaches and this is the fastest so far=
:

(de getArticles (W)
   (let Goal
  (goal
 (quote
@Word W
(select (@Wcs)
   ((word +WordCount @Word))
   (same @Word @Wcs word
  (do 25
 (NIL (prove Goal))
 (bind @
(pr (cons (; @Wcs article) (; @Wcs picoStamp)))
(unless (flush) (bye)
   (bye))

Where the remote ER is:

(class +WordCount +Entity) #
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))
(rel picoStamp (+Ref +Number))



On Thu, May 13, 2010 at 9:12 PM, Henrik Sarvell hsarv...@gmail.com wrote:
 Everything is running smoothly now, I intend to make a write up on the
 wiki this weekend maybe on this.

 One thing first though, since articles are indexed when they're parsed
 and PL isn't doing any kind of sorting automatically on insert then
 they should be sorted by date automatically with the latest articles
 at the end of the database file since I suppose they're just appended?

 How can I simply start walking from the end of the file until I've
 found say 25 matches? This procedure should be the absolutely fastest
 way of getting what I want.

 I know about your iter example earlier and it seems like a good fit if
 it starts walking in the right end?




 On Tue, May 11, 2010 at 9:09 AM, Alexander Burger a...@software-lab.de w=
rote:
 On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
 My code simply stops executing (as if waiting for the next entry but
 it never gets it) when I run out of entries to fetch, really strange
 and a traceAll confirms this, the last output is a call to rd1.

 What happens on the remote side, after all entries are sent? If the
 remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
 it is done.


 This is my rd1:

 (dm rd1 (Sock)
 =A0 =A0(or
 =A0 =A0 =A0 (in Sock (rd))
 =A0 =A0 =A0 (nil
 =A0 =A0 =A0 =A0 =A0(close Sock

 This looks all right, but isn't obviously the problem, as it hangs in
 'rd'.


 (de getArticles (W)
 =A0 =A0(for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp)
 =A0 =A0 =A0(pr (cons (; Wc article) (; Wc picoStamp)))
 =A0 =A0 =A0(unless (flush) (bye

 What happens if you do (bye) after the 'for' loop is done?

 I assume that 'getArticles' is executed in the (eval @) below


 =A0 =A0(task (port (+ *IdxNum 4040))
 =A0 =A0 =A0 (let? Sock (accept @)
 =A0 =A0 =A0 =A0 =A0(unless (fork)
 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(while (rd)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (sync)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (out Sock
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(eval @
 =A0 =A0 =A0 =A0 =A0 =A0 (bye))
 =A0 =A0 =A0 =A0 =A0(close Sock)))

 This looks OK, because (bye) is called after the while loop is done.
 Perhaps there is something in the way 'getArticles' is invoked here? You
 could change the second last line to (! bye) and see if it is indeed
 reached. I would suspect it isn't.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe


-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-11 Thread Alexander Burger
On Mon, May 10, 2010 at 11:50:52PM +0200, Henrik Sarvell wrote:
 My code simply stops executing (as if waiting for the next entry but
 it never gets it) when I run out of entries to fetch, really strange
 and a traceAll confirms this, the last output is a call to rd1.

What happens on the remote side, after all entries are sent? If the
remote doesn't 'close' (or 'bye'), then the receiving end doesn't know
it is done.


 This is my rd1:
 
 (dm rd1 (Sock)
(or
   (in Sock (rd))
   (nil
  (close Sock

This looks all right, but isn't obviously the problem, as it hangs in
'rd'.


 (de getArticles (W)
(for Wc (sortBy '+Gh (collect 'word '+WordCount W) 'picoStamp)
  (pr (cons (; Wc article) (; Wc picoStamp)))
  (unless (flush) (bye

What happens if you do (bye) after the 'for' loop is done?

I assume that 'getArticles' is executed in the (eval @) below


(task (port (+ *IdxNum 4040))
   (let? Sock (accept @)
  (unless (fork)
 (in Sock
(while (rd)
   (sync)
   (out Sock
  (eval @
 (bye))
  (close Sock)))

This looks OK, because (bye) is called after the while loop is done.
Perhaps there is something in the way 'getArticles' is invoked here? You
could change the second last line to (! bye) and see if it is indeed
reached. I would suspect it isn't.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-10 Thread Henrik Sarvell
Ah I see, so the issue is on the remote side then, what did your code
look like there, did you use (prove)?



On Mon, May 10, 2010 at 7:22 AM, Alexander Burger a...@software-lab.de wro=
te:
 Hi Henrik,

 One final question, how did you define the rd1 mechanism?

 In the mentioned case, I used the followin method in the +Agent class

 =A0 (dm rd1 (Sock)
 =A0 =A0 =A0(when (assoc Sock (: socks))
 =A0 =A0 =A0 =A0 (rot (: socks) (index @ (: socks)))
 =A0 =A0 =A0 =A0 (ext (: ext)
 =A0 =A0 =A0 =A0 =A0 =A0(or
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (in Sock (rd))
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (nil
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(close Sock)
 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0(pop (:: socks)) ) ) ) ) )

 This looks a little complicated, as each agent maintains a list of open
 sockets (in 'socks'). But if you omit the 'socks' management, it is
 basically just

 =A0 (ext (: ext) (in Sock (rd)))

 followed by 'close' if the remote side closed the connection.


 Simply doing:

 (dm rd1 (Sock)
 =A0 =A0(in Sock (rd)))

 will read the whole result, not just the first result, won't it?

 This should not be the case. It depends on what the other side sends. If
 it sends a list, you'll get the whole list. In the examples we
 discussed, however, the query results were sent one by one.


 I'm a little bit confused since it says in the reference that rd will
 read the first item from the current input channel but when I look

 Yes, analog to 'read', 'line', 'char' etc.

 Maybe something is needed on the remote? At the moment there is simply
 a collect and sort by there.

 Could it be that remote sends the result of 'collect'? This would be the
 whole list then.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe

-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-10 Thread Alexander Burger
On Mon, May 10, 2010 at 09:04:48AM +0200, Henrik Sarvell wrote:
 Ah I see, so the issue is on the remote side then, what did your code
 look like there, did you use (prove)?

There were several scenarios. In cases where only a few hits are to be
expected, I used 'collect':

   (for Obj (collect 'var '+Cls (...))
  (pr Obj)
  (unless (flush) (bye)) )

The 'flush' is there for two purposes: (1) to get the data sent
immediately (without holding in a local buffer), and (2) to have an
immediate feedback. When the receiving side should close the connection
(i.e. the GUI is not interested in more results, or the client has
quit), 'flush' returns NIL and the local query can be terminated.


In other cases, where there were potentially many hits (so that I didn't
want to use 'collect'), I used the low-level tree iteration function
'iter' (which is also internally by 'collect'):

   (iter (tree 'var '+Cls)
  '((Obj)
 (pr Obj)
 (unless (flush) (bye)) )
  (cons From)
  (cons Till T) )
   (bye) )

So 'iter' is quite efficient, as it avoids the overhead of Pilog, but
still can deliver an unlimited number of hits.

Note, however, that you have to pass the proper 'from' and 'till'
arguments. They must have the right structure of the index tree's key.
For a '+Key' index this would be simply 'From' and 'Till'. For a '+Ref'
(like in the shown case) it must be '(From . NIL)' and '(Till . T)'.
'db', 'collect' and the Pilog functions take care of such details
automatically.


For complexer queries, involving more than one index, yes, I used Pilog
and 'prove'. Each call to 'prove' returns (and sends) a single object.


For plain Pilog queries, i.e. without any special requirements like a
defined sorting order, you can get along even without any custom
functions/methods on the remote side. The 'remote/2' predicate can
handle this transparently by executing its clauses on all remote
machines. I have examples for that, but they are probably beyond the
scope of this mail.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-09 Thread Henrik Sarvell
One final question, how did you define the rd1 mechanism?

Simply doing:

(dm rd1 (Sock)
   (in Sock (rd)))

will read the whole result, not just the first result, won't it?

I'm a little bit confused since it says in the reference that rd will
read the first item from the current input channel but when I look
at my current usage of rd I get the feeling it will read the whole
result?

Maybe something is needed on the remote? At the moment there is simply
a collect and sort by there.

I hope I'm not too cryptic.

/Henrik




On Sun, Apr 25, 2010 at 5:08 PM, Henrik Sarvell hsarv...@gmail.com wrote:
 Ah so the key is to have the connections in a list, I should have underst=
ood
 that.

 Thanks for the help, I'll try it out!



 On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger a...@software-lab.de
 wrote:

 On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
  So I gather the *Ext mapping is absolutely necessary regardless of
  whether
  remote or ext is used.

 Yes.

 Only in case you do not intend to communicate whole objects between the
 remote and local application, but only scalar data like strings,
 numbers, or lists of those. I would say this would be quite a
 limitation. You need to communicate whole objects, at least because you
 want to compare them locally to find the biggest (see below).


  I took at the *Ext section again, could I use this maybe:
 
  (setq *Ext =A0# Define extension functions
  ...
  =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 (off Sock)=
 ) ) ) ) ) ) ) )
  =A0 =A0 =A0 '(localhost localhost)
  =A0 =A0 =A0 '(4041 4042)
  =A0 =A0 =A0 (40 80) ) )

 Yes, that's good. The example in the docu was not sufficient, as it has
 a single port hard-coded.


  And then with *ext* I need to create that single look ahead queue in t=
he
  local code you talked about earlier, but how?

 The look ahead queue of a single object per connection consisted simply =
of
 a list, the first result sent from each remote host.

 What I did was:

 1. Starting a new query, a list of connections to all remote hosts is
 =A0 opened:

 =A0 =A0 =A0(extract
 =A0 =A0 =A0 =A0 '((Agent)
 =A0 =A0 =A0 =A0 =A0 =A0(query Agent arguments) )
 =A0 =A0 =A0 =A0 (list of agents) )

 =A0 This returns a list of all agent objects who succeeded to connect. I
 =A0 used that list to initialize a Pilog query.

 2. Then you fetch the first answer from each connection. I used a method
 =A0 'rd1' in the agent class for that:

 =A0 =A0 =A0(extract 'rd1 (list of open agents))

 =A0 'extract' is used here, as it behaves like 'mapcar' but filters all
 =A0 NIL items out of the result. A NIL item will be returned in the frst
 =A0 'extract' if the connection cannot be openend, and in the second one
 =A0 if that remote host has no results to send.

 =A0 So now you have a list of results, the first (highest, biggest,
 =A0 newest?) object from each remote host.

 3. Now the main query loop starts. Each time a new result is requested,
 =A0 e.g. from the GUI, you just need to find the object with the highest=
,
 =A0 biggest, newest attribute in that list. You take it from the list
 =A0 (e.g. with 'prog1'), and immediately fill the slot in the list by
 =A0 calling 'rd1' for that host again.

 =A0 If that 'rd1' returns NIL, it means this remote hosts has no more
 =A0 results, so you delete it from the list of open agents. If it return=
s
 =A0 non-NIL, you store the read value into the slot.

 In that way, the list of received items constitutes a kind of look-ahead
 structure, always containing the items which might be returned next to
 the caller.


  I mean at the moment the problem is that I get too many articles in my
  local
  code since all the remotes send all their articles at once, thus
  swamping

 There cannot be any swamping. All remote processes will send their
 results, yes, but only until the TCP queue fills up, or until they have
 no more results. The local process doesn't see anything of that, it just
 fetches the next result with 'rd1' whenever it needs one.

 You don't have to worry at all whether the GUI calls for the next result
 50 times, or 1 times. Each time simply the next result is returned.
 This works well, and produces not more load than is necessary.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe


-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-05-09 Thread Alexander Burger
Hi Henrik,

 One final question, how did you define the rd1 mechanism?

In the mentioned case, I used the followin method in the +Agent class

   (dm rd1 (Sock)
  (when (assoc Sock (: socks))
 (rot (: socks) (index @ (: socks)))
 (ext (: ext)
(or
   (in Sock (rd))
   (nil
  (close Sock)
  (pop (:: socks)) ) ) ) ) )

This looks a little complicated, as each agent maintains a list of open
sockets (in 'socks'). But if you omit the 'socks' management, it is
basically just

   (ext (: ext) (in Sock (rd)))

followed by 'close' if the remote side closed the connection.


 Simply doing:
 
 (dm rd1 (Sock)
(in Sock (rd)))
 
 will read the whole result, not just the first result, won't it?

This should not be the case. It depends on what the other side sends. If
it sends a list, you'll get the whole list. In the examples we
discussed, however, the query results were sent one by one.


 I'm a little bit confused since it says in the reference that rd will
 read the first item from the current input channel but when I look

Yes, analog to 'read', 'line', 'char' etc.

 Maybe something is needed on the remote? At the moment there is simply
 a collect and sort by there.

Could it be that remote sends the result of 'collect'? This would be the
whole list then.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-04-25 Thread Henrik Sarvell
So I gather the *Ext mapping is absolutely necessary regardless of whether
remote or ext is used.

I took at the *Ext section again, could I use this maybe:

(setq *Ext  # Define extension functions
   (mapcar
  '((@Host @Port @Ext)
 (let Sock NIL
(cons @Ext
   (curry (@Host @Ext Sock) (Obj)
  (when (or Sock (setq Sock (connect @Host @Port)))
 (ext @Ext
(out Sock (pr (cons 'qsym Obj)))
(prog1 (in Sock (rd))
   (unless @
  (close Sock)
  (off Sock) ) ) ) ) ) ) ) )
  '(localhost localhost)
  '(4041 4042)
  (40 80) ) )

And then with *ext* I need to create that single look ahead queue in the
local code you talked about earlier, but how?

I mean at the moment the problem is that I get too many articles in my local
code since all the remotes send all their articles at once, thus swamping
the local process, I'll show you what I'm using now:

(dm evalAll @
   (let Result
  (make
 (for N (getMachine This localhost)
(later (chain (cons void))
   (eval This N (rest)
  (wait 5000 (not (memq void Result)))
  Result))

(Note that this logic does not respect a multi machine environment, I will
add that when/if my current single machine is not enough.)

This one will evalute code on all remotes and return all the results. If the
result contains let's say more than 10 000 articles I will choke as it is
now. That's why I need that single look ahead you talked about, but I don't
know how to implement it.

If it was just about returning the 25 newest articles I could have each
remote simply return the 25 newest ones and then sort again locally. In that
case I would get 50 back and not 10 000 in this case. And when I want the
next result which will be 25-50 I suppose I could return 50 from each remote
then but this is a very ugly solution that doesn't scale very well.




On Sun, Apr 25, 2010 at 12:05 PM, Alexander Burger a...@software-lab.dewrote:

 Hi Henrik,

  I've reviewed the **Ext* part in the manual and I will need something
  different as I will have several nodes on each machine on different ports
  (starting with simply localhost). I suppose I could have simply modified
 it
  if I had had one node per machine?

 With node you mean a server process? What makes you think that the
 example limits it to one node? IIRC, the example is in fact a simplified
 version (perhaps too simplified?) of a system where there were many
 servers, of equal and different types, on each host.


  Anyway, what would the whole procedure you've described look like if I
 have
  two external nodes listening on 4041 and 4042 respectively but on
 localhost
  both of them, and the E/R in question looks like this?:
 
  (class +Article +Entity)
  (rel aid   (+Key +Number))
  (rel title (+String))
  (rel htmlUrl   (+Key +String)) #
  (rel body  (+Blob))
  (rel pubDate   (+Ref +Number))

 Side question: Is there a special reason why 'pubDate' is a '+Number'
 and not a '+Date'? Should work that way, though.


  In this case I want to fetch article 25 - 50 sorted by pubDate from both
  nodes

 Unfortunately, this cannot be achieved directly with an '+Aux' relation,
 because the article number and the date cannot be organized into a
 single index with a primary and secondary sorting criterion.

 There is no other way then fetching and then sorting them, I think:

   (by '((This) (: pubDate)) sort (collect 'aid '+Article 25 50))

 Thus, the send part from a node to the central server would be

   (for Article
  (by
 '((This) (: pubDate))
 sort
 (collect 'aid '+Article 25 50) )
  (pr Article) # Send the article object
  (NIL (flush)) )  # Flush the socket

 The 'flush' is important, not so much to immediately send the data, but
 to detect whether the other side (the central server) has closed the
 connection, perhaps because it isn't interested in further data.

 'flush' returns NIL if it cannot send the data successfully, and thus
 causes the 'for' loop to terminate.



  So as far as I've understood it a (setq *Ext ... ) section is needed and
  then the specific logic described in your previous post in the form of
  something using *ext* or maybe *remote*?

 Yes. '*Ext' is necessary if remote objects are accessed locally.

 'remote' might be handy if Pilog is used for remote queries. This is not
 the case in the above example.

 But 'ext' is needed on the central server, with the proper offsets for
 the clients. This can be all encapsulated in the +Agent objects.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-25 Thread Alexander Burger
On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
 So I gather the *Ext mapping is absolutely necessary regardless of whether
 remote or ext is used.

Yes.

Only in case you do not intend to communicate whole objects between the
remote and local application, but only scalar data like strings,
numbers, or lists of those. I would say this would be quite a
limitation. You need to communicate whole objects, at least because you
want to compare them locally to find the biggest (see below).


 I took at the *Ext section again, could I use this maybe:
 
 (setq *Ext  # Define extension functions
 ...
   (off Sock) ) ) ) ) ) ) ) )
   '(localhost localhost)
   '(4041 4042)
   (40 80) ) )

Yes, that's good. The example in the docu was not sufficient, as it has
a single port hard-coded.


 And then with *ext* I need to create that single look ahead queue in the
 local code you talked about earlier, but how?

The look ahead queue of a single object per connection consisted simply of
a list, the first result sent from each remote host.

What I did was:

1. Starting a new query, a list of connections to all remote hosts is
   opened:

  (extract
 '((Agent)
(query Agent arguments) )
 (list of agents) )

   This returns a list of all agent objects who succeeded to connect. I
   used that list to initialize a Pilog query.

2. Then you fetch the first answer from each connection. I used a method
   'rd1' in the agent class for that:

  (extract 'rd1 (list of open agents))

   'extract' is used here, as it behaves like 'mapcar' but filters all
   NIL items out of the result. A NIL item will be returned in the frst
   'extract' if the connection cannot be openend, and in the second one
   if that remote host has no results to send.

   So now you have a list of results, the first (highest, biggest,
   newest?) object from each remote host.

3. Now the main query loop starts. Each time a new result is requested,
   e.g. from the GUI, you just need to find the object with the highest,
   biggest, newest attribute in that list. You take it from the list
   (e.g. with 'prog1'), and immediately fill the slot in the list by
   calling 'rd1' for that host again.

   If that 'rd1' returns NIL, it means this remote hosts has no more
   results, so you delete it from the list of open agents. If it returns
   non-NIL, you store the read value into the slot.

In that way, the list of received items constitutes a kind of look-ahead
structure, always containing the items which might be returned next to
the caller.


 I mean at the moment the problem is that I get too many articles in my local
 code since all the remotes send all their articles at once, thus swamping

There cannot be any swamping. All remote processes will send their
results, yes, but only until the TCP queue fills up, or until they have
no more results. The local process doesn't see anything of that, it just
fetches the next result with 'rd1' whenever it needs one.

You don't have to worry at all whether the GUI calls for the next result
50 times, or 1 times. Each time simply the next result is returned.
This works well, and produces not more load than is necessary.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-04-25 Thread Henrik Sarvell
Ah so the key is to have the connections in a list, I should have understood
that.

Thanks for the help, I'll try it out!



On Sun, Apr 25, 2010 at 4:51 PM, Alexander Burger a...@software-lab.dewrote:

 On Sun, Apr 25, 2010 at 03:17:55PM +0200, Henrik Sarvell wrote:
  So I gather the *Ext mapping is absolutely necessary regardless of
 whether
  remote or ext is used.

 Yes.

 Only in case you do not intend to communicate whole objects between the
 remote and local application, but only scalar data like strings,
 numbers, or lists of those. I would say this would be quite a
 limitation. You need to communicate whole objects, at least because you
 want to compare them locally to find the biggest (see below).


  I took at the *Ext section again, could I use this maybe:
 
  (setq *Ext  # Define extension functions
  ...
(off Sock) ) ) ) ) ) ) ) )
'(localhost localhost)
'(4041 4042)
(40 80) ) )

 Yes, that's good. The example in the docu was not sufficient, as it has
 a single port hard-coded.


  And then with *ext* I need to create that single look ahead queue in the
  local code you talked about earlier, but how?

 The look ahead queue of a single object per connection consisted simply of
 a list, the first result sent from each remote host.

 What I did was:

 1. Starting a new query, a list of connections to all remote hosts is
   opened:

  (extract
 '((Agent)
(query Agent arguments) )
 (list of agents) )

   This returns a list of all agent objects who succeeded to connect. I
   used that list to initialize a Pilog query.

 2. Then you fetch the first answer from each connection. I used a method
   'rd1' in the agent class for that:

  (extract 'rd1 (list of open agents))

   'extract' is used here, as it behaves like 'mapcar' but filters all
   NIL items out of the result. A NIL item will be returned in the frst
   'extract' if the connection cannot be openend, and in the second one
   if that remote host has no results to send.

   So now you have a list of results, the first (highest, biggest,
   newest?) object from each remote host.

 3. Now the main query loop starts. Each time a new result is requested,
   e.g. from the GUI, you just need to find the object with the highest,
   biggest, newest attribute in that list. You take it from the list
   (e.g. with 'prog1'), and immediately fill the slot in the list by
   calling 'rd1' for that host again.

   If that 'rd1' returns NIL, it means this remote hosts has no more
   results, so you delete it from the list of open agents. If it returns
   non-NIL, you store the read value into the slot.

 In that way, the list of received items constitutes a kind of look-ahead
 structure, always containing the items which might be returned next to
 the caller.


  I mean at the moment the problem is that I get too many articles in my
 local
  code since all the remotes send all their articles at once, thus swamping

 There cannot be any swamping. All remote processes will send their
 results, yes, but only until the TCP queue fills up, or until they have
 no more results. The local process doesn't see anything of that, it just
 fetches the next result with 'rd1' whenever it needs one.

 You don't have to worry at all whether the GUI calls for the next result
 50 times, or 1 times. Each time simply the next result is returned.
 This works well, and produces not more load than is necessary.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-20 Thread Henrik Sarvell
I've been reading up a bit on the remote stuff, I haven't made the articles
distributed yet but let's assume I have, with 10 000 articles per remote.
Let's also assume that I have remade the word indexes to now work with real
+Ref +Links on each remote that links words and articles (not simply numbers
for subsequent use with (id) locally).

So with the refs in place I could use the full remote logic to run pilog
queries on the remotes.

Now a search is made for all articles containing the word picolisp for
instance. I then need to be able to get an arbitrary slice back of the total
which needs to be sorted by time. I have a hard time understanding how this
can be achieved in any sensible way except through one the following:

Central Command:

1.) The remotes are setup so that remote one contains the oldest articles,
remote two the second oldest articles and so on (this is the case naturally
as a new remote is spawned when the newest one is full).

2.) Each remote then returns how many articles it has that contains
picolisp. This is needed for the pagination anyway in order to display a
correct amount of page numbers and can be done pretty trivially through the
count tree mechanism described earlier in this thread.

3.) The local logic now determines which remote(s) should be queried in
order to get 25 correct articles, issues the queries to be executed remotely
and displays the returned articles.

If pagination is scrapped the total count is not needed, it's possible to
have a More Results button instead, I'm fine with that kind of interface
too. In most cases the count is not important for the user anyway. In that
way the following might be possible:

Cascading:

1.) The newest remote is queried first and can quickly determine through
count tree that it has the requested articles, quickly fetches them and
returns them.

2.) If it doesn't contain them it will pass on the request to the second
newest remote which might contain all of the requested articles, or a subset
in which case the missing ones will be returned from the third newest remote
through the same mechanism.

3.) The end result is that the correct articles now end up in the first
remote which will return them to the local.

Did I miss something, might this problem be solved in a cleverer way?

/Henrik






On Thu, Apr 15, 2010 at 12:55 PM, Henrik Sarvell hsarv...@gmail.com wrote:

 To simply be able to pass along simple commands like collect and db ie. the
 *Ext stuff was overkill, which works just fine except in this special case
 when there are thousands of articles to a feed.

 I'm planning to distribute the whole DB except users and what feeds they
 subscribe to. Everything else will be article centric and remote. I will
 also keep local records of which feeds have articles in which remote so I
 don't query remotes for nothing.





 On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger 
 a...@software-lab.dewrote:

 On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote:
  On the other hand, if I'm to follow my own thinking to its logical
  conclusion I should make the articles distributed too, with blobs and
 all.

 What was the rationale to use object IDs instead of direct remote access
 via '*Ext'? I can't remember at the moment.
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe





Re: Scaling issue

2010-04-15 Thread Henrik Sarvell
On the other hand, if I'm to follow my own thinking to its logical
conclusion I should make the articles distributed too, with blobs and all.


On Wed, Apr 14, 2010 at 9:51 PM, Henrik Sarvell hsarv...@gmail.com wrote:

 I don't know Alex, remember that we disconnected stuff, I'll paste the
 remote E/R again (all of it, there is nothing else on the remotes):


 (class +WordCount +Entity)
 (rel article   (+Ref +Number))
 (rel word  (+Aux +Ref +Number) (article))
 (rel count (+Number))

 The numbers here can then be used in the main app with (id) to actually
 locate the objects in question.

 Could the *Ext functionality still be used somehow? I have a hard time
 understanding how if I don't map the feed (parent) - article (child)
 relationship remotely, I mean at some point I will have to filter all
 retrieved articles against a set of articles fetched locally (all articles
 belonging to my Twitter feed), if I don't store the connections remotely.
 Storing the feed - article links remotely will let me avoid checking
 locally, and it's that check that is the bottleneck at the moment.

 I suppose you could find some clever way of speeding up the local
 filtering, at the moment I'm simply loading all Twitter articles with
 collect and then throwing away all remotely retrieved articles that are not
 in that list. However that just seems like a duct tape solution, even if it
 works to begin with it won't work for long.

 /Henrik



 On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger a...@software-lab.dewrote:

 On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
  Thanks Alex, I will go for the the reversed range and check out
 select/3.

 Let me mention that since picoLisp-3.0.1 we have a separate
 documentation of 'select/3', in doc/select.html.
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe





Re: Scaling issue

2010-04-15 Thread Henrik Sarvell
To simply be able to pass along simple commands like collect and db ie. the
*Ext stuff was overkill, which works just fine except in this special case
when there are thousands of articles to a feed.

I'm planning to distribute the whole DB except users and what feeds they
subscribe to. Everything else will be article centric and remote. I will
also keep local records of which feeds have articles in which remote so I
don't query remotes for nothing.




On Thu, Apr 15, 2010 at 12:17 PM, Alexander Burger a...@software-lab.dewrote:

 On Thu, Apr 15, 2010 at 09:12:18AM +0200, Henrik Sarvell wrote:
  On the other hand, if I'm to follow my own thinking to its logical
  conclusion I should make the articles distributed too, with blobs and
 all.

 What was the rationale to use object IDs instead of direct remote access
 via '*Ext'? I can't remember at the moment.
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-14 Thread Henrik Sarvell
I don't know Alex, remember that we disconnected stuff, I'll paste the
remote E/R again (all of it, there is nothing else on the remotes):

(class +WordCount +Entity)
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))

The numbers here can then be used in the main app with (id) to actually
locate the objects in question.

Could the *Ext functionality still be used somehow? I have a hard time
understanding how if I don't map the feed (parent) - article (child)
relationship remotely, I mean at some point I will have to filter all
retrieved articles against a set of articles fetched locally (all articles
belonging to my Twitter feed), if I don't store the connections remotely.
Storing the feed - article links remotely will let me avoid checking
locally, and it's that check that is the bottleneck at the moment.

I suppose you could find some clever way of speeding up the local filtering,
at the moment I'm simply loading all Twitter articles with collect and then
throwing away all remotely retrieved articles that are not in that list.
However that just seems like a duct tape solution, even if it works to begin
with it won't work for long.

/Henrik


On Sun, Apr 11, 2010 at 4:13 PM, Alexander Burger a...@software-lab.dewrote:

 On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
  Thanks Alex, I will go for the the reversed range and check out select/3.

 Let me mention that since picoLisp-3.0.1 we have a separate
 documentation of 'select/3', in doc/select.html.
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-11 Thread Alexander Burger
Hi Henrik,

 (class +ArFeLink +Entity)
 (rel article   (+Aux +Ref +Link) (feed) NIL (+Article))
 (rel feed  (+Ref +Link) NIL (+Feed))
 
 (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need it
 to take something like maximum 2 seconds...
 
 Can this be fixed by adding some index or key or do I need make this part of
 the DB distributed and chopped up so I can run this in parallel?

This is already the proper index. Is it perhaps the case that there are
simply too many articles fetched at once? How may articles does the
above 'collect' return? And are these articles all needed at that time?

If you talk about 2 seconds, I assume you don't want the user having to
wait, so it is a GUI interaction. In such cases it is typical not to
fetch all data from the DB, but only the first chunk e.g. to display
them in the GUI. It would be better then to use a Pilog query, returning
the results one by one (as done in +QueryChart).

To get results analog to the above 'collect', you could create a query
like

   (let Q
  (goal
 (quote
@Obj Obj
(db feed +ArFeLink @Obj @Feed)
(val @Article @Feed article) ) )
  ...
  (do 20   # Then fetch the first 20 articles
 (NIL (prove Q))  # More?
 (bind @   # Bind the result values
(println @Article)  # Use the article
...

Instead of 'bind' you could also simply use 'get' to extract the
@Article: (get @ '@Article).

Before doing so, I would test it interactively, e.g.

: (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article))

if '{ART}' is an article.

Not that the above is not tested.

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-04-11 Thread Henrik Sarvell
I see, I should've known about that one (I'm using it to get similar
articles already).

What's additionally needed is:

1.) Calculating total count somehow without retrieving all articles.

2.) Somehow sorting by date so I get say the 25 first articles.

If those two can also be achieved in a manner that won't require me to fetch
all articles then I can use Pilog in this manner to fetch the results when
it comes to getting all articles under all feeds under a specific tag. At
the moment I'm fetching all of them at once and using head, not optimal.

However, it won't work with the word indexes, a redesign of how the system
works is needed I think.

When searching for articles belonging to a certain feed containing a word in
the content I now let the distributed indexes return all articles and then I
simply use filter to get at the articles. And to do that I of course need to
fetch all the articles in a certain feed, which works fine for most feeds
but not Twitter as it now probably contains more than 10 000 articles.

The only solution I can see to this is to simply store the feed - article
mapping remotely too, ie each word index server contains this info too for
the articles it's mapping, resutling in an E/R section looking like this:

(class +WordCount +Entity) #
(rel article   (+Ref +Number))
(rel word  (+Aux +Ref +Number) (article))
(rel count (+Number))

(class +ArFeLink +Entity)
(rel article   (+Aux +Ref +Number) (feed))
(rel feed  (+Ref +Number))

Then I could simply filter by feed remotely.

/Henrik


On Sun, Apr 11, 2010 at 9:25 AM, Alexander Burger a...@software-lab.dewrote:

 Hi Henrik,

  (class +ArFeLink +Entity)
  (rel article   (+Aux +Ref +Link) (feed) NIL (+Article))
  (rel feed  (+Ref +Link) NIL (+Feed))
 
  (collect 'feed '+ArFeLink Obj Obj 'article) takes forever (2 mins) I need
 it
  to take something like maximum 2 seconds...
 
  Can this be fixed by adding some index or key or do I need make this part
 of
  the DB distributed and chopped up so I can run this in parallel?

 This is already the proper index. Is it perhaps the case that there are
 simply too many articles fetched at once? How may articles does the
 above 'collect' return? And are these articles all needed at that time?

 If you talk about 2 seconds, I assume you don't want the user having to
 wait, so it is a GUI interaction. In such cases it is typical not to
 fetch all data from the DB, but only the first chunk e.g. to display
 them in the GUI. It would be better then to use a Pilog query, returning
 the results one by one (as done in +QueryChart).

 To get results analog to the above 'collect', you could create a query
 like

   (let Q
  (goal
 (quote
@Obj Obj
(db feed +ArFeLink @Obj @Feed)
(val @Article @Feed article) ) )
  ...
  (do 20   # Then fetch the first 20 articles
 (NIL (prove Q))  # More?
 (bind @   # Bind the result values
(println @Article)  # Use the article
...

 Instead of 'bind' you could also simply use 'get' to extract the
 @Article: (get @ '@Article).

 Before doing so, I would test it interactively, e.g.

 : (? (db feed +ArFeLink {ART} @Feed) (val @Article @Feed article))

 if '{ART}' is an article.

 Not that the above is not tested.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-11 Thread Alexander Burger
On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote:
 What's additionally needed is:
 
 1.) Calculating total count somehow without retrieving all articles.

If it is simply the count of all articles in the DB, you can get it
directly from a '+Key' or '+Ref' index. I don't quite remember the E/R
model, but I found this in an old mail:

   (class +Article +Entity)
   (rel aid   (+Key +Number))
   (rel title (+Idx +String))
   (rel htmlUrl   (+Key +String))

With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl
'+Article)) will give all articles having the property 'aid' or
'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more
than one tree node per object).

If you need distinguished counts (e.g. for groups of articles or
according to certain features), it might be necessary to build more
indexes, or simply maintain counts during import.


 2.) Somehow sorting by date so I get say the 25 first articles.

This is also best done with a dedicated index, e.g.

   (rel dat (+Ref +Date))

in '+Article'. Then you could specify a reversed range (T . NIL) for a
pilog query

   (? (db dat +Article (T . NIL) @Article) (show @Article))

This will start with the newest article, and step backwards. Even easier
might be if you specify a range of dates, say from today till one week
ago. Then you could use 'collect'

   (collect 'dat '+Article (date) (- (date) 7))

or, as 'today' is not very informative,

   (collect 'dat '+Article T (- (date) 7))


 When searching for articles belonging to a certain feed containing a word in
 the content I now let the distributed indexes return all articles and then I
 simply use filter to get at the articles. And to do that I of course need to
 fetch all the articles in a certain feed, which works fine for most feeds
 but not Twitter as it now probably contains more than 10 000 articles.

I think that usually it should not be necessary to fetch all articles,
if you build a combined query with the 'select/3' predicate.


 The only solution I can see to this is to simply store the feed - article
 mapping remotely too, ie each word index server contains this info too for
 ...
 Then I could simply filter by feed remotely.

Not sure. But I feel that I would use distributed processing here only
if there is no other way (i.e. the parallel search with 'select/3').

Cheers,
- Alex
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe


Re: Scaling issue

2010-04-11 Thread Henrik Sarvell
Thanks Alex, I will go for the the reversed range and check out select/3.

I'm already using collect with dates extensively but in this case it
wouldn't work as I need the 25 newest regardless of exactly when they were
published.

/Henrik

On Sun, Apr 11, 2010 at 1:27 PM, Alexander Burger a...@software-lab.dewrote:

 On Sun, Apr 11, 2010 at 12:25:42PM +0200, Henrik Sarvell wrote:
  What's additionally needed is:
 
  1.) Calculating total count somehow without retrieving all articles.

 If it is simply the count of all articles in the DB, you can get it
 directly from a '+Key' or '+Ref' index. I don't quite remember the E/R
 model, but I found this in an old mail:

   (class +Article +Entity)
   (rel aid   (+Key +Number))
   (rel title (+Idx +String))
   (rel htmlUrl   (+Key +String))

 With that, (count (tree 'aid '+Article)) or (count (tree 'htmlUrl
 '+Article)) will give all articles having the property 'aid' or
 'htmlUrl' (not, however, via 'title', as an '+Idx' index creates more
 than one tree node per object).

 If you need distinguished counts (e.g. for groups of articles or
 according to certain features), it might be necessary to build more
 indexes, or simply maintain counts during import.


  2.) Somehow sorting by date so I get say the 25 first articles.

 This is also best done with a dedicated index, e.g.

   (rel dat (+Ref +Date))

 in '+Article'. Then you could specify a reversed range (T . NIL) for a
 pilog query

   (? (db dat +Article (T . NIL) @Article) (show @Article))

 This will start with the newest article, and step backwards. Even easier
 might be if you specify a range of dates, say from today till one week
 ago. Then you could use 'collect'

   (collect 'dat '+Article (date) (- (date) 7))

 or, as 'today' is not very informative,

   (collect 'dat '+Article T (- (date) 7))


  When searching for articles belonging to a certain feed containing a word
 in
  the content I now let the distributed indexes return all articles and
 then I
  simply use filter to get at the articles. And to do that I of course need
 to
  fetch all the articles in a certain feed, which works fine for most feeds
  but not Twitter as it now probably contains more than 10 000 articles.

 I think that usually it should not be necessary to fetch all articles,
 if you build a combined query with the 'select/3' predicate.


  The only solution I can see to this is to simply store the feed -
 article
  mapping remotely too, ie each word index server contains this info too
 for
  ...
  Then I could simply filter by feed remotely.

 Not sure. But I feel that I would use distributed processing here only
 if there is no other way (i.e. the parallel search with 'select/3').

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe



Re: Scaling issue

2010-04-11 Thread Alexander Burger
On Sun, Apr 11, 2010 at 02:19:23PM +0200, Henrik Sarvell wrote:
 Thanks Alex, I will go for the the reversed range and check out select/3.

Let me mention that since picoLisp-3.0.1 we have a separate
documentation of 'select/3', in doc/select.html.
-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe