date:20091103

Re: +Idx problems maybe?

2009-11-03 Thread Henrik Sarvell

I'll try with the one you suggested, thanks for the clarifications!

/Henrik

On Tue, Nov 3, 2009 at 8:38 AM, Alexander Burger a...@software-lab.de wrot=
e:
 Hi Henrik,

 I took a look at the pilog file, I already get what same and range are
 doing but what are part, head and fold doing?

 You are on the right track. You used 'tolr', but this actually makes
 sense only in combination with the '+Sn' (Soundex) prefix. The whole
 matter is rather complicated, because there are so many combinations of
 index types and Pilog comparison functions possible.


 I would say that we have the following typical use cases for string
 searches (I'll leave out numerical searches, which usually combine with
 'same' or 'range').

 1. Exact searches. You have either a unique index

 =A0 =A0 =A0(rel key (+Key +String))

 =A0 or a non-unique index

 =A0 =A0 =A0(rel key (+Ref +String))

 =A0 and you can compare results in Pilog with

 =A0 =A0 =A0(same @Str @Cls key)

 =A0 for exact matches, or with

 =A0 =A0 =A0(head @Str @Cls key)

 =A0 for dictionary searches (searching only for the beginning of
 =A0 strings). These are case-sensitive searches.


 2. Folded searches. They make use of the 'fold' function which keeps
 =A0 only letters, converted to lower case, and digits.

 =A0 =A0 =A0(rel key (+Fold +Ref +String))
 =A0 =A0 =A0...
 =A0 =A0 =A0(fold @Str @Cls key)

 =A0 This searches only for the beginning of strings. We use it typically
 =A0 for telephone numbers.


 =A0 If a search for individual words in a key is desired, we can use

 =A0 =A0 =A0(rel key (+List +Fold +Ref +String))
 =A0 =A0 =A0...
 =A0 =A0 =A0(fold @Str @Cls key)

 =A0 This stores only the strings in the list (not the substrings) in
 =A0 'fold'ed representation. So each word can be found by dictionary
 =A0 search. This requires changes to the GUI and import functions,
 =A0 though, as 'key' is not a string but a list of strings.


 =A0 Finally, we can also index folded substrings:

 =A0 =A0 =A0(rel key (+Fold +Idx +String))
 =A0 =A0 =A0...
 =A0 =A0 =A0(part @Str @Cls key)

 =A0 This is perhaps what you need. If you go for it, I'd recommend you
 =A0 download once more the latest testing release, as the 'part' function
 =A0 was changed recently.


 3. Tolerant searches. They return first all exact (case-sensitive)
 =A0 matches of partial strings, and then the matches according to the
 =A0 soundex algorithm (the first letter is compared exactly
 =A0 (case-sensitive), the rest checks for similarity). This makes mainly
 =A0 sense for personal names.

 =A0 =A0 =A0(rel key (+Sn +Idx +String))
 =A0 =A0 =A0...
 =A0 =A0 =A0(tolr @Str @Cls key)


 Concerning space consumption, the '+Key' and '+Ref' indexes are the most
 economical ones. They create only a single entry in the index tree per
 key.

 Then follow the '+List +Ref +String' indexes, which create an entry per
 word.

 Most space-hungry are the '+Idx' indexes, as they create an entry for
 each substring down to a length of three, and '+Sn' adds one more for
 the soundex key.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe

-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: 64bit segmentation fault when matching on long lists

2009-11-03 Thread Henrik Sarvell

I started with this approach yesterday, first in order to capture feed
type which I am now able to do.

I noticed that some rss feeds have attributes in their item tags,
therefore the above won't work 100% of the time.

(in rss.xml
   (while
  (from item)
  (println
 (make
(loop
   (NIL (chain (till )))
   (char)
   (T (tail '`(chop item) @)) ) ) ) ))

This will accurately capture the item tag all the time I think but
then we need some way of discarding the attributes and the closing .
I tried with an immediate (till ) after the (from) but it didn't
have the intentional result, any suggestions here?

/Henrik


On Sun, Nov 1, 2009 at 6:26 PM, Alexander Burger a...@software-lab.de wrot=
e:
 On Sun, Nov 01, 2009 at 01:49:59PM +0100, Henrik Sarvell wrote:
 It's a good question with a very simple answer, many many feeds out
 there are completely broken, sometimes they don't conform to
 standards, that's a good scenario but often they have unmatched tags
 or unclosed attributes.

 Ouch. I see.

 So what do you think about the following:

 (while (from item)
 =A0 (println =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 # Instead of printing
 =A0 =A0 =A0(make =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =
=A0 =A0 =A0 =A0 =A0 # do further matching
 =A0 =A0 =A0 =A0 (loop
 =A0 =A0 =A0 =A0 =A0 =A0(NIL (chain (till ))) =A0 =A0 =A0 =A0 =A0 =A0 =
=A0# Collect until next tag
 =A0 =A0 =A0 =A0 =A0 =A0(char) =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0=
 =A0 =A0 =A0 =A0 =A0# Skip ''
 =A0 =A0 =A0 =A0 =A0 =A0(T (tail '`(chop item) @)) ) ) ) ) =A0# See if w=
e got item

 The 'make' will give you smaller chunks of data, which are easier to
 'match'.

 Cheers,
 - Alex
 --
 UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=3dunsubscribe

-- 
UNSUBSCRIBE: mailto:picol...@software-lab.de?subject=unsubscribe

Re: +Idx problems maybe?

Re: 64bit segmentation fault when matching on long lists

2 matches

Site Navigation

Mail list logo

Footer information