subject:"Re\: database setup in language learning"

Re: database setup in language learning

2014-07-22 Thread Alexander Burger

Hi Steven,

> I have decided that man's best friend is 1) a dog, 2) hexdump. ;)

Indeed! Though in this case, when exploring the structure of a PicoLisp
database, I would recommend 'show' and 'edit'.

Perhaps you already found "http://picolisp.com/wiki/?usingedit";? It
explains this specifically in the DB context.


> I'm a bit confused with which is the "entity object" and which are the
> "B-Tree nodes".

Entities are objects of class '+Wrds', for example. Each such object may
grow to arbitrary size, has the list of its classes in the value, and
relations in the properties.

B-Tree nodes, on the other hand, are not objects in the OOP-sense. They
store the tree data in their value (the property list is empty), and
keep a fixed size (as determined by the block size of that DB file)
and split into two if this size is exceeded.

It doesn't matter in wich file these symbols reside. A single DB file
may hold both entities and tree nodes, though often trees are in their
own files because of different block size requirements.


> I think I will need eventually is for the Wrds file to hold lemmas, so
> objects that are semantically unique. And before had used a List prefix
> as a way of holding the article ("Teil" "r") ("Teil" "s") for that
> reason. Later there will be verbs which can be reflexive or not. And
> these are also part of the word semantically, are different lemmas.

Interesting!


> > In all (+List +String) cases it is a bit tedious to handle these for
> > data in the GUI. Note that you can use +ListTextField to map a
> > single text field to a list of strings.
> 
> This is any case is a must, and if it is easier for the GUI to use and
> to reformat a simple string (for display) then just do that. And so
> maybe stick with just the ((rel wrd (+Key +String))?

Yes, that's best I think. Using plain +String, with +Key because they
are unique, and e.g. (gui '(+E/R +TextField) '(wrd : home obj) 60) for
the GUI.

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-22 Thread steven

Hi Alex,

I have decided that man's best friend is 1) a dog, 2) hexdump. ;)

What I did was try out the various prefix classes you mentioned with
the +List, +Ref, +String etc. in different combinations, and just went
and looked at them. More on that below.

> Exactly. But this may not be a good idea if the first file has a
> rather small block size, because the B-Tree nodes are only happy if
> they have enough space to store several key-value pairs. As opposed
> to entity objects (which may occupy more than one block, see above),
> a node splits into two if it needs more space. So you end up with
> many (perhaps too small) nodes.

I'm a bit confused with which is the "entity object" and which are the
"B-Tree nodes". The key-value pair seems like it was in file 1 in any
case, and the index files had values, with an link back to the
object name (?). Thought I had it figured out, then was not so sure.

> In fact, I never used (+Ref +List +String), always found it not
> useful. I would expect the same for (+Key +List +String). Because you
> can always store the whole string instead - without splitting it into
> a list - in a (+Key +String) or (+Ref +String), right?
> 
> What turned out more useful is (+List +Ref +String), i.e. indexing the
> individual words, or, more typical, (+List +Fold +Ref +String).
> 

Concrete examples will help. This is all about language (sorry!!), what
I think I will need eventually is for the Wrds file to hold lemmas, so
objects that are semantically unique. And before had used a List prefix
as a way of holding the article ("Teil" "r") ("Teil" "s") for that
reason. Later there will be verbs which can be reflexive or not. And
these are also part of the word semantically, are different lemmas.

I don't think I would want to index across these with +List anyway, and
maybe the best solution is just to write them into a single string and
not fold them (e.g. "Teil, r" or "vorstellen, sich"). Then of course
you have cases where the usage (or context) makes a word different
(süß) and the plot thickens! Languages get intricate rather fast.

For my test program here it is not important, and I don't want to get
too bogged down on this, just get it completely done (so with updates
and GUI ). But it will be important later, to have lemmas and signs as
the two main data files, with indexes, categories and such.

> In all (+List +String) cases it is a bit tedious to handle these for
> data in the GUI. Note that you can use +ListTextField to map a
> single text field to a list of strings.

This is any case is a must, and if it is easier for the GUI to use and
to reformat a simple string (for display) then just do that. And so
maybe stick with just the ((rel wrd (+Key +String))?
--
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-22 Thread Alexander Burger

Hi Steven,

> > Hello marmorine,
> Ummm... Steven. ;)

Oops! Sorry :)


> Tried all sorts of size combinations, found one that is more compact, to
> use (0 +Wrds) instead of the block size 1 for it. But then I guess you
> should take a look with "size" to make sure you are not cutting things
> to close? 

No reason to worry. If space conservation is important, I would go with
dbs 0, meaning a block size of 64 bytes. Each block has a 6-byte link
field in the beginning, so that you have a net capacity of 58 bytes per
block.

If a word list gets longer than that, the DB simply allocates one or
more additional blocks for that object, so you will never hit a limit.
Just access times will slow down a little.


> What I am doing is to translate more or less 1:1 a little Python
> program of mine into picolisp, because it uses scheduled learning, and
> then set up an index in my database only for those data fields where I
> had presets for queries, and this is what comes out (and then
> add updates and a gui to it in due time):
> 
>(class +Wrds +Entity)
>(rel wrd (+Key +List +String))
> ...
> I had (rel wrd (+Ref +List +String)) before and that didn't work,

In fact, I never used (+Ref +List +String), always found it not useful.
I would expect the same for (+Key +List +String). Because you can always
store the whole string instead - without splitting it into a list - in a
(+Key +String) or (+Ref +String), right?

What turned out more useful is (+List +Ref +String), i.e. indexing the
individual words, or, more typical, (+List +Fold +Ref +String).

In all (+List +String) cases it is a bit tedious to handle these data in
the GUI. Note that you can use +ListTextField to map a single text field
to a list of strings.


> Do I have to initialize data by the way, or will things default to
> standard values, or just "not be there" until used?

Right. Search and access functions return NIL if no data are found.

> And I take it you
> only list the indexes you are pushing off into another file, otherwise
> they default to being in the first file?

Exactly. But this may not be a good idea if the first file has a rather
small block size, because the B-Tree nodes are only happy if they have
enough space to store several key-value pairs. As opposed to entity
objects (which may occupy more than one block, see above), a node splits
into two if it needs more space. So you end up with many (perhaps too
small) nodes.

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-21 Thread steven

On Thu, 17 Jul 2014 09:13:38 +0200
Alexander Burger  wrote:

> Hello marmorine,
> 

Ummm... Steven. ;)

> 
> A few minor notes:
> 

No quoting for local transient symbols, the the (use (G N) has also
been corrected. These are small things, but very useful for me.

> 
> You may be able to tune that a bit. If you call (pool "words02d.db")
> the default is a single-file database, with a block size of 256.
> 
>(mapcar size (collect 'wrd '+Wrds))
> 
> Also, it might be better to put the indexes into a separate file(s),
>(dbs
>   (1 +Wrds)   # 128 Byte/block
>   (4 (+Wrds wid wrd)) )   # 4096 Byte/block
> 

This is what I have been looking at in the meantime, with a break for
the latest heat wave (!), just how thinks look on the disk, where
things are stored and so forth. The "mechanics" of things.

Tried all sorts of size combinations, found one that is more compact, to
use (0 +Wrds) instead of the block size 1 for it. But then I guess you
should take a look with "size" to make sure you are not cutting things
to close? 

> Hmm, I've never done this, but there might be a very efficient
> solution: If you have all words in a single file (and no objects of
> other types), you can access them directly, without index, with the
> 'id' function. You first determine the ID of the last object in the
> file:
> 
>(setq *Max (id (for (S (seq (db: +Wrds)) S (seq S)) S)))
> 

Actually what I did was your suggestion, Alex. I found it in an old
post in the mail archives. This one works great though, and so I have a
choice, which is nice. And so I left out the wid, used it here.

What I am doing is to translate more or less 1:1 a little Python
program of mine into picolisp, because it uses scheduled learning, and
then set up an index in my database only for those data fields where I
had presets for queries, and this is what comes out (and then
add updates and a gui to it in due time):

   (class +Wrds +Entity)
   (rel wrd (+Key +List +String))
   (rel inf (+String))
   (rel dsp (+Number))
   (rel err (+Ref +Number))
   (rel idat (+Ref +Date))
   (rel ibox (+Ref +Number))
   (rel ldat (+Date))
   (rel lbox (+Ref +Number))
   (rel ndat (+Ref +Date))
   (dbs
  (1 +Wrds)
  (4 (+Wrds wrd idat ndat))
  (0 (+Wrds err ibox lbox)) )
   (pool "words2i.db" *Dbs)

I had (rel wrd (+Ref +List +String)) before and that didn't work,
the collect doesn't find anything in that case:

(mapcan cdr (collect 'wrd '+Wrds (list W "a") (list W "z") 'wrd ) )

Do I have to initialize data by the way, or will things default to
standard values, or just "not be there" until used? And I take it you
only list the indexes you are pushing off into another file, otherwise
they default to being in the first file?

-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-17 Thread Alexander Burger

On Thu, Jul 17, 2014 at 09:13:38AM +0200, Alexander Burger wrote:
> And then later try (as in your case)
> 
>(while (id (db: +Wrds) (rand 1 *Max)))
> 
> because there may be holes in the file due to deleted objects.

Sorry, there are two errors: You must also call 'ext?' to check if you
found a valid object, and use 'until' instead of 'while':

   (until (ext? (id `(db: +Wrds) (rand 1 *Max
  ... object in '@' ...

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-17 Thread Alexander Burger

Hello marmorine,

> (note: a bit long being a first post)

No problem :)


> post a first attempt to check form and convention and company, so here
> the latest variant (working):
> ...

Fine. That looks good.


A few minor notes:

1. As "___ " is a local transient symbol, the single quote is not really
   needed.


2. (and M (<> (setq G (cadr (assoc (key) *Art))) " ") )

   The symbols 'G' and 'N' are free variables. Doesn't harm probably, but
   I would recommend to put a

  (use (G N)

   somehere before the 'loop'.



> I've noticed the pil db is considerably larger than the original
> sqlite, even with only part of the data. If that is normal that is also
> OK with me, just wanted to check.

You may be able to tune that a bit. If you call (pool "words02d.db") the
default is a single-file database, with a block size of 256.

Now it depends on the average sizes of the word lists, but if most of
them are rather short, you waste some space in each block. You can take
a look at the sizes if you create a (perhaps smaller) database with
typical data, and then do

   (mapcar size (collect 'wrd '+Wrds))

Also, it might be better to put the indexes into a separate file(s),
both for space and speed considerations. Especially the 'wrd' index has
rather long keys. Assuming that a block size for the words of 128 is
enough, I would do

   (dbs
  (1 +Wrds)   # 128 Byte/block
  (4 (+Wrds wid wrd)) )   # 4096 Byte/block

to put both indexes into the second DB file, and then

   (pool "words02d.db" *Dbs)


> Mostly I am wondering about what to index and what not, also whether an
> idx is of any benefit to me.

I think the rule is simply: Index what you want to search for.

An 'idx' is only usefule for an in-memory structure. You could build the
whole system as a large 'idx' tree, if it fits into memory. But you lose
persistency then.


> The wid key (see below) seems necessary because I want to be able to
> random sample first of all.

Hmm, I've never done this, but there might be a very efficient solution:
If you have all words in a single file (and no objects of other types),
you can access them directly, without index, with the 'id' function. You
first determine the ID of the last object in the file:

   (setq *Max (id (for (S (seq (db: +Wrds)) S (seq S)) S)))

And then later try (as in your case)

   (while (id (db: +Wrds) (rand 1 *Max)))

because there may be holes in the file due to deleted objects.

Keep in mind that this would hang forever (like your version too), if
the DB is empty!


> deletions). I am writing it to run on a Nanonote among other things
> (picolisp already compiled and running!), so memory and speed could

Cool!


> Its funny, I was dead set on programming lisp, kind of a principle
> thing, but now that I have gotten my feet wet a bit I am starting to
> think that picolisp really IS the best choice for what I have in mind.
> Because of the db, the gui, its "all just there". A little gem.

I'm glad to hear that :)

♪♫ Alex
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

2014-07-16 Thread Henrik Sarvell

Hi, I just skimmed your post quickly, this project/code might be of
use for you: https://bitbucket.org/hsarvell/indexer

See also the dbs function http://software-lab.de/doc/refD.html#dbs
(with more examples of usage in the example project shipping with
PicoLisp) for how to possibly make the size smaller.

On Thu, Jul 17, 2014 at 6:31 AM, marmorine  wrote:
>
> (note: a bit long being a first post)
>
> I've been writing variations of a (very!) small learning program just
> to learn picolisp, and to gather the basics of what I need for a coming
> project (also language learning, but with sign). And just wanted to
> post a first attempt to check form and convention and company, so here
> the latest variant (working):
>
> # Title:  guess the article, picolisp
>
> (de gtl NIL
>(class +Wrds +Entity)
>(rel wid (+Key +Number))
>(rel wrd (+Key +List +String))
>(pool "words02d.db")
>(setq *Max (maxKey (tree 'wid '+Wrds)))
>(setq *Art '(("r" "der") ("e" "die") ("s" "das") (" " " ") ) )
>(seed (time))
>(loop
>   (prin "number of words (q to quit): ")
>   (T (= "q" (setq N (read))) 'end)
>   (do N
>  (until (db 'wid '+Wrds (rand 1 *Max)))
>  (let W (car (get @ 'wrd))
> (prinl '"___ " W)
> (let M
>(mapcan cdr
>   (collect 'wrd '+Wrds
>  (list W "a") (list W "z") 'wrd ) )
>(while
>   (and M (<> (setq G (cadr (assoc (key) *Art))) " ") )
>   (ifn (index G M)
>  (prinl "you dummy!")
>  (prinl G " " W ", hurrah!")
>  (when (setq M (delete G M))
> (prinl "there's more!")) ) ) ) ) ) ) )
>
> And am now about to expand the fields with this as the next step:
>
>(rel inf (+String))
>(rel shown (+Number))
>(rel wrong (+Number))
>(rel init_dat (+Date))
>(rel init_box (+Number))
>(rel last_dat (+Date))
>(rel last_box (+Number))
>(rel next_dat (+Date))
>
> And have questions about that.
>
> I've noticed the pil db is considerably larger than the original
> sqlite, even with only part of the data. If that is normal that is also
> OK with me, just wanted to check.
>
> Mostly I am wondering about what to index and what not, also whether an
> idx is of any benefit to me.
>
> The wid key (see below) seems necessary because I want to be able to
> random sample first of all. The wrd key (word and article) also seem
> necessary. But do I really want to add an index for all the additional
> data fields (to track scheduled learning)? And what about when I also
> add a variety of categories (tags)?
>
> I have tried both iter'ing the whole db as well as collecting in
> various versions and notice no real difference in response (with 2600
> records). So I am kind of torn, for queries it seems a good and natural
> thing to index, but maybe its overkill for my size of db? I think I
> want to avoid unnecessary complexity, might be a better way to say that.
>
> I'm not sure how you go about deciding that, since "it depends". For my
> case: The final db in any case could be up to 5000 records (and the
> same again later for sign language), with data fields updated often,
> but where the records themselves are largely static (no additions or
> deletions). I am writing it to run on a Nanonote among other things
> (picolisp already compiled and running!), so memory and speed could
> also be a factor. Queries could be somewhat complex for searching or
> maintenance, but "in-play" would be fairly simple. Just to sketch
> things out a bit.
>
> Its funny, I was dead set on programming lisp, kind of a principle
> thing, but now that I have gotten my feet wet a bit I am starting to
> think that picolisp really IS the best choice for what I have in mind.
> Because of the db, the gui, its "all just there". A little gem.
> --
> UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe
-- 
UNSUBSCRIBE: mailto:picolisp@software-lab.de?subject=Unsubscribe

Re: database setup in language learning

Re: database setup in language learning

Re: database setup in language learning

Re: database setup in language learning

Re: database setup in language learning

Re: database setup in language learning

Re: database setup in language learning

7 matches

Site Navigation

Mail list logo

Footer information