Re: [opencog-dev] Re: [Link Grammar] Cosine similarity, PCA, sheaves (algebraic topology)

2017-06-20 Thread Linas Vepstas
Yeah, I know symbolic dynamics pretty well. I think I wrote most of the
wikipedia article on "subshifts of finite type" and the rainbow of related
topics - the product topology, the cylinder sets,  e.g. most of
"measure-preserving dynamical system"  There's a vast network of related
topics, and they're all interesting.

--linas

On Mon, Jun 19, 2017 at 9:56 PM, Ben Goertzel  wrote:

> On Tue, Jun 20, 2017 at 5:59 AM, Linas Vepstas 
> wrote:
> >> , and see how  your grammar+semantic approach will be effective (adding
> >> somehow a non-linear embedding in the phase space as I already discussed
> >> with Ben)
> >
> >
> > Ben has not yet relayed this to me.
> >
> > -- Linas
>
> Yeah, it seemed you were already pretty busy!
>
> The short summary is: For any complex dynamical system, if one embeds
> the system's states in a K-dimensional space appropriately, and then
> divides the relevant region of that K-dimensional space into discrete
> cells... then each trajectory of that system becomes a series of
> "words" in a certain language (where each of the discrete cells
> corresponds to a word)...   I guess you are probably familiar with
> this technique, which is "symbolic dynamics"
>
> One can then characterize a dynamical system, in various ways, via the
> inferred grammar of this "symbolic-dynamical language" ...
>
> I did work on this a couple decades ago using various Markovian
> grammar inference tools I hacked myself...
>
> Enzo at Cisco, as it turns out, had been thinking about applying
> similar methods to characterize the complex dynamics of some Cisco
> networks...
>
> So we have been discussing this as an interesting application of the
> OpenCog-based grammar inference tools we're now developing ...
>
> There's plenty more, but that's the high-level summary...
>
> (Part of the "plenty  more" is that there may be a use of deep (or
> shallow, depending on the case) neural networks to help with the
> initial stage where one embeds the complex system's states in a
> K-dimensional space.  In a different context, word2vec and adagram are
> examples of the power of modern NNs for dimensional embedding.)
>
> -- Ben
>
>
> --
> Ben Goertzel, PhD
> http://goertzel.org
>
> "I am God! I am nothing, I'm play, I am freedom, I am life. I am the
> boundary, I am the peak." -- Alexander Scriabin
>
> --
> You received this message because you are subscribed to the Google Groups
> "opencog" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to opencog+unsubscr...@googlegroups.com.
> To post to this group, send email to opencog@googlegroups.com.
> Visit this group at https://groups.google.com/group/opencog.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/opencog/CACYTDBeYGVPZJo3OeV3sajuPgaosg9nbBiurttsVU%3Dz23pSg7w%
> 40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA36NbnLXzEFJEr33uN%2BPNswLG26w7%3D3FsZJSzp%3DaSmG2yA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [opencog-dev] Re: [Link Grammar] Cosine similarity, PCA, sheaves (algebraic topology)

2017-06-19 Thread Ben Goertzel
The python version of Adagram seems not complete and not tested, so I
think I'd rather deal with the Julia implementation at this point
 Julia is not that complicated, and I don't love python anyway...

Regarding the deficiencies of n-grams, I agree with Linas.  However,
my suggestion is to modify Adagram to use inputs obtained from parse
trees (the MST information theory based parse trees that Linas's code
produces) rather than simply from word-sequences.   So then it will be
a non-gram-based Adagram   The "gram" part of Adagram is not what
interests me; what interests me is the ability of that particular NN
architecture/algorithm to make word2vec style dimensionally-reduced
vectors in a way that automagically carries out word-sense
disambiguation along the way  I believe it can do this same trick
if fed features from the MST parses, rather than being fed n-gram-ish
features from raw word sequences  But this will require some code
changes and experimentation...

-- Ben



On Mon, Jun 19, 2017 at 5:52 PM, Linas Vepstas <linasveps...@gmail.com> wrote:
> Thanks.  The point of this is that we're not using n-grams for anything.
> We're using sheaves. So any algo that has "gram" in it's name is immediately
> disqualified. The bet is that by doing grammar correctly, using sheaves,
> will get you much much better results than using n-grams. And that's the
> point.
>
> --linas
>
> On Mon, Jun 19, 2017 at 4:33 AM, Enzo Fenoglio (efenogli)
> <efeno...@cisco.com> wrote:
>>
>> Hi Linas
>>
>> Nice working with you guys on interesting stuff.
>>
>>
>>
>> PCA  is a linear classifier not suited for this kind of problems. I
>> strongly suggest moving definitely to ANN
>>
>>
>>
>> About Adagram this is an implementation for python
>> https://github.com/lopuhin/python-adagram   of the original Julia
>> implementation posted by Ben.  Or you may have a look at Sensegram
>> http://aclweb.org/anthology/W/W16/W16-1620.pdf with code
>> https://github.com/tudarmstadt-lt/sensegram . I am not aware of ANN for
>> Adagram but there are plenty for skipgram, for example
>> https://keras.io/preprocessing/sequence/#skipgrams
>>
>>
>>
>> bye
>>
>> e
>>
>>
>>
>>
>>
>>
>>
>> From: Linas Vepstas [mailto:linasveps...@gmail.com]
>> Sent: lundi 19 juin 2017 11:16
>> To: opencog <opencog@googlegroups.com>; Curtis M. Faith
>> <curtis.m.fa...@gmail.com>
>> Cc: Ruiting Lian <ruit...@hansonrobotics.com>; Enzo Fenoglio (efenogli)
>> <efeno...@cisco.com>; Hugo Latapie (hlatapie) <hlata...@cisco.com>; Andres
>> Suarez <suarezand...@gmail.com>; link-grammar
>> <link-gram...@googlegroups.com>
>> Subject: Re: [opencog-dev] Re: [Link Grammar] Cosine similarity, PCA,
>> sheaves (algebraic topology)
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jun 19, 2017 at 3:31 AM, Ben Goertzel <b...@goertzel.org> wrote:
>>
>>
>> Regarding "hidden multivariate logistic regression", as you hint at
>> the end of your document ... it seems you are gradually inching toward
>> my suggestion of using neural nets here...
>>
>>
>>
>> Maybe. I want to understand the data first, before I start applying random
>> algorithms to it. BTW, the previous report, showing graphs and distributions
>> of various sorts: its been expanded and cleaned up with lots of new stuff.
>> Nothing terribly exciting.  I can send the current version, if you care.
>>
>>
>> However, we haven't gotten to experimenting with that yet, because are
>> still getting stuck with weird Guile problems in trying to get the MST
>> parsing done ... we (Curtis) can get through MST-parsing maybe
>> 800-1500 sentences before it crashes (and it doesn't crash when
>> examined with GDB, which is frustrating...)
>>
>>
>>
>> Arghhh. OK, I just now merged one more tweak to the text-ingestion that
>> might allow you to progress.  Some back-story:
>>
>> Back-when, when Curtis was complaining about the large amount of CPU time
>> spent in garbage collection, that is because the script *manually* triggered
>> a GC after each sentence. I presume that Curtis was not aware of this. Now
>> he is.
>>
>> The reason for doing this was that, without it, mem usage would blow up:
>> link-grammar was returning these strings that were 10 or 20MBytes long and
>> the GC was perfectly happy in letting these clog up RAM. That works out to a
>> gigabyte every 50 or 100 sentences, so I was forcing GC to run pretty much
>> c

Re: [opencog-dev] Re: [Link Grammar] Cosine similarity, PCA, sheaves (algebraic topology)

2017-06-19 Thread Linas Vepstas
On Mon, Jun 19, 2017 at 3:31 AM, Ben Goertzel  wrote:

>
> Regarding "hidden multivariate logistic regression", as you hint at
> the end of your document ... it seems you are gradually inching toward
> my suggestion of using neural nets here...
>

Maybe. I want to understand the data first, before I start applying random
algorithms to it. BTW, the previous report, showing graphs and
distributions of various sorts: its been expanded and cleaned up with lots
of new stuff. Nothing terribly exciting.  I can send the current version,
if you care.

>
> However, we haven't gotten to experimenting with that yet, because are
> still getting stuck with weird Guile problems in trying to get the MST
> parsing done ... we (Curtis) can get through MST-parsing maybe
> 800-1500 sentences before it crashes (and it doesn't crash when
> examined with GDB, which is frustrating...)
>

Arghhh. OK, I just now merged one more tweak to the text-ingestion that
might allow you to progress.  Some back-story:

Back-when, when Curtis was complaining about the large amount of CPU time
spent in garbage collection, that is because the script *manually*
triggered a GC after each sentence. I presume that Curtis was not aware of
this. Now he is.

The reason for doing this was that, without it, mem usage would blow up:
link-grammar was returning these strings that were 10 or 20MBytes long and
the GC was perfectly happy in letting these clog up RAM. That works out to
a gigabyte every 50 or 100 sentences, so I was forcing GC to run pretty
much constantly: maybe a few times a second.

This appears to have exposed an obscure guile bug. Each of those giant
strings contains scheme code, which guile interprets/compiles and then
runs. It appears that high-frequency GC pulls the rug out from under the
compiler/interpreter, leading to a weird hang. I think I know how to turn
this into a simple test case, but haven't yet.

Avoiding the high-frequency GC avoids the weird hang.  And that's what the
last few github merges do. Basically, it checks, after every sentence, if
RAM usage is above 750MBytes, and then forces a GC if it is.  This is
enough to keep RAM usage low, while still avoiding the other ills and
diseases.

For me, its been running for over a week without any problems. It runs at
about a few sentences per second. Not sure, its not something I measure. So
pretty slow, but I kind-of don't care. because after a week, its 20 or 40
million observations of words, which is plenty enough for me. Too much,
actually, the datasets get too big, and I need to trim them.

This has no effect at all on new, unmerged Curtis code. It won't fix his
crash. Its only for the existing pipeline.  So set it running on some other
machine, and while Curtis debugs, you'll at least get some data piling up.
Run it stock, straight out of the box, don't tune it or tweak it, and it
should work fine.

--linas

-- 
You received this message because you are subscribed to the Google Groups 
"opencog" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to opencog+unsubscr...@googlegroups.com.
To post to this group, send email to opencog@googlegroups.com.
Visit this group at https://groups.google.com/group/opencog.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/opencog/CAHrUA34rEGAE5nwfpARkTGiXc9QfdR0ZAYpz064D%3D_EtZ%3DLCow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.