[tw5] Re: Intertwingling the TiddlyWiki - TF-IDF and tag inference

Joe Armstrong Tue, 22 Jan 2019 03:46:25 -0800

YES - For a very long time I've wanted an assistant that watches what I do 
and helps me - this is my
ultimate goal.


I want to reduce entropy - I want to discover similar tiddlers and merge 
them to reduce entropy.

I have been thinking about how to do this for 30 odd years (not for 
tiddlers - but for text representing ideas)

When I make a new tiddler I ask myself "Have I done this or something like 
this before" - or "has anybody else done this"

I know of one good algorithm for this.

I make a new tiddler T and want to find the most similar tiddler to T in a 
collection of tiddlers.

For this https://en.wikipedia.org/wiki/Normalized_compression_distance is 
very good.

This is very simple   A is similar to B if
size(compress(A ++ B)) is just a bit larger than size(compress(A))

If size(compress(A++B)) > size(compress(A)) + size(compress(B)) then A and 
B are very dissimilar

++ is concatination.

Why is this? compression algorithms look for similarities between different 
parts of a text - 
they work well when they find similarities.

This is a crazy good algorithm - I gave it all the paragraphs in a book I'd 
written - typed a new paragraph and it ranks
any similar paragraphs it can find. 

The problem is that it's very inefficient - if I had a few million tiddlers 
it would be way too slow.

My next idea would be to use the rsync algorithm for plagiarism detection. 
This 
is linear in the number of character of the new tiddler - it  finds short 
fragments of identical text
very very quickly - this is why Universities etc use it to detect cheating 
students.

So I'd propose using rsync to make a set of candidate tiddlers then least 
compression difference
to rank the candidates.

Could also use TD*IDF similarity to make candidates.

Ultimately I'd like to find all paragraphs in the planet name them by their 
SHA checksums
then put them into an entropy reduction machine that finds similarities and 
throws out
duplicates and near matching data.

It seems to me that intelligence is partly the ability to recognise 
similarities between things
so I think an entropy reduction machine would be great.

What attracted me to the TW in the first place was the granularity of the 
tiddlers.

They must be not to small and not too big and capture a single idea - 
there's
a Swedish word for this "Lagom" - that and transclusion to combine ideas are
fundamental to building build large structures by combining smaller things.

The problem with search engines is that you have to think up a query.

In similarity detection what you have written becomes the query - and you 
ask
"what is the most similar thing you can find to <this>"

This topic has fascinated me for years - I view entropy reduction as one of 
the key
unsolved problems in computer science

Cheers

/Joe










On Tuesday, 22 January 2019 05:27:56 UTC+1, Rob Hoelz wrote:
>
> Again, thanks for sharing, Joe!  I looked through the PDF and had a few 
> thoughts:
>
>   * Did you do any additional processing of the tiddler bodies, eg. 
> stemming, chunking into bigrams/trigrams, or stripping out various wikitext 
> elements like URLs?  If you did, I'd be curious to hear how that affected 
> your results!
>   * During the talk, you mention the idea of an "assistant" that sits off 
> to the side and helps you work on tiddlers as you type.  I often think that 
> it would be helpful if TiddlyWiki offered me suggestions for tiddlers that 
> might be related to what I'm currently writing, and I think perhaps your 
> TF-IDF "significant term" detection approach might make for a step in the 
> right direction.  Perhaps the top N TF-IDF terms for each tiddler could be 
> encoded as a vector, and tiddlers whose vectors have the highest cosine 
> similarity could be offered as matches in this regard - what do you think?
>
> -Rob
>
> On Monday, January 21, 2019 at 12:03:09 PM UTC-6, Rob Hoelz wrote:
>>
>> Thanks, Joe!  I'll read over that PDF you sent over; as far as the code 
>> goes, I think the PDF documentation describing the methodology should 
>> suffice.
>>
>> -Rob
>>
>> On Monday, January 21, 2019 at 11:33:31 AM UTC-6, Joe Armstrong wrote:
>>>
>>> The code I wrote was a bit messy and just as an experiment. 
>>> Good enough for proof of concept but not for production - it was just 
>>> written to test a few ideas.
>>>
>>> I don't mind sending you a private copy - but explaining how it works 
>>> would be low priority.
>>>
>>> A better idea would be for me to put it up on github together with my 
>>> library of Erlang code that
>>> parses and mucks with tiddlers - I'm trying to programmatically create 
>>> TWs from other data sources.
>>>
>>> If you saw the talk you'd see that we're interested in "Communicating 
>>> TW's" I can imagine TW's sending messages
>>> to each other - but this is a long way off ...
>>>
>>> I did make a little writeup that explains the method (enclosed) - the 
>>> code was just a prototype and written in Erlang - the problem at the moment 
>>> is that this is not integrated in any way with a live TW - Our idea was to 
>>> integrate this through a socket interface.
>>>
>>> At the moment I'm learning the TW so hopefully when I understand more 
>>> I'll figure out how to
>>> connect the TW to Erlang through a socket and fun and games will follow 
>>> :-)
>>>
>>> The TF*IDF algorithm is very simple (see the writeup) most of the work 
>>> is in tokenising the input
>>> into words - from  then on it's easy (in pure JS) - integrating this 
>>> with the TW would then be
>>> as they say "an exercise to the reader" (that's what I say when I don't 
>>> know how to do this :-)
>>>
>>> Cheers
>>>
>>> /Joe
>>>
>>>
>>> On Monday, 21 January 2019 18:04:10 UTC+1, Rob Hoelz wrote:
>>>>
>>>> Hi everyone (especially Jeremy and Joe) -
>>>>
>>>> I finally got around to watching this talk, and I was enraptured the 
>>>> whole time, especially by the part about inferring tags and using TF-IDF 
>>>> to 
>>>> come up with more accurate suggestions.  Is the source code for your work 
>>>> freely available?  I tried my hand at tag inference using forests of 
>>>> decision trees a few months back, and I'd like to study alternative 
>>>> approaches!
>>>>
>>>> Thanks,
>>>> Rob
>>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups 
"TiddlyWiki" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tiddlywiki+unsubscr...@googlegroups.com.
To post to this group, send email to tiddlywiki@googlegroups.com.
Visit this group at https://groups.google.com/group/tiddlywiki.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tiddlywiki/c9210784-4471-4ab5-ad99-ac33a850339b%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tw5] Re: Intertwingling the TiddlyWiki - TF-IDF and tag inference

Reply via email to