Re: [twdev] Re: Need help building keyword generator

PMario Fri, 17 May 2013 14:31:49 -0700

On Friday, May 17, 2013 1:46:35 AM UTC+2, infernoape wrote:
>
>
> So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, 
> I'm pushin 4mb including tiddlywiki and I want to get back down to speedy 
> 2mb size. Did that answer your question?
>

I did wait a bit to see if someone comes up with a solution, because I only
have an idea, how it could work.

About 2 years ago I did have a closer look to text indexing / full text
searching. But since it wasn't practical for my TW usecase, I did stop
investigating. Reading your description about your workflow, I think it
would make sense to have a closer look again. If you go on reading, you'll
see, why it wouldn't make sense for 10+ pdfs. 100++ is a different matter :)

-----

Have a look at: http://lookups.pageforest.com/
On the left side you can copy paste some text. (Text from one of your PDFs)

Top right you have a button "Build Tree" -> creates a searchable index,
where the text search is astonishing fast.
"Build Tree" also prints a compression result. For "small (~2k) texts it is
about 50% compression rate) for big texts ~600kByte it is much better.

The text search input is botoom left!

Click top right button "Load Dictionary" and "Build Tree" then enter any
word in the search input (bottom left) and see the magic.
eg: "wor .. ld " As you can see results are pretty fast and updated as you
type.

This isn't exactly what you need, since it is a dictionary lookup but it
could be adjusted to be used as a "full text search". To create a workflow,
that works for you it would need a new "TrieSearchPlugin" that can handle
the hidden index tiddlers. Every pdf would get it's own index. Similar to
your existing workflow but the indexes would be much smaller than the
"plain text" and searching should be quite fast.

All components are open source, so it would be possible to integrate the
stuff into a TW. IMO the problem with the library is, that it isn't ready
to be used as a TW plugin. Some heavy refactoring and some adjustmets would
be necessary. eg: it can only handle english text well because öäü ... and
such is ignored ... imo it can't handle numbers eg: 2013

It can't handle typos, so some type of "fuzzy search" would be cool, which
would need more pre-processing ....

----------- tl;dr
The background

John Resig (inventor of jQuery) blogged about a dictionary lookup algorithm
in 2011

- initial post:

<http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment>http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment

- follow up: http://ejohn.org/blog/javascript-trie
-performance-analysis/#postcomment

You can have a look at the blog post but *the interesting stuff is in the
comment section* :) eg: Discussion about memory usage, lookup speed, index
creation speed, search algorithms ....

Near the end of the second blog post comment section. Mike Koss came up
with a working installation (http://lookups.pageforest.com/) that works
with Trie's (no typo)

Some links about the theoretical backgound (for those who are interested :)

- http://en.wikipedia.org/wiki/Directed_acyclic_graph
- http://en.wikipedia.org/wiki/Trie
- http://en.wikipedia.org/wiki/Suffix_tree

have fun!

mario

--
You received this message because you are subscribed to the Google Groups
"TiddlyWikiDev" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tiddlywikidev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [twdev] Re: Need help building keyword generator

Reply via email to