On Friday, May 17, 2013 1:46:35 AM UTC+2, infernoape wrote: > > > So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, > I'm pushin 4mb including tiddlywiki and I want to get back down to speedy > 2mb size. Did that answer your question? >
I did wait a bit to see if someone comes up with a solution, because I only have an idea, how it could work. About 2 years ago I did have a closer look to text indexing / full text searching. But since it wasn't practical for my TW usecase, I did stop investigating. Reading your description about your workflow, I think it would make sense to have a closer look again. If you go on reading, you'll see, why it wouldn't make sense for 10+ pdfs. 100++ is a different matter :) ----- Have a look at: http://lookups.pageforest.com/ On the left side you can copy paste some text. (Text from one of your PDFs) Top right you have a button "Build Tree" -> creates a searchable index, where the text search is astonishing fast. "Build Tree" also prints a compression result. For "small (~2k) texts it is about 50% compression rate) for big texts ~600kByte it is much better. The text search input is botoom left! Click top right button "Load Dictionary" and "Build Tree" then enter any word in the search input (bottom left) and see the magic. eg: "wor .. ld " As you can see results are pretty fast and updated as you type. This isn't exactly what you need, since it is a dictionary lookup but it could be adjusted to be used as a "full text search". To create a workflow, that works for you it would need a new "TrieSearchPlugin" that can handle the hidden index tiddlers. Every pdf would get it's own index. Similar to your existing workflow but the indexes would be much smaller than the "plain text" and searching should be quite fast. All components are open source, so it would be possible to integrate the stuff into a TW. IMO the problem with the library is, that it isn't ready to be used as a TW plugin. Some heavy refactoring and some adjustmets would be necessary. eg: it can only handle english text well because öäü ... and such is ignored ... imo it can't handle numbers eg: 2013 It can't handle typos, so some type of "fuzzy search" would be cool, which would need more pre-processing .... ----------- tl;dr The background John Resig (inventor of jQuery) blogged about a dictionary lookup algorithm in 2011 - initial post: <http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment>http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment - follow up: http://ejohn.org/blog/javascript-trie -performance-analysis/#postcomment You can have a look at the blog post but *the interesting stuff is in the comment section* :) eg: Discussion about memory usage, lookup speed, index creation speed, search algorithms .... Near the end of the second blog post comment section. Mike Koss came up with a working installation (http://lookups.pageforest.com/) that works with Trie's (no typo) Some links about the theoretical backgound (for those who are interested :) - http://en.wikipedia.org/wiki/Directed_acyclic_graph - http://en.wikipedia.org/wiki/Trie - http://en.wikipedia.org/wiki/Suffix_tree have fun! mario -- You received this message because you are subscribed to the Google Groups "TiddlyWikiDev" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tiddlywikidev?hl=en. For more options, visit https://groups.google.com/groups/opt_out.
