On Friday, May 17, 2013 1:46:35 AM UTC+2, infernoape wrote:
>
>
> So in a couple hundred pdf files where each pdf has 1 to 40 pages of text, 
> I'm pushin 4mb including tiddlywiki and I want to get back down to speedy 
> 2mb size. Did that answer your question?
>

I did wait a bit to see if someone comes up with a solution, because I only 
have an idea, how it could work.
 
About 2 years ago I did have a closer look to text indexing / full text 
searching. But since it wasn't practical for my TW usecase, I did stop 
investigating. Reading your description about your workflow, I think it 
would make sense to have a closer look again. If you go on reading, you'll 
see, why it wouldn't make sense for 10+ pdfs. 100++ is a different matter :)

-----

Have a look at: http://lookups.pageforest.com/
On the left side you can copy paste some text. (Text from one of your PDFs)

Top right you have a button "Build Tree" -> creates a searchable index, 
where the text search is astonishing fast. 
"Build Tree" also prints a compression result. For "small (~2k) texts it is 
about 50% compression rate) for big texts ~600kByte it is much better. 

The text search input is botoom left!

Click top right button "Load Dictionary" and "Build Tree"  then enter any 
word in the search input (bottom left) and see the magic.
eg: "wor .. ld " As you can see results are pretty fast and updated as you 
type. 

This isn't exactly what you need, since it is a dictionary lookup but it 
could be adjusted to be used as a "full text search". To create a workflow, 
that works for you it would need a new "TrieSearchPlugin" that can handle 
the hidden index tiddlers. Every pdf would get it's own index. Similar to 
your existing workflow but the indexes would be much smaller than the 
"plain text" and searching should be quite fast. 

All components are open source, so it would be possible to integrate the 
stuff into a TW. IMO the problem with the library is, that it isn't ready 
to be used as a TW plugin. Some heavy refactoring and some adjustmets would 
be necessary. eg: it can only handle english text well because öäü ... and 
such is ignored ... imo it can't handle numbers eg: 2013

It can't handle typos, so some type of "fuzzy search" would be cool, which 
would need more pre-processing ....

----------- tl;dr
The background 

John Resig (inventor of jQuery) blogged about a dictionary lookup algorithm 
in 2011

   - initial post: 
   
<http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment>http://ejohn.org/blog/dictionary-lookups-in-javascript/#postcomment
 
   
   - follow up: http://ejohn.org/blog/javascript-trie
   -performance-analysis/#postcomment
   
You can have a look at the blog post but *the interesting stuff is in the 
comment section* :) eg: Discussion about memory usage, lookup speed, index 
creation speed, search algorithms ....

Near the end of the second blog post comment section. Mike Koss came up 
with a working installation (http://lookups.pageforest.com/) that works 
with Trie's (no typo) 

Some links about the theoretical backgound (for those who are interested :)

   - http://en.wikipedia.org/wiki/Directed_acyclic_graph
   - http://en.wikipedia.org/wiki/Trie
   - http://en.wikipedia.org/wiki/Suffix_tree

have fun!

mario










-- 
You received this message because you are subscribed to the Google Groups 
"TiddlyWikiDev" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tiddlywikidev?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.


Reply via email to