> If immediate reindexing of modified documents is strictly required you may > need to drop Nutch and go for a stand-alone Solr with a lot of scripting and > some file alteration monitor you can use cross-platform.
Thanks Marcus, I'll see if I really need that. One thing I might do is simply use an existing desktop indexer and just use Tika to parse files (mostly I want to get a list of indexed terms). On Mon, Aug 15, 2011 at 1:43 PM, Markus Jelsma <[email protected]>wrote: > > > The KDE thing is very interesting, thanks for the link! I wash hoping > for > > something cross-platform though. > > KDE is almost pure QT so most of it is cross-platform. You might want to > check > with their lists for details and feasibility. > > > > > As regards using Nutch: how would it handle file updates? It seems to me > a > > Web crawler would only get new files and changes on each crawl, whereas a > > desktop search engine like Spotlight for instance indexes a file as soon > as > > it gets made or modified. > > Nutch will crawl a (local) url and increment a timestamp with a constant > (default 30 days) or based on some algorithm; the fetch time. At this time > in > the future the url becomes eligible for refetch all the associated > processing. > > You can also hook-up some file alteration monitor daemon that can run some > script to reindex a specific file in Solr. This cannot be used with Nutch, > it > will not recrawl and index an url if it is not eligible for fetch. > This is not a big problem as both Nutch and Solr use the Tika libs for > document parsing but may become a problem is both use different versions > and > if you have custom Nutch pluging. > To be short: forced reindexing of a given url cannot go through Nutch. > > > > > There's also this document I found on the Web: it describes some problems > > with using Nutch on the personal scale owing to its specialization for > web > > crawling----it says there is a limit on files crawled per directory, and > > size of files crawled. This was all I was able to find under "Nutch > > desktop search" in Google. However, now that I look at it more closely > > it's from 2004, so it seems to me Nutch might have gotten rid of these > > problems in the interim.... > > There are limits indeed but they are configurable, num outlinks (applies to > directory lists as well) and max content limit and such. > > If immediate reindexing of modified documents is strictly required you may > need to drop Nutch and go for a stand-alone Solr with a lot of scripting > and > some file alteration monitor you can use cross-platform. > > Good luck > > > > > > http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/ > > > images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=A > > > DGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJh > > > KKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtb > > TpxSL0xmZJxa5CWm8MzDWD4vyAAg > > > > Thanks, > > > > Andrew > > > > On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma > > > > <[email protected]>wrote: > > > With Nutch you can crawl your FS with ease and index to a Solr > instance. > > > It'll > > > surely work. But you may also be interested in the cool KDE > technologies > > > that > > > are specifically built for desktop search. > > > > > > > http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi- > > > explained/ > > > > > > On Monday 15 August 2011 04:41:11 Andrew Naylor wrote: > > > > Any suggestions for the best way to get desktop search in the > > > > Lucene/Solr/Nutch/Tika ecosystem? I want to be able to access (from > my > > > > > > own > > > > > > > program) lists of terms that are indexed and weights for each file, > for > > > > example, but if a filesystem indexer and index updater already exists > > > > somewhere I'd like to use it rather than write my own. > > > > > > > > I'm planning on working in Clojure, btw, not that that should make > any > > > > difference--- > > > > > > > > Thanks, > > > > > > > > Andrew > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > > http://www.linkedin.com/in/markus17 > > > 050-8536620 / 06-50258350 >

