Re: desktop search

Markus Jelsma Mon, 15 Aug 2011 13:45:49 -0700

> The KDE thing is very interesting, thanks for the link!  I wash hoping for
> something cross-platform though.

KDE is almost pure QT so most of it is cross-platform. You might want to check 
with their lists for details and feasibility.

> 
> As regards using Nutch: how would it handle file updates?  It seems to me a
> Web crawler would only get new files and changes on each crawl, whereas a
> desktop search engine like Spotlight for instance indexes a file as soon as
> it gets made or modified.

Nutch will crawl a (local) url and increment a timestamp with a constant 
(default 30 days) or based on some algorithm; the fetch time. At this time in 
the future the url becomes eligible for refetch all the associated processing.

You can also hook-up some file alteration monitor daemon that can run some 
script to reindex a specific file in Solr. This cannot be used with Nutch, it 
will not recrawl and index an url if it is not eligible for fetch.
This is not a big problem as both Nutch and Solr use the Tika libs for 
document parsing but may become a problem is both use different versions and 
if you have custom Nutch pluging.
To be short: forced reindexing of a given url cannot go through Nutch.

> 
> There's also this document I found on the Web: it describes some problems
> with using Nutch on the personal scale owing to its specialization for web
> crawling----it says there is a limit on files crawled per directory, and
> size of files crawled.  This was all I was able to find under "Nutch
> desktop search" in Google.  However, now that I look at it more closely
> it's from 2004, so it seems to me Nutch might have gotten rid of these
> problems in the interim....

There are limits indeed but they are configurable, num outlinks (applies to 
directory lists as well) and max content limit and such.

If immediate reindexing of modified documents is strictly required you may 
need to drop Nutch and go for a stand-alone Solr with a lot of scripting and 
some file alteration monitor you can use cross-platform.

Good luck

> 
> http://docs.google.com/viewer?a=v&q=cache:bDjjs__eYPcJ:www.commercenet.com/
> images/0/06/CN-TR-04-04.pdf+nutch+desktop+search&hl=en&gl=us&pid=bl&srcid=A
> DGEESg12Bq0VDGk3FpevwOHIdbfr1bCkEZ3CH1yojEliyfeCJv_3JhGRe1gMPx66LiywsUYFWJh
> KKzsLBVoCtATNcghrW4DRLWlT5sd4YhIWMVaQjMKs5xN-8vqTOHFV2pw9bzCtoQY&sig=AHIEtb
> TpxSL0xmZJxa5CWm8MzDWD4vyAAg
> 
> Thanks,
> 
> Andrew
> 
> On Mon, Aug 15, 2011 at 6:07 AM, Markus Jelsma
> 
> <[email protected]>wrote:
> > With Nutch you can crawl your FS with ease and index to a Solr instance.
> > It'll
> > surely work. But you may also be interested in the cool KDE technologies
> > that
> > are specifically built for desktop search.
> > 
> > http://thomasmcguire.wordpress.com/2009/10/03/akonadi-nepomuk-and-strigi-
> > explained/
> > 
> > On Monday 15 August 2011 04:41:11 Andrew Naylor wrote:
> > > Any suggestions for the best way to get desktop search in the
> > > Lucene/Solr/Nutch/Tika ecosystem?  I want to be able to access (from my
> > 
> > own
> > 
> > > program) lists of terms that are indexed and weights for each file, for
> > > example, but if a filesystem indexer and index updater already exists
> > > somewhere I'd like to use it rather than write my own.
> > > 
> > > I'm planning on working in Clojure, btw, not that that should make any
> > > difference---
> > > 
> > > Thanks,
> > > 
> > > Andrew
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

Re: desktop search

Reply via email to