Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Lance Norskog Mon, 17 Jan 2011 19:47:33 -0800

Solr itself does all three things. There is no need for Nutch- that is
needed for crawling web sites, not file systems (as the original
question specifies).


Solr operates as a web service, running in any Java servlet container.

Detecting changes to files is more tricky: there is no implementation
for the real-time update system available for Windows. You would have
to implement that. Otherwise you can poll a file system and re-index
altered files.

On Fri, Jan 14, 2011 at 4:54 AM, Markus Jelsma
<markus.jel...@openindex.io> wrote:
> Nutch can crawl the file system as well. Nutch 1.x can also provide search but
> this is delegated to Solr in Nutch 2.x. Solr can provide the search and Nutch
> can provide Solr with content from your intranet.
>
> On Friday 14 January 2011 13:17:52 Cathy Hemsley wrote:
>> Hi,
>> Thanks for suggesting this.
>> However, I'm not sure a 'crawler' will work:  as the various pages are not
>> necessarily linked (it's complicated:  basically our intranet is a dynamic
>> and managed collection of independantly published web sites, and users
>> found information using categorisation and/or text searching), so we need
>> something that will index all the files in a given folder, rather than
>> follow links like a crawler. Can Nutch do this? As well as the other
>> requirements below?
>> Regards
>> Cathy
>>
>> On 14 January 2011 12:09, Markus Jelsma <markus.jel...@openindex.io> wrote:
>> > Please visit the Nutch project. It is a powerful crawler and can
>> > integrate with Solr.
>> >
>> > http://nutch.apache.org/
>> >
>> > > Hi Solr users,
>> > >
>> > > I hope you can help.  We are migrating our intranet web site management
>> > > system to Windows 2008 and need a replacement for Index Server to do
>> > > the text searching.  I am trying to establish if Lucene and Solr is a
>> >
>> > feasible
>> >
>> > > replacement, but I cannot find the answers to these questions:
>> > >
>> > > 1. Can Solr be set up to recursively index a folder containing an
>> > > indeterminate and variable large number of subfolders, containing files
>> >
>> > of
>> >
>> > > all types:  XML, HTML, PDF, DOC, spreadsheets, powerpoint
>> > > presentations, text files etc.  If so, how?
>> > > 2. Can Solr be queried over the web and return a list of files that
>> > > match
>> >
>> > a
>> >
>> > > search query entered by a user, and also return the abstracts for these
>> > > files, as well as 'hit highlighting'.  If so, how?
>> > > 3. Can Solr be run as a service (like Index Server) that automatically
>> > > detects changes to the files within the indexed folder and updates the
>> > > index? If so, how?
>> > >
>> > > Thanks for your help
>> > >
>> > > Cathy Hemsley
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
Lance Norskog
goks...@gmail.com

Re: Solr: using to index large folders recursively containing lots of different documents, and querying over the web

Reply via email to