Hello Solr users! I need suggestions on the best and most bullet-proof way to index data from multiple websites.
- different websites, - running on different CMS systems (Drupal, Plone, Sharepoint, Wordpress) etc, - different site owners (somebody else is in control of each of the sites). Currently we have a setup where: 1) some of the websites push new and updated content directly to Solr and, 2) other websites are crawled by Nutch and the content is pushed from Nutch to Solr. (50 websites / 250.000 pages) This works out pretty well. But only because the first group of sites are under my own control. So if something goes wrong or I need to upgrade Solr, I have easy access to login to these sites and to make technical changes and reindex the full content. How about if I want to give other sites the same possibility to push content directly to my Solr index? This would be nice because: - some of the websites contain restricted content that is somewhat tricky to expose to the crawler, - many CMSes have existing Solr modules that can push content to Solr out-of-the-box, - content can be pushed instantly to the Solr index. But what if something goes wrong in t his process - and I do not have access to login and make changes to the CMS, start a re-index etc. In that case the content from a site will be missing until a CMS coder will have time to help me out. Can anybody give advice on how to handle this in an easy way? Should I stick to the model of having a crawler between the websites and Solr? Or some other kind of proxy service? Kind regards, Bjørn Axelsen Fagkommunikation Webbureau som formidler viden Schillerhuset · Nannasgade 28 · 2200 København N · +45 60660669 · i...@fagkommunikation.dk · fagkommunikation.dk