Database storage solution then DIH to Solr... or clean post to Solr from Nutch crawl

McGibbney, Lewis John Wed, 23 Feb 2011 05:13:33 -0800

Hi list,

This is a question I am hoping will prompt some decent input. I have been 
trying to build Nutch trunk recently and have been pondering on whether or not 
to store large data in back-end HBase or MySQL then utilise DIH to import to 
Solr for search capabilities... or to pass the solrindex command within the 
crawl process to send data directly to Solr, effectively removing back-end 
database storage altogether. At this stage I do not quite know how 'large' data 
will get, as this idea is still in development but to give an example we wish 
to implement a prototype system which will crawl a local authority Intranet 
site, if reasonable results can be achieved then we can progress to the other 
31 local authorities throughout the country. In the latter case I expect that 
data volumes could be classed as huge. I was wondering if anyone can provide 
insight into the pro's con's of both of the approaches and if possible any 
examples of production implementations relating to comparisons. I realise that 
there are a couple of questions here and do not wish to get definitive answers 
to all/any of them, but it would be great to receive any feedback on this topic.


Thank you Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

Winner: Times Higher Education’s Outstanding Support for Early Career 
Researchers of the Year 2010, GCU as a lead with Universities Scotland partners.
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,15691,en.html

Database storage solution then DIH to Solr... or clean post to Solr from Nutch crawl

Reply via email to