using tika inside SOLR vs using nutch
Hi What are the pros and cons of both use cases? 1. use nutch to crawl file system + parse files + perform other data manipulation and eventually index to solr. 2. use solr dataimporthandlers and plugins in order to perform this task. Note that I have tens of millions of docs which I need to handle the first time, and then delta imports of around 100k docs per day. Each doc may be up to 100mb. -- View this message in context: http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html Sent from the Solr - User mailing list archive at Nabble.com.
Re: using tika inside SOLR vs using nutch
If you have tens of millions of documents to parse and do want to do that job inside Solr than it means that you will make a workload on Solr. If there are many queries into your Solr node than you should consider that CPU and RAM may not be enough for you while both parsing and somebody is querying you system. Parsing documents at Nutch is a batch processing. If you do that in Solr you won't wait to send that documents from Nutch to Solr. If you parse that documents at Nutch side and do that on Hadoop than and have many machines than doing that job on Map/Reduce may be a good choice for you. 2013/9/10 adfel70 adfe...@gmail.com Hi What are the pros and cons of both use cases? 1. use nutch to crawl file system + parse files + perform other data manipulation and eventually index to solr. 2. use solr dataimporthandlers and plugins in order to perform this task. Note that I have tens of millions of docs which I need to handle the first time, and then delta imports of around 100k docs per day. Each doc may be up to 100mb. -- View this message in context: http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html Sent from the Solr - User mailing list archive at Nabble.com.