using tika inside SOLR vs using nutch

2013-09-10 Thread adfel70
Hi

What are the pros and cons of both use cases?
1. use nutch to crawl file system + parse files + perform other data
manipulation and eventually index to solr.
2. use solr dataimporthandlers and plugins in order to perform this task.


Note that I have  tens of millions of docs which I need to handle the first
time, and then delta imports of around 100k docs per day.
Each doc may be up to 100mb.





--
View this message in context: 
http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: using tika inside SOLR vs using nutch

2013-09-10 Thread Furkan KAMACI
If you have tens of millions of documents to parse and do want to do that
job inside Solr than it means that you will make a workload on Solr. If
there are many queries into your Solr node than you should consider that
CPU and RAM may not be enough for you while both parsing and somebody is
querying you system.

Parsing documents at Nutch is a batch processing. If you do that in Solr
you won't wait to send that documents from Nutch to Solr.

If you parse that documents at Nutch side and do that on Hadoop than and
have many machines than doing that job on Map/Reduce may be a good
choice for you.


2013/9/10 adfel70 adfe...@gmail.com

 Hi

 What are the pros and cons of both use cases?
 1. use nutch to crawl file system + parse files + perform other data
 manipulation and eventually index to solr.
 2. use solr dataimporthandlers and plugins in order to perform this task.


 Note that I have  tens of millions of docs which I need to handle the first
 time, and then delta imports of around 100k docs per day.
 Each doc may be up to 100mb.





 --
 View this message in context:
 http://lucene.472066.n3.nabble.com/using-tika-inside-SOLR-vs-using-nutch-tp4089120.html
 Sent from the Solr - User mailing list archive at Nabble.com.