Hi:

I'm currently working on a plattform for crawl a large amount of PDFs files. 
Using nutch (and tika) I'm able of extract and store the textual content of the 
files in solr, but right now we want to be able to extract the content of the 
PDFs by page, this means that, we want to store several solr fields (one per 
each page in the document). Is there any recommended way of accomplish this in 
nutch/solr?. With a parse plugin I could store the text from each page to the 
metadata's document, anything else would be needed?

slds
--
"It is only in the mysterious equation of love that any 
logical reasons can be found."
"Good programmers often confuse halloween (31 OCT) with 
christmas (25 DEC)"

Reply via email to