AFAIK, You cannot get that directly. Although the PDF contents are written as is in the segments, it cannot be extracted as a file with some nutch command. You will need to take a dump and then copy the underlying content (it will look similar to what you can see when you open a pdf file in a text editor like notepad++) and save it as a PDF document.
If dont want to take that pain, one simple way is to tweak the PDF plugin / Fetcher code and make it write the content to a PDF file. If you just want to get PDF files from a server and dont want the features of a crawler, then better use HTTrack tool (http://www.httrack.com/). Here is a shameless plug of my answer on StackOverflow: ( http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch). The question there was a more broader one: "how to save the file that are crawled by nutch" Thanks, Tejas On Sat, Nov 24, 2012 at 12:30 PM, hudvin <[email protected]> wrote: > I need to extract fetched pdf files. I can extract text by using following > command > > bin/nutch readseg -dump crawl-test/segments/20110201114/ dump -nogenerate > -noparse -noparsedata -noparsetex > > But I need raw pdf files, not pure text. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/How-to-extract-fetched-files-pdf-tp4022202.html > Sent from the Nutch - User mailing list archive at Nabble.com. >

