Re: How to extract fetched files(pdf)?

Tejas Patil Sat, 24 Nov 2012 14:29:47 -0800

AFAIK, You cannot get that directly. Although the PDF contents are written
as is in the segments, it cannot be extracted as a file with some nutch
command. You will need to take a dump and then copy the underlying content
(it will look similar to what you can see when you open a pdf file in a
text editor like notepad++) and save it as a PDF document.

If dont want to take that pain, one simple way is to tweak the PDF plugin /
Fetcher code and make it write the content to a PDF file.

If you just want to get PDF files from a server and dont want the features
of a crawler, then better use HTTrack tool (http://www.httrack.com/).

Here is a shameless plug of my answer on StackOverflow: (
http://stackoverflow.com/questions/10007178/how-do-i-save-the-origin-html-file-with-apache-nutch).
The question there was a more broader one: "how to save the file that are
crawled by nutch"

Thanks,
Tejas

On Sat, Nov 24, 2012 at 12:30 PM, hudvin <[email protected]> wrote:

> I need to extract fetched pdf files. I can extract text by using following
> command
>
> bin/nutch readseg -dump crawl-test/segments/20110201114/ dump -nogenerate
> -noparse -noparsedata -noparsetex
>
> But I need raw pdf files, not pure text.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/How-to-extract-fetched-files-pdf-tp4022202.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>

Re: How to extract fetched files(pdf)?

Reply via email to