PDF support? Does crawl parse p

2005-08-31 Thread Diane Palla
Does Nutch have a way to parse pdf files, that is, application/pdf 
content type files?

I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf


If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?


Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Bryan Woliner [EMAIL PROTECTED] 
08/23/2005 05:22 PM
Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,

I have a number of sites that I want to crawl, then merge their segments 
and 
create a single index. One of the main reasons I want to do this is that I 

want some of the sites in my index to be crawls on a daily basis, others 
on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the 
fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:

1. Is it possible to use a different regex.urlfilter.txt file for each 
site 
that I am crawling? If so, how would I do this?

2. If I have a very large segment that is indexed (my aggregate index) and 
I 
want to add another (much smaller) set of fetched URLs to this index, what 

is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 

-- especially if I wanted to add news small sets of fetched URLs 
frequently. 


Thanks for any suggestions you have to offer,
Bryan



Re: PDF support? Does crawl parse p

2005-08-31 Thread Piotr Kosiorowski

Hello Diane,
There is a plugin to parse pdf files. You have to enable it in 
nutch-site.xml (just copy entry from nutch-default.xml).

You have to change plugin.includes property to include parse-pdf plugin:
[...] parse-(text|html) [...] to [...] parse-(text|html|pdf) [...]
Regards
Piotr

Diane Palla wrote:
Does Nutch have a way to parse pdf files, that is, application/pdf 
content type files?


I noticed a plugin variable setting in default.properties:

plugin.pdf=org.apache.nutch.parse.pdf*

I never changed this file.

Is that the right value?

I am using Nutch 0.7.

What do I have to do make parse pdf files?

When I do the crawl, I get this error with application/pdf files:

050831 145126 fetch okay, but can't parse 
mainurl/research/126900/126969/126969.pdf, reason: failed(2,203): 
Content-Type not text/html: application/pdf



If it's not possible, what future version of Nutch do developers expect to 
support application/pdf types  and have such parsing of pdf files 
available?



Diane Palla
Web Services Developer
Seton Hall University
973 313-6199
[EMAIL PROTECTED]




Bryan Woliner [EMAIL PROTECTED] 
08/23/2005 05:22 PM

Please respond to
nutch-user@lucene.apache.org


To
nutch-user@lucene.apache.org
cc

Subject
Adding small batches of fetched URLs to a larger aggregate segment/index






Hi,

I have a number of sites that I want to crawl, then merge their segments 
and 
create a single index. One of the main reasons I want to do this is that I 

want some of the sites in my index to be crawls on a daily basis, others 
on 
a weekly basis, etc. Each time I re-crawl a site, I want to add the 
fetched 
URLs to a single aggregate segment/index. I have a couple questions about 
doing this:


1. Is it possible to use a different regex.urlfilter.txt file for each 
site 
that I am crawling? If so, how would I do this?


2. If I have a very large segment that is indexed (my aggregate index) and 
I 
want to add another (much smaller) set of fetched URLs to this index, what 

is the best way to do this. It seems like merging the small and large 
segments and then re-indexing the whole thing would be very time consuming 

-- especially if I wanted to add news small sets of fetched URLs 
frequently. 



Thanks for any suggestions you have to offer,
Bryan