Hi again,
I'm getting this error:
The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are
enabled via the plugin.includes system property, and all claim to
support the content type application/pdf, but they are not mapped to it
in the parse-plugins.xml file.
Should I add <mimeType name="application/pdf">
<plugin id="parse-pdf" />
</mimeType>
to conf/parse-plugins.xml?
Regards,
On 5/22/12 2:06 PM, Piet van Remortel wrote:
another option is
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
which uses Tika, which parses PDF.
On Tue, May 22, 2012 at 1:00 PM, Tolga<[email protected]> wrote:
Hi again,
I am getting this error: org.apache.nutch.parse.**ParseException: parser
not found for contentType=application/pdf. I googled and found out that I
have to add a plugin.includes line to include pdf extension. However, I
already have that line. Actually, the whole<property> block looks like
this:
<property>
<name>plugin.includes</name>
<value>protocol-http|**urlfilter-regex|parse-(text|**
html|js|msexcel|mspowerpoint|**msword|oo|pdf|swf|zip)|index-**
basic|query-(basic|site|url)|**summary-basic|scoring-opic|**
urlnormalizer-(pass|regex|**basic)</value>
<description>Some long description</description>
</property>
However, I still get that error.
What am I missing?
Thanks,
On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote:
yes well then you should either set this property to -1 (which is a
safe guard to ensure that you definitely crawl and parse all of your
PDF's) or a a safe guard, responsible value to reflect the size of
PDF's or other documents which you envisage to be obtained during your
crawl. The first option has the downside that on occasion the parser
can choke on rather large files...
On Tue, May 22, 2012 at 10:36 AM, Tolga<[email protected]> wrote:
What is that value's unit? kilobytes? My PDF file is 4.7mb.
On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
Yes I know.
If your PDF's are larger than this then they will be either truncated
or may not be crawled. Please look thoroughly at your log output...
you may wish to use the http.verbose and fetcher.verbose properties as
well.
On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]> wrote:
The value is 65536
On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
try your http.content.limit and also make sure that you haven't
changed anything within the tika mimeType mappings.
On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]> wrote:
Sorry, I forgot to also add my original problem. PDF files are not
crawled.
I even modified -topN to be 10.
-------- Original Message --------
Subject: PDF not crawled/indexed
Date: Tue, 22 May 2012 10:48:15 +0300
From: Tolga<[email protected]>
To: [email protected]
Hi,
I am crawling my website with this command:
bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
http://localhost:8983/solr/ -depth 20 -topN 5
Is it a good idea to modify the directory name? Should I always
delete
indexes prior to crawling and stick to the same directory name?
Regards,