Re: PDF not crawled/indexed

Piet van Remortel Tue, 22 May 2012 04:07:23 -0700

another option is

 
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>


which uses Tika, which parses PDF.


On Tue, May 22, 2012 at 1:00 PM, Tolga <[email protected]> wrote:

> Hi again,
>
> I am getting this error: org.apache.nutch.parse.**ParseException: parser
> not found for contentType=application/pdf. I googled and found out that I
> have to add a plugin.includes line to include pdf extension. However, I
> already have that line. Actually, the whole <property> block looks like
> this:
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(text|**
> html|js|msexcel|mspowerpoint|**msword|oo|pdf|swf|zip)|index-**
> basic|query-(basic|site|url)|**summary-basic|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
> <description>Some long description</description>
> </property>
>
> However, I still get that error.
>
> What am I missing?
>
> Thanks,
>
>
> On 5/22/12 12:44 PM, Lewis John Mcgibbney wrote:
>
>> yes well then you should either set this property to -1 (which is a
>> safe guard to ensure that you definitely crawl and parse all of your
>> PDF's) or a a safe guard, responsible value to reflect the size of
>> PDF's or other documents which you envisage to be obtained during your
>> crawl. The first option has the downside that on occasion the parser
>> can choke on rather large files...
>>
>> On Tue, May 22, 2012 at 10:36 AM, Tolga<[email protected]>  wrote:
>>
>>> What is that value's unit? kilobytes? My PDF file is 4.7mb.
>>>
>>> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
>>>
>>>> Yes I know.
>>>>
>>>> If your PDF's are larger than this then they will be either truncated
>>>> or may not be crawled. Please look thoroughly at your log output...
>>>> you may wish to use the http.verbose and fetcher.verbose properties as
>>>> well.
>>>>
>>>> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]>    wrote:
>>>>
>>>>> The value is 65536
>>>>>
>>>>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
>>>>>
>>>>>> try your http.content.limit and also make sure that you haven't
>>>>>> changed anything within the tika mimeType mappings.
>>>>>>
>>>>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>      wrote:
>>>>>>
>>>>>>> Sorry, I forgot to also add my original problem. PDF files are not
>>>>>>> crawled.
>>>>>>> I even modified -topN to be 10.
>>>>>>>
>>>>>>>
>>>>>>> -------- Original Message --------
>>>>>>> Subject:        PDF not crawled/indexed
>>>>>>> Date:   Tue, 22 May 2012 10:48:15 +0300
>>>>>>> From:   Tolga<[email protected]>
>>>>>>> To:     [email protected]
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I am crawling my website with this command:
>>>>>>>
>>>>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
>>>>>>> http://localhost:8983/solr/ -depth 20 -topN 5
>>>>>>>
>>>>>>> Is it a good idea to modify the directory name? Should I always
>>>>>>> delete
>>>>>>> indexes prior to crawling and stick to the same directory name?
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>>
>>>>
>>
>>

Re: PDF not crawled/indexed

Reply via email to