Re: PDF not crawled/indexed

Lewis John Mcgibbney Tue, 22 May 2012 02:26:31 -0700

Sorry I should have been more explicit about the exact file locationb

http://svn.apache.org/repos/asf/nutch/trunk/conf/parse-plugins.xml


hth

On Tue, May 22, 2012 at 10:19 AM, Tolga <[email protected]> wrote:
> By, tika mimeType settings, do you mean protocol-http?
>
> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
>>
>> try your http.content.limit and also make sure that you haven't
>> changed anything within the tika mimeType mappings.
>>
>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>  wrote:
>>>
>>> Sorry, I forgot to also add my original problem. PDF files are not
>>> crawled.
>>> I even modified -topN to be 10.
>>>
>>>
>>> -------- Original Message --------
>>> Subject:        PDF not crawled/indexed
>>> Date:   Tue, 22 May 2012 10:48:15 +0300
>>> From:   Tolga<[email protected]>
>>> To:     [email protected]
>>>
>>>
>>>
>>> Hi,
>>>
>>> I am crawling my website with this command:
>>>
>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
>>> http://localhost:8983/solr/ -depth 20 -topN 5
>>>
>>> Is it a good idea to modify the directory name? Should I always delete
>>> indexes prior to crawling and stick to the same directory name?
>>>
>>> Regards,
>>>
>>
>>
>



-- 
Lewis

Re: PDF not crawled/indexed

Reply via email to