Re: PDF not crawled/indexed

Lewis John Mcgibbney Tue, 22 May 2012 03:31:56 -0700

Well the value is in bytes. So anything above the default (~65000) is truncated.
Ferdy also introduced a parser.skip.truncated property which is set to
true by default. Justification on this is that parsing can sometimes
take extremely high levels of CPU which then leads to the parser
choking.


On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel
<[email protected]> wrote:
> I have been dealing with the exact same issues, and I wonder what happens
> to PDF's that exceed the file size limit, are they cropped (and partly
> parsed?) or fully ignored ?  I seem to observe parsing problems in PDFs
> since using a file size limit.  Setting the limit to -1 indeed caused
> consistent choke errors on large pages/files so setting a hard limit seemed
> the only option.
>
> thanks
>
> Piet
>
>
> On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> yes well then you should either set this property to -1 (which is a
>> safe guard to ensure that you definitely crawl and parse all of your
>> PDF's) or a a safe guard, responsible value to reflect the size of
>> PDF's or other documents which you envisage to be obtained during your
>> crawl. The first option has the downside that on occasion the parser
>> can choke on rather large files...
>>
>> On Tue, May 22, 2012 at 10:36 AM, Tolga <[email protected]> wrote:
>> > What is that value's unit? kilobytes? My PDF file is 4.7mb.
>> >
>> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
>> >>
>> >> Yes I know.
>> >>
>> >> If your PDF's are larger than this then they will be either truncated
>> >> or may not be crawled. Please look thoroughly at your log output...
>> >> you may wish to use the http.verbose and fetcher.verbose properties as
>> >> well.
>> >>
>> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]>  wrote:
>> >>>
>> >>> The value is 65536
>> >>>
>> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
>> >>>>
>> >>>> try your http.content.limit and also make sure that you haven't
>> >>>> changed anything within the tika mimeType mappings.
>> >>>>
>> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>    wrote:
>> >>>>>
>> >>>>> Sorry, I forgot to also add my original problem. PDF files are not
>> >>>>> crawled.
>> >>>>> I even modified -topN to be 10.
>> >>>>>
>> >>>>>
>> >>>>> -------- Original Message --------
>> >>>>> Subject:        PDF not crawled/indexed
>> >>>>> Date:   Tue, 22 May 2012 10:48:15 +0300
>> >>>>> From:   Tolga<[email protected]>
>> >>>>> To:     [email protected]
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> Hi,
>> >>>>>
>> >>>>> I am crawling my website with this command:
>> >>>>>
>> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
>> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5
>> >>>>>
>> >>>>> Is it a good idea to modify the directory name? Should I always
>> delete
>> >>>>> indexes prior to crawling and stick to the same directory name?
>> >>>>>
>> >>>>> Regards,
>> >>>>>
>> >>>>
>> >>
>> >>
>> >
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: PDF not crawled/indexed

Reply via email to