Re: PDF not crawled/indexed

Lewis John Mcgibbney Tue, 22 May 2012 04:12:36 -0700

Hi Piet,

We will hopefully be pushing 1.5 in the next few days so please watch
this space.


Thanks

On Tue, May 22, 2012 at 11:43 AM, Piet van Remortel
<[email protected]> wrote:
> Ok thanks, that property seems the right solution indeed, but it's not part
> of the 1.4 release that I currently use.
> Current source trunk includes it though.
>
> On Tue, May 22, 2012 at 12:31 PM, Lewis John Mcgibbney <
> [email protected]> wrote:
>
>> Well the value is in bytes. So anything above the default (~65000) is
>> truncated.
>> Ferdy also introduced a parser.skip.truncated property which is set to
>> true by default. Justification on this is that parsing can sometimes
>> take extremely high levels of CPU which then leads to the parser
>> choking.
>>
>> On Tue, May 22, 2012 at 10:47 AM, Piet van Remortel
>> <[email protected]> wrote:
>> > I have been dealing with the exact same issues, and I wonder what happens
>> > to PDF's that exceed the file size limit, are they cropped (and partly
>> > parsed?) or fully ignored ?  I seem to observe parsing problems in PDFs
>> > since using a file size limit.  Setting the limit to -1 indeed caused
>> > consistent choke errors on large pages/files so setting a hard limit
>> seemed
>> > the only option.
>> >
>> > thanks
>> >
>> > Piet
>> >
>> >
>> > On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney <
>> > [email protected]> wrote:
>> >
>> >> yes well then you should either set this property to -1 (which is a
>> >> safe guard to ensure that you definitely crawl and parse all of your
>> >> PDF's) or a a safe guard, responsible value to reflect the size of
>> >> PDF's or other documents which you envisage to be obtained during your
>> >> crawl. The first option has the downside that on occasion the parser
>> >> can choke on rather large files...
>> >>
>> >> On Tue, May 22, 2012 at 10:36 AM, Tolga <[email protected]> wrote:
>> >> > What is that value's unit? kilobytes? My PDF file is 4.7mb.
>> >> >
>> >> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
>> >> >>
>> >> >> Yes I know.
>> >> >>
>> >> >> If your PDF's are larger than this then they will be either truncated
>> >> >> or may not be crawled. Please look thoroughly at your log output...
>> >> >> you may wish to use the http.verbose and fetcher.verbose properties
>> as
>> >> >> well.
>> >> >>
>> >> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]>  wrote:
>> >> >>>
>> >> >>> The value is 65536
>> >> >>>
>> >> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
>> >> >>>>
>> >> >>>> try your http.content.limit and also make sure that you haven't
>> >> >>>> changed anything within the tika mimeType mappings.
>> >> >>>>
>> >> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>    wrote:
>> >> >>>>>
>> >> >>>>> Sorry, I forgot to also add my original problem. PDF files are not
>> >> >>>>> crawled.
>> >> >>>>> I even modified -topN to be 10.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> -------- Original Message --------
>> >> >>>>> Subject:        PDF not crawled/indexed
>> >> >>>>> Date:   Tue, 22 May 2012 10:48:15 +0300
>> >> >>>>> From:   Tolga<[email protected]>
>> >> >>>>> To:     [email protected]
>> >> >>>>>
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> Hi,
>> >> >>>>>
>> >> >>>>> I am crawling my website with this command:
>> >> >>>>>
>> >> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
>> >> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5
>> >> >>>>>
>> >> >>>>> Is it a good idea to modify the directory name? Should I always
>> >> delete
>> >> >>>>> indexes prior to crawling and stick to the same directory name?
>> >> >>>>>
>> >> >>>>> Regards,
>> >> >>>>>
>> >> >>>>
>> >> >>
>> >> >>
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Lewis
>> >>
>>
>>
>>
>> --
>> Lewis
>>



-- 
Lewis

Re: PDF not crawled/indexed

Reply via email to