Re: PDF not crawled/indexed

Piet van Remortel Tue, 22 May 2012 02:47:53 -0700

I have been dealing with the exact same issues, and I wonder what happens
to PDF's that exceed the file size limit, are they cropped (and partly
parsed?) or fully ignored ?  I seem to observe parsing problems in PDFs
since using a file size limit.  Setting the limit to -1 indeed caused
consistent choke errors on large pages/files so setting a hard limit seemed
the only option.


thanks

Piet


On Tue, May 22, 2012 at 11:44 AM, Lewis John Mcgibbney <
[email protected]> wrote:

> yes well then you should either set this property to -1 (which is a
> safe guard to ensure that you definitely crawl and parse all of your
> PDF's) or a a safe guard, responsible value to reflect the size of
> PDF's or other documents which you envisage to be obtained during your
> crawl. The first option has the downside that on occasion the parser
> can choke on rather large files...
>
> On Tue, May 22, 2012 at 10:36 AM, Tolga <[email protected]> wrote:
> > What is that value's unit? kilobytes? My PDF file is 4.7mb.
> >
> > On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
> >>
> >> Yes I know.
> >>
> >> If your PDF's are larger than this then they will be either truncated
> >> or may not be crawled. Please look thoroughly at your log output...
> >> you may wish to use the http.verbose and fetcher.verbose properties as
> >> well.
> >>
> >> On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]>  wrote:
> >>>
> >>> The value is 65536
> >>>
> >>> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
> >>>>
> >>>> try your http.content.limit and also make sure that you haven't
> >>>> changed anything within the tika mimeType mappings.
> >>>>
> >>>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>    wrote:
> >>>>>
> >>>>> Sorry, I forgot to also add my original problem. PDF files are not
> >>>>> crawled.
> >>>>> I even modified -topN to be 10.
> >>>>>
> >>>>>
> >>>>> -------- Original Message --------
> >>>>> Subject:        PDF not crawled/indexed
> >>>>> Date:   Tue, 22 May 2012 10:48:15 +0300
> >>>>> From:   Tolga<[email protected]>
> >>>>> To:     [email protected]
> >>>>>
> >>>>>
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am crawling my website with this command:
> >>>>>
> >>>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
> >>>>> http://localhost:8983/solr/ -depth 20 -topN 5
> >>>>>
> >>>>> Is it a good idea to modify the directory name? Should I always
> delete
> >>>>> indexes prior to crawling and stick to the same directory name?
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>
> >>
> >>
> >
>
>
>
> --
> Lewis
>

Re: PDF not crawled/indexed

Reply via email to