RE: PDF not crawled/indexed

Markus Jelsma Tue, 22 May 2012 02:39:43 -0700
Please read the description.
 
 
-----Original message-----
> From:Tolga <[email protected]>
> Sent: Tue 22-May-2012 11:37
> To: [email protected]
> Subject: Re: PDF not crawled/indexed
> 
> What is that value's unit? kilobytes? My PDF file is 4.7mb.
> 
> On 5/22/12 12:34 PM, Lewis John Mcgibbney wrote:
> > Yes I know.
> >
> > If your PDF's are larger than this then they will be either truncated
> > or may not be crawled. Please look thoroughly at your log output...
> > you may wish to use the http.verbose and fetcher.verbose properties as
> > well.
> >
> > On Tue, May 22, 2012 at 10:31 AM, Tolga<[email protected]>  wrote:
> >> The value is 65536
> >>
> >> On 5/22/12 12:14 PM, Lewis John Mcgibbney wrote:
> >>> try your http.content.limit and also make sure that you haven't
> >>> changed anything within the tika mimeType mappings.
> >>>
> >>> On Tue, May 22, 2012 at 9:06 AM, Tolga<[email protected]>    wrote:
> >>>> Sorry, I forgot to also add my original problem. PDF files are not
> >>>> crawled.
> >>>> I even modified -topN to be 10.
> >>>>
> >>>>
> >>>> -------- Original Message --------
> >>>> Subject:        PDF not crawled/indexed
> >>>> Date:   Tue, 22 May 2012 10:48:15 +0300
> >>>> From:   Tolga<[email protected]>
> >>>> To:     [email protected]
> >>>>
> >>>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am crawling my website with this command:
> >>>>
> >>>> bin/nutch crawl urls -dir crawl-$(date +%FT%H-%M-%S) -solr
> >>>> http://localhost:8983/solr/ -depth 20 -topN 5
> >>>>
> >>>> Is it a good idea to modify the directory name? Should I always delete
> >>>> indexes prior to crawling and stick to the same directory name?
> >>>>
> >>>> Regards,
> >>>>
> >>>
> >
> >
>
RE: PDF not crawled/indexed

Reply via email to