Re: Nutch 2.2.1: PDF issue

remi tassing Sun, 13 Apr 2014 17:09:17 -0700

Hi Laxmi,

Could you provide some examples?



On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> wrote:

> Hi Sebastian,
>
> Yes, you are right, there is *no *title defined in the PDF's "info"
> container and that is when Nutch is returning empty titles where as Google
> somehow returns the title from the content of the PDF document even if
> there is no title defined in its "info" container aka PDF
> properties/metadata. Not sure why Tika's behavior has been set that way.
>
>
>
>
> On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel <
> [email protected]
> > wrote:
>
> > Hi,
> >
> > can you provide a concrete example?
> > What does Google show as title?
> > If there is no title defined in PDF's "info" container
> > (aka properties aka meta data) it must be, e.g.,
> > - file name / URL
> > - first heading
> > or something similar.
> >
> > Nutch 2.2.1 is using Tika 1.3 to parse PDFs.
> > In doubt, you should check the behavior of the current
> > Tika version and ev. ask on the Tika mailing list
> > if you thinks it's a defect of the PDF parser.
> >
> > Thanks,
> > Sebastian
> >
> >
> > On 04/12/2014 11:20 PM, A Laxmi wrote:
> > > Hi,
> > >
> > > Nutch doesn't seem to grab the title of PDF files when there is *no
> > > title*defined in PDF properties where as Google does. Could someone
> > > explain if
> > > any additional tweaking has to be done from Nutch side so it does not
> > > return empty title?
> > >
> > > Thanks!
> > >
> >
> >
>

Re: Nutch 2.2.1: PDF issue

Reply via email to