Hi Laxmi, Could you provide some examples?
On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> wrote: > Hi Sebastian, > > Yes, you are right, there is *no *title defined in the PDF's "info" > container and that is when Nutch is returning empty titles where as Google > somehow returns the title from the content of the PDF document even if > there is no title defined in its "info" container aka PDF > properties/metadata. Not sure why Tika's behavior has been set that way. > > > > > On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel < > [email protected] > > wrote: > > > Hi, > > > > can you provide a concrete example? > > What does Google show as title? > > If there is no title defined in PDF's "info" container > > (aka properties aka meta data) it must be, e.g., > > - file name / URL > > - first heading > > or something similar. > > > > Nutch 2.2.1 is using Tika 1.3 to parse PDFs. > > In doubt, you should check the behavior of the current > > Tika version and ev. ask on the Tika mailing list > > if you thinks it's a defect of the PDF parser. > > > > Thanks, > > Sebastian > > > > > > On 04/12/2014 11:20 PM, A Laxmi wrote: > > > Hi, > > > > > > Nutch doesn't seem to grab the title of PDF files when there is *no > > > title*defined in PDF properties where as Google does. Could someone > > > explain if > > > any additional tweaking has to be done from Nutch side so it does not > > > return empty title? > > > > > > Thanks! > > > > > > > >

