Hi Remi & Sebastian: Here is the example: http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
When Nutch crawls the above, it doesn't grab the title since there is no title defined in the pdf properties. When the same file was searched in Google, you can see the title - https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf Thanks.. On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]> wrote: > Hi Laxmi, > > Could you provide some examples? > > > On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> wrote: > > > Hi Sebastian, > > > > Yes, you are right, there is *no *title defined in the PDF's "info" > > container and that is when Nutch is returning empty titles where as > Google > > somehow returns the title from the content of the PDF document even if > > there is no title defined in its "info" container aka PDF > > properties/metadata. Not sure why Tika's behavior has been set that way. > > > > > > > > > > On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel < > > [email protected] > > > wrote: > > > > > Hi, > > > > > > can you provide a concrete example? > > > What does Google show as title? > > > If there is no title defined in PDF's "info" container > > > (aka properties aka meta data) it must be, e.g., > > > - file name / URL > > > - first heading > > > or something similar. > > > > > > Nutch 2.2.1 is using Tika 1.3 to parse PDFs. > > > In doubt, you should check the behavior of the current > > > Tika version and ev. ask on the Tika mailing list > > > if you thinks it's a defect of the PDF parser. > > > > > > Thanks, > > > Sebastian > > > > > > > > > On 04/12/2014 11:20 PM, A Laxmi wrote: > > > > Hi, > > > > > > > > Nutch doesn't seem to grab the title of PDF files when there is *no > > > > title*defined in PDF properties where as Google does. Could someone > > > > explain if > > > > any additional tweaking has to be done from Nutch side so it does not > > > > return empty title? > > > > > > > > Thanks! > > > > > > > > > > > > >

