Hi Bin - >
*I am guessing maybe instead of parsing the raw pdf file, Google is actually taking advantage of other pages within the same domain/site and use the anchor text as the PDF file title if the PDF property is missing title* Any ideas on how this behavior can be incorporated in Nutch 2.2.1? Thanks for your observations!! On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> wrote: > Here are some observations that I noticed, not sure if will be helpful or > not: > > (1) You can see the version of parsed PDF cached by Google using Google > Cache: > > http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us > When I looked into the source code of Google Cache version, I can not even > see the complete title name anywhere in the page nor the meta data: > > For example: > <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta > name="CreationDate" content="D:20080201131312-06'00'"><meta name=" > Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker 8.1 > for Word">... > Even the title has been broken into pieces that scattered all around google > cached version. > (2) If you go one level up the PDF file, you will end up in this page(I am > not sure it is just simple one level up or it is actually because it has a > link to the pdf file): > http://www.srs.fs.usda.gov/econ/data/forestincentives/ > You can see the title that perfectly lying in the source code: > ... > <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential > Incentives for Practicing Sustainable Forestry on Non-industrial Private > Forest Lands</a> (pdf 294 KB)</dt> > <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven E. > Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics > Workshop (2006)</dd> > ... > > I am guessing maybe instead of parsing the raw pdf file, Google is actually > taking advantage of other pages within the same domain/site and use the > anchor text as the PDF file title if the PDF property is missing title. > > Thanks! > > /usr/bin > > > On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> wrote: > > > Hi Remi & Sebastian: > > > > Here is the example: > > > > > http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf > > > > When Nutch crawls the above, it doesn't grab the title since there is no > > title defined in the pdf properties. When the same file was searched in > > Google, you can see the title - > > > > > > > https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf > > > > Thanks.. > > > > > > > > On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]> > > wrote: > > > > > Hi Laxmi, > > > > > > Could you provide some examples? > > > > > > > > > On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> > wrote: > > > > > > > Hi Sebastian, > > > > > > > > Yes, you are right, there is *no *title defined in the PDF's "info" > > > > container and that is when Nutch is returning empty titles where as > > > Google > > > > somehow returns the title from the content of the PDF document even > if > > > > there is no title defined in its "info" container aka PDF > > > > properties/metadata. Not sure why Tika's behavior has been set that > > way. > > > > > > > > > > > > > > > > > > > > On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel < > > > > [email protected] > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > can you provide a concrete example? > > > > > What does Google show as title? > > > > > If there is no title defined in PDF's "info" container > > > > > (aka properties aka meta data) it must be, e.g., > > > > > - file name / URL > > > > > - first heading > > > > > or something similar. > > > > > > > > > > Nutch 2.2.1 is using Tika 1.3 to parse PDFs. > > > > > In doubt, you should check the behavior of the current > > > > > Tika version and ev. ask on the Tika mailing list > > > > > if you thinks it's a defect of the PDF parser. > > > > > > > > > > Thanks, > > > > > Sebastian > > > > > > > > > > > > > > > On 04/12/2014 11:20 PM, A Laxmi wrote: > > > > > > Hi, > > > > > > > > > > > > Nutch doesn't seem to grab the title of PDF files when there is > *no > > > > > > title*defined in PDF properties where as Google does. Could > someone > > > > > > explain if > > > > > > any additional tweaking has to be done from Nutch side so it does > > not > > > > > > return empty title? > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > > > > > > > > >

