Thanks, Sebastian! Yes, I would follow what Bin suggested. Bin - Sorry for the delay in my response! What you mentioned gave me proper background on how Google crawls/indexes such documents with no title. Thank you so much!! The statement from Tom White's book was very helpful. Thanks for bringing that up.
On Tue, Apr 15, 2014 at 1:33 PM, Sebastian Nagel <[email protected] > wrote: > There is also Google scholar which knows 3 versions and 18 citations of the > "Sustainable Forestry" paper: > > > http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5 > > Citations and links from bibliographic lists provide, of course, big > support. > > > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?" > Since Tika / PDFbox does not extract clean text, I follow > Bin Wang: try it using inlinks. > > Sebastian > > On 04/15/2014 03:28 AM, Bin Wang wrote: > > Hi Laxmi, > > > > One can easily see in Google's crawldb / linkdb, there is only one page > > that links directly to that paper by using "link: operator", which is its > > parent page: > > > https://www.google.com/#q=link:+http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf > > > > Per you question: > > > > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?" > > FYI, I am also an entry-level Nutcher, but I bet the linkdb might contain > > the information that you need. > > Cited From Nutch Wiki: > > > > 1. The link database, or linkdb. This contains the list of known links > > to each URL, including both the source URL and anchor text of the > link. > > > > btw, you can view the linkdb content by using the cmd "bin/nutch > readlinkdb > > crawl/linkdb/ -dump tmp", and it will generate a text file containing the > > content like below: > > > > http://105558.netguestbook.com/ Inlinks: > > fromUrl: http://105559.netguestbook.com/ anchor: > > > > http://105559.netguestbook.com/ Inlinks: > > fromUrl: http://www.st-georg.ch/ anchor: Gästebuch > > > > http://105560.netguestbook.com/ Inlinks: > > fromUrl: http://105559.netguestbook.com/ anchor: > > > > http://105985.forums.motigo.com/ Inlinks: > > fromUrl: http://www.vkgf-info.de/ anchor: Forum > > > > http://1254.virgilio.it/ Inlinks: > > fromUrl: http://www.virgilio.it/ anchor: 1254 > > > > > > If you have successfully scrape that site in Nutch. You will have your > > linkdb looks like this: > > > > http://www.srs.fs.usda.gov/econ/data/forestincentives/ Inlinks: > > fromUrl: > > > http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf > > anchor: Existing and Potential Incentives for Practicing Sustainable > > Forestry on Non-industrial Private Forest Lands > > > > After you invert the links, you can index the linkdb by including the > > -linkdb parameter when you index using solr. > > > > Tom White also mentioned in this book "Hadoop: The Definite Guide" about > > the linkdb: "However, most algorithms for calculating a page’s importance > > (or quality) need the opposite information, that is, what pages contain > > outlinks that point to the current page. This information is not readily > > available when crawling. Also, the indexing process benefits from taking > > into account the anchor text on inlinks so that this text may > semantically > > enrich the text of the current page.", which makes me more confident that > > the anchor text is the secret... > > > > Anyway, I think the discussion has been slightly switched from Nutch to > > Solr... if you need more information about how to query the linkdb in > solr. > > You can ask that in the solr community, maybe? :) > > > > /usr/bin > > > > > > > > On Mon, Apr 14, 2014 at 7:43 AM, A Laxmi <[email protected]> wrote: > > > >> Hi Bin - > >> > >>> > >> > >> *I am guessing maybe instead of parsing the raw pdf file, Google is > >> actually taking advantage of other pages within the same domain/site and > >> use the anchor text as the PDF file title if the PDF property is missing > >> title* > >> > >> Any ideas on how this behavior can be incorporated in Nutch 2.2.1? > >> > >> Thanks for your observations!! > >> > >> > >> On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> > wrote: > >> > >>> Here are some observations that I noticed, not sure if will be helpful > or > >>> not: > >>> > >>> (1) You can see the version of parsed PDF cached by Google using Google > >>> Cache: > >>> > >>> > >> > http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us > >>> When I looked into the source code of Google Cache version, I can not > >> even > >>> see the complete title name anywhere in the page nor the meta data: > >>> > >>> For example: > >>> <meta http-equiv="Content-Type" content="text/html; > charset=UTF-8"><meta > >>> name="CreationDate" content="D:20080201131312-06'00'"><meta > >> name=" > >>> Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker > >> 8.1 > >>> for Word">... > >>> Even the title has been broken into pieces that scattered all around > >> google > >>> cached version. > >>> (2) If you go one level up the PDF file, you will end up in this page(I > >> am > >>> not sure it is just simple one level up or it is actually because it > has > >> a > >>> link to the pdf file): > >>> http://www.srs.fs.usda.gov/econ/data/forestincentives/ > >>> You can see the title that perfectly lying in the source code: > >>> ... > >>> <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential > >>> Incentives for Practicing Sustainable Forestry on Non-industrial > Private > >>> Forest Lands</a> (pdf 294 KB)</dt> > >>> <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven > E. > >>> Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics > >>> Workshop (2006)</dd> > >>> ... > >>> > >>> I am guessing maybe instead of parsing the raw pdf file, Google is > >> actually > >>> taking advantage of other pages within the same domain/site and use the > >>> anchor text as the PDF file title if the PDF property is missing title. > >>> > >>> Thanks! > >>> > >>> /usr/bin > >>> > >>> > >>> On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> > wrote: > >>> > >>>> Hi Remi & Sebastian: > >>>> > >>>> Here is the example: > >>>> > >>>> > >>> > >> > http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf > >>>> > >>>> When Nutch crawls the above, it doesn't grab the title since there is > >> no > >>>> title defined in the pdf properties. When the same file was searched > in > >>>> Google, you can see the title - > >>>> > >>>> > >>>> > >>> > >> > https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf > >>>> > >>>> Thanks.. > >>>> > >>>> > >>>> > >>>> On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]> > >>>> wrote: > >>>> > >>>>> Hi Laxmi, > >>>>> > >>>>> Could you provide some examples? > >>>>> > >>>>> > >>>>> On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> > >>> wrote: > >>>>> > >>>>>> Hi Sebastian, > >>>>>> > >>>>>> Yes, you are right, there is *no *title defined in the PDF's "info" > >>>>>> container and that is when Nutch is returning empty titles where as > >>>>> Google > >>>>>> somehow returns the title from the content of the PDF document even > >>> if > >>>>>> there is no title defined in its "info" container aka PDF > >>>>>> properties/metadata. Not sure why Tika's behavior has been set that > >>>> way. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel < > >>>>>> [email protected] > >>>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> can you provide a concrete example? > >>>>>>> What does Google show as title? > >>>>>>> If there is no title defined in PDF's "info" container > >>>>>>> (aka properties aka meta data) it must be, e.g., > >>>>>>> - file name / URL > >>>>>>> - first heading > >>>>>>> or something similar. > >>>>>>> > >>>>>>> Nutch 2.2.1 There is also Google scholar which knows 3 versions > and 18 citations of the > "Sustainable Forestry" paper: > > http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5 > is using Tika 1.3 to parse PDFs. > >>>>>>> In doubt, you should check the behavior of the current > >>>>>>> Tika version and ev. ask on the Tika mailing list > >>>>>>> if you thinks it's a defect of the PDF parser. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Sebastian > >>>>>>> > >>>>>>> > >>>>>>> On 04/12/2014 11:20 PM, A Laxmi wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> Nutch doesn't seem to grab the title of PDF files when there is > >>> *no > >>>>>>>> title*defined in PDF properties where as Google does. Could > >>> someone > >>>>>>>> explain if > >>>>>>>> any additional tweaking has to be done from Nutch side so it > >> does > >>>> not > >>>>>>>> return empty title? > >>>>>>>> > >>>>>>>> Thanks! > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> > > > >

