There is also Google scholar which knows 3 versions and 18 citations of the "Sustainable Forestry" paper:
http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5 Citations and links from bibliographic lists provide, of course, big support. > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?" Since Tika / PDFbox does not extract clean text, I follow Bin Wang: try it using inlinks. Sebastian On 04/15/2014 03:28 AM, Bin Wang wrote: > Hi Laxmi, > > One can easily see in Google's crawldb / linkdb, there is only one page > that links directly to that paper by using "link: operator", which is its > parent page: > https://www.google.com/#q=link:+http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf > > Per you question: > > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?" > FYI, I am also an entry-level Nutcher, but I bet the linkdb might contain > the information that you need. > Cited From Nutch Wiki: > > 1. The link database, or linkdb. This contains the list of known links > to each URL, including both the source URL and anchor text of the link. > > btw, you can view the linkdb content by using the cmd "bin/nutch readlinkdb > crawl/linkdb/ -dump tmp", and it will generate a text file containing the > content like below: > > http://105558.netguestbook.com/ Inlinks: > fromUrl: http://105559.netguestbook.com/ anchor: > > http://105559.netguestbook.com/ Inlinks: > fromUrl: http://www.st-georg.ch/ anchor: Gästebuch > > http://105560.netguestbook.com/ Inlinks: > fromUrl: http://105559.netguestbook.com/ anchor: > > http://105985.forums.motigo.com/ Inlinks: > fromUrl: http://www.vkgf-info.de/ anchor: Forum > > http://1254.virgilio.it/ Inlinks: > fromUrl: http://www.virgilio.it/ anchor: 1254 > > > If you have successfully scrape that site in Nutch. You will have your > linkdb looks like this: > > http://www.srs.fs.usda.gov/econ/data/forestincentives/ Inlinks: > fromUrl: > http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf > anchor: Existing and Potential Incentives for Practicing Sustainable > Forestry on Non-industrial Private Forest Lands > > After you invert the links, you can index the linkdb by including the > -linkdb parameter when you index using solr. > > Tom White also mentioned in this book "Hadoop: The Definite Guide" about > the linkdb: "However, most algorithms for calculating a page’s importance > (or quality) need the opposite information, that is, what pages contain > outlinks that point to the current page. This information is not readily > available when crawling. Also, the indexing process benefits from taking > into account the anchor text on inlinks so that this text may semantically > enrich the text of the current page.", which makes me more confident that > the anchor text is the secret... > > Anyway, I think the discussion has been slightly switched from Nutch to > Solr... if you need more information about how to query the linkdb in solr. > You can ask that in the solr community, maybe? :) > > /usr/bin > > > > On Mon, Apr 14, 2014 at 7:43 AM, A Laxmi <[email protected]> wrote: > >> Hi Bin - >> >>> >> >> *I am guessing maybe instead of parsing the raw pdf file, Google is >> actually taking advantage of other pages within the same domain/site and >> use the anchor text as the PDF file title if the PDF property is missing >> title* >> >> Any ideas on how this behavior can be incorporated in Nutch 2.2.1? >> >> Thanks for your observations!! >> >> >> On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> wrote: >> >>> Here are some observations that I noticed, not sure if will be helpful or >>> not: >>> >>> (1) You can see the version of parsed PDF cached by Google using Google >>> Cache: >>> >>> >> http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us >>> When I looked into the source code of Google Cache version, I can not >> even >>> see the complete title name anywhere in the page nor the meta data: >>> >>> For example: >>> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta >>> name="CreationDate" content="D:20080201131312-06'00'"><meta >> name=" >>> Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker >> 8.1 >>> for Word">... >>> Even the title has been broken into pieces that scattered all around >> google >>> cached version. >>> (2) If you go one level up the PDF file, you will end up in this page(I >> am >>> not sure it is just simple one level up or it is actually because it has >> a >>> link to the pdf file): >>> http://www.srs.fs.usda.gov/econ/data/forestincentives/ >>> You can see the title that perfectly lying in the source code: >>> ... >>> <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential >>> Incentives for Practicing Sustainable Forestry on Non-industrial Private >>> Forest Lands</a> (pdf 294 KB)</dt> >>> <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven E. >>> Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics >>> Workshop (2006)</dd> >>> ... >>> >>> I am guessing maybe instead of parsing the raw pdf file, Google is >> actually >>> taking advantage of other pages within the same domain/site and use the >>> anchor text as the PDF file title if the PDF property is missing title. >>> >>> Thanks! >>> >>> /usr/bin >>> >>> >>> On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> wrote: >>> >>>> Hi Remi & Sebastian: >>>> >>>> Here is the example: >>>> >>>> >>> >> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf >>>> >>>> When Nutch crawls the above, it doesn't grab the title since there is >> no >>>> title defined in the pdf properties. When the same file was searched in >>>> Google, you can see the title - >>>> >>>> >>>> >>> >> https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf >>>> >>>> Thanks.. >>>> >>>> >>>> >>>> On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]> >>>> wrote: >>>> >>>>> Hi Laxmi, >>>>> >>>>> Could you provide some examples? >>>>> >>>>> >>>>> On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]> >>> wrote: >>>>> >>>>>> Hi Sebastian, >>>>>> >>>>>> Yes, you are right, there is *no *title defined in the PDF's "info" >>>>>> container and that is when Nutch is returning empty titles where as >>>>> Google >>>>>> somehow returns the title from the content of the PDF document even >>> if >>>>>> there is no title defined in its "info" container aka PDF >>>>>> properties/metadata. Not sure why Tika's behavior has been set that >>>> way. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel < >>>>>> [email protected] >>>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> can you provide a concrete example? >>>>>>> What does Google show as title? >>>>>>> If there is no title defined in PDF's "info" container >>>>>>> (aka properties aka meta data) it must be, e.g., >>>>>>> - file name / URL >>>>>>> - first heading >>>>>>> or something similar. >>>>>>> >>>>>>> Nutch 2.2.1 There is also Google scholar which knows 3 versions and 18 >>>>>>> citations of the "Sustainable Forestry" paper: http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5 is using Tika 1.3 to parse PDFs. >>>>>>> In doubt, you should check the behavior of the current >>>>>>> Tika version and ev. ask on the Tika mailing list >>>>>>> if you thinks it's a defect of the PDF parser. >>>>>>> >>>>>>> Thanks, >>>>>>> Sebastian >>>>>>> >>>>>>> >>>>>>> On 04/12/2014 11:20 PM, A Laxmi wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> Nutch doesn't seem to grab the title of PDF files when there is >>> *no >>>>>>>> title*defined in PDF properties where as Google does. Could >>> someone >>>>>>>> explain if >>>>>>>> any additional tweaking has to be done from Nutch side so it >> does >>>> not >>>>>>>> return empty title? >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >

