Re: Nutch 2.2.1: PDF issue

Sebastian Nagel Tue, 15 Apr 2014 10:34:49 -0700

There is also Google scholar which knows 3 versions and 18 citations of the
"Sustainable Forestry" paper:


 http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5

Citations and links from bibliographic lists provide, of course, big support.

> "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?"
Since Tika / PDFbox does not extract clean text, I follow
Bin Wang: try it using inlinks.

Sebastian

On 04/15/2014 03:28 AM, Bin Wang wrote:
> Hi Laxmi,
> 
> One can easily see in Google's crawldb / linkdb, there is only one page
> that links directly to that paper by using "link: operator", which is its
> parent page:
> https://www.google.com/#q=link:+http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
> 
> Per you question:
> 
> "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?"
> FYI, I am also an entry-level Nutcher, but I bet the linkdb might contain
> the information that you need.
> Cited From Nutch Wiki:
> 
>    1. The link database, or linkdb. This contains the list of known links
>    to each URL, including both the source URL and anchor text of the link.
> 
> btw, you can view the linkdb content by using the cmd "bin/nutch readlinkdb
> crawl/linkdb/ -dump tmp", and it will generate a text file containing the
> content like below:
> 
> http://105558.netguestbook.com/ Inlinks:
>  fromUrl: http://105559.netguestbook.com/ anchor:
> 
> http://105559.netguestbook.com/ Inlinks:
>  fromUrl: http://www.st-georg.ch/ anchor: Gästebuch
> 
> http://105560.netguestbook.com/ Inlinks:
>  fromUrl: http://105559.netguestbook.com/ anchor:
> 
> http://105985.forums.motigo.com/        Inlinks:
>  fromUrl: http://www.vkgf-info.de/ anchor: Forum
> 
> http://1254.virgilio.it/        Inlinks:
>  fromUrl: http://www.virgilio.it/ anchor: 1254
> 
> 
> If you have successfully scrape that site in Nutch. You will have your
> linkdb looks like this:
> 
> http://www.srs.fs.usda.gov/econ/data/forestincentives/ Inlinks:
> fromUrl:
> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
> anchor: Existing and Potential Incentives for Practicing Sustainable
> Forestry on Non-industrial Private Forest Lands
> 
> After you invert the links, you can index the linkdb by including the
> -linkdb parameter when you index using solr.
> 
> Tom White also mentioned in this book "Hadoop: The Definite Guide" about
> the linkdb: "However, most algorithms for calculating a page’s importance
> (or quality) need the opposite information, that is, what pages contain
> outlinks that point to the current page. This information is not readily
> available when crawling. Also, the indexing process benefits from taking
> into account the anchor text on inlinks so that this text may semantically
> enrich the text of the current page.", which makes me more confident that
> the anchor text is the secret...
> 
> Anyway, I think the discussion has been slightly switched from Nutch to
> Solr... if you need more information about how to query the linkdb in solr.
> You can ask that in the solr community, maybe? :)
> 
> /usr/bin
> 
> 
> 
> On Mon, Apr 14, 2014 at 7:43 AM, A Laxmi <[email protected]> wrote:
> 
>> Hi Bin -
>>
>>>
>>
>> *I am guessing maybe instead of parsing the raw pdf file, Google is
>> actually taking advantage of other pages within the same domain/site and
>> use the anchor text as the PDF file title if the PDF property is missing
>> title*
>>
>> Any ideas on how this behavior can be incorporated in Nutch 2.2.1?
>>
>> Thanks for your observations!!
>>
>>
>> On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> wrote:
>>
>>> Here are some observations that I noticed, not sure if will be helpful or
>>> not:
>>>
>>> (1) You can see the version of parsed PDF cached by Google using Google
>>> Cache:
>>>
>>>
>> http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us
>>> When I looked into the source code of Google Cache version, I can not
>> even
>>> see the complete title name anywhere in the page nor the meta data:
>>>
>>> For example:
>>> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta
>>> name="CreationDate" content="D:20080201131312-06&#39;00&#39;"><meta
>> name="
>>> Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker
>> 8.1
>>> for Word">...
>>> Even the title has been broken into pieces that scattered all around
>> google
>>> cached version.
>>> (2) If you go one level up the PDF file, you will end up in this page(I
>> am
>>> not sure it is just simple one level up or it is actually because it has
>> a
>>> link to the pdf file):
>>> http://www.srs.fs.usda.gov/econ/data/forestincentives/
>>> You can see the title that perfectly lying in the source code:
>>> ...
>>> <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential
>>> Incentives for Practicing Sustainable Forestry on Non-industrial Private
>>> Forest Lands</a> (pdf 294 KB)</dt>
>>>   <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven E.
>>> Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics
>>> Workshop (2006)</dd>
>>> ...
>>>
>>> I am guessing maybe instead of parsing the raw pdf file, Google is
>> actually
>>> taking advantage of other pages within the same domain/site and use the
>>> anchor text as the PDF file title if the PDF property is missing title.
>>>
>>> Thanks!
>>>
>>> /usr/bin
>>>
>>>
>>> On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> wrote:
>>>
>>>> Hi Remi & Sebastian:
>>>>
>>>> Here is the example:
>>>>
>>>>
>>>
>> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
>>>>
>>>> When Nutch crawls the above, it doesn't grab the title since there is
>> no
>>>> title defined in the pdf properties. When the same file was searched in
>>>> Google, you can see the title -
>>>>
>>>>
>>>>
>>>
>> https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
>>>>
>>>> Thanks..
>>>>
>>>>
>>>>
>>>> On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Laxmi,
>>>>>
>>>>> Could you provide some examples?
>>>>>
>>>>>
>>>>> On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]>
>>> wrote:
>>>>>
>>>>>> Hi Sebastian,
>>>>>>
>>>>>> Yes, you are right, there is *no *title defined in the PDF's "info"
>>>>>> container and that is when Nutch is returning empty titles where as
>>>>> Google
>>>>>> somehow returns the title from the content of the PDF document even
>>> if
>>>>>> there is no title defined in its "info" container aka PDF
>>>>>> properties/metadata. Not sure why Tika's behavior has been set that
>>>> way.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel <
>>>>>> [email protected]
>>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> can you provide a concrete example?
>>>>>>> What does Google show as title?
>>>>>>> If there is no title defined in PDF's "info" container
>>>>>>> (aka properties aka meta data) it must be, e.g.,
>>>>>>> - file name / URL
>>>>>>> - first heading
>>>>>>> or something similar.
>>>>>>>
>>>>>>> Nutch 2.2.1 There is also Google scholar which knows 3 versions and 18 
>>>>>>> citations of the
"Sustainable Forestry" paper:
 http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5
is using Tika 1.3 to parse PDFs.
>>>>>>> In doubt, you should check the behavior of the current
>>>>>>> Tika version and ev. ask on the Tika mailing list
>>>>>>> if you thinks it's a defect of the PDF parser.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Sebastian
>>>>>>>
>>>>>>>
>>>>>>> On 04/12/2014 11:20 PM, A Laxmi wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Nutch doesn't seem to grab the title of PDF files when there is
>>> *no
>>>>>>>> title*defined in PDF properties where as Google does. Could
>>> someone
>>>>>>>> explain if
>>>>>>>> any additional tweaking has to be done from Nutch side so it
>> does
>>>> not
>>>>>>>> return empty title?
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Nutch 2.2.1: PDF issue

Reply via email to