Re: Nutch 2.2.1: PDF issue

A Laxmi Tue, 15 Apr 2014 20:29:19 -0700

Thanks, Sebastian! Yes, I would follow what Bin suggested.

Bin - Sorry for the delay in my response! What you mentioned gave me proper
background on how Google crawls/indexes such documents with no title. Thank
you so much!! The statement from Tom White's book was very helpful. Thanks
for bringing that up.



On Tue, Apr 15, 2014 at 1:33 PM, Sebastian Nagel <[email protected]
> wrote:

> There is also Google scholar which knows 3 versions and 18 citations of the
> "Sustainable Forestry" paper:
>
>
> http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5
>
> Citations and links from bibliographic lists provide, of course, big
> support.
>
> > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?"
> Since Tika / PDFbox does not extract clean text, I follow
> Bin Wang: try it using inlinks.
>
> Sebastian
>
> On 04/15/2014 03:28 AM, Bin Wang wrote:
> > Hi Laxmi,
> >
> > One can easily see in Google's crawldb / linkdb, there is only one page
> > that links directly to that paper by using "link: operator", which is its
> > parent page:
> >
> https://www.google.com/#q=link:+http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
> >
> > Per you question:
> >
> > "Any ideas on how this behavior can be incorporated in Nutch 2.2.1?"
> > FYI, I am also an entry-level Nutcher, but I bet the linkdb might contain
> > the information that you need.
> > Cited From Nutch Wiki:
> >
> >    1. The link database, or linkdb. This contains the list of known links
> >    to each URL, including both the source URL and anchor text of the
> link.
> >
> > btw, you can view the linkdb content by using the cmd "bin/nutch
> readlinkdb
> > crawl/linkdb/ -dump tmp", and it will generate a text file containing the
> > content like below:
> >
> > http://105558.netguestbook.com/ Inlinks:
> >  fromUrl: http://105559.netguestbook.com/ anchor:
> >
> > http://105559.netguestbook.com/ Inlinks:
> >  fromUrl: http://www.st-georg.ch/ anchor: Gästebuch
> >
> > http://105560.netguestbook.com/ Inlinks:
> >  fromUrl: http://105559.netguestbook.com/ anchor:
> >
> > http://105985.forums.motigo.com/        Inlinks:
> >  fromUrl: http://www.vkgf-info.de/ anchor: Forum
> >
> > http://1254.virgilio.it/        Inlinks:
> >  fromUrl: http://www.virgilio.it/ anchor: 1254
> >
> >
> > If you have successfully scrape that site in Nutch. You will have your
> > linkdb looks like this:
> >
> > http://www.srs.fs.usda.gov/econ/data/forestincentives/ Inlinks:
> > fromUrl:
> >
> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
> > anchor: Existing and Potential Incentives for Practicing Sustainable
> > Forestry on Non-industrial Private Forest Lands
> >
> > After you invert the links, you can index the linkdb by including the
> > -linkdb parameter when you index using solr.
> >
> > Tom White also mentioned in this book "Hadoop: The Definite Guide" about
> > the linkdb: "However, most algorithms for calculating a page’s importance
> > (or quality) need the opposite information, that is, what pages contain
> > outlinks that point to the current page. This information is not readily
> > available when crawling. Also, the indexing process benefits from taking
> > into account the anchor text on inlinks so that this text may
> semantically
> > enrich the text of the current page.", which makes me more confident that
> > the anchor text is the secret...
> >
> > Anyway, I think the discussion has been slightly switched from Nutch to
> > Solr... if you need more information about how to query the linkdb in
> solr.
> > You can ask that in the solr community, maybe? :)
> >
> > /usr/bin
> >
> >
> >
> > On Mon, Apr 14, 2014 at 7:43 AM, A Laxmi <[email protected]> wrote:
> >
> >> Hi Bin -
> >>
> >>>
> >>
> >> *I am guessing maybe instead of parsing the raw pdf file, Google is
> >> actually taking advantage of other pages within the same domain/site and
> >> use the anchor text as the PDF file title if the PDF property is missing
> >> title*
> >>
> >> Any ideas on how this behavior can be incorporated in Nutch 2.2.1?
> >>
> >> Thanks for your observations!!
> >>
> >>
> >> On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]>
> wrote:
> >>
> >>> Here are some observations that I noticed, not sure if will be helpful
> or
> >>> not:
> >>>
> >>> (1) You can see the version of parsed PDF cached by Google using Google
> >>> Cache:
> >>>
> >>>
> >>
> http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us
> >>> When I looked into the source code of Google Cache version, I can not
> >> even
> >>> see the complete title name anywhere in the page nor the meta data:
> >>>
> >>> For example:
> >>> <meta http-equiv="Content-Type" content="text/html;
> charset=UTF-8"><meta
> >>> name="CreationDate" content="D:20080201131312-06&#39;00&#39;"><meta
> >> name="
> >>> Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker
> >> 8.1
> >>> for Word">...
> >>> Even the title has been broken into pieces that scattered all around
> >> google
> >>> cached version.
> >>> (2) If you go one level up the PDF file, you will end up in this page(I
> >> am
> >>> not sure it is just simple one level up or it is actually because it
> has
> >> a
> >>> link to the pdf file):
> >>> http://www.srs.fs.usda.gov/econ/data/forestincentives/
> >>> You can see the title that perfectly lying in the source code:
> >>> ...
> >>> <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential
> >>> Incentives for Practicing Sustainable Forestry on Non-industrial
> Private
> >>> Forest Lands</a> (pdf 294 KB)</dt>
> >>>   <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven
> E.
> >>> Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics
> >>> Workshop (2006)</dd>
> >>> ...
> >>>
> >>> I am guessing maybe instead of parsing the raw pdf file, Google is
> >> actually
> >>> taking advantage of other pages within the same domain/site and use the
> >>> anchor text as the PDF file title if the PDF property is missing title.
> >>>
> >>> Thanks!
> >>>
> >>> /usr/bin
> >>>
> >>>
> >>> On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]>
> wrote:
> >>>
> >>>> Hi Remi & Sebastian:
> >>>>
> >>>> Here is the example:
> >>>>
> >>>>
> >>>
> >>
> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
> >>>>
> >>>> When Nutch crawls the above, it doesn't grab the title since there is
> >> no
> >>>> title defined in the pdf properties. When the same file was searched
> in
> >>>> Google, you can see the title -
> >>>>
> >>>>
> >>>>
> >>>
> >>
> https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
> >>>>
> >>>> Thanks..
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]>
> >>>> wrote:
> >>>>
> >>>>> Hi Laxmi,
> >>>>>
> >>>>> Could you provide some examples?
> >>>>>
> >>>>>
> >>>>> On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]>
> >>> wrote:
> >>>>>
> >>>>>> Hi Sebastian,
> >>>>>>
> >>>>>> Yes, you are right, there is *no *title defined in the PDF's "info"
> >>>>>> container and that is when Nutch is returning empty titles where as
> >>>>> Google
> >>>>>> somehow returns the title from the content of the PDF document even
> >>> if
> >>>>>> there is no title defined in its "info" container aka PDF
> >>>>>> properties/metadata. Not sure why Tika's behavior has been set that
> >>>> way.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel <
> >>>>>> [email protected]
> >>>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> can you provide a concrete example?
> >>>>>>> What does Google show as title?
> >>>>>>> If there is no title defined in PDF's "info" container
> >>>>>>> (aka properties aka meta data) it must be, e.g.,
> >>>>>>> - file name / URL
> >>>>>>> - first heading
> >>>>>>> or something similar.
> >>>>>>>
> >>>>>>> Nutch 2.2.1 There is also Google scholar which knows 3 versions
> and 18 citations of the
> "Sustainable Forestry" paper:
>
> http://scholar.google.de/scholar?cluster=9858548666546300620&hl=de&as_sdt=0,5
> is using Tika 1.3 to parse PDFs.
> >>>>>>> In doubt, you should check the behavior of the current
> >>>>>>> Tika version and ev. ask on the Tika mailing list
> >>>>>>> if you thinks it's a defect of the PDF parser.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Sebastian
> >>>>>>>
> >>>>>>>
> >>>>>>> On 04/12/2014 11:20 PM, A Laxmi wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> Nutch doesn't seem to grab the title of PDF files when there is
> >>> *no
> >>>>>>>> title*defined in PDF properties where as Google does. Could
> >>> someone
> >>>>>>>> explain if
> >>>>>>>> any additional tweaking has to be done from Nutch side so it
> >> does
> >>>> not
> >>>>>>>> return empty title?
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>
>

Re: Nutch 2.2.1: PDF issue

Reply via email to