Re: Nutch 2.2.1: PDF issue

Bin Wang Mon, 14 Apr 2014 18:29:24 -0700

Hi Laxmi,

One can easily see in Google's crawldb / linkdb, there is only one page
that links directly to that paper by using "link: operator", which is its
parent page:
https://www.google.com/#q=link:+http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf


Per you question:

"Any ideas on how this behavior can be incorporated in Nutch 2.2.1?"
FYI, I am also an entry-level Nutcher, but I bet the linkdb might contain
the information that you need.
Cited From Nutch Wiki:

   1. The link database, or linkdb. This contains the list of known links
   to each URL, including both the source URL and anchor text of the link.

btw, you can view the linkdb content by using the cmd "bin/nutch readlinkdb
crawl/linkdb/ -dump tmp", and it will generate a text file containing the
content like below:

http://105558.netguestbook.com/ Inlinks:
 fromUrl: http://105559.netguestbook.com/ anchor:

http://105559.netguestbook.com/ Inlinks:
 fromUrl: http://www.st-georg.ch/ anchor: Gästebuch

http://105560.netguestbook.com/ Inlinks:
 fromUrl: http://105559.netguestbook.com/ anchor:

http://105985.forums.motigo.com/        Inlinks:
 fromUrl: http://www.vkgf-info.de/ anchor: Forum

http://1254.virgilio.it/        Inlinks:
 fromUrl: http://www.virgilio.it/ anchor: 1254


If you have successfully scrape that site in Nutch. You will have your
linkdb looks like this:

http://www.srs.fs.usda.gov/econ/data/forestincentives/ Inlinks:
fromUrl:
http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
anchor: Existing and Potential Incentives for Practicing Sustainable
Forestry on Non-industrial Private Forest Lands

After you invert the links, you can index the linkdb by including the
-linkdb parameter when you index using solr.

Tom White also mentioned in this book "Hadoop: The Definite Guide" about
the linkdb: "However, most algorithms for calculating a page’s importance
(or quality) need the opposite information, that is, what pages contain
outlinks that point to the current page. This information is not readily
available when crawling. Also, the indexing process benefits from taking
into account the anchor text on inlinks so that this text may semantically
enrich the text of the current page.", which makes me more confident that
the anchor text is the secret...

Anyway, I think the discussion has been slightly switched from Nutch to
Solr... if you need more information about how to query the linkdb in solr.
You can ask that in the solr community, maybe? :)

/usr/bin



On Mon, Apr 14, 2014 at 7:43 AM, A Laxmi <[email protected]> wrote:

> Hi Bin -
>
> >
>
> *I am guessing maybe instead of parsing the raw pdf file, Google is
> actually taking advantage of other pages within the same domain/site and
> use the anchor text as the PDF file title if the PDF property is missing
> title*
>
> Any ideas on how this behavior can be incorporated in Nutch 2.2.1?
>
> Thanks for your observations!!
>
>
> On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> wrote:
>
> > Here are some observations that I noticed, not sure if will be helpful or
> > not:
> >
> > (1) You can see the version of parsed PDF cached by Google using Google
> > Cache:
> >
> >
> http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us
> > When I looked into the source code of Google Cache version, I can not
> even
> > see the complete title name anywhere in the page nor the meta data:
> >
> > For example:
> > <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta
> > name="CreationDate" content="D:20080201131312-06&#39;00&#39;"><meta
> name="
> > Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker
> 8.1
> > for Word">...
> > Even the title has been broken into pieces that scattered all around
> google
> > cached version.
> > (2) If you go one level up the PDF file, you will end up in this page(I
> am
> > not sure it is just simple one level up or it is actually because it has
> a
> > link to the pdf file):
> > http://www.srs.fs.usda.gov/econ/data/forestincentives/
> > You can see the title that perfectly lying in the source code:
> > ...
> > <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential
> > Incentives for Practicing Sustainable Forestry on Non-industrial Private
> > Forest Lands</a> (pdf 294 KB)</dt>
> >   <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven E.
> > Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics
> > Workshop (2006)</dd>
> > ...
> >
> > I am guessing maybe instead of parsing the raw pdf file, Google is
> actually
> > taking advantage of other pages within the same domain/site and use the
> > anchor text as the PDF file title if the PDF property is missing title.
> >
> > Thanks!
> >
> > /usr/bin
> >
> >
> > On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> wrote:
> >
> > > Hi Remi & Sebastian:
> > >
> > > Here is the example:
> > >
> > >
> >
> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
> > >
> > > When Nutch crawls the above, it doesn't grab the title since there is
> no
> > > title defined in the pdf properties. When the same file was searched in
> > > Google, you can see the title -
> > >
> > >
> > >
> >
> https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
> > >
> > > Thanks..
> > >
> > >
> > >
> > > On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]>
> > > wrote:
> > >
> > > > Hi Laxmi,
> > > >
> > > > Could you provide some examples?
> > > >
> > > >
> > > > On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]>
> > wrote:
> > > >
> > > > > Hi Sebastian,
> > > > >
> > > > > Yes, you are right, there is *no *title defined in the PDF's "info"
> > > > > container and that is when Nutch is returning empty titles where as
> > > > Google
> > > > > somehow returns the title from the content of the PDF document even
> > if
> > > > > there is no title defined in its "info" container aka PDF
> > > > > properties/metadata. Not sure why Tika's behavior has been set that
> > > way.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel <
> > > > > [email protected]
> > > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > can you provide a concrete example?
> > > > > > What does Google show as title?
> > > > > > If there is no title defined in PDF's "info" container
> > > > > > (aka properties aka meta data) it must be, e.g.,
> > > > > > - file name / URL
> > > > > > - first heading
> > > > > > or something similar.
> > > > > >
> > > > > > Nutch 2.2.1 is using Tika 1.3 to parse PDFs.
> > > > > > In doubt, you should check the behavior of the current
> > > > > > Tika version and ev. ask on the Tika mailing list
> > > > > > if you thinks it's a defect of the PDF parser.
> > > > > >
> > > > > > Thanks,
> > > > > > Sebastian
> > > > > >
> > > > > >
> > > > > > On 04/12/2014 11:20 PM, A Laxmi wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > Nutch doesn't seem to grab the title of PDF files when there is
> > *no
> > > > > > > title*defined in PDF properties where as Google does. Could
> > someone
> > > > > > > explain if
> > > > > > > any additional tweaking has to be done from Nutch side so it
> does
> > > not
> > > > > > > return empty title?
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Nutch 2.2.1: PDF issue

Reply via email to