Re: Nutch 2.2.1: PDF issue

A Laxmi Mon, 14 Apr 2014 06:44:59 -0700

Hi Bin -

>


*I am guessing maybe instead of parsing the raw pdf file, Google is
actually taking advantage of other pages within the same domain/site and
use the anchor text as the PDF file title if the PDF property is missing
title*

Any ideas on how this behavior can be incorporated in Nutch 2.2.1?

Thanks for your observations!!


On Sun, Apr 13, 2014 at 11:41 PM, Bin Wang <[email protected]> wrote:

> Here are some observations that I noticed, not sure if will be helpful or
> not:
>
> (1) You can see the version of parsed PDF cached by Google using Google
> Cache:
>
> http://webcache.googleusercontent.com/search?q=cache:FP2qlSjDH1wJ:www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf+&cd=1&hl=en&ct=clnk&gl=us
> When I looked into the source code of Google Cache version, I can not even
> see the complete title name anywhere in the page nor the meta data:
>
> For example:
> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta
> name="CreationDate" content="D:20080201131312-06&#39;00&#39;"><meta name="
> Author" content="Pookey"><meta name="Creator" content="Acrobat PDFMaker 8.1
> for Word">...
> Even the title has been broken into pieces that scattered all around google
> cached version.
> (2) If you go one level up the PDF file, you will end up in this page(I am
> not sure it is just simple one level up or it is actually because it has a
> link to the pdf file):
> http://www.srs.fs.usda.gov/econ/data/forestincentives/
> You can see the title that perfectly lying in the source code:
> ...
> <dt><a href="greene-etal-sofew2006proc.pdf">Existing and Potential
> Incentives for Practicing Sustainable Forestry on Non-industrial Private
> Forest Lands</a> (pdf 294 KB)</dt>
>   <dd>John L. Greene, Michael A. Kilgore, Michael G. Jacobson, Steven E.
> Daniels and Thomas J. Straka. Proceedings, Southern Forest Economics
> Workshop (2006)</dd>
> ...
>
> I am guessing maybe instead of parsing the raw pdf file, Google is actually
> taking advantage of other pages within the same domain/site and use the
> anchor text as the PDF file title if the PDF property is missing title.
>
> Thanks!
>
> /usr/bin
>
>
> On Sun, Apr 13, 2014 at 7:56 PM, A Laxmi <[email protected]> wrote:
>
> > Hi Remi & Sebastian:
> >
> > Here is the example:
> >
> >
> http://www.srs.fs.usda.gov/econ/data/forestincentives/greene-etal-sofew2006proc.pdf
> >
> > When Nutch crawls the above, it doesn't grab the title since there is no
> > title defined in the pdf properties. When the same file was searched in
> > Google, you can see the title -
> >
> >
> >
> https://www.google.com/#q=http:%2F%2Fwww.srs.fs.usda.gov%2Fecon%2Fdata%2Fforestincentives%2Fgreene-etal-sofew2006proc.pdf
> >
> > Thanks..
> >
> >
> >
> > On Sun, Apr 13, 2014 at 8:08 PM, remi tassing <[email protected]>
> > wrote:
> >
> > > Hi Laxmi,
> > >
> > > Could you provide some examples?
> > >
> > >
> > > On Mon, Apr 14, 2014 at 2:31 AM, A Laxmi <[email protected]>
> wrote:
> > >
> > > > Hi Sebastian,
> > > >
> > > > Yes, you are right, there is *no *title defined in the PDF's "info"
> > > > container and that is when Nutch is returning empty titles where as
> > > Google
> > > > somehow returns the title from the content of the PDF document even
> if
> > > > there is no title defined in its "info" container aka PDF
> > > > properties/metadata. Not sure why Tika's behavior has been set that
> > way.
> > > >
> > > >
> > > >
> > > >
> > > > On Sun, Apr 13, 2014 at 7:06 AM, Sebastian Nagel <
> > > > [email protected]
> > > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > can you provide a concrete example?
> > > > > What does Google show as title?
> > > > > If there is no title defined in PDF's "info" container
> > > > > (aka properties aka meta data) it must be, e.g.,
> > > > > - file name / URL
> > > > > - first heading
> > > > > or something similar.
> > > > >
> > > > > Nutch 2.2.1 is using Tika 1.3 to parse PDFs.
> > > > > In doubt, you should check the behavior of the current
> > > > > Tika version and ev. ask on the Tika mailing list
> > > > > if you thinks it's a defect of the PDF parser.
> > > > >
> > > > > Thanks,
> > > > > Sebastian
> > > > >
> > > > >
> > > > > On 04/12/2014 11:20 PM, A Laxmi wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Nutch doesn't seem to grab the title of PDF files when there is
> *no
> > > > > > title*defined in PDF properties where as Google does. Could
> someone
> > > > > > explain if
> > > > > > any additional tweaking has to be done from Nutch side so it does
> > not
> > > > > > return empty title?
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Nutch 2.2.1: PDF issue

Reply via email to