Found it right after I asked. :)  BTW, the command is wrong on the wiki.  I
need to get around to making an account so I can fix things.

I ran it on the pdf url and it only gives me one title.  But it's pretty
long.  Could that be the problem?

The url is http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfif
you want to check yourself.

On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma <[email protected]>wrote:

> bin/nutch parsechecker <url>
>
> see also:
> http://wiki.apache.org/nutch/CommandLineOptions
>
> On Wednesday 02 November 2011 14:16:10 Bai Shen wrote:
> > Parsechecker tool?  Where do I find that?
> >
> > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma
> <[email protected]>wrote:
> > > > I'm running the latest version of 1.4  We just rebuilt it last week.
> > > > Is that patch included?
> > >
> > > Yes, so you actually have more than one non-zero length titles coming
> > > from your parser. Please try the parsechecker tool and confirm, but i'm
> > > not sure it
> > > is capable of showing multiple titles.
> > >
> > > > And where would it get multiple titles from?
> > >
> > > Most likely from PDF or other document types. You can check with a
> > > stand-alone
> > > Tika.
> > >
> > > > How do I tell what the titles
> > > > are so I can see if they're valid or not?
> > > >
> > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma
> > >
> > > <[email protected]>wrote:
> > > > > This should work around the problem in most cases. The parser can
> > >
> > > output
> > >
> > > > > two
> > > > > titles of which one is actually empty. This patch (in 1.4) skips
> > > > > empty titles.
> > > > >
> > > > > If this doesn't work you really have two _valid_ titles coming from
> > >
> > > your
> > >
> > > > > document.
> > > > >
> > > > > https://issues.apache.org/jira/browse/NUTCH-1004
> > > > >
> > > > > > It looks like the issue I'm encountering is the same one as here.
> > >
> > >
> http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu
> > >
> > > > > lt
> > > > >
> > > > > > iValued-field-title-td1446817.html
> > > > > >
> > > > > > I'm not really sure what the linked bug is since that involves
> the
> > >
> > > HTML
> > >
> > > > > > parser and I'm seeing this problem with a PDF file.
> > > > > >
> > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <
> [email protected]>
> > > > >
> > > > > wrote:
> > > > > > > I'm getting an exception when I try to commit to Solr.  Looking
> > > > > > > at the Solr log, it's showing that title is getting multiple
> > > > > > > values when it's not a multivalue field.  None of my code does
> > > > > > > anything with the title, so I'm not sure why this is happening.
> > > > > > >
> > > > > > > How can I look at the pending commit and determine why and/or
> > >
> > > delete
> > >
> > > > > the
> > > > >
> > > > > > > extraneous values?  The document in question is a pdf if that
> > >
> > > makes a
> > >
> > > > > > > difference.
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Reply via email to