Gotcha.  Thanks.

On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma
<[email protected]>wrote:

> The output is a bit misleading indeed. The file has two valid titles and
> two
> are being extracted. The title and the filename are both seen as titles by
> Tika.
>
> You can spot this behaviour better using the indexchecker tool.
>
> Please consult the Tika wiki, docs or mailing list on how to proceed.
> Either
> that or make your Solr schema field for title multiValued and deal with it
> appropriately in your search front-end.
>
> Cheers
>
>
> On Wednesday 02 November 2011 15:02:11 Bai Shen wrote:
> > Found it right after I asked. :)  BTW, the command is wrong on the wiki.
>  I
> > need to get around to making an account so I can fix things.
> >
> > I ran it on the pdf url and it only gives me one title.  But it's pretty
> > long.  Could that be the problem?
> >
> > The url is
> > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfif you
> want
> > to check yourself.
> >
> > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma
> <[email protected]>wrote:
> > > bin/nutch parsechecker <url>
> > >
> > > see also:
> > > http://wiki.apache.org/nutch/CommandLineOptions
> > >
> > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote:
> > > > Parsechecker tool?  Where do I find that?
> > > >
> > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma
> > >
> > > <[email protected]>wrote:
> > > > > > I'm running the latest version of 1.4  We just rebuilt it last
> > > > > > week. Is that patch included?
> > > > >
> > > > > Yes, so you actually have more than one non-zero length titles
> coming
> > > > > from your parser. Please try the parsechecker tool and confirm, but
> > > > > i'm not sure it
> > > > > is capable of showing multiple titles.
> > > > >
> > > > > > And where would it get multiple titles from?
> > > > >
> > > > > Most likely from PDF or other document types. You can check with a
> > > > > stand-alone
> > > > > Tika.
> > > > >
> > > > > > How do I tell what the titles
> > > > > > are so I can see if they're valid or not?
> > > > > >
> > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma
> > > > >
> > > > > <[email protected]>wrote:
> > > > > > > This should work around the problem in most cases. The parser
> can
> > > > >
> > > > > output
> > > > >
> > > > > > > two
> > > > > > > titles of which one is actually empty. This patch (in 1.4)
> skips
> > > > > > > empty titles.
> > > > > > >
> > > > > > > If this doesn't work you really have two _valid_ titles coming
> > > > > > > from
> > > > >
> > > > > your
> > > > >
> > > > > > > document.
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004
> > > > > > >
> > > > > > > > It looks like the issue I'm encountering is the same one as
> > > > > > > > here.
> > >
> > >
> http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu
> > >
> > > > > > > lt
> > > > > > >
> > > > > > > > iValued-field-title-td1446817.html
> > > > > > > >
> > > > > > > > I'm not really sure what the linked bug is since that
> involves
> > >
> > > the
> > >
> > > > > HTML
> > > > >
> > > > > > > > parser and I'm seeing this problem with a PDF file.
> > > > > > > >
> > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <
> > >
> > > [email protected]>
> > >
> > > > > > > wrote:
> > > > > > > > > I'm getting an exception when I try to commit to Solr.
> > > > > > > > > Looking at the Solr log, it's showing that title is getting
> > > > > > > > > multiple values when it's not a multivalue field.  None of
> > > > > > > > > my code does anything with the title, so I'm not sure why
> > > > > > > > > this is happening.
> > > > > > > > >
> > > > > > > > > How can I look at the pending commit and determine why
> and/or
> > > > >
> > > > > delete
> > > > >
> > > > > > > the
> > > > > > >
> > > > > > > > > extraneous values?  The document in question is a pdf if
> that
> > > > >
> > > > > makes a
> > > > >
> > > > > > > > > difference.
> > >
> > > --
> > > Markus Jelsma - CTO - Openindex
> > > http://www.linkedin.com/in/markus17
> > > 050-8536620 / 06-50258350
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>

Reply via email to