The output is a bit misleading indeed. The file has two valid titles and two are being extracted. The title and the filename are both seen as titles by Tika.
You can spot this behaviour better using the indexchecker tool. Please consult the Tika wiki, docs or mailing list on how to proceed. Either that or make your Solr schema field for title multiValued and deal with it appropriately in your search front-end. Cheers On Wednesday 02 November 2011 15:02:11 Bai Shen wrote: > Found it right after I asked. :) BTW, the command is wrong on the wiki. I > need to get around to making an account so I can fix things. > > I ran it on the pdf url and it only gives me one title. But it's pretty > long. Could that be the problem? > > The url is > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfif you want > to check yourself. > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma <[email protected]>wrote: > > bin/nutch parsechecker <url> > > > > see also: > > http://wiki.apache.org/nutch/CommandLineOptions > > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote: > > > Parsechecker tool? Where do I find that? > > > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma > > > > <[email protected]>wrote: > > > > > I'm running the latest version of 1.4 We just rebuilt it last > > > > > week. Is that patch included? > > > > > > > > Yes, so you actually have more than one non-zero length titles coming > > > > from your parser. Please try the parsechecker tool and confirm, but > > > > i'm not sure it > > > > is capable of showing multiple titles. > > > > > > > > > And where would it get multiple titles from? > > > > > > > > Most likely from PDF or other document types. You can check with a > > > > stand-alone > > > > Tika. > > > > > > > > > How do I tell what the titles > > > > > are so I can see if they're valid or not? > > > > > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma > > > > > > > > <[email protected]>wrote: > > > > > > This should work around the problem in most cases. The parser can > > > > > > > > output > > > > > > > > > > two > > > > > > titles of which one is actually empty. This patch (in 1.4) skips > > > > > > empty titles. > > > > > > > > > > > > If this doesn't work you really have two _valid_ titles coming > > > > > > from > > > > > > > > your > > > > > > > > > > document. > > > > > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004 > > > > > > > > > > > > > It looks like the issue I'm encountering is the same one as > > > > > > > here. > > > > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu > > > > > > > > lt > > > > > > > > > > > > > iValued-field-title-td1446817.html > > > > > > > > > > > > > > I'm not really sure what the linked bug is since that involves > > > > the > > > > > > HTML > > > > > > > > > > > parser and I'm seeing this problem with a PDF file. > > > > > > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen < > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > I'm getting an exception when I try to commit to Solr. > > > > > > > > Looking at the Solr log, it's showing that title is getting > > > > > > > > multiple values when it's not a multivalue field. None of > > > > > > > > my code does anything with the title, so I'm not sure why > > > > > > > > this is happening. > > > > > > > > > > > > > > > > How can I look at the pending commit and determine why and/or > > > > > > > > delete > > > > > > > > > > the > > > > > > > > > > > > > > extraneous values? The document in question is a pdf if that > > > > > > > > makes a > > > > > > > > > > > > difference. > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

