Ah you're right. There's an issue for this. You're welcome to submit a patch:
https://issues.apache.org/jira/browse/NUTCH-1140 I'll mark it for 1.5, seems it isn't yet. > Actually, it turns out it's a Nutch issue. Tika outputs the correct title > for the pdf. However, the indexer-more plugin is adding in the filename > due to the HTTP header "Content-Disposition". > > Is there a way to turn this off while keeping the other functionality of > the plugin? I'd prefer not to have a bunch of tweaks in the Nutch code. > > On Wed, Nov 2, 2011 at 10:11 AM, Markus Jelsma > > <[email protected]>wrote: > > The output is a bit misleading indeed. The file has two valid titles and > > two > > are being extracted. The title and the filename are both seen as titles > > by Tika. > > > > You can spot this behaviour better using the indexchecker tool. > > > > Please consult the Tika wiki, docs or mailing list on how to proceed. > > Either > > that or make your Solr schema field for title multiValued and deal with > > it appropriately in your search front-end. > > > > Cheers > > > > On Wednesday 02 November 2011 15:02:11 Bai Shen wrote: > > > Found it right after I asked. :) BTW, the command is wrong on the > > > wiki. > > > > I > > > > > need to get around to making an account so I can fix things. > > > > > > I ran it on the pdf url and it only gives me one title. But it's > > > pretty long. Could that be the problem? > > > > > > The url is > > > http://www.sipri.org/yearbook/2011/files/SIPRIYB11summaryNL.pdfif you > > > > want > > > > > to check yourself. > > > > > > On Wed, Nov 2, 2011 at 9:18 AM, Markus Jelsma > > > > <[email protected]>wrote: > > > > bin/nutch parsechecker <url> > > > > > > > > see also: > > > > http://wiki.apache.org/nutch/CommandLineOptions > > > > > > > > On Wednesday 02 November 2011 14:16:10 Bai Shen wrote: > > > > > Parsechecker tool? Where do I find that? > > > > > > > > > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma > > > > > > > > <[email protected]>wrote: > > > > > > > I'm running the latest version of 1.4 We just rebuilt it last > > > > > > > week. Is that patch included? > > > > > > > > > > > > Yes, so you actually have more than one non-zero length titles > > > > coming > > > > > > > > from your parser. Please try the parsechecker tool and confirm, > > > > > > but i'm not sure it > > > > > > is capable of showing multiple titles. > > > > > > > > > > > > > And where would it get multiple titles from? > > > > > > > > > > > > Most likely from PDF or other document types. You can check with > > > > > > a stand-alone > > > > > > Tika. > > > > > > > > > > > > > How do I tell what the titles > > > > > > > are so I can see if they're valid or not? > > > > > > > > > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma > > > > > > > > > > > > <[email protected]>wrote: > > > > > > > > This should work around the problem in most cases. The parser > > > > can > > > > > > > > output > > > > > > > > > > > > > > two > > > > > > > > titles of which one is actually empty. This patch (in 1.4) > > > > skips > > > > > > > > > > empty titles. > > > > > > > > > > > > > > > > If this doesn't work you really have two _valid_ titles > > > > > > > > coming from > > > > > > > > > > > > your > > > > > > > > > > > > > > document. > > > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004 > > > > > > > > > > > > > > > > > It looks like the issue I'm encountering is the same one as > > > > > > > > > here. > > > > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu > > > > > > > > > > lt > > > > > > > > > > > > > > > > > iValued-field-title-td1446817.html > > > > > > > > > > > > > > > > > > I'm not really sure what the linked bug is since that > > > > involves > > > > > > the > > > > > > > > > > HTML > > > > > > > > > > > > > > > parser and I'm seeing this problem with a PDF file. > > > > > > > > > > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen < > > > > > > > > [email protected]> > > > > > > > > > > > > wrote: > > > > > > > > > > I'm getting an exception when I try to commit to Solr. > > > > > > > > > > Looking at the Solr log, it's showing that title is > > > > > > > > > > getting multiple values when it's not a multivalue > > > > > > > > > > field. None of my code does anything with the title, so > > > > > > > > > > I'm not sure why this is happening. > > > > > > > > > > > > > > > > > > > > How can I look at the pending commit and determine why > > > > and/or > > > > > > > > delete > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > extraneous values? The document in question is a pdf if > > > > that > > > > > > > > makes a > > > > > > > > > > > > > > > > difference. > > > > > > > > -- > > > > Markus Jelsma - CTO - Openindex > > > > http://www.linkedin.com/in/markus17 > > > > 050-8536620 / 06-50258350 > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350

