Parsechecker tool? Where do I find that? On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma <[email protected]>wrote:
> > > I'm running the latest version of 1.4 We just rebuilt it last week. Is > > that patch included? > > Yes, so you actually have more than one non-zero length titles coming from > your parser. Please try the parsechecker tool and confirm, but i'm not > sure it > is capable of showing multiple titles. > > > > > And where would it get multiple titles from? > > Most likely from PDF or other document types. You can check with a > stand-alone > Tika. > > > How do I tell what the titles > > are so I can see if they're valid or not? > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma > <[email protected]>wrote: > > > This should work around the problem in most cases. The parser can > output > > > two > > > titles of which one is actually empty. This patch (in 1.4) skips empty > > > titles. > > > > > > If this doesn't work you really have two _valid_ titles coming from > your > > > document. > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004 > > > > > > > It looks like the issue I'm encountering is the same one as here. > > > > > > > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu > > > lt > > > > > > > iValued-field-title-td1446817.html > > > > > > > > I'm not really sure what the linked bug is since that involves the > HTML > > > > parser and I'm seeing this problem with a PDF file. > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <[email protected]> > > > > > > wrote: > > > > > I'm getting an exception when I try to commit to Solr. Looking at > > > > > the Solr log, it's showing that title is getting multiple values > > > > > when it's not a multivalue field. None of my code does anything > > > > > with the title, so I'm not sure why this is happening. > > > > > > > > > > How can I look at the pending commit and determine why and/or > delete > > > > > > the > > > > > > > > extraneous values? The document in question is a pdf if that > makes a > > > > > difference. >

