bin/nutch parsechecker <url> see also: http://wiki.apache.org/nutch/CommandLineOptions
On Wednesday 02 November 2011 14:16:10 Bai Shen wrote: > Parsechecker tool? Where do I find that? > > On Tue, Nov 1, 2011 at 4:56 PM, Markus Jelsma <[email protected]>wrote: > > > I'm running the latest version of 1.4 We just rebuilt it last week. > > > Is that patch included? > > > > Yes, so you actually have more than one non-zero length titles coming > > from your parser. Please try the parsechecker tool and confirm, but i'm > > not sure it > > is capable of showing multiple titles. > > > > > And where would it get multiple titles from? > > > > Most likely from PDF or other document types. You can check with a > > stand-alone > > Tika. > > > > > How do I tell what the titles > > > are so I can see if they're valid or not? > > > > > > On Tue, Nov 1, 2011 at 4:33 PM, Markus Jelsma > > > > <[email protected]>wrote: > > > > This should work around the problem in most cases. The parser can > > > > output > > > > > > two > > > > titles of which one is actually empty. This patch (in 1.4) skips > > > > empty titles. > > > > > > > > If this doesn't work you really have two _valid_ titles coming from > > > > your > > > > > > document. > > > > > > > > https://issues.apache.org/jira/browse/NUTCH-1004 > > > > > > > > > It looks like the issue I'm encountering is the same one as here. > > > > http://lucene.472066.n3.nabble.com/multiple-values-encountered-for-non-mu > > > > > > lt > > > > > > > > > iValued-field-title-td1446817.html > > > > > > > > > > I'm not really sure what the linked bug is since that involves the > > > > HTML > > > > > > > parser and I'm seeing this problem with a PDF file. > > > > > > > > > > On Tue, Nov 1, 2011 at 3:41 PM, Bai Shen <[email protected]> > > > > > > > > wrote: > > > > > > I'm getting an exception when I try to commit to Solr. Looking > > > > > > at the Solr log, it's showing that title is getting multiple > > > > > > values when it's not a multivalue field. None of my code does > > > > > > anything with the title, so I'm not sure why this is happening. > > > > > > > > > > > > How can I look at the pending commit and determine why and/or > > > > delete > > > > > > the > > > > > > > > > > extraneous values? The document in question is a pdf if that > > > > makes a > > > > > > > > difference. -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

