> I ran the nutch readdb and dumped it to a text file, I found the entry
> for one of them:
> 
> http://url/Alpha.docx   Version: 7
> Status: 3 (db_gone)
> Fetch time: Thu Feb 02 18:42:19 GMT 2012
> Modified time: Thu Jan 01 00:00:00 GMT 1970
> Retries since fetch: 0
> Retry interval: 3888000 seconds (45 days)
> Score: 0.21058823
> Signature: null
> Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx
> 
> I guess the problem is that it is "gone" but I really don't know why
> -- the file does exist and nutch can seem to find/parse it with the
> checker runs.

Hmm, the status db_gone prevents it from being indexed, of course. It is 
perfectly possible for the checkers to pass but that the fetcher will fail. 
There may have been an error and i remeber you using a proxy earlier, that's 
likely the problem here too. The checkers don't use proxy configurations.

Check the logs to make sure.

> Wouldn't the URL filter block it at that level?  In any
> case, it doesn't match on anything that has a - in the
> regex-urlfilter.xml file, so I don't think it is being filtered out
> there.

That's good. But remember, to pass it _must_ match regex prefixed by a +. 
This, however, is not your problem because in that case it wouldn't have ended 
up in the CrawlDB at all.


> Is there another thing that I could look at?
> 
> The only thing that dumps out errors is the hadoop logs, and there is
> a lot going on there...is there anything in particular that I should
> look for near where it crawls that file?
> I don't see anything
> error-related near it or the other missing files.

Check the fetcher output thoroughly. Grep around. You should find it.

> 
> -- Chris
> 
> 
> 
> On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > Check if it is in your CrawlDB at all. Debug further from that point on.
> > If it is not, then why? Perhaps some URL filter? If it is, did it get an
> > error?
> > 
> >> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
> >> a document isn't getting added to my Solr Index.
> >> 
> >> I can use the parsechecker and indexchecker to verify the link to the
> >> docx file, and they both can get to it and parse it just fine.  But
> >> when I use the crawl command, it doesn't appear.  What config file
> >> should I be checking?  Do those tools use the same settings, or is
> >> there something different about the way they operate?
> >> 
> >> Any help would be appreciated!
> >> 
> >> -- Chris

Reply via email to