> I ran the nutch readdb and dumped it to a text file, I found the entry > for one of them: > > http://url/Alpha.docx Version: 7 > Status: 3 (db_gone) > Fetch time: Thu Feb 02 18:42:19 GMT 2012 > Modified time: Thu Jan 01 00:00:00 GMT 1970 > Retries since fetch: 0 > Retry interval: 3888000 seconds (45 days) > Score: 0.21058823 > Signature: null > Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx > > I guess the problem is that it is "gone" but I really don't know why > -- the file does exist and nutch can seem to find/parse it with the > checker runs.
Hmm, the status db_gone prevents it from being indexed, of course. It is perfectly possible for the checkers to pass but that the fetcher will fail. There may have been an error and i remeber you using a proxy earlier, that's likely the problem here too. The checkers don't use proxy configurations. Check the logs to make sure. > Wouldn't the URL filter block it at that level? In any > case, it doesn't match on anything that has a - in the > regex-urlfilter.xml file, so I don't think it is being filtered out > there. That's good. But remember, to pass it _must_ match regex prefixed by a +. This, however, is not your problem because in that case it wouldn't have ended up in the CrawlDB at all. > Is there another thing that I could look at? > > The only thing that dumps out errors is the hadoop logs, and there is > a lot going on there...is there anything in particular that I should > look for near where it crawls that file? > I don't see anything > error-related near it or the other missing files. Check the fetcher output thoroughly. Grep around. You should find it. > > -- Chris > > > > On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma > > <[email protected]> wrote: > > Check if it is in your CrawlDB at all. Debug further from that point on. > > If it is not, then why? Perhaps some URL filter? If it is, did it get an > > error? > > > >> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why > >> a document isn't getting added to my Solr Index. > >> > >> I can use the parsechecker and indexchecker to verify the link to the > >> docx file, and they both can get to it and parse it just fine. But > >> when I use the crawl command, it doesn't appear. What config file > >> should I be checking? Do those tools use the same settings, or is > >> there something different about the way they operate? > >> > >> Any help would be appreciated! > >> > >> -- Chris

