I ran the nutch readdb and dumped it to a text file, I found the entry for one of them:
http://url/Alpha.docx Version: 7 Status: 3 (db_gone) Fetch time: Thu Feb 02 18:42:19 GMT 2012 Modified time: Thu Jan 01 00:00:00 GMT 1970 Retries since fetch: 0 Retry interval: 3888000 seconds (45 days) Score: 0.21058823 Signature: null Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx I guess the problem is that it is "gone" but I really don't know why -- the file does exist and nutch can seem to find/parse it with the checker runs. Wouldn't the URL filter block it at that level? In any case, it doesn't match on anything that has a - in the regex-urlfilter.xml file, so I don't think it is being filtered out there. Is there another thing that I could look at? The only thing that dumps out errors is the hadoop logs, and there is a lot going on there...is there anything in particular that I should look for near where it crawls that file? I don't see anything error-related near it or the other missing files. -- Chris On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma <[email protected]> wrote: > Check if it is in your CrawlDB at all. Debug further from that point on. If it > is not, then why? Perhaps some URL filter? If it is, did it get an error? > >> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why >> a document isn't getting added to my Solr Index. >> >> I can use the parsechecker and indexchecker to verify the link to the >> docx file, and they both can get to it and parse it just fine. But >> when I use the crawl command, it doesn't appear. What config file >> should I be checking? Do those tools use the same settings, or is >> there something different about the way they operate? >> >> Any help would be appreciated! >> >> -- Chris

