I ran the nutch readdb and dumped it to a text file, I found the entry
for one of them:

http://url/Alpha.docx   Version: 7
Status: 3 (db_gone)
Fetch time: Thu Feb 02 18:42:19 GMT 2012
Modified time: Thu Jan 01 00:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 3888000 seconds (45 days)
Score: 0.21058823
Signature: null
Metadata: _pst_: gone(11), lastModified=0: http://url/Alpha.docx

I guess the problem is that it is "gone" but I really don't know why
-- the file does exist and nutch can seem to find/parse it with the
checker runs.  Wouldn't the URL filter block it at that level?  In any
case, it doesn't match on anything that has a - in the
regex-urlfilter.xml file, so I don't think it is being filtered out
there.  Is there another thing that I could look at?

The only thing that dumps out errors is the hadoop logs, and there is
a lot going on there...is there anything in particular that I should
look for near where it crawls that file?  I don't see anything
error-related near it or the other missing files.

-- Chris



On Mon, Dec 19, 2011 at 2:00 PM, Markus Jelsma
<[email protected]> wrote:
> Check if it is in your CrawlDB at all. Debug further from that point on. If it
> is not, then why? Perhaps some URL filter? If it is, did it get an error?
>
>> I'm trying to crawl a SharePoint 2010 site, and I'm confused as to why
>> a document isn't getting added to my Solr Index.
>>
>> I can use the parsechecker and indexchecker to verify the link to the
>> docx file, and they both can get to it and parse it just fine.  But
>> when I use the crawl command, it doesn't appear.  What config file
>> should I be checking?  Do those tools use the same settings, or is
>> there something different about the way they operate?
>>
>> Any help would be appreciated!
>>
>> -- Chris

Reply via email to