Re: How can I influence a Content-Type checking?

2006-07-23 Thread Jayant Kumar Gandhi

Have you enabled the msword parse plugin in nutch-site.xml . You will have
to enable that plugin for msword parsing to work.

Cheers,
Jayant

On 7/20/06, SKUHRA, Milan <[EMAIL PROTECTED]> wrote:


Hello,



I am trying use Nutch to look up at specify URLs like this are:

http://server.domain/appname/get?id=34&view=content

http://server.domain/appname/get?id=35&view=content

http://server.domain/appname/get?id=36&view=content



(So I use "-depth 1" option for running crawl.)

Some sites are HTML pages, but some return files.

After running the crawl I found at the log file this record:



060719 160036 fetch okay, but can't parse
http://server.domain/appname/get?id=36&view=content, reason:
failed(2,203): Content-Type not text/html: application/msword



It looks like crawl infers Content-Type from URL and than compares it
with received Content-Type.

Is possible to use received Content-Type without checking Content-Type
implied from URL?

How can I resolve this problem?



Tank you for reply.

Milan Skuhra








--
www.jkg.in | http://www.jkg.in/contact-me/
Jayant Kr. Gandhi
M.Tech. Computer Tech. Class of 2007,
IIT Delhi


How can I influence a Content-Type checking?

2006-07-20 Thread SKUHRA, Milan
Hello,

 

I am trying use Nutch to look up at specify URLs like this are:

http://server.domain/appname/get?id=34&view=content

http://server.domain/appname/get?id=35&view=content

http://server.domain/appname/get?id=36&view=content

 

(So I use "-depth 1" option for running crawl.)

Some sites are HTML pages, but some return files.

After running the crawl I found at the log file this record:

 

060719 160036 fetch okay, but can't parse
http://server.domain/appname/get?id=36&view=content, reason:
failed(2,203): Content-Type not text/html: application/msword

 

It looks like crawl infers Content-Type from URL and than compares it
with received Content-Type.

Is possible to use received Content-Type without checking Content-Type
implied from URL?

How can I resolve this problem?

 

Tank you for reply.

Milan Skuhra