Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-13 Thread Jérôme Charron
I think you have activated the parse-mspowerpoint plugin, but not the lib-jakarta-poi plugin. Just activate the lib-jakarta-poi plugin and it must work. Thanks, that worked. For informations: If you download the last code available in the trunk, the manual activation of the

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-09 Thread Jérôme Charron
So in this case, the MIME type is correct, so the file should be passed to the parse-mspowerpoint plugin, but it's not. Now that the plugin has been committed, how do we actually make it work (yes, I've read http://issues.apache.org/jira/browse/NUTCH-88 )? I think you have activated the

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-07 Thread Jérôme Charron
Funly enough, I was thinking the other way around: could it be a requirement for someone that two plugins parse the same content-type? One plugin does some parts of the parsing, then hands over the page to another one, _à la_ Visitor. It could be an interesting feature. But for now, I

nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Ayyanar Inbamohan
Hi all, I am using the powerpoint plugin from JIRA, and when i crawl my application having link to the ppt, nutch 7.0 is not at all fetching the powerpoint files. i am crawling my local appliation http://localhost:8080/search_sample/index.html this url, i have given in the url.intranet, i

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Michael Nebel
Hi, have you checked the filters? (regex-urlfilter or crawl-urlfilter)? The ending .ppt ist disabled by default. Regards Michael Ayyanar Inbamohan wrote: Hi all, I am using the powerpoint plugin from JIRA, and when i crawl my application having link to the ppt, nutch 7.0 is not

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Ayyanar Inbamohan
Hi Michael, I have enabled the ppt extension from the crawl-urlfilter.txt, Now it is fetching the powerpoint files, But i am getting the following error, bcos ppt files content type is not taken by nutch.. 050906 175342 fetching http://localhost:8080/search_sample/kmportal3.ppt 050906 175342

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
I have enabled the ppt extension from the crawl-urlfilter.txt, Now it is fetching the powerpoint files, But i am getting the following error, bcos ppt files content type is not taken by nutch.. Looking at the code, here is a copy of the comment of the ParserFactory (the class that choose

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Sébastien LE CALLONNEC
--- Jérôme Charron [EMAIL PROTECTED] a écrit : Yes, you are rigth, but my response was a short time solution. 1. A quick solution could be to checsk that a plugin can be associated to many content-types (if so, there's just to add application/powerpoint in the mspowerpoint plugin xml).

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
I remember having played with that a wee bit, but the problem was that the plugins themselves are riddled with pieces of code like the one below, found in MSWordParser in release 0.7: Yes, it's true, each parse plugin checks in its code the content-type of the provided content. As you

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
Jérôme Charron wrote: I remember having played with that a wee bit, but the problem was that the plugins themselves are riddled with pieces of code like the one below, found in MSWordParser in release 0.7: Yes, it's true, each parse plugin checks in its code the content-type of the provided

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Doug Cutting
Andrzej Bialecki wrote: 3. implement a catch-all plugin, which is equivalent to a Unix command strings(1) (I have an implementation of that which I can contribute). And turn it off/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
This is possible now by simply configuring a catch-all plugin to match the empty suffix and removing the empty suffix from other plugins. So it seems the problem is not that this is currently impossible, but rather that it would be better to alter the configuration than the plugin

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
Jérôme Charron wrote: I really don't like this solution to centralize this kind of informations. I think, it's the plugin responsability to claim the content-type/path-suffix it can handle. However, what happens if more than one plugin claims that it can handle any given content-type? E.g.