Re: [Nutch-general] Link Analysis in OC

2005-09-06 Thread Kelvin Tan
No, and that's something that will be worked into the OC before it gets merged into SVN: support for both host-based and score-based fetchlist prioritization. On Tue, 6 Sep 2005 19:17:42 -0700 (PDT), Michael Ji wrote: > hi Kelvin: > > Does OC support Link Analysis directly? > > I guess we have to

Re: [Nutch-general] scope filter in OC

2005-09-06 Thread Kelvin Tan
There is a FLFilter in OC which uses Nutch's regex-urlfilter.txt. I believe its called NutchUrlFLFilter On Tue, 6 Sep 2005 19:32:15 -0700 (PDT), Michael Ji wrote: > Hi Kelvin: > > Does OC support domain crawling like url-fliter.txt? > If so, how to insert the seeds domain list to OC? > > I saw OC

RE: com.sun.net.ssl Error

2005-09-06 Thread EM
This: http://lucene.apache.org/nutch/tutorial.html says the requirements are flexible (I'm sticking to SUN's java anyway) . It also says Linux is preferred although I've been using XP + cygwin for huge part of my tests and work and I've encountered zero problems due to that configuration. -Or

scope filter in OC

2005-09-06 Thread Michael Ji
Hi Kelvin: Does OC support domain crawling like url-fliter.txt? If so, how to insert the seeds domain list to OC? I saw OC's org.supermind.crawl.scope package, didn't see a similar concept. thanks, Michael Ji __ Cl

Re: how to fetch all web pages on one site

2005-09-06 Thread Michael Ji
I think you need run several runs. The first run just crawling the homepage of the site. I use the screen output as the log information. Do sure whatelse logs are. Michael Ji, --- AJ Chen <[EMAIL PROTECTED]> wrote: > I'm testing nutch whole-web crawling with juts one > url in a text file. > Bu

Link Analysis in OC

2005-09-06 Thread Michael Ji
hi Kelvin: Does OC support Link Analysis directly? I guess we have to use updateDB and then use DistributeLinkAnalysisTool to generate the pageRank score for individual site. Will there be another scenario that we could get Link Analysis Score from OC? thanks, Michael Ji

Re: I runed Nutch crawl but got an "FileNotFountException" ,why?

2005-09-06 Thread Michael Ji
check your urls, does that file exist in the folder you run crawler? Michael Ji --- mu xiaofeng <[EMAIL PROTECTED]> wrote: > Hi, > > I runed this command "nutch crawl urls -dir > crawl.test -depth 3 > -threads 2" but got an 'FileNotFoundException' ,why > ? > _

Re: com.sun.net.ssl Error

2005-09-06 Thread Michael Ji
Why JVM from IBM? All Java package is from Sun, right? Michael Ji --- "Vanderdray, Jake" <[EMAIL PROTECTED]> wrote: > I'm trying to get nutch-0.7 setup on a RedHat > Enterprise 3 > machine. I've installed the JVM from IBM and gotten > tomcat up and > running, but when I try to use ant to

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
Jérôme Charron wrote: I really don't like this solution to centralize this kind of informations. I think, it's the plugin responsability to claim the content-type/path-suffix it can handle. However, what happens if more than one plugin claims that it can handle any given content-type? E.g. ht

RE: httpd/unix-directory

2005-09-06 Thread EM
The issue happened quite a lot with my last fetchlist (I'm using the official 0.7), the next time it happens I can send you a list of urls if you like? -Original Message- From: Michael Nebel [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 06, 2005 2:42 PM To: nutch-user@lucene.apache.

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
> This is possible now by simply configuring a catch-all plugin to match > the empty suffix and removing the empty suffix from other plugins. So > it seems the problem is not that this is currently impossible, but > rather that it would be better to alter the configuration than the > plugin definit

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Doug Cutting
Andrzej Bialecki wrote: 3. implement a catch-all plugin, which is equivalent to a Unix command strings(1) (I have an implementation of that which I can contribute). And turn it off/on in the config, if it's off, then the unknown content is skipped and logged, if it's on - then make the best eff

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
> 3. implement a catch-all plugin, which is equivalent to a Unix command > strings(1) (I have an implementation of that which I can contribute). > And turn it off/on in the config, if it's off, then the unknown content > is skipped and logged, if it's on - then make the best effort to extract > tex

Re: Content-type mismatch for Excel

2005-09-06 Thread Jérôme Charron
> I took at random some xls-files from the internet, crawled them and saw > some errors. I haven't been able to check the errors further. So I can't > give you a more specific description of the problem :-( If you're > interested, I can mail you the url with my test-documents "off-list". Yes, I'm

Re: Content-type mismatch for Excel

2005-09-06 Thread Michael Nebel
Hi Jérôme, Jérôme Charron wrote: The changes are not difficult, but I still observe some other problems with this plugin. Ok, what kind of problems? I took at random some xls-files from the internet, crawled them and saw some errors. I haven't been able to check the errors further. So I can'

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Michael Nebel
Hi Ayyanar, sorry for the delay, but I've been out of office for some hours. Have you activated the plugins? You need to extend the plugin.includes. Mne look for example: plugin.includes nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexc

Re: httpd/unix-directory

2005-09-06 Thread Michael Nebel
Hi, looking at my apache, i get directory-listings as "Content-Type: text-html" not "httpd/unix-directory"... What kind of server are you crawling? Regards Michael EM wrote: Shouldn't "httpd/unix-directory" be parsed? Message from the logs: fetch okay, but can't parse http:///,

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Andrzej Bialecki
Jérôme Charron wrote: I remember having played with that a wee bit, but the problem was that the plugins themselves are riddled with pieces of code like the one below, found in MSWordParser in release 0.7: Yes, it's true, each parse plugin checks in its code the content-type of the provided c

Wildcards and different sites in Nutch

2005-09-06 Thread Mark Johannes
Hello, i have 2 Questions about Nutch. 1. Is Nutch supporting Wildcards, because Lucene does. I tried to use a * in my search-query and nothing happend. Is there any way to pass my search-query directly to the Lucene QueryParser? 2. I want to crawl and index a set of intranet-sites. After cra

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
> > I remember having played with that a wee bit, but the problem was that > the plugins themselves are riddled with pieces of code like the one > below, found in MSWordParser in release 0.7: Yes, it's true, each parse plugin checks in its code the content-type of the provided content. As you no

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Sébastien LE CALLONNEC
--- Jérôme Charron <[EMAIL PROTECTED]> a écrit : > Yes, you are rigth, but my response was a short time solution. > 1. A quick solution could be to checsk that a plugin can be > associated to > many content-types (if so, there's just to add application/powerpoint > in the > mspowerpoint plugin

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
> Is it not supposed to be the other way around, Nutch needing to be more > complacent with old servers that return "application/powerpoint"? The > thing is, there are some servers out there which _do_ return that MIME > Type, and supposedly, one would want to index them as well... As we > can't ha

RE: com.sun.net.ssl Error

2005-09-06 Thread Sébastien LE CALLONNEC
Hi Jake, You probably need to install JSSE as well. http://java.sun.com/products/jsse/ Regards, Sebastien. --- "Vanderdray, Jake" <[EMAIL PROTECTED]> a écrit : > I'm trying to get nutch-0.7 setup on a RedHat Enterprise 3 > machine. I've installed the JVM from IBM and gotten tomcat up a

com.sun.net.ssl Error

2005-09-06 Thread Vanderdray, Jake
I'm trying to get nutch-0.7 setup on a RedHat Enterprise 3 machine. I've installed the JVM from IBM and gotten tomcat up and running, but when I try to use ant to compile nutch, I get a bunch of errors like this: compile: [echo] Compiling plugin: protocol-httpclient [javac] Compi

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Sébastien LE CALLONNEC
Hi there, Is it not supposed to be the other way around, Nutch needing to be more complacent with old servers that return "application/powerpoint"? The thing is, there are some servers out there which _do_ return that MIME Type, and supposedly, one would want to index them as well... As we can'

I runed Nutch crawl but got an "FileNotFountException" ,why?

2005-09-06 Thread mu xiaofeng
Hi, I runed this command "nutch crawl urls -dir crawl.test -depth 3 -threads 2" but got an 'FileNotFoundException' ,why ?

Re: why nutch taking application/msword for powerpoint

2005-09-06 Thread Jérôme Charron
> 050906 175342 fetch okay, but can't parse > http://localhost:8080/search_sample/kmportal3.ppt, > reason: failed(2,203): Content-Type not > application/msword: application/powerpoint See me response in your previous mail Jérôme -- http://motrech.free.fr/ http://www.frutch.org/

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Jérôme Charron
> > I have enabled the ppt extension from the > crawl-urlfilter.txt, Now it is fetching the powerpoint > files, > But i am getting the following error, bcos ppt files > content type is not taken by nutch.. Looking at the code, here is a copy of the comment of the ParserFactory (the class that ch

Re: Content-type mismatch for Excel

2005-09-06 Thread Jérôme Charron
> > there are some modifications nescessary, because the xls-plugin uses > still an old interface. Yes, it uses some old interefaces. I have made the changes in my local copy for committing in the trunk. But I have not tested it already (I will commit in a few days if no objections for other de

why nutch taking application/msword for powerpoint

2005-09-06 Thread Ayyanar Inbamohan
Hi All, when i crawl the powerpoint files, by creating href in my html files, The powerpoint files were fetched,but while parsing i am getting the following error, 050906 175342 fetch okay, but can't parse http://localhost:8080/search_sample/kmportal3.ppt, reason: failed(2,203): Content-Type no

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Ayyanar Inbamohan
Hi Michael, I have enabled the ppt extension from the crawl-urlfilter.txt, Now it is fetching the powerpoint files, But i am getting the following error, bcos ppt files content type is not taken by nutch.. 050906 175342 fetching http://localhost:8080/search_sample/kmportal3.ppt 050906 175342

Re: Content-type mismatch for Excel

2005-09-06 Thread Michael Nebel
Hi, there are some modifications nescessary, because the xls-plugin uses still an old interface. The changes are not difficult, but I still observe some other problems with this plugin. Regards Michael Ayyanar Inbamohan wrote: Hi jerome, Now i am trying nutch 7.0. I am using the

Re: nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Michael Nebel
Hi, have you checked the filters? (regex-urlfilter or crawl-urlfilter)? The ending ".ppt" ist disabled by default. Regards Michael Ayyanar Inbamohan wrote: Hi all, I am using the powerpoint plugin from JIRA, and when i crawl my application having link to the ppt, nutch 7.0 is not

nutch 7.0 not fetching powerpoint, plugin is present

2005-09-06 Thread Ayyanar Inbamohan
Hi all, I am using the powerpoint plugin from JIRA, and when i crawl my application having link to the ppt, nutch 7.0 is not at all fetching the powerpoint files. i am crawling my local appliation http://localhost:8080/search_sample/index.html this url, i have given in the url.intranet, i ga

Re: Content-type mismatch for Excel

2005-09-06 Thread Ayyanar Inbamohan
Hi jerome, Now i am trying nutch 7.0. I am using the plugin from JIRA,but still while building the plugin using ant,i am getting two exceptions from the excel plugin compile: [echo] Compiling plugin: parse-msexcel [javac] Compiling 3 source files to /home/oss/nutch-0.7/build/parse-msexc