No, and that's something that will be worked into the OC before it gets merged
into SVN: support for both host-based and score-based fetchlist prioritization.
On Tue, 6 Sep 2005 19:17:42 -0700 (PDT), Michael Ji wrote:
> hi Kelvin:
>
> Does OC support Link Analysis directly?
>
> I guess we have to
There is a FLFilter in OC which uses Nutch's regex-urlfilter.txt. I believe its
called NutchUrlFLFilter
On Tue, 6 Sep 2005 19:32:15 -0700 (PDT), Michael Ji wrote:
> Hi Kelvin:
>
> Does OC support domain crawling like url-fliter.txt?
> If so, how to insert the seeds domain list to OC?
>
> I saw OC
This:
http://lucene.apache.org/nutch/tutorial.html
says the requirements are flexible (I'm sticking to SUN's java anyway) .
It also says Linux is preferred although I've been using XP + cygwin for
huge part of my tests and work and I've encountered zero problems due to
that configuration.
-Or
Hi Kelvin:
Does OC support domain crawling like url-fliter.txt?
If so, how to insert the seeds domain list to OC?
I saw OC's org.supermind.crawl.scope package, didn't
see a similar concept.
thanks,
Michael Ji
__
Cl
I think you need run several runs. The first run just
crawling the homepage of the site.
I use the screen output as the log information. Do
sure whatelse logs are.
Michael Ji,
--- AJ Chen <[EMAIL PROTECTED]> wrote:
> I'm testing nutch whole-web crawling with juts one
> url in a text file.
> Bu
hi Kelvin:
Does OC support Link Analysis directly?
I guess we have to use updateDB and then use
DistributeLinkAnalysisTool to generate the pageRank
score for individual site.
Will there be another scenario that we could get Link
Analysis Score from OC?
thanks,
Michael Ji
check your urls, does that file exist in the folder
you run crawler?
Michael Ji
--- mu xiaofeng <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I runed this command "nutch crawl urls -dir
> crawl.test -depth 3
> -threads 2" but got an 'FileNotFoundException' ,why
> ?
>
_
Why JVM from IBM? All Java package is from Sun, right?
Michael Ji
--- "Vanderdray, Jake" <[EMAIL PROTECTED]> wrote:
> I'm trying to get nutch-0.7 setup on a RedHat
> Enterprise 3
> machine. I've installed the JVM from IBM and gotten
> tomcat up and
> running, but when I try to use ant to
Jérôme Charron wrote:
I really don't like this solution to centralize this kind of informations.
I think, it's the plugin responsability to claim the
content-type/path-suffix it can handle.
However, what happens if more than one plugin claims that it can handle
any given content-type? E.g. ht
The issue happened quite a lot with my last fetchlist (I'm using the
official 0.7), the next time it happens I can send you a list of urls if you
like?
-Original Message-
From: Michael Nebel [mailto:[EMAIL PROTECTED]
Sent: Tuesday, September 06, 2005 2:42 PM
To: nutch-user@lucene.apache.
> This is possible now by simply configuring a catch-all plugin to match
> the empty suffix and removing the empty suffix from other plugins. So
> it seems the problem is not that this is currently impossible, but
> rather that it would be better to alter the configuration than the
> plugin definit
Andrzej Bialecki wrote:
3. implement a catch-all plugin, which is equivalent to a Unix command
strings(1) (I have an implementation of that which I can contribute).
And turn it off/on in the config, if it's off, then the unknown content
is skipped and logged, if it's on - then make the best eff
> 3. implement a catch-all plugin, which is equivalent to a Unix command
> strings(1) (I have an implementation of that which I can contribute).
> And turn it off/on in the config, if it's off, then the unknown content
> is skipped and logged, if it's on - then make the best effort to extract
> tex
> I took at random some xls-files from the internet, crawled them and saw
> some errors. I haven't been able to check the errors further. So I can't
> give you a more specific description of the problem :-( If you're
> interested, I can mail you the url with my test-documents "off-list".
Yes, I'm
Hi Jérôme,
Jérôme Charron wrote:
The changes are not difficult, but I still
observe some other problems with this plugin.
Ok, what kind of problems?
I took at random some xls-files from the internet, crawled them and saw
some errors. I haven't been able to check the errors further. So I can'
Hi Ayyanar,
sorry for the delay, but I've been out of office for some hours.
Have you activated the plugins? You need to extend the plugin.includes.
Mne look for example:
plugin.includes
nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msword|pdf|rtf|rss|js|msexc
Hi,
looking at my apache, i get directory-listings as "Content-Type:
text-html" not "httpd/unix-directory"... What kind of server are you
crawling?
Regards
Michael
EM wrote:
Shouldn't "httpd/unix-directory" be parsed?
Message from the logs:
fetch okay, but can't parse http:///,
Jérôme Charron wrote:
I remember having played with that a wee bit, but the problem was that
the plugins themselves are riddled with pieces of code like the one
below, found in MSWordParser in release 0.7:
Yes, it's true, each parse plugin checks in its code the content-type of the
provided c
Hello,
i have 2 Questions about Nutch.
1. Is Nutch supporting Wildcards, because Lucene does.
I tried to use a * in my search-query and nothing happend. Is there any
way to pass my search-query directly to the Lucene QueryParser?
2. I want to crawl and index a set of intranet-sites. After cra
>
> I remember having played with that a wee bit, but the problem was that
> the plugins themselves are riddled with pieces of code like the one
> below, found in MSWordParser in release 0.7:
Yes, it's true, each parse plugin checks in its code the content-type of the
provided content.
As you no
--- Jérôme Charron <[EMAIL PROTECTED]> a écrit :
> Yes, you are rigth, but my response was a short time solution.
> 1. A quick solution could be to checsk that a plugin can be
> associated to
> many content-types (if so, there's just to add application/powerpoint
> in the
> mspowerpoint plugin
> Is it not supposed to be the other way around, Nutch needing to be more
> complacent with old servers that return "application/powerpoint"? The
> thing is, there are some servers out there which _do_ return that MIME
> Type, and supposedly, one would want to index them as well... As we
> can't ha
Hi Jake,
You probably need to install JSSE as well.
http://java.sun.com/products/jsse/
Regards,
Sebastien.
--- "Vanderdray, Jake" <[EMAIL PROTECTED]> a écrit :
> I'm trying to get nutch-0.7 setup on a RedHat Enterprise 3
> machine. I've installed the JVM from IBM and gotten tomcat up a
I'm trying to get nutch-0.7 setup on a RedHat Enterprise 3
machine. I've installed the JVM from IBM and gotten tomcat up and
running, but when I try to use ant to compile nutch, I get a bunch of
errors like this:
compile:
[echo] Compiling plugin: protocol-httpclient
[javac] Compi
Hi there,
Is it not supposed to be the other way around, Nutch needing to be more
complacent with old servers that return "application/powerpoint"? The
thing is, there are some servers out there which _do_ return that MIME
Type, and supposedly, one would want to index them as well... As we
can'
Hi,
I runed this command "nutch crawl urls -dir crawl.test -depth 3
-threads 2" but got an 'FileNotFoundException' ,why ?
> 050906 175342 fetch okay, but can't parse
> http://localhost:8080/search_sample/kmportal3.ppt,
> reason: failed(2,203): Content-Type not
> application/msword: application/powerpoint
See me response in your previous mail
Jérôme
--
http://motrech.free.fr/
http://www.frutch.org/
>
> I have enabled the ppt extension from the
> crawl-urlfilter.txt, Now it is fetching the powerpoint
> files,
> But i am getting the following error, bcos ppt files
> content type is not taken by nutch..
Looking at the code, here is a copy of the comment of the ParserFactory (the
class that ch
>
> there are some modifications nescessary, because the xls-plugin uses
> still an old interface.
Yes, it uses some old interefaces. I have made the changes in my local copy
for committing in the trunk.
But I have not tested it already (I will commit in a few days if no
objections for other de
Hi All,
when i crawl the powerpoint files, by creating href in
my html files,
The powerpoint files were fetched,but while parsing i
am getting the following error,
050906 175342 fetch okay, but can't parse
http://localhost:8080/search_sample/kmportal3.ppt,
reason: failed(2,203): Content-Type no
Hi Michael,
I have enabled the ppt extension from the
crawl-urlfilter.txt, Now it is fetching the powerpoint
files,
But i am getting the following error, bcos ppt files
content type is not taken by nutch..
050906 175342 fetching
http://localhost:8080/search_sample/kmportal3.ppt
050906 175342
Hi,
there are some modifications nescessary, because the xls-plugin uses
still an old interface. The changes are not difficult, but I still
observe some other problems with this plugin.
Regards
Michael
Ayyanar Inbamohan wrote:
Hi jerome,
Now i am trying nutch 7.0. I am using the
Hi,
have you checked the filters? (regex-urlfilter or crawl-urlfilter)? The
ending ".ppt" ist disabled by default.
Regards
Michael
Ayyanar Inbamohan wrote:
Hi all,
I am using the powerpoint plugin from JIRA, and when i
crawl my application having link to the ppt, nutch 7.0
is not
Hi all,
I am using the powerpoint plugin from JIRA, and when i
crawl my application having link to the ppt, nutch 7.0
is not at all fetching the powerpoint files.
i am crawling my local appliation
http://localhost:8080/search_sample/index.html
this url, i have given in the url.intranet,
i ga
Hi jerome,
Now i am trying nutch 7.0. I am using the plugin from
JIRA,but still while building the plugin using ant,i
am getting two exceptions from the excel plugin
compile:
[echo] Compiling plugin: parse-msexcel
[javac] Compiling 3 source files to
/home/oss/nutch-0.7/build/parse-msexc
35 matches
Mail list logo