Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right. If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to
Piotr Kosiorowski wrote:
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.
+1
--
Best regards,
Andrzej Bialecki
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
I got it to work now, it wasent selecting the directory I had chosen, so I
typed it in and it works fine
BTW very cool tool
-J
- Original Message -
From: Fredrik Andersson [EMAIL PROTECTED]
To: nutch-dev@lucene.apache.org
Sent: Sunday, August 07, 2005 6:16 PM
Subject: Re: luke??
That's
A very basic facility seem to be missing in Nutch. If I have a 2000
urls list in Nutch DB and want to ignore external links, I have to
build a regex-filter with thousands of different domain I want to
crawl. No parameter to only crawl the different domain and ignore
external links.
At these
Hello,
I just created an issue in JIRA
http://issues.apache.org/jira/browse/NUTCH-79 containing the code for
fault tolerant searching. I think it is too late to include it in 0.7
release but I would wait for comments and test it in the meantime.
I would like to commit it when release would be
Piotr Kosiorowski wrote:
Looking around in JIRA I found out I cannot resolve an issue. I am not
sure how it works but I suspect I lack some rights to do so. Am I right?
I have added you to the nutch-developers Jira group. Now you should be
able to resolve issues, etc.
Doug
Thanks. It works.
Piotr
Doug Cutting wrote:
Piotr Kosiorowski wrote:
Looking around in JIRA I found out I cannot resolve an issue. I am
not sure how it works but I suspect I lack some rights to do so. Am I
right?
I have added you to the nutch-developers Jira group. Now you should be
Piotr Kosiorowski wrote:
So I have installed forrest and modified
src/site/src/documentation/content/xdocs.
Than run 'forrest'. And it generated content in src/site/build/site.
And now the questions:
Should I copy src/site/build/site to site and commit it?
Yes. I'm impressed that you got
is there any way to filter results to english via search, so I can setup a
multi-language search, I thought I saw somewhere that you could put
something into the form of the html, a switch while submiting the form that
would use a plugin to filter the results? I know I had seen some benchmarks
on
No problem for me. I have just run the test crawl on
http://lucene.apache.org/nutch as described in new tutorial and a lot
of pdf and png files were causing big exceptions and stack traces in
log. I thought that people (usually using nutch for the first time)
might think that they did something
Jay Pound wrote:
1.) we need to split up chunks of data into sub-folders as not to run the
filesystem out of its physical limitations of concurrent files in a single
directory, like the way squid splits up its data into directories.
I agree. I am currently using reiser with NDFS so this is
Hello,
We should probably change user agent string in nutch-default.xml to
point to Apache site. The only question is http.agent.version - should
we set it to 0.07 for release and 0.08-dev for future work? I do not
know how it was used previously.
Current values:
property
+1
Piotr Kosiorowski wrote:
Hello,
We should probably change user agent string in nutch-default.xml to
point to Apache site. The only question is http.agent.version - should
we set it to 0.07 for release and 0.08-dev for future work? I do not
know how it was used previously.
Current
[EMAIL PROTECTED] wrote:
- valuehttp://www.nutch.org/docs/en/bot.html/value
+ valuehttp://lucene.apache.org/nutch/bot.html/value
I think this should now be:
http://lucene.apache.org/nutch/bot.html
The docs/en pages have mostly been reduced to the about page, whose
translations I hate to
Hi,
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
Thanks for any hints.
Stefan
Stefan Groschupf wrote:
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
You don't miss much!
I avoided using Java's
What do others think?
I think, RMI isn't a good idea. I waste a lot of time with it. I
like the nutch rpc very much.
However I think usage of Externalizable is a good idea, first it is a
very small change.
Second many users use nutch for very custom things and usage of
Externalizable
17 matches
Mail list logo