Paul Baclace wrote:
Here is a patch for improving the error message that is displayed
when an intranet crawl commandline has a file instead of a directory
of files containing URLs.
I have committed this to the mapred branch.
Thanks, Paul!
Doug
NDFS is not recommended in 0.7. The version of NDFS in the mapred
branch is much improved. Note however that the mapred branch is
substantially different than 0.7 and is still incomplete.
Doug
Transbuerg Tian wrote:
hi, all friends,
I download nutch0.7 ,and want use ndfs independence.
[
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329474 ]
Doug Cutting commented on NUTCH-92:
---
A minor detail:
In Searcher, instead of
int[] getDocFreqs(Term[]);
The new method will probably have to be something like
public
I will postpone the merge of the mapred branch into trunk until I have a
chance to (a) add some MapReduce documentation; and (b) implement
MapReduce-based dedup.
Doug
Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred. This
increases the chances
For now, look at the source for crawl/Crawl.java.
I'll try to add some documentation ASAP.
Doug
Steffen Viken Valvåg wrote:
Hi,
I'm playing around with the mapreduce branch, and got it working for a
simple intranet crawl by following the nutch tutorial on
Kelvin Tan wrote:
Each of these stages will be handled in its own thread (except for HTML parsing
and scoring, which may actually benefit from having multiple threads). With the
introduction of non-blocking IO, I think threads should be used only where
parallel computation offers performance
Apache Wiki wrote:
1. The SVN repository consists of the following areas:
a. '''trunk''' [ ... ]
a. '''Release-x.x''' branches [ ... ]
This should also mention tags, fixed versions of the code where no
development occurs.
I also would prefer that tag names and branch names are distinct,
I assume that in most NDFS-based configurations the production search
system will not run out of NDFS. Rather, indexes will be created
offline for a deployment (i.e., merging things to create an index per
search node), then copied out of NDFS to the local filesystem on a
production search
[EMAIL PROTECTED] wrote:
I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works. If
merging mapred to trunk means losing Kelvin's changes, then I
Jeremy Bensley (sent by Nabble.com) wrote:
I have been experimenting with MapReduce to perform some distributed tasks
aside from the normal fetch/index routine of Nutch, and overall have had much
success.
I'm glad to hear this!
Today I have been experimenting with running extended duration
Jérôme Charron wrote:
svn.apache.org http://svn.apache.org down, or the problem is on my side?
A good way to answer this is to look at:
http://monitoring.apache.org/status/
It looks like SVN is currently up. And it works for me too.
Doug
Piotr Kosiorowski wrote:
Is anyone working on preparing the release?
I am not.
If not I can spent some time on it in an hour or so.
+1
Thanks,
Doug
What API are you using to get hits, NutchBean or OpenSearchServlet? If
you're using OpenSearchServlet, then, with 1000 hits, most of your time
is probably spent constructing summaries. Do you need the summaries?
If not, use NutchBean instead, or modify OpenSearchServlet to not
generate
Piotr Kosiorowski wrote:
After making a tar I was trying to go through crawl tutorial.
- tar xvfz nutch-0.7.tar.gz
bin/nutch - is not executable (and nutch-daemon.sh too).
It is strange nobody reported it so far so it may still be my fault.
No, it looks like a problem with ant's tar task,
Jay Pound wrote:
is the org.apache.nutch.crawl package a part of the nightly builds?
No. Nightly builds are from trunk. The mapred code is in a separate
branch in subversion. After the 0.7 release, when the mapred branch is
folded into trunk, then it will be in nightly builds. Until then
Fuad Efendi wrote:
Which parameter should I pass to Crawl? It should be directory
containing smth. in which format?
As before, inject takes a flat text files of urls, one per line. If you
wish to inject DMOZ urls, there is now a utility main() that will
convert the DMOZ file to such a file.
Piotr Kosiorowski wrote:
I think we all refer to 0.7 as next number (and 0.6 as current) so
nutch-default.xml contains wrong format. In fact it should still contain
-dev suffix.
To make undocumented comvention documented I would also like to suggest
naming releases with X.Y format and naming
Piotr Kosiorowski wrote:
I read your email ten times and still I am not sure
what the problem is.
The problem is with me.
Doug Cutting wrote:
[EMAIL PROTECTED] wrote:
- valuehttp://www.nutch.org/docs/en/bot.html/value
+ valuehttp://lucene.apache.org/nutch/bot.html/value
I clicked
Piotr Kosiorowski wrote:
Will do it tommorow - I wanted to put down a kind of release checklist
in Wiki - starting with where to change numbers. But would like to cover
also release howto - but in fact I am not sure how to do make a relase
yet. But will try to gather this information.
A
Piotr Kosiorowski wrote:
Looking around in JIRA I found out I cannot resolve an issue. I am not
sure how it works but I suspect I lack some rights to do so. Am I right?
I have added you to the nutch-developers Jira group. Now you should be
able to resolve issues, etc.
Doug
Piotr Kosiorowski wrote:
So I have installed forrest and modified
src/site/src/documentation/content/xdocs.
Than run 'forrest'. And it generated content in src/site/build/site.
And now the questions:
Should I copy src/site/build/site to site and commit it?
Yes. I'm impressed that you got
Jay Pound wrote:
1.) we need to split up chunks of data into sub-folders as not to run the
filesystem out of its physical limitations of concurrent files in a single
directory, like the way squid splits up its data into directories.
I agree. I am currently using reiser with NDFS so this is
+1
Piotr Kosiorowski wrote:
Hello,
We should probably change user agent string in nutch-default.xml to
point to Apache site. The only question is http.agent.version - should
we set it to 0.07 for release and 0.08-dev for future work? I do not
know how it was used previously.
Current
[EMAIL PROTECTED] wrote:
- valuehttp://www.nutch.org/docs/en/bot.html/value
+ valuehttp://lucene.apache.org/nutch/bot.html/value
I think this should now be:
http://lucene.apache.org/nutch/bot.html
The docs/en pages have mostly been reduced to the about page, whose
translations I hate to
Stefan Groschupf wrote:
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
You don't miss much!
I avoided using Java's
Stefan Groschupf wrote:
http://wiki.apache.org/nutch/Presentations
Can you explan what this means: Page 20:
- cheduling is bottleneck, not disk, network or CPU?
I mean that neither the CPUs, disks or network are at 100% of capacity.
Disks are running around 50% busy, CPUs a bit higher, and
Jay Pound wrote:
Doug I also ran into this when I was testing ndfs the system would have to
wait for the namenode to tell the datanodes what data to recieve and which
data to replicate
When did you test this? Which version of Nutch? How many nodes? My
benchmark results from just a few days
Fredrik Andersson wrote:
I just ported a lot of old 0.6 code to 0.7-dev/mapred. Lots of stuff
has changed I see! One thing I can't quite grasp though, is why the
Hit.getScore() has been removed in favour for the TopDocs-thingie
instead?
Hit.getScore() was generalized to Hit.getSortValue() in
Ilia S. Yatsenko wrote:
Why hits.getTotal() ignore hitsPerSite?
hits.getTotal() always returns the total number of hits, regardless of
site. hitsPerSite is a filter on hits as they are displayed. This is
the way Google Yahoo handle this too. Search for NutchAnalysis
there. If you look
301 - 329 of 329 matches
Mail list logo