Stefan Neufeind wrote:
Can you maybe also help me out with sort=title?
Lucene's works with indexed, non-tokenized fields. The title field is
tokenized. If you need to sort by title then you'd need to add a plugin
that indexes another field (e.g., sortTitle) containing the
un-tokenized
The .job file is a jar file for submission to Hadoop's MapReduce. It is
Hadoop-specific, although very similar to war and ear files.
Teruhiko Kurosaka wrote:
Nutch's top-level bulid.xml file's default target is job,
and it build a zip file called nutch-0.8-dev.job.
project name=Nutch
Andrzej Bialecki wrote:
0.8 is pretty stable now, I think we should start considering a release
soon, within the next month's time frame.
+1
Are there substantial features still missing from 0.8 that were
supported in 0.7?
Are there any showstopping bugs, things that worked in 0.7 that are
The nightly build is not mirrored. It is only available from
cvs.apache.org, which has been down, but is now up.
http://cvs.apache.org/dist/lucene/nutch/nightly/
Note that no nightly build was done last night, since Subversion was down.
Doug
Michael Plax wrote:
I tried randomly some of
Andrzej Bialecki wrote:
Unfortunately, this is still an existing problem, and neither Nutch nor
Lucene does the right job here. Please see NUTCH-92 for more
information, and a sketch of solution for this issue.
Lucene's MultiSearcher now implements this correctly, no? But Nutch's
It sounds like you're sorting a segment index after dedup, rather than a
merged index. It also looks like there's a bug in IndexSorter. But you
should be able to work around it by merging your segment indexes after
deduping, so there are no deletions.
Please file a bug in Jira.
Doug
Andrzej Bialecki wrote:
I think it should be possible to put your binary at the Apache site,
probably Doug will be the right person to talk to ...
Have you tried attaching it to a Jira issue?
If that fails, you could attach it to a page on the Wiki, no?
Doug
Chris Fellows wrote:
I'm having what appears to be the same issue on 0.8
trunk. I can get through inject, generate, fetch and
updatedb, but am getting the IOException: No input
directories on invertlinks and cannot figure out why.
I'm only using nutch on a single local windows
machine. Any
NutchBean.getContent() and NutchBean.getParseData() do this, but require
a HitDetails instance. In the non-distributed case, the only required
field of the HitDetails for these calls is url. In the distributed
case, the segment field must also be provided, so that the request can
be routed
Dennis Kubes wrote:
I think that I am not fully understanding the role
the segments directory and its contents play.
A segment is simply a set of urls fetched in the same round, and data
associated with these urls. The content subdirectory contains the raw
http content. The parse-text
[EMAIL PROTECTED] wrote:
First question. Updatedb won't run against the segment so what can I do to
salvage it? Is the segment salvageable?
Probably. I think you're hitting some current bugs in DFS MapReduce.
Once these are fixed, then your updatedb's should succeed!
Second question,
[EMAIL PROTECTED] wrote:
Actually, I think that updatedb won't run because the fetched segment
didn't
complete correctly. Don't know whether the instructions in the 0.7 FAQ
apply:
%touch /index/segments/2005somesegment/fetcher.done
Ah. That's different. No, the 0.7 trick probably won't
Scott Simpson wrote:
I don't quite understand how to set up distributed searching with
relation to DFS (and the Tom White documents don't discuss this either).
There are three databases with relation to Nutch:
1. Web database (dfs)
2. Segments (regular fs)
3. The index (regular fs)
From your
Folks can say whether they'll attend at:
http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1
Doug
Shawn Gervais wrote:
I was not able to use the literal instructions, as my indexes and
segments are in DFS while the document presumes a local filesystem
installation
Search performance is not good with DFS-based indexes segments. This
is not recommended.
Distributed search is not meant
Elwin wrote:
When I use the httpclient.HttpResponse to get http content in nutch, I often
get SocketTimeoutExceptions.
Can I solve this problem by enlarging the value of http.timeout in conf
file?
Perhaps, if you're working with slow sites. But, more likely, you're
using too many fetcher
Jason Camp wrote:
Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).
But to perform the copies, you're
mikeyc wrote:
Any idea how the 'plugins' directory gets populated? I noticed
microformats-hreview was not there. It does exist in the build directory
with its jar and class files. Could this be the issue?
The plugins directory exists in release builds. When developing,
plugins live in
Shawn Gervais wrote:
When I have been at the terminal to observe the timed out process before
it is reaped, I have seen that it continues to use 100% of a single
processor. strace of the java process did not produce any usable leads.
When the reduce task is reassigned, either to the same
Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process for
an extended period of time (1M pages over 16 nodes, in this case), I
notice there is one map task which performs _very_ poorly compared to
the others:
4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,
Ken Krugler wrote:
Anyway, curious if anybody has insights here. We've done a fair amount
of poking around, to no avail. I don't think there's any way to get the
blocks back, as they definitely seem to be gone, and file recovery on
Linux seems pretty iffy. I'm mostly interested in figuring out
You can limit the number of pages by using the -topN parameter. This
limits the number of pages fetched in each round. Pages are prioritized
by how well-linked they are. The maximum number of pages that can be
fetched is topN*depth.
Doug
Olena Medelyan wrote:
Hi,
I'm using the crawl
Blocks are not deleted immediately. Check back in a while to see that
they're actually removed.
Doug
Dennis Kubes wrote:
Is there a way to delete files from the DFS? I used the dfs -rm option, but
the data blocks still are there.
Dennis
Dennis Kubes wrote:
Here it is for the list, I will try to put it on the wiki as well.
Thanks for writing this!
I've added a few comments below.
Some things are assumed for this tutorial. First, you will need root level
access to all of the boxes you are deploying to.
Root access should
Dennis Kubes wrote:
localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
[ ... ]
localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000:
Dennis Kubes wrote:
: command not foundlaves.sh: line 29:
: command not foundlaves.sh: line 32:
localhost: ssh: \015: Name or service not known
devcluster02: ssh: \015: Name or service not known
And still getting this error:
060316 175355 parsing file:/nutch/search/conf/hadoop-site.xml
Jérôme Charron wrote:
I reproduce this with nutch-0.8 with neko html parser (it seems that script
tags are not removed).
You can switch the html parser implementation to tagsoup. In my tests, all
is ok.
(property parser.html.impl)
Should we switch the default from neko to tagsoup? Are there
Olive g wrote:
Is hadoop/nutch scalable at all or I can tune some other parameters?
I'm not sure what you're asking. How long does it take to run this on a
single machine? My guess is that it's much longer. So things are
scaling: they're running faster when more hardware is added. In all
relevent posts in the mailing list archive, but I think I'm
missing something. For example, here's a snippet from a post from Doug
Cutting:
snip
that said, one can implement OR as a filter (replacing or altering
BasicQueryFilter) that scans for terms whose text is OR in the
default field.
/snip
I just fixed this.
Thanks,
Doug
ArentJan Banck wrote:
on: http://lucene.apache.org/nutch/issue_tracking.html
http://nagoya.apache.org/jira/browse/Nutch no longer works.
Should be: http://issues.apache.org/jira/browse/Nutch
- Arent-Jan
Andrzej Bialecki wrote:
What i infer is,
1. For every refetch, the score of files (but not the directory) is
increasing
This is curious, it should not be so. However, it's the same in the
vanilla version of Nutch (without this patch), so we'll address this
separately.
The OPIC
David Odmark wrote:
So am I correct in believing that in order to implement boolean OR using
Nutch search and a QueryFilter, one must also (minimally) hack the
NutchAnalysis.jj file to produce a new analyzer? Also, given that a
Nutch Query object doesn't seem to have a method to add a
Andrzej Bialecki wrote:
Doug Cutting wrote:
are refetched, their links are processed again. I think the easiest
way to fix this would be to change ParseOutputFormat to not generate
STATUS_LINKED crawldata when a page has been refetched. That way
scores would only be adjusted for links
Byron Miller wrote:
Anything i should change/tweak on my fetcher config
for .8 release? i'm only getting 5 pages/sec and i was
getting nearly 50 on .7 with 125 threads going. Does
.8 not use threads like 7 did?
Byron,
Have you tried again more recently? A number of bugs have been fixed in
Jon Blower wrote:
My guess is that the source program is not available on your version of
FreeBSD. Try running the source program (with no arguments) from the
command line or type man source. Do you see anything? If not, you
probably don't have the source program, which is called by the
Richard Braman wrote:
when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur? How would you make a fetchlist of all urls that have
failed?
Vanderdray, Jacob wrote:
I've changed the language a bit. If you're interested, take a
look:
http://wiki.apache.org/nutch/NutchTutorial
This looks great! Thanks so much for adding this to the wiki!
We might add something to the Step-by-Step introduction to the effect
that: This
Richard Braman wrote:
Can someone confirm this:
Uou start a crawldb from a list of urls and you generate a fetch list,
which is akin to seeding your crawldb. When you fetch it just fetches
those seed urls.
When you do your next round of generate/fetch/update, The fetch list
will have the
Richard Braman wrote:
I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back. And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.
Here's how
David Wallace wrote:
Also, I've lost count of the number of times someone has posted
something to the effect of I'll pay someone to give me Nutch support,
simply because they find the existing documentation and mailing lists
inadequate. Usually, that person gets told that the best way to get
Florent Gluck wrote:
In hadoop jobtracker's log, I can see several tasks being losts as follow:
060306 184155 Aborting job job_hyhtho
060306 184156 Task 'task_m_7qgat2' has been lost.
060306 184156 Aborting job job_hyhtho
060306 184156 Task 'task_m_lph5qs' has been lost.
060306 184156 Aborting
Monu Ogbe wrote:
Caused by: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at
org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav
It
Matthias Jaekle wrote:
Maybe we should move the tutorial to the wiki so it can be commented on.
+1
+1
Doug
It looks like the child JVM is silently exiting. The error reading
child output just shows that the child's standard output has been
closed, and the child error says the JVM exited with non-zero.
Perhaps you can get a core dump by setting 'ulimit -c' to something big.
JVM core dumps can
0.7 and 0.8 are not compatible. You need to re-crawl. Sorry!
Once we have a 1.0 release then we'll make sure things are back-compatible.
Doug
Martin Gutbrod wrote:
I changed from 0.7.1 to one of the latest nightly builds (0.8) and
now search for url: fields fail. E.g. [ url:my.doman.com ]
Vanderdray, Jacob wrote:
I get the same thing from my linux box. The only reference I can find
to linkmap.html is a commented out line in forrest.properties.
FWIW: I've already made the changes to my copy of mailing_lists.xml. Let me
know if you want me to just send someone that.
Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there?
Doug
Håvard W. Kongsgård wrote:
I am unable to set java_home in bin/hadoop, is there a bug? I have used
nutch 0.7.1 with the same java path.
localhost: Error: JAVA_HOME is not set.
if [ -f $HADOOP_HOME/conf/hadoop-env.sh ];
Rafit Izhak_Ratzin wrote:
I just check out the latest svn version (376446), I built it from scratch.
When I tried to run the jobtrucker I got the next message in the
jobtracker log file:
060209 164707 Property 'sun.cpu.isalist' is
Exception in thread main java.lang.NullPointerException
Michael Nebel wrote:
I upgraded to the last version from the svn today. After having some
nuts and bolts fixes (missing hadoop-site.xml, webapps-dir).
I just fixed these issues.
I finally
tried to inject a new set of urls. Doing so, I get the exception below.
I am not seeing this. Are you
Michael Nebel wrote:
Now it's complaining about a missing class
org/apache/nutch/util/LogFormatter :-(
That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter.
Doug
The file packaged in the jar is used for the defaults. It is read from
the jar file. So it should not need to be committed to Nutch.
Mike Smith wrote:
There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into
Chris Schneider wrote:
Also, since we've been running this crawl for quite some time, we'd like
to preserve the segment data if at all possible. Could someone please
recommend a way to recover as gracefully as possible from this
condition? The Crawl .main process died with the following
Steve Betts wrote:
I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
but it does allow it to complete.
I find xpdf much faster than PDFBox.
http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html
Does this work any better for you?
Doug
Chris Schneider wrote:
I'm trying to bring up a MapReduce system, but am confused about how to
control the logging level. It seems like most of the Nutch code is still
logging the way it used to, but the -logLevel parameter that was getting
passed to each tool's main() method no longer exists
Michael Plax wrote:
Question summery:
Q: How can I set up crawler in order to index all web site?
I'm trying to run crawl with command from tutorial
1. In urls file I have start page (index.html).
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.
3. I run: $ bin/nutch
Florent Gluck wrote:
I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.
There have been a number of complaints about unreliable fetching with
protocol-httpclient, so
Matt Zytaruk wrote:
I am having this same problem during the reduce phase of fetching, and
am now seeing:
060119 132458 Task task_r_obwceh timed out. Killing.
That is a different problem: a different timeout. This happens when a
task does not report status for too long then it is assumed
att Kangas wrote:
Doug, would it make sense to print a LOG.info() message every time the
fetcher bumps into one of these db.max limits? This would help users
find out when they need to adjust their configuration.
I can prepare a patch if it seems sensible.
Sure, this is sensible. But it's
Insurance Squared Inc. wrote:
I'm trying to determine if there's a better way to whitelist a large
number of domains than just adding them as a regular expression in the
filter.
Have a look at the urlfilter-prefix plugin. This is more efficient for
filtering urls by a large list of domains.
Neal Whitley wrote:
Now here's another question.
How can I obtain the exact number of searches being displayed on the
screen. I have been fishing around and can not find a variable being
output to the page with this date.
In my example below 81 total matches were found. But because of the
Pushpesh Kr. Rajwanshi wrote:
Just wanted to confirm that this distributed crawl you
did using nutch version 0.7.1 or some other version? And was that a
successful distributed crawl using map reduce or some work around for
distributed crawl?
No, this is 0.8-dev. This was using in early
Teruhiko Kurosaka wrote:
Can I use MapReduce to run Nutch on a multi CPU system?
Yes.
I want to run the index job on two (or four) CPUs
on a single system. I'm not trying to distribute the job
over multiple systems.
If the MapReduce is the way to go,
do I just specify config parameters
David Wallace wrote:
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
Note that this is only done when collecting anchor texts, not when
Earl Cahill wrote:
Any chance you could walk through your implementation?
Like how the twenty boxes were assigned? Maybe
upload your confs somewhere, and outline what commands
you actually ran?
All 20 boxes are configured identically, running a Debian 2.4 kernel.
These are dual-processor
Pushpesh Kr. Rajwanshi wrote:
I want to know if anyone is able to successfully run distributed crawl on
multiple machines involving crawling millions of pages? and how hard is to
do that? Do i just have to do some configuration and set up or do some
implementations also?
I recently performed a
Can you please describe the higher-level problem you're trying to solve?
Doug
Matt Zytaruk wrote:
Hello,
I am trying to implement a system where to get the score for certain
documents in a query, I need to average the score of two different
documents for that query. Does anyone have any
Nguyen Ngoc Giang wrote:
I'm writing a small program which just utilizes Nutch as a crawler only,
with no search functionality. The program should be able to return page
content given an url input.
In the mapred branch this is directly supported by NutchBean.
Doug
Did you update the crawldb after the first fetch? The mapred crawler
does not update the next-fetch date of pages when the fetch list is
generated, as in 0.7. So, until that changes, you must update the
crawldb before you next generate a fetch list.
Doug
Florent Gluck wrote:
Hi,
As a
Florent Gluck wrote:
8. invertlinks linkdb segments/SEG_NAME
This should be instead:
invertlinks linkdb segments
Doug
Ben Halsted wrote:
When I check the fetch status pages in the JobTracker web GUI I saw that I
was getting on average more errors than pages.
95 pages, 119 errors, 1.0 pages/s, 63 kb/s
Is there a way to find out what the errors are?
Look in the tasktracker logs. Typically they're max delays
Thomas Delnoij wrote:
So, say I want to setup a machine as a DataNode that has two or more disks,
do I have to configure and setup a DataNode Deamon for every disk? How else
could I use all disks if the ndfs.data.dir property only accepts one path
(assumed I don't want to rely on MS Windows'
Håvard W. Kongsgård wrote:
- I want to index about 50 – 100 sites with lots of documents, is it
best use the Intranet Crawling or Whole-web Crawling method.
The intranet style is simpler and hence a good place to start. If it
doesn't work well for you then you might try the whole-web style.
Matt Zytaruk wrote:
Indeed, that does work, although that ends up slowing down the fetch a
fair amount because a lot of threads end up idle, waiting, and I was
hoping to avoid that slowdown if possible.
What should these threads be doing?
If you have a site with N pages to fetch, and you
Matt Zytaruk wrote:
Well, if we want to fetch pages from N different sites, ideally we
should be able to have N threads running, without any of them having to
wait. I guess ideally what the fetcher should probably do is instead of
waiting, put the url it was trying to fetch back into the queue
Ben Halsted wrote:
I'm getting the dreaded: Too many open files error.
I've checked my system settings for file-max:
$ cat /proc/sys/fs/file-nr
2677 1945 478412
$ cat /proc/sys/fs/file-max
478412
What does 'ulimit -n' print? Look in /etc/security/limits.conf to
increase the limit.
What
Ben Halsted wrote:
I've modified the auto-crawl to always use a pre-existing crawldb. If I run
it multiple times I get multiple linkdb, segments, indexes, and index
directories.
Is it possible to merge the results using the bin/nutch comamnds?
You should also have it use a single linkdb.
Ben Halsted wrote:
I was wondering what the required file structure is for the web gui to work
properly.
Are all of these required?
/db/crawldb
/db/index
/db/indexes
/db/segments
/db/linkdb
The indexes directory is not used when a merged index is present.
The crawldb and
Ben Halsted wrote:
When I merge this stuff, do I need to merge the segments/* for each crawl
into a single segments directory? Or is there data in the merged index file
that will direct the web component to the correct segment?
Put the segments in a single directory. The index only has the
James Nelson wrote:
I need to sort the search results on two fields for a project I'm
working on, but nutch only seems to support sorting on one. I'm
wondering if I missed something and there is actually a way or if
there is a reason for restricting sort to one field that I'm not aware
of.
The explain page lists all stored fields by calling the toHtml()
method of HitDetails. You can also list things with:
for (int i = 0; i detail.getLength(); i++) {
String field = detail.getField(i);
String value = detail.getValue(i);
...
}
Doug
Byron Miller wrote:
I'm looking to see
It looks like you are using ndfs but not running any datanodes. An ndfs
filesystem requires one namenode and at least one datanode, typically a
large number running on different machines. Look at the
bin/start-all.sh script for an example of what is started in a typical
mapred/ndfs
Ken van Mulder wrote:
Initially, its able to reach ~25 pages/s with 150 threads. The fetcher
gets progressivly slower though, dropping down to about ~15 pages/s
after about 2-3 hours or so and continues to slow down. I've seen a few
references on these lists to the issue, but I'm not clear on
Byron Miller wrote:
For example i've been tweaking max merge/min merge and
such and i've been able to double my performance
without increasing anything but cpu load..
Smaller maxMergeDocs will cost you in the end, since these will
eventually be merged during the index optimization at the end.
Ken Krugler wrote:
We're only using the html text parsers, so I don't think that's the
problem. Plus we dumping the thread stack when it hangs, and it's always
in the ChunkedInputStream.exhaustInputStream() process (see trace below).
The trace did not make it.
Have you tried protocol-http
Ken van Mulder wrote:
As a side note, does anyone have any recommendations for profiling
software? I've used the standard hprof, which slows down the process to
much for my needs and jmp which seems pretty unstable.
I recommend 'kill -QUIT' as a poor-man's profiler. With a few stack
dumps
Byron Miller wrote:
property
nameindexer.mergeFactor/name
value350/value
description
/description
/property
Initially high index merge factor caused out of file
handle errors but increasing the others along with it
seemed to help get around that.
That is a very large mergeFactor,
The only link on http://shopthar.com/ to the domain shopthar.com is a
link to http://shopthar.com/. So a crawl starting from that page that
only visits pages in shopthar.com will only find that one page.
% wget -q -O - http://shopthar.com/ | grep shopthar.com
trtd colspan=2Welcome to
TL wrote:
You mentioned that as a rule of thumb each node should
only have about 20M pages. What's the main bottleneck
that's encountered around 20M pages? Disk i/o , cpu
speed?
Either or both, depending on your hardware, index, traffic, etc.
CPU-time to compute results serially can average
Murray Hunter wrote:
We tested search for a 20 Million page index on a dual core 64 bit machine
with 8 GB of ram using storage of the nutch data on another server through
linux nfs, and it's performance was terrible. It looks like the bottleneck
was nfs, so I was wondering how you had your
Andy Lee wrote:
Not to become a one-person thread or anything (and I'll shut up if this
attempt gets no answers), but this seems like a straightforward
question. Is there some design principle I'm missing that would be
violated if clauses could be removed from a query?
No, not that I can
Andy Lee wrote:
Thanks, Doug. In that case, please consider this a request for a
couple of API changes which you may be planning anyway:
* addClause() and removeClause() methods in Query.
* Setters in Query.Clause for its term/phrase.
Please submit a bug report, ideally with a patch file
Ngoc Giang Nguyen wrote:
I'm running Nutch to crawl some specific websites that I know the web admins
personally. So is there anyway to change the settings of the target web
servers such that they give my Nutch higher priority, let's say unlimited
access, assuming they are all Apache servers?
Earl Cahill wrote:
1. Sounds like some of you have some glue programs
that help run the whole process. Are these going to
end up in subversion sometime? I am guessing there is
much duplicated effort.
I'm not sure what you mean. I set environment variables in my .bashrc,
then simply use
Rod Taylor wrote:
Tell me how it behaves during the sort phase.
I ran 8 jobs simultaneously. Very high await time (1200) and it was
doing about 22MB/sec data writes. Nearly 0 reads from disk (everything
would be cached in memory).
This is during the sort part? This first writes a big file,
Ilya Kasnacheev wrote:
So I only get HTMLised version, which is useless if I need only page
rating (top Explanation.getValue()). How would I get page rating (i.e.
number from 0 to 1 showing how relevant Hit was to Query) from nutch?
Explanations are not a good way to get this, as, for each
Paul van Brouwershaven wrote:
The AcceptEnv option is only avalible with ssh 3.9 Debian currently
only has 3.8.1p1 in stable and testing. (4.2 unstable)
Is there an other way to solve the env. problem?
I don't know. The Fedora and Debian systems that I use have AcceptEnv.
Doug
Rod Taylor wrote:
I see. Is there any way to speed up this phase? It seems to be taking as
long to run the sort phase as it did to download the data.
It would appear that nearly 30% of the time for the nutch fetch segment
is spent doing the sorts, so I'm well off the 20% overhead number you
Rod Taylor wrote:
Virtually no IO reported at all. Averages about 200kB/sec read and
writes are usually 0, but burst to 120MB/sec for under 1 second once
every 30 seconds or so.
That's strange. I wonder what it's doing. Can you use 'kill -QUIT' to
get a thread dump? Try a few of these to
Try the following on your system:
bin/nutch org.apache.nutch.io.TestSequenceFile -fast -count 2000
-megabytes 100 foo
Tell me how it behaves during the sort phase.
Thanks,
Doug
Gal Nitzan wrote:
I only have two log files:
-rw-r--r-- 1 root root 8090 Oct 3 07:01
nutch-root-jobtracker-kunzon.log
-rw-r--r-- 1 root root 4290 Oct 3 07:01 nutch-root-namenode-kunzon.log
The tasktracker logs would be on the machines running the tasktracker,
which might be
1 - 100 of 116 matches
Mail list logo