from:"Doug Cutting"

Re: Sorting in nutch-webinterface - how?

2006-05-26 Thread Doug Cutting


Stefan Neufeind wrote:

Can you maybe also help me out with sort=title?


Lucene's works with indexed, non-tokenized fields.  The title field is 
tokenized.  If you need to sort by title then you'd need to add a plugin 
that indexes another field (e.g., sortTitle) containing the 
un-tokenized title, perhaps lowercased, if you want case-independent 
sorting.


http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Sort.html

Doug

Re: .job file?

2006-05-26 Thread Doug Cutting

The .job file is a jar file for submission to Hadoop's MapReduce.  It is 
Hadoop-specific, although very similar to war and ear files.


Teruhiko Kurosaka wrote:

Nutch's top-level bulid.xml file's default target is job,
and it build a zip file called nutch-0.8-dev.job.  


project name=Nutch default=job
...
  target name=job depends=compile
jar jarfile=${build.dir}/${final.name}.job
  zipfileset dir=${build.classes}/
  zipfileset dir=${conf.dir} excludes=*.template/
  zipfileset dir=${lib.dir} prefix=lib
  includes=**/*.jar excludes=hadoop-*.jar/
  zipfileset dir=${build.plugins} prefix=plugins/
/jar
  /target

I've heard of .jar, .war, and .ear files, but 
not .job files.  What is this? What (application servers?) 
are supposed to understand .job files? Is this part of 
the new J2EE spec?

-kuro

0.8 release soon?

2006-05-26 Thread Doug Cutting


Andrzej Bialecki wrote:
0.8 is pretty stable now, I think we should start considering a release 
soon, within the next month's time frame.


+1

Are there substantial features still missing from 0.8 that were 
supported in 0.7?


Are there any showstopping bugs, things that worked in 0.7 that are 
broken in 0.8?


Doug

Re: Can't access nightly build nutch 0.8

2006-05-11 Thread Doug Cutting

The nightly build is not mirrored.  It is only available from 
cvs.apache.org, which has been down, but is now up.


http://cvs.apache.org/dist/lucene/nutch/nightly/

Note that no nightly build was done last night, since Subversion was down.

Doug

Michael Plax wrote:

I tried randomly some of them (~10).
I will try again.
Thank you,
Michael

- Original Message - From: Jérôme Charron 
[EMAIL PROTECTED]

To: nutch-user@lucene.apache.org
Sent: Thursday, May 11, 2006 12:39 PM
Subject: Re: Can't access nightly build nutch 0.8



I'm trying (5/10-5/11) to download  nightly build of nutch but I get The
 page cannot be displayed.

Do you have some catalina logs?


Oups... sorry... you get a page cannot be displayed while loading?
(if so, forgot my previous message ... there is currently some problems 
with

some apache servers... try a mirror)

Jérôme

Re: MultiSearcher skewed IDF values

2006-04-28 Thread Doug Cutting


Andrzej Bialecki wrote:
Unfortunately, this is still an existing problem, and neither Nutch nor 
Lucene does the right job here. Please see NUTCH-92 for more 
information, and a sketch of solution for this issue.


Lucene's MultiSearcher now implements this correctly, no?  But Nutch's 
distributed search does not.  Two round trips to each node are required: 
the first to get IDF information for the query, and the second to get hits.


Doug

Re: Problem with sorting index

2006-04-28 Thread Doug Cutting

It sounds like you're sorting a segment index after dedup, rather than a 
merged index.  It also looks like there's a bug in IndexSorter.  But you 
should be able to work around it by merging your segment indexes after 
deduping, so there are no deletions.


Please file a bug in Jira.

Doug

Michael wrote:

When i'm trying to use IndexSorter, i'm getting this error:

Exception in thread main java.lang.IllegalArgumentException: attempt to 
access a deleted document
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:282)
at 
org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:104)
at 
org.apache.nutch.indexer.IndexSorter$SortingReader.document(IndexSorter.java:170)
at 
org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186)
at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:579)
at org.apache.nutch.indexer.IndexSorter.sort(IndexSorter.java:240)
at org.apache.nutch.indexer.IndexSorter.main(IndexSorter.java:291)

Anyone knows how to fix this?
  


Michael

Re: Admin Gui beta test (was Re: ATB: Heritrix)

2006-04-28 Thread Doug Cutting


Andrzej Bialecki wrote:
I think it should be possible to put your binary at the Apache site, 
probably Doug will be the right person to talk to ...


Have you tried attaching it to a Jira issue?

If that fails, you could attach it to a page on the Wiki, no?

Doug

Re: java.io.IOException: No input directories specified in

2006-04-26 Thread Doug Cutting


Chris Fellows wrote:

I'm having what appears to be the same issue on 0.8
trunk. I can get through inject, generate, fetch and
updatedb, but am getting the IOException: No input
directories on invertlinks and cannot figure out why.
I'm only using nutch on a single local windows
machine. Any idea's? Configuration has not changed
since checking out from svn.


The handling of Windows pathnames is still buggy in Hadoop 0.1.1.  You 
might try replacing your lib/hadoop-0.1.1.jar file with the latest 
Hadoop nightly jar, from:


http://cvs.apache.org/dist/lucene/hadoop/nightly/

The file name code has been extensively re-written.  The next Hadoop 
release (0.2), containing these fixes, will be made in around a week.


Doug

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting

NutchBean.getContent() and NutchBean.getParseData() do this, but require 
a HitDetails instance.  In the non-distributed case, the only required 
field of the HitDetails for these calls is url.  In the distributed 
case, the segment field must also be provided, so that the request can 
be routed to a node serving that segment.  These are implemented by 
FetchedSegments.java and DistributedSearch.java.


Doug

Dennis Kubes wrote:
Can somebody direct me on how to get the stored text and parse metadata 
for a given url?


Dennis

Re: How to get Text and Parse data for URL

2006-04-25 Thread Doug Cutting


Dennis Kubes wrote:
I think that I am not fully understanding the role 
the segments directory and its contents play.


A segment is simply a set of urls fetched in the same round, and data 
associated with these urls.  The content subdirectory contains the raw 
http content.  The parse-text subdirectory contains the extracted text, 
used when indexing and when building snippets for hits.  The index 
subdirectory holds a Lucene index of the pages in the segment.  Etc.  It 
is an independent chunk of Nutch data.


In 0.8, each segment subdirectory is further split into parts, the 
result of distributed processing.  The parts are split by the hash of 
the url.


Does that help?

Doug

Re: java.io.IOException: Cannot create file

2006-04-20 Thread Doug Cutting


[EMAIL PROTECTED] wrote:

First question.  Updatedb won't run against the segment so what can I do to
salvage it?  Is the segment salvageable?


Probably.  I think you're hitting some current bugs in DFS  MapReduce. 
 Once these are fixed, then your updatedb's should succeed!



Second question, should I raise an issue in JIRA quoting the errors below?


Yes, please.


*** Excerpt from hadoop-site.xml
property
  namemapred.system.dir/name
  value/home/nutch/hadoop/mapred/system/value
/property


Unlike the other paths, mapred.system.dir is not a local path, but a 
path in the default filesystem, dfs in your case.  Your setting is fine, 
I just thought I'd mention that.



Timed out.java.io.IOException: Task process exit with nonzero status of 143.


These 143's are a mystery to me.  We really need to figure out what is 
causing these!  One suggestion I found on the net was to try passing 
'-Xrs' to java, i.e., setting mapred.child.java.opts to include it. 
Another idea is to put 'ulimit -c unlimited' in one's 
conf/hadoop-env.sh, so that these will cause core dumps.  Then, 
hopefully, we can use gdb to see where the JVM crashed.  I have not had 
time recently to try either of these on a cluster, the only place where 
this problem has been seen.



java.rmi.RemoteException: java.io.IOException: Cannot create file
/user/root/crawlA/segments/20060419162433/parse_text/part-5/data on client
DFSClient_task_r_poobc6


This bug is triggered by the previous bug.  In the first case the output 
is started, then the task jvm crashes.  But DFS waits a minute before it 
will let another task create a file with the same name (to time out the 
other writer).  So if the replacement task starts within a minute, then 
this error is thrown.  I think Owen is working on a patch for this which 
will make DFSClient try to open the file for at least a minute before 
throwing an exception.  We should have that committed today.  This won't 
fix the 143's, but should allow your jobs to complete in spite of them.


Thanks for your patience,

Doug

Re: java.io.IOException: Cannot create file

2006-04-20 Thread Doug Cutting


[EMAIL PROTECTED] wrote:
Actually, I think that updatedb won't run because the fetched segment 
didn't
complete correctly.  Don't know whether the instructions in the 0.7 FAQ 
apply:


%touch /index/segments/2005somesegment/fetcher.done


Ah.  That's different.  No, the 0.7 trick probably won't work.

What errors are you seeing from this?  I'd expect you'd see unexpected 
eof in the updatedb map task, for truncated outputs in this segment.


http://issues.apache.org/jira/browse/HADOOP-153 would fix that, once 
implemented.


Doug

Re: Using Nutch's distributed search server mode

2006-04-20 Thread Doug Cutting


Scott Simpson wrote:

I don't quite understand how to set up distributed searching with
relation to DFS (and the Tom White documents don't discuss this either).
There are three databases with relation to Nutch:

1. Web database (dfs)
2. Segments (regular fs)
3. The index (regular fs)

From your message above, I assume that the segments and index go in the
regular file system and the web database is distributed across dfs. We
put only a portion of the segments and index on each node and the search
is distributed from Tomcat to all the nodes at once.

If we don't use DFS for the segments and index, we'll lose the
redundancy if a node is dead and we may lose search results. Is this
true?


The distributed search code is currently a bit neglected.  It doesn't 
yet take advantage of MapReduce.  The best way to use it today is to 
keep the master copy of your segments and indexes in dfs, then, when 
you're (manually) starting distributed search servers, copy segments and 
indexes from dfs to temporary local storage start the distributed search 
servers against those.  Then construct a search-servers.txt that will be 
picked up by NutchBean to construct the DistributedSearch.Client.


Long-term, I think we should automate this by having a distributed 
search MapReduce task.  Each task will start by copying required data to 
local disk, starting a search server on that data, then reporting that 
search server back through the job tracker.  Currently this can be done 
by setting the task's status to be the host:port string of the search 
server, then call getMapTaskReports() to get the host:port of all 
servers.  The map task can then simply loop forever doing nothing.  If 
a search server dies, then the MapReduce system will automatically start 
a new one.  To launch a new version of the index, start a new such 
MapReduce job, and, once it is running, switch the 
DistributedSearch.Client to use it's servers and kill the old job.  The 
temporary space will be reclaimed when the job is killed.  One will have 
to be sure that the number of input splits naming search server tasks 
is no greater than numNodes*mapred.tasktracker.tasks.maximum, so that 
all of the tasks will run simultaneously.


But none of that's implemented yet!

Doug

Re: nutch user meeting in San Francisco: May 18th

2006-04-20 Thread Doug Cutting


Folks can say whether they'll attend at:

http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1

Doug

Re: Using Nutch's distributed search server mode

2006-04-17 Thread Doug Cutting


Shawn Gervais wrote:
I was not able to use the literal instructions, as my indexes and 
segments are in DFS while the document presumes a local filesystem 
installation


Search performance is not good with DFS-based indexes  segments.  This 
is not recommended.


Distributed search is not meant for a single merged index, but rather 
for searching multiple indexes.  With distributed search, each node will 
typically have (a local copy of) a few segments and either a merged 
index for just those segments, or separate indexes for each segment.


When I examine the search results I see many duplicate results. Looking 
at it further it seems like the results of performing the same search 
across all 16 nodes is being combined into one result set - duplicates 
and all. I can only assume that I need to somehow partition my index or 
segments, but I'm unsure how to do that.


It looks like you're searching the same dfs-resident index 16 times.

Doug

Re: java.net.SocketTimeoutException: Read timed out

2006-04-12 Thread Doug Cutting


Elwin wrote:

When I use the httpclient.HttpResponse to get http content in nutch, I often
get SocketTimeoutExceptions.
Can I solve this problem by enlarging the value of http.timeout in conf
file?


Perhaps, if you're working with slow sites.  But, more likely, you're 
using too many fetcher threads and exceeding your available bandwidth, 
causing threads to starve and timeout.


Doug

Re: Question about crawldb and segments

2006-04-12 Thread Doug Cutting


Jason Camp wrote:

Unfortunately in our scenario, bw is cheap at our fetching datacenter,
but adding additional disk capacity is expensive - so we are fetching
the data and sending it back to another cluster (by exporting segments
from ndfs, copy, importing).


But to perform the copies, you're using a lot of bandwidth to your 
indexing datacenter, no?  Copying segments probably takes almost as 
much bandwidth as fetching them...



I know this sounds a bit messy, but it was the only way we could
come up with to utilize the benefits of both datacenters. Ideally, I'd
love to be able to have all of the servers in one cluster, and define
which servers I want to perform which tasks, so for instance we could
use the one group of servers to fetch the data, but the other group of
servers to store the data and perform the indexing/etc. If there's a
better way to do something like this than what we're doing, or if  you
think we're just insane for doing it this way, please let me know :) Thanks!


You can use different sets of machines for dfs and MapReduce, by 
starting them in differently configured installations.  So you could run 
dfs only in your indexing datacenter, and MapReduce in both 
datacenters configured to talk to the same dfs, at the indexing 
datacenter.  Then your fetch tasks as the fetch datacenter would write 
their output to the indexing datacenter's dfs.  And 
parse/updatedb/generate/index/etc. could all run at the other 
datacenter.  Does that make sense?


Doug

Re: plugins directory

2006-04-12 Thread Doug Cutting


mikeyc wrote:

Any idea how the 'plugins' directory gets populated?  I noticed
microformats-hreview was not there.  It does exist in the build directory
with its jar and class files.  Could this be the issue?  


The plugins directory exists in release builds.  When developing, 
plugins live in build/plugins.  If you're developing you should 
generally work from a subversion checkout, not a downloaded release.


Doug

Re: How best to debug failed fetch-reduce task

2006-04-12 Thread Doug Cutting


Shawn Gervais wrote:
When I have been at the terminal to observe the timed out process before 
it is reaped, I have seen that it continues to use 100% of a single 
processor. strace of the java process did not produce any usable leads. 
When the reduce task is reassigned, either to the same machine or 
another, it will die around the same percentage completion.


Did you try 'kill -QUIT' the process?  That should print a stack trace 
for every thread.


Is there an option I can enable somewhere that will allow for more 
verbose output to be written to the logs? Any other suggestions on 
debugging this issue?


You could put add some print statements to FetcherOutputFormat.java, in 
the RecordWriter.write() method, printing each key (URL) written.  That 
might let you figure out what page is hanging things.


It seems to me that it might be possible to take a 
snapshot of the task while it is running (i.e. data and the task job 
jar), so that I can debug it in isolation without re-running an entire 
fetch process. I am unsure of how this might be done, though.


Once you know the page (assuming it is determinisitic) then you should 
be able to run a fetch of just that page to test things.


Doug

Re: When Nutch fetches using mapred ...

2006-04-10 Thread Doug Cutting


Shawn Gervais wrote:
 When I perform a search large enough to observe the fetch process for 
an extended period of time (1M pages over 16 nodes, in this case), I 
notice there is one map task which performs _very_ poorly compared to 
the others:


4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,
versus
46639 pages, 13227 errors, 43.9 pages/s, 4547 kb/s,

It is deficient in terms of raw pages/sec, execution time (it is the 
last map task to complete), and the number of errors encountered.


As I said, there seems to always be exactly one map task like this. 
Different fetch executions will have the thread assigned to different 
machines -- there doesn't seem to be any pattern.


What the heck is going on here?


My suspicion is that you're trying to fetch a large number of pages from 
a single site.  Fetch tasks are partitioned by host name.  All urls with 
a given host are fetched in a single fetcher map task.  Grep the errors 
from the log on the slow node: I'll bet most are from a single host name.


To fix this, try setting generate.max.per.host.

A good value might be something like 
topN/(mapred.map.tasks*fetcher.threads.fetch).  So if you're setting 
-topN to 10M and running with 10 fetch tasks and using 100 threads, then 
each fetch task will fetch around 1M urls, 10,000 per thread.  Fetching 
a single host is single-threaded, so any host with more than 10,000 urls 
will slow the overall fetch.


Here's another way to think about it: If you're fetching a page/second 
per host (fetcher.server.delay) and your fetch tasks are averaging 
around an hour (3600 seconds) then any host which has more than 3600 
pages will cause its fetch tasks to run slower than the others and/or to 
have high error rates.


Doug

Re: lost NDFS blocks following network reorg

2006-03-26 Thread Doug Cutting


Ken Krugler wrote:
Anyway, curious if anybody has insights here. We've done a fair amount 
of poking around, to no avail. I don't think there's any way to get the 
blocks back, as they definitely seem to be gone, and file recovery on 
Linux seems pretty iffy. I'm mostly interested in figuring out if this 
is a known issue (Of course you can't change the server names and 
expect it to work), or whether it's a symptom of lurking NDFS bugs.


It's hard to tell, after the fact, whether stuff like this is pilot 
error or a bug.  Others have reported similar things, so it's either a 
bug or it's too easy to make pilot errors.  So something needs to 
change.  But what?


We need to start testing stuff like this systematically.  A reproducible 
test case would make this much easier to diagnose.


I'm sorry I can't be more helpful.  I'm sorry you lost data.

Doug

Re: How to terminate the crawl?

2006-03-21 Thread Doug Cutting

You can limit the number of pages by using the -topN parameter.  This 
limits the number of pages fetched in each round.  Pages are prioritized 
by how well-linked they are.  The maximum number of pages that can be 
fetched is topN*depth.


Doug

Olena Medelyan wrote:

Hi,

I'm using the crawl tool in nutch to crawl web starting from a set of 
URL seeds. The crawl normally finishes after the specified depth was 
reached. Is it possible to terminate after a pre-defined number of pages 
or a text data of a pre-defined size (e.g. 500 MB) has been crawled? 
Thank you for any hints!


Regards,
Olena

Re: Delete Files from NDFS

2006-03-21 Thread Doug Cutting

Blocks are not deleted immediately.  Check back in a while to see that 
they're actually removed.


Doug

Dennis Kubes wrote:

Is there a way to delete files from the DFS?  I used the dfs -rm option, but
the data blocks still are there.

Dennis

Re: Nutch and Hadoop Tutorial Finished

2006-03-20 Thread Doug Cutting


Dennis Kubes wrote:

Here it is for the list, I will try to put it on the wiki as well.


Thanks for writing this!

I've added a few comments below.


Some things are assumed for this tutorial.  First, you will need root level
access to all of the boxes you are deploying to.


Root access should not be required (although it is sometimes 
convenient).  I have certainly run large-scale crawls w/o root.



The only way to get Nutch 0.8 Dev as of this writing that I know of is
through Subversion.


Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from:

http://cvs.apache.org/dist/lucene/hadoop/nightly/


Add a build.properties file and inside of it add a variable called  dist.dir
with its value as the location where you want to build nutch. So if you are
building on a linux machine it would look something  like this:

dist.dir=/path/to/build


This is optional.


So log into the master nodes and all of the slave nodes as root. Create the
nutch user and the different filesystems with the following commands:

mkdir /nutch
mkdir /nutch/search
mkdir /nutch/filesystem
mkdir /nutch/home

useradd -d /nutch/home -g users
chown -R nutch:users /nutch
passwd nutch nutchuserpassword


You can of course run things as any user.  I always run things as 
myself, but that may not be appropriate in all environments.



First we are going to edit the ssh daemon.  The line that reads
#PermitUserEnvironment no should be changed to yes and the daemon restarted.
This will need to be done on all nodes.

vi /etc/ssh/sshd_config
PermitUserEnvironment yes


This is not required (although it can be useful).  If you see errors 
from ssh when running scripts, then try changing the value of 
HADOOP_SSH_OPTS in conf/hadoop-env.sh.



Once we have the ssh daemon configured, the ssh keys created and copied to
all of the nodes we will need to create an environment file for ssh to use.
When nutch logs in to the slave nodes using ssh, the environment file
creates the environment variables for the shell.  The environment file is
created under the nutch home .ssh directory.  We will create the environment
file on the master node and copy it to all of the slave nodes.

vi /nutch/home/.ssh/environment

.. environment variables

Then copy it to all of the slave nodes using scp:

scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment


One can now instead put environment variables in conf/hadoop-env.sh, 
since not all versions of ssh support PermitUserEnvironment.



cd /nutch/search
scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search


Note that, after the initial copy, you can set NUTCH_MASTER in your 
conf/hadoop-env.sh and it will use rsync to update the code running on 
each slave when you start daemons on that slave.



The first time all of the nodes are started there may be the ssh dialog
asking to add the hosts to the known_hosts file.  You will have to type in
yes for each one and hit enter.  The output may be a little wierd the first
time but just keep typing yes and hitting enter if the dialogs keep
appearing.


A command like 'bin/slaves.sh uptime' is a good way to test that things 
are configured correctly before attempting bin/start-all.sh.


Thanks again for providing this!

Doug

Re: Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Doug Cutting


Dennis Kubes wrote:

localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000: command-line: line 0: Bad configuration option:
ConnectTimeout

[ ... ]

localhost:9000: command-line: line 0: Bad configuration option:
ConnectTimeout
devcluster02:9000: command-line: line 0: Bad configuration option:
ConnectTimeout


The launch of the datanodes and tasktrackers failed, since your version 
of ssh does not support the ConnectTimeout option.  Edit 
conf/nutch-env.sh, and add a 'export HADOOP_SSH_OPTS=' line to remove 
this option.


Doug

Re: Help Setting Up Nutch 0.8 Distributed

2006-03-16 Thread Doug Cutting


Dennis Kubes wrote:

: command not foundlaves.sh: line 29:
: command not foundlaves.sh: line 32:
localhost: ssh: \015: Name or service not known
devcluster02: ssh: \015: Name or service not known

And still getting this error:

060316 175355 parsing file:/nutch/search/conf/hadoop-site.xml
Exception in thread main java.io.IOException: Cannot create file
/tmp/hadoop/mapred/system/submit_mmuodk/job.jar on client
DFSClient_-913777457
at org.apache.hadoop.ipc.Client.call(Client.java:301)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141)
at org.apache.hadoop.dfs.$Proxy0.create(Unknown Source)
at
org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSCli
ent.java:587)
at org

My ssh version is:

openssh-clients-3.6.1p2-33.30.3
openssh-server-3.6.1p2-33.30.3
openssh-askpass-gnome-3.6.1p2-33.30.3
openssh-3.6.1p2-33.30.3
openssh-askpass-3.6.1p2-33.30.3

Is it something to do with my slaves file?


The \015 looks like a file has a CR where perhaps an LF is expected? 
What does 'od -c conf/slaves' print?  What happens when you try 
something like 'bin/slaves uptime'?


Doug

Re: javascript in summaries [nutch-0.7.1]

2006-03-15 Thread Doug Cutting


Jérôme Charron wrote:

I reproduce this with nutch-0.8 with neko html parser (it seems that script
tags are not removed).
You can switch the html parser implementation to tagsoup. In my tests, all
is ok.
(property parser.html.impl)


Should we switch the default from neko to tagsoup?  Are there cases 
where neko is better?


Doug

Re: Question on scalability

2006-03-15 Thread Doug Cutting


Olive g wrote:

Is hadoop/nutch scalable at all or I can tune some other parameters?


I'm not sure what you're asking.  How long does it take to run this on a 
single machine?  My guess is that it's much longer.  So things are 
scaling: they're running faster when more hardware is added.  In all 
cases you're using the same number of machines, but varying parameters 
and seeing different performance, as one would expect.  For your current 
configuration, indexing appears fastest when the number of reduce tasks 
equals the number of nodes.



I already have:
mapred.map.tasks set to 100
mapred.job.tracker is not local
mapred.tasktracker.tasks.maximum is 2.
and everything else is default.


How are you storing things?  Are you using dfs?

Are your nodes single-cpu or dual-cpu?  My guess is single-cpu, in which 
case you might see more consistent performance with 
mapred.tasktracker.tasks.maximum=1.


How many disks do you have per node?  If you have multiple drives, then 
configuring mapred.local.dir to contain a list of directories, one per 
drive, might make things faster.


Doug

Re: Boolean OR QueryFilter

2006-03-15 Thread Doug Cutting

This looks like a good approach.  Note also that you will probably need 
to change BasicQueryFilter and perhaps other filters to work correctly 
with optional terms.


Nguyen Ngoc Giang wrote:

Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating
patches :D
I'll try to put my solution here first to receive comments from our
community. Since we must differentiate 3 possibilities: must have, may have
and must not have; we need at least 2 boolean variables in
org.apache.nutch.searcher.Query. In fact, these 2 boolean variables are
isRequired and isProhibited.

-In the first step, I define an OR token separately in jj file. This will be
put before WORD. So it will look like this:
OR: OR

-Second, I define a new function called disjunction:
void disjunction() :
{}
{
OR nonOpOrTerm()
}

-Third, in the function parse(), I declare a boolean variable disj:
boolean disj;

-Forth, inside parse(), once we finished looking ahead, we examine the
existence of OR token:
( LOOKAHEAD ... )?
// check OR
(disjunction() { disj = true; })*

-Finally, I changed the handling portion in parse():
if (stop
   field == Clause.DEFAULT_FIELD
   terms.size()==1
   isStopWord(array[0])) {
// ignore stop words only when single, unadorned terms in default
field
  } else {
if (prohibited)
  query.addProhibitedPhrase(array, field);
else if (disj)
  query.addOptionalPhrase(array, field);
else
  query.addRequiredPhrase(array, field);
  }

  After this point, I have finished changing the jj file. Please note that I
also have to add the method addOptionalPhrase() in
org.apache.nutch.searcher.Query. This method basically sets isRequired=false
and isProhibited=false. The rest has been taken care by Nutch already.

  Regards,
  Giang


On 3/15/06, Laurent Michenaud [EMAIL PROTECTED] wrote:


I would like to use Boolean Query too :)

-Message d'origine-
De : Alexander Hixon [mailto:[EMAIL PROTECTED]
Envoyé : mercredi 15 mars 2006 08:38
À : nutch-user@lucene.apache.org
Objet : RE: Boolean OR QueryFilter

Maybe you could post the code on JIRA, if anyone else wishes to use
Boolean operators in their search queries..? We could probably get a
developer or two to put this in the 0.8 release? Since it IS open source.
;)

Just a thought,
Alex

-Original Message-
From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED]
Sent: Wednesday, 15 March 2006 3:45 PM
To: nutch-user@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Boolean OR QueryFilter

 Hi David,

 I also did a similar task. In fact, I hacked into jj code to add the
definition for OR and NOT. If you need any help, don't hesitate to contact
me :).

 Regards,
  Giang

PS: I also believe that a hack to jj code is necessary.

On 3/8/06, David Odmark [EMAIL PROTECTED] wrote:


Hi all,

We're trying to implement a nutch app (version 0.8) that allows for
Boolean OR e.g. (this OR that) AND (something OR other). I've found
some relevent posts in the mailing list archive, but I think I'm
missing something. For example, here's a snippet from a post from Doug


Cutting:


snip
that said, one can implement OR as a filter (replacing or altering
BasicQueryFilter) that scans for terms whose text is OR in the
default field.
/snip

The problem I'm finding is that the NutchAnalysis analyzer seems to be
swallowing all boolean terms by the time the QueryFilter is even
executed (perhaps because OR is a stop word?). To wit:

String queryText = this OR that;
org.apache.nutch.searcher.Query query =
org.apache.nutch.searcher.Query.parse(queryText, conf); for (int
i=0;iquery.getTerms().length;i++) {
   System.out.println(Term =  + query.getTerms()[i]); }

This results in output that looks like this:

Term = this
Term = that

So am I correct in believing that in order to implement boolean OR
using Nutch search and a QueryFilter, one must also (minimally) hack
the NutchAnalysis.jj file to produce a new analyzer? Also, given that
a Nutch Query object doesn't seem to have a method to add a
non-required Term or Phrase, does that need to be modified as well?

Sorry for the long post, and thanks in advance...

-David Odmark

Re: Site: invalid Jira link

2006-03-15 Thread Doug Cutting


I just fixed this.

Thanks,

Doug

ArentJan Banck wrote:

on: http://lucene.apache.org/nutch/issue_tracking.html
http://nagoya.apache.org/jira/browse/Nutch no longer works.

Should be: http://issues.apache.org/jira/browse/Nutch

- Arent-Jan

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting


Andrzej Bialecki wrote:

What i infer is,

   1. For every refetch, the score of files (but not the directory) is
   increasing
  



This is curious, it should not be so. However, it's the same in the 
vanilla version of Nutch (without this patch), so we'll address this 
separately.


The OPIC algorithm is not really designed for re-fetching.  It assumes 
that each link is seen only once.  When pages are refetched, their links 
are processed again.  I think the easiest way to fix this would be to 
change ParseOutputFormat to not generate STATUS_LINKED crawldata when a 
page has been refetched.  That way scores would only be adjusted for 
links in the original version of a page.  This is not perfect, but 
considerably better than what happens now.  Incrementally updating the 
score would require re-processing the parser outputs to find outlinks 
from the previous version of the page and then subtracting their 
contribution from the page's score.  This is possible, but not easy.


Doug

Re: Boolean OR QueryFilter

2006-03-08 Thread Doug Cutting


David Odmark wrote:
So am I correct in believing that in order to implement boolean OR using 
Nutch search and a QueryFilter, one must also (minimally) hack the 
NutchAnalysis.jj file to produce a new analyzer? Also, given that a 
Nutch Query object doesn't seem to have a method to add a non-required 
Term or Phrase, does that need to be modified as well?


It looks like you might need to make sure that OR is not a stop word. 
 Or use syntax like 'this +OR that', since required words are not 
stopped.  Or use something like this operator:OR that.


Doug

Re: Adaptive Refetching

2006-03-08 Thread Doug Cutting


Andrzej Bialecki wrote:

Doug Cutting wrote:
are refetched, their links are processed again.  I think the easiest 
way to fix this would be to change ParseOutputFormat to not generate 
STATUS_LINKED crawldata when a page has been refetched.  That way 
scores would only be adjusted for links in the original version of a 
page.  This is not perfect, but considerably better 


But then we would miss any new links from that page. I think it's not 
acceptable. Think e.g. of news sites, where links from the same page are 
changing on a daily or even hourly basis.


Good point.  Then maybe then we should add a new status just for this, 
STATUS_REFRESH_LINK.  If this is the only datum for a page, then the 
page could be added with its inherited score, but otherwise, if it is an 
already known page, the score increment is ignored.  That way the scores 
for existing pages would not change due to recrawling, but new pages 
would still be added with a score influenced by the page that linked to 
them.  Still not perfect, but better.


If you remember, some time ago I proposed a different solution: to 
involve linkDB in score calculations, and to store these partial OPIC 
score values in Inlink. This would allow us to track score contributions 
per source/target pair. Newly discovered links would get the initial 
partial score value from the originating page, and we could track these 
values if the original page's score changes (e.g. the number of links 
increases, or the page's score is updated).


Involving the linkdb in score calculations means that the linkdb is 
involved in crawldb updates, which makes crawldb updates much slower, 
since the linkdb generally has many times more entries than the crawldb. 
 The linkdb is not required for batch crawling and OPIC scoring, a 
common case.  So if we wish to implement things this way we should make 
it optional.  For example, an initial crawl could be done using the 
current algorithm while subsequent crawls could use a slower, 
incrementally updating algorithm.


BTW: I've been toying with some patches to implement pluggable scoring 
mechanisms, it would be easy to provide hooks for custom scoring 
implementations. Scores are just float values, so they would be 
sufficient for a wide range of scoring mechanisms, for others the newly 
added CrawlDatum.metadata could be used.


+1

Doug

Re: .8 svn - fetcher performance..

2006-03-07 Thread Doug Cutting


Byron Miller wrote:

Anything i should change/tweak on my fetcher config
for .8 release? i'm only getting 5 pages/sec and i was
getting nearly 50 on .7 with 125 threads going.  Does
.8 not use threads like 7 did?


Byron,

Have you tried again more recently?  A number of bugs have been fixed in 
0.8 in the past few weeks.  I think it is now much more stable.


Doug

Re: Problems with hadoop

2006-03-07 Thread Doug Cutting


Jon Blower wrote:

My guess is that the source program is not available on your version of
FreeBSD.  Try running the source program (with no arguments) from the
command line or type man source.  Do you see anything?  If not, you
probably don't have the source program, which is called by the hadoop
script.


The source command is a shell builtin which effectively inserts the 
content of another shell script within a shell script, so that the 
sourced script can, e.g., set local variables, etc.


Doug

Re: retry later

2006-03-07 Thread Doug Cutting


Richard Braman wrote:

when you get an error while fetching, and you get the
org.apache.nutch.protocol.retrylater because the max retries have been
reached, nutch says it has given up and will retry later, when does that
retry occur?  How would you make a fetchlist of all urls that have
failed?  Is this information maintained somewhere?


Each url in the crawldb has a retry count, the number of times it has 
been tried without a conclusive result.  When the maximum 
(db.fetch.retry.max) then the page is considered gone.  Until then it 
will be generated for fetch along with other pages.  There is no command 
that generates a fetchlist for only pages whose retry count is greater 
than zero.


Doug

Re: Tutorial on the Wiki

2006-03-07 Thread Doug Cutting


Vanderdray, Jacob wrote:

I've changed the language a bit.  If you're interested, take a
look:

http://wiki.apache.org/nutch/NutchTutorial


This looks great!  Thanks so much for adding this to the wiki!

We might add something to the Step-by-Step introduction to the effect 
that: This also permits more control over the crawl process, and 
incremental crawling.  Does that address other's concerns?


Doug

Re: still not so clear to me

2006-03-07 Thread Doug Cutting




Richard Braman wrote:

Can someone confirm this:
 
Uou start a crawldb from a list of urls and you generate a fetch list,

which is akin to seeding your crawldb. When you fetch it just fetches
those seed urls. 
When you do your next round of generate/fetch/update,  The fetch list

will have the links found while parsing the pages in the original urls.
Then on your next round, it will fetch the links found during the
previous fetch.  
 
 
So with each round of fetching, nutch goes deeper and deeper into the

web, only fetching urls it hasn't previously fetched.
The generate command generates a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.


This all sounds right to me.

Some clarifications:

- urls are filtered before adding them to the crawldb, so the db only 
ever contains urls that pass the filter.


- the db contains both urls that have been fetched and those that have 
not been fetched.  When you find a new link to a url that is already in 
the db it does not add a new entry to the db, but rather just updates 
the existing entry's score.


- higher-scoring pages are generated in preference to lower-scoring 
pages when the -topN option is used.  So a page discovered in the first 
round might not be fetched until the fourth round, when enough other 
links have been found to that page to warrant fetching it.  This, when 
topN is specified, crawling is not totally breadth first.


Doug

Re: project vitality?

2006-03-06 Thread Doug Cutting


Richard Braman wrote:

I realy do think nutch is great, but I echo Matthias's comments that the
community needs to come together and contirbute more back.  And that
comes with the requirement of making sure volunteers are given access to
make their contributions part of the project.


Here's how it works:

One has to be a committer to directly change the code.

One may be invited to become a committer if contributes a number of 
non-trivial, consistently exemplary patches.


Exemplary patches:
 1. are easy for a committer to apply;
 2. fix one thing;
 3. fix it well;
 4. are well formatted, using Sun's coding conventions
 5. are well documented, with Javadoc for all non-private items
 6. pass all existing unit tests
 7. includes new unit tests
 8. etc.

An exemplary patch is thus something that a committer can commit with 
little hesitation.  It follows that exemplary patches will be committed 
quickly.  Lesser patches are likely to languish.


For example, a committer might be reluctant to take on a poorly 
constructed patch for a bug that only affects niche users, since it may 
take a lot of time to turn it into code worthy of committing.


Most committers are already doing as much as they can to help the 
project.  The trick is not to get them committers to do more work, but 
for others to do more work for the committers, and,eventually, to get 
more committers.



Putting the faqs and tutorial on the website and not the wiki maybe one
of the two biggest problems in getting people started learning nutch.


If you think these should move, don't just complain: file a bug, make 
your case, submit a patch, etc.  The website is part of the source and 
is governed by the same process.


Doug

Re: project vitality?

2006-03-06 Thread Doug Cutting


David Wallace wrote:

Also, I've lost count of the number of times someone has posted
something to the effect of I'll pay someone to give me Nutch support,
simply because they find the existing documentation and mailing lists
inadequate.  Usually, that person gets told that the best way to get
Nutch support is to ask questions on the mailing list; but since
questions often go unanswered, this isn't a very good way to get Nutch
support at all.


I agree this is a problem, but it is also an opportunity. I do try to 
answer Nutch questions whenever I have time, and most other Nutch 
developers are also active on these lists.  The problem is simply that 
there are more questions than question answering hours.



All of this is acceptable in a product that hasn't yet reached version
1.0.  The code has moved ahead faster than the documentation; and
that's fine, provided the documentation will eventually catch up.


Yes, I hope it will.


Maybe, once 0.8 is deemed production-worthy, the team should down tools,
stop coding, and put some effort into really producing a really lovely
set of documentation, including a comprehensive FAQ.  I believe that
this will help grow the user base, faster than adding new features ever
could.


That would be nice.  Once things settle down it will also be easier for 
support organizations, consultants, book authors, etc, to step in and 
improve documentation too.


Doug

Re: issues w/ new nutch versions

2006-03-06 Thread Doug Cutting


Florent Gluck wrote:

In hadoop jobtracker's log, I can see several tasks being losts as follow:
060306 184155 Aborting job job_hyhtho
060306 184156 Task 'task_m_7qgat2' has been lost.
060306 184156 Aborting job job_hyhtho
060306 184156 Task 'task_m_lph5qs' has been lost.
060306 184156 Aborting job job_hyhtho
It seems there are some sort of timeouts.  Weird, the machines are
properly configured (hasn't changed) and it definitely works w/ nutch
the previous nutch version (as of end of Jan.).


I fixed some bugs today in Hadoop which could cause this.

Please try updating again and see if you still have this problem.

Sorry!

Doug

Re: Help with bin/nutch server 8081 crawl

2006-03-06 Thread Doug Cutting


Monu Ogbe wrote:

Caused by: java.lang.InstantiationException:
org.apache.nutch.searcher.Query
at java.lang.Class.newInstance0(Unknown Source)
at java.lang.Class.newInstance(Unknown Source)
at
org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav


It looks like Query no longer has a no-arg constructor, probably since 
the patch which makes all Configurations non-static.  A no-arg 
constructor is required in order to pass something via an RPC.  The fix 
might be as simple as adding the no-arg constructor, but perhaps not, 
since the query would then have a null configuration.  At a glance, the 
query execution code doesn't appear to use the configuration, so this 
might work...


Doug

Re: Moving tutorial link to wiki

2006-03-06 Thread Doug Cutting


Matthias Jaekle wrote:

Maybe we should move the tutorial to the wiki so it can be commented on.


+1


+1

Doug

Re: exception during fetch using hadoop

2006-02-24 Thread Doug Cutting

It looks like the child JVM is silently exiting.  The error reading 
child output just shows that the child's standard output has been 
closed, and the child error says the JVM exited with non-zero.


Perhaps you can get a core dump by setting 'ulimit -c' to something big. 
   JVM core dumps can be informative.


This doesn't look like something that should kill a crawl, though.  Are 
you using a tasktracker  jobtrackers, or running things with a local 
jobtracker?  With a tasktracker this task would be retried.  Are you 
seeing this?  Does a given task consistently fail when retried?


Doug

Mike Smith wrote:

I have been getting this exception during fetching for almost a month. This
exception stops the whole crawl. It happens on and off! Any Idea?? We are
really stocked with this problem.

I am using 3 data node and 1 name server.

060223 173809 task_m_b8ibww  fetching http://www.heartcenter.com/94fall.pdf
060223 173809 task_m_b8ibww  fetching
http://www.medinfo.co.uk/conditions/tenosynovitis.html
060223 173809 task_m_b8ibww  fetching
http://www.boncholesterol.com/whatsnew/index.shtml
060223 173809 task_m_b8ibww  fetching
http://www.drcranton.com/hrt/promise_of_longevity.htm
060223 173809 task_m_b8ibww  fetching
http://www.drcranton.com/hrt/promise_of_longevity.htm
060223 173809 task_m_b8ibww Error reading child output
java.io.IOException: Bad file descriptor
at java.io.FileInputStream.readBytes(Native Method)
at java.io.FileInputStream.read(FileInputStream.java:194)
at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java
:411)
at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java
:453)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at org.apache.hadoop.mapred.TaskRunner.logStream(TaskRunner.java
:170)
at org.apache.hadoop.mapred.TaskRunner.access$100(TaskRunner.java
:29)
at org.apache.hadoop.mapred.TaskRunner$1.run(TaskRunner.java:137)
060223 173809 task_r_3h1pex 0.1667% reduce  copy 
060223 173809 Server connection on port 50050 from xx: exiting
060223 173809 Server connection on port 50050 from xx: exiting
060223 173809 task_m_b8ibww Child Error
java.io.IOException: Task process exit with nonzero status.
at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97)
060223 173812 task_m_b8ibww done; removing files.

Re: url: search fail

2006-02-24 Thread Doug Cutting


0.7 and 0.8 are not compatible.  You need to re-crawl.  Sorry!

Once we have a 1.0 release then we'll make sure things are back-compatible.

Doug

Martin Gutbrod wrote:
I changed from 0.7.1 to one of the latest nightly builds (0.8) and 
now search for url: fields fail. E.g. [ url:my.doman.com ]


Has anybody similar experiences? Should I switch back to 0.7.1 ?

Log file shows:

2006-02-24 11:17:11 StandardWrapperValve[jsp]: Servlet.service() for
servlet jsp threw exception
java.lang.NullPointerException
at
org.apache.nutch.searcher.FieldQueryFilter.filter(FieldQueryFilter.java:63)
at
org.apache.nutch.searcher.QueryFilters.filter(QueryFilters.java:106)
at
org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:94)
at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
at org.apache.jsp.search_jsp._jspService(search_jsp.java:251)
at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
at
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
at
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
at
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:856)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at
org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
at
org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:287)
at
org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:84)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
at
org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
at
org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
at
org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
at
org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
at java.lang.Thread.run(Thread.java:534)

Re: Link to Search Interface for List

2006-02-16 Thread Doug Cutting


Vanderdray, Jacob wrote:

I get the same thing from my linux box.  The only reference I can find 
to linkmap.html is a commented out line in forrest.properties.

FWIW: I've already made the changes to my copy of mailing_lists.xml.  Let me 
know if you want me to just send someone that.


I think I just fixed that problem.  Forrest 0.7 seems to choke on ext: 
links in the tabs.xml file.  Once those are removed it works.


Doug

Re: Problem/bug setting java_home in hadoop nightly 16.02.06

2006-02-16 Thread Doug Cutting


Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there?

Doug

Håvard W. Kongsgård wrote:
I am unable to set java_home in bin/hadoop, is there a bug? I have used 
nutch 0.7.1 with the same java path.



localhost: Error: JAVA_HOME is not set.


if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then
 source ${HADOOP_HOME}/conf/hadoop-env.sh
fi

# some Java parameters
if [ $JAVA_HOME != /usr/lib/java ]; then
 #echo run java in $JAVA_HOME
 JAVA_HOME=$JAVA_HOME
fi

if [ $JAVA_HOME =  ]; then
 echo Error: JAVA_HOME is not set.
 exit 1
fi

JAVA=$JAVA_HOME/bin/java
JAVA_HEAP_MAX=-Xmx1000m

System: SUSE 10 64-bit | Java 1.4.2

Re: The latest svn version is not stable

2006-02-10 Thread Doug Cutting


Rafit Izhak_Ratzin wrote:

I just check out the latest svn version (376446), I built it from scratch.

When I tried to run the jobtrucker I got the next message in the 
jobtracker log file:


060209 164707 Property 'sun.cpu.isalist' is
Exception in thread main java.lang.NullPointerException


Okay.  I think I just fixed this.  Please give it a try.

Thanks,

Doug

Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting


Michael Nebel wrote:
I upgraded to the last version from the svn today. After having some 
nuts and bolts fixes (missing hadoop-site.xml, webapps-dir).


I just fixed these issues.

I finally 
tried to inject a new set of urls. Doing so, I get the exception below.


I am not seeing this.  Are you still seeing it, with the current 
sources?  If so, can you provide more details?  What OS, JVM?


Thanks,

Doug

Re: nutch inject problem with hadoop

2006-02-10 Thread Doug Cutting


Michael Nebel wrote:

Now it's complaining about a missing class

org/apache/nutch/util/LogFormatter :-(


That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter.

Doug

Re: hadoop-default.xml

2006-02-07 Thread Doug Cutting

The file packaged in the jar is used for the defaults.  It is read from 
the jar file.  So it should not need to be committed to Nutch.


Mike Smith wrote:

There is no setting file for Hadoop in conf/. Should it be
hadoop-default.xml?
It seems this file is not committed but it is packaged into hadoop jar file.


Thanks, Mike.

Re: Recovering from Socket closed

2006-01-31 Thread Doug Cutting

Chris Schneider wrote:
Also, since we've been running this crawl for quite some time, we'd like
to preserve the segment data if at all possible. Could someone please
recommend a way to recover as gracefully as possible from this
condition? The Crawl .main process died with the following output:

060129 221129 Indexer: adding segment:
/user/crawler/crawl-20060129091444/segments/20060129200246
Exception in thread main java.io.IOException: timed out waiting for
response

at org.apache.nutch.ipc.Client.call(Client.java:296)
at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
at $Proxy1.submitJob(Unknown Source)
at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
at org.apache.nutch.indexer.Indexer.index(Indexer.java:263)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:127)

However, it definitely seems as if the JobTracker is still waiting for
the job to finish (no failed jobs).

Have you looked at the web ui? It will show if things are still
running. This is on the jobtracker host at port 50030 by default.

The bug here is that the RPC call times out while the map task is
computing splits. The fix is that the job tracker should not compute
splits until after it has returned from the submitJob RPC. Please
submit a bug in Jira to help remind us to fix this.

To recover, first determine if the indexing has completed. If it has
not, then use the 'index' command to index things, followed by 'dedup'
and 'merge'. Look at the source for Crawl.java:

http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup

All you need to do to complete the crawl is to complete the last few
steps manually.

Cheers,

Doug

Re: Parsing PDF Nutch Achilles heel?

2006-01-25 Thread Doug Cutting


Steve Betts wrote:

I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster,
but it does allow it to complete.


I find xpdf much faster than PDFBox.

http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html

Does this work any better for you?

Doug

Re: How do I control log level with MapReduce?

2006-01-19 Thread Doug Cutting


Chris Schneider wrote:
I'm trying to bring up a MapReduce system, but am confused about how to 
control the logging level. It seems like most of the Nutch code is still 
logging the way it used to, but the -logLevel parameter that was getting 
passed to each tool's main() method no longer exists (not that these 
main methods are getting called by Crawl.java, of course). Previously, 
if -logLevel was omitted, each tool would set its logLevel field to 
INFO, but those fields no longer exist either. The result seems to be 
that the logging level defaults all the way back to the LogFormatter, 
which sets all of its handlers to FINEST.


I was sort of expecting there to be a new configuration property 
(perhaps a job configuration property?) that would control the logging 
level, but I don't see anything like this. Any guidance would be greatly 
appreciated.


There is no config property to control logging level.  That would be a 
useful addition, if someone wishes to contribute it.


In the meantime, Nutch uses Java's built-in logging mechanism. 
Instructions for configuring that are in:


http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/LogManager.html

Doug

Re: Can't index some pages

2006-01-19 Thread Doug Cutting


Michael Plax wrote:

Question summery:
Q: How can I set up crawler in order to index all web site?

I'm trying to run crawl with command from tutorial

1. In urls file I have start page (index.html). 
2. In the configuration file conf/crawl-urlfilter.txt domain was changed.

3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10  crawl.log
4. Crawling is finished
5. I run: bin/nutch readdb crawled/db -stats
   output:
  $ bin/nutch readdb crawledtottaly/db -stats
  run java in C:\Sun\AppServer\jdk
  060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml
  060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml
  060118 155526 No FS indicated, using default:local
  Stats for [EMAIL PROTECTED]
  ---
  Number of pages: 63
  Number of links: 3906
6. I get less pages than I have expected.


This is a common question, but there's not a common answer.  The problem 
could be that urls are blocked by your url filter, or by 
http.max.delays, or something else.


What might help is if the fetcher and crawl db printed more detailed 
statistics.  In particular, the fetcher could categorize failures and 
periodically print a list of failure counts by category.  The crawl db 
updater could also list the number of urls that are filtered.


In the meantime, please examine the logs, particularly watching for 
errors while fetching.


Doug

Re: So many Unfetched Pages using MapReduce

2006-01-19 Thread Doug Cutting


Florent Gluck wrote:

I then decided to switch to using the old http protocol plugin:
protocol-http (in nutch-default.xml) instead of protocol-httpclient
With the old protocol I got 5 as expected.


There have been a number of complaints about unreliable fetching with 
protocol-httpclient, so I've switched the default back to protocol-http.


Doug

Re: Error at end of MapReduce run with indexing

2006-01-19 Thread Doug Cutting


Matt Zytaruk wrote:
I am having this same problem during the reduce phase of fetching, and 
am now seeing:

060119 132458 Task task_r_obwceh timed out.  Killing.


That is a different problem: a different timeout.  This happens when a 
task does not report status for too long then it is assumed to be hung.



Will the jobtracker restart this job?


It will retry that task up to three times.

If so, if I change the ipc timeout 
in the config, will the tasktracker read in the new value when the job 
restarts?


The ipc timeout is not the relevant timeout.  The task timeout is what's 
involved here.  And, no, at present I think the tasktracker only reads 
this when it is started, not per job.


Doug

Re: Can't index some pages

2006-01-19 Thread Doug Cutting


att Kangas wrote:
Doug, would it make sense to print a LOG.info() message every time the 
fetcher bumps into one of these db.max limits? This would help users 
find out when they need to adjust their configuration.


I can prepare a patch if it seems sensible.


Sure, this is sensible.  But it's not done under the fetcher, but when 
the links are read, under db update.


Doug

Re: large filter file, time to update db

2006-01-12 Thread Doug Cutting


Insurance Squared Inc. wrote:
I'm trying to determine if there's a better way to whitelist a large 
number of domains than just adding them as a regular expression in the 
filter.


Have a look at the urlfilter-prefix plugin.  This is more efficient for 
filtering urls by a large list of domains.


Doug

Re: Full Range of Results Not Showing

2006-01-11 Thread Doug Cutting


Neal Whitley wrote:

Now here's another question.

How can I obtain the exact number of searches being displayed on the 
screen.  I have been fishing around and can not find a variable being 
output to the page with this date.


In my example below 81 total matches were found.  But because of the 
grouping in the initial result set (hitsPerSite=2) it is showing only 46 
listings, some of which are grouped under more from site.   This is 
causing a slight problem with pagination because the pager thinks 
there are 81 matches and it's extending itself for a range of 81, when 
we really want a value of 46 when hitsPerSite=2.


Perhaps something like this on search.jsp:

if (hitsPerSite == 0) {
//grab the full result set
maxPages = (int)hits.getTotal();
}
else {
//grab the short result set
maxPages = ???some variable OR some math here to obtain value???;
}

It seems to me this is not a problem when using the default Next 
button to move from page to page.  But with any sort of pagination when 
used with hitsPerSite we need to know what we are actually viewing on 
the screen.


The site-deduping is performed at query time.  If you ask for the top N 
hits without site duplication then Nutch finds more than N hits and 
removes those from duplicate sites dynamically.  So unless you make N 
very large, we don't know the total number of site-de-duplicated hits.


Doug

Re: Is any one able to successfully run Distributed Crawl?

2006-01-09 Thread Doug Cutting


Pushpesh Kr. Rajwanshi wrote:

Just wanted to confirm that this distributed crawl you
did using nutch version 0.7.1 or some other version? And was that a
successful distributed crawl using map reduce or some work around for
distributed crawl?


No, this is 0.8-dev.  This was using in early December using the version 
of Nutch then in the mapred branch.  This version has since been merged 
into the trunk and will be eventually released as 0.8.  I believe 
everything in my previous message is still relevant to the current trunk.


Doug

Re: Multi CPU support

2006-01-09 Thread Doug Cutting


Teruhiko Kurosaka wrote:

Can I use MapReduce to run Nutch on a multi CPU system?


Yes.


I want to run the index job on two (or four) CPUs
on a single system.  I'm not trying to distribute the job
over multiple systems.

If the MapReduce is the way to go,
do I just specify config parameters like these:
mapred.tasktracker.tasks.maxiumum=2
mapred.job.tracker=localhost:9001
mapred.reduce.tasks=2 (or 1?)

and
bin/start-all.sh

?


That should work.  You'd probably want to set the default number of map 
tasks to be a multiple of the number of CPUs, and the number of reduce 
tasks to be exactly the number of cpus.


Don't use start-all.sh, but rather just:

bin/nutch-daemon.sh start tasktracker
bin/nutch-daemon.sh start jobtracker


Must I use NDFS for MapReduce?


No.

Doug

Re: Multiple anchors on same site - what's better than making these unique?

2006-01-05 Thread Doug Cutting


David Wallace wrote:

I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code.  I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.  


Note that this is only done when collecting anchor texts, not when 
computing page scores.



Suppose my site has 3 pages with links to page X, and the same anchor
text.  I'd kind of like to score page X higher than a page where there's
only one incoming link with that anchor text.  But I don't want to have
this effect swamping the other calculations of page score.  In other
words, if my site has 1000 pages with links to page X, this page should
score a wee bit higher than a similar page with just one incoming link,
but not 1000 times higher.
 
I'm thinking of doing some maths with the number of repetitions of an

anchor, then including the result in the page score.  Something like
log(10+n), or maybe n/(n+2); where n is the number of incoming links
with the same anchor text.  Either of these formulas would make 1000
incoming links score roughly 3 times higher than a single incoming link,
which seems about right to me.


Page scores currently are sqrt(OPIC) in the Nutch trunk.

http://www.nabble.com/-Fwd%3A-Fetch-list-priority--t360125.html#a997304

The OPIC calculation does not consider the domain or anchor text.

Hope this helps.

Doug

Re: Is any one able to successfully run Distributed Crawl?

2006-01-04 Thread Doug Cutting


Earl Cahill wrote:

Any chance you could walk through your implementation?
 Like how the twenty boxes were assigned?  Maybe
upload your confs somewhere, and outline what commands
you actually ran?


All 20 boxes are configured identically, running a Debian 2.4 kernel. 
These are dual-processor boxes with 2GB of RAM each.  Each machine has 
four drives, mounted as a RAID on /export/crawlspace.  This cluster uses 
NFS to mount home directories, so I did not have to set NUTCH_MASTER in 
order to rsync copies of nutch to all machines.


I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion 
in ~/local/svn.


My ~/.ssh/environment contains:

JAVA_HOME=/home/dcutting/local/java
NUTCH_OPTS=-server
NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs
NUTCH_SLAVES=/home/dcutting/.slaves

I added the following to ~/.bash_profile, then logged out  back in.

export `cat ~/.ssh/environment`

I added the following to /etc/ssh/sshd_config on all hosts:

PermitUserEnvironment yes

My ~/.slaves file contains a list of all 20 slave hosts, one per line.

My ~/src/nutch/conf/mapred-default.xml contains:

nutch-conf

property
  namemapred.map.tasks/name
  value1000/value
/property

property
  namemapred.reduce.tasks/name
  value39/value
/property

/nutch-conf

My ~/src/nutch/conf/nutch-site.xml contains:

nutch-conf

property
  namefetcher.threads.fetch/name
  value100/value
/property

property
  namegenerate.max.per.host/name
  value100/value
/property

property
  nameplugin.includes/name

valueprotocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)/value
/property

property
  nameparser.html.impl/name
  valuetagsoup/value
/property

!-- NDFS --

property
  namefs.default.name/name
  valueadminhost:8009/value
/property

property
  namendfs.name.dir/name
  value/export/crawlspace/tmp/dcutting/ndfs/names/value
/property

property
  namendfs.data.dir/name
  value/export/crawlspace/tmp/dcutting/ndfs/value
/property

!-- MapReduce --

property
  namemapred.job.tracker/name
  valueadminhost:8010/value
/property

property
  namemapred.system.dir/name
  value/mapred/system/value
/property

property
  namemapred.local.dir/name
  value/export/crawlspace/tmp/dcutting/local/value
/property

property
  namemapred.child.heap.size/name
  value500m/value
/property

/nutch-conf

My ~/src/nutch/conf/crawl-urlfilter.txt contains:

# skip file:, ftp:,  mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break 
loops

-.*(/.+?)/.*?\1/.*?\1/

# accept everything else
+.

To run the crawl I gave the following commands on the master host:

# checkout nutch sources and build them
mkdir ~/src
cd ~/src
~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch
cd nutch
~/local/ant/bin/ant

# install config files named above in ~/src/nutch/conf

# create dmoz/urls file
wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz
gunzip content.rdf.u8.gz
mkdir dmoz
bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz  dmoz/urls

# create required directories on slaves
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local
bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names

# start nutch daemons
bin/start-all.sh

# copy dmoz/urls into ndfs
bin/nutch ndfs -put dmoz dmoz

# crawl
nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 1600  
/dev/null  crawl.log 


Then I visited http://master:50030/ to monitor progress.

I think that's it!

Doug

Re: Is any one able to successfully run Distributed Crawl?

2006-01-02 Thread Doug Cutting


Pushpesh Kr. Rajwanshi wrote:

I want to know if anyone is able to successfully run distributed crawl on
multiple machines involving crawling millions of pages? and how hard is to
do that? Do i just have to do some configuration and set up or do some
implementations also?


I recently performed a four-level deep crawl, starting from urls in 
DMOZ, limiting each level to 16M urls.  This ran on 20 machines taking 
around 24 hours using about 100Mbit and retrieved around 50M pages.  I 
used Nutch unmodified, specifying only a few configuration options.  So, 
yes, it is possible.


Doug

Re: Linking Document scores together in a query

2005-12-12 Thread Doug Cutting


Can you please describe the higher-level problem you're trying to solve?

Doug

Matt Zytaruk wrote:

Hello,

I am trying to implement a system where to get the score for certain 
documents in a query, I need to average the score of two different 
documents for that query. Does anyone have any bright ideas on what the 
best way to implement such a system would be? I've been investigating 
and thus far haven't been able to find a way that didnt degrade 
performance horribly.


Any help would be appreciated. Thanks in advance.

-Matt Zytaruk

Re: How to get page content given URL only?

2005-12-12 Thread Doug Cutting


Nguyen Ngoc Giang wrote:

  I'm writing a small program which just utilizes Nutch as a crawler only,
with no search functionality. The program should be able to return page
content given an url input.


In the mapred branch this is directly supported by NutchBean.

Doug

Re: Incremental crawl w/ map reduce

2005-12-09 Thread Doug Cutting

Did you update the crawldb after the first fetch?  The mapred crawler 
does not update the next-fetch date of pages when the fetch list is 
generated, as in 0.7.  So, until that changes, you must update the 
crawldb before you next generate a fetch list.


Doug

Florent Gluck wrote:

Hi,

As a test, I recently did a quick incremental crawl.  First, I did a
crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3
tastTrackers/dataNodes).  So far, so good, the fetches were distributed
among the 3 nodes (3/3/4) and a segment was generated.  Running a quick
-stats on the crawldb showed me the 10 links were there.  I also did a
dump and everything was fine.
Then, I injected a new url and crawled again, generating a second segment.
While it was running, I looked at the logs expecting to only see the
fetch of the new url I added, but instead I saw it was fetching all the
previous urls again.
Why is that ?  These were already fetched and my understanding is that
they should only be fetched again after 30 days (or whatever value is
specified in nutch-site.xml).
What am I missing here ?

Thanks,
Flo

Re: mapred branch: IOException in invertlinks (No input directories specified)

2005-12-02 Thread Doug Cutting


Florent Gluck wrote:

8. invertlinks linkdb segments/SEG_NAME


This should be instead:

  invertlinks linkdb segments

Doug

Re: Fetch Errors

2005-11-28 Thread Doug Cutting


Ben Halsted wrote:

When I check the fetch status pages in the JobTracker web GUI I saw that I
was getting on average more errors than pages.
95 pages, 119 errors, 1.0 pages/s, 63 kb/s

Is there a way to find out what the errors are?


Look in the tasktracker logs.  Typically they're max delays exceeded. 
 I recently increased the default for this paramter, which helps a lot.


Doug

Re: NDFS / WebDB QUestion

2005-11-28 Thread Doug Cutting


Thomas Delnoij wrote:

So, say I want to setup a machine as a DataNode that has two or more disks,
do I have to configure and setup a DataNode Deamon for every disk? How else
could I use all disks if the ndfs.data.dir property only accepts one path
(assumed I don't want to rely on MS Windows' dynamic discs or similar OS
specific features)?


You can list multiple paths in ndfs.data.dir.  Paths which do not exist 
are ignored.


Doug

Re: Crawl auto updated in nutch?

2005-11-28 Thread Doug Cutting


Håvard W. Kongsgård wrote:
- I want to index about 50 – 100 sites with lots of documents, is it 
best use the Intranet Crawling or Whole-web Crawling method.


The intranet style is simpler and hence a good place to start.  If it 
doesn't work well for you then you might try the whole-web style.



- Is the crawl auto updated in nutch, or must I run a cron task


It is not auto-updated.

Doug

Re: Fetcher url sorting

2005-11-22 Thread Doug Cutting


Matt Zytaruk wrote:
Indeed, that does work, although that ends up slowing down the fetch a 
fair amount because a lot of threads end up idle, waiting, and I was 
hoping to avoid that slowdown if possible.


What should these threads be doing?

If you have a site with N pages to fetch, and you want to fetch them all 
politely, then it will take at least fetcher.server.delay*N to fetch 
them all.  The fetch list is sorted by the hash of the url, so accesses 
to each host should be spread fairly evenly through the list.


Capping the number of pages per host (generate.max.per.host) will help, 
or, if you know the webmasters in question, you can consider increasing 
fetcher.threads.per.host.


Doug

Re: Fetcher url sorting

2005-11-22 Thread Doug Cutting


Matt Zytaruk wrote:
Well, if we want to fetch pages from N different sites, ideally we 
should be able to have N threads running, without any of them having to 
wait. I guess ideally what the fetcher should probably do is instead of 
waiting, put the url it was trying to fetch back into the queue to be 
tried later on, and grab a different one.


The fetcher used to do this, and it ended up with huge queues.  We 
capped the size of the queues, and dropped urls when their queue was 
full.  But the fetcher still spent an age at the end, mostly idle, with 
a single thread emptying its queue.  And there were some bugs in the 
queue synchronization that caused things to sometimes hang, but no one 
could ever figure out why.


So the current fetcher's strategy is to, instead of queuing urls in 
order to drop them later, drop them now.  And instead of queuing urls in 
order to wait later, wait now.  It makes things a lot simpler.  In the 
end the performance is similar, but you can see the cost of crawling big 
sites immediately, rather than only later.  In either case you need to 
choose to drop things or run slowly.


I'm not so sure that accesses to each host are spread evenly throughout 
the list, because the fetch list I was doing had tens of  thousands of 
different hosts and I was still getting a large amount of threads trying 
to access the same host at the same time, even with only 50 threads. 
Although maybe I'm wrong and that is how it would act if the hosts were 
spread evenly throughout, I'm not sure, it just seems like a lot.


They're not spread exactly evenly, but randomly, which can be a bit 
lumpy.  What percentage of urls in the fetch list are from a host that 
is exceeding max delays?  If it is near 2%, or that host is slower than 
average, then you'll probably have issues with 50 threads.


Doug

Re: Merging many indexes

2005-11-22 Thread Doug Cutting


Ben Halsted wrote:

I'm getting the dreaded: Too many open files error.
I've checked my system settings for file-max:
$ cat /proc/sys/fs/file-nr
2677 1945 478412

$ cat /proc/sys/fs/file-max
478412


What does 'ulimit -n' print?  Look in /etc/security/limits.conf to 
increase the limit.



What would be the best way to work around (or fix) this. Merging 10 indexes
at a time and then merging the results down until I get just one index?


Yes.  You can decrease indexer.mergeFactor to make this happen.  Perhaps 
we should decrease the default.  With the addition of crc files, the 
number of open files is doubled.  So 50 indexes with 10 open files each 
yeilds 1000 open files, and the JVM needs more than 24.  So I guess the 
default should be decreased to 30 or so.



What about the dedup process. It seems to be able to manage the 100+ indexes
fine, but if I switch the process and merge the indexes first and then
remove dupes, I think it may speed up the process. Ideas?


Then you end up with dupes still taking space in your final index, which 
is not optimal for search.


Doug

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting


Ben Halsted wrote:

I've modified the auto-crawl to always use a pre-existing crawldb. If I run
it multiple times I get multiple linkdb, segments, indexes, and index
directories.

Is it possible to merge the results using the bin/nutch comamnds?


You should also have it use a single linkdb.  Then use 'bin/nutch dedup' 
and 'bin/nutch merge' across both indexes directories to create a new 
index with everything.


Doug

Re: Filesystem structure for the web front-end.

2005-11-21 Thread Doug Cutting


Ben Halsted wrote:

I was wondering what the required file structure is for the web gui to work
properly.

Are all of these required?
/db/crawldb
/db/index
/db/indexes
/db/segments
/db/linkdb


The indexes directory is not used when a merged index is present.

The crawldb and segments/*/crawl_parse directories are not used by the 
web ui.



Also -- What is the proper way to merge segments and indexes? Can I simply
move segments all into one directory then re-index it, or is there a better
way?


You should update the linkdb so that it contains links from all 
segments.  Then you can use the dedup and merge commands to create a new 
index.  Ideally you should also re-index after updating the linkdb, but 
this is not required.


Doug

Re: merging auto-crawls

2005-11-21 Thread Doug Cutting


Ben Halsted wrote:

When I merge this stuff, do I need to merge the segments/* for each crawl
into a single segments directory? Or is there data in the merged index file
that will direct the web component to the correct segment?


Put the segments in a single directory.  The index only has the segment 
name, not its full path.


Please keep folks on the list updated as to how this works for you.  I 
have not yet used things in this way with the mapred branch, but it is a 
common use case.  Perhaps we can add an option to the crawl command to 
crawl more that automates this.


Doug

Re: sorting on multiple fields

2005-11-21 Thread Doug Cutting


James Nelson wrote:

 I need to sort the search results on two fields for a project I'm
working on, but nutch only seems to support sorting on one. I'm
wondering if I missed something and there is actually a way or if
there is a reason for restricting sort to one field that I'm not aware
of.


Sorting results by multiple fields is not yet supported in Nutch, but 
would not be too hard to add, since Lucene supports it.


Doug

Re: Which fields can you call via detail.getvalue(....) out of the box?

2005-11-01 Thread Doug Cutting

The explain page lists all stored fields by calling the toHtml() 
method of HitDetails.  You can also list things with:


for (int i = 0; i  detail.getLength(); i++) {
  String field = detail.getField(i);
  String value = detail.getValue(i);
  ...
}

Doug

Byron Miller wrote:

I'm looking to see if i can pull a meta description in
lieu of summary for some content and wondering if this
is indexed - is there an easy way to see the fields
indexed by default and how they're exposed through
nutch bean?

Re: mapred error on windows

2005-10-31 Thread Doug Cutting

It looks like you are using ndfs but not running any datanodes.  An ndfs 
filesystem requires one namenode and at least one datanode, typically a 
large number running on different machines.  Look at the 
bin/start-all.sh script for an example of what is started in a typical 
mapred/ndfs deployment.


Doug

Kashif Khadim wrote:

I am unable to crawl with mapred on windows. I get
this error after i run.


bin/nutch crawl urls 


Error:

051030 004819 parsing
file:/C:/nutch/mapred/conf/nutch-default.xml
051030 004819 parsing
file:/C:/nutch/mapred/conf/nutch-site.xml
051030 004819 Server listener on port 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004819 Server handler on 9009: starting
051030 004822 Server connection on port 9009 from
80.139.7.173: starting
051030 004914 Server connection on port 9009 from
80.139.7.173: starting
051030 004931 While choosing target, totalMachines is
0
051030 004931 Target-length is 0, below
MIN_REPLICATION (1)
051030 004931 Server handler on 9009 call error:
java.io.IOException: Cannot create file
/tmp/nutch/mapred/system/submit_ogykqp/job.xml
java.io.IOException: Cannot create file
/tmp/nutch/mapred/system/submit_ogykqp/job.xml
at
org.apache.nutch.ndfs.NameNode.create(NameNode.java:98)
at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:582)
at org.apache.nutch.ipc.RPC$1.call(RPC.java:187)
at
org.apache.nutch.ipc.Server$Handler.run(Server.java:198)
051030 004934 Server connection on port 9009 from
80.139.7.173: exiting


Thanks.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting


Ken van Mulder wrote:
Initially, its able to reach ~25 pages/s with 150 threads. The fetcher 
gets progressivly slower though, dropping down to about ~15 pages/s 
after about 2-3 hours or so and continues to slow down. I've seen a few 
references on these lists to the issue, but I'm not clear on if its 
expected behaviour or if there's a solution to it? I've also noticed 
that the process takes up more and more memory as it runs, is this 
expected as well?


What parse plugins do you have enabled?

The best way to diagnose these problems is to 'kill -QUIT' an offending 
fetcher process.  This will dump the stack of every fetcher thread. 
This will likely look quite different at the start of your run than 
later in the run, and that difference should point to the problem.


In the past I have seen these symptoms primarily with parser plugins.  I 
have also seen threads hang infinitely in a socket read, but that is 
much rarer.


Doug

Re: Peak index performance

2005-10-28 Thread Doug Cutting


Byron Miller wrote:

For example i've been tweaking max merge/min merge and
such and i've been able to double my performance
without increasing anything but cpu load..


Smaller maxMergeDocs will cost you in the end, since these will 
eventually be merged during the index optimization at the end.  I would 
just leave this at Integer.MAX_VALUE.


Larger minMergeDocs will improve performance, but by using more heap. 
So watch your heap size as you increase this and leave a healthy margin 
for safety.  This is the best way to tweak indexing performance.


Larger mergeFactors may improve performance somewhat, but by using more 
file handles.  In general, the maximum number of file handles is around 
10-20x (depending on plugins) the mergeFactor.  So raising this above 50 
on most systems is risky, and the performance improvements are marginal, 
so I wouldn't bother.


Doug

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting


Ken Krugler wrote:
We're only using the html  text parsers, so I don't think that's the 
problem. Plus we dumping the thread stack when it hangs, and it's always 
in the ChunkedInputStream.exhaustInputStream() process (see trace below).


The trace did not make it.

Have you tried protocol-http instead of protocol-httpclient?  Is it any 
better?  What JVM are you running?  I get fewer socket hangs in 1.5 than 
1.4.


Also, the mapred fetcher has been changed to succeed even when threads 
hang.  Perhaps we should change the 0.7 fetcher similarly?  I think we 
should probably go even farther, and kill threads which take longer than 
a timeout to process a url.  Thread.stop() is theoretically unsafe, but 
I've used it in the past for this sort of thing and never traced 
subsequent problems back to it...


Doug

Re: fetch questions - freezing

2005-10-28 Thread Doug Cutting


Ken van Mulder wrote:
As a side note, does anyone have any recommendations for profiling 
software? I've used the standard hprof, which slows down the process to 
much for my needs and jmp which seems pretty unstable.


I recommend 'kill -QUIT' as a poor-man's profiler.  With a few stack 
dumps you can usually get a decent idea of where the time is going.  If 
you want to get fancy you can 'kill -QUIT' every minute or so, then use 
'sort | uniq -c | sort -nv' so see where you're spending a lot of time.


Doug

Re: Peak index performance

2005-10-28 Thread Doug Cutting


Byron Miller wrote:

property
  nameindexer.mergeFactor/name
  value350/value
  description
  /description
/property

Initially high index merge factor caused out of file
handle errors but increasing the others along with it
seemed to help get around that.


That is a very large mergeFactor, larger than I would recommend.  How 
many documents do you index in a run?  More than 350*500=175,000?  If 
not then you're not hitting a merge yet.  What does 'ulimit -n' show? 
Does your performance actually change much when you lower this?


Doug

Re: crawl problems

2005-10-19 Thread Doug Cutting

The only link on http://shopthar.com/ to the domain shopthar.com is a 
link to http://shopthar.com/.  So a crawl starting from that page that 
only visits pages in shopthar.com will only find that one page.


% wget -q -O - http://shopthar.com/ | grep shopthar.com
  trtd colspan=2Welcome to shopthar.com/td/td/tr
a href=http://shopthar.com/shopthar.com/a |

Doug

Earl Cahill wrote:

I am trying to do a crawl on trunk of one of my sites,
and it isn't working.  I make a file urls, that just
contains the site

http://shopthar.com/

in my conf/crawl-urlfilter.txt I have

+^http://shopthar.com/

I then do

bin/nutch crawl urls -dir crawl.test -depth 100
-threads 20

it kicks in and I get repeating chunks like

051019 010450 Updating
/home/nutch/nutch/trunk/crawl.test/db
051019 010450 Updating for
/home/nutch/nutch/trunk/crawl.test/segments/20051019010449
051019 010450 Finishing update
051019 010450 Update finished
051019 010450 FetchListTool started
051019 010450 Overall processing: Sorted 0 entries in
0.0 seconds.
051019 010450 Overall processing: Sorted NaN
entries/second
051019 010450 FetchListTool completed
051019 010450 logging at INFO

For ages, but I only see two nutch hits in my access
log: one for my robots.txt and one for my front page. 
Nothing else.


The crawl finishes, then I do a search and can only
get a hits for the front page.  When I do the search
via lynx, I get a momentary

Bad partial reference!  Stripping lead dots.

I can't imagine this is really the problem, but pretty
well all my links are relative.  I mean nutch has to
be able to follow relative links, right?

Ideas?

Thanks,
Earl



__ 
Start your day with Yahoo! - Make it your home page! 
http://www.yahoo.com/r/hs

Re: Nutch Search Speed Concern

2005-10-18 Thread Doug Cutting


TL wrote:

You mentioned that as a rule of thumb each node should
only have about 20M pages. What's the main bottleneck
that's encountered around 20M pages? Disk i/o , cpu
speed? 


Either or both, depending on your hardware, index, traffic, etc. 
CPU-time to compute results serially can average up to a second or more 
with ~20M page indexes.  And the total amount of i/o time per query on 
indexes this size can be more than a second.  If you can spread the i/o 
over multiple spindles then it may not be the bottleneck.


Doug

Re: Nutch Search Speed Concern

2005-10-17 Thread Doug Cutting


Murray Hunter wrote:

We tested search for a 20 Million page index on a dual core 64 bit machine
with 8 GB of ram using storage of the nutch data on another server through
linux nfs, and it's performance was terrible. It looks like the bottleneck
was nfs, so I was wondering how you had your storage set up.  Are you using
NDFS, or is it split up over multiple servers?


For good search performance, indexes and segments should always reside 
on local volumes, not in NDFS and not in NFS.  Ideally these can be 
spread across the available local volumes, to permit more parallel disk 
i/o.  As a rule of thumb, searching starts to get slow with more than 
around 20M pages per node.  Systems larger than that should benefit from 
distributed search.


Doug

Re: Do you believe in Clause sanity?

2005-10-17 Thread Doug Cutting


Andy Lee wrote:
Not to become a one-person thread or anything (and I'll shut up if  this 
attempt gets no answers), but this seems like a straightforward  
question.  Is there some design principle I'm missing that would be  
violated if clauses could be removed from a query?


No, not that I can think of.  The public constructors for Query are 
limited in order to prohibit certain things that are not yet supported, 
like optional and nested clauses.


Doug

Re: Do you believe in Clause sanity?

2005-10-17 Thread Doug Cutting


Andy Lee wrote:
Thanks, Doug.  In that case, please consider this a request for a  
couple of API changes which you may be planning anyway:


 * addClause() and removeClause() methods in Query.
 * Setters in Query.Clause for its term/phrase.


Please submit a bug report, ideally with a patch file attached.

Doug

Re: Unlimited access to a web server for Nutch

2005-10-11 Thread Doug Cutting


Ngoc Giang Nguyen wrote:

I'm running Nutch to crawl some specific websites that I know the web admins
personally. So is there anyway to change the settings of the target web
servers such that they give my Nutch higher priority, let's say unlimited
access, assuming they are all Apache servers? Because usually I observed
that Nutch has a lot of HTTP max delay even when I set the timeout quite
large and the network connections are quite perfect (I also double check by
visiting those websites by browser, and they respond well).


Try something like fetcher.server.delay=0 and fetcher.threads.per.host=10.

Doug

Re: a simple map reduce tutorial

2005-10-04 Thread Doug Cutting


Earl Cahill wrote:

1.  Sounds like some of you have some glue programs
that help run the whole process.  Are these going to
end up in subversion sometime?  I am guessing there is
much duplicated effort.


I'm not sure what you mean.  I set environment variables in my .bashrc, 
then simply use 'bin/start-all.sh' and 'bin/nutch crawl'.



2.  Not sure how to test that my index actually
worked.  Starting catalina in my index directory
didn't work this time.


NutchBean now looks for things in the subdirectory of the connected 
directory named 'crawl'.  Is that an improvement or is it just confusing?



3.  What do you all think of setting up some test
directories to crawl, in say 


http://lucene.apache.org/nutch/test/

Thinking it would be kind of cool to have junit run
through a whole process on external pages.


I think it would be better to have the junit tests start jetty then 
crawl localhost.  I'd love to see some end-to-end unit tests like that.



4.  Any way that

http://spack.net/nutch/SimpleMapReduceTutorial.html
http://spack.net/nutch/GettingNutchRunningOnUbuntu.html

can get on the wiki?  I am using apache-ish style and
would change to whatever, but as fun as these are to
write, I would like to see them used.  


You should be able to add them to the wiki yourself.  Just fill out:

http://wiki.apache.org/nutch/UserPreferences

Thanks,

Doug

Re: mapred Sort Progress Reports

2005-10-04 Thread Doug Cutting


Rod Taylor wrote:

Tell me how it behaves during the sort phase.


I ran 8 jobs simultaneously. Very high await time (1200) and it was
doing about 22MB/sec data writes. Nearly 0 reads from disk (everything
would be cached in memory).


This is during the sort part?  This first writes a big file, then reads 
it, then sorts it.  With 20M records I think the file is around 2.5GB, 
so eight of these would be 20GB.  Do you have 20GB of RAM?


Doug

Re: How to get real Explanation instead of crippled HTML version?

2005-10-03 Thread Doug Cutting


Ilya Kasnacheev wrote:

So I only get HTMLised version, which is useless if I need only page
rating (top Explanation.getValue()). How would I get page rating (i.e.
number from 0 to 1 showing how relevant Hit was to Query) from nutch?


Explanations are not a good way to get this, as, for each explanation, 
the query must be re-executed.  In recent versions of Nutch the score 
can be retrieved from a hit with ((FloatWritable)hit.getSortValue()).get().


Doug

Re: MapReduce

2005-10-03 Thread Doug Cutting


Paul van Brouwershaven wrote:
The AcceptEnv option is only avalible with ssh 3.9  Debian currently 
only has 3.8.1p1 in stable and testing. (4.2 unstable)


Is there an other way to solve the env. problem?


I don't know.  The Fedora and Debian systems that I use have AcceptEnv.

Doug

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting


Rod Taylor wrote:

I see. Is there any way to speed up this phase? It seems to be taking as
long to run the sort phase as it did to download the data.

It would appear that nearly 30% of the time for the nutch fetch segment
is spent doing the sorts, so I'm well off the 20% overhead number you
seem to be able to achieve for a full cycle.

5 machines (4CPU) each with 8 tasks with a load average is about 5 and
they run Redhat. Context switches are low (under 1500/second). There is
virtually no IO (boxes have plenty of ram) but the kernel is doing a
bunch of work as 50% of CPU time is in system (unsure what, I'm not
familiar with the Linux DTrace type tools).


Sorting is usually i/o bound on mapred.local.dir.  When eight tasks are 
using the same device this could become a bottleneck.  Use iostat or sar 
to view disk i/o statistics.


My plan is to permit one to specify a list of directories for 
mapred.local.dir and have the sorting (and everything else) select 
randomly among these for temporary local files.  That way all devices 
can be used in parallel.


As a workaround you could try starting eight tasktrackers, each 
configured with a different device for mapred.local.dir.  Yes, that's a 
pain, but it would give us an idea of whether my analysis is correct.


Doug

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting


Rod Taylor wrote:

Virtually no IO reported at all.  Averages about 200kB/sec read and
writes are usually 0, but burst to 120MB/sec for under 1 second once
every 30 seconds or so.


That's strange.  I wonder what it's doing.  Can you use 'kill -QUIT' to 
get a thread dump?  Try a few of these to sample the stack and see where 
it seems to be spending time.


Doug

Re: mapred Sort Progress Reports

2005-10-03 Thread Doug Cutting


Try the following on your system:

bin/nutch org.apache.nutch.io.TestSequenceFile -fast -count 2000 
-megabytes 100 foo


Tell me how it behaves during the sort phase.

Thanks,

Doug

Re: MapRed - how can I get the fetcher logs?

2005-10-03 Thread Doug Cutting


Gal Nitzan wrote:

I only have two log files:

-rw-r--r--   1 root root  8090 Oct  3 07:01 
nutch-root-jobtracker-kunzon.log

-rw-r--r--   1 root root  4290 Oct  3 07:01 nutch-root-namenode-kunzon.log


The tasktracker logs would be on the machines running the tasktracker, 
which might be different than your namenode and jobtracker.


Also note that the jobtracker's web interface shows summary statistics 
for each fetcher task.


Doug

1 2 >

1 - 100 of 116 matches

Mail list logo