Re: Sorting in nutch-webinterface - how?
Stefan Neufeind wrote: Can you maybe also help me out with sort=title? Lucene's works with indexed, non-tokenized fields. The title field is tokenized. If you need to sort by title then you'd need to add a plugin that indexes another field (e.g., sortTitle) containing the un-tokenized title, perhaps lowercased, if you want case-independent sorting. http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Sort.html Doug
Re: .job file?
The .job file is a jar file for submission to Hadoop's MapReduce. It is Hadoop-specific, although very similar to war and ear files. Teruhiko Kurosaka wrote: Nutch's top-level bulid.xml file's default target is job, and it build a zip file called nutch-0.8-dev.job. project name=Nutch default=job ... target name=job depends=compile jar jarfile=${build.dir}/${final.name}.job zipfileset dir=${build.classes}/ zipfileset dir=${conf.dir} excludes=*.template/ zipfileset dir=${lib.dir} prefix=lib includes=**/*.jar excludes=hadoop-*.jar/ zipfileset dir=${build.plugins} prefix=plugins/ /jar /target I've heard of .jar, .war, and .ear files, but not .job files. What is this? What (application servers?) are supposed to understand .job files? Is this part of the new J2EE spec? -kuro
0.8 release soon?
Andrzej Bialecki wrote: 0.8 is pretty stable now, I think we should start considering a release soon, within the next month's time frame. +1 Are there substantial features still missing from 0.8 that were supported in 0.7? Are there any showstopping bugs, things that worked in 0.7 that are broken in 0.8? Doug
Re: Can't access nightly build nutch 0.8
The nightly build is not mirrored. It is only available from cvs.apache.org, which has been down, but is now up. http://cvs.apache.org/dist/lucene/nutch/nightly/ Note that no nightly build was done last night, since Subversion was down. Doug Michael Plax wrote: I tried randomly some of them (~10). I will try again. Thank you, Michael - Original Message - From: Jérôme Charron [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Thursday, May 11, 2006 12:39 PM Subject: Re: Can't access nightly build nutch 0.8 I'm trying (5/10-5/11) to download nightly build of nutch but I get The page cannot be displayed. Do you have some catalina logs? Oups... sorry... you get a page cannot be displayed while loading? (if so, forgot my previous message ... there is currently some problems with some apache servers... try a mirror) Jérôme
Re: MultiSearcher skewed IDF values
Andrzej Bialecki wrote: Unfortunately, this is still an existing problem, and neither Nutch nor Lucene does the right job here. Please see NUTCH-92 for more information, and a sketch of solution for this issue. Lucene's MultiSearcher now implements this correctly, no? But Nutch's distributed search does not. Two round trips to each node are required: the first to get IDF information for the query, and the second to get hits. Doug
Re: Problem with sorting index
It sounds like you're sorting a segment index after dedup, rather than a merged index. It also looks like there's a bug in IndexSorter. But you should be able to work around it by merging your segment indexes after deduping, so there are no deletions. Please file a bug in Jira. Doug Michael wrote: When i'm trying to use IndexSorter, i'm getting this error: Exception in thread main java.lang.IllegalArgumentException: attempt to access a deleted document at org.apache.lucene.index.SegmentReader.document(SegmentReader.java:282) at org.apache.lucene.index.FilterIndexReader.document(FilterIndexReader.java:104) at org.apache.nutch.indexer.IndexSorter$SortingReader.document(IndexSorter.java:170) at org.apache.lucene.index.SegmentMerger.mergeFields(SegmentMerger.java:186) at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88) at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:579) at org.apache.nutch.indexer.IndexSorter.sort(IndexSorter.java:240) at org.apache.nutch.indexer.IndexSorter.main(IndexSorter.java:291) Anyone knows how to fix this? Michael
Re: Admin Gui beta test (was Re: ATB: Heritrix)
Andrzej Bialecki wrote: I think it should be possible to put your binary at the Apache site, probably Doug will be the right person to talk to ... Have you tried attaching it to a Jira issue? If that fails, you could attach it to a page on the Wiki, no? Doug
Re: java.io.IOException: No input directories specified in
Chris Fellows wrote: I'm having what appears to be the same issue on 0.8 trunk. I can get through inject, generate, fetch and updatedb, but am getting the IOException: No input directories on invertlinks and cannot figure out why. I'm only using nutch on a single local windows machine. Any idea's? Configuration has not changed since checking out from svn. The handling of Windows pathnames is still buggy in Hadoop 0.1.1. You might try replacing your lib/hadoop-0.1.1.jar file with the latest Hadoop nightly jar, from: http://cvs.apache.org/dist/lucene/hadoop/nightly/ The file name code has been extensively re-written. The next Hadoop release (0.2), containing these fixes, will be made in around a week. Doug
Re: How to get Text and Parse data for URL
NutchBean.getContent() and NutchBean.getParseData() do this, but require a HitDetails instance. In the non-distributed case, the only required field of the HitDetails for these calls is url. In the distributed case, the segment field must also be provided, so that the request can be routed to a node serving that segment. These are implemented by FetchedSegments.java and DistributedSearch.java. Doug Dennis Kubes wrote: Can somebody direct me on how to get the stored text and parse metadata for a given url? Dennis
Re: How to get Text and Parse data for URL
Dennis Kubes wrote: I think that I am not fully understanding the role the segments directory and its contents play. A segment is simply a set of urls fetched in the same round, and data associated with these urls. The content subdirectory contains the raw http content. The parse-text subdirectory contains the extracted text, used when indexing and when building snippets for hits. The index subdirectory holds a Lucene index of the pages in the segment. Etc. It is an independent chunk of Nutch data. In 0.8, each segment subdirectory is further split into parts, the result of distributed processing. The parts are split by the hash of the url. Does that help? Doug
Re: java.io.IOException: Cannot create file
[EMAIL PROTECTED] wrote: First question. Updatedb won't run against the segment so what can I do to salvage it? Is the segment salvageable? Probably. I think you're hitting some current bugs in DFS MapReduce. Once these are fixed, then your updatedb's should succeed! Second question, should I raise an issue in JIRA quoting the errors below? Yes, please. *** Excerpt from hadoop-site.xml property namemapred.system.dir/name value/home/nutch/hadoop/mapred/system/value /property Unlike the other paths, mapred.system.dir is not a local path, but a path in the default filesystem, dfs in your case. Your setting is fine, I just thought I'd mention that. Timed out.java.io.IOException: Task process exit with nonzero status of 143. These 143's are a mystery to me. We really need to figure out what is causing these! One suggestion I found on the net was to try passing '-Xrs' to java, i.e., setting mapred.child.java.opts to include it. Another idea is to put 'ulimit -c unlimited' in one's conf/hadoop-env.sh, so that these will cause core dumps. Then, hopefully, we can use gdb to see where the JVM crashed. I have not had time recently to try either of these on a cluster, the only place where this problem has been seen. java.rmi.RemoteException: java.io.IOException: Cannot create file /user/root/crawlA/segments/20060419162433/parse_text/part-5/data on client DFSClient_task_r_poobc6 This bug is triggered by the previous bug. In the first case the output is started, then the task jvm crashes. But DFS waits a minute before it will let another task create a file with the same name (to time out the other writer). So if the replacement task starts within a minute, then this error is thrown. I think Owen is working on a patch for this which will make DFSClient try to open the file for at least a minute before throwing an exception. We should have that committed today. This won't fix the 143's, but should allow your jobs to complete in spite of them. Thanks for your patience, Doug
Re: java.io.IOException: Cannot create file
[EMAIL PROTECTED] wrote: Actually, I think that updatedb won't run because the fetched segment didn't complete correctly. Don't know whether the instructions in the 0.7 FAQ apply: %touch /index/segments/2005somesegment/fetcher.done Ah. That's different. No, the 0.7 trick probably won't work. What errors are you seeing from this? I'd expect you'd see unexpected eof in the updatedb map task, for truncated outputs in this segment. http://issues.apache.org/jira/browse/HADOOP-153 would fix that, once implemented. Doug
Re: Using Nutch's distributed search server mode
Scott Simpson wrote: I don't quite understand how to set up distributed searching with relation to DFS (and the Tom White documents don't discuss this either). There are three databases with relation to Nutch: 1. Web database (dfs) 2. Segments (regular fs) 3. The index (regular fs) From your message above, I assume that the segments and index go in the regular file system and the web database is distributed across dfs. We put only a portion of the segments and index on each node and the search is distributed from Tomcat to all the nodes at once. If we don't use DFS for the segments and index, we'll lose the redundancy if a node is dead and we may lose search results. Is this true? The distributed search code is currently a bit neglected. It doesn't yet take advantage of MapReduce. The best way to use it today is to keep the master copy of your segments and indexes in dfs, then, when you're (manually) starting distributed search servers, copy segments and indexes from dfs to temporary local storage start the distributed search servers against those. Then construct a search-servers.txt that will be picked up by NutchBean to construct the DistributedSearch.Client. Long-term, I think we should automate this by having a distributed search MapReduce task. Each task will start by copying required data to local disk, starting a search server on that data, then reporting that search server back through the job tracker. Currently this can be done by setting the task's status to be the host:port string of the search server, then call getMapTaskReports() to get the host:port of all servers. The map task can then simply loop forever doing nothing. If a search server dies, then the MapReduce system will automatically start a new one. To launch a new version of the index, start a new such MapReduce job, and, once it is running, switch the DistributedSearch.Client to use it's servers and kill the old job. The temporary space will be reclaimed when the job is killed. One will have to be sure that the number of input splits naming search server tasks is no greater than numNodes*mapred.tasktracker.tasks.maximum, so that all of the tasks will run simultaneously. But none of that's implemented yet! Doug
Re: nutch user meeting in San Francisco: May 18th
Folks can say whether they'll attend at: http://www.evite.com/app/publicUrl/[EMAIL PROTECTED]/nutch-1 Doug
Re: Using Nutch's distributed search server mode
Shawn Gervais wrote: I was not able to use the literal instructions, as my indexes and segments are in DFS while the document presumes a local filesystem installation Search performance is not good with DFS-based indexes segments. This is not recommended. Distributed search is not meant for a single merged index, but rather for searching multiple indexes. With distributed search, each node will typically have (a local copy of) a few segments and either a merged index for just those segments, or separate indexes for each segment. When I examine the search results I see many duplicate results. Looking at it further it seems like the results of performing the same search across all 16 nodes is being combined into one result set - duplicates and all. I can only assume that I need to somehow partition my index or segments, but I'm unsure how to do that. It looks like you're searching the same dfs-resident index 16 times. Doug
Re: java.net.SocketTimeoutException: Read timed out
Elwin wrote: When I use the httpclient.HttpResponse to get http content in nutch, I often get SocketTimeoutExceptions. Can I solve this problem by enlarging the value of http.timeout in conf file? Perhaps, if you're working with slow sites. But, more likely, you're using too many fetcher threads and exceeding your available bandwidth, causing threads to starve and timeout. Doug
Re: Question about crawldb and segments
Jason Camp wrote: Unfortunately in our scenario, bw is cheap at our fetching datacenter, but adding additional disk capacity is expensive - so we are fetching the data and sending it back to another cluster (by exporting segments from ndfs, copy, importing). But to perform the copies, you're using a lot of bandwidth to your indexing datacenter, no? Copying segments probably takes almost as much bandwidth as fetching them... I know this sounds a bit messy, but it was the only way we could come up with to utilize the benefits of both datacenters. Ideally, I'd love to be able to have all of the servers in one cluster, and define which servers I want to perform which tasks, so for instance we could use the one group of servers to fetch the data, but the other group of servers to store the data and perform the indexing/etc. If there's a better way to do something like this than what we're doing, or if you think we're just insane for doing it this way, please let me know :) Thanks! You can use different sets of machines for dfs and MapReduce, by starting them in differently configured installations. So you could run dfs only in your indexing datacenter, and MapReduce in both datacenters configured to talk to the same dfs, at the indexing datacenter. Then your fetch tasks as the fetch datacenter would write their output to the indexing datacenter's dfs. And parse/updatedb/generate/index/etc. could all run at the other datacenter. Does that make sense? Doug
Re: plugins directory
mikeyc wrote: Any idea how the 'plugins' directory gets populated? I noticed microformats-hreview was not there. It does exist in the build directory with its jar and class files. Could this be the issue? The plugins directory exists in release builds. When developing, plugins live in build/plugins. If you're developing you should generally work from a subversion checkout, not a downloaded release. Doug
Re: How best to debug failed fetch-reduce task
Shawn Gervais wrote: When I have been at the terminal to observe the timed out process before it is reaped, I have seen that it continues to use 100% of a single processor. strace of the java process did not produce any usable leads. When the reduce task is reassigned, either to the same machine or another, it will die around the same percentage completion. Did you try 'kill -QUIT' the process? That should print a stack trace for every thread. Is there an option I can enable somewhere that will allow for more verbose output to be written to the logs? Any other suggestions on debugging this issue? You could put add some print statements to FetcherOutputFormat.java, in the RecordWriter.write() method, printing each key (URL) written. That might let you figure out what page is hanging things. It seems to me that it might be possible to take a snapshot of the task while it is running (i.e. data and the task job jar), so that I can debug it in isolation without re-running an entire fetch process. I am unsure of how this might be done, though. Once you know the page (assuming it is determinisitic) then you should be able to run a fetch of just that page to test things. Doug
Re: When Nutch fetches using mapred ...
Shawn Gervais wrote: When I perform a search large enough to observe the fetch process for an extended period of time (1M pages over 16 nodes, in this case), I notice there is one map task which performs _very_ poorly compared to the others: 4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s, versus 46639 pages, 13227 errors, 43.9 pages/s, 4547 kb/s, It is deficient in terms of raw pages/sec, execution time (it is the last map task to complete), and the number of errors encountered. As I said, there seems to always be exactly one map task like this. Different fetch executions will have the thread assigned to different machines -- there doesn't seem to be any pattern. What the heck is going on here? My suspicion is that you're trying to fetch a large number of pages from a single site. Fetch tasks are partitioned by host name. All urls with a given host are fetched in a single fetcher map task. Grep the errors from the log on the slow node: I'll bet most are from a single host name. To fix this, try setting generate.max.per.host. A good value might be something like topN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting -topN to 10M and running with 10 fetch tasks and using 100 threads, then each fetch task will fetch around 1M urls, 10,000 per thread. Fetching a single host is single-threaded, so any host with more than 10,000 urls will slow the overall fetch. Here's another way to think about it: If you're fetching a page/second per host (fetcher.server.delay) and your fetch tasks are averaging around an hour (3600 seconds) then any host which has more than 3600 pages will cause its fetch tasks to run slower than the others and/or to have high error rates. Doug
Re: lost NDFS blocks following network reorg
Ken Krugler wrote: Anyway, curious if anybody has insights here. We've done a fair amount of poking around, to no avail. I don't think there's any way to get the blocks back, as they definitely seem to be gone, and file recovery on Linux seems pretty iffy. I'm mostly interested in figuring out if this is a known issue (Of course you can't change the server names and expect it to work), or whether it's a symptom of lurking NDFS bugs. It's hard to tell, after the fact, whether stuff like this is pilot error or a bug. Others have reported similar things, so it's either a bug or it's too easy to make pilot errors. So something needs to change. But what? We need to start testing stuff like this systematically. A reproducible test case would make this much easier to diagnose. I'm sorry I can't be more helpful. I'm sorry you lost data. Doug
Re: How to terminate the crawl?
You can limit the number of pages by using the -topN parameter. This limits the number of pages fetched in each round. Pages are prioritized by how well-linked they are. The maximum number of pages that can be fetched is topN*depth. Doug Olena Medelyan wrote: Hi, I'm using the crawl tool in nutch to crawl web starting from a set of URL seeds. The crawl normally finishes after the specified depth was reached. Is it possible to terminate after a pre-defined number of pages or a text data of a pre-defined size (e.g. 500 MB) has been crawled? Thank you for any hints! Regards, Olena
Re: Delete Files from NDFS
Blocks are not deleted immediately. Check back in a while to see that they're actually removed. Doug Dennis Kubes wrote: Is there a way to delete files from the DFS? I used the dfs -rm option, but the data blocks still are there. Dennis
Re: Nutch and Hadoop Tutorial Finished
Dennis Kubes wrote: Here it is for the list, I will try to put it on the wiki as well. Thanks for writing this! I've added a few comments below. Some things are assumed for this tutorial. First, you will need root level access to all of the boxes you are deploying to. Root access should not be required (although it is sometimes convenient). I have certainly run large-scale crawls w/o root. The only way to get Nutch 0.8 Dev as of this writing that I know of is through Subversion. Nightly builds of Hadoop's trunk (currently 0.8-dev) are available from: http://cvs.apache.org/dist/lucene/hadoop/nightly/ Add a build.properties file and inside of it add a variable called dist.dir with its value as the location where you want to build nutch. So if you are building on a linux machine it would look something like this: dist.dir=/path/to/build This is optional. So log into the master nodes and all of the slave nodes as root. Create the nutch user and the different filesystems with the following commands: mkdir /nutch mkdir /nutch/search mkdir /nutch/filesystem mkdir /nutch/home useradd -d /nutch/home -g users chown -R nutch:users /nutch passwd nutch nutchuserpassword You can of course run things as any user. I always run things as myself, but that may not be appropriate in all environments. First we are going to edit the ssh daemon. The line that reads #PermitUserEnvironment no should be changed to yes and the daemon restarted. This will need to be done on all nodes. vi /etc/ssh/sshd_config PermitUserEnvironment yes This is not required (although it can be useful). If you see errors from ssh when running scripts, then try changing the value of HADOOP_SSH_OPTS in conf/hadoop-env.sh. Once we have the ssh daemon configured, the ssh keys created and copied to all of the nodes we will need to create an environment file for ssh to use. When nutch logs in to the slave nodes using ssh, the environment file creates the environment variables for the shell. The environment file is created under the nutch home .ssh directory. We will create the environment file on the master node and copy it to all of the slave nodes. vi /nutch/home/.ssh/environment .. environment variables Then copy it to all of the slave nodes using scp: scp /nutch/home/.ssh/environment [EMAIL PROTECTED]:/nutch/home/.ssh/environment One can now instead put environment variables in conf/hadoop-env.sh, since not all versions of ssh support PermitUserEnvironment. cd /nutch/search scp -r /nutch/search/* [EMAIL PROTECTED]:/nutch/search Note that, after the initial copy, you can set NUTCH_MASTER in your conf/hadoop-env.sh and it will use rsync to update the code running on each slave when you start daemons on that slave. The first time all of the nodes are started there may be the ssh dialog asking to add the hosts to the known_hosts file. You will have to type in yes for each one and hit enter. The output may be a little wierd the first time but just keep typing yes and hitting enter if the dialogs keep appearing. A command like 'bin/slaves.sh uptime' is a good way to test that things are configured correctly before attempting bin/start-all.sh. Thanks again for providing this! Doug
Re: Help Setting Up Nutch 0.8 Distributed
Dennis Kubes wrote: localhost:9000: command-line: line 0: Bad configuration option: ConnectTimeout devcluster02:9000: command-line: line 0: Bad configuration option: ConnectTimeout [ ... ] localhost:9000: command-line: line 0: Bad configuration option: ConnectTimeout devcluster02:9000: command-line: line 0: Bad configuration option: ConnectTimeout The launch of the datanodes and tasktrackers failed, since your version of ssh does not support the ConnectTimeout option. Edit conf/nutch-env.sh, and add a 'export HADOOP_SSH_OPTS=' line to remove this option. Doug
Re: Help Setting Up Nutch 0.8 Distributed
Dennis Kubes wrote: : command not foundlaves.sh: line 29: : command not foundlaves.sh: line 32: localhost: ssh: \015: Name or service not known devcluster02: ssh: \015: Name or service not known And still getting this error: 060316 175355 parsing file:/nutch/search/conf/hadoop-site.xml Exception in thread main java.io.IOException: Cannot create file /tmp/hadoop/mapred/system/submit_mmuodk/job.jar on client DFSClient_-913777457 at org.apache.hadoop.ipc.Client.call(Client.java:301) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:141) at org.apache.hadoop.dfs.$Proxy0.create(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSCli ent.java:587) at org My ssh version is: openssh-clients-3.6.1p2-33.30.3 openssh-server-3.6.1p2-33.30.3 openssh-askpass-gnome-3.6.1p2-33.30.3 openssh-3.6.1p2-33.30.3 openssh-askpass-3.6.1p2-33.30.3 Is it something to do with my slaves file? The \015 looks like a file has a CR where perhaps an LF is expected? What does 'od -c conf/slaves' print? What happens when you try something like 'bin/slaves uptime'? Doug
Re: javascript in summaries [nutch-0.7.1]
Jérôme Charron wrote: I reproduce this with nutch-0.8 with neko html parser (it seems that script tags are not removed). You can switch the html parser implementation to tagsoup. In my tests, all is ok. (property parser.html.impl) Should we switch the default from neko to tagsoup? Are there cases where neko is better? Doug
Re: Question on scalability
Olive g wrote: Is hadoop/nutch scalable at all or I can tune some other parameters? I'm not sure what you're asking. How long does it take to run this on a single machine? My guess is that it's much longer. So things are scaling: they're running faster when more hardware is added. In all cases you're using the same number of machines, but varying parameters and seeing different performance, as one would expect. For your current configuration, indexing appears fastest when the number of reduce tasks equals the number of nodes. I already have: mapred.map.tasks set to 100 mapred.job.tracker is not local mapred.tasktracker.tasks.maximum is 2. and everything else is default. How are you storing things? Are you using dfs? Are your nodes single-cpu or dual-cpu? My guess is single-cpu, in which case you might see more consistent performance with mapred.tasktracker.tasks.maximum=1. How many disks do you have per node? If you have multiple drives, then configuring mapred.local.dir to contain a list of directories, one per drive, might make things faster. Doug
Re: Boolean OR QueryFilter
This looks like a good approach. Note also that you will probably need to change BasicQueryFilter and perhaps other filters to work correctly with optional terms. Nguyen Ngoc Giang wrote: Sorry, I'm a newbie in OS, and I'm not familiar to the way of updating patches :D I'll try to put my solution here first to receive comments from our community. Since we must differentiate 3 possibilities: must have, may have and must not have; we need at least 2 boolean variables in org.apache.nutch.searcher.Query. In fact, these 2 boolean variables are isRequired and isProhibited. -In the first step, I define an OR token separately in jj file. This will be put before WORD. So it will look like this: OR: OR -Second, I define a new function called disjunction: void disjunction() : {} { OR nonOpOrTerm() } -Third, in the function parse(), I declare a boolean variable disj: boolean disj; -Forth, inside parse(), once we finished looking ahead, we examine the existence of OR token: ( LOOKAHEAD ... )? // check OR (disjunction() { disj = true; })* -Finally, I changed the handling portion in parse(): if (stop field == Clause.DEFAULT_FIELD terms.size()==1 isStopWord(array[0])) { // ignore stop words only when single, unadorned terms in default field } else { if (prohibited) query.addProhibitedPhrase(array, field); else if (disj) query.addOptionalPhrase(array, field); else query.addRequiredPhrase(array, field); } After this point, I have finished changing the jj file. Please note that I also have to add the method addOptionalPhrase() in org.apache.nutch.searcher.Query. This method basically sets isRequired=false and isProhibited=false. The rest has been taken care by Nutch already. Regards, Giang On 3/15/06, Laurent Michenaud [EMAIL PROTECTED] wrote: I would like to use Boolean Query too :) -Message d'origine- De : Alexander Hixon [mailto:[EMAIL PROTECTED] Envoyé : mercredi 15 mars 2006 08:38 À : nutch-user@lucene.apache.org Objet : RE: Boolean OR QueryFilter Maybe you could post the code on JIRA, if anyone else wishes to use Boolean operators in their search queries..? We could probably get a developer or two to put this in the 0.8 release? Since it IS open source. ;) Just a thought, Alex -Original Message- From: Nguyen Ngoc Giang [mailto:[EMAIL PROTECTED] Sent: Wednesday, 15 March 2006 3:45 PM To: nutch-user@lucene.apache.org; [EMAIL PROTECTED] Subject: Re: Boolean OR QueryFilter Hi David, I also did a similar task. In fact, I hacked into jj code to add the definition for OR and NOT. If you need any help, don't hesitate to contact me :). Regards, Giang PS: I also believe that a hack to jj code is necessary. On 3/8/06, David Odmark [EMAIL PROTECTED] wrote: Hi all, We're trying to implement a nutch app (version 0.8) that allows for Boolean OR e.g. (this OR that) AND (something OR other). I've found some relevent posts in the mailing list archive, but I think I'm missing something. For example, here's a snippet from a post from Doug Cutting: snip that said, one can implement OR as a filter (replacing or altering BasicQueryFilter) that scans for terms whose text is OR in the default field. /snip The problem I'm finding is that the NutchAnalysis analyzer seems to be swallowing all boolean terms by the time the QueryFilter is even executed (perhaps because OR is a stop word?). To wit: String queryText = this OR that; org.apache.nutch.searcher.Query query = org.apache.nutch.searcher.Query.parse(queryText, conf); for (int i=0;iquery.getTerms().length;i++) { System.out.println(Term = + query.getTerms()[i]); } This results in output that looks like this: Term = this Term = that So am I correct in believing that in order to implement boolean OR using Nutch search and a QueryFilter, one must also (minimally) hack the NutchAnalysis.jj file to produce a new analyzer? Also, given that a Nutch Query object doesn't seem to have a method to add a non-required Term or Phrase, does that need to be modified as well? Sorry for the long post, and thanks in advance... -David Odmark
Re: Site: invalid Jira link
I just fixed this. Thanks, Doug ArentJan Banck wrote: on: http://lucene.apache.org/nutch/issue_tracking.html http://nagoya.apache.org/jira/browse/Nutch no longer works. Should be: http://issues.apache.org/jira/browse/Nutch - Arent-Jan
Re: Adaptive Refetching
Andrzej Bialecki wrote: What i infer is, 1. For every refetch, the score of files (but not the directory) is increasing This is curious, it should not be so. However, it's the same in the vanilla version of Nutch (without this patch), so we'll address this separately. The OPIC algorithm is not really designed for re-fetching. It assumes that each link is seen only once. When pages are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better than what happens now. Incrementally updating the score would require re-processing the parser outputs to find outlinks from the previous version of the page and then subtracting their contribution from the page's score. This is possible, but not easy. Doug
Re: Boolean OR QueryFilter
David Odmark wrote: So am I correct in believing that in order to implement boolean OR using Nutch search and a QueryFilter, one must also (minimally) hack the NutchAnalysis.jj file to produce a new analyzer? Also, given that a Nutch Query object doesn't seem to have a method to add a non-required Term or Phrase, does that need to be modified as well? It looks like you might need to make sure that OR is not a stop word. Or use syntax like 'this +OR that', since required words are not stopped. Or use something like this operator:OR that. Doug
Re: Adaptive Refetching
Andrzej Bialecki wrote: Doug Cutting wrote: are refetched, their links are processed again. I think the easiest way to fix this would be to change ParseOutputFormat to not generate STATUS_LINKED crawldata when a page has been refetched. That way scores would only be adjusted for links in the original version of a page. This is not perfect, but considerably better But then we would miss any new links from that page. I think it's not acceptable. Think e.g. of news sites, where links from the same page are changing on a daily or even hourly basis. Good point. Then maybe then we should add a new status just for this, STATUS_REFRESH_LINK. If this is the only datum for a page, then the page could be added with its inherited score, but otherwise, if it is an already known page, the score increment is ignored. That way the scores for existing pages would not change due to recrawling, but new pages would still be added with a score influenced by the page that linked to them. Still not perfect, but better. If you remember, some time ago I proposed a different solution: to involve linkDB in score calculations, and to store these partial OPIC score values in Inlink. This would allow us to track score contributions per source/target pair. Newly discovered links would get the initial partial score value from the originating page, and we could track these values if the original page's score changes (e.g. the number of links increases, or the page's score is updated). Involving the linkdb in score calculations means that the linkdb is involved in crawldb updates, which makes crawldb updates much slower, since the linkdb generally has many times more entries than the crawldb. The linkdb is not required for batch crawling and OPIC scoring, a common case. So if we wish to implement things this way we should make it optional. For example, an initial crawl could be done using the current algorithm while subsequent crawls could use a slower, incrementally updating algorithm. BTW: I've been toying with some patches to implement pluggable scoring mechanisms, it would be easy to provide hooks for custom scoring implementations. Scores are just float values, so they would be sufficient for a wide range of scoring mechanisms, for others the newly added CrawlDatum.metadata could be used. +1 Doug
Re: .8 svn - fetcher performance..
Byron Miller wrote: Anything i should change/tweak on my fetcher config for .8 release? i'm only getting 5 pages/sec and i was getting nearly 50 on .7 with 125 threads going. Does .8 not use threads like 7 did? Byron, Have you tried again more recently? A number of bugs have been fixed in 0.8 in the past few weeks. I think it is now much more stable. Doug
Re: Problems with hadoop
Jon Blower wrote: My guess is that the source program is not available on your version of FreeBSD. Try running the source program (with no arguments) from the command line or type man source. Do you see anything? If not, you probably don't have the source program, which is called by the hadoop script. The source command is a shell builtin which effectively inserts the content of another shell script within a shell script, so that the sourced script can, e.g., set local variables, etc. Doug
Re: retry later
Richard Braman wrote: when you get an error while fetching, and you get the org.apache.nutch.protocol.retrylater because the max retries have been reached, nutch says it has given up and will retry later, when does that retry occur? How would you make a fetchlist of all urls that have failed? Is this information maintained somewhere? Each url in the crawldb has a retry count, the number of times it has been tried without a conclusive result. When the maximum (db.fetch.retry.max) then the page is considered gone. Until then it will be generated for fetch along with other pages. There is no command that generates a fetchlist for only pages whose retry count is greater than zero. Doug
Re: Tutorial on the Wiki
Vanderdray, Jacob wrote: I've changed the language a bit. If you're interested, take a look: http://wiki.apache.org/nutch/NutchTutorial This looks great! Thanks so much for adding this to the wiki! We might add something to the Step-by-Step introduction to the effect that: This also permits more control over the crawl process, and incremental crawling. Does that address other's concerns? Doug
Re: still not so clear to me
Richard Braman wrote: Can someone confirm this: Uou start a crawldb from a list of urls and you generate a fetch list, which is akin to seeding your crawldb. When you fetch it just fetches those seed urls. When you do your next round of generate/fetch/update, The fetch list will have the links found while parsing the pages in the original urls. Then on your next round, it will fetch the links found during the previous fetch. So with each round of fetching, nutch goes deeper and deeper into the web, only fetching urls it hasn't previously fetched. The generate command generates a fetch list first based on the seed urls, then on the links found on that page (for each subsequent iteration), then on the links on those pages, and so forth and son on until the entire domain is crawled, if you limit the domains with a filter. This all sounds right to me. Some clarifications: - urls are filtered before adding them to the crawldb, so the db only ever contains urls that pass the filter. - the db contains both urls that have been fetched and those that have not been fetched. When you find a new link to a url that is already in the db it does not add a new entry to the db, but rather just updates the existing entry's score. - higher-scoring pages are generated in preference to lower-scoring pages when the -topN option is used. So a page discovered in the first round might not be fetched until the fourth round, when enough other links have been found to that page to warrant fetching it. This, when topN is specified, crawling is not totally breadth first. Doug
Re: project vitality?
Richard Braman wrote: I realy do think nutch is great, but I echo Matthias's comments that the community needs to come together and contirbute more back. And that comes with the requirement of making sure volunteers are given access to make their contributions part of the project. Here's how it works: One has to be a committer to directly change the code. One may be invited to become a committer if contributes a number of non-trivial, consistently exemplary patches. Exemplary patches: 1. are easy for a committer to apply; 2. fix one thing; 3. fix it well; 4. are well formatted, using Sun's coding conventions 5. are well documented, with Javadoc for all non-private items 6. pass all existing unit tests 7. includes new unit tests 8. etc. An exemplary patch is thus something that a committer can commit with little hesitation. It follows that exemplary patches will be committed quickly. Lesser patches are likely to languish. For example, a committer might be reluctant to take on a poorly constructed patch for a bug that only affects niche users, since it may take a lot of time to turn it into code worthy of committing. Most committers are already doing as much as they can to help the project. The trick is not to get them committers to do more work, but for others to do more work for the committers, and,eventually, to get more committers. Putting the faqs and tutorial on the website and not the wiki maybe one of the two biggest problems in getting people started learning nutch. If you think these should move, don't just complain: file a bug, make your case, submit a patch, etc. The website is part of the source and is governed by the same process. Doug
Re: project vitality?
David Wallace wrote: Also, I've lost count of the number of times someone has posted something to the effect of I'll pay someone to give me Nutch support, simply because they find the existing documentation and mailing lists inadequate. Usually, that person gets told that the best way to get Nutch support is to ask questions on the mailing list; but since questions often go unanswered, this isn't a very good way to get Nutch support at all. I agree this is a problem, but it is also an opportunity. I do try to answer Nutch questions whenever I have time, and most other Nutch developers are also active on these lists. The problem is simply that there are more questions than question answering hours. All of this is acceptable in a product that hasn't yet reached version 1.0. The code has moved ahead faster than the documentation; and that's fine, provided the documentation will eventually catch up. Yes, I hope it will. Maybe, once 0.8 is deemed production-worthy, the team should down tools, stop coding, and put some effort into really producing a really lovely set of documentation, including a comprehensive FAQ. I believe that this will help grow the user base, faster than adding new features ever could. That would be nice. Once things settle down it will also be easier for support organizations, consultants, book authors, etc, to step in and improve documentation too. Doug
Re: issues w/ new nutch versions
Florent Gluck wrote: In hadoop jobtracker's log, I can see several tasks being losts as follow: 060306 184155 Aborting job job_hyhtho 060306 184156 Task 'task_m_7qgat2' has been lost. 060306 184156 Aborting job job_hyhtho 060306 184156 Task 'task_m_lph5qs' has been lost. 060306 184156 Aborting job job_hyhtho It seems there are some sort of timeouts. Weird, the machines are properly configured (hasn't changed) and it definitely works w/ nutch the previous nutch version (as of end of Jan.). I fixed some bugs today in Hadoop which could cause this. Please try updating again and see if you still have this problem. Sorry! Doug
Re: Help with bin/nutch server 8081 crawl
Monu Ogbe wrote: Caused by: java.lang.InstantiationException: org.apache.nutch.searcher.Query at java.lang.Class.newInstance0(Unknown Source) at java.lang.Class.newInstance(Unknown Source) at org.apache.hadoop.io.WritableFactories.newInstance(WritableFactories.jav It looks like Query no longer has a no-arg constructor, probably since the patch which makes all Configurations non-static. A no-arg constructor is required in order to pass something via an RPC. The fix might be as simple as adding the no-arg constructor, but perhaps not, since the query would then have a null configuration. At a glance, the query execution code doesn't appear to use the configuration, so this might work... Doug
Re: Moving tutorial link to wiki
Matthias Jaekle wrote: Maybe we should move the tutorial to the wiki so it can be commented on. +1 +1 Doug
Re: exception during fetch using hadoop
It looks like the child JVM is silently exiting. The error reading child output just shows that the child's standard output has been closed, and the child error says the JVM exited with non-zero. Perhaps you can get a core dump by setting 'ulimit -c' to something big. JVM core dumps can be informative. This doesn't look like something that should kill a crawl, though. Are you using a tasktracker jobtrackers, or running things with a local jobtracker? With a tasktracker this task would be retried. Are you seeing this? Does a given task consistently fail when retried? Doug Mike Smith wrote: I have been getting this exception during fetching for almost a month. This exception stops the whole crawl. It happens on and off! Any Idea?? We are really stocked with this problem. I am using 3 data node and 1 name server. 060223 173809 task_m_b8ibww fetching http://www.heartcenter.com/94fall.pdf 060223 173809 task_m_b8ibww fetching http://www.medinfo.co.uk/conditions/tenosynovitis.html 060223 173809 task_m_b8ibww fetching http://www.boncholesterol.com/whatsnew/index.shtml 060223 173809 task_m_b8ibww fetching http://www.drcranton.com/hrt/promise_of_longevity.htm 060223 173809 task_m_b8ibww fetching http://www.drcranton.com/hrt/promise_of_longevity.htm 060223 173809 task_m_b8ibww Error reading child output java.io.IOException: Bad file descriptor at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:194) at sun.nio.cs.StreamDecoder$CharsetSD.readBytes(StreamDecoder.java :411) at sun.nio.cs.StreamDecoder$CharsetSD.implRead(StreamDecoder.java :453) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:183) at java.io.InputStreamReader.read(InputStreamReader.java:167) at java.io.BufferedReader.fill(BufferedReader.java:136) at java.io.BufferedReader.readLine(BufferedReader.java:299) at java.io.BufferedReader.readLine(BufferedReader.java:362) at org.apache.hadoop.mapred.TaskRunner.logStream(TaskRunner.java :170) at org.apache.hadoop.mapred.TaskRunner.access$100(TaskRunner.java :29) at org.apache.hadoop.mapred.TaskRunner$1.run(TaskRunner.java:137) 060223 173809 task_r_3h1pex 0.1667% reduce copy 060223 173809 Server connection on port 50050 from xx: exiting 060223 173809 Server connection on port 50050 from xx: exiting 060223 173809 task_m_b8ibww Child Error java.io.IOException: Task process exit with nonzero status. at org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:144) at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:97) 060223 173812 task_m_b8ibww done; removing files.
Re: url: search fail
0.7 and 0.8 are not compatible. You need to re-crawl. Sorry! Once we have a 1.0 release then we'll make sure things are back-compatible. Doug Martin Gutbrod wrote: I changed from 0.7.1 to one of the latest nightly builds (0.8) and now search for url: fields fail. E.g. [ url:my.doman.com ] Has anybody similar experiences? Should I switch back to 0.7.1 ? Log file shows: 2006-02-24 11:17:11 StandardWrapperValve[jsp]: Servlet.service() for servlet jsp threw exception java.lang.NullPointerException at org.apache.nutch.searcher.FieldQueryFilter.filter(FieldQueryFilter.java:63) at org.apache.nutch.searcher.QueryFilters.filter(QueryFilters.java:106) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:94) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239) at org.apache.jsp.search_jsp._jspService(search_jsp.java:251) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94) at javax.servlet.http.HttpServlet.service(HttpServlet.java:856) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236) at javax.servlet.http.HttpServlet.service(HttpServlet.java:856) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.valves.RequestFilterValve.process(RequestFilterValve.java:287) at org.apache.catalina.valves.RemoteAddrValve.invoke(RemoteAddrValve.java:84) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104) at org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520) at org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929) at org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705) at org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577) at org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684) at java.lang.Thread.run(Thread.java:534)
Re: Link to Search Interface for List
Vanderdray, Jacob wrote: I get the same thing from my linux box. The only reference I can find to linkmap.html is a commented out line in forrest.properties. FWIW: I've already made the changes to my copy of mailing_lists.xml. Let me know if you want me to just send someone that. I think I just fixed that problem. Forrest 0.7 seems to choke on ext: links in the tabs.xml file. Once those are removed it works. Doug
Re: Problem/bug setting java_home in hadoop nightly 16.02.06
Have you edited conf/hadoop-env.sh, and defined JAVA_HOME there? Doug Håvard W. Kongsgård wrote: I am unable to set java_home in bin/hadoop, is there a bug? I have used nutch 0.7.1 with the same java path. localhost: Error: JAVA_HOME is not set. if [ -f $HADOOP_HOME/conf/hadoop-env.sh ]; then source ${HADOOP_HOME}/conf/hadoop-env.sh fi # some Java parameters if [ $JAVA_HOME != /usr/lib/java ]; then #echo run java in $JAVA_HOME JAVA_HOME=$JAVA_HOME fi if [ $JAVA_HOME = ]; then echo Error: JAVA_HOME is not set. exit 1 fi JAVA=$JAVA_HOME/bin/java JAVA_HEAP_MAX=-Xmx1000m System: SUSE 10 64-bit | Java 1.4.2
Re: The latest svn version is not stable
Rafit Izhak_Ratzin wrote: I just check out the latest svn version (376446), I built it from scratch. When I tried to run the jobtrucker I got the next message in the jobtracker log file: 060209 164707 Property 'sun.cpu.isalist' is Exception in thread main java.lang.NullPointerException Okay. I think I just fixed this. Please give it a try. Thanks, Doug
Re: nutch inject problem with hadoop
Michael Nebel wrote: I upgraded to the last version from the svn today. After having some nuts and bolts fixes (missing hadoop-site.xml, webapps-dir). I just fixed these issues. I finally tried to inject a new set of urls. Doing so, I get the exception below. I am not seeing this. Are you still seeing it, with the current sources? If so, can you provide more details? What OS, JVM? Thanks, Doug
Re: nutch inject problem with hadoop
Michael Nebel wrote: Now it's complaining about a missing class org/apache/nutch/util/LogFormatter :-( That's been moved to Hadoop: org.apache.hadoop.util.LogFormatter. Doug
Re: hadoop-default.xml
The file packaged in the jar is used for the defaults. It is read from the jar file. So it should not need to be committed to Nutch. Mike Smith wrote: There is no setting file for Hadoop in conf/. Should it be hadoop-default.xml? It seems this file is not committed but it is packaged into hadoop jar file. Thanks, Mike.
Re: Recovering from Socket closed
Chris Schneider wrote: Also, since we've been running this crawl for quite some time, we'd like to preserve the segment data if at all possible. Could someone please recommend a way to recover as gracefully as possible from this condition? The Crawl .main process died with the following output: 060129 221129 Indexer: adding segment: /user/crawler/crawl-20060129091444/segments/20060129200246 Exception in thread main java.io.IOException: timed out waiting for response at org.apache.nutch.ipc.Client.call(Client.java:296) at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127) at $Proxy1.submitJob(Unknown Source) at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259) at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288) at org.apache.nutch.indexer.Indexer.index(Indexer.java:263) at org.apache.nutch.crawl.Crawl.main(Crawl.java:127) However, it definitely seems as if the JobTracker is still waiting for the job to finish (no failed jobs). Have you looked at the web ui? It will show if things are still running. This is on the jobtracker host at port 50030 by default. The bug here is that the RPC call times out while the map task is computing splits. The fix is that the job tracker should not compute splits until after it has returned from the submitJob RPC. Please submit a bug in Jira to help remind us to fix this. To recover, first determine if the indexing has completed. If it has not, then use the 'index' command to index things, followed by 'dedup' and 'merge'. Look at the source for Crawl.java: http://svn.apache.org/viewcvs.cgi/lucene/nutch/trunk/src/java/org/apache/nutch/crawl/Crawl.java?view=markup All you need to do to complete the crawl is to complete the last few steps manually. Cheers, Doug
Re: Parsing PDF Nutch Achilles heel?
Steve Betts wrote: I am using PDFBox-0.7.2-log4j.jar. That doesn't make it run a lot faster, but it does allow it to complete. I find xpdf much faster than PDFBox. http://www.mail-archive.com/nutch-dev@incubator.apache.org/msg00161.html Does this work any better for you? Doug
Re: How do I control log level with MapReduce?
Chris Schneider wrote: I'm trying to bring up a MapReduce system, but am confused about how to control the logging level. It seems like most of the Nutch code is still logging the way it used to, but the -logLevel parameter that was getting passed to each tool's main() method no longer exists (not that these main methods are getting called by Crawl.java, of course). Previously, if -logLevel was omitted, each tool would set its logLevel field to INFO, but those fields no longer exist either. The result seems to be that the logging level defaults all the way back to the LogFormatter, which sets all of its handlers to FINEST. I was sort of expecting there to be a new configuration property (perhaps a job configuration property?) that would control the logging level, but I don't see anything like this. Any guidance would be greatly appreciated. There is no config property to control logging level. That would be a useful addition, if someone wishes to contribute it. In the meantime, Nutch uses Java's built-in logging mechanism. Instructions for configuring that are in: http://java.sun.com/j2se/1.4.2/docs/api/java/util/logging/LogManager.html Doug
Re: Can't index some pages
Michael Plax wrote: Question summery: Q: How can I set up crawler in order to index all web site? I'm trying to run crawl with command from tutorial 1. In urls file I have start page (index.html). 2. In the configuration file conf/crawl-urlfilter.txt domain was changed. 3. I run: $ bin/nutch crawl urls -dir crawledtottaly -depth 10 crawl.log 4. Crawling is finished 5. I run: bin/nutch readdb crawled/db -stats output: $ bin/nutch readdb crawledtottaly/db -stats run java in C:\Sun\AppServer\jdk 060118 155526 parsing file:/C:/nutch/conf/nutch-default.xml 060118 155526 parsing file:/C:/nutch/conf/nutch-site.xml 060118 155526 No FS indicated, using default:local Stats for [EMAIL PROTECTED] --- Number of pages: 63 Number of links: 3906 6. I get less pages than I have expected. This is a common question, but there's not a common answer. The problem could be that urls are blocked by your url filter, or by http.max.delays, or something else. What might help is if the fetcher and crawl db printed more detailed statistics. In particular, the fetcher could categorize failures and periodically print a list of failure counts by category. The crawl db updater could also list the number of urls that are filtered. In the meantime, please examine the logs, particularly watching for errors while fetching. Doug
Re: So many Unfetched Pages using MapReduce
Florent Gluck wrote: I then decided to switch to using the old http protocol plugin: protocol-http (in nutch-default.xml) instead of protocol-httpclient With the old protocol I got 5 as expected. There have been a number of complaints about unreliable fetching with protocol-httpclient, so I've switched the default back to protocol-http. Doug
Re: Error at end of MapReduce run with indexing
Matt Zytaruk wrote: I am having this same problem during the reduce phase of fetching, and am now seeing: 060119 132458 Task task_r_obwceh timed out. Killing. That is a different problem: a different timeout. This happens when a task does not report status for too long then it is assumed to be hung. Will the jobtracker restart this job? It will retry that task up to three times. If so, if I change the ipc timeout in the config, will the tasktracker read in the new value when the job restarts? The ipc timeout is not the relevant timeout. The task timeout is what's involved here. And, no, at present I think the tasktracker only reads this when it is started, not per job. Doug
Re: Can't index some pages
att Kangas wrote: Doug, would it make sense to print a LOG.info() message every time the fetcher bumps into one of these db.max limits? This would help users find out when they need to adjust their configuration. I can prepare a patch if it seems sensible. Sure, this is sensible. But it's not done under the fetcher, but when the links are read, under db update. Doug
Re: large filter file, time to update db
Insurance Squared Inc. wrote: I'm trying to determine if there's a better way to whitelist a large number of domains than just adding them as a regular expression in the filter. Have a look at the urlfilter-prefix plugin. This is more efficient for filtering urls by a large list of domains. Doug
Re: Full Range of Results Not Showing
Neal Whitley wrote: Now here's another question. How can I obtain the exact number of searches being displayed on the screen. I have been fishing around and can not find a variable being output to the page with this date. In my example below 81 total matches were found. But because of the grouping in the initial result set (hitsPerSite=2) it is showing only 46 listings, some of which are grouped under more from site. This is causing a slight problem with pagination because the pager thinks there are 81 matches and it's extending itself for a range of 81, when we really want a value of 46 when hitsPerSite=2. Perhaps something like this on search.jsp: if (hitsPerSite == 0) { //grab the full result set maxPages = (int)hits.getTotal(); } else { //grab the short result set maxPages = ???some variable OR some math here to obtain value???; } It seems to me this is not a problem when using the default Next button to move from page to page. But with any sort of pagination when used with hitsPerSite we need to know what we are actually viewing on the screen. The site-deduping is performed at query time. If you ask for the top N hits without site duplication then Nutch finds more than N hits and removes those from duplicate sites dynamically. So unless you make N very large, we don't know the total number of site-de-duplicated hits. Doug
Re: Is any one able to successfully run Distributed Crawl?
Pushpesh Kr. Rajwanshi wrote: Just wanted to confirm that this distributed crawl you did using nutch version 0.7.1 or some other version? And was that a successful distributed crawl using map reduce or some work around for distributed crawl? No, this is 0.8-dev. This was using in early December using the version of Nutch then in the mapred branch. This version has since been merged into the trunk and will be eventually released as 0.8. I believe everything in my previous message is still relevant to the current trunk. Doug
Re: Multi CPU support
Teruhiko Kurosaka wrote: Can I use MapReduce to run Nutch on a multi CPU system? Yes. I want to run the index job on two (or four) CPUs on a single system. I'm not trying to distribute the job over multiple systems. If the MapReduce is the way to go, do I just specify config parameters like these: mapred.tasktracker.tasks.maxiumum=2 mapred.job.tracker=localhost:9001 mapred.reduce.tasks=2 (or 1?) and bin/start-all.sh ? That should work. You'd probably want to set the default number of map tasks to be a multiple of the number of CPUs, and the number of reduce tasks to be exactly the number of cpus. Don't use start-all.sh, but rather just: bin/nutch-daemon.sh start tasktracker bin/nutch-daemon.sh start jobtracker Must I use NDFS for MapReduce? No. Doug
Re: Multiple anchors on same site - what's better than making these unique?
David Wallace wrote: I've been grubbing around with Nutch for a while now, although I'm still working with 0.7 code. I notice that when anchors are collected for a document, they're made unique by domain and by anchor text. Note that this is only done when collecting anchor texts, not when computing page scores. Suppose my site has 3 pages with links to page X, and the same anchor text. I'd kind of like to score page X higher than a page where there's only one incoming link with that anchor text. But I don't want to have this effect swamping the other calculations of page score. In other words, if my site has 1000 pages with links to page X, this page should score a wee bit higher than a similar page with just one incoming link, but not 1000 times higher. I'm thinking of doing some maths with the number of repetitions of an anchor, then including the result in the page score. Something like log(10+n), or maybe n/(n+2); where n is the number of incoming links with the same anchor text. Either of these formulas would make 1000 incoming links score roughly 3 times higher than a single incoming link, which seems about right to me. Page scores currently are sqrt(OPIC) in the Nutch trunk. http://www.nabble.com/-Fwd%3A-Fetch-list-priority--t360125.html#a997304 The OPIC calculation does not consider the domain or anchor text. Hope this helps. Doug
Re: Is any one able to successfully run Distributed Crawl?
Earl Cahill wrote: Any chance you could walk through your implementation? Like how the twenty boxes were assigned? Maybe upload your confs somewhere, and outline what commands you actually ran? All 20 boxes are configured identically, running a Debian 2.4 kernel. These are dual-processor boxes with 2GB of RAM each. Each machine has four drives, mounted as a RAID on /export/crawlspace. This cluster uses NFS to mount home directories, so I did not have to set NUTCH_MASTER in order to rsync copies of nutch to all machines. I installed JDK 1.5 in ~/local/java, Ant in ~/local/ant and subversion in ~/local/svn. My ~/.ssh/environment contains: JAVA_HOME=/home/dcutting/local/java NUTCH_OPTS=-server NUTCH_LOG_DIR=/export/crawlspace/tmp/dcutting/logs NUTCH_SLAVES=/home/dcutting/.slaves I added the following to ~/.bash_profile, then logged out back in. export `cat ~/.ssh/environment` I added the following to /etc/ssh/sshd_config on all hosts: PermitUserEnvironment yes My ~/.slaves file contains a list of all 20 slave hosts, one per line. My ~/src/nutch/conf/mapred-default.xml contains: nutch-conf property namemapred.map.tasks/name value1000/value /property property namemapred.reduce.tasks/name value39/value /property /nutch-conf My ~/src/nutch/conf/nutch-site.xml contains: nutch-conf property namefetcher.threads.fetch/name value100/value /property property namegenerate.max.per.host/name value100/value /property property nameplugin.includes/name valueprotocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)/value /property property nameparser.html.impl/name valuetagsoup/value /property !-- NDFS -- property namefs.default.name/name valueadminhost:8009/value /property property namendfs.name.dir/name value/export/crawlspace/tmp/dcutting/ndfs/names/value /property property namendfs.data.dir/name value/export/crawlspace/tmp/dcutting/ndfs/value /property !-- MapReduce -- property namemapred.job.tracker/name valueadminhost:8010/value /property property namemapred.system.dir/name value/mapred/system/value /property property namemapred.local.dir/name value/export/crawlspace/tmp/dcutting/local/value /property property namemapred.child.heap.size/name value500m/value /property /nutch-conf My ~/src/nutch/conf/crawl-urlfilter.txt contains: # skip file:, ftp:, mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|png)$ # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/.+?)/.*?\1/.*?\1/ # accept everything else +. To run the crawl I gave the following commands on the master host: # checkout nutch sources and build them mkdir ~/src cd ~/src ~/local/svn co https://svn.apache.org/repos/asf/lucene/nutch/trunk nutch cd nutch ~/local/ant/bin/ant # install config files named above in ~/src/nutch/conf # create dmoz/urls file wget http://rdf.dmoz.org/rdf/content.rdf.u8.gz gunzip content.rdf.u8.gz mkdir dmoz bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8.gz dmoz/urls # create required directories on slaves bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/logs bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/local bin/slaves.sh mkdir -p /export/crawlspace/tmp/dcutting/ndfs/names # start nutch daemons bin/start-all.sh # copy dmoz/urls into ndfs bin/nutch ndfs -put dmoz dmoz # crawl nohup bin/nutch crawl dmoz -dir crawl -depth 4 -topN 1600 /dev/null crawl.log Then I visited http://master:50030/ to monitor progress. I think that's it! Doug
Re: Is any one able to successfully run Distributed Crawl?
Pushpesh Kr. Rajwanshi wrote: I want to know if anyone is able to successfully run distributed crawl on multiple machines involving crawling millions of pages? and how hard is to do that? Do i just have to do some configuration and set up or do some implementations also? I recently performed a four-level deep crawl, starting from urls in DMOZ, limiting each level to 16M urls. This ran on 20 machines taking around 24 hours using about 100Mbit and retrieved around 50M pages. I used Nutch unmodified, specifying only a few configuration options. So, yes, it is possible. Doug
Re: Linking Document scores together in a query
Can you please describe the higher-level problem you're trying to solve? Doug Matt Zytaruk wrote: Hello, I am trying to implement a system where to get the score for certain documents in a query, I need to average the score of two different documents for that query. Does anyone have any bright ideas on what the best way to implement such a system would be? I've been investigating and thus far haven't been able to find a way that didnt degrade performance horribly. Any help would be appreciated. Thanks in advance. -Matt Zytaruk
Re: How to get page content given URL only?
Nguyen Ngoc Giang wrote: I'm writing a small program which just utilizes Nutch as a crawler only, with no search functionality. The program should be able to return page content given an url input. In the mapred branch this is directly supported by NutchBean. Doug
Re: Incremental crawl w/ map reduce
Did you update the crawldb after the first fetch? The mapred crawler does not update the next-fetch date of pages when the fetch list is generated, as in 0.7. So, until that changes, you must update the crawldb before you next generate a fetch list. Doug Florent Gluck wrote: Hi, As a test, I recently did a quick incremental crawl. First, I did a crawl with 10 seed urls using 4 nodes (1 jobTracker/nameNode + 3 tastTrackers/dataNodes). So far, so good, the fetches were distributed among the 3 nodes (3/3/4) and a segment was generated. Running a quick -stats on the crawldb showed me the 10 links were there. I also did a dump and everything was fine. Then, I injected a new url and crawled again, generating a second segment. While it was running, I looked at the logs expecting to only see the fetch of the new url I added, but instead I saw it was fetching all the previous urls again. Why is that ? These were already fetched and my understanding is that they should only be fetched again after 30 days (or whatever value is specified in nutch-site.xml). What am I missing here ? Thanks, Flo
Re: mapred branch: IOException in invertlinks (No input directories specified)
Florent Gluck wrote: 8. invertlinks linkdb segments/SEG_NAME This should be instead: invertlinks linkdb segments Doug
Re: Fetch Errors
Ben Halsted wrote: When I check the fetch status pages in the JobTracker web GUI I saw that I was getting on average more errors than pages. 95 pages, 119 errors, 1.0 pages/s, 63 kb/s Is there a way to find out what the errors are? Look in the tasktracker logs. Typically they're max delays exceeded. I recently increased the default for this paramter, which helps a lot. Doug
Re: NDFS / WebDB QUestion
Thomas Delnoij wrote: So, say I want to setup a machine as a DataNode that has two or more disks, do I have to configure and setup a DataNode Deamon for every disk? How else could I use all disks if the ndfs.data.dir property only accepts one path (assumed I don't want to rely on MS Windows' dynamic discs or similar OS specific features)? You can list multiple paths in ndfs.data.dir. Paths which do not exist are ignored. Doug
Re: Crawl auto updated in nutch?
Håvard W. Kongsgård wrote: - I want to index about 50 – 100 sites with lots of documents, is it best use the Intranet Crawling or Whole-web Crawling method. The intranet style is simpler and hence a good place to start. If it doesn't work well for you then you might try the whole-web style. - Is the crawl auto updated in nutch, or must I run a cron task It is not auto-updated. Doug
Re: Fetcher url sorting
Matt Zytaruk wrote: Indeed, that does work, although that ends up slowing down the fetch a fair amount because a lot of threads end up idle, waiting, and I was hoping to avoid that slowdown if possible. What should these threads be doing? If you have a site with N pages to fetch, and you want to fetch them all politely, then it will take at least fetcher.server.delay*N to fetch them all. The fetch list is sorted by the hash of the url, so accesses to each host should be spread fairly evenly through the list. Capping the number of pages per host (generate.max.per.host) will help, or, if you know the webmasters in question, you can consider increasing fetcher.threads.per.host. Doug
Re: Fetcher url sorting
Matt Zytaruk wrote: Well, if we want to fetch pages from N different sites, ideally we should be able to have N threads running, without any of them having to wait. I guess ideally what the fetcher should probably do is instead of waiting, put the url it was trying to fetch back into the queue to be tried later on, and grab a different one. The fetcher used to do this, and it ended up with huge queues. We capped the size of the queues, and dropped urls when their queue was full. But the fetcher still spent an age at the end, mostly idle, with a single thread emptying its queue. And there were some bugs in the queue synchronization that caused things to sometimes hang, but no one could ever figure out why. So the current fetcher's strategy is to, instead of queuing urls in order to drop them later, drop them now. And instead of queuing urls in order to wait later, wait now. It makes things a lot simpler. In the end the performance is similar, but you can see the cost of crawling big sites immediately, rather than only later. In either case you need to choose to drop things or run slowly. I'm not so sure that accesses to each host are spread evenly throughout the list, because the fetch list I was doing had tens of thousands of different hosts and I was still getting a large amount of threads trying to access the same host at the same time, even with only 50 threads. Although maybe I'm wrong and that is how it would act if the hosts were spread evenly throughout, I'm not sure, it just seems like a lot. They're not spread exactly evenly, but randomly, which can be a bit lumpy. What percentage of urls in the fetch list are from a host that is exceeding max delays? If it is near 2%, or that host is slower than average, then you'll probably have issues with 50 threads. Doug
Re: Merging many indexes
Ben Halsted wrote: I'm getting the dreaded: Too many open files error. I've checked my system settings for file-max: $ cat /proc/sys/fs/file-nr 2677 1945 478412 $ cat /proc/sys/fs/file-max 478412 What does 'ulimit -n' print? Look in /etc/security/limits.conf to increase the limit. What would be the best way to work around (or fix) this. Merging 10 indexes at a time and then merging the results down until I get just one index? Yes. You can decrease indexer.mergeFactor to make this happen. Perhaps we should decrease the default. With the addition of crc files, the number of open files is doubled. So 50 indexes with 10 open files each yeilds 1000 open files, and the JVM needs more than 24. So I guess the default should be decreased to 30 or so. What about the dedup process. It seems to be able to manage the 100+ indexes fine, but if I switch the process and merge the indexes first and then remove dupes, I think it may speed up the process. Ideas? Then you end up with dupes still taking space in your final index, which is not optimal for search. Doug
Re: merging auto-crawls
Ben Halsted wrote: I've modified the auto-crawl to always use a pre-existing crawldb. If I run it multiple times I get multiple linkdb, segments, indexes, and index directories. Is it possible to merge the results using the bin/nutch comamnds? You should also have it use a single linkdb. Then use 'bin/nutch dedup' and 'bin/nutch merge' across both indexes directories to create a new index with everything. Doug
Re: Filesystem structure for the web front-end.
Ben Halsted wrote: I was wondering what the required file structure is for the web gui to work properly. Are all of these required? /db/crawldb /db/index /db/indexes /db/segments /db/linkdb The indexes directory is not used when a merged index is present. The crawldb and segments/*/crawl_parse directories are not used by the web ui. Also -- What is the proper way to merge segments and indexes? Can I simply move segments all into one directory then re-index it, or is there a better way? You should update the linkdb so that it contains links from all segments. Then you can use the dedup and merge commands to create a new index. Ideally you should also re-index after updating the linkdb, but this is not required. Doug
Re: merging auto-crawls
Ben Halsted wrote: When I merge this stuff, do I need to merge the segments/* for each crawl into a single segments directory? Or is there data in the merged index file that will direct the web component to the correct segment? Put the segments in a single directory. The index only has the segment name, not its full path. Please keep folks on the list updated as to how this works for you. I have not yet used things in this way with the mapred branch, but it is a common use case. Perhaps we can add an option to the crawl command to crawl more that automates this. Doug
Re: sorting on multiple fields
James Nelson wrote: I need to sort the search results on two fields for a project I'm working on, but nutch only seems to support sorting on one. I'm wondering if I missed something and there is actually a way or if there is a reason for restricting sort to one field that I'm not aware of. Sorting results by multiple fields is not yet supported in Nutch, but would not be too hard to add, since Lucene supports it. Doug
Re: Which fields can you call via detail.getvalue(....) out of the box?
The explain page lists all stored fields by calling the toHtml() method of HitDetails. You can also list things with: for (int i = 0; i detail.getLength(); i++) { String field = detail.getField(i); String value = detail.getValue(i); ... } Doug Byron Miller wrote: I'm looking to see if i can pull a meta description in lieu of summary for some content and wondering if this is indexed - is there an easy way to see the fields indexed by default and how they're exposed through nutch bean?
Re: mapred error on windows
It looks like you are using ndfs but not running any datanodes. An ndfs filesystem requires one namenode and at least one datanode, typically a large number running on different machines. Look at the bin/start-all.sh script for an example of what is started in a typical mapred/ndfs deployment. Doug Kashif Khadim wrote: I am unable to crawl with mapred on windows. I get this error after i run. bin/nutch crawl urls Error: 051030 004819 parsing file:/C:/nutch/mapred/conf/nutch-default.xml 051030 004819 parsing file:/C:/nutch/mapred/conf/nutch-site.xml 051030 004819 Server listener on port 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004819 Server handler on 9009: starting 051030 004822 Server connection on port 9009 from 80.139.7.173: starting 051030 004914 Server connection on port 9009 from 80.139.7.173: starting 051030 004931 While choosing target, totalMachines is 0 051030 004931 Target-length is 0, below MIN_REPLICATION (1) 051030 004931 Server handler on 9009 call error: java.io.IOException: Cannot create file /tmp/nutch/mapred/system/submit_ogykqp/job.xml java.io.IOException: Cannot create file /tmp/nutch/mapred/system/submit_ogykqp/job.xml at org.apache.nutch.ndfs.NameNode.create(NameNode.java:98) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:582) at org.apache.nutch.ipc.RPC$1.call(RPC.java:187) at org.apache.nutch.ipc.Server$Handler.run(Server.java:198) 051030 004934 Server connection on port 9009 from 80.139.7.173: exiting Thanks. __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: fetch questions - freezing
Ken van Mulder wrote: Initially, its able to reach ~25 pages/s with 150 threads. The fetcher gets progressivly slower though, dropping down to about ~15 pages/s after about 2-3 hours or so and continues to slow down. I've seen a few references on these lists to the issue, but I'm not clear on if its expected behaviour or if there's a solution to it? I've also noticed that the process takes up more and more memory as it runs, is this expected as well? What parse plugins do you have enabled? The best way to diagnose these problems is to 'kill -QUIT' an offending fetcher process. This will dump the stack of every fetcher thread. This will likely look quite different at the start of your run than later in the run, and that difference should point to the problem. In the past I have seen these symptoms primarily with parser plugins. I have also seen threads hang infinitely in a socket read, but that is much rarer. Doug
Re: Peak index performance
Byron Miller wrote: For example i've been tweaking max merge/min merge and such and i've been able to double my performance without increasing anything but cpu load.. Smaller maxMergeDocs will cost you in the end, since these will eventually be merged during the index optimization at the end. I would just leave this at Integer.MAX_VALUE. Larger minMergeDocs will improve performance, but by using more heap. So watch your heap size as you increase this and leave a healthy margin for safety. This is the best way to tweak indexing performance. Larger mergeFactors may improve performance somewhat, but by using more file handles. In general, the maximum number of file handles is around 10-20x (depending on plugins) the mergeFactor. So raising this above 50 on most systems is risky, and the performance improvements are marginal, so I wouldn't bother. Doug
Re: fetch questions - freezing
Ken Krugler wrote: We're only using the html text parsers, so I don't think that's the problem. Plus we dumping the thread stack when it hangs, and it's always in the ChunkedInputStream.exhaustInputStream() process (see trace below). The trace did not make it. Have you tried protocol-http instead of protocol-httpclient? Is it any better? What JVM are you running? I get fewer socket hangs in 1.5 than 1.4. Also, the mapred fetcher has been changed to succeed even when threads hang. Perhaps we should change the 0.7 fetcher similarly? I think we should probably go even farther, and kill threads which take longer than a timeout to process a url. Thread.stop() is theoretically unsafe, but I've used it in the past for this sort of thing and never traced subsequent problems back to it... Doug
Re: fetch questions - freezing
Ken van Mulder wrote: As a side note, does anyone have any recommendations for profiling software? I've used the standard hprof, which slows down the process to much for my needs and jmp which seems pretty unstable. I recommend 'kill -QUIT' as a poor-man's profiler. With a few stack dumps you can usually get a decent idea of where the time is going. If you want to get fancy you can 'kill -QUIT' every minute or so, then use 'sort | uniq -c | sort -nv' so see where you're spending a lot of time. Doug
Re: Peak index performance
Byron Miller wrote: property nameindexer.mergeFactor/name value350/value description /description /property Initially high index merge factor caused out of file handle errors but increasing the others along with it seemed to help get around that. That is a very large mergeFactor, larger than I would recommend. How many documents do you index in a run? More than 350*500=175,000? If not then you're not hitting a merge yet. What does 'ulimit -n' show? Does your performance actually change much when you lower this? Doug
Re: crawl problems
The only link on http://shopthar.com/ to the domain shopthar.com is a link to http://shopthar.com/. So a crawl starting from that page that only visits pages in shopthar.com will only find that one page. % wget -q -O - http://shopthar.com/ | grep shopthar.com trtd colspan=2Welcome to shopthar.com/td/td/tr a href=http://shopthar.com/shopthar.com/a | Doug Earl Cahill wrote: I am trying to do a crawl on trunk of one of my sites, and it isn't working. I make a file urls, that just contains the site http://shopthar.com/ in my conf/crawl-urlfilter.txt I have +^http://shopthar.com/ I then do bin/nutch crawl urls -dir crawl.test -depth 100 -threads 20 it kicks in and I get repeating chunks like 051019 010450 Updating /home/nutch/nutch/trunk/crawl.test/db 051019 010450 Updating for /home/nutch/nutch/trunk/crawl.test/segments/20051019010449 051019 010450 Finishing update 051019 010450 Update finished 051019 010450 FetchListTool started 051019 010450 Overall processing: Sorted 0 entries in 0.0 seconds. 051019 010450 Overall processing: Sorted NaN entries/second 051019 010450 FetchListTool completed 051019 010450 logging at INFO For ages, but I only see two nutch hits in my access log: one for my robots.txt and one for my front page. Nothing else. The crawl finishes, then I do a search and can only get a hits for the front page. When I do the search via lynx, I get a momentary Bad partial reference! Stripping lead dots. I can't imagine this is really the problem, but pretty well all my links are relative. I mean nutch has to be able to follow relative links, right? Ideas? Thanks, Earl __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: Nutch Search Speed Concern
TL wrote: You mentioned that as a rule of thumb each node should only have about 20M pages. What's the main bottleneck that's encountered around 20M pages? Disk i/o , cpu speed? Either or both, depending on your hardware, index, traffic, etc. CPU-time to compute results serially can average up to a second or more with ~20M page indexes. And the total amount of i/o time per query on indexes this size can be more than a second. If you can spread the i/o over multiple spindles then it may not be the bottleneck. Doug
Re: Nutch Search Speed Concern
Murray Hunter wrote: We tested search for a 20 Million page index on a dual core 64 bit machine with 8 GB of ram using storage of the nutch data on another server through linux nfs, and it's performance was terrible. It looks like the bottleneck was nfs, so I was wondering how you had your storage set up. Are you using NDFS, or is it split up over multiple servers? For good search performance, indexes and segments should always reside on local volumes, not in NDFS and not in NFS. Ideally these can be spread across the available local volumes, to permit more parallel disk i/o. As a rule of thumb, searching starts to get slow with more than around 20M pages per node. Systems larger than that should benefit from distributed search. Doug
Re: Do you believe in Clause sanity?
Andy Lee wrote: Not to become a one-person thread or anything (and I'll shut up if this attempt gets no answers), but this seems like a straightforward question. Is there some design principle I'm missing that would be violated if clauses could be removed from a query? No, not that I can think of. The public constructors for Query are limited in order to prohibit certain things that are not yet supported, like optional and nested clauses. Doug
Re: Do you believe in Clause sanity?
Andy Lee wrote: Thanks, Doug. In that case, please consider this a request for a couple of API changes which you may be planning anyway: * addClause() and removeClause() methods in Query. * Setters in Query.Clause for its term/phrase. Please submit a bug report, ideally with a patch file attached. Doug
Re: Unlimited access to a web server for Nutch
Ngoc Giang Nguyen wrote: I'm running Nutch to crawl some specific websites that I know the web admins personally. So is there anyway to change the settings of the target web servers such that they give my Nutch higher priority, let's say unlimited access, assuming they are all Apache servers? Because usually I observed that Nutch has a lot of HTTP max delay even when I set the timeout quite large and the network connections are quite perfect (I also double check by visiting those websites by browser, and they respond well). Try something like fetcher.server.delay=0 and fetcher.threads.per.host=10. Doug
Re: a simple map reduce tutorial
Earl Cahill wrote: 1. Sounds like some of you have some glue programs that help run the whole process. Are these going to end up in subversion sometime? I am guessing there is much duplicated effort. I'm not sure what you mean. I set environment variables in my .bashrc, then simply use 'bin/start-all.sh' and 'bin/nutch crawl'. 2. Not sure how to test that my index actually worked. Starting catalina in my index directory didn't work this time. NutchBean now looks for things in the subdirectory of the connected directory named 'crawl'. Is that an improvement or is it just confusing? 3. What do you all think of setting up some test directories to crawl, in say http://lucene.apache.org/nutch/test/ Thinking it would be kind of cool to have junit run through a whole process on external pages. I think it would be better to have the junit tests start jetty then crawl localhost. I'd love to see some end-to-end unit tests like that. 4. Any way that http://spack.net/nutch/SimpleMapReduceTutorial.html http://spack.net/nutch/GettingNutchRunningOnUbuntu.html can get on the wiki? I am using apache-ish style and would change to whatever, but as fun as these are to write, I would like to see them used. You should be able to add them to the wiki yourself. Just fill out: http://wiki.apache.org/nutch/UserPreferences Thanks, Doug
Re: mapred Sort Progress Reports
Rod Taylor wrote: Tell me how it behaves during the sort phase. I ran 8 jobs simultaneously. Very high await time (1200) and it was doing about 22MB/sec data writes. Nearly 0 reads from disk (everything would be cached in memory). This is during the sort part? This first writes a big file, then reads it, then sorts it. With 20M records I think the file is around 2.5GB, so eight of these would be 20GB. Do you have 20GB of RAM? Doug
Re: How to get real Explanation instead of crippled HTML version?
Ilya Kasnacheev wrote: So I only get HTMLised version, which is useless if I need only page rating (top Explanation.getValue()). How would I get page rating (i.e. number from 0 to 1 showing how relevant Hit was to Query) from nutch? Explanations are not a good way to get this, as, for each explanation, the query must be re-executed. In recent versions of Nutch the score can be retrieved from a hit with ((FloatWritable)hit.getSortValue()).get(). Doug
Re: MapReduce
Paul van Brouwershaven wrote: The AcceptEnv option is only avalible with ssh 3.9 Debian currently only has 3.8.1p1 in stable and testing. (4.2 unstable) Is there an other way to solve the env. problem? I don't know. The Fedora and Debian systems that I use have AcceptEnv. Doug
Re: mapred Sort Progress Reports
Rod Taylor wrote: I see. Is there any way to speed up this phase? It seems to be taking as long to run the sort phase as it did to download the data. It would appear that nearly 30% of the time for the nutch fetch segment is spent doing the sorts, so I'm well off the 20% overhead number you seem to be able to achieve for a full cycle. 5 machines (4CPU) each with 8 tasks with a load average is about 5 and they run Redhat. Context switches are low (under 1500/second). There is virtually no IO (boxes have plenty of ram) but the kernel is doing a bunch of work as 50% of CPU time is in system (unsure what, I'm not familiar with the Linux DTrace type tools). Sorting is usually i/o bound on mapred.local.dir. When eight tasks are using the same device this could become a bottleneck. Use iostat or sar to view disk i/o statistics. My plan is to permit one to specify a list of directories for mapred.local.dir and have the sorting (and everything else) select randomly among these for temporary local files. That way all devices can be used in parallel. As a workaround you could try starting eight tasktrackers, each configured with a different device for mapred.local.dir. Yes, that's a pain, but it would give us an idea of whether my analysis is correct. Doug
Re: mapred Sort Progress Reports
Rod Taylor wrote: Virtually no IO reported at all. Averages about 200kB/sec read and writes are usually 0, but burst to 120MB/sec for under 1 second once every 30 seconds or so. That's strange. I wonder what it's doing. Can you use 'kill -QUIT' to get a thread dump? Try a few of these to sample the stack and see where it seems to be spending time. Doug
Re: mapred Sort Progress Reports
Try the following on your system: bin/nutch org.apache.nutch.io.TestSequenceFile -fast -count 2000 -megabytes 100 foo Tell me how it behaves during the sort phase. Thanks, Doug
Re: MapRed - how can I get the fetcher logs?
Gal Nitzan wrote: I only have two log files: -rw-r--r-- 1 root root 8090 Oct 3 07:01 nutch-root-jobtracker-kunzon.log -rw-r--r-- 1 root root 4290 Oct 3 07:01 nutch-root-namenode-kunzon.log The tasktracker logs would be on the machines running the tasktracker, which might be different than your namenode and jobtracker. Also note that the jobtracker's web interface shows summary statistics for each fetcher task. Doug