PruneRegexTool
Nutchers, I know that I have seen many posts to this list regarding the usage of nutch's prune tool (org.apache.nutch.tools.PruneIn dexTool) and that many of those posts noted the difficulty of having to pass Lucene queries as parameters (for those of us who don't already have a firm understanding of Lucene queries). I also know that many users wish that there was a pruning tool that could prune an index based on regex commands (like those used by regex-urlfilter.txt). A while back a friend of mine who is more java savy that I am (I am not java savy at all) wrote such at tool, which we called PruneRegexTool. The reason for this post is twofold: first to share this tool with the nutch community and second, to ask for help in updating this tool to work with nutch 0.8.1, since it currently only works with 0.7.1. I'm not exactly sure why it doesn't work with 0.8.1, but I think of at least two probably reasons: this tool uses Lucene 1.4.3, which is currently outdated, and the tool is based on the PruneIndexTool, which probably changed between nutch 0.7.1 and nutch 0.8.1. As I have already noted, my understanding of java is quite poor, so I don't know if updating this tool for use with nutch 0.8.1 is a 20 minute or 20 hour task (I'm guessing somewhere inbetween). Anyway, I have included the .java and .class file for this tool (as attachments), as well as the instructions that my friend provided me regarding how to use it. If anyone updates it for use with nutch 0.8.1, I would be eternally grateful to get a copy of the updated code, as well as any changes in the instructions for how to use this tool properly. Final Disclaimer: I cannot make any assurance about this tool and the instructions provided, except that it worked fine for me when I was still using nutch 0.7.1. Instructions: *Stage 1:* Compiling PruneRegexTool Everything in this stage except for the very last step (running javac on PruneRegexTool.java) only needs to be done once. The last step only needs to be done again if changes are made to PruneRegexTool.java. If you haven't already, make sure that the java compiler is in your PATH. If not, you can add to by putting the following in your ~/.bash_profile: - PATH=$PATH:/usr/local/j2sdk1.4.2_08/bin/ - export PATH Then run: - source ~/.bash_profile to make the change take effect for your current session. (The ~ is an alias for your home directory, so you can run this command from anywhere.) PruneRegexTool uses code from Nutch and two other libraries that are not included with the Nutch package: Lucene and ORO. Download the source code for these libraries into your nutch-0.7.1 directory with the commands (run from within the nutch directory): - wget http://apache.mirrors.pair.com/jakarta/lucene/source/lucene-1.4.3-src.tar.gz - wget http://apache.mirrors.versehost.com/jakarta/oro/source/jakarta-oro-2.0.8.tar.gz And then extract them with: - tar xzvf lucene-1.4.3-src.tar.gz - tar xzvf jakarta-oro-2.0.8.tar.gz Now that we've got these libraries set up, we need to tell the java compiler where they (and Nutch's own source code) live. This is done by setting the CLASSPATH environmental variable. Add the following to your ~/.bash_profile, replacing USERNAME with your username: CLASSPATH=/home/USERNAME/nutch-0.7.1/jakarta-oro-2.0.8 /src/java:/home/USERNAME/nutch-0.7.1/lucene-1.4.3 /src/java:/home/USERNAME/nutch-0.7.1/src/java:/home/USERNAME/nutch-0.7.1 /src/plugin/urlfilter-regex/src/java:. export CLASSPATH Next, copy PruneRegexTool.java and PruneRegexTool.class to your nutch directory. You should then be able to compile the tool with the command: javac PruneRegexTool.java * Stage 2:* Set Up conf/regex-prune.txt The file conf/regex-prune.txt is the default file that PruneRegexTool reads to determine which pages to keep and which to discard based on their urls. It uses the same format as conf/regex-urlfilter.txt above, with + indicating that a url will be pruned, and - indicating that it will not be pruned. A custom file can be specified when running the tool with the flag -regexfile filename. *Stage 3:* Run The Tool The full set of command line arguments that can be given to the tool are as follows: - PruneRegexTool indexDir | segmentsDir [-regexfile filename] [-dryrun] [-force] [-output filename] - NOTE: exactly one of indexDir or segmentsDir must be provided - -*regexfile * specify the file containing the regex to be used during pruning. defaults to conf/regex-prune.txt - -*dryrun* don't do anything, just show what would be done. - -*force* force index unlock, if locked. Use with caution! - -*output* store pruned URLs in a text file An example run of the tool (from your nutch-0.7.1directory): - bin/nutch PruneRegexTool crawl/segments -dryrun -output prunedurls.log End of Instructions Again, if anyone who uses nutch 0.8.1 thinks this tool would be useful, I would love for you to update it so it works with nutch 0.8.1. For those of you using
Can PruneIndexTool still be used in Nutch 0.8.1?
Hi, When using 0.7.x I often used the PruneIndexTool, but I noticed that calling bin/nutch Prune no longer works and Prune is not included in the 0.8commandline options section of the nutch wiki. Furthermore, when I call the command locate PruneIndexTool, all the returned files start with: nutch-0.7.1/docs/api/org/apache/nutch/tools/ and nothing comes up from my nutch-0.8.1 directory. Can the PruneIndexTool still be used with nutch 0.8.1? If so, is the usage the same as it was under nutch 0.7.x and where can the source files be found? Thanks for any help anyone can provide! -Bryan
Does nutch 0.8.x have an command like bin/nutch fetchlist -dumpurls
Hi, When I was using nutch 0.7, I found the bin/nutch fetchlist -dumpurls command to be very useful. However, I have not been able to find an equivalent command in nutch 0.8.x. Essentially all I want to do is dump all urls stored in a certain segment (or group of segments) into a text file. In nutch 0.7.x I would call a command like this: *$ bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 foo.txt *Any suggestions for how this can be accomplished in nutch 0.8.x are very much appreciated. Thanks, Bryan
Two Errors in Nutch 0.8 Tutorial?
I am certainly far from a nutch expert, but it appears to me that there are two errors in the current Nutch 0.8 tutorial. First off, here is the version of Nutch 0.8 that I am using, in case there has been changes made in newer version that invalidate my comments: -bash-2.05b$ svn info Path: . URL: http://svn.apache.org/repos/asf/lucene/nutch/trunk Repository Root: http://svn.apache.org/repos/asf Repository UUID: 13f79535-47bb-0310-9956-ffa450edef68 Revision: 414318 Node Kind: directory Schedule: normal Last Changed Author: siren Last Changed Rev: 414306 Last Changed Date: 2006-06-14 11:08:28 -0500 (Wed, 14 Jun 2006) Properties Last Updated: 2006-06-14 12:00:57 -0500 (Wed, 14 Jun 2006) Error #1: Towards the end of the tutorial, the following command is found: bin/nutch invertlinks crawl/linkdb crawl/segments When I call this command verbatim, I get the following error: 2006-07-25 08:44:40,503 WARN mapred.LocalJobRunner (LocalJobRunner.java:run(119)) - job_8ly5hf java.io.IOException: No input directories specified in: Configuration: defaults: hadoop-default.xml , mapred-default.xml , /home/bryan/nutch-8d/hadoop/mapred/local/localRunner/job_8ly5hf.xmlfinal: hadoop-site.xml at org.apache.hadoop.mapred.InputFormatBase.listPaths( InputFormatBase.java:96) at org.apache.hadoop.mapred.SequenceFileInputFormat.listPaths( SequenceFileInputFormat.java:37) at org.apache.hadoop.mapred.InputFormatBase.getSplits( InputFormatBase.java:106) at org.apache.hadoop.mapred.LocalJobRunner$Job.run( LocalJobRunner.java:80) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305) I think the correct syntax for the command should be: bin/nutch invertlinks crawl/linkdb crawl/segments/* (with the /* added to the end). Error #2: The tutorial says that to index, the following command should be called: bin/nutch index indexes crawl/linkdb crawl/segments/* However, when I call that command I get the following error: Usage: index crawldb linkdb segment ... I believe the correct syntax should be: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/* If these are indeed errors in the tutorial, perhaps someone with the authority to do so would be kind enough the make the necessary changes. My two cents, Bryan
Dissecting the Nutch Search Page (Please Help!)
I am trying to modify the standard nutch search page (for nutch 0.8-dev) and have several questions: 1. Do most people modify the search.html file directly, or is it better to modify the files that are used to automatically generate the search.htmlpage. If the latter is the case, are there any files besides these that are involved in the creation of the search page: - *../nutch-8d/src/web/jsp/search.sjp* - *../nutch-8d/src/web/include/style.html* - *../nutch-8d/src/web/include/footer.html* - *../nutch-8d/src/web/include/en/header.xml * - *../nutch-8d/src/web/pages/en/search.xml* 2. I have looked at the source of the search.html page that comes up when you open :8080, and it appears that this page is mostly generated from search.jsp and the certain other html pages it includes (listed above)). However, I cannot figure out where the menu is being imported from. This is the section of code that follows the imported style sheet, but precedes the input box and button used to search. Where does this code come from?? Also, I cannot figure out where the nutch_logo.gif image is coming from (that file name doesn't even appear the source for search.html). Any help is much appreciated. Thanks, Bryan
Re: Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results problem with invertlinks
Kuro, Thanks for the tip. I made the changes you suggested and took a look at the debug output, which allowed me to realize that my difficulties were actually occurring before I got the Job Failed exception. Specifically, when I call the bin/nutch inject command at the beginning of the my whole-web crawl, I get the following error, which I haven't been able to figure out (any insights are much apprecaited): 2006-06-14 18:02:32,401 DEBUG conf.Configuration(Configuration.java:init(67))- java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java:67) at org.apache.nutch.util.NutchConfiguration.create(NutchConfiguration.java:50) at org.apache.nutch.crawl.Injector.main(Injector.java:148) This problem seemed like a different issue so I posted a separate post about it yesterday: http://mail-archives.apache.org/mod_mbox/lucene-nutch-user/200606.mbox/[EMAIL PROTECTED] Thanks, Bryan On 6/16/06, Teruhiko Kurosaka [EMAIL PROTECTED] wrote: Bryan, Some recent changes in the logging code changed the default logging behavior; nutch doesn't output anything to the console. (It supposedly sends the logging output to a file described as ${nutch.log.dir}/${nutch.log.file} but I don't know what the default values of these variables.) You can change conf/log4j.propertiies to change the logging behavior of the nutch command line. (There is another logging properties for search GUI.) I changed conf/log4j.properties like outlined below, to enable full debug logging. (Only changed lines are shown). #log4j.rootLogger=INFO,DRFA log4j.rootLogger=DEBUG, stdout #log4j.logger.org.apache.nutch=INFO #log4j.logger.org.apache.hadoop=WARN I hope this helps. -kuro From: Bryan Woliner [mailto:[EMAIL PROTECTED] Sent: 2006-6-15 18:21 $ bin/nutch crawl test -dir crawl3 -depth 2 -topN 50 It seemed like everything worked correctly (although unlike nutch 0.7.1, no ouput was generated)
Error when calling bin/nutch inject -- java.io.IOException: config()
On June 13th, I downloaded the trunk version of nutch-0.8-dev and then built it using ant. I then created a valid urls file and put it in the urlsdir subdirectory of my nutch directory. I also made sure that my conf/regex-urlfilter.txt file was valid. At that point, I tried to do my first whole-web crawl using 0.8, so I called the command: bin/nutch inject testcrawl/crawldb urlsdir However, when calling this command, the logged output included the following error (more logged output included below): 2006-06-14 18:02:32,401 DEBUG conf.Configuration (Configuration.java:init(67)) - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java :67) at org.apache.nutch.util.NutchConfiguration.create( NutchConfiguration.java:50) at org.apache.nutch.crawl.Injector.main(Injector.java:148) What am I doing wrong? Are there any configuration files or environment variables that I need to modify? Thanks for any helpful suggestions anyone can provide! -Bryan LOGGED OUTPUT: 2006-06-19 18:02:32,401 DEBUG conf.Configuration (Configuration.java:init(67)) - java.io.IOException: config() at org.apache.hadoop.conf.Configuration.init(Configuration.java :67) at org.apache.nutch.util.NutchConfiguration.create( NutchConfiguration.java:50) at org.apache.nutch.crawl.Injector.main(Injector.java:148) 2006-06-19 18:02:32,407 INFO crawl.Injector (Injector.java:inject(110)) - Injector: starting 2006-06-19 18:02:32,410 INFO crawl.Injector (Injector.java:inject(111)) - Injector: crawlDb: crawl8/crawldb 2006-06-19 18:02:32,410 INFO crawl.Injector (Injector.java:inject(112)) - Injector: urlDir: urlsdir 2006-06-19 18:02:32,696 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/hadoop$ 2006-06-19 18:02:32,816 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/nutch-default.xml 2006-06-19 18:02:32,861 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/nutch-site.xml 2006-06-19 18:02:32,872 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/hadoop-site.xml 2006-06-19 18:02:32,873 INFO crawl.Injector (Injector.java:inject(120)) - Injector: Converting injected urls to crawl db entries. 2006-06-19 18:02:32,876 DEBUG conf.Configuration (Configuration.java:init(76)) - java.io.IOException: config(config) at org.apache.hadoop.conf.Configuration.init(Configuration.java :76) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:97) at org.apache.nutch.util.NutchJob.init(NutchJob.java:26) at org.apache.nutch.crawl.Injector.inject(Injector.java:121) at org.apache.nutch.crawl.Injector.main(Injector.java:155) 2006-06-19 18:02:32,889 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/hadoop$ 2006-06-19 18:02:32,914 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/nutch-default.xml 2006-06-19 18:02:32,930 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/mapred$ 2006-06-19 18:02:32,933 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing jar:file:/home/bryan/nutch-8d/lib/hadoop-0.3.2.jar!/mapred$ 2006-06-19 18:02:32,937 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/nutch-site.xml 2006-06-19 18:02:32,940 INFO conf.Configuration ( Configuration.java:loadResource(397)) - parsing file:/home/bryan/nutch-8d/conf/hadoop-site.xml 2006-06-19 18:02:33,503 DEBUG conf.Configuration (Configuration.java:init(76)) - java.io.IOException: config(config) at org.apache.hadoop.conf.Configuration.init(Configuration.java :76) at org.apache.hadoop.mapred.JobConf.init(JobConf.java:86) at org.apache.hadoop.mapred.LocalJobRunner$Job.init( LocalJobRunner.java:57) at org.apache.hadoop.mapred.LocalJobRunner.submitJob( LocalJobRunner.java:181) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:277) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:312) at org.apache.nutch.crawl.Injector.inject(Injector.java:131) at org.apache.nutch.crawl.Injector.main(Injector.java:155)
Problems switching over from nutch 0.7.1 to nutch 0.8 (dev) -- zero search results problem with invertlinks
Hi All, I have been using Nutch 0.7.1 for some time (although I am certainly not an expert) and am now in the process of switching over to Nutch 0.8. However, I have ran into a couple of problems along the way and am hoping that those of you who have been using nutch 0.8 for a while will take a quick look at what I have done and see if you can figure why I am running into these problems. Thanks ahead of time for any help you can offer!! __ The two problems I am having are essentially as follows (more detail provided below): 1. So far I have been able to run a testcrawl using bin/nutch crawl, but when I go to my nutch searchpage (:8080) and try a search, I always get zero results returned, even though I am able to open the index using Luke and verify that there are approximately 200 documents and approximately 40,000 search terms in my index and there are no errors in the Tomcat logs. 2. I am unable to get through the whole-web crawl in the nutch-0.8 tutorial. Specifically, I get stuck on the bin/nutch invertlinks step, where I get the message: Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:342) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:203) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:305) ___ ** Details ** These are the steps I took to install nutch 0.8. 1. Downloaded Nutch 0.8 (dev) I was previously using the release copy of nutch 0.7.1, so this was the first time I had to build a release of nutch using ant. I downloaded ant and then installed the current trunk of nutch 0.8 (thinking it would be more stable than the nightly build). To do this I did the following from my home directory: $ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk $ mv trunk nutch-8d $ export ANT_HOME=/usr/local/ant/apache-ant-1.6.5 $ export PATH=${PATH}:${ANT_HOME}/bin $ cd nutch-8d $ ant 2. Compiled Nutch 0.8 war file and then replaced ROOT Tomcat directory I then did the following from my nutch-8d directory: $ ant war $ mv /usr/local/jakarta-tomcat-4.1.31/webapps/ROOT /usr/local/jakarta- tomcat-4.1.31/webapps/ROOT_nutch-0.7/ $ cp build/nutch-0.8-dev.war /usr/local/jakarta-tomcat-4.1.31 /webapps/ROOT.war 3. Tried first Nutch 0.8 crawl using the CrawlTool I first created an urls file at ../nutch-8d/test/urls and then set the crawl-urlfilter.txt file to allow essentially all URLs. I then did a round of fetching using the following call: $ bin/nutch crawl test -dir crawl3 -depth 2 -topN 50 It seemed like everything worked correctly (although unlike nutch 0.7.1, no ouput was generated) I then did the following: $cd crawl3 $ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh stop Using CATALINA_BASE: /usr/local/jakarta-tomcat-4.1.31 Using CATALINA_HOME: /usr/local/jakarta-tomcat-4.1.31 Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp Using JAVA_HOME: /usr/local/j2sdk1.4.2_08 $ /usr/local/jakarta-tomcat-4.1.31/bin/catalina.sh start Using CATALINA_BASE: /usr/local/jakarta-tomcat-4.1.31 Using CATALINA_HOME: /usr/local/jakarta-tomcat-4.1.31 Using CATALINA_TMPDIR: /usr/local/jakarta-tomcat-4.1.31/temp Using JAVA_HOME: /usr/local/j2sdk1.4.2_08 Everything seemed to be working correctly, but when I went to my nutch search page (i.e. :8080), no matter what search term I enter, I get zero results returned. I then did the following to troubleshoot the situation: 1. Reviewed the tomcat logs (no error messages of any sort). 2. Looked at the following segments stats: $ bin/nutch segread -list -dir crawl3/segments NAMEGENERATED FETCHER START FETCHER END FETCHED PARSED 20060613200213 3 2006-06-13T20:02:20 2006-06-13T20:02:22 3 3 20060613200226 214 2006-06-13T20:02:32 2006-06-13T20:04:48 217 181 3. Opened the index I am trying to search using Luke, which allowed me to verify that there are approximately 200 documents and approximately 40,000 seach terms in my index (including search terms that were returning zero results when I was searching for them). I HAVE NO IDEA WHY ZERO SEARCH RESULTS ARE ALWAYS BEING RETURNED -- PLEASE HELP. 4. Trying a Whole-Web Crawl After I couldn't figure out why I was always getting zero search results, I tried to follow the instructions for a whole-web crawl, just for the hell of it. Things seemed to be going fine, until I got to the invertlinks steps, at which point I always get an error message. Below are the command calls that I made (and the error message). Please let me know what I am doing wrong: I first made sure that the test/urls file and regex-urlfilter.txt files had valid entries, which they do. -bash-2.05b$ bin/nutch inject testcrawl/crawldb test -bash-2.05b$ bin/nutch generate testcrawl/crawldb testcrawl/segments -bash-2.05b$ s1=`ls -d testcrawl/segments/2* | tail -1` -bash-2.05b$ echo
What are valid names and location(s) for segments
I am using nutch 0.7.1 and have a couple questions about valid segment names and locations: I can get nutch to work fine when I store my segments, with their original nutch assigned names in the folder: /usr/local/nutch-0.7.1/live/segments/ and then start tomcat from the /usr/local/nutch-0.7.1/live/ directory. However, if I change the names of any of the segments then I get either zero search results or a blank screen when I try to search. Additionally, if I do not change the names, but move the segments to sub-directories of the /live/segments/ folder (i.e. /live/segments/site1/), then I always get zero search results. Question: What is the easiest way to get nutch to recognize segments with modified names, or those that are stored in a sub-directory of the segments folder. In General: The larger problem that I am trying to solve is that my nutch search engine currently crawls and indexes a couple dozes sites and I want to update (i.e re-crawl) these sites independently and at different time intervals. My current plan is to have a ../live/segments/ folder and store an updated (and indexed) segment for each site in that folder. With this in mind, I'm sure you can understand why it would be difficult to keep this folder organized without being able to rename segments and/or store them in sub-directories. If anyone has an ideas about how to organize these segments without renaming them or storing them in sub-directories, I'm all ears. Thanks ahead of time for any suggestions, Bryan
Re: Do not index seed page?
I have a similar issue and have begun working on a tool that would prune an index using a file of regexes. When I get it working I will be happy to make it publicly available. -Bryan On 1/23/06, Stefan Groschupf [EMAIL PROTECTED] wrote: Blocking a page in a url filter will also not fetch a page, so that doesn't solve your problem. You can remove the page manually from the index e.g. by using PruneIndexTool. However I have something here that also can solve the problem but I need some more time to prepare a patch. Stefan Am 21.01.2006 um 16:54 schrieb Franz Werfel: Yes, that is an option we are certainly considering, but we would rather have a start page and forget about it. Cheers, Fr On 1/20/06, Neal Whitley [EMAIL PROTECTED] wrote: Franz, Someone else will need to confirm this... FYI...why not simply inject the urls directly into Nutch? ./nutch inject db/ -urlfile seeds.txt At 03:49 PM 1/20/2006, you wrote: Thank you, but if I do that will the page be read for urls? Cheers, Frank On 1/20/06, Neal Whitley [EMAIL PROTECTED] wrote: Franz, I 'think' you could use the regex url filter to not index this page (regex-urlfilter.txt). Something like: -^http://([a-z0-9]*\.)*tripod.com/ I am new to Nutch so I make no guarantee... :-) Neal At 05:23 AM 1/20/2006, you wrote: Hello, We are trying to implement Nutch on an intranet and have setup a special page which has links to all the other pages of the site, since many are not linked together. We will start with this special page and then go from there to all the other pages, but we would like to not index the first page (so that it doesn't show up in search results), just use it for its links. Is it possible easily? Thank you. --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Common Lucene Queries for PruneIndexTool -- GROUPS of files or folders
OK, I have spent a fair amount of time trying to figure out how to create the correct Lucene queries to use with the PruneIndexTool. I have read the wiki page for bin/nutch Prune, looked at the Lucence Query Parser Syntax page and browsed past mailing list discussions on the subject. Accordingly, I have used bin/nutch org.apache.nutch.searcher.Query to create queries for a specific URL or a specific directory. I enter the URL or directory at the Query prompt and then copy the +(url:*) section of the output into my queries.txt file. However, I am still at a loss for how to create the proper lucene queries for GROUPS of files and folders. Here a some of the most common groupings of files and/or directories I am trying to prune from my index. It would be great if anyone could suggest the correct lucene query to use and/or how to figure out these types of queries. 1. I want to prune the URL http://www.testsite.com/testdir/;, but I don't want to prune any other files in the /testdir/ directory. 2. I want to prune URLs in the range: http://www.testsite.com/[20-40]/ (meaning the following URLs would be pruned): http://www.testsite.com/20/ http://www.testsite.com/21/ ... http://www.testsite.com/39/ http://www.testsite.com/40]/ I would even settle for the following URLs being pruned: http://www.testsite.com/??/ 3. I want to prune the URLs http://www.testsite.com/*.php; Either just in this directory, or recursively through all sub-directories (ideally I would like to know how to do both). Any help is much appreciated! -Bryan
How can no URLs be fetched until the 11th round of fetching?
I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14 rounds of fetching and an urls files with one URL in it. No urls were fetched during the first 10 rounds, but then in the 11th round one URL was fetched and increasing more URLs were fetched in rounds 12-14. I am basing the numbers of URLs fetched on the output from calling bin/nutch segread (included below). I don't understand how this can happen. If a URL is not fetched during a round are its outlinks still added to the database for the next round of fetching? Why would I have 10 rounds of fetching with no URLs fetched and then suddenly have one fetched successfully in the 11th round? Any suggestions are appreciated. -Bryan Here is the output when I call: bin/nutch segread -list -dir segments run java in /usr/local/j2sdk1.4.2_08 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml 060115 205601 No FS indicated, using default:local 060115 205601 PARSED? STARTED FINISHED COUNT DIR NAME 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173409 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173413 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173417 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173421 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173424 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173428 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173432 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173436 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173440 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173443 060115 205602 true 20060115-17:34:51 20060115-17:34:51 1 ../segments/20060115173447 060115 205602 true 20060115-17:34:57 20060115-17:41:07 42 ../segments/20060115173454 060115 205602 true 20060115-17:41:16 20060115-18:12:28 234 ../segments/20060115174113 060115 205602 true 20060115-18:12:37 20060115-19:51:07 738 ../segments/20060115181234 060115 205602 TOTAL: 1015 entries in 14 segments.
Re: How can no URLs be fetched until the 11th round of fetching?
I don't think that I was completely clear in my first post. What you are saying makes sense if I was doing a one-round fetch on a number of different occasions. However, I am doing 14 rounds of fetching each called by one script, in the pattern outlined in the nutch tutorial, where my script does 14 loops of the following: -- bin/nutch generate db segments s[i]=`ls -d segments/2* | tail -1` bin/nutch fetch $s[i] bin/nutch updatedb db $s[i] -- Do you think the possibilities you suggested makes sense in light of the fact that I am doing each of these rounds of fetching within seconds of each other, each being called by the same script? I also have a couple of related questions? (1) In the first round of fetching, the fetchlist is generated from the database, which was injected with the one URL that comprises my urls file. If in the first round of fetching, the one URL in the fetch list can't be fetched and/or parsed, I am assuming that subsequent rounds of fetching just used the same one-URL fetchlist until this URL is successfully fetched and its outlinks added to the database. Is that correct? (2) When I call the following command, the resulting file has no output for the rounds where no URLs were fetched. This leads me to believe that the fact that no URLs were fetched is not a result of a fetching or parsing error (since such errors usually show up in the output of this command). Does this make sense? If it does, then what caused no URLs to be fetched. Thanks for any helpful suggestions, Bryan On 1/15/06, Fuad Efendi [EMAIL PROTECTED] wrote: Many things could happen. Sample1: website was unavailable during first 10 fetches Sample2: 11th fetch used different IP, DNS-to-IP mapping changed (or may be finally resolved!) Sample3: Smth changed on a site, redirect added/changed, etc. Sample4: web-master modified robots.txt Sample5: big first HTML file, network errors during first 10 fetch attempts, etc. It should be very uncommon behaviour, but it may happen... -Original Message- From: Bryan Woliner I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14 rounds of fetching and an urls files with one URL in it. No urls were fetched during the first 10 rounds, but then in the 11th round one URL was fetched and increasing more URLs were fetched in rounds 12-14. I am basing the numbers of URLs fetched on the output from calling bin/nutch segread (included below). I don't understand how this can happen. If a URL is not fetched during a round are its outlinks still added to the database for the next round of fetching? Why would I have 10 rounds of fetching with no URLs fetched and then suddenly have one fetched successfully in the 11th round? Any suggestions are appreciated. -Bryan Here is the output when I call: bin/nutch segread -list -dir segments run java in /usr/local/j2sdk1.4.2_08 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml 060115 205601 No FS indicated, using default:local 060115 205601 PARSED? STARTED FINISHED COUNT DIR NAME 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173409 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173413 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173417 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173421 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173424 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173428 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173432 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173436 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173440 060115 205601 true 19691231-18:00:00 19691231-18:00:00 0 ../segments/20060115173443 060115 205602 true 20060115-17:34:51 20060115-17:34:51 1 ../segments/20060115173447 060115 205602 true 20060115-17:34:57 20060115-17:41:07 42 ../segments/20060115173454 060115 205602 true 20060115-17:41:16 20060115-18:12:28 234 ../segments/20060115174113 060115 205602 true 20060115-18:12:37 20060115-19:51:07 738 ../segments/20060115181234 060115 205602 TOTAL: 1015 entries in 14 segments.
Re: port :8080 no longer brings up Nutch search page!
Nevermind, I was able to fix it by renaming the tomcat/webapps/ROOT/ directory and then restarting tomcat, which recreated the root directory from the ROOT.war file. I must have messed up some of the permissions in the ROOT folder. On 1/4/06, Bryan Woliner [EMAIL PROTECTED] wrote: When I originally installed nutch and tomcat on my machine, I needed to change the ownership and permission of certain files in subdirectories of the ../jakarta-tomcat-4.1.31/ folder in order to be able to use tomcat and nutch together. I've had no problems with tomcat for some time, however I am currently in the process of setting up my server so several other users can test nutch and so I did the following: I originally had nutch installed in /home/user1/nutch-0.7.1/ I copied the whole nutch folder to /home/user2/nutch-0.7.1/ I got nutch to run fine from user2's account and the /home/user2/nutch- 0.7.1/ folder, however I was getting some permission errors when trying to start tomcat from user2's account and nutch folder. Therefore, I did the following: All of the files in the /jakarta-tomcat-4.1.31/logs/ folder and the /jakarta-tomcat-4.1.31/webapps/ROOT folder, as well as the /jakarta- tomcat-4.1.31/webapps.ROOT.war file had user1 as the user and group owner and had file permissions of 655. To enable user2 to access these files I changed ownership to webadmin:webadmin (a group that user1 and user2 both belong to) and changed the permission on all of these files to 665. PROBLEM: I am now able to start tomcat from either user1's or user2's account, BUT when i go to my :8080 port I no longer get the Nutch search page -- instead I get a listing of my /jakarta-tomcat-4.1.31/webapps/ROOT/ directory: Directory Listing For / -- *Filename* *Size* *Last Modified* anchors.jsp/http://www.searchthenews.org:8080/anchors.jsp/ null ca/ http://www.searchthenews.org:8080/ca/ null cached.jsp/ http://www.searchthenews.org:8080/cached.jsp/ null cluster.jsp/ http://www.searchthenews.org:8080/cluster.jsp/ null de/ http://www.searchthenews.org:8080/de/ null en/http://www.searchthenews.org:8080/en/ null es/ http://www.searchthenews.org:8080/es/ null explain.jsp/ http://www.searchthenews.org:8080/explain.jsp/ null fi/ http://www.searchthenews.org:8080/fi/ null fr/http://www.searchthenews.org:8080/fr/ null hu/ http://www.searchthenews.org:8080/hu/ null img/http://www.searchthenews.org:8080/img/ null include/ http://www.searchthenews.org:8080/include/ null index.jsp/ http://www.searchthenews.org:8080/index.jsp/ null jp/http://www.searchthenews.org:8080/jp/ null more.jsp/ http://www.searchthenews.org:8080/more.jsp/ null ms/ http://www.searchthenews.org:8080/ms/ null nl/http://www.searchthenews.org:8080/nl/ null pl/ http://www.searchthenews.org:8080/pl/ null pt/http://www.searchthenews.org:8080/pt/ null refine-query-init.jsp/http://www.searchthenews.org:8080/refine-query-init.jsp/ null refine-query.jsp/http://www.searchthenews.org:8080/refine-query.jsp/ null search.jsp/ http://www.searchthenews.org:8080/search.jsp/ null sv/ http://www.searchthenews.org:8080/sv/ null text.jsp/http://www.searchthenews.org:8080/text.jsp/ null th/ http://www.searchthenews.org:8080/th/ null zh/http://www.searchthenews.org:8080/zh/ null -- Apache Tomcat/4.1.31 When I try to click on search.jsp/ or en/ I get the following error, even though I know these files/folders are in my ../webapps/ROOT/ directory!! HTTP Status 404 - /search.jsp/ -- *type* Status report *message* */search.jsp/* *description* *The requested resource (/search.jsp/) is not available.* -- Apache Tomcat/4.1.31 PLEASE HELP!
port :8080 no longer brings up Nutch search page!
When I originally installed nutch and tomcat on my machine, I needed to change the ownership and permission of certain files in subdirectories of the ../jakarta-tomcat-4.1.31/ folder in order to be able to use tomcat and nutch together. I've had no problems with tomcat for some time, however I am currently in the process of setting up my server so several other users can test nutch and so I did the following: I originally had nutch installed in /home/user1/nutch-0.7.1/ I copied the whole nutch folder to /home/user2/nutch-0.7.1/ I got nutch to run fine from user2's account and the /home/user2/nutch-0.7.1/ folder, however I was getting some permission errors when trying to start tomcat from user2's account and nutch folder. Therefore, I did the following: All of the files in the /jakarta-tomcat-4.1.31/logs/ folder and the /jakarta-tomcat-4.1.31/webapps/ROOT folder, as well as the /jakarta- tomcat-4.1.31/webapps.ROOT.war file had user1 as the user and group owner and had file permissions of 655. To enable user2 to access these files I changed ownership to webadmin:webadmin (a group that user1 and user2 both belong to) and changed the permission on all of these files to 665. PROBLEM: I am now able to start tomcat from either user1's or user2's account, BUT when i go to my :8080 port I no longer get the Nutch search page -- instead I get a listing of my /jakarta-tomcat-4.1.31/webapps/ROOT/ directory: Directory Listing For / -- *Filename* *Size* *Last Modified* anchors.jsp/http://www.searchthenews.org:8080/anchors.jsp/ null ca/ http://www.searchthenews.org:8080/ca/ null cached.jsp/ http://www.searchthenews.org:8080/cached.jsp/ null cluster.jsp/ http://www.searchthenews.org:8080/cluster.jsp/ null de/http://www.searchthenews.org:8080/de/ null en/ http://www.searchthenews.org:8080/en/ null es/http://www.searchthenews.org:8080/es/ null explain.jsp/ http://www.searchthenews.org:8080/explain.jsp/ null fi/ http://www.searchthenews.org:8080/fi/ null fr/http://www.searchthenews.org:8080/fr/ null hu/ http://www.searchthenews.org:8080/hu/ null img/http://www.searchthenews.org:8080/img/ null include/ http://www.searchthenews.org:8080/include/ null index.jsp/ http://www.searchthenews.org:8080/index.jsp/ null jp/http://www.searchthenews.org:8080/jp/ null more.jsp/ http://www.searchthenews.org:8080/more.jsp/ null ms/ http://www.searchthenews.org:8080/ms/ null nl/http://www.searchthenews.org:8080/nl/ null pl/ http://www.searchthenews.org:8080/pl/ null pt/http://www.searchthenews.org:8080/pt/ null refine-query-init.jsp/http://www.searchthenews.org:8080/refine-query-init.jsp/ null refine-query.jsp/http://www.searchthenews.org:8080/refine-query.jsp/ null search.jsp/ http://www.searchthenews.org:8080/search.jsp/ null sv/ http://www.searchthenews.org:8080/sv/ null text.jsp/http://www.searchthenews.org:8080/text.jsp/ null th/ http://www.searchthenews.org:8080/th/ null zh/http://www.searchthenews.org:8080/zh/ null -- Apache Tomcat/4.1.31 When I try to click on search.jsp/ or en/ I get the following error, even though I know these files/folders are in my ../webapps/ROOT/ directory!! HTTP Status 404 - /search.jsp/ -- *type* Status report *message* */search.jsp/* *description* *The requested resource (/search.jsp/) is not available.* -- Apache Tomcat/4.1.31 PLEASE HELP!
Re: which files/directories are needed after a segment or index merge
Thanks Stefan! I guess I should have look at the searcher.dir entry in nutch-site.xml to start with. For the record, I was able to search the index of the merged-segment successfully after I created a /nutch-0.7.1/Live/segements/ folder, put my segments in that directory and started tomcat from the /Live/ directory. It was not even necessary to modify the searcher.dir entry in my nutch-site.xml file. Thanks again, Bryan On 12/22/05, Stefan Groschupf [EMAIL PROTECTED] wrote: OK, sorry in case you use o.7 -my fault - you the index itself is stored in the segments. So you need to copy the segments that include the indexes into a folder may called finalSegments. In the nutch.default.xml your search folder than should be /home/ finalSegments or so. Sorry! From my point of view sources are usable. Is there is estimated release date for 0.8 Not yet. Stefan
which files/directories are needed after a segment or index merge
I am using nutch 0.7.1 (non-mapred) and am a little confused about how to move the contents of several test crawls into a single live directory. Any suggestions are very much appreciated! I want to have a Live directory that contains all the indexes that are ready to be searched. The first index I want to add to the Live directory comes from a crawl with 10 rounds of fetching, whose db and segments are stored in the following directories: /crawlA/db/ /crawlA/segments/ I can merge all of the segments in the segments directory (using bin/nutch mergesegs), which results in the following (11th) segment directory: /crawlA/segments/20051219000754/ I can then index this 11th (i.e. merged) segment. However, I have the following questions about which files and directories should be moved to the Live directory: 1. If I copy /crawlA/db/ to /Live/db/ and copy /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ , then I can start tomcat from /Live/ and I'm able to search the index fine. However, I'm note sure if that can be duplicated for my crawlB directory. I can't copy /crawlB/db/ to the Live directory because there is already a db directory there. What are the correct files and directories to copy from each crawl into the Live directory? 2. On a side note: am I even taking the correct approach in merging the 10 segments in the crawlA/segments/ directory before I index, or should I index each segment first and then merge the 10 indexes? If I was to take the latter approach (merging indexes instead of segments), which files from the /crawlA/ directory would I need to move to the Live directory. Thanks ahead of time for any helpful suggestions,
Re: which files/directories are needed after a segment or index merge
Stefan, Thanks so much for the speedy reply! I have a couple of comments: 1. I am currently NOT using NFDS or map reduce because the number of sites I am looking to fetch and index is relatively small (currently less than 1 million). Accordingly, I am using the 0.7.1 version of nutch available from the Nutch website. Does this seem like the correct choice? 2. I currently do use a script which basically looks like this (I have another version that includes indexing): mkdir $DB_DIR mkdir $SEG_DIR bin/nutch admin $DB_DIR -create bin/nutch inject $DB_DIR -urlfile $URL_FILE i=$FETCH_ROUNDS while [ $i -gt 0 ] do bin/nutch generate $DB_DIR $SEG_DIR $TOP_N $MAX_SITE s[$i]=`ls -d $SEG_DIR/2* | tail -1` bin/nutch fetch ${s[$i]} bin/nutch updatedb $DB_DIR ${s[$i]} i=`expr $i - 1` done 3. It is my understanding that Nutch 0.7.1 (no NFDS or mapred) only has a webdb and not the linkdb/crawldb structure. If that is correct, then if I'm trying to add two merged segments (and the index of each) to my live folder, do I also need the webdb of each (and if so, do I need to merge them)? Thanks again for the help, Bryan On 12/21/05, Stefan Groschupf [EMAIL PROTECTED] wrote: In general I suggest using a shell script and doing the command manually instead of using the crawl command, may something like. NUTCH_HOME=$HOME/nutch-0.8-dev while [ 1 ] # or may just 10 rounds do DATE=$(date +%d%B%Y_%H%M%S) $NUTCH_HOME/bin/nutch generate /user/nutchUser/ crawldb /user/nutchUser/segments -topN 500 s=`$NUTCH_HOME/bin/nutch ndfs -ls /user/nutchUser/ segments | tail -1 | cut -c 1-38` $NUTCH_HOME/bin/nutch fetch $s $NUTCH_HOME/bin/nutch updatedb /user/nutchUser/ crawldb $s # only when indexing$NUTCH_HOME/bin/nutch invertlinks /user/nutchUser/linkdb /user/nutchUser/segments # what to index, may the merged segment from the 10 roundss=`$NUTCH_HOME/bin/nutch ndfs -ls /user/ nutchUser/segments | tail -1 | cut -c 24-38` # index$NUTCH_HOME/bin/nutch index /user/nutchUser/ indexes/$s /user/nutchUser/crawldb /user/nutchUser/linkdb /user/ nutchUser/segments/$s done This prevent you from merging crawl db's. Than you only need the merged segment, the linkdb and the index from the merged segment. The 10 segments used to build the merged segment can be removed. Hope this helps, you should only may change the scripts to have a 10 round loop to create you 10 segments and the merging command is also not in the script. Stefan Am 21.12.2005 um 18:28 schrieb Bryan Woliner: I am using nutch 0.7.1 (non-mapred) and am a little confused about how to move the contents of several test crawls into a single live directory. Any suggestions are very much appreciated! I want to have a Live directory that contains all the indexes that are ready to be searched. The first index I want to add to the Live directory comes from a crawl with 10 rounds of fetching, whose db and segments are stored in the following directories: /crawlA/db/ /crawlA/segments/ I can merge all of the segments in the segments directory (using bin/nutch mergesegs), which results in the following (11th) segment directory: /crawlA/segments/20051219000754/ I can then index this 11th (i.e. merged) segment. However, I have the following questions about which files and directories should be moved to the Live directory: 1. If I copy /crawlA/db/ to /Live/db/ and copy /crawlA/segments/20051219000754/ to /Live/segments/20051219000754/ , then I can start tomcat from /Live/ and I'm able to search the index fine. However, I'm note sure if that can be duplicated for my crawlB directory. I can't copy /crawlB/db/ to the Live directory because there is already a db directory there. What are the correct files and directories to copy from each crawl into the Live directory? 2. On a side note: am I even taking the correct approach in merging the 10 segments in the crawlA/segments/ directory before I index, or should I index each segment first and then merge the 10 indexes? If I was to take the latter approach (merging indexes instead of segments), which files from the /crawlA/ directory would I need to move to the Live directory. Thanks ahead of time for any helpful suggestions, --- company:http://www.media-style.com forum:http://www.text-mining.org blog:http://www.find23.net
Re: Luke and Indexes
Thank you very much for the helpful answers. Most of the pages that didn't make it into the index were indeed due to protocol errors (mostly exceeding http.max.delay). One quick side note. When I was looking at the Nutch wiki page for bin/nutch segread, I noticed an error on the page and wasn't sure how to go about fixing it, or alerting someone who can. The page currently reads: ... -nocontent ignore content data -noparsedata ignore parse_data data -nocontent ignore parse_text data ... The 2nd -nocontent should probably be -noparsetext, right? Thanks again for the help, Bryan On 12/8/05, Andrzej Bialecki [EMAIL PROTECTED] wrote: Bryan Woliner wrote: I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does Index version refer to? It's the time (as in System.currentTimeMillis()) when the index was last modified. 2. Also in the overview tab, if Has Deletions? is equal to yes, where are the possible sources of deletions? Dedup? Manual deletions through luke? Either. Both. 3. Is there any way (w/ Luke or otherwise) to get a file listing all of the docs in an index. Basically is there an index equivalent of this command (which outputs all the URLs in a segment): bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir You can browse through documents on the Document tab. But there is no option to dump all documents to a file. Besides, some fields which are not stored are no longer accessible, so you cannot retrieve them from the index (you may be able to reconstruct them, but it's a lossy operation). 4. Finally, my last question is the one I'm most perplexed by: I called bin/nutch segread -list -dir for a particular segments directory and found out that one directory had 93 entries. BUT, when I opened up the index of that segment in Luke, there were only 23 documents (and 3 deletions)! Where did the rest of the URLs go?? Do a segread -dump and check what is the protocol status and parse status for the pages that didn't make it to the index. Most likely you encountered either protocol errors or parsing errors, so there was nothing to index from these entries. In addition, if you ran the deduplication, some of the entries in your index may have been deleted because they were considered duplicates. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Luke and Indexes
I have a couple very basic questions about Luke and indexes in general. Answers to any of these questions are much appreciated: 1. In the Luke overview tab, what does Index version refer to? 2. Also in the overview tab, if Has Deletions? is equal to yes, where are the possible sources of deletions? Dedup? Manual deletions through luke? 3. Is there any way (w/ Luke or otherwise) to get a file listing all of the docs in an index. Basically is there an index equivalent of this command (which outputs all the URLs in a segment): bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls segmentsDir 4. Finally, my last question is the one I'm most perplexed by: I called bin/nutch segread -list -dir for a particular segments directory and found out that one directory had 93 entries. BUT, when I opened up the index of that segment in Luke, there were only 23 documents (and 3 deletions)! Where did the rest of the URLs go?? Thanks ahead of time for any helpful suggestions, Bryan
Number of URLs in segment fetchlist vs. Number of URLs in index
How is the number of URLs in a a group of segment's fetchlists related to the number of urls in an index. Specifically, when I call the following command using the segments2 directory, I find out that there are 166 entries in 15 segments: $ bin/nutch segread -list -dir segments However, when I tried to prune the index of the same segments2 directory, using the following command, it tells me that 15 of 45 directories have been deleted: $ bin/nutch org.apache.nutch.tools.PruneIndexTool segments2 -dryrun -queries queries.txt -showfields url,title - What I don't understand is how the number of directories went from 166 in the fetchlists for this folder of segments, to only 45 in the indexes. I'm positive that there were not 121 duplicate URLs (or anywhere near that amount). Thanks, Bryan
Re: RegexURLFilter / testing regex-urlfilter.txt
Sorry if the answer to this question should be obvious, but where in the bin/nutch script do you need to add the following line to be able to test your regex-urlfilter.txt file from the command line? CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar On 11/29/05, Thomas Delnoij [EMAIL PROTECTED] wrote: For the sake of the archives, I will answer my own question here: I had to add the following line to the bin/nutch script to be able to run org.apache.nutch.net.RegexURLFilter from the command line: CLASSPATH=${CLASSPATH}:$NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar The nutch script overrides the classpath environment variable, so adding the jar there didn't help. Rgrds, Thomas Delnoij On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: All. The problem is actualy a bit different. I was a bit in a hurry when I posted the previous message, apologies. I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath. When I run java org.apache.nutch.net.RegexURLFilter, I am getting 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch- 0.7.1.jar!/nutch-default.xml 051005 221040 parsing jar:file:/C:/Personal/vvdb/Nutch/nutch-0.7.1/nutch- 0.7.1.jar!/nutch-site.xml 051005 221040 Plugins: directory not found: plugins Exception in thread main java.lang.ExceptionInInitializerError Caused by: java.lang.NullPointerException at org.apache.nutch.net.RegexURLFilter.clinit( RegexURLFilter.java:64) when I run nutch org.apache.nutch.net.RegexURLFilter, I am getting Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/net/RegexURLFilter I know I am missing something obvious, but your help is really appreciated. Kind regards, Thomas Delnoij On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: I was a bit in a hurry when I posted this message, apologies. The problem is actualy a bit different. I added both urlfilter-regex.jar and nutch-0.7.1.jar to my classpath. When I run java org.apache.nutch.net.RegexURLFilter, On 10/5/05, Thomas Delnoij [EMAIL PROTECTED] wrote: All. I want to run the RegexURLFilter's main() method for testing the regex-urlfilter.txt. I set up NUTCH_HOME and NUTCH_CONF_DIR so I think I set up my environment correctly. When I run nutch org.apache.nutch.net.RegexURLFilter I get Exception in thread main java.lang.NoClassDefFoundError: org/apache/nutch/net/RegexURLFilter. Assuming this was a classpath issue, I added NUTCH_HOME/plugins/urlfilter-regex/urlfilter-regex.jar to my classpath. This did not solve the problem, as I am still getting the NoClassDefFoundError. So my first question is how to set up my environment correctly for testing the regex-urlfilter. Secondly, I want to tune my regex-urlfilter for maximum relevancy of the crawl result. By now, I have around 50 entries. My second question is if I can expect any performance impact? Your help is greatly appreciated. Kind regards, Thomas Delnoij.
Has anyone gotten the date query to function properly?
If people have gotten the date query to work properly, it would be great to know the steps they used in get it working. I added the following property entry to my nutch-site.xml file and used the search phrase: url:http date:19000101-20051231 (which returned zero results). property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property Thanks, Bryan
Re: Using FetchListEntry -dumpurls
Thanks for the tip. It turns out that the command worked fine when i replaced bin/nutch net.nutch... with bin/nutch org.apache.nutch... Accordingly, the correct command call is: bin/nutch org.apache.nutch.pagedb.FetchListEntry -dumpurls $s1 foo.txt Thanks, Bryan On 11/13/05, Piotr Kosiorowski [EMAIL PROTECTED] wrote: Hi, I think this is the reason: Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry In 0.7 branch all classes where moved to org.apache.nutch package structure and scripts where updated so you are probably using old script with new release. Regards Piotr Bryan Woliner wrote: Hi, I am trying to dump all of the URLS from a segment to a text file. I was able to do this successfully under Nutch 0.6 but am not able to do so under 0.7.1 Please take a look a the line below and let me know if you can figure out why I'm getting an error. Perhaps it a due to change from version 0.6 to 0.7.1, or maybe I just have the wrong syntax. Note: the segments/20051107233629 directory is a valid segments directory, as is evidence by the ls statement below. ___- -bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls segments/20051107233629 foo.txt Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry -bash-2.05b$ ls -la segments/20051107233629 total 8 drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 . drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 .. drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index -rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text
A couple of questions about the date: query
OK, I believe that I correctly included the more indexing and more query plugins, which should allow searches using the date: query field. However, I am current unable to search by date ranges. I tried to use the search string that doug cutting suggested in an e-mail to the list on 9/12/2005. He suggested, If you want to see all documents in a date range, then perhaps try something like url:http date:19000101-20051231. However, when I try the suggested search string, zero pages are returned. I believe the fact that I have included the more plugin is evidenced by the content of my nutch-site.xml file and part of the output nutch generates during a whole-web crawl, each of which are listed below. Any suggestions are much appreciated. On a separate, but related, note, I am unable to interpret the number that apparently represents the date that a page was last modified. When I click on the (explain) link next to a search result link, there is a field like this: lastModified = 1131904304000 What date does this represent? Here is the contents of my nutch-site.xml file: ___ ?xml version=1.0? ?xml-stylesheet type=text/xsl href=nutch-conf.xsl? !-- Put site-specific property overrides in this file. -- nutch-conf property nameplugin.includes/name valuenutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-(basic|more)|query-(basic|site|url|more)/value descriptionRegular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property /nutch-conf Also, here is part of the output I get when I do a whole-web crawl: 051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-basic/plugin.xml 051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.basic.BasicQueryFilter 051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-more/plugin.xml 051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.TypeQueryFilter 051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.more.DateQueryFilter 051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-site/plugin.xml 051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.site.SiteQueryFilter 051113 232339 parsing: /usr/local/nutch-0.7.1/plugins/query-url/plugin.xml 051113 232339 impl: point=org.apache.nutch.searcher.QueryFilter class= org.apache.nutch.searcher.url.URLQueryFilter _ Thanks for any suggestions, Bryan
Using FetchListEntry -dumpurls
Hi, I am trying to dump all of the URLS from a segment to a text file. I was able to do this successfully under Nutch 0.6 but am not able to do so under 0.7.1 Please take a look a the line below and let me know if you can figure out why I'm getting an error. Perhaps it a due to change from version 0.6 to 0.7.1, or maybe I just have the wrong syntax. Note: the segments/20051107233629 directory is a valid segments directory, as is evidence by the ls statement below. ___- -bash-2.05b$ bin/nutch net.nutch.pagedb.FetchListEntry -dumpurls segments/20051107233629 foo.txt Exception in thread main java.lang.NoClassDefFoundError: net/nutch/pagedb/FetchListEntry -bash-2.05b$ ls -la segments/20051107233629 total 8 drwxr-xr-x 8 bryan bryan 1024 Nov 7 23:36 . drwxr-xr-x 3 bryan bryan 1024 Nov 7 23:36 .. drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 content drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetcher drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 fetchlist drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 index -rw-r--r-- 1 bryan bryan 0 Nov 7 23:36 index.done drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_data drwxr-xr-x 2 bryan bryan 1024 Nov 7 23:36 parse_text
Re: Collections.
The regular expressions that you use in your regex-urlfilter.txt file allow you to specify that Nutch should only crawl certain parts of a domain. For example, you could limit your search to URLs that start with news.domainor www.domain.com/news http://www.domain.com/news If you search the mailing list archive or the Nutch WIKI you should be able to find more info on what type of regular expressions the regex-urlfiler.txtfile uses. -Bryan On 10/25/05, XIN LING [EMAIL PROTECTED] wrote: No, what I mean is a set of URLs in a collection. For example, a finance web site might divide the web pages into 2 collections, news and analysis. This way if I am only interested in news, I can refine my search to this collection, without bothering analysis part. I know other search engines can do this, google, htdig, etc. Thanks. -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 25, 2005 1:38 PM To: nutch-user@lucene.apache.org Subject: Re: Collections. What do you mean with collections? java.lang.collections? Am 25.10.2005 um 20:27 schrieb XIN LING: Hi, does anyone know if Nutch supports collections? How to set collections in nutch? Thanks. --- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net
Where are indexes stored and where to store indexes
I know that this is a really basic question, but once you index segment(s), where is the index stored? On a related note, I read in numerous emails to the list that you can search more than one index at the same time if they are in the same location when you start tomcat. Where is the correct location (or type of location) to store these indexes? Based on the fact that you need to create new db and segment directories each time you do a crawl, it seems like you would have to move indexes after they are created if you want multiple indexes in the same location. Thanks for the help, Bryan
Re: Where are indexes stored and where to store indexes
An update to my question: 1. I found where the index is located, so nevermind on that one. 2. In term of using bin/nutch merge -- the wiki indicates that correct syntax is something like this: bin/nutch merge index segments/* However, that seems to suggest that all of your indexes need to be in the same segments directory in order to merge them. What is the best practice for merging indexes that are in different segment directories? Do you have to copy all of your segments to the same segments directory first? Do you need to use mergesegs before you call merge (it doesn't seem likely that this is the case). Thanks, Bryan On 8/24/05, Bryan Woliner [EMAIL PROTECTED] wrote: I know that this is a really basic question, but once you index segment(s), where is the index stored? On a related note, I read in numerous emails to the list that you can search more than one index at the same time if they are in the same location when you start tomcat. Where is the correct location (or type of location) to store these indexes? Based on the fact that you need to create new db and segment directories each time you do a crawl, it seems like you would have to move indexes after they are created if you want multiple indexes in the same location. Thanks for the help, Bryan
Adding small batches of fetched URLs to a larger aggregate segment/index
Hi, I have a number of sites that I want to crawl, then merge their segments and create a single index. One of the main reasons I want to do this is that I want some of the sites in my index to be crawls on a daily basis, others on a weekly basis, etc. Each time I re-crawl a site, I want to add the fetched URLs to a single aggregate segment/index. I have a couple questions about doing this: 1. Is it possible to use a different regex.urlfilter.txt file for each site that I am crawling? If so, how would I do this? 2. If I have a very large segment that is indexed (my aggregate index) and I want to add another (much smaller) set of fetched URLs to this index, what is the best way to do this. It seems like merging the small and large segments and then re-indexing the whole thing would be very time consuming -- especially if I wanted to add news small sets of fetched URLs frequently. Thanks for any suggestions you have to offer, Bryan
Two Questions: Refetching and searching the archive of this list
Two questions: 1. Is there a way to search all archived messages from this mailing list? 2. Is there a way to configure the fetcher to refetch only those pages either: (i) didn't exist during the last fetch; or (ii) have been modified since the last fetch? I know people have asked questions similar to this one before (hence my first question), but I could find the relevent thread(s). Thanks, Bryan