problem with skiped urls
hi, i'm trying to run nutch in our clinicum center and i have a little problem. we have a few intranet servers and i want that nutch skip a few direcotries. for example: http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/ i wrote this urls in the crawl-urlfilter.txt. for example: -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus but nothing happens. nutch don't skip this urls. and i don't know why... :( kann anyone help me? i'm cwaling with this command: bin/nutch crawl urls -dir crawl060621 -depth 15 crawl060621.log i'm using the release 0.7.1 greets david == David Wojciechowski Universitätsklinikum Freiburg Klinikrechenzentrum Agnesenstrasse 6-8 D-79106 Freiburg Telefon : 0761 / 270 - 1842 Fax: 0761 / 270 - 2276 E-Mail : [EMAIL PROTECTED] ==
Re: stemming
Thanks! to Jerome Checks that these words are not in the stopword list of your analyzer. That words aren't in the stopword list. It couldn't find them at all. When I disable the stemming (the index is the same) it could find that words (of course it find only that form of the words which presents in the queries). No : it only highlight the main form in the summaries. It is a known problem For me there is not main thing the highlighting. In the summaries presents only documents with the main form of the words. For example without stemming I find about 450 documents with the different forms of the word (I count only simple forms - changes only one or two letters at the end of the root of the word). With enabled stemming it finds about 120 documents which contains only main forms of the word. About analysis-xx. Should i make any changes in trunk version or not (I mean in the code as it described in the wiki on MultiLingual support page)? --- Regards Alexey
Re: problem with skiped urls
You can also stop nutch from crawling those pages by modifying the robots.txt if you have set Nutch to respect those rules, by default it will respect such rules. If you haven't modified the setting http.robots.agents in nutch-default.xml/nutch-site.xml , the following robots.txt rule should work:- User-agent: NutchCVS Disallow: /abteilung/pvs/dokus/ Cheers, Jayant On 6/21/06, Stefan Neufeind [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: hi, i'm trying to run nutch in our clinicum center and i have a little problem. we have a few intranet servers and i want that nutch skip a few direcotries. for example: http://sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus/ i wrote this urls in the crawl-urlfilter.txt. for example: -^http://([a-z0-9]*\.)*sapdoku.ukl.uni-freiburg.de/abteilung/pvs/dokus but nothing happens. nutch don't skip this urls. and i don't know why... :( kann anyone help me? i'm cwaling with this command: bin/nutch crawl urls -dir crawl060621 -depth 15 crawl060621.log i'm using the release 0.7.1 Hi David, do you have regex-urlfilter in your crawler-site-configfile or nutch-site-configfile? I suspect that the plugin might not yet be loaded. Also, do you have another allow all URLs-line above the one you mentioned, maybe? I don't think the ([a-z0-9]*\.)* should lead to problems (it is * and not +, so I guess that should be fine). But if your URL does not have anything in front of sapdoku, maybe try dropping that part. Good luck, Stefan -- www.jkg.in | http://www.jkg.in/contact-me/ Jayant Kr. Gandhi | +91-9871412929 M.Tech. Computer Tech. Class of 2007, D-38, Aravali Hostel, IIT Delhi, Hauz Khas, Delhi-110016
Re: Do nutch allow an advanced search?
The index-more plugin indexes each document's last modified date and is searchable via a range like: date:20060521-20060621 Note that a date search does not work by itself. At least one keyword or phrase is required. Scott John john wrote: Hello I'm new in the nutch world and i'm wondering whether it's possible to search with date range? or specify a date and then nutch retrieves pages updated after this date? thanks
Re: Do nutch allow an advanced search?
Scott McCammon wrote: The index-more plugin indexes each document's last modified date and is searchable via a range like: date:20060521-20060621 Note that a date search does not work by itself. At least one keyword or phrase is required. Hi Scott, requiring a keyword/phrase has been mentioned at several places before already. Is there a technical background for it, or could that limitation maybe be removed (and should we file a JIRA for that)? Regards, Stefan John john wrote: Hello I'm new in the nutch world and i'm wondering whether it's possible to search with date range? or specify a date and then nutch retrieves pages updated after this date? thanks
NEWBIE help: java.lang.IllegalAccessError
Hi folks. I installed Nutch the past days, and I kept getting blank results pages at http://localhost:8080/search.jsp... Tomcat was properly installed, I think, so I looked further and I found that everytime I run a crawl, I get this toward the end: 060621 125637 indexing segment: crawl.test/segments/20060620163405 Exception in thread main java.lang.IllegalAccessError: tried to access field org.apache.lucene.index.IndexWriter.mergeFactor from class org.apache.nutch.indexer.IndexSegment at org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:102) at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:263) It's crawling ok, but not indexing the segment. I copied the lucene-core-2.0.0.jar file to the current directory from which I'm running bin/nutch (just in case it wasn't finding the lucene.index.IndexWriter stuff), but no luck. Anyone seen this before or have any ideas? I hope so. Cheers and thanks, Mike
Re: NEWBIE help: java.lang.IllegalAccessError
Mike Blackstock wrote: Exception in thread main java.lang.IllegalAccessError: tried to access field org.apache.lucene.index.IndexWriter.mergeFactor from class org.apache.nutch.indexer.IndexSegment I'm using Nutch 7.2; I spent the better part of yesterday searching the net for possible answers before subscribing here, but no luck. Cheers, Mike
using a test web site
Hi, I am trying to develop a web-crawler for my master's thesis. I need a dummy test site to test my crawler. If I had the tree-structure of the test web site I would compare the list of pages with the output of my crawler after I ran my crawler. Can anybody help me? Thanks, Nildem This message was sent using IMP, the Internet Messaging Program.
Re: stemming
to Jerome Checks that these words are not in the stopword list of your analyzer. Actually I could not find stopwords file. Could You help me with this. Actually I am sure that such words as mission, sea, ocean, building, electricity, etc. couldn't be in stopwords file. (at my previous question I mean carrot stopword file, because I can't find lucene's stopwords files) --- Regards Alexey
Re: stemming
When I disable the stemming (the index is the same) it could find that words (of course it find only that form of the words which presents in the queries). Just a silly question: Do you build your index with the analyzers turned on? (does the documents language was correctly guessed and the corresponding analyzer called?) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: Re: stemming
Actually I could not find stopwords file. Could You help me with this. If you have simply wrapped a Lucene's analyzer (like fr and de analyzers), the default stop word list is inside the analyzer code (take a look at the analyzer source). Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
RE: stemming
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Actually I could not find stopwords file. Could You help me with this. Actually I am sure that such words as mission, sea, ocean, building, electricity, etc. couldn't be in stopwords file. (at my previous question I mean carrot stopword file, because I can't find lucene's stopwords files) The current implementations of the language Analyzers use the default constructors of the analyzers of the same name in the Lucene package. When instantiated this way, the analyzers use the hard-coded stop word lists. For German, the stop words are: private String[] GERMAN_STOP_WORDS = { einer, eine, eines, einem, einen, der, die, das, dass, daß, du, er, sie, es, was, wer, wie, wir, und, oder, ohne, mit, am, im, in, aus, auf, ist, sein, war, wird, ihr, ihre, ihres, als, für, von, mit, dich, dir, mich, mir, mein, sein, kein, durch, wegen, wird }; // From src/java/org/apache/lucene/analysis/de/GermanAnalyzer.java // of Lucene 1.4.3 distribution. This could be slightly out of date. You'd have to either modify the source code in: src/plugin/analysis-de/src/java/org/apache/nutch/analysis/de/GermanAnalyzer.java to use the constructor that takes the word list or the file name of the word list, I think. when I use trunk version should I change some code as it shown at wiki in MultiLingual support page? Because, as I understand everything in trunk version have been done for stemming plugins integration without code changing. I believe Jérôme has implemented these code changes into the Trunk. -kuro
Re: Add Wyona to the wiki support page?
Renaud Richardet wrote: Hello Nutch, My name is Renaud Richardet and I am the COO of Wyona LLC. We are offering Nutch and Lucene support (http://wyona.com/lucene.html), and I was wondering if I could add our company to http://wiki.apache.org/nutch/Support. That would be great. Certainly, you can add a short note about your company on the support page. It's a Wiki, so you can just create an account, log in, and edit this page (please use the preview button to check the changes before saving). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Add Wyona to the wiki support page?
The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. Swinging back on topic, does nutch obey the nofollow tags? g. Andrzej Bialecki wrote: Renaud Richardet wrote: Hello Nutch, My name is Renaud Richardet and I am the COO of Wyona LLC. We are offering Nutch and Lucene support (http://wyona.com/lucene.html), and I was wondering if I could add our company to http://wiki.apache.org/nutch/Support. That would be great. Certainly, you can add a short note about your company on the support page. It's a Wiki, so you can just create an account, log in, and edit this page (please use the preview button to check the changes before saving).
Re: Add Wyona to the wiki support page?
Insurance Squared Inc. wrote: The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. I think it's a default setting for the Wiki, which nobody bothered to change... Swinging back on topic, does nutch obey the nofollow tags? Yes. Please see HtmlParser and HTMLMetaTags classes for details. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: Add Wyona to the wiki support page?
Well so much for knee-jerk suspicions as to intent. No need to look for conspiracy theories when default settings are more likely to be the cause. That should probably a corollary to occam's razor or something :). Andrzej Bialecki wrote: Insurance Squared Inc. wrote: The funny thing about that wiki page (and some others in that area) is that they apparently use the nofollow tags. Given the topic of that wiki, isn't that a bit odd? I personally dislike the nofollow tag and think it should be used only in extreme circumstances (i.e. here's a link to a site you absolutely don't want to visit). I believe in this case however it's simply being used so that sites that are listed don't get any pagerank/weight/whatever passed to them from an authority site. A really bizarre policy for a search related site IMO. I think it's a default setting for the Wiki, which nobody bothered to change... Swinging back on topic, does nutch obey the nofollow tags? Yes. Please see HtmlParser and HTMLMetaTags classes for details.