Re: How to run a complete crawl?
Vincent155 wrote: I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like "crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the topN-statement, there are still only some 50 to 75 files indexed. Check in your nutch-site.xml what is the value of db.max.outlinks.per.page, the default is 100 - when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to run a complete crawl?
@ Dennis: Thanks for clearifying the difference between deep indexing and whole web crawling. I think I have the text document with the url in the urlDir all right. I have been able to run a crawl, but it only fetches some 50 documents. @ Paul: .htaccess file, Options +Indexes, IndexOptions +SuppressColumnSorting? Yes, I am using Apache (and I have to apologize for not mentioning that I am using Nutch 0.9). However, this looks a bit scary for me - I don't have experience with programming in Java and stuff. I already found myself very clever by using a virtual machine in order to crawl my local file system. In the same way I have found a solution. I have placed my 2500 documents per 50 in some 50 directories, and placed them in each other: directory 1 contains 50 documents and directory 2, directory 2 contains 50 documents and directory 3, etc. Not the most beautiful solution, but it fits my purposes (running a test to compare two search engines) for the moment. This way, I have been able to index some 2100 documents, I could figure out why it stopped there, but for the moment, I am satisfied. -- View this message in context: http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25936033.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: How to run a complete crawl?
On Fri, Oct 16, 2009 at 10:19 AM, Dennis Kubes wrote: > Because you are crawling the local files you would either need urls in the > initial urlDir text file or those documents you are crawling would need to > point to the other urls. > Another way to do this is to put the following in the document directory's .htaccess file, and point the urlDir text file to the directory itself: Options +Indexes IndexOptions +SuppressColumnSorting (Assuming you're using Apache, and assuming that the site configuration doesn't have "AllowOverride None" set.) -- http://www.linkedin.com/in/paultomblin
Re: How to run a complete crawl?
Whole web crawling is about indexing the entire web versus deep indexing of a single site. The urls parameter is the urlDir, a directory which should hold one or more text files with listings of urls to be fetched. The dir parameter is the output directory for the crawls. Because you are crawling the local files you would either need urls in the initial urlDir text file or those documents you are crawling would need to point to the other urls. Is that how you have it setup? Dennis Vincent155 wrote: I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like "crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the topN-statement, there are still only some 50 to 75 files indexed. I have searched for more information, but it is not clear to me if (for example) "whole web crawling" is about a complete crawl of one site, or about crawling and updating groups of sites. My question is about a complete crawl (indexing) of one "site".
How to run a complete crawl?
I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like "crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the topN-statement, there are still only some 50 to 75 files indexed. I have searched for more information, but it is not clear to me if (for example) "whole web crawling" is about a complete crawl of one site, or about crawling and updating groups of sites. My question is about a complete crawl (indexing) of one "site". -- View this message in context: http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25919860.html Sent from the Nutch - User mailing list archive at Nabble.com.