Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki

Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
"crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the
topN-statement, there are still only some 50 to 75 files indexed.


Check in your nutch-site.xml what is the value of 
db.max.outlinks.per.page, the default is 100 - when crawling filesystems 
each file in a directory is treated as an outlink, and this limit is 
then applied.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to run a complete crawl?

2009-10-17 Thread Vincent155

@ Dennis:  Thanks for clearifying the difference between deep indexing and
whole web crawling. I think I have the text document with the url in the
urlDir all right. I have been able to run a crawl, but it only fetches some
50 documents.

@ Paul: .htaccess file, Options +Indexes, IndexOptions
+SuppressColumnSorting? Yes, I am using Apache (and I have to apologize for
not mentioning that I am using Nutch 0.9). However, this looks a bit scary
for me - I don't have experience with programming in Java and stuff. I
already found myself very clever by using a virtual machine in order to
crawl my local file system.

In the same way I have found a solution. I have placed my 2500 documents per
50 in some 50 directories, and placed them in each other: directory 1
contains 50 documents and directory 2, directory 2 contains 50 documents and
directory 3, etc. Not the most beautiful solution, but it fits my purposes
(running a test to compare two search engines) for the moment.

This way, I have been able to index some 2100 documents, I could figure out
why it stopped there, but for the moment, I am satisfied.
-- 
View this message in context: 
http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25936033.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: How to run a complete crawl?

2009-10-16 Thread Paul Tomblin
On Fri, Oct 16, 2009 at 10:19 AM, Dennis Kubes  wrote:
> Because you are crawling the local files you would either need urls in the
> initial urlDir text file or those documents you are crawling would need to
> point to the other urls.
>

Another way to do this is to put the following in the document
directory's .htaccess file, and point the urlDir text file to the
directory itself:

Options +Indexes
IndexOptions +SuppressColumnSorting

(Assuming you're using Apache, and assuming that the site
configuration doesn't have "AllowOverride None" set.)




-- 
http://www.linkedin.com/in/paultomblin


Re: How to run a complete crawl?

2009-10-16 Thread Dennis Kubes
Whole web crawling is about indexing the entire web versus deep indexing 
of a single site.


The urls parameter is the urlDir, a directory which should hold one or 
more text files with listings of urls to be fetched.  The dir parameter 
is the output directory for the crawls.


Because you are crawling the local files you would either need urls in 
the initial urlDir text file or those documents you are crawling would 
need to point to the other urls.


Is that how you have it setup?

Dennis


Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
"crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the
topN-statement, there are still only some 50 to 75 files indexed.

I have searched for more information, but it is not clear to me if (for
example) "whole web crawling" is about a complete crawl of one site, or
about crawling and updating groups of sites. My question is about a complete
crawl (indexing) of one "site".


How to run a complete crawl?

2009-10-15 Thread Vincent155

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
"crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the
topN-statement, there are still only some 50 to 75 files indexed.

I have searched for more information, but it is not clear to me if (for
example) "whole web crawling" is about a complete crawl of one site, or
about crawling and updating groups of sites. My question is about a complete
crawl (indexing) of one "site".
-- 
View this message in context: 
http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25919860.html
Sent from the Nutch - User mailing list archive at Nabble.com.