Re: How to run a complete crawl?

2009-10-17 Thread Vincent155

@ Dennis:  Thanks for clearifying the difference between deep indexing and
whole web crawling. I think I have the text document with the url in the
urlDir all right. I have been able to run a crawl, but it only fetches some
50 documents.

@ Paul: .htaccess file, Options +Indexes, IndexOptions
+SuppressColumnSorting? Yes, I am using Apache (and I have to apologize for
not mentioning that I am using Nutch 0.9). However, this looks a bit scary
for me - I don't have experience with programming in Java and stuff. I
already found myself very clever by using a virtual machine in order to
crawl my local file system.

In the same way I have found a solution. I have placed my 2500 documents per
50 in some 50 directories, and placed them in each other: directory 1
contains 50 documents and directory 2, directory 2 contains 50 documents and
directory 3, etc. Not the most beautiful solution, but it fits my purposes
(running a test to compare two search engines) for the moment.

This way, I have been able to index some 2100 documents, I could figure out
why it stopped there, but for the moment, I am satisfied.
-- 
View this message in context: 
http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25936033.html
Sent from the Nutch - User mailing list archive at Nabble.com.



Re: Nutch Enterprise

2009-10-17 Thread Andrzej Bialecki

Dennis Kubes wrote:
Depending on what you are wanting to do Solr may be a better choice as 
and Enterprise search server. If you are needing crawling you can use 
Nutch or attach a different crawler to Solr.  If you are wanting to do 
more full web type search, then Nutch is a better option.  What are your 
 requirements?


Dennis

fredericoagent wrote:
Does anybody have any information on using Nutch as Enterprise search 
?, and

what would I need ?
is it just a case of the current nutch package or do you need other 
addons.


And how does that compare against Google Enterprise ?

thanks


I agree with Dennis - use Nutch if you need to do a larger-scale 
discovery such as when you crawl the web, but if you already know all 
target pages in advance then Solr will be a much better (and much easier 
to handle) platform.




--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



nutch for many pages

2009-10-17 Thread Oto Brglez

Hi you all!

I'm beginner in crawlers. I want to use Nutch as a system for crawling  
~500 online sites.


Can i somehow configure Nutch so that it can read targets from the  
database or some other source?


Is Nutch software for this kind of job? I was hoping to use Nutch  
becouse of Solr and Lucene. Please recomend me alternative if you know  
some other crawler alike software that can be used in this kind of  
tasks.


Hava a nice day, thank you for awnsers and making this great software!

- Oto


Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException

2009-10-17 Thread Andrzej Bialecki

Jesse Hires wrote:

Does anyone have any insight into the following error I am seeing in the
hadoop logs? Is this something I should be concerned with, or is it expected
that this shows up in the logs from time to time? If it is not expected,
where can I look for more information on what is going on?

2009-10-16 17:02:43,061 ERROR datanode.DataNode -
DatanodeRegistration(192.168.1.7:50010,
storageID=DS-1226842861-192.168.1.7-50010-1254609174303,
infoPort=50075, ipcPort=50020):DataXceiver
org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException:
Block blk_90983736382565_3277 is valid, and cannot be written to.


Are you sure you are running a single datanode process per machine?


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: How to run a complete crawl?

2009-10-17 Thread Andrzej Bialecki

Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the
topN-statement, there are still only some 50 to 75 files indexed.


Check in your nutch-site.xml what is the value of 
db.max.outlinks.per.page, the default is 100 - when crawling filesystems 
each file in a directory is treated as an outlink, and this limit is 
then applied.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com