Re: How to run a complete crawl?
@ Dennis: Thanks for clearifying the difference between deep indexing and whole web crawling. I think I have the text document with the url in the urlDir all right. I have been able to run a crawl, but it only fetches some 50 documents. @ Paul: .htaccess file, Options +Indexes, IndexOptions +SuppressColumnSorting? Yes, I am using Apache (and I have to apologize for not mentioning that I am using Nutch 0.9). However, this looks a bit scary for me - I don't have experience with programming in Java and stuff. I already found myself very clever by using a virtual machine in order to crawl my local file system. In the same way I have found a solution. I have placed my 2500 documents per 50 in some 50 directories, and placed them in each other: directory 1 contains 50 documents and directory 2, directory 2 contains 50 documents and directory 3, etc. Not the most beautiful solution, but it fits my purposes (running a test to compare two search engines) for the moment. This way, I have been able to index some 2100 documents, I could figure out why it stopped there, but for the moment, I am satisfied. -- View this message in context: http://www.nabble.com/How-to-run-a-complete-crawl--tp25919860p25936033.html Sent from the Nutch - User mailing list archive at Nabble.com.
Re: Nutch Enterprise
Dennis Kubes wrote: Depending on what you are wanting to do Solr may be a better choice as and Enterprise search server. If you are needing crawling you can use Nutch or attach a different crawler to Solr. If you are wanting to do more full web type search, then Nutch is a better option. What are your requirements? Dennis fredericoagent wrote: Does anybody have any information on using Nutch as Enterprise search ?, and what would I need ? is it just a case of the current nutch package or do you need other addons. And how does that compare against Google Enterprise ? thanks I agree with Dennis - use Nutch if you need to do a larger-scale discovery such as when you crawl the web, but if you already know all target pages in advance then Solr will be a much better (and much easier to handle) platform. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
nutch for many pages
Hi you all! I'm beginner in crawlers. I want to use Nutch as a system for crawling ~500 online sites. Can i somehow configure Nutch so that it can read targets from the database or some other source? Is Nutch software for this kind of job? I was hoping to use Nutch becouse of Solr and Lucene. Please recomend me alternative if you know some other crawler alike software that can be used in this kind of tasks. Hava a nice day, thank you for awnsers and making this great software! - Oto
Re: ERROR datanode.DataNode - DatanodeRegistration ... BlockAlreadyExistsException
Jesse Hires wrote: Does anyone have any insight into the following error I am seeing in the hadoop logs? Is this something I should be concerned with, or is it expected that this shows up in the logs from time to time? If it is not expected, where can I look for more information on what is going on? 2009-10-16 17:02:43,061 ERROR datanode.DataNode - DatanodeRegistration(192.168.1.7:50010, storageID=DS-1226842861-192.168.1.7-50010-1254609174303, infoPort=50075, ipcPort=50020):DataXceiver org.apache.hadoop.hdfs.server.datanode.BlockAlreadyExistsException: Block blk_90983736382565_3277 is valid, and cannot be written to. Are you sure you are running a single datanode process per machine? -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: How to run a complete crawl?
Vincent155 wrote: I have a virtual machine running (VMware 1.0.7). Both host and guest run on Fedora 10. In the virtual machine, I have Nutch installed. I can index directories on my host as if they are websites. Now I want to compare Nutch with another search enige. For that, I want to index some 2,500 files in a directory. But when I execute a command like crawl urls -dir crawl.test -depth 3 -topN 2500, of leave away the topN-statement, there are still only some 50 to 75 files indexed. Check in your nutch-site.xml what is the value of db.max.outlinks.per.page, the default is 100 - when crawling filesystems each file in a directory is treated as an outlink, and this limit is then applied. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com