Problem in crawling windows shared folder using Nutch's SMB protocol plugin
Hi, I downloaded SMB protocol plugin from following location: http://issues.apache.org/jira/browse/NUTCH-427 I configured it (as mentioned it in read.txt) with Nutch. But when I tried to crawl, nothing gets crawled and get following exception in hadoop log. 2009-12-21 16:25:04,728 FATAL smb.SMB - Could not read content of protocol: smb://10.88.45.140/shared_folder/ jcifs.smb.SmbException: jcifs.util.transport.TransportException java.net.SocketException: Invalid argument or cannot assign requested address at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:333) at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:195) at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366) at java.net.Socket.connect(Socket.java:525) at java.net.Socket.connect(Socket.java:475) at java.net.Socket.init(Socket.java:372) at java.net.Socket.init(Socket.java:246) at jcifs.smb.SmbTransport.negotiate(SmbTransport.java:244) at jcifs.smb.SmbTransport.doConnect(SmbTransport.java:299) at jcifs.util.transport.Transport.run(Transport.java:240) at java.lang.Thread.run(Thread.java:619) at jcifs.util.transport.Transport.run(Transport.java:256) at java.lang.Thread.run(Thread.java:619) at jcifs.smb.SmbTransport.connect(SmbTransport.java:289) at jcifs.smb.SmbTree.treeConnect(SmbTree.java:139) at jcifs.smb.SmbFile.connect(SmbFile.java:798) at jcifs.smb.SmbFile.connect0(SmbFile.java:768) at jcifs.smb.SmbFile.exists(SmbFile.java:1275) at org.apache.nutch.protocol.smb.SMBResponse.init(SMBResponse.java:74) at org.apache.nutch.protocol.smb.SMB.getProtocolOutput(SMB.java:62) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:535) Has anyone used SMB protocol plugin before? Thanks, Rupesh DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Large files - nutch failing to fetch
Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch all the files using Nutch, but nutch is throwing outofmemory exception for large files ( have set heap size to 2500m), with heap memory 2500m file size with 250mb are retrieved but larger that that are failing, and nutch takes lot of time after printing -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 if there are three files with size 100mb each then it is failing (at the same depth, with heap size 2500m) to fetch files. i have set http.content.limite to -1 is there way to fetch several large files using nutch.. I am using nutch as webcrawler, i am not using Indexing. I want to download web resources and scan then for virus using ClamA/V. -- Thanks Regards, Sundara Kaku
Re: Large files - nutch failing to fetch
On 2009-12-21 17:15, Sundara Kaku wrote: Hi, Nutch is throwing errors while fetching large files (file with size more then 100mb). I have a website with pages that point to large files (file size varies from 10mb to 500mb) and there are several large files in that website. I want to fetch all the files using Nutch, but nutch is throwing outofmemory exception for large files ( have set heap size to 2500m), with heap memory 2500m file size with 250mb are retrieved but larger that that are failing, and nutch takes lot of time after printing -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 -activeThreads=0 if there are three files with size 100mb each then it is failing (at the same depth, with heap size 2500m) to fetch files. i have set http.content.limite to -1 is there way to fetch several large files using nutch.. I am using nutch as webcrawler, i am not using Indexing. I want to download web resources and scan then for virus using ClamA/V. Probably Nutch is not the right tool for you - you should probably use wget. Nutch was designed to fetch many pages of limited size - as a temporary step it caches the downloaded content in memory, before flushing it out to disk. (I had to solve this limitation once for a specific case - the solution was to implement a variant of the protocol and Content that stored data into separate HDFS files without buffering in memory - but it was a brittle hack that only worked for that particular scenario). -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
domain crawl using bin/nutch
Hi, I found db.ignore.external.links property. How do I limit the crawl by also excluding links within the same domain as well ? Thanks
unicode 2029 paragraph separator
http://www.fileformat.info/info/unicode/char/2029/index.htm i have experienced that this unicode character breaks JSON deserializing when using SOLR and AJAX. it comes from a pdf text. where to filter out or replace this character? pdf parser/text extractor? solr indexer? regards reinhard
Re: domain crawl using bin/nutch
You should be able to do this using one of the variations of *-urlfilter.txt files. Instead of using + in front of the regex, you can tell it to exclude lines that match the regex with a -. Just a guess, I haven't actually tried it, but you could probably use something like the following. (I'm sure you would have to fiddle with it to get it to work correctly). +^http://([a-z0-9]*\.)*mydomain.com/ -*/(pagename1.php|pagename2.php) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I found db.ignore.external.links property. How do I limit the crawl by also excluding links within the same domain as well ? Thanks
RE: domain crawl using bin/nutch
But how could we tell Nutch every time to do crawling in this way? I do not want to edit *-filter.txt every time. Thanks, Jun -Original Message- From: Jesse Hires [mailto:jhi...@gmail.com] Sent: 2009年12月22日 9:23 To: nutch-user@lucene.apache.org Subject: Re: domain crawl using bin/nutch You should be able to do this using one of the variations of *-urlfilter.txt files. Instead of using + in front of the regex, you can tell it to exclude lines that match the regex with a -. Just a guess, I haven't actually tried it, but you could probably use something like the following. (I'm sure you would have to fiddle with it to get it to work correctly). +^http://([a-z0-9]*\.)*mydomain.com/ -*/(pagename1.php|pagename2.php) Jesse int GetRandomNumber() { return 4; // Chosen by fair roll of dice // Guaranteed to be random } // xkcd.com On Mon, Dec 21, 2009 at 2:14 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I found db.ignore.external.links property. How do I limit the crawl by also excluding links within the same domain as well ? Thanks