Hello everyone, I have tried crawling FTP site with Nutch-1.4 using "protocol-ftp" plug-in. I faced two major issues with this plug-in as described below.
1. Two member objects client,parser in Class org.apache.nutch.protocol.ftp.Ftp which are used by class org.apache.nutch.protocol.ftp.FtpReponse for establishment of connection with configured FTP server and parsing the data. I configured 30 fetcher threads in nutch-site.xml so that 30 threads were simultaneously fetching data from FTP site. I configured FTP site to allow unlimited number of connections. Object of class org.apache.nutch.protocol.ftp.Ftp is created only once in the life-cycle(stored in Object Cache and used by all the fetcher threads), hence this behavior makes client and parser shared object among all the worker threads. These behavior made it cause the below exceptions : a. NullpointerException - In case one thread is trying accessing the client and other thread is nullifying it at the end of its run. b. IOException (Stream Closed Exception) - In case one thread is trying accessing the client and other thread is disconnecting it at the end of its run. Resolution : Localizing the variables in org.apache.nutch.protocol.ftp.FtpReponse would resolve this. 2. If there is a document on FTP server with a space in name(e.g. "test document.pdf"). In this case Fetcher passes URL to be fetched in encoded format(e.g. ftp://192.168.1.20/pub/test%20document.pdf) . On which first FTP command - ls /pub/test%20document.pdf gets executed in org.apache.nutch.protocol.ftp.Client Method : retriveList()) Call to retriveList() returns empty list on which org.apache.nutch.protocol.ftp.FtpReponse Method : getFileAsHttpResponse() calls list.get(0). This causes IndexOutOfBoundException. Resolution : Decoding the URL and then passing to ls command. Any of the pointers regarding this would help a lot. Thanks, Rutvij Vyas DISCLAIMER ========== This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

