Hello everyone,

I have tried crawling FTP site with Nutch-1.4 using "protocol-ftp" plug-in.
I faced two major issues with this plug-in as described below.


1.       Two member objects client,parser in Class 
org.apache.nutch.protocol.ftp.Ftp which are used by class 
org.apache.nutch.protocol.ftp.FtpReponse for establishment of connection with 
configured FTP server and parsing the data.
I configured 30 fetcher threads in nutch-site.xml so that 30 threads were 
simultaneously fetching data from FTP site. I configured FTP site to allow 
unlimited number of connections.
Object of class org.apache.nutch.protocol.ftp.Ftp is created only once in the 
life-cycle(stored in Object Cache and used by all the fetcher threads), hence 
this behavior makes client and parser shared object among all the worker 
threads.
These behavior made it cause the below exceptions :

a.  NullpointerException - In case one thread is trying accessing the client 
and other thread is nullifying it at the end of its run.

b.  IOException (Stream Closed Exception) - In case one thread is trying 
accessing the client and other thread is disconnecting it at the end of its run.
Resolution : Localizing the variables in 
org.apache.nutch.protocol.ftp.FtpReponse would resolve this.



2.  If there is a document on FTP server with a space in name(e.g. "test 
document.pdf"). In this case Fetcher passes URL to be fetched in encoded 
format(e.g. ftp://192.168.1.20/pub/test%20document.pdf) . On which first FTP 
command - ls /pub/test%20document.pdf gets executed in  
org.apache.nutch.protocol.ftp.Client Method : retriveList())
Call to retriveList() returns empty list on which 
org.apache.nutch.protocol.ftp.FtpReponse  Method : getFileAsHttpResponse() 
calls list.get(0). This causes IndexOutOfBoundException.
Resolution : Decoding the URL and then passing to ls command.


Any of the pointers regarding this would help a lot.

Thanks,
Rutvij Vyas

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.

Reply via email to