Hi Guys,

I'm sure that these issues should be logged in our Jira as they not
only sound serious but also ship with reasonable sounding possible
solutions.

If any of you feel like opening a ticket(s), it would be great...
patches are always welcome.

Lewis

On Sat, Oct 13, 2012 at 12:14 AM, Tejas Patil <[email protected]> wrote:
> About a year back I was working on the FTP plugin and I faced this same
> issue.
>
> *For #1: "Localizing the variables in
> org.apache.nutch.protocol.ftp.FtpReponse would resolve this."*
> The FTP response object itself is being shared. So instead of localizing
> those variables, you can create a new FTPResponse object for each request
> to keep things clean. An ideal way would be to re-use the same FTP object
> for requests corresponding to the same host but things go messy as the
> connection gets timed out by the time you send a new request with the same
> object.
> (Off the topic: you just got those 2 exceptions ?? i am surprised because i
> used to get 3-4 that too in a different order everytime I executed. Lucky
> you :P)
>
> *For #2: "Resolution : Decoding the URL and then passing to ls command."*
> You can decode the url but the trap here is that decoder will try to do
> modifications so some other areas in the url which you might have not
> expected it to. So in the quest to make the one correction, you might end
> up with an incorrect modification.
>
> Thanks,
> Tejas Patil
>
> On Fri, Oct 12, 2012 at 3:19 AM, Rutvij Vyas
> <[email protected]>wrote:
>
>> Hello everyone,
>>
>> I have tried crawling FTP site with Nutch-1.4 using "protocol-ftp" plug-in.
>> I faced two major issues with this plug-in as described below.
>>
>>
>> 1.       Two member objects client,parser in Class
>> org.apache.nutch.protocol.ftp.Ftp which are used by class
>> org.apache.nutch.protocol.ftp.FtpReponse for establishment of connection
>> with configured FTP server and parsing the data.
>> I configured 30 fetcher threads in nutch-site.xml so that 30 threads were
>> simultaneously fetching data from FTP site. I configured FTP site to allow
>> unlimited number of connections.
>> Object of class org.apache.nutch.protocol.ftp.Ftp is created only once in
>> the life-cycle(stored in Object Cache and used by all the fetcher threads),
>> hence this behavior makes client and parser shared object among all the
>> worker threads.
>> These behavior made it cause the below exceptions :
>>
>> a.  NullpointerException - In case one thread is trying accessing the
>> client and other thread is nullifying it at the end of its run.
>>
>> b.  IOException (Stream Closed Exception) - In case one thread is trying
>> accessing the client and other thread is disconnecting it at the end of its
>> run.
>> Resolution : Localizing the variables in
>> org.apache.nutch.protocol.ftp.FtpReponse would resolve this.
>>
>>
>>
>> 2.  If there is a document on FTP server with a space in name(e.g. "test
>> document.pdf"). In this case Fetcher passes URL to be fetched in encoded
>> format(e.g. ftp://192.168.1.20/pub/test%20document.pdf) . On which first
>> FTP command - ls /pub/test%20document.pdf gets executed in
>>  org.apache.nutch.protocol.ftp.Client Method : retriveList())
>> Call to retriveList() returns empty list on which
>> org.apache.nutch.protocol.ftp.FtpReponse  Method : getFileAsHttpResponse()
>> calls list.get(0). This causes IndexOutOfBoundException.
>> Resolution : Decoding the URL and then passing to ls command.
>>
>>
>> Any of the pointers regarding this would help a lot.
>>
>> Thanks,
>> Rutvij Vyas
>>
>> DISCLAIMER
>> ==========
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>



-- 
Lewis

Reply via email to