Hi Hiran,

When it comes to plugins (which is where I think your question stems given your 
recent protocol-smb plugin contribution), the CrawlDatum parameter is used in 
the following existing ways
 
- protocol-file: used 'under the hood' when making a request to the file 
system. Specifically it is used when attempting to establish an accurate _last_ 
modified time for the file. This is accomplished by performing a less than or 
equals comparison against the CrawlDatum record. See 
https://github.com/apache/nutch/blob/d6f55b8ea6f5809cef5a31239e5760be23742c00/src/plugin/protocol-file/src/java/org/apache/nutch/protocol/file/FileResponse.java#L172C13-L172C25

- protocol-ftp: essentially it is used in the same capacity as I've described 
above with the addition that FTP files _and_ directories can both have unique 
CrawlDatum objects associated with the FTP URL. Again however the CrawlDatum is 
used in the _last_ modified check. See 
https://github.com/apache/nutch/blob/d6f55b8ea6f5809cef5a31239e5760be23742c00/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/FtpResponse.java#L264-L266

- lib-http: for all plugins which import and inherit functionality from 
lib-http (e.g. protocol-htmlunit, protocol-http, protocol-httpclient, 
protocol-interactiveselenium, protocol-okhttp and protocol-selenium) the 
CrawlDatum is used for a few more things. For example, in protocol-okhttp the 
CrawlDatum is used to establish a IF_MODIFIED_SINCE HTTP header, it is also 
used to associate and fetch any cookies for a particular URL, it is used to 
store a server response time for a given URL, finally it is used to store the 
HTTP response code the given URL i.e., PROTOCOL_STATUS_CODE_KEY.

I hope this gives you an idea as to how it is used. 

If you have another question associated to things other than use of CrawlDatum 
objects in plugins please feel free to follow up here. Hopefully I gauged your 
question correctly. 

lewismc

On 2024/10/01 21:27:59 Hiran Chaudhuri wrote:
> Looking at the interface for protocol plugin, I notice the
> getProtocolOutput function has two parameters.
> 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/protocol/Protocol.java#L40
> 
> What is the CrawlDatum parameter used for?
> 
> 

Reply via email to