Hello, I've got a question concerning slight modification of squid (for the purpose of other GNU project). I don't think I could get answer in any other place, so I hope my message won't be filtered by the moderator:)
We want to use squid as a proxy for crawler that visit sites which are suspected to be malicious. However, we'd like the proxy to check downloaded sites in clamav engine. There is solution that combines squid and clamav, but clamav scanner works there as redirector, which means that sites are downloaded twice. This is unacceptable approach for our project due to the fact that some sites behave differently whether it's the first connection from a certain IP address or not. I've looked into squid's code and I've got an idea how to do this. The best place to scan downloaded site seems to be storeSwapOutFileClosed function in store_swapout.cc file. After closing file clamav could scan this file and log if the site is malicious. The only glitch is the fact that not all sites are cached. However it doesn't seem like difficult to solve (needs to track places where decision is made whether the site is cachable). The second thing is that I don't want squid to return cached site ever. Even if the site has been cached already it should be downloaded again. I haven't investigated this yet, but I also think it shouldn't be very complicated to change. So, my question is: is the way of modifying squid I've described a good idea and will it lead to desired results? Maybe you have other suggestions for modification? Thank you for any help. Kind regards, Cezary Rzewuski
