On 16/08/11 20:33, Thiago Moraes wrote:
Hello everyone,

I currently have a server which stores many terabytes of rather static
files, each one having tenths of gigabytes. Right now, these files are
only accessed through a local connection, but in some time this is
going to change. One option to make the access acceptable is to deploy
new servers on the places that will most access these files. The new
server would keep a copy of the most accessed ones so that only a LAN
connection is needed, instead of wasting bandwidth to external access.

I'm considering almost any solution to these new hosts and one of then
is just using a cache tool like squid to make the downloads faster,
but as I didn't see someone caching files this big, I would like to
know which problems I may find if I adopt this kind of solution.


You did mean "tenths" right, as in 100-900 MB files? seems slightly larger than most traffic, but not huge. Even old Squid installs limited to 32-bit files should have no problem with handling that as traffic.


Most Squid installs wont store them locally to the clients though. The default limit is 4MB to cache the bulk of web page traffic and avoid rarer large objects like yours from pushing much out of cache. Most of the bumping up mentioned around here is for YouTube and similar video media content. Only increasing it to tens/hundreds of MB then stops there for the same caching reasons as the 4MB limit.

Occasionally we hear from ISP or CDN bumping it enough to cache CDs or DVDs. And OS distribution mirrors, although those also tend to have smaller package caches. Mostly tens of MB objects.

The CERN Frontier network admins are pushing multiple-TB around via Squids. It sounds like they are a scale above what you want to do, but if you want operational experience with big data they could be the best people to talk to.



The alternatives I've considered so far include using a distributed
file system such as Hadoop, deploying a private cloud storage system
to communicate between the servers or even using bittorrent to share
the files among servers. Any comments on these alternatives too?

No opinion on them as such. AFAIK these don't seem to be really in the same type of service area as Squid.

If you are after distributed _storage_. Squid is then definitely not the right solution.

Squid design is more about fast delivery of the data than storage. Caches being distributed stores is a side effect of that model being very efficient for delivery rather than any effort to spread the locations of things. Cache storage is fundamentally a giant /tmp director. Persistent but liable for erasure any given second. A chunk of it is often found only in volatile RAM too. Bittorrent perhapse is closest in a matter of being delivery oriented rather than storage. With one authority source and a hierarchy of intermediaries doing the delivery. Thats where the similarities end as well.


If what you are after is scalable delivery mechanism that can minimize the bandwidth consumption, Squid is definitely an option there.

You can layer a whole distributed background set of storage servers behind a gateway layer of Squid. Using the various peering algorithms and ACL rules for source selection.

Those background layer servers can in turn use any of the actual storage-oriented methods you mention to actually store the content. If they still need scale. With web services to provide the files as HTTP objects from each location to the Squid layer. WikiMedia have some nice CDN network diagrams published if you want to see what I mean: http://meta.wikimedia.org/wiki/Wikimedia_servers

Sorry, talked you round in a circle there. But I hope its of some help. At least of where and whether Squid can fit into things for you.

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10

Reply via email to