Re: [squid-users] How does squid behave when caching really large files (GBs)

Amos Jeffries Thu, 18 Aug 2011 07:47:32 -0700

On 16/08/11 20:33, Thiago Moraes wrote:

Hello everyone,


I currently have a server which stores many terabytes of rather static
files, each one having tenths of gigabytes. Right now, these files are
only accessed through a local connection, but in some time this is
going to change. One option to make the access acceptable is to deploy
new servers on the places that will most access these files. The new
server would keep a copy of the most accessed ones so that only a LAN
connection is needed, instead of wasting bandwidth to external access.

I'm considering almost any solution to these new hosts and one of then
is just using a cache tool like squid to make the downloads faster,
but as I didn't see someone caching files this big, I would like to
know which problems I may find if I adopt this kind of solution.

You did mean "tenths" right, as in 100-900 MB files? seems slightlylarger than most traffic, but not huge. Even old Squid installs limitedto 32-bit files should have no problem with handling that as traffic.

Most Squid installs wont store them locally to the clients though. Thedefault limit is 4MB to cache the bulk of web page traffic and avoidrarer large objects like yours from pushing much out of cache.Most of the bumping up mentioned around here is for YouTube andsimilar video media content. Only increasing it to tens/hundreds of MBthen stops there for the same caching reasons as the 4MB limit.

Occasionally we hear from ISP or CDN bumping it enough to cache CDs orDVDs. And OS distribution mirrors, although those also tend to havesmaller package caches. Mostly tens of MB objects.

The CERN Frontier network admins are pushing multiple-TB around viaSquids. It sounds like they are a scale above what you want to do, butif you want operational experience with big data they could be the bestpeople to talk to.


The alternatives I've considered so far include using a distributed
file system such as Hadoop, deploying a private cloud storage system
to communicate between the servers or even using bittorrent to share
the files among servers. Any comments on these alternatives too?

No opinion on them as such. AFAIK these don't seem to be really in thesame type of service area as Squid.

If you are after distributed _storage_. Squid is then definitely not theright solution.

Squid design is more about fast delivery of the data than storage.Caches being distributed stores is a side effect of that model beingvery efficient for delivery rather than any effort to spread thelocations of things. Cache storage is fundamentally a giant /tmpdirector. Persistent but liable for erasure any given second. A chunk ofit is often found only in volatile RAM too.Bittorrent perhapse is closest in a matter of being delivery orientedrather than storage. With one authority source and a hierarchy ofintermediaries doing the delivery. Thats where the similarities end as well.

If what you are after is scalable delivery mechanism that can minimizethe bandwidth consumption, Squid is definitely an option there.

You can layer a whole distributed background set of storage serversbehind a gateway layer of Squid. Using the various peering algorithmsand ACL rules for source selection.

Those background layer servers can in turn use any of the actualstorage-oriented methods you mention to actually store the content. Ifthey still need scale. With web services to provide the files as HTTPobjects from each location to the Squid layer.WikiMedia have some nice CDN network diagrams published if you want tosee what I mean: http://meta.wikimedia.org/wiki/Wikimedia_servers

Sorry, talked you round in a circle there. But I hope its of some help.At least of where and whether Squid can fit into things for you.


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10

Re: [squid-users] How does squid behave when caching really large files (GBs)

Reply via email to