[squid-users] How does squid behave when caching really large files (GBs)

2011-08-19 Thread Thiago Moraes
I meant files with ranging from 100MB to 30GB, but mostly above the
10GB milestone, so that's the size of my problem. I saw the CERN case
on squid's homepage, but their files had, at maximum, 150MB, as said
in the paper. I'll try to learn a little more from their case, though.


They really are not in the same area of squid. The question is a have
to make less painful to download huge files and try to avoid using a
WAN. Having a server inside a LAN connection makes more sense in my
head, but I don't have limitations as the project is fresh and is
entirely in my hands. I can develop something in a layer above my
system (which would run in my main server) such as squid or I can
make every place have its own system deployed. In the last case, I
would need a way to share files between multiple instances of the same
program running and a distributed file system made more sense to me.
(don't know if I made myself clear here, english is not my first
language and if it's a little messy, don't mind in asking me again)

The problem with the architecture of multiple instances of my system
sharing files (which could even be done via rsync or else) is that the
main database has more than 40TB of data. Its copies may not have all
this space available and I would need to find a solution to choose
which files will reside in each server (and the changes along the
time). For me, this seens to be the kind of problem a cache server is
capable to solve and would save a lot of effort. Is this viable?

I hope I have my problem a little clearer now. Do you have any more
thoughts to share? And thanks for your time, Amos, it helped me and I
appreciate your help.

Thiago Moraes - EnC 07 - UFSCar


2011/8/18 Amos Jeffries squ...@treenet.co.nz

 On 16/08/11 20:33, Thiago Moraes wrote:

 Hello everyone,

 I currently have a server which stores many terabytes of rather static
 files, each one having tenths of gigabytes. Right now, these files are
 only accessed through a local connection, but in some time this is
 going to change. One option to make the access acceptable is to deploy
 new servers on the places that will most access these files. The new
 server would keep a copy of the most accessed ones so that only a LAN
 connection is needed, instead of wasting bandwidth to external access.

 I'm considering almost any solution to these new hosts and one of then
 is just using a cache tool like squid to make the downloads faster,
 but as I didn't see someone caching files this big, I would like to
 know which problems I may find if I adopt this kind of solution.


 You did mean tenths right, as in 100-900 MB files? seems slightly larger 
 than most traffic, but not huge. Even old Squid installs limited to 32-bit 
 files should have no problem with handling that as traffic.


 Most Squid installs wont store them locally to the clients though. The 
 default limit is 4MB to cache the bulk of web page traffic and avoid rarer 
 large objects like yours from pushing much out of cache.
  Most of the bumping up mentioned around here is for YouTube and similar 
 video media content. Only increasing it to tens/hundreds of MB then stops 
 there for the same caching reasons as the 4MB limit.

  Occasionally we hear from ISP or CDN bumping it enough to cache CDs or DVDs. 
 And OS distribution mirrors, although those also tend to have smaller package 
 caches. Mostly tens of MB objects.

  The CERN Frontier network admins are pushing multiple-TB around via Squids. 
 It sounds like they are a scale above what you want to do, but if you want 
 operational experience with big data they could be the best people to talk to.



 The alternatives I've considered so far include using a distributed
 file system such as Hadoop, deploying a private cloud storage system
 to communicate between the servers or even using bittorrent to share
 the files among servers. Any comments on these alternatives too?

 No opinion on them as such. AFAIK these don't seem to be really in the same 
 type of service area as Squid.

 If you are after distributed _storage_. Squid is then definitely not the 
 right solution.

  Squid design is more about fast delivery of the data than storage. Caches 
 being distributed stores is a side effect of that model being very efficient 
 for delivery rather than any effort to spread the locations of things. Cache 
 storage is fundamentally a giant /tmp director. Persistent but liable for 
 erasure any given second. A chunk of it is often found only in volatile RAM 
 too.
  Bittorrent perhapse is closest in a matter of being delivery oriented rather 
 than storage. With one authority source and a hierarchy of intermediaries 
 doing the delivery. Thats where the similarities end as well.


 If what you are after is scalable delivery mechanism that can minimize the 
 bandwidth consumption, Squid is definitely an option there.

  You can layer a whole distributed background set of storage servers behind a 
 gateway layer of Squid. Using the 

Re: [squid-users] How does squid behave when caching really large files (GBs)

2011-08-19 Thread Amos Jeffries

On 20/08/11 03:59, Thiago Moraes wrote:

I meant files with ranging from 100MB to 30GB, but mostly above the
10GB milestone, so that's the size of my problem. I saw the CERN case
on squid's homepage, but their files had, at maximum, 150MB, as said
in the paper. I'll try to learn a little more from their case, though.



Oh dear. Files above 2GB each can expect some problems with those older 
installs of Squid. The cache accounting screws up a bit with various 
side effects. The other admin will have hopefully worked around this by 
limiting their cache sizes already, so the noticed problems should be 
small. But nobody can guarantee that.




They really are not in the same area of squid. The question is a have
to make less painful to download huge files and try to avoid using a
WAN. Having a server inside a LAN connection makes more sense in my
head, but I don't have limitations as the project is fresh and is
entirely in my hands. I can develop something in a layer above my
system (which would run in my main server) such as squid or I can
make every place have its own system deployed. In the last case, I
would need a way to share files between multiple instances of the same
program running and a distributed file system made more sense to me.
(don't know if I made myself clear here, english is not my first
language and if it's a little messy, don't mind in asking me again)

The problem with the architecture of multiple instances of my system
sharing files (which could even be done via rsync or else) is that the
main database has more than 40TB of data. Its copies may not have all
this space available and I would need to find a solution to choose
which files will reside in each server (and the changes along the
time). For me, this seens to be the kind of problem a cache server is
capable to solve and would save a lot of effort. Is this viable?


Squid certainly should be able to solve the problem of selecting best 
source when something is needed. It will depend on how hot your 
objects are, ie how much repeat traffic you get for each one. The more 
repeat traffic the better Squid works.
  You can measure this from your existing logs to get a rough idea of 
whether Squid would be useful.




I hope I have my problem a little clearer now. Do you have any more
thoughts to share? And thanks for your time, Amos, it helped me and I
appreciate your help.


You are welcome. Big data projects are few and far between. Always kind 
of interesting to hear and think about :)


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10


Re: [squid-users] How does squid behave when caching really large files (GBs)

2011-08-18 Thread Amos Jeffries

On 16/08/11 20:33, Thiago Moraes wrote:

Hello everyone,

I currently have a server which stores many terabytes of rather static
files, each one having tenths of gigabytes. Right now, these files are
only accessed through a local connection, but in some time this is
going to change. One option to make the access acceptable is to deploy
new servers on the places that will most access these files. The new
server would keep a copy of the most accessed ones so that only a LAN
connection is needed, instead of wasting bandwidth to external access.

I'm considering almost any solution to these new hosts and one of then
is just using a cache tool like squid to make the downloads faster,
but as I didn't see someone caching files this big, I would like to
know which problems I may find if I adopt this kind of solution.



You did mean tenths right, as in 100-900 MB files? seems slightly 
larger than most traffic, but not huge. Even old Squid installs limited 
to 32-bit files should have no problem with handling that as traffic.



Most Squid installs wont store them locally to the clients though. The 
default limit is 4MB to cache the bulk of web page traffic and avoid 
rarer large objects like yours from pushing much out of cache.
 Most of the bumping up mentioned around here is for YouTube and 
similar video media content. Only increasing it to tens/hundreds of MB 
then stops there for the same caching reasons as the 4MB limit.


 Occasionally we hear from ISP or CDN bumping it enough to cache CDs or 
DVDs. And OS distribution mirrors, although those also tend to have 
smaller package caches. Mostly tens of MB objects.


 The CERN Frontier network admins are pushing multiple-TB around via 
Squids. It sounds like they are a scale above what you want to do, but 
if you want operational experience with big data they could be the best 
people to talk to.





The alternatives I've considered so far include using a distributed
file system such as Hadoop, deploying a private cloud storage system
to communicate between the servers or even using bittorrent to share
the files among servers. Any comments on these alternatives too?


No opinion on them as such. AFAIK these don't seem to be really in the 
same type of service area as Squid.


If you are after distributed _storage_. Squid is then definitely not the 
right solution.


 Squid design is more about fast delivery of the data than storage. 
Caches being distributed stores is a side effect of that model being 
very efficient for delivery rather than any effort to spread the 
locations of things. Cache storage is fundamentally a giant /tmp 
director. Persistent but liable for erasure any given second. A chunk of 
it is often found only in volatile RAM too.
 Bittorrent perhapse is closest in a matter of being delivery oriented 
rather than storage. With one authority source and a hierarchy of 
intermediaries doing the delivery. Thats where the similarities end as well.



If what you are after is scalable delivery mechanism that can minimize 
the bandwidth consumption, Squid is definitely an option there.


  You can layer a whole distributed background set of storage servers 
behind a gateway layer of Squid. Using the various peering algorithms 
and ACL rules for source selection.


 Those background layer servers can in turn use any of the actual 
storage-oriented methods you mention to actually store the content. If 
they still need scale. With web services to provide the files as HTTP 
objects from each location to the Squid layer.
 WikiMedia have some nice CDN network diagrams published if you want to 
see what I mean: http://meta.wikimedia.org/wiki/Wikimedia_servers


Sorry, talked you round in a circle there. But I hope its of some help. 
At least of where and whether Squid can fit into things for you.


Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10


[squid-users] How does squid behave when caching really large files (GBs)

2011-08-16 Thread Thiago Moraes
Hello everyone,

I currently have a server which stores many terabytes of rather static
files, each one having tenths of gigabytes. Right now, these files are
only accessed through a local connection, but in some time this is
going to change. One option to make the access acceptable is to deploy
new servers on the places that will most access these files. The new
server would keep a copy of the most accessed ones so that only a LAN
connection is needed, instead of wasting bandwidth to external access.

I'm considering almost any solution to these new hosts and one of then
is just using a cache tool like squid to make the downloads faster,
but as I didn't see someone caching files this big, I would like to
know which problems I may find if I adopt this kind of solution.

The alternatives I've considered so far include using a distributed
file system such as Hadoop, deploying a private cloud storage system
to communicate between the servers or even using bittorrent to share
the files among servers. Any comments on these alternatives too?

thank you all,

Thiago Moraes - EnC 07 - UFSCar