On 20/08/11 03:59, Thiago Moraes wrote:
I meant files with ranging from 100MB to 30GB, but mostly above the
10GB milestone, so that's the size of my problem. I saw the CERN case
on squid's homepage, but their files had, at maximum, 150MB, as said
in the paper. I'll try to learn a little more from their case, though.


Oh dear. Files above 2GB each can expect some problems with those older installs of Squid. The cache accounting screws up a bit with various side effects. The other admin will have hopefully worked around this by limiting their cache sizes already, so the noticed problems should be small. But nobody can guarantee that.


They really are not in the same area of squid. The question is a have
to make less painful to download huge files and try to avoid using a
WAN. Having a server inside a LAN connection makes more sense in my
head, but I don't have limitations as the project is fresh and is
entirely in my hands. I can develop something in a layer above my
system (which would run in my "main" server) such as squid or I can
make every place have its own system deployed. In the last case, I
would need a way to share files between multiple instances of the same
program running and a distributed file system made more sense to me.
(don't know if I made myself clear here, english is not my first
language and if it's a little messy, don't mind in asking me again)

The problem with the architecture of multiple instances of my system
sharing files (which could even be done via rsync or else) is that the
main database has more than 40TB of data. Its copies may not have all
this space available and I would need to find a solution to choose
which files will reside in each server (and the changes along the
time). For me, this seens to be the kind of problem a cache server is
capable to solve and would save a lot of effort. Is this viable?

Squid certainly should be able to solve the problem of selecting best source when something is needed. It will depend on how "hot" your objects are, ie how much repeat traffic you get for each one. The more repeat traffic the better Squid works. You can measure this from your existing logs to get a rough idea of whether Squid would be useful.


I hope I have my problem a little clearer now. Do you have any more
thoughts to share? And thanks for your time, Amos, it helped me and I
appreciate your help.

You are welcome. Big data projects are few and far between. Always kind of interesting to hear and think about :)

Amos
--
Please be using
  Current Stable Squid 2.7.STABLE9 or 3.1.14
  Beta testers wanted for 3.2.0.10

Reply via email to