Re: [squid-users] New StoreID helper: squid_dedup

2016-05-16 Thread Hans-Peter Jansen
Hi Eliezer,

Thanks for your feedback, much appreciated, /especially/ from you.

The most important part is in dedup.py. I've kept an eye on efficiency without 
sacrificing readability (much) and extendability:

https://github.com/frispete/squid_dedup/blob/master/squid_dedup/dedup.py

A big part of the rest is related to configuration management, which tries to 
maximize convenience (as many config files as wanted, automatic reload option 
on changes, etc..)

Depending on public interest, it would be cool to create a public CDN 
collection, that is shared among users, or even distributed automatically.

Pete

On Montag, 16. Mai 2016 03:44:29 Eliezer Croitoru wrote:
> Thanks for sharing!
> 
> I didn't had enough time to understand the tool structure since I am not
> a python expert but,
> This is the first squid helper I have seen which is based on python and
> implements concurrency.
> 
> Thanks!!
> Eliezer Croitoru
> 
> On 10/05/2016 00:56, Hans-Peter Jansen wrote:
> > Hi,
> > 
> > I'm pleased to announce the availability of squid_dedup, a helper for
> > deduplicating CDN accesses, implementing the squid 3 StoreID protocol.
> > 
> > It is a multi-threaded tool, written in python3, with no further
> > dependencies, hosted at: https://github.com/frispete/squid_dedup
> > available at: https://pypi.python.org/pypi/squid-dedup
> > 
> > For openSUSE users, a ready made rpm package is available here:
> > https://build.opensuse.org/package/show/home:frispete:python3/squid_dedup
> > 
> > Any feedback is greatly appreciated.
> > 
> > Cheers,
> > Pete

___
squid-users mailing list
squid-users@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users


Re: [squid-users] Getting the full file content on a range request, but not on EVERY get ...

2016-05-12 Thread Hans-Peter Jansen
On Freitag, 13. Mai 2016 01:09:39 Yuri Voinov wrote:
> I suggest it is very bad idea to transform caching proxy to linux
> distro's or something else archive.

Yuri, if I wanted an archive, I would mirror all stuff and use local repos. 
I went that route for a long time - it's a lot of work to keep up everywhere, 
and generates an awful amount of traffic (and I did it the sanest way possible 
- with a custom script, that was using rsync..)

> As Amos said, "Squid is a cache, not an archive".

Yes, updating 20 similar machines makes a significant difference with the 
squid as a deduplicated cache - with no recurring work at all.

Pete
___
squid-users mailing list
squid-users@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users


Re: [squid-users] Getting the full file content on a range request, but not on EVERY get ...

2016-05-12 Thread Hans-Peter Jansen
Hi Heiler,

On Donnerstag, 12. Mai 2016 13:28:00 Heiler Bemerguy wrote:
> Hi Pete, thanks for replying... let me see if I got it right..
> 
> Will I need to specify every url/domain I want it to act on ? I want
> squid to do it for every range-request downloads that should/would be
> cached (based on other rules, pattern_refreshs etc)

Yup, that's right. At least, that's the common approach to deal with CDNs.
I think, that disallowing range requests is too drastic to work fine on the 
long run, but let us know, if you get to satisfactory solution this way.

> It doesn't need to delay any downloads as long as it isn't a dupe of
> what's already being downloaded.

You can set to delay to zero of course.

This is only one side of the issues with CDNs. The other, more problematic 
side of it is, that many server with different URLs provide the same files.
Every new address will result in a new download of otherwise identical 
content.
 
Here's an example of openSUSE:

#
# this file was generated by gen_openSUSE_dedups
# from http://mirrors.opensuse.org/list/all.html
# with timestamp Thu, 12 May 2016 05:30:18 +0200
#
[openSUSE]
match:
# openSUSE Headquarter
http\:\/\/[a-z0-9]+\.opensuse\.org\/(.*)
# South Africa (za)
http\:\/\/ftp\.up\.ac\.za\/mirrors\/opensuse\/opensuse\/(.*)
# Bangladesh (bd)
http\:\/\/mirror\.dhakacom\.com\/opensuse\/(.*)
http\:\/\/mirrors\.ispros\.com\.bd\/opensuse\/(.*)
# China (cn)
http\:\/\/mirror\.bjtu\.edu\.cn\/opensuse\/(.*)
http\:\/\/fundawang\.lcuc\.org\.cn\/opensuse\/(.*)
http\:\/\/mirrors\.tuna\.tsinghua\.edu\.cn\/opensuse\/(.*)
http\:\/\/mirrors\.skyshe\.cn\/opensuse\/(.*)
http\:\/\/mirrors\.hust\.edu\.cn\/opensuse\/(.*)
http\:\/\/c\.mirrors\.lanunion\.org\/opensuse\/(.*)
http\:\/\/mirrors\.hustunique\.com\/opensuse\/(.*)
http\:\/\/mirrors\.sohu\.com\/opensuse\/(.*)
http\:\/\/mirrors\.ustc\.edu\.cn\/opensuse\/(.*)
# Hong Kong (hk)
http\:\/\/mirror\.rackspace\.hk\/openSUSE\/(.*)
# Indonesia (id)
http\:\/\/mirror\.linux\.or\.id\/linux\/opensuse\/(.*)
http\:\/\/buaya\.klas\.or\.id\/opensuse\/(.*)
http\:\/\/kartolo\.sby\.datautama\.net\.id\/openSUSE\/(.*)
http\:\/\/opensuse\.idrepo\.or\.id\/opensuse\/(.*)
http\:\/\/mirror\.unej\.ac\.id\/opensuse\/(.*)
http\:\/\/download\.opensuse\.or\.id\/(.*)
http\:\/\/repo\.ugm\.ac\.id\/opensuse\/(.*)
http\:\/\/dl2\.foss\-id\.web\.id\/opensuse\/(.*)
# Israel (il)
http\:\/\/mirror\.isoc\.org\.il\/pub\/opensuse\/(.*)

[...] -> this list contains about 180 entries

replace: http://download.opensuse.org.%(intdomain)s/\1
# fetch all redirected objects explicitly
fetch: true


This is, how CDNs work, but it's a nightmare for caching proxies.
In such scenarios squid_dedup comes to rescue.

Cheers,
Pete
___
squid-users mailing list
squid-users@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users


Re: [squid-users] Getting the full file content on a range request, but not on EVERY get ...

2016-05-12 Thread Hans-Peter Jansen
On Mittwoch, 11. Mai 2016 21:37:17 Heiler Bemerguy wrote:
> Hey guys,
> 
> First take a look at the log:
> 
> root@proxy:/var/log/squid# tail -f access.log |grep
> http://download.cdn.mozilla.net/pub/firefox/releases/45.0.1/update/win32/pt-> 
> BR/firefox-45.0.1.complete.mar 1463011781.572   8776 10.1.3.236 TCP_MISS/206
> 300520 GET
[...] 
> Now think: An user is just doing a segmented/ranged download, right?
> Squid won't cache the file because it is a range-download, not a full
> file download.
> But I WANT squid to cache it. So I decide to use "range_offset_limit
> -1", but then on every GET squid will re-download the file from the
> beginning, opening LOTs of simultaneous connections and using too much
> bandwidth, doing just the OPPOSITE it's meant to!
> 
> Is there a smart way to allow squid to download it from the beginning to
> the end (to actually cache it), but only on the FIRST request/get? Even
> if it makes the user wait for the full download, or cancel it
> temporarily, or.. whatever!! Anything!!

Well, this is exactly, what my squid_dedup helper was created for!

See my announcement: 

Subject: [squid-users] New StoreID helper: squid_dedup
Date: Mon, 09 May 2016 23:56:45 +0200

My openSUSE environment is fetching _all_ updates with byte-ranges from many 
servers. Therefor, I created squid_dedup.

Your specific config could look like this:

/etc/squid/dedup/mozilla.conf:
[mozilla]
match: http\:\/\/download\.cdn\.mozilla\.net/(.*)
replace: http://download.cdn.mozilla.net.%(intdomain)s/\1
fetch: true

The fetch parameter is unique among the other StoreID helper (AFAIK): it is 
fetching the object after a certain delay with a pool of fetcher threads.

The idea is: after the first access for an object, wait a bit (global setting, 
default: 15 secs), and then fetch the whole thing once. It won't solve 
anything for the first client, but for all subsequent accesses. 

The fetcher avoids fetching anything more than once by checking the http 
headers.

This is a pretty new project, but be assured, that the basic functions are 
working fine, and I will do my best to solve any upcoming issues. It is 
implemented with Python3 and prepared for supporting additional features 
easily, while keeping a good part of an eye on efficiency.

Let me know, if you're going to try it.

Pete
___
squid-users mailing list
squid-users@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users


[squid-users] New StoreID helper: squid_dedup

2016-05-09 Thread Hans-Peter Jansen
Hi,

I'm pleased to announce the availability of squid_dedup, a helper for 
deduplicating CDN accesses, implementing the squid 3 StoreID protocol.

It is a multi-threaded tool, written in python3, with no further dependencies,
hosted at: https://github.com/frispete/squid_dedup
available at: https://pypi.python.org/pypi/squid-dedup

For openSUSE users, a ready made rpm package is available here:
https://build.opensuse.org/package/show/home:frispete:python3/squid_dedup

Any feedback is greatly appreciated.

Cheers,
Pete
___
squid-users mailing list
squid-users@lists.squid-cache.org
http://lists.squid-cache.org/listinfo/squid-users