Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-09-25 Thread tevador
On Tue, Sep 22, 2020 at 2:10 PM George Kadianakis  wrote:
>
> if you have a GPU-enabled box, so that we can get some benchmarks
> from GPUs as well
>

Someone would have to write a GPU benchmark for that. My code is CPU-only.

>
>   tevador do you have the graphing code somewhere so that I can run the
>   experiments again and see how the graphs are influenced?
>

I've uploaded the gnuplot script I used to generate the graphs here:
https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.plt

You will need to modify the script with your path to the simulation output file.

On Thu, Sep 24, 2020 at 6:54 PM Jim Newsome  wrote:
>
> I stumbled across some weird artifacts when using more threads than
> processors: the benchmark reports solutions/sec continuing to increase
> linearly with #threads. The wall-clock time for the benchmark itself
> (measured with `time`) show the expected trend though of linear scaling
> only up to 4 (the number of physical cores), a little bump at 8 (using
> the hyperthreaded virtual cores), and no improvement past that.
>

Good catch. There was a bug in the time measurement code in the
benchmark. Should be fixed now in the master branch.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-09-24 Thread Jim Newsome

On 9/22/20 07:10, George Kadianakis wrote:
> George Kadianakis  writes:
>
>> tevador  writes:
>>
>>> Hi all,
>>>
> Hello,
>
> I have pushed another update to the PoW proposal here:
>   https://github.com/asn-d6/torspec/tree/pow-over-intro
> I also (finally) merged it upstream to torspec as proposal #327:
>   
> https://github.com/torproject/torspec/blob/master/proposals/327-pow-over-intro.txt
>
> The most important improvements are:
> - Add tevador as an author.
> - Update PoW algorithms based on tevador's Equix feedback.
> - Update effort estimation algorithm based on tevador's simulation.
> - Include hybrid attack section.
> - Remove a bunch of blocker tags.
>
> Two things I'd like to work more on:
>
> - I'd like people to take tevador's Equix PoW function and run it on
>   their boxes and post back benchmarks of how it performed.

I shared some results privately with George and he suggested including
the list. Results below.

> Particularly
>   so if you have a GPU-enabled box, so that we can get some benchmarks
>   from GPUs as well. That will help us tune the proposal even more.

For anyone else following along or also contributing benchmarks, George
clarified for me that the equix benchmark isn't capable of utilizing the
GPU.

My results:

First results are on my w530, i7, 4 core (hyperthreaded to 8) laptop
(with moderate activity in the background).

I stumbled across some weird artifacts when using more threads than
processors: the benchmark reports solutions/sec continuing to increase
linearly with #threads. The wall-clock time for the benchmark itself
(measured with `time`) show the expected trend though of linear scaling
only up to 4 (the number of physical cores), a little bump at 8 (using
the hyperthreaded virtual cores), and no improvement past that.

Further below are results on my pinephone.
   
$ time ./equix-bench --threads 1
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 1) ...
    1.91 solutions/nonce
    227.714446 solutions/sec. (1 thread)
    20301.439170 verifications/sec. (1 thread)

    real    0m4.242s
    user    0m4.230s
    sys    0m0.012s

$ time ./equix-bench --threads 2
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 2) ... 
    1.91 solutions/nonce
    450.100153 solutions/sec. (2 threads)
    17925.519934 verifications/sec. (1 thread)

    real    0m2.184s
    user    0m4.294s
    sys    0m0.004s

$ time ./equix-bench --threads 4
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 4) ... 
    1.91 solutions/nonce
    876.343564 solutions/sec. (4 threads)
    18863.079719 verifications/sec. (1 thread)

    real    0m1.154s
    user    0m4.400s
    sys    0m0.012s

$ time ./equix-bench --threads 8
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 8) ... 
    1.91 solutions/nonce
    1089.198671 solutions/sec. (8 threads)
    17808.857809 verifications/sec. (1 thread)

    real    0m0.981s
    user    0m7.019s
    sys    0m0.052s

$ time ./equix-bench --threads 16
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 16) ... 
    1.91 solutions/nonce
    2183.232035 solutions/sec. (16 threads)
    18936.014118 verifications/sec. (1 thread)

    real    0m1.025s
    user    0m7.021s
    sys    0m0.032s

$ time ./equix-bench --threads 32
    Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 32) ... 
    1.91 solutions/nonce
    4397.259598 solutions/sec. (32 threads)
    17754.229411 verifications/sec. (1 thread)

    real    0m1.026s
    user    0m6.961s
    sys    0m0.049s

$ cat /proc/cpuinfo
    
    processor    : 7
    vendor_id    : GenuineIntel
    cpu family    : 6
    model    : 58
    model name    : Intel(R) Core(TM) i7-3740QM CPU @ 2.70GHz
    stepping    : 9
    microcode    : 0x21
    cpu MHz    : 1856.366
    cache size    : 6144 KB
    physical id    : 0
    siblings    : 8
    core id    : 3
    cpu cores    : 4
    apicid    : 7
    initial apicid    : 7
    fpu    : yes
    fpu_exception    : yes
    cpuid level    : 13
    wp    : yes
    flags    : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe
syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl
xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic
popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault
epb pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid
fsgsbase smep erms xsaveopt dtherm ida arat pln pts md_clear flush_l1d
    bugs    : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass
l1tf mds swapgs itlb_multihit srbds
    bogomips    : 5387.48
    clflush size    : 64
    cache_alignment    : 64
    address sizes    : 36 bits physical, 48 bits virtual
    power management:

Similar behavior on the (4-core aarch64) pinephone:
   
$ time ./equix-bench --threads 1
    

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-09-22 Thread George Kadianakis
George Kadianakis  writes:

> tevador  writes:
>
>> Hi all,
>>

Hello,

I have pushed another update to the PoW proposal here:
  https://github.com/asn-d6/torspec/tree/pow-over-intro
I also (finally) merged it upstream to torspec as proposal #327:
  
https://github.com/torproject/torspec/blob/master/proposals/327-pow-over-intro.txt

The most important improvements are:
- Add tevador as an author.
- Update PoW algorithms based on tevador's Equix feedback.
- Update effort estimation algorithm based on tevador's simulation.
- Include hybrid attack section.
- Remove a bunch of blocker tags.

Two things I'd like to work more on:

- I'd like people to take tevador's Equix PoW function and run it on
  their boxes and post back benchmarks of how it performed. Particularly
  so if you have a GPU-enabled box, so that we can get some benchmarks
  from GPUs as well. That will help us tune the proposal even more.

  For my laptop (with an Intel CPU i7-8550U CPU @ 1.80GHz) I got pretty
  accurate benchmarks (compared to 
https://github.com/tevador/equix#performance):
  $ ./equix-bench 
 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 1) ...
 1.91 solutions/nonce
 283.829505 solutions/sec. (1 thread)
 22810.327943 verifications/sec. (1 thread)
  $ ./equix-bench --threads 16
 Solving nonces 0-499 (interpret: 0, hugepages: 0, threads: 16) ...
 1.91 solutions/nonce
 2296.585708 solutions/sec. (16 threads)
 20223.196324 verifications/sec. (1 thread)

  See how to do this here: https://github.com/tevador/equix#build

- I'd like to improve the effort estimation algorithm by dynamically adjusting
  SVC_BOTTOM_CAPACITY instead of having it as a static value. Otherwise, I
  would like to reduce the currently suggested SVC_BOTTOM_CAPACITY because I
  feel that 180 is too big. I would like to put it to 100 which is much more
  conservative.  I tried to do so while updating tevador's simulation
  accordingly, but I found out that the simulation code does not do the graphs
  itself, so I didn't make much progress here.

  tevador do you have the graphing code somewhere so that I can run the
  experiments again and see how the graphs are influenced?

Apart from that, I think the proposal is really solid. I have hence merged it
as proposal #327 to torspec and further revisions can be done on top of that
from now on.

Thanks for all the work here and I'm looking forward to further feedback!
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-08-26 Thread George Kadianakis
tevador  writes:

> Hi all,
>

Hello tevador,

thanks so much for your work here and for the great simulation. Also for
the hybrid attack which was definitely missing from the puzzle.

I've been working on a further revision of the proposal based on your
comments. I have just one small question I would like your feedback on.

>> 3.4.3. PoW effort estimation [EFFORT_ESTIMATION]
>> {XXX: BLOCKER: Figure out of this system makes sense}
>
> I wrote a simple simulation in Python to test different ways of
> adjusting the suggested effort. The results are here:
> https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.md
>
> In summary, I suggest to use MIN_EFFORT = 1000 and the following
> algorithm to calculate the suggested effort:
>
> 1. Sum the effort of all valid requests that have been received since the
>last HS descriptor update. This includes all handled requests, trimmed
>requests and requests still in the queue.
> 2. Divide the sum by the max. number of requests that the service could have
>handled during that time (SVC_BOTTOM_CAPACITY * HS_UPDATE_PERIOD).
> 3. Suggested effort = max(MIN_EFFORT, result)
>
> This algorithm can both increase and reduce the suggested effort.
>

I like the above logic but I'm wondering of how we can get the real
SVC_BOTTOM_CAPACITY for every scenario. In particular, the
SVC_BOTTOM_CAPACITY=180 value from 6.2.2 might have been true for
David's testing but it will not be true for every computer and every
network.

I wonder if we can adapt the above effort estimation algorithm to use an
initial SVC_BOTTOM_CAPACITY magic value for the first run (let's say
180), but then derive the real SVC_BOTTOM_CAPACITY of the host in
runtime and use that for subsequent runs of the algorithm.

Do you think this is possible?
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-07-01 Thread David Goulet
On 22 Jun (17:52:44), George Kadianakis wrote:
> Hello there,
> 
> here is another round of PoW revisions:
>  https://github.com/asn-d6/torspec/tree/pow-over-intro
> I'm inlining the full proposal in the end of this email.
> 
> Here is a changelog:
> - Actually used tevador's EquiX scheme as our PoW scheme for now. This is 
> still
>   tentative, but I needed some ingredients to cook with so I went for it.
> - Fold in David's performance measurements and use them to get some
>   guesstimates on the default PoW difficulty etc.
> - Enable overlapping seed system.
> - Enrich the attack section of the proposal some more.
> - Attempt to fix an effort estimation attack pointed by tevador.
> - Added a bunch of "BLOCKER" tags around the proposal for things that we need
>   to figure out or at least have some good intuition if we want to have
>   guarantees that the proposal can work before we start implementing.
> 
> Here is what needs to happen next:
> 
> - David's performance measurements have been really useful, but they open a
>   bunch of questions on auxiliary overheads. We are now performing more
>   experiments to confirm the performance numbers we got and make sure we are
>   not overshooting. I noted these issues down as BLOCKER in the proposal.
>   While doing so we also found a pretty serious bug with our scheduler that we
>   trying to fix:
>  https://gitlab.torproject.org/tpo/core/tor/-/issues/40006

[snip]

(For the record)

Ok now that this bug has been fixed here are the new numbers. The time per
INTRO2 cell, on average, is the same as in the proposal.

Big difference is that Tor is not handling on average ~15 cells per mainloop
round during heavy DDoS. It is 15 and not 32 (theoretical limit) because the
service also handles a lot of DESTROY cells due to the rendezvous circuit
failing but also due to some seconds where no cells are processed because tor
is busy doing other things.

We've also confirmed that the theoretical value of 180 requests per second in
the proposal actually is valid. During high DDoS time, we've observed on
average 165 cells per second (by removing few outliers since tor has other
events that prevents cell processing for 1-3 seconds sometimes.

We've observed rate of 185cells/second so the 180 numbers holds here imo.

Cheers!
David

-- 
aivM0ymbv1PERLUJ1ZMsGtCDACQ3MpuWDLc0zbwJjqQ=


signature.asc
Description: PGP signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-07-01 Thread tevador
Hi all,

On Mon, Jun 22, 2020 at 4:52 PM George Kadianakis  wrote:
> Also, tevador let me know if you'd like me to add you as a co-author on the
> proposal based on all your great feedback so far.

Thanks for the update. Yes, you can add me as a co-author.

> During our performance measurements in [TOR_MEASUREMENTS] we learned that
> the "top half" takes about 0.26 msecs in average, without doing any sort
> of PoW verification.

Interesting. This confirms the need for fast PoW verification. Equi-X
takes about 0.05 ms to verify, so the top-half throughput should still
be over 3000 requests per second.

> Our scheduler is pretty limited by the fact that Tor has a single-threaded
> design.

If both the top- and bottom- halves are processed by the same thread,
this opens up the possibility for a "hybrid" attack. Given the
performance figures for the bottom half (0.31 ms/req.) and the top
half (5.5 ms/req.), the attacker can optimally deny service by
submitting 91 high-effort requests and 1520 invalid requests per
second. This will completely saturate the main loop because:

0.31*(1520+91) ~ 0.5 sec.
5.5*91 ~ 0.5 sec.

This attack only has half the bandwidth requirement of
[ATTACK_TOP_HALF] and half the compute requirement of
[ATTACK_BOTTOM_HALF].

Alternatively, the attacker can adjust the ratio between invalid and
high-effort requests depending on their bandwidth and compute
capabilities.

> {TODO: BLOCKER: What's the max size of the queue? Do we trim it, or do
> we just stop adding new requests when it reaches max size? Can we use WRED?
> Trimming is currently used EFFORT_ESTIMATION, so if we don't do it we need
> to find a different way to estimate effort. See tevador's [REF_TEVADOR_2]
> email.}

The simplest approach is to have a "soft" maximum size of the queue.
All requests with valid PoW can be added to the queue, but once per
second, the queue is trimmed to the "soft" max size by removing
timed-out and low-effort requests. I used this approach in my
simulations (see below).

> The EXT_FIELD content format is:
> POW_VERSION[1 byte]
> POW_NONCE  [16 bytes]
> POW_SEED   [4 bytes]

If we want to use Equi-X, you also need to add the solution (16 bytes)
and I also recommend adding the client's target effort.

POW_VERSION[1 byte]
POW_NONCE  [16 bytes]
POW_EFFORT [4 bytes]
POW_SEED   [4 bytes]
POW_SOLUTION   [16 bytes]

43 bytes total including the extension type and length.

The client's algorithm in section 3.2 should be modified to:

  a) Client selects a target effort E.
  b) Client generates a random 16-byte nonce N.
  c) Client derives seed C by decoding 'seed-b64'.
  d) Client calculates S = equix_solve(C || N || E)
  e) Client calculates R = blake2b(C || N || E || S)
  f) Client checks if R * E <= UINT32_MAX.
f1) If yes, success! The client can submit N, E, the first 4 bytes of C
and S.
f2) If no, fail! The client interprets N as a 16-byte little-endian
integer, increments it by 1 and goes back to step d).

We could also add the server algorithm in 3.4:

  a) Find a valid seed C that starts with POW_SEED. Fail if no such seed
 exists.
  b) Fail if E = POW_EFFORT is lower than the minimum effort.
  c) Fail if N = POW_NONCE is present in the replay cache.
  d) Calculate R = blake2b(C || N || E || S)
  e) Fail if R * E > UINT32_MAX
  f) Fail if equix_verify(C || N || E, S) != EQUIX_OK
  g) Put the request in the queue with a priority of E

> 3.4.3. PoW effort estimation [EFFORT_ESTIMATION]
> {XXX: BLOCKER: Figure out of this system makes sense}

I wrote a simple simulation in Python to test different ways of
adjusting the suggested effort. The results are here:
https://github.com/tevador/scratchpad/blob/master/tor-pow/effort_sim.md

In summary, I suggest to use MIN_EFFORT = 1000 and the following
algorithm to calculate the suggested effort:

1. Sum the effort of all valid requests that have been received since the
   last HS descriptor update. This includes all handled requests, trimmed
   requests and requests still in the queue.
2. Divide the sum by the max. number of requests that the service could have
   handled during that time (SVC_BOTTOM_CAPACITY * HS_UPDATE_PERIOD).
3. Suggested effort = max(MIN_EFFORT, result)

This algorithm can both increase and reduce the suggested effort.

My simulations also show that bottom-half attacks are not feasible
(even "The large botnet" cannot completely deny access to the
service). I think further research and testing should focus on
top-half attacks (or hybrid attacks).

> {XXX: Does this mean that this system can auto-enable and auto-disable the
> DoS subsystem with reasonable accuracy?}

I tried to make the effort estimation algorithm disable PoW
automatically (by setting suggested effort to 0), but it led to
oscillating behavior in the case of a sustained attack (i.e. once the
suggested effort is set to 0, clients cannot connect anymore because
the attacker can 

[tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-07-01 Thread johnymnx

Hello, list !
Some ideas about this proposal, it's a little bit surface ideas, cause 
I'm not currently digging deep into the internals of Tor-core.


There's user division on different classes there ("The standard web 
user",   "The motivated user",   "The mobile user").
I would add "Registered user", user registered within the service. Thus 
service keeps fast-access DB(Redis) with some unique-id of registered 
user, which is given to registered user. When user wants to connect to 
HS, he puts this Unique-ID via TBB (similar to stealth auth introduced 
recently)


About "The mobile user", mobile user can be offered two options:
1. Go to Desktop
2. Compute series of reCAPTCHA
3. If mobile user is "The registered user", enter unique-id saved in 
some encrypted vault on the phone


Registered and unregistered class of users introduces another issue
When HS starts its operation and no registered users exist, and DDoS 
starts with initial HS availability to the World, unregistered/guest 
users will have no chance to register on HS. Thus, hybrid solution is 
required, as described in 5.2 (5.2. Future directions [FUTURE_WORK]) 
"PoW + "Third-party anonymous credentials"
HS must have Third-Party Human Puzzle service behind Anit-DDOS operator 
(Cloudflare, e.t.c.), where unregistered users could solve some AI-proof 
puzzles and get access-token to connect to HS


In the long run, when user base of HS is saturated, and number of user 
registrations decays, PoW for unregistered users can be significantly 
increased, thus protecting HS from huge DDoS attacks, cause it's usually 
when HS is popular and has big and saturated user base, it becomes 
target for huge DDoS attacks.


Verification of PoW
I didn't dig deep to HS implementation, but it seems reasonable if 
decision about PoW verification is made on the closest node to 
client(ClosestNode).


1. HS sends encrypted PoW task to client
2. Client computes PoW
3. Client sends PoW to ClosestNode
4. ClosestNode verifies PoW (verification of PoW is always much faster, 
than Proof)

5. Verification succeeded - approve connection to HS
6. Verification failed - disconnect this client

Finally, our proposal has a big benefit over the blockchain use cases: 
it's
much more agile. We can deploy changes to the PoW algorithm without 
having

to hard-fork, and we can do this even through the consensus for maximum
agility. This means that we should try to use this agility to our 
advantage.


Change of PoW algorithms can be made in form of some PoW CAP(capability) 
flag, so some version of client supports PoW_CAP1, PoW_CAP2, PoW_CAP3 
and HS supports PoW_CAP1, PoW_CAP1, thus they both can only interact 
with intersection of their PoW_CAP# sets. Similar to OpenVPN 
crypto-capabilities exchange between client and server




___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-22 Thread George Kadianakis
Hello there,

here is another round of PoW revisions:
 https://github.com/asn-d6/torspec/tree/pow-over-intro
I'm inlining the full proposal in the end of this email.

Here is a changelog:
- Actually used tevador's EquiX scheme as our PoW scheme for now. This is still
  tentative, but I needed some ingredients to cook with so I went for it.
- Fold in David's performance measurements and use them to get some
  guesstimates on the default PoW difficulty etc.
- Enable overlapping seed system.
- Enrich the attack section of the proposal some more.
- Attempt to fix an effort estimation attack pointed by tevador.
- Added a bunch of "BLOCKER" tags around the proposal for things that we need
  to figure out or at least have some good intuition if we want to have
  guarantees that the proposal can work before we start implementing.

Here is what needs to happen next:

- David's performance measurements have been really useful, but they open a
  bunch of questions on auxiliary overheads. We are now performing more
  experiments to confirm the performance numbers we got and make sure we are
  not overshooting. I noted these issues down as BLOCKER in the proposal.
  While doing so we also found a pretty serious bug with our scheduler that we
  trying to fix:
 https://gitlab.torproject.org/tpo/core/tor/-/issues/40006
- Did not have time to think about the priority queue's max size. I added a
  BLOCKER about this in the [HANDLE_QUEUE] section.
- Did not have time to think about a minimum effort feature on the queue. I
  guess this also depends on the scheduler.
- Need to think more about the effort estimation logic and make sure that it
  can't backfire big time.
- Need to kill all the XXXs, TODOs and BLOCKERs.

Also, tevador let me know if you'd like me to add you as a co-author on the
proposal based on all your great feedback so far.

This is looking more and more plausible but let's wait for more data before we
seal the deal.

Thanks for all the feedback and looking forward to more!

---

Filename: xxx-pow-over-intro-v1
Title: A First Take at PoW Over Introduction Circuits
Author: George Kadianakis, Mike Perry, David Goulet
Created: 2 April 2020
Status: Draft

0. Abstract

  This proposal aims to thwart introduction flooding DoS attacks by introducing
  a dynamic Proof-Of-Work protocol that occurs over introduction circuits.

1. Motivation

  So far our attempts at limiting the impact of introduction flooding DoS
  attacks on onion services has been focused on horizontal scaling with
  Onionbalance, optimizing the CPU usage of Tor and applying congestion control
  using rate limiting. While these measures move the goalpost forward, a core
  problem with onion service DoS is that building rendezvous circuits is a
  costly procedure both for the service and for the network. For more
  information on the limitations of rate-limiting when defending against DDoS,
  see [REF_TLS_1].

  If we ever hope to have truly reachable global onion services, we need to
  make it harder for attackers to overload the service with introduction
  requests. This proposal achieves this by allowing onion services to specify
  an optional dynamic proof-of-work scheme that its clients need to participate
  in if they want to get served.

  With the right parameters, this proof-of-work scheme acts as a gatekeeper to
  block amplification attacks by attackers while letting legitimate clients
  through.

1.1. Related work

  For a similar concept, see the three internet drafts that have been proposed
  for defending against TLS-based DDoS attacks using client puzzles [REF_TLS].

1.2. Threat model [THREAT_MODEL]

1.2.1. Attacker profiles [ATTACKER_MODEL]

  This proposal is written to thwart specific attackers. A simple PoW proposal
  cannot defend against all and every DoS attack on the Internet, but there are
  adverary models we can defend against.

  Let's start with some adversary profiles:

  "The script-kiddie"

The script-kiddie has a single computer and pushes it to its
limits. Perhaps it also has a VPS and a pwned server. We are talking about
an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We
consider the total cost for this attacker to be zero $.

  "The small botnet"

The small botnet is a bunch of computers lined up to do an introduction
flooding attack. Assuming 500 medium-range computers, we are talking about
an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We consider
the upfront cost for this attacker to be about $400.

  "The large botnet"

The large botnet is a serious operation with many thousands of computers
organized to do this attack. Assuming 100k medium-range computers, we are
talking about an attacker with total access to 200 Thz of CPU and 200 TB of
RAM. The upfront cost for this attacker is about $36k.

  We hope that this proposal can help us defend against the script-kiddie
  attacker and small botnets. To defend against a 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-10 Thread tevador
Hi George,

Thanks for the update.

On Wed, Jun 10, 2020 at 2:05 PM George Kadianakis  wrote:
>
>   Tevador, thanks a lot for your tailored work on equix. This is fantastic.
>   I have a question that I don't see addressed in your very well written
>   README. In your initial email, we discuss how Equihash does not have good
>   GPU resistance:
>   https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html
>
>   Since equix is using Equihash isn't this gonna be a problem here too?
>   I'm not too worried about ASIC resistance since I doubt someone is gonna
>   build ASICs for this problem just yet, but script kiddies with their CS:GO
>   graphics cards attacking equix is something I'm concerned about. I bet you
>   have thought of this, so I'm wondering what's your take here.
>

Equihash runs much faster on GPUs only if the memory requirements
exceed the size of the CPU cache. This is the case for most Equihash
variants that are in use by cryptocurrencies (e.g. 200,9 and 144,5),
but doesn't apply to Equi-X, which uses only 2 MB.

The GPU resistance of Equi-X is based on 2 facts:
1) Each solution attempt uses a different hash function. GPUs cannot
compile new kernels fast enough (it would require >1000 kernels per
second), so they have to run in emulator mode, which is much slower.
GPUs are also impacted by thread divergence.
2) The entire sorting phase fits into CPU cache, so CPUs can benefit
from memory bandwidth comparable to GPUs (~500 GB/s).

> In tevador's initial mail, they point how the cell should include
> POW_EFFORT and that we should specify a "minimum effort" value instead
> of just inserting any effort in the pqueue. I can understand how this can
> have benefits (like  the June discussion between tevador and yoehoduv) but
> I'm also concerned that this can make us more vulnerable to
> [ATTACK_BOTTOM_HALF] types of attacks, by completely dropping introduction
> requests instead of queueing them for an abstract future. I wouldn't be
> surprised if my concerns are invalid and harmful here. Does anyone have
> intuition?
>

I don't see any downsides to including the PoW effort in the cell and
specifying the minimum effort. The minimum effort is needed to reduce
the verification overhead and to ensure that the queue doesn't grow
indefinitely. If the effort of the introduction request is lower than
the minimum, the service can simply treat it like a request without
any PoW (and saving a verification call). The exact outcome would
depend on the circumstances, normally the request would go to the back
of the queue.

I suggest a minimum effort equivalent to ~1 second of CPU time and
adjust it upward if the queue is full. This is more efficient than
trimming.

The size of the queue should be enough to handle short attack bursts
without dropping any requests. I'd say that a reasonable maximum queue
size is {bottom half throughput} * {number of seconds the client will
wait before retrying}. There is no point in queueing up more requests
than this because the client will have already given up on the request
by the earliest time it could be serviced.

> - tevador suggests we use two seeds, and always accept introductions with
> the previous seed. I agree this is a good idea, and it's not as complex as
> I originally thought (I have trauma from the v3 design where we try to
> support multiple time periods at the same time). However, because this
> doubles the vefication time, I decided to wait for dgoulet's scheduler
> numbers and until the PoW function is finalized to understand if we can
> afford the verification overhead.
>

There is no verification overhead if the seed is included in the
request. If additional 32 bytes are too much, the request can include
e.g. the first 4 bytes of the seed. This is enough for the service to
select the correct seed from the two active ones. The chance that two
subsequent seeds have the same first 32 bits is negligible (and can be
even avoided completely).

> - Solar Designer suggested we do Ethash's anti-DDoS trick to avoid instances
> of [ATTACK_TOP_HALF]. This involves wrapping the final PoW token in a fast
> hash with a really low difficulty, and having the verifier check that fast
> hash POW first. This means that a target trying to flood us with invalid PoW
> would need to do some work for every PoW instead of it being free. This is a
> decision we should take at the end after we do some number crunching and see
> where we are at in terms of verification time and attack models.
>

This trick is mostly relevant to slow-to-verify algorithms, but can be
also applied to Equi-X by reordering the server algorithm steps:

Input: C, N, E, S
1) Check that C is a valid seed (there may be multiple seeds active at a time).
2) Check that E is above the minimum effort
3) Check that N hasn't been used before with C
4) Calculate R = blake2b(C || N || E || S)
5) Check R * E <= UINT32_MAX
6) Check equix_verify(C || N || E, S) == EQUIX_OK
7) Put the request in the queue with 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-10 Thread George Kadianakis
Hello,

after reading all the excellent feedback on this thread, I did another
revision on this proposal:
  https://github.com/asn-d6/torspec/tree/pow-over-intro
I'm inlining the full proposal in the end of this email.

Here is a changelog:
- Improve attack vector section
- Shrink nonce size on cells to 16 bytes
- Change effort definition to linear

Here is a few things I did not do and might need some help with:

- I did not decide on the PoW function. I think to do this we miss the
  scheduler number crunching from dgoulet, and also I need to understand the
  possible options a bit more. I removed most references to argon2 and replaced
  them with XXX_POW.

  Tevador, thanks a lot for your tailored work on equix. This is fantastic.  I
  have a question that I don't see addressed in your very well written
  README. In your initial email, we discuss how Equihash does not have good GPU
  resistance:
  https://lists.torproject.org/pipermail/tor-dev/2020-May/014268.html

  Since equix is using Equihash isn't this gonna be a problem here too? I'm not
  too worried about ASIC resistance since I doubt someone is gonna build ASICs
  for this problem just yet, but script kiddies with their CS:GO graphics cards
  attacking equix is something I'm concerned about. I bet you have thought of
  this, so I'm wondering what's your take here.

  Right now I think the possible options are equix or the reduced Randomx
  (again thanks tevador) or yespower. In theory we could do all three of them
  and just support different versions; but that means more engineering.

  In any case, we are also waiting for some Tor-specific numbers from dgoulet,
  so we need those before we proceed here.

- In their initial mail, tevador points out an attack where the adversary games
  the effort estimation logic, by pausing an attack a minute before descriptor
  upload, so that the final descriptor has a very small target effort. They
  suggest using the median effort over a long period of time to fix this. Mike,
  can you check that out and see how we can adapt our logic to fix this?

- In tevador's initial mail, they point how the cell should include POW_EFFORT
  and that we should specify a "minimum effort" value instead of just inserting
  any effort in the pqueue. I can understand how this can have benefits (like
  the June discussion between tevador and yoehoduv) but I'm also concerned that
  this can make us more vulnerable to [ATTACK_BOTTOM_HALF] types of attacks, by
  completely dropping introduction requests instead of queueing them for an
  abstract future. I wouldn't be surprised if my concerns are invalid and
  harmful here. Does anyone have intuition?

- tevador suggests we use two seeds, and always accept introductions with the
  previous seed. I agree this is a good idea, and it's not as complex as I
  originally thought (I have trauma from the v3 design where we try to support
  multiple time periods at the same time). However, because this doubles the
  vefication time, I decided to wait for dgoulet's scheduler numbers and until
  the PoW function is finalized to understand if we can afford the verification
  overhead.

- Solar Designer suggested we do Ethash's anti-DDoS trick to avoid instances of
  [ATTACK_TOP_HALF]. This involves wrapping the final PoW token in a fast hash
  with a really low difficulty, and having the verifier check that fast hash
  POW first. This means that a target trying to flood us with invalid PoW would
  need to do some work for every PoW instead of it being free. This is a
  decision we should take at the end after we do some number crunching and see
  where we are at in terms of verification time and attack models.

Thanks a lot! :)

---

Filename: xxx-pow-over-intro-v1
Title: A First Take at PoW Over Introduction Circuits
Author: George Kadianakis, Mike Perry, David Goulet
Created: 2 April 2020
Status: Draft

0. Abstract

  This proposal aims to thwart introduction flooding DoS attacks by introducing
  a dynamic Proof-Of-Work protocol that occurs over introduction circuits.

1. Motivation

  So far our attempts at limiting the impact of introduction flooding DoS
  attacks on onion services has been focused on horizontal scaling with
  Onionbalance, optimizing the CPU usage of Tor and applying congestion control
  using rate limiting. While these measures move the goalpost forward, a core
  problem with onion service DoS is that building rendezvous circuits is a
  costly procedure both for the service and for the network. For more
  information on the limitations of rate-limiting when defending against DDoS,
  see [REF_TLS_1].

  If we ever hope to have truly reachable global onion services, we need to
  make it harder for attackers to overload the service with introduction
  requests. This proposal achieves this by allowing onion services to specify
  an optional dynamic proof-of-work scheme that its clients need to participate
  in if they want to get served.

  With the right parameters, 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-08 Thread tevador
On Mon, Jun 8, 2020 at 3:09 AM  wrote:
>
> It looks like all the 40320 permutations of the 16-bit words in S are
> equix solutions.  Are steps 3 to 5 supposed to be repeated for all the
> permutations?
>

Each unique S provides only one attempt. The indices in S must be
ordered in a certain way, otherwise the solution is invalid.
See: https://github.com/tevador/equix/blob/master/src/equix.c#L14-L23
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-07 Thread yoehoduv
> The client's algorithm:
> Input: C
>
> 1.  Select N, E
>
> 2.  Calculate S = equix_solve(C || N || E)
>
> 3.  Calculate R = blake2b(C || N || E || S)
>
> 4.  if R * E > UINT32_MAX, go back to step 1)
>
> 5.  Submit C, N, E, S (68 bytes total)

It looks like all the 40320 permutations of the 16-bit words in S are
equix solutions.  Are steps 3 to 5 supposed to be repeated for all the
permutations?
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-07 Thread yoehoduv
> Sending many requests is not the optimal strategy. One high-effort
> request would have exactly the same chance of being selected as many
> low-effort requests with the same total effort. The difference is that
> with many requests, the client wouldn't know which rendezvous would be
> selected, so he'd have to waste resources on opening many circuits.

Is solving for a high-effort solution different from solving for a
lower-effort solution?  If not, many lower-effort solutions will be
found while working for a high-effort solution.  Then nothing stops a
client from making requests with those lower-effort solutions as well.
I can expect to find 2 solutions of effort E/2 for every solution of
effort E.  If I make 3 requests with those solutions, my chance of
succeeding doubles, at the cost of 3 times the server's verification
effort.

One way is to include the target effort in requests, and include both
the server-provided nonce and the target effort as the x in Hx.  Then
only check that the real effort comes out no less than the target
effort, but use the target effort for everything else.

> Anyways, I suggest using a priority queue first and see how it works
> in practice. To allow efficient insertion and trimming, the queue can
> be implemented as a red-black tree.
>
> > As of trimming the priority queue, we don't have to use a heap. We can
> > compress the effort into maybe 7 bits, and then store the requests in
> > 128 arrays. Then trimming it is freeing an array. The compression can
> > be something like floating point.
> > ~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1
> > That is, take the number of leading zeros and by one bit on the right of
> > the leftmost 1 bit, then complement the first part to preserve order.
> > We can expect the number of leading zeros to be less than 64, so this
> > will take 7 bits. A decrement of this value means about 1.3 - 1.5 times
> > more work, which should be finely enough grained.
>
> What you are describing is called a hashtable. The question is: what
> happens when one of the arrays gets filled up? You would have to
> discard all additional requests coming into that bucket. With your
> construction, it's very likely that most requests would end up in just
> a few buckets and the rest would remain empty (e.g. all buckets for
> more than 40 leading zeroes).

I was thinking of dynamically sized arrays.  Anyway, my point was we
don't need to compare 128-bit solutions.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-06 Thread tevador
On Mon, May 18, 2020 at 1:01 PM  wrote:
>
> When a cell with a small effort in the queue has any chance of getting
> selected, the optimal strategy for a legitimate client would be to
> compute nonces and send as many nonces as possible until it causes
> congestion on his network.  Instead when only the cell with the highest
> effort is processed, sending more than one nonces per connection does no
> good for a client.  We want each legitimate client to send only one
> nonce per connection.
>

Sending many requests is not the optimal strategy. One high-effort
request would have exactly the same chance of being selected as many
low-effort requests with the same total effort. The difference is that
with many requests, the client wouldn't know which rendezvous would be
selected, so he'd have to waste resources on opening many circuits.

Anyways, I suggest using a priority queue first and see how it works
in practice. To allow efficient insertion and trimming, the queue can
be implemented as a red-black tree.

> As of trimming the priority queue, we don't have to use a heap.  We can
> compress the effort into maybe 7 bits, and then store the requests in
> 128 arrays.  Then trimming it is freeing an array.  The compression can
> be something like floating point.
>
>~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1
>
> That is, take the number of leading zeros and by one bit on the right of
> the leftmost 1 bit, then complement the first part to preserve order.
> We can expect the number of leading zeros to be less than 64, so this
> will take 7 bits.  A decrement of this value means about 1.3 - 1.5 times
> more work, which should be finely enough grained.

What you are describing is called a hashtable. The question is: what
happens when one of the arrays gets filled up? You would have to
discard all additional requests coming into that bucket. With your
construction, it's very likely that most requests would end up in just
a few buckets and the rest would remain empty (e.g. all buckets for
more than 40 leading zeroes).

T.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-06-06 Thread tevador
I've been working on a custom PoW algorithm specifically for this
proposal. It is 10x faster to verify than RandomX-Tor and doesn't
require any memory for verification.

Full write-up is here: https://github.com/tevador/equix/blob/master/devlog.md

Especially the comparison table in the Appendix may be of interest to
this discussion.

T.

On Sat, May 9, 2020 at 9:38 PM tevador  wrote:
> I have implemented this in the tor-pow branch of the RandomX repository:
>
> https://github.com/tevador/RandomX/tree/tor-pow
>
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-18 Thread yoehoduv
> What is the benefit of this approach rather than discarding low
> priority requests right away in the top-half handler?
> 

> Note that a priority queue is typically implemented as a heap, which
> does not support efficient trimming.

Correct me if I'm wrong.

When a cell with a small effort in the queue has any chance of getting
selected, the optimal strategy for a legitimate client would be to
compute nonces and send as many nonces as possible until it causes
congestion on his network.  Instead when only the cell with the highest
effort is processed, sending more than one nonces per connection does no
good for a client.  We want each legitimate client to send only one
nonce per connection.

As of trimming the priority queue, we don't have to use a heap.  We can
compress the effort into maybe 7 bits, and then store the requests in
128 arrays.  Then trimming it is freeing an array.  The compression can
be something like floating point.

   ~clz(POW_NONCE) << 1 | (POW_NONCE >> (127 - clz(POW_NONCE))) & 1

That is, take the number of leading zeros and by one bit on the right of
the leftmost 1 bit, then complement the first part to preserve order.
We can expect the number of leading zeros to be less than 64, so this
will take 7 bits.  A decrement of this value means about 1.3 - 1.5 times
more work, which should be finely enough grained.

signature.asc
Description: OpenPGP digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-13 Thread tevador
Hi Mike,

My apologies. I thought this later email from 14th April had the
latest version of the proposal:
https://lists.torproject.org/pipermail/tor-dev/2020-April/014225.html

> In short, we let the queue grow at a faster rate than we serve, and we
> trim it occasionally.

What is the benefit of this approach rather than discarding low
priority requests right away in the top-half handler?

Note that a priority queue is typically implemented as a heap, which
does not support efficient trimming.

> However, it still has the following potential issues:
>  A) AES will bottleneck us at ~100Mbit-300Mbit at #2 in top-half above
>  B) Extra mainloop() iterations for INTRO2s may be expensive (or not?)

I don't think AES is the main concern here. Each introduction request
is 512 bytes (AFAIK), so with a PoW verification performance of 2000
requests per second, the top-half handler will bottleneck at ~1 MB/s.

> Are there other reasons to do stochastic sampling over a priority queue,
> given this top-half and bottom-half design?

After thinking about it more, I would recommend starting with a simple
priority queue as proposed. More complex solutions can be implemented
later if field testing finds issues.

T.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-13 Thread Mike Perry
Hi tevador,

On 5/8/20 2:53 PM, tevador wrote:
> Including the effort has 2 benefits:
> 
> 1. In case the Intro Priority Queue is full, the service doesn't need to
>waste time verifying PoW solutions that have effort lower than the last
>element in the queue. While an attacker can always lie about the actual
>effort of their nonce value, I think this field can still save some CPU
>cycles when the service is under heavy load.
> 
> 2. The service can conveniently verify the reported effort with
>the following inequality:
> 
>POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT
> 
>where MAX_RESULT is the highest possible output from the pow_function.
>In the case of Example A, that would be:
> 
>165 * 6317 = 1042305 <= 1048576
> 
> 
> 
>>  Similar to how our cell scheduler works, the onion service subsystem will
>>  poll the priority queue every 100ms tick and process the first 20 cells from
>>  the priority queue (if they exist). The service will perform the rendezvous
>>  and the rest of the onion service protocol as normal.
> 
> I suggest to use a different selection method rather than always taking
> the first 20 requests.
> 
> Selecting cells from the front of the queue actually minimizes the effort that
> an attacker needs to expend to completely block access to the service.
> The attacker can always submit the minimum effort required to occupy the first
> 20 slots in each 100 ms window. This can be done by submitting just 200 
> requests
> per second and observing how many circuits are successfully open and adjusting
> the effort until no other request can go through. This is the "Total overwhelm
> strategy" described in § 5.1.1.

Hrm, you seem to have read the original proposal and missed some of the
follow-on threads.

We moved away from a 100ms tick-based system into a top-half and
bottom-half handler design, which updates the difficulty target as well.
I tried to describe this in the top-half and bottom-half handler steps
in my reply:
https://lists.torproject.org/pipermail/tor-dev/2020-April/014219.html

In short, we let the queue grow at a faster rate than we serve, and we
trim it occasionally. Those steps set the descriptor difficulty a
property of based on what the service can actually serve from the queue
and based on the queue trim point. We allow clients to pick a higher
difficulty arbitrarily to jump to the head of the queue, if they notice
they are still not getting service based on the descriptor difficulty.

This also eliminates the need for a "default" difficulty.

So in order for the attacker to "total overwhelm" that system, don't
they have to submit not just 200 requests per second, but *continuously*
send requests with higher difficulty than anyone else in the queue, in
order to fully deny service?

Are there other reasons to do stochastic sampling over a priority queue,
given this top-half and bottom-half design?

-- 
Mike Perry



signature.asc
Description: OpenPGP digital signature
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-10 Thread tevador
Hi teor,

On Sun, May 10, 2020 at 6:36 AM teor  wrote:
>
> There are two possible issues with this design:
>
> Division is expensive on some platforms, including ARM-based devices.
> But there might be a way to calculate an approximate value without division.
> (For example, bit shifts, or multiplying by an inverse.) Or we could
> calculate the maximum value once, and then re-use it.
>
> Is it still possible to express the full range of difficulties? Is that
> expression reasonably compact?
>
> Some advantages of this exponential distribution are:
> * spurious results can be filtered using a single instruction (a bit mask),
> * the expected effort is quick and easy to calculate,
> * the effort can be expressed in a compact way.
>
> Maybe we don't need some of these properties, and a linear design would
> be fine.
>
> But if we do, we could change the exponent to the square or cube root of
> two. There would be a smoother distribution, but a wider range, and the
> checks would still be reasonably fast.
>
> T
>

You only need 1 or 2 divisions per introduction request, so it doesn't matter
even if division is expensive, because it will take a minuscule amount of time
compared to the actual hashing effort.

There are 2 scenarios:

1) User wants to wait for X seconds and then submit the best result they could
   find.

2) User wants to wait as long as it takes to submit a result with an effort
   of at least E.

In case 1), the client will simply take the smallest result R found during the
X seconds and calculate ACTUAL_EFFORT = MAX_RESULT / R at the end.

In case 2), the client will calculate TARGET = MAX_RESULT / E at the beginning
and keep hashing until they find a result R <= TARGET. Then they can calculate
ACTUAL_EFFORT = MAX_RESULT / R at the end, which implies ACTUAL_EFFORT >= E.

Case 1) takes 1 division instruction, case 2) takes 2 division instructions.
When hashing, the client can filter results with a single instruction
(comparison).

Examples:

1) After X seconds, the client finds results 660445, 6317, 599102 ... 111847.
   The smallest result is 6317, so:

  ACTUAL_EFFORT = 1048575 / 6317 = 165

2) The client wants to find a result with an effort of at least E = 165, so
   they calculate TARGET = 1048575 / 165 = 6355. Then they can keep hashing
   until they find R <= 6355, e.g. 6317. The actual effort is calculated
   as above.

So the only advantage of the exponential notation is:

>
> * the effort can be expressed in a compact way.
>

This can save a few characters in the HS descriptor, at the cost of a coarse
effort classification, e.g. clients who spent 60 seconds hashing will be, on
average, classified the same as those who spent 100 seconds.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-09 Thread teor
Hi tevador,

> On 9 May 2020, at 06:43, tevador  wrote:
>> For our dynamic PoW system to work, we will need to be able to compare PoW
>> tokens with each other. To do so we define a function:
>>unsigned effort(uint8_t *token)
>> which takes as its argument a hash output token, and returns the number of
>> leading zero bits on it.
> 
> This definition makes the effort exponential, i.e. the computational resources
> required to reach one notch higher effort increase by a factor of 2 each time.
> 
> I suggest to use linear effort defined as the quotient of dividing a bitstring
> of 1s by the hash value:
> 
> == Example A:
> 
>  effort(0001100010101101) =  / 0001100010101101
> 
> or in decimal:
> 
>  effort(6317) = 1048575 / 6317 = 165.
> 
> This definition of effort has the advantage of directly expressing the 
> expected
> number of hashes that the client had to calculate to reach the effort.
> 
> With the exponential definition, we could have an equivalent linear effort of
> either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear
> definition provides smoother classification of PoW results.

There are two possible issues with this design:

Division is expensive on some platforms, including ARM-based devices.
But there might be a way to calculate an approximate value without division.
(For example, bit shifts, or multiplying by an inverse.) Or we could calculate
the maximum value once, and then re-use it.

Is it still possible to express the full range of difficulties? Is that 
expression
reasonably compact?

Some advantages of this exponential distribution are:
* spurious results can be filtered using a single instruction (a bit mask),
* the expected effort is quick and easy to calculate,
* the effort can be expressed in a compact way.

Maybe we don't need some of these properties, and a linear design would
be fine.

But if we do, we could change the exponent to the square or cube root of
two. There would be a smoother distribution, but a wider range, and the
checks would still be reasonably fast.

T


___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-09 Thread tevador
On 08 May, 21:53, tevador  wrote:
> In particular, the following parameters should be set differently from
> Monero:
>
> RANDOMX_ARGON_SALT = "RandomX-TOR-v1"
>
> The unique RandomX salt means we do not need to use a separate salt as PoW
> input as specified in § 3.2.
>
> RANDOMX_ARGON_ITERATIONS = 1
> RANDOMX_CACHE_ACCESSES = 4
> RANDOMX_DATASET_BASE_SIZE = 1073741824
> RANDOMX_DATASET_EXTRA_SIZE = 16777216
>
> These 4 changes reduce the RandomX Dataset size to ~1 GiB, which allows
> the number of iteration to be reduced from 8 to 4. The combined effect of
> this is that Dataset initialization becomes 4 times faster, which is needed
> due to more frequent updates of the seed (Monero updates once per ~3 days).
>
> RANDOMX_PROGRAM_COUNT = 2
> RANDOMX_SCRATCHPAD_L3 = 1048576
>
> Additionally, reducing the number of programs from 8 to 2 makes the hash
> calculation about 4 times faster, while still providing resistance against
> program filtering strategies (see [REF_RANDOMX_PROGRAMS]). Since there are
> 4 times fewer writes, we also have to reduce the scratchpad size. I suggest
> to use a 1 MiB scratchpad size as a compromise between scratchpad write
> density and memory hardness. Most x86 CPUs will perform roughly the same
> with a 512 KiB and 1024 KiB scratchpad, while the larger size provides
> higher resistance against specialized hardware, at the cost of possible
> time-memory tradeoffs (see [REF_RANDOMX_TMTO] for details).
>
> Lastly, we reduce the output of RandomX to just 8 bytes:
>
>RANDOMX_HASH_SIZE = 8
>
> 64-bit preimage security is more than sufficient for proof-of-work and it
> allows the result to be treated as a little-endian encoded unsigned integer
> for easy effort calculation.

I have implemented this in the tor-pow branch of the RandomX repository:

https://github.com/tevador/RandomX/tree/tor-pow

Namely I have changed the API to return the hash value as an uint64_t and
made corresponding changes in the benchmark.

Benchmark example:

./randomx-benchmark --mine \
--avx2 \
--jit  \
--largePages \
--nonces 1 \
--seed 1234 \
--init 1 \
--threads 1 \
--batch
RandomX-TOR-v1 benchmark
 - Argon2 implementation: AVX2
 - full memory mode (1040 MiB)
 - JIT compiled mode
 - hardware AES mode
 - large pages mode
 - batch mode
Initializing (1 thread) ...
Memory initialized in 5.32855 s
Initializing 1 virtual machine(s) ...
Running benchmark (1 nonces) ...
Performance: 2535.43 hashes per second
Best result:
  Nonce: 8bc3ded34d2dcdeed900f95cd20c
  Result: d947ceff08750300
  Effort: 18956
  Valid: 1

At the end, it prints out the nonce that gives the highest effort value and
validates it.

For the actual implementation in TOR, the RandomX validator should run in
a separate thread that doesn't do anything else apart from validation and
moving valid requests into the Intro Queue. This way we can reach the maximum
performance of ~2000 processed requests per second.

Finally, here are some disadvantages of RandomX-TOR:

 1) Fast verification requires ~1 GiB of memory. If we decide to use two
overlapping seed epochs, each service will need to allocate >2 GiB of RAM
just to verify the PoW. Alternatively, it is possible to use the slow
mode, which requires only 256 MiB per seed, but runs 4x slower.
 2) The fast mode needs about 5 seconds to initialize every time the
seed is  changed (can be reduced to under 1 second using multiple
threads). The
slow mode needs about 0.1 seconds to initialize.
 3) RandomX includes a JIT compiler for maximum performance. The iOS operating
system doesn't support JIT compilation, so RandomX runs about 10x slower
there.
 4) The JIT compiler in RandomX is currently implemented only for
x86-64 and  ARM64 CPU architectures. Other architectures will run
very slowly
(especially 32-bit systems). However, the two supported architectures
cover the vast majority of devices, so this should not be an issue.
___
tor-dev mailing list
tor-dev@lists.torproject.org
https://lists.torproject.org/cgi-bin/mailman/listinfo/tor-dev


Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-05-08 Thread tevador
Hi all,

I was asked by George to submit my comments about the proposal and to suggest
suitable RandomX parameters for this PoW scheme.

> For our dynamic PoW system to work, we will need to be able to compare PoW
>  tokens with each other. To do so we define a function:
> unsigned effort(uint8_t *token)
>  which takes as its argument a hash output token, and returns the number of
>  leading zero bits on it.

This definition makes the effort exponential, i.e. the computational resources
required to reach one notch higher effort increase by a factor of 2 each time.

I suggest to use linear effort defined as the quotient of dividing a bitstring
of 1s by the hash value:

== Example A:

  effort(0001100010101101) =  / 0001100010101101

or in decimal:

  effort(6317) = 1048575 / 6317 = 165.

This definition of effort has the advantage of directly expressing the expected
number of hashes that the client had to calculate to reach the effort.

With the exponential definition, we could have an equivalent linear effort of
either 128 (7 leading zeroes) or 256 (8 leading zeroes), while the linear
definition provides smoother classification of PoW results.



>   The EXT_FIELD content format is:
>
>  POW_VERSION[1 byte]
>  POW_NONCE  [32 bytes]
>

I suggest to use a 16-byte nonce value, which is more than sufficient given the
target security level and has the benefit of reducing the replay cache size to
half.

Since we expect the seed to be valid for around 3 hours (as proposed), then even
if the service receives 1 million proofs per second and each proof has an effort
of 1 million, then the number of submitted nonces from clients will only reach
about 10^10. With a 108-bit solution space (subtracting 20 bits as the search
space per client), the probability that two clients accidentally submit the same
nonce is roughly 10^-13 (see [REF_BIRTHDAY]).

Additionally, I suggest to add the client's effort for convenience:

   The updated EXT_FIELD content format would be:

  POW_VERSION[1 byte]
  POW_EFFORT [4 bytes]
  POW_NONCE  [16 bytes]

Including the effort has 2 benefits:

1. In case the Intro Priority Queue is full, the service doesn't need to
   waste time verifying PoW solutions that have effort lower than the last
   element in the queue. While an attacker can always lie about the actual
   effort of their nonce value, I think this field can still save some CPU
   cycles when the service is under heavy load.

2. The service can conveniently verify the reported effort with
   the following inequality:

   POW_EFFORT * pow_function(POW_NONCE, SEED) <= MAX_RESULT

   where MAX_RESULT is the highest possible output from the pow_function.
   In the case of Example A, that would be:

   165 * 6317 = 1042305 <= 1048576



>  Similar to how our cell scheduler works, the onion service subsystem will
>  poll the priority queue every 100ms tick and process the first 20 cells from
>  the priority queue (if they exist). The service will perform the rendezvous
>  and the rest of the onion service protocol as normal.

I suggest to use a different selection method rather than always taking
the first 20 requests.

Selecting cells from the front of the queue actually minimizes the effort that
an attacker needs to expend to completely block access to the service.
The attacker can always submit the minimum effort required to occupy the first
20 slots in each 100 ms window. This can be done by submitting just 200 requests
per second and observing how many circuits are successfully open and adjusting
the effort until no other request can go through. This is the "Total overwhelm
strategy" described in § 5.1.1.

See the following examples to show how selecting the first N cells from
the queue is unfair. I will use N = 4 for clarity. E denotes some value of
effort.

== Example B:

   ATTACKER   LEGITIMATE CLIENTS
 ___  _    __
/  \   / \

+++++++++++++
|   2E   |   2E   |   2E   |   2E   |  E |  E |  E |  E |  E |  E |  E |  E |
+++++++++++++

^^^^
 selected selected selected selected

Here both the attacker and the legitimate clients have expended a combined
effort of 8E. All of the attacker's cells get selected, while none of the other
clients get through.

Instead, I suggest to use Stochastic universal sampling [REF_SUS], where
the probability that a cell gets selected is proportional to its effort.

== Example C:

Here the total effort in the queue is 16E, so we first select a random value in
the interval [0, 4E) and then select 4 evenly spaced cells:

   ATTACKER   

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-14 Thread George Kadianakis
> Hello list,
>
> hope everyone is safe and doing well!
>
> I present you an initial draft of a proposal on PoW-based defences for
> onion services under DoS.
>

Hello again,

many thanks for all the thoughtful feedback!

In the end of this email I inline a new version of the proposal
addressing various issues discussed over IRC and on this thread.
Here is a rough changelog:

- Specifying some features we might want from "v1.5".
- Adding suggested-effort to the descriptor.
- Specifying the effort() function.
- Specifying the format of the expiration time.
- Adding a protocol-specific label to the PoW computation.
- Removing the seed and output values from the INTRODUCE1 cell.
- Specifying what happens when a client does not send a PoW token when PoW is 
enabled.
- Revamping the UX section.
- Added Mike and David in the authors list.

I'm also pushing the spec to my git repo so that you can see a diff:
https://github.com/asn-d6/torspec/tree/pow-over-intro

Now before going in to the proposal here are the three big topics currently
under discussion in the thread:

== How the scheduler should work ==

   I'm not gonna touch on this, since David is writing an initial draft of a
   scheduler design soon, so let's wait for that email before we discuss this
   further.

== Should there be a target difficulty on the descriptor? ==

   I have made changes in the proposal to this effect. See sections
   [EFFORT_ESTIMATION] and [CLIENT_TIMEOUT] for more information.

   While there is no hard-target difficulty, the descriptor now contains
   a suggested difficulty that clients should aim at. The service will
   still add requests with lower effort than the suggested one in the
   priority queue. That's to make the system more resilient to attacks
   in cases where the client cannot get the latest descriptor (and hence
   latest suggested effort) due to the descriptor upload/fetch
   rate-limiting restrictions in place.

== Which PoW function should we use? ==

   The proposal suggests argon2, and Mike has been looking at Randomx. However,
   after further consideration and speaking with some people (props to Alex
   Biryukov), it seems like those two functions are not well fitted for this
   purpose, since they are memory-hard both for the client and the service. And
   since we are trying to minimize the verification overhead, so that the
   service can do hundreds of verifications per second, they don't seem like
   good fits.

   In particular, slimming down argon2 to the point that services can do
   hundreds of those verifications per second, results in an argon2 without any
   memory-hardness. And Randomx is even heavier, since it uses argon2 under the
   hood and also does extra stuff. In particular, from some preliminary
   computations, it seems like the top-half of the cell processing takes about
   2ms, whereas Randomx takes at least 17ms in my computer, which means that it
   puts an almost 1000% overhead to the top-half processing of a single
   introduction.

   This means that assymetric PoW schemes like Equihash and family is what we
   should be looking at next. These schemes aim to have small proof sizes, and
   be memory-hard for the prover, but lightweight for the verifier. They are
   currently used by Zcash so there is quite some literature and improvements.

   In particular, Equihash has two important parameters (n,k). These parameters
   together control the proof size (so for example, Equihash<144,5> has a 100B
   proof, and Equihash<200,9> has a 1344B proof), and the 'k' parameter
   controls the verification speed (the verifier has to do 2^k hash invocations
   to do the verification). Also see this for more details:
  https://forum.bitcoingold.org/t/our-new-equihash-equihash-btg/1512
  https://www.cryptolux.org/images/b/b9/Equihash.pdf

   The good thing here is that these parameters look good and offer good
   security. Furthermore, Equihash is used by big and friendly projects like
   Zcash.

   The negative thing is that because Equihash is widely used there is already
   ASIC hardware for it, so we would need to look at the parameters we pick and
   how ASIC-friendly they are. Furthermore, an attacker who buys Equihash ASIC
   can also use it for coin mining which makes it an easier investment.

   IMO, we should look more into Equihash and other assymetric types of PoW, as
   well as speak with people who know Equihash well.

   Finally, our proposal has a big benefit over the blockchain use cases: it's
   much more agile. We can deploy changes to the PoW algorithm without having
   to hard-fork, and we can do this even through the consensus for maximum
   agility. This means that we should try to use this agility to our advantage.

Looking forward to more feedback!

=

And here comes the updated proposal:

Filename: xxx-pow-over-intro-v1
Title: A First Take at PoW Over Introduction Circuits
Author: George Kadianakis, Mike Perry, David Goulet
Created: 2 April 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-07 Thread David Goulet
On 06 Apr (17:08:12), Mike Perry wrote:

[snip]

> > I think you'll be bound by the amount of data a connection inbuf can take
> > which has an upper bound of 32 cells each read event.
> > 
> > Then tor will have to empty at once the inbuf, queue all INTRODUCE2 cells 
> > (at
> > most 32) in that priority queue and once done, we would process it until we
> > return to handling the connection inbuf.
> > 
> > In other words, the queue size, with tor's architecture, is bound to the
> > number of cells upper bound you can get when doing a recv() pass which is 32
> > cells.
> > 
> > Nevertheless, that limit is weirdly hardcoded in tor so you should 
> > definitely
> > think of a way to upper bound the queue and just drop the rest. A good
> > starting point would be that 32 cells number?
> 
> dgoulet and I think the following will work better than always doing 32
> cells at a time. Basically, the idea is to split our INTRO2 handling
> into "top-half" and "bottom-half" handlers.
> 
> Top-half handling:
>  1) read 32 cells off inbuf as usual
>  2) do AES relay cell decryption as usual
>  3) Parse relay headers, handle all cells as usual, except:
> a) in hs_service_receive_introduce2(), add to pqueue
>and return without further processing of those
>  4) Return to rest of mainloop

Agree with the above. It is trivial to do this today so very low engineering
cost.

> 
> Then, separately, also in mainloop, do the bottom half. (TODO: What
> group priority?)
> 
> Bottom-half handling:
>   I) pop a single intro2 off of pqueue (ie: max difficulty in queue)
>  II) Compare this difficulty to desc difficulty. If lower, lower
>  desc difficulty
> III) Parse it and launch RP circuit (as per current bottom half of
>  hs_service_receive_introduce2())
>  IV) trim pqueue elements, if queue "too big" (TODO: how to trim?)
>   V) Compare trim point difficulty to descriptor difficulty, if trim
>  point was higher than descriptor value, raise desc difficulty
>  VI) return to libevent/mainloop again

I would maybe try to convince you that we could dequeue more than 1 cell here
because this behavior is changing quite a bit the current state of HS.

Right now, we would get 32 cells out of the inbuf and one at a time, process
it then go back to mainloop.

This new algorithm means that we would process 1 single cell at each mainloop
event instead of 32. This is quite a decrease. Ok, it is not that exact ration
because maybe dequeue the inbuf without processing the decrypted INTRO2 is
fast but it is still a full mainloop run per cell is clearly slower than right
now.

We need some sort of performance measurements here to make an informed
decision but my guts feeling tells me that we might want to don't know process
5 or 10 cells instead of 1 per mainloop round.

We _should_ run timing measurement here to see how much delaying INTRO2
processing to another mainloop event affects the overall rate of introduction.

But let say a full mainloop run takes 100msec, we will process 50
introductions per second... that looks quite low? But could be already what we
do now, unknown.

> 
> 
> The astute reader will note that even without PoW, the above can provide
> almost the exact same functionality as the naive rate limiting currently
> done at intropoints - just cut the queue arbitrarily. Basically PoW and
> the pqueue just gives us a smarter way to decide who to reply to.
> 
> However, it still has the following potential issues:
>   A) AES will bottleneck us at ~100Mbit-300Mbit at #2 in top-half above
>   B) Extra mainloop() iterations for INTRO2s may be expensive (or not?)

Possibly, from the above, some analysis should happen. I can easily do that
once we get the tracing API upstream.

> 
> For A, onionbalance will still help by adding new back-end instances
> providing intropoints via separate back-end Tor daemons, either on the
> same box or different boxes.
> 
> But it will only help up to a point. A HSDesc maxes out at 30k, so at
> some point we'll run out of space to list more intropoints in a single
> descriptor. At that point, we can still list different intropoints at
> each HSDir position, but after that, we are screwed.

Small correctino. HSDesc max at 50k for v3 and 20k for v2. But lets just
consider v3 for the forseable future :D.

[snip]

> > I would _love_ to but could be too early for that if we consider that we are
> > still unsure that this defense will be useful or not (according to Mike as a
> > discussion on IRC).
> 
> As described above, multithreading still provides a multiplier in the
> AES bottleneck case, even over onionbalance.
> 
> But, there may be more bottlenecks than just AES crypto, so this is a
> further argument for not jumping the gun just yet, and trying v1 first
> (or even a simple prototype without pow, that just cuts the queue
> arbitrarily), and getting some more profiling data.

As an initial step, I agree. Onionbalance provides an easy way for service to
outsource client 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-06 Thread Mike Perry
Trimming to stuff that I just want to reply to; I otherwise agree.

Note: in a couple places I replied directly to asn's OP, because I
noticed some more questions that I could answer.

On 4/2/20 2:30 PM, David Goulet wrote:
>
> On 02 Apr (18:54:59), George Kadianakis wrote:
>> 2.2.2. Dynamic PoW
>>
>>   DoS is a dynamic problem where the attacker's capabilities constantly 
>> change,
>>   and hence we want our proof-of-work system to be dynamic and not stuck 
>> with a
>>   static difficulty setting. Hence, instead of forcing clients to go below a
>>   static target like in Bitcoin to be successful, we ask clients to "bid" 
>> using
>>   their PoW effort. Effectively, a client gets higher priority the higher
>>   effort they put into their proof-of-work. This is similar to how
>>   proof-of-stake works but instead of staking coins, you stake work.
> 
> So this means that desktop users will be prioritized over mobile users
> basically unless I make my phone use X% of battery?

Yes. We should be clear that this is not meant to be on all the time,
and that yes, it is likely to sacrifice access by mobile users,
depending on the current attack volume/difficulty level.

The Tor Browser Android UI could inform users the service is under
attack and direct them try on their desktop instead.

>>   If a client receives a descriptor with "pow-params", it should assume that
>>   the service is expecting a PoW input as part of the introduction protocol.
> 
> What happens with clients _without_ PoW support? They basically won't be able
> to connect I suppose? Or be put in the prio queue at the service at the very
> hand with work done = 0 ?

The work_done=0 will be better. Fake work with actual work_done=0 is
just as easy to create as omitting work, for an attacker.

>> 3.4.1. PoW verification
>>
>>To verify the client's proof-of-work the service extracts (hash_output,
>>seed, nonce) from the INTRODUCE1 cell and MUST do the following steps:
>>
>>1) Make sure that the client's seed is identical to the active seed.
>>2) Check the client's nonce for replays (see [REPLAY_PROTECTION] section).
>>3) Verify that 'hash_output =?= argon2(seed, nonce)
> 
> So wait, the service also has to do the PoW for each client by computing the
> Argon2 hash for each cell? Or am I mis-understanding?

Yes, but the service side only has to run the hash once, which should be
fast.

The client/attacker is hashing many times, to search for a nonce that
will satisfy the target hash value comparison.

But oops! We forgot to list that directly above, which might have caused
the confusion:

0) Check that hash_output <= target_level

>> 3.4.2. The Introduction Queue  [INTRO_QUEUE]
>>
>> 3.4.2.1. Adding introductions to the introduction queue
>>
>>   When PoW is enabled and a verified introduction comes through, the service
>>   instead of jumping straight into rendezvous, queues it and prioritizes it
>>   based on how much effort was devoted by the client to PoW. This means that
>>   introduction requests with high effort should be prioritized over those 
>> with
>>   low effort.
>>
>>   To do so, the service maintains an "introduction priority queue" data
>>   structure. Each element in that priority queue is an introduction request,
>>   and its priority is the effort put into its PoW:
>>
>>   When a verified introduction comes through, the service interprets the PoW
>>   hash as a 32-byte big-endian integer 'hash_int' and based on that integer 
>> it
>>   inserts it into the right position of the priority_queue: The smallest
>>   'hash_int' goes forward in the queue. If two elements have the same value,
>>   the older one has priority over the newer one.
>>   {XXX: Is this operation with 32-bytes integers expensive? How to make 
>> cheaper?}

One option: either subtract or and-off the target, so we're comparing
difficulty relative to the target - ie a much smaller integer space.

In blockchain world, typically the difficulty target is expressed as a
bitmask anyway, probably for reasons like this.

>>   {TODO: PARAM_TUNING: If the priority queue is only ordered based on the
>>effort what attacks can happen in various scenarios? Do we want to order 
>> on
>>time+effort?  Which scenarios and attackers should we examine here?}
>>
>>   {TODO: PARAM_TUNING: What's the max size of the queue? How do we trim it? 
>> Can we
>>use WRED usefully?}
> 
> I think you'll be bound by the amount of data a connection inbuf can take
> which has an upper bound of 32 cells each read event.
> 
> Then tor will have to empty at once the inbuf, queue all INTRODUCE2 cells (at
> most 32) in that priority queue and once done, we would process it until we
> return to handling the connection inbuf.
> 
> In other words, the queue size, with tor's architecture, is bound to the
> number of cells upper bound you can get when doing a recv() pass which is 32
> cells.
> 
> Nevertheless, that limit is weirdly hardcoded in tor so you should definitely
> 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-02 Thread Mike Perry
Phew! This is lg but excellent! Comments in-line!

On 4/2/20 10:54 AM, George Kadianakis wrote:i
> Hello list,
> 
> hope everyone is safe and doing well!
> 
> I present you an initial draft of a proposal on PoW-based defences for
> onion services under DoS.
> 
> The proposal is not finished yet and it needs tuning and fixing. There
> are many places marked with XXX and TODO around the proposal that should
> be addressed.
> 
> The important part is that looking at the numbers it does seem like this
> proposal can work as a concept and serve its intended purpose. The most
> handwavey parts of the proposal right now are [INTRO_QUEUE] and
> [POW_SECURITY] and if this thing fails in the end, it's probably gonna
> be something that slipped over there. Hence, we should polish these
> sections before we proceed with any sort of engineering here.
> 
> In any case, I decided to send it to the list even in premature form, so
> that it can serve as a stable point of reference in subsequent
> discussions. It can also be found in my git repo:
> https://github.com/asn-d6/torspec/tree/pow-over-intro
> 
> Cheers and stay safe!
> 
> ---
> 
> Filename: xxx-pow-over-intro-v1
> Title: A First Take at PoW Over Introduction Circuits
> Author: George Kadianakis
> Created: 2 April 2020
> Status: Draft
> 
> 0. Abstract
> 
>   This proposal aims to thwart introduction flooding DoS attacks by 
> introducing
>   a dynamic Proof-Of-Work protocol that occurs over introduction circuits.
> 
> 1. Motivation
> 
>   So far our attempts at limiting the impact of introduction flooding DoS
>   attacks on onion services has been focused on horizontal scaling with
>   Onionbalance, optimizing the CPU usage of Tor and applying congestion 
> control
>   using rate limiting. While these measures move the goalpost forward, a core
>   problem with onion service DoS is that building rendezvous circuits is a
>   costly procedure both for the service and for the network. If we ever hope 
> to
>   have truly reachable global onion services, we need to make it harder for
>   attackers to overload the service with introduction requests.
> 
>   This proposal achieves this by allowing onion services to specify an 
> optional
>   dynamic proof-of-work scheme that its clients need to participate in if they
>   want to get served.
> 
>   With the right parameters, this proof-of-work scheme acts as a gatekeeper to
>   block amplification attacks by attackers while letting legitimate clients
>   through.
> 
> 1.1. Threat model [THREAT_MODEL]
> 
> 1.1.1. Attacker profiles [ATTACKER_MODEL]
> 
>   This proposal is written to thwart specific attackers. A simple PoW proposal
>   cannot defend against all and every DoS attack on the Internet, but there 
> are
>   adverary models we can defend against.
> 
>   Let's start with some adversary profiles:
> 
>   "The script-kiddie"
> 
> The script-kiddie has a single computer and pushes it to its
> limits. Perhaps it also has a VPS and a pwned server. We are talking about
> an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We
> consider the total cost for this attacker to be zero $.
> 
>   "The small botnet"
> 
> The small botnet is a bunch of computers lined up to do an introduction
> flooding attack. Assuming 500 medium-range computers, we are talking about
> an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We 
> consider
> the upfront cost for this attacker to be about $400.
> 
>   "The large botnet"
> 
> The large botnet is a serious operation with many thousands of computers
> organized to do this attack. Assuming 100k medium-range computers, we are
> talking about an attacker with total access to 200 Thz of CPU and 200 TB 
> of
> RAM. The upfront cost for this attacker is about $36k.
>
> 1.1.2. User profiles [USER_MODEL]
> 
>   We have attackers and we have users. Here are a few user profiles:
> 
>   "The standard web user"
> 
> This is a standard laptop/desktop user who is trying to browse the
> web. They don't know how these defences work and they don't care to
> configure or tweak them. They are gonna use the default values and if the
> site doesn't load, they are gonna close their browser and be sad at Tor.
> They run a 2Ghz computer with 4GB of RAM.
> 
>   "The motivated user"
> 
> This is a user that really wants to reach their destination. They don't
> care about the journey; they just want to get there. They know what's 
> going
> on; they are willing to tweak the default values and make their computer 
> do
> expensive multi-minute PoW computations to get where they want to be.
> 
>   "The mobile user"
> 
> This is a motivated user on a mobile phone. Even tho they want to read the
> news article, they don't have much leeway on stressing their machine to do
> more computation.
> 
>   We hope that this proposal will allow the motivated user to always connect
>   where they 

Re: [tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-02 Thread David Goulet
On 02 Apr (18:54:59), George Kadianakis wrote:
> Hello list,
> 
> hope everyone is safe and doing well!
> 
> I present you an initial draft of a proposal on PoW-based defences for
> onion services under DoS.
> 
> The proposal is not finished yet and it needs tuning and fixing. There
> are many places marked with XXX and TODO around the proposal that should
> be addressed.
> 
> The important part is that looking at the numbers it does seem like this
> proposal can work as a concept and serve its intended purpose. The most
> handwavey parts of the proposal right now are [INTRO_QUEUE] and
> [POW_SECURITY] and if this thing fails in the end, it's probably gonna
> be something that slipped over there. Hence, we should polish these
> sections before we proceed with any sort of engineering here.
> 
> In any case, I decided to send it to the list even in premature form, so
> that it can serve as a stable point of reference in subsequent
> discussions. It can also be found in my git repo:
> https://github.com/asn-d6/torspec/tree/pow-over-intro
> 
> Cheers and stay safe!
> 
> ---
> 
> Filename: xxx-pow-over-intro-v1
> Title: A First Take at PoW Over Introduction Circuits
> Author: George Kadianakis
> Created: 2 April 2020
> Status: Draft
> 
> 0. Abstract
> 
>   This proposal aims to thwart introduction flooding DoS attacks by 
> introducing
>   a dynamic Proof-Of-Work protocol that occurs over introduction circuits.
> 
> 1. Motivation
> 
>   So far our attempts at limiting the impact of introduction flooding DoS
>   attacks on onion services has been focused on horizontal scaling with
>   Onionbalance, optimizing the CPU usage of Tor and applying congestion 
> control
>   using rate limiting. While these measures move the goalpost forward, a core
>   problem with onion service DoS is that building rendezvous circuits is a
>   costly procedure both for the service and for the network. If we ever hope 
> to
>   have truly reachable global onion services, we need to make it harder for
>   attackers to overload the service with introduction requests.
> 
>   This proposal achieves this by allowing onion services to specify an 
> optional
>   dynamic proof-of-work scheme that its clients need to participate in if they
>   want to get served.
> 
>   With the right parameters, this proof-of-work scheme acts as a gatekeeper to
>   block amplification attacks by attackers while letting legitimate clients
>   through.
> 
> 1.1. Threat model [THREAT_MODEL]
> 
> 1.1.1. Attacker profiles [ATTACKER_MODEL]
> 
>   This proposal is written to thwart specific attackers. A simple PoW proposal
>   cannot defend against all and every DoS attack on the Internet, but there 
> are
>   adverary models we can defend against.
> 
>   Let's start with some adversary profiles:
> 
>   "The script-kiddie"
> 
> The script-kiddie has a single computer and pushes it to its
> limits. Perhaps it also has a VPS and a pwned server. We are talking about
> an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We
> consider the total cost for this attacker to be zero $.
> 
>   "The small botnet"
> 
> The small botnet is a bunch of computers lined up to do an introduction
> flooding attack. Assuming 500 medium-range computers, we are talking about
> an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We 
> consider
> the upfront cost for this attacker to be about $400.
> 
>   "The large botnet"
> 
> The large botnet is a serious operation with many thousands of computers
> organized to do this attack. Assuming 100k medium-range computers, we are
> talking about an attacker with total access to 200 Thz of CPU and 200 TB 
> of
> RAM. The upfront cost for this attacker is about $36k.
> 
>   We hope that this proposal can help us defend against the script-kiddie
>   attacker and small botnets. To defend against a large botnet we would need
>   more tools in our disposal (see [FUTURE_WORK]).
> 
>   {XXX: Do the above make sense? What other attackers do we care about? What
> other metrics do we care about? Network speed? I got the botnet costs
> from here [REF_BOTNET] Back up our claims of defence.}
> 
> 1.1.2. User profiles [USER_MODEL]
> 
>   We have attackers and we have users. Here are a few user profiles:
> 
>   "The standard web user"
> 
> This is a standard laptop/desktop user who is trying to browse the
> web. They don't know how these defences work and they don't care to
> configure or tweak them. They are gonna use the default values and if the
> site doesn't load, they are gonna close their browser and be sad at Tor.
> They run a 2Ghz computer with 4GB of RAM.
> 
>   "The motivated user"
> 
> This is a user that really wants to reach their destination. They don't
> care about the journey; they just want to get there. They know what's 
> going
> on; they are willing to tweak the default values and make their computer 
> do
> 

[tor-dev] [RFC] Proposal: A First Take at PoW Over Introduction Circuits

2020-04-02 Thread George Kadianakis
Hello list,

hope everyone is safe and doing well!

I present you an initial draft of a proposal on PoW-based defences for
onion services under DoS.

The proposal is not finished yet and it needs tuning and fixing. There
are many places marked with XXX and TODO around the proposal that should
be addressed.

The important part is that looking at the numbers it does seem like this
proposal can work as a concept and serve its intended purpose. The most
handwavey parts of the proposal right now are [INTRO_QUEUE] and
[POW_SECURITY] and if this thing fails in the end, it's probably gonna
be something that slipped over there. Hence, we should polish these
sections before we proceed with any sort of engineering here.

In any case, I decided to send it to the list even in premature form, so
that it can serve as a stable point of reference in subsequent
discussions. It can also be found in my git repo:
https://github.com/asn-d6/torspec/tree/pow-over-intro

Cheers and stay safe!

---

Filename: xxx-pow-over-intro-v1
Title: A First Take at PoW Over Introduction Circuits
Author: George Kadianakis
Created: 2 April 2020
Status: Draft

0. Abstract

  This proposal aims to thwart introduction flooding DoS attacks by introducing
  a dynamic Proof-Of-Work protocol that occurs over introduction circuits.

1. Motivation

  So far our attempts at limiting the impact of introduction flooding DoS
  attacks on onion services has been focused on horizontal scaling with
  Onionbalance, optimizing the CPU usage of Tor and applying congestion control
  using rate limiting. While these measures move the goalpost forward, a core
  problem with onion service DoS is that building rendezvous circuits is a
  costly procedure both for the service and for the network. If we ever hope to
  have truly reachable global onion services, we need to make it harder for
  attackers to overload the service with introduction requests.

  This proposal achieves this by allowing onion services to specify an optional
  dynamic proof-of-work scheme that its clients need to participate in if they
  want to get served.

  With the right parameters, this proof-of-work scheme acts as a gatekeeper to
  block amplification attacks by attackers while letting legitimate clients
  through.

1.1. Threat model [THREAT_MODEL]

1.1.1. Attacker profiles [ATTACKER_MODEL]

  This proposal is written to thwart specific attackers. A simple PoW proposal
  cannot defend against all and every DoS attack on the Internet, but there are
  adverary models we can defend against.

  Let's start with some adversary profiles:

  "The script-kiddie"

The script-kiddie has a single computer and pushes it to its
limits. Perhaps it also has a VPS and a pwned server. We are talking about
an attacker with total access to 10 Ghz of CPU and 10 GBs of RAM. We
consider the total cost for this attacker to be zero $.

  "The small botnet"

The small botnet is a bunch of computers lined up to do an introduction
flooding attack. Assuming 500 medium-range computers, we are talking about
an attacker with total access to 10 Thz of CPU and 10 TB of RAM. We consider
the upfront cost for this attacker to be about $400.

  "The large botnet"

The large botnet is a serious operation with many thousands of computers
organized to do this attack. Assuming 100k medium-range computers, we are
talking about an attacker with total access to 200 Thz of CPU and 200 TB of
RAM. The upfront cost for this attacker is about $36k.

  We hope that this proposal can help us defend against the script-kiddie
  attacker and small botnets. To defend against a large botnet we would need
  more tools in our disposal (see [FUTURE_WORK]).

  {XXX: Do the above make sense? What other attackers do we care about? What
other metrics do we care about? Network speed? I got the botnet costs
from here [REF_BOTNET] Back up our claims of defence.}

1.1.2. User profiles [USER_MODEL]

  We have attackers and we have users. Here are a few user profiles:

  "The standard web user"

This is a standard laptop/desktop user who is trying to browse the
web. They don't know how these defences work and they don't care to
configure or tweak them. They are gonna use the default values and if the
site doesn't load, they are gonna close their browser and be sad at Tor.
They run a 2Ghz computer with 4GB of RAM.

  "The motivated user"

This is a user that really wants to reach their destination. They don't
care about the journey; they just want to get there. They know what's going
on; they are willing to tweak the default values and make their computer do
expensive multi-minute PoW computations to get where they want to be.

  "The mobile user"

This is a motivated user on a mobile phone. Even tho they want to read the
news article, they don't have much leeway on stressing their machine to do
more computation.

  We hope that this proposal