Closer inspection of HC-128 reveals that it uses secret-dependent array indices[*], so for that reason alone I don't think we should adopt it. It also has a very large state, which is going to hurt the cache on big systems and hurt the stack on little systems.
So, could you please split your changes into three parts, along the following lines? That will make it easier to discuss and measure the parts separately. (1) Isolate the algorithm from the cprng_fast bookkeeping (currently the cprng_fast API is separate, but the bookkeeping is embedded in arc4random.c). (2) Convert the bookkeeping from one global state to per-CPU states. (3) Switch algorithms. [*] In hc128_init, key flows into w which flows into p/q which flow into round_expression which does h(qp, pq[...]) which does qp[pq[...] & 0xff].