To answer your question on stackexchange, the way a compressor would guess
that a binary string is made up of 2 bit tokens with different frequencies
is to train on a context that includes both the previous bit and the bit
position mod 2. PAQ has models like this at the byte level for length 2, 3,
https://twitter.com/jabowery/status/1760015755792294174
https://youtu.be/zduSFxRajkE
On Tue, Nov 21, 2023 at 7:20 PM Matt Mahoney
wrote:
> I started the large text benchmark in 2006
> (https://mattmahoney.net/dc/text.html ) with the claim that all you
> need to pass the Turing test is text
I started the large text benchmark in 2006
(https://mattmahoney.net/dc/text.html ) with the claim that all you
need to pass the Turing test is text prediction, which you can measure
with compression. Both the benchmark and the Hutter prize use the same
1 GB text file (enwik9), with the goal of