subject:"Compression of random binary data"

Re: Compression of random binary data

2018-04-12 Thread Ned Batchelder


On 4/11/18 9:29 PM, cuddlycave...@gmail.com wrote:

I’m replying to your post on January 28th
Nice carefully chosen non random numbers  Steven D'Aprano.
Was just doing what you asked, but you don’t remember 


Best practice is to include a quote of the thing you are replying to.  
It makes it much easier for people to follow the thread of the 
discussion, especially when there are large gaps in the timeline.


--Ned.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-04-11 Thread cuddlycaveman

I’m replying to your post on January 28th
Nice carefully chosen non random numbers  Steven D'Aprano.
Was just doing what you asked, but you don’t remember 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-04-11 Thread Steven D'Aprano

On Tue, 10 Apr 2018 23:36:27 -0700, cuddlycaveman wrote:

[snip a number of carefully chosen, non-random numbers shown in binary]

> Don’t know if that helps


Helps what?

With no context, we don't know who you are replying to, what they asked, 
or why you think this is helpful.

According to my archives, such as they are, you appear to be responding 
to a post made in either July 2016 or October 2017. 


-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-04-11 Thread cuddlycaveman

387420479
00110011 00111000 00110111 00110100 00110010 0011 00110100 00110111 00111001
72 bits

Equal to 
(9^9)-10
00101000 00111001 0100 00111001 00101001 00101101 00110001 0011
64 bits 

387420499
00110011 00111000 00110111 00110100 00110010 0011 00110100 00111001 00111001
72 bits

Equal to 
(9^9)+10
00101000 00111001 0100 00111001 00101001 00101011 00110001 0011
64 bits

Or 
387,420,489 387,420,499
00110011 00111000 00110111 00101100 00110100 00110010 0011 00101100 
00110100 00111000 00111001 0010 00110011 00111000 00110111 00101100 
00110100 00110010 0011 00101100 00110100 00111001 00111001
184 bits

Equal to
((9^9)-10) ((9^9)+10)
00101000 00101000 00111001 0100 00111001 00101001 00101101 00110001 
0011 00101001 0010 00101000 00101000 00111001 0100 00111001 
00101001 00101011 00110001 0011 00101001
168 bits

Or
387420479
387420499
00110011 00111000 00110111 00110100 00110010 0011 00110100 00110111 
00111001 0001010 00110011 00111000 00110111 00110100 00110010 0011 00110100 
00111001 00111001
152 bits

Equal to
(9^9)-10
(9^9)+10
00101000 00111001 0100 00111001 00101001 00101101 00110001 0011 0001010 
00101000 00111001 0100 00111001 00101001 00101011 00110001 0011
136 bits

Don’t know if that helps


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-01-28 Thread Peter Pearson

On Sat, 27 Jan 2018 21:26:06 -0800 (PST), pendrysamm...@gmail.com wrote:
> If it is then show him this
>
> 387,420,489
>=
> 00110011 00111000 00110111 00101100 00110100 00110010 0011 0 ...

To save the casual reader a moment of disorientation, the
above binary string is just the ASCII representation of the
text string "387,420,489".

> 9^9 = ⬇️ (^ = to the power of)
>= 387,420,489
>
> But
>
> 9^9
>=
> 00111001 0100 00111001

Similarly, this is the ASCII representation of "9^9".  Our
self-confessedly intermittently sober correspondent appears to have
discovered the backside of the principle that a short text expression
can generate a large number.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data (Posting On Python-List Prohibited)

2018-01-28 Thread Grant Edwards

On 2018-01-28, pendrysamm...@gmail.com  wrote:

> I have it in my head, just need someone to write the program for me,
> I know nothing about data compression or binary data other than 1s
> and 0s and that you can not take 2 number without a possible value
> more or less than them selves and compress them, I have been working
> for 1 1/2 years on a solution, just need help with programming. 
>
> If someone could get ahold of me when I am sober I could probably
> explain it a lot better, but I said fuck it I’ll show everyone a
> possibility instead of trying to do it on my own.

Wow...

Just... wow.

-- 
Grant





-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-01-28 Thread Steven D'Aprano

On Sat, 27 Jan 2018 21:50:24 -0800, pendrysammuel wrote:

> 387,420,489 is a number with only 2 repeating binary sequences

Okay. Now try these two numbers:

387420479
387420499



-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-01-28 Thread Steven D'Aprano

On Sat, 27 Jan 2018 22:14:46 -0800, pendrysammuel wrote:

> I have it in my head, just need someone to write the program for me, 

Sure, my rate is $150 an hour.

> I
> know nothing about data compression or binary data other than 1s and 0s
> and that you can not take 2 number without a possible value more or less
> than them selves and compress them, I have been working for 1 1/2 years
> on a solution, just need help with programming.

Sounds like you should have spend two hours on learning a bit about data 
compression (start with Wikipedia) instead of wasting 1.5 years trying to 
guess a solution for something you know nothing about.

> If someone could get ahold of me when I am sober

O_o

Make that $550 an hour. Payable in advance.

-- 
Steve

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data (Posting On Python-List Prohibited)

2018-01-27 Thread pendrysammuel

Lawrence D’Oliveiro

In other words yes, I just need to be sober first.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data (Posting On Python-List Prohibited)

2018-01-27 Thread pendrysammuel

I have it in my head, just need someone to write the program for me, I know 
nothing about data compression or binary data other than 1s and 0s and that you 
can not take 2 number without a possible value more or less than them selves 
and compress them, I have been working for 1 1/2 years on a solution, just need 
help with programming. 

If someone could get ahold of me when I am sober I could probably explain it a 
lot better, but I said fuck it I’ll show everyone a possibility instead of 
trying to do it on my own.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-01-27 Thread pendrysammuel

387,420,489 is a number with only 2 repeating binary sequences

In binary 387,420,489 is expressed as 00110011 00111000 00110111 00101100 
00110100 00110010 0011 00101100 00110100 00111000 00111001

387,420,489 can be simplified to 9*9 or nine to the power of nine

In binary 9*9 is represented by 00111001 00101010 00111001

9*9 and 387,420,489 are the same thing they are both a representation of 
numbers or data, yet in binary 9*9 is 64 bits shorter than 387,420,489 

Not a solution just help.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2018-01-27 Thread Chris Angelico

On Sun, Jan 28, 2018 at 4:26 PM,   wrote:
> If it is then show him this
>
> 387,420,489
> =
> 00110011 00111000 00110111 00101100 00110100 00110010 0011 00101100 
> 00110100 00111000 00111001
>
> 9^9 = ⬇️ (^ = to the power of)
> = 387,420,489
>
> But
>
> 9^9
> =
> 00111001 0100 00111001

I have no idea what you're responding to or how this has anything to
do with the thread you're replying to, but I would advise you to look
into the difference between exponentiation and exclusive-or. Even so,
I don't know what result you got.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Compression of random binary data

2018-01-27 Thread pendrysammuel

If it is then show him this

387,420,489
=
00110011 00111000 00110111 00101100 00110100 00110010 0011 00101100 
00110100 00111000 00111001

9^9 = ⬇️ (^ = to the power of)
= 387,420,489

But

9^9
=
00111001 0100 00111001
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Ben Bacarisse

Gregory Ewing  writes:

> Ben Bacarisse wrote:
>> But that has to be about the process that gives rise to the data, not
>> the data themselves.
>
>> If I say: "here is some random data..." you can't tell if it is or is
>> not from a random source.  I can, as a parlour trick, compress and
>> recover this "random data" because I chose it.
>
> Indeed. Another way to say it is that you can't conclude
> anything about the source from a sample size of one.
>
> If you have a large enough sample, then you can estimate
> a probability distribution, and calculate an entropy.
>
>> I think the argument that you can't compress arbitrary data is simpler
>> ...  it's obvious that it includes the results of previous
>> compressions.
>
> What? I don't see how "results of previous compressions" comes
> into it. The source has an entropy even if you're not doing
> compression at all.

Maybe we are taking at cross purposes.  A claim to be able to compress
arbitrary data leads immediately to the problem that iterating the
compression will yield zero-size results.  That, to me, is a simpler
argument that talking about data from a random source.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Steve D'Aprano

On Sun, 29 Oct 2017 01:56 pm, Stefan Ram wrote:

>   If the entropy of an individual message is not defined,
>   than it is still available to be defined. I define it
>   to be log2(1/p), where p is the probability of this
>   message. I also choose a unit for it, which I call "bit".

That is exactly the definition of self-information:

https://en.wikipedia.org/wiki/Self-information

See also:

https://en.wikipedia.org/wiki/Entropy_(information_theory)

which lists several forms of related measures of information:

- the self-information of an individual message or symbol taken from a 
  given probability distribution;

- the entropy of a given probability distribution of messages or symbols;

- the entropy rate of a stochastic process.

It also suggests a connection between information entropy and thermodynamic
entropy, namely that the information entropy of a system is the amount of
additional information needed to determine the microstate of a system (the
states of all its particles), given the macrostate (identified by the bulk
thermodynamic parameters such as temperature, volume, energy).

More here: 

https://physics.stackexchange.com/questions/263197/is-information-entropy-the-same-as-thermodynamic-entropy

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Steve D'Aprano

On Sun, 29 Oct 2017 06:03 pm, Chris Angelico wrote:

> On Sun, Oct 29, 2017 at 6:00 PM, Ian Kelly  wrote:
>> On Oct 28, 2017 5:53 PM, "Chris Angelico"  wrote:
>>> One bit. It might send the message, or it might NOT send the message.
>>
>> Not sending the message is equivalent to having a second possible message.
> 
> Okay, now we're getting seriously existential. Is a non-message a message?

Is zero a number?

Is bald a hair colour?

Is atheism a religion?

Aristotle seriously argued that the smallest whole number was two. Zero was
clearly nothing at all, and one wasn't a *number*, it was *unity*. A number
is, by definition, a multitude of units.

https://philosophy.stackexchange.com/questions/19533/why-does-aristotle-suggest-one-is-not-a-number

http://classics.mit.edu/Aristotle/metaphysics.14.xiv.html

No message can be a message. British nuclear submarines during the Cold War
had orders that if they failed to receive any transmissions from London
within a certain time period, they were to crack open the secret orders from
the Prime Minister. Nobody knows what is in the orders, as they were
destroyed when the PM left office, but they are popularly supposed to have
included:

- fire your missiles at Russia;

- surrender to whoever won the war;

- try to flee to Australia or New Zealand and join up with whatever remnants
  of the British Empire still exist;

- do whatever you want.

It makes a good, but ultimately futile, exercise to try to guess what the
various PMs would have said.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Steve D'Aprano

On Sun, 29 Oct 2017 02:31 pm, Gregory Ewing wrote:

> Steve D'Aprano wrote:
>> I don't think that's right. The entropy of a single message is a
>> well-defined quantity, formally called the self-information.
>> 
>> https://en.wikipedia.org/wiki/Self-information
> 
> True, but it still depends on knowing (or assuming) the
> probability of getting that particular message out of
> the set of all possible messages.

Indeed.


> This is *not* what danceswithnumbers did when he
> calculated the "entropy" of his example bit sequences.
> He didn't define the set they were drawn from or
> what their probabilities were.

I'm not defending or supporting Danceswithnumbers in his confusion. I'm just
pointing out that an entropy-like measure of information for individual
messages does exist, and people do often call it "entropy".



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Gregory Ewing


Chris Angelico wrote:

One bit. It might send the message, or it might NOT send the message.


The entropy formula assumes that you are definitely
going to send one of the possible messages. If not
sending a message is a possibility, then you need
to include an empty message in the set of messages.

Another way to think about it: The receiver can be
in one of N possible states, and you want to ensure
that it's in a particular state. To do that, there
must be N possible messages that you can send.
If N = 1, the receiver is already in the right
state, so you don't need to send it a message at
all.

Example: If the message is "I love you", and your
SO already knows that, you don't need to tell
them. (Not according to information theory, at
least.)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Chris Angelico

On Sun, Oct 29, 2017 at 6:00 PM, Ian Kelly  wrote:
> On Oct 28, 2017 5:53 PM, "Chris Angelico"  wrote:
>> One bit. It might send the message, or it might NOT send the message.
>
> Not sending the message is equivalent to having a second possible message.

Okay, now we're getting seriously existential. Is a non-message a message?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-29 Thread Ian Kelly

On Oct 28, 2017 5:53 PM, "Chris Angelico"  wrote:
> One bit. It might send the message, or it might NOT send the message.

Not sending the message is equivalent to having a second possible message.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Chris Angelico

On Sun, Oct 29, 2017 at 2:08 PM, Gregory Ewing
 wrote:
> Stefan Ram wrote:
>>
>>   Well, then one can ask about the entropy of a data source
>>   that only is emitting this message.
>
>
> You can, but it's still the *source* that has the entropy,
> not the message.
>
> (And the answer in that case is that the entropy is zero.
> If there's only one possible message you can ever send, you
> don't need to send it at all!)

One bit. It might send the message, or it might NOT send the message.

And I have known situations in which that is exactly the system used.
Back before mobile phones were common, you could sometimes use a
payphone to cause someone's phone to ring, but you couldn't actually
speak on it. So you had one message you could send: "bing
bring". A pre-arranged meaning for that message might be "I'm at
the railway station, please come and pick me up"... but there's still
*some* information in the mere fact of the call.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Gregory Ewing


Steve D'Aprano wrote:

I don't think that's right. The entropy of a single message is a well-defined
quantity, formally called the self-information. 


https://en.wikipedia.org/wiki/Self-information


True, but it still depends on knowing (or assuming) the
probability of getting that particular message out of
the set of all possible messages.

This is *not* what danceswithnumbers did when he
calculated the "entropy" of his example bit sequences.
He didn't define the set they were drawn from or
what their probabilities were.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Gregory Ewing


Stefan Ram wrote:

  Well, then one can ask about the entropy of a data source
  that only is emitting this message.


You can, but it's still the *source* that has the entropy,
not the message.

(And the answer in that case is that the entropy is zero.
If there's only one possible message you can ever send, you
don't need to send it at all!)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Chris Angelico

On Sun, Oct 29, 2017 at 1:32 PM, Chris Angelico  wrote:
> On Sun, Oct 29, 2017 at 1:18 PM, Gregory Ewing
>  wrote:
>> You're missing something fundamental about what
>> entropy is in information theory.
>>
>> It's meaningless to talk about the entropy of a single
>> message. Entropy is a function of the probability
>> distribution of *all* the messages you might want to
>> send.
>
> Which is where a lot of "password strength" confusion comes from. How
> much entropy is there in the password "U3ZINVp3PT0="? Strong password
> or weak? What about "dark-next-sure-secret"?
> "with-about-want-really-going"? They were generated by, respectively:
> double-MIME-encoding four bytes from /dev/random (32 bits of entropy),
> picking four words from the top 1024 (40 bits), and picking 5 words
> from the top 64 (30 bits). But just by looking at the passwords
> themselves, you can't tell that.

To clarify: The "top 1024 words" concept is XKCD 936 style password
generation, using my tabletop gaming room's resident parrot. So it's
based on the frequency of words used by D players. YMMV if you use a
different corpus :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Chris Angelico

On Sun, Oct 29, 2017 at 1:18 PM, Gregory Ewing
 wrote:
> You're missing something fundamental about what
> entropy is in information theory.
>
> It's meaningless to talk about the entropy of a single
> message. Entropy is a function of the probability
> distribution of *all* the messages you might want to
> send.

Which is where a lot of "password strength" confusion comes from. How
much entropy is there in the password "U3ZINVp3PT0="? Strong password
or weak? What about "dark-next-sure-secret"?
"with-about-want-really-going"? They were generated by, respectively:
double-MIME-encoding four bytes from /dev/random (32 bits of entropy),
picking four words from the top 1024 (40 bits), and picking 5 words
from the top 64 (30 bits). But just by looking at the passwords
themselves, you can't tell that.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Gregory Ewing


Steve D'Aprano wrote:


Random data = any set of data generated by "a source of random".


Any set of data generated by Grant Thompson?

https://www.youtube.com/user/01032010814

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Gregory Ewing


danceswithnumb...@gmail.com wrote:

10101011
This equals
61611
This can be represented using
0-6 log2(7)*5= 14.0367746103 bits


11010101
This equals 
54543

This can be represented using
0-5 log2(6)*5= 12.9248125036 bits


You're missing something fundamental about what
entropy is in information theory.

It's meaningless to talk about the entropy of a single
message. Entropy is a function of the probability
distribution of *all* the messages you might want to
send.

What you calculated in your first example relates to
this situation: You want to send someone messages
consisting of 5 symbols drawn from the set {0, 1,
2, 3, 4, 5, 6}, where all such messages have equal
probability. In that case, you need and average of
about 14.03 bits for each message.

Note that this has essentially nothing to do with
the particular sequence of bits you started with.

Your second calculation was for a similar situation,
except that the set of symbols is just {0, 1, 2,
3, 4, 5}. There are fewer messages of length 5 that
can be constructed from that set, so the number of
bits needed is smaller.


In reality you can express 54543 with 10 bits.


Again, this statement is meaningless. You can't
say *anything* about the number of bits needed to
represent that particular number, without knowing
what *other* numbers you might want to represent.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Gregory Ewing


Ben Bacarisse wrote:

But that has to be about the process that gives rise to the data, not
the data themselves.



If I say: "here is some random data..." you can't tell if it is or is
not from a random source.  I can, as a parlour trick, compress and
recover this "random data" because I chose it.


Indeed. Another way to say it is that you can't conclude
anything about the source from a sample size of one.

If you have a large enough sample, then you can estimate
a probability distribution, and calculate an entropy.


I think the argument that you can't compress arbitrary data is simpler
...  it's obvious that it includes the results of previous
compressions.


What? I don't see how "results of previous compressions" comes
into it. The source has an entropy even if you're not doing
compression at all.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Ian Kelly

On Oct 28, 2017 10:30 AM, "Stefan Ram"  wrote:
> Well, then one can ask about the entropy of a data source
> thatt only is emitting this message. (If it needs to be endless:
> thatt only is emitting this message repeatedly.)

If there is only one possible message then the entropy is zero.

-1.0 * log2(1.0) == 0
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Steve D'Aprano

On Sun, 29 Oct 2017 07:03 am, Peter Pearson wrote:

> On Thu, 26 Oct 2017 19:26:11 -0600, Ian Kelly  wrote:
>>
>> . . . Shannon entropy is correctly calculated for a data source,
>> not an individual message . . .
> 
> Thank you; I was about to make the same observation.  When
> people talk about the entropy of a particular message, you
> can bet they're headed for confusion.

I don't think that's right. The entropy of a single message is a well-defined
quantity, formally called the self-information. The entropy of a data source
is the expected value of the self-information of all possible messages coming
from that data source.

https://en.wikipedia.org/wiki/Self-information

We can consider the entropy of a data source as analogous to the population
mean, and the entropy of a single message as the sample mean. A single
message is like a sample taken from the set of all possible messages.

Self-information is sometimes also called "surprisal" as it is a measure of
how surprising an event is. If your data source is a coin toss, then actually
receiving a Heads has a self-information ("entropy") of 1 bit. That's not
very surprising. If your data source is a pair of fair, independent dice,
then the self-information of receiving a two and a four is 5.170 bits. Its a
logarithmic scale, not linear: if the probability of a message or event is p,
the self-information of that event is log_2 (1/p) bits.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Peter Pearson

On Thu, 26 Oct 2017 19:26:11 -0600, Ian Kelly  wrote:
>
> . . . Shannon entropy is correctly calculated for a data source,
> not an individual message . . .

Thank you; I was about to make the same observation.  When
people talk about the entropy of a particular message, you
can bet they're headed for confusion.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Ben Bacarisse

Steve D'Aprano  writes:

> On Fri, 27 Oct 2017 09:53 am, Ben Bacarisse wrote:
>
>> A source of random can be defined but "random data" is much more
>> illusive.
>
> Random data = any set of data generated by "a source of random".

(I had an editing error there; it should be "a source of random data".)

Yes, that's a fine definition, but it has the disadvantage of not being
a verifiable property of the thing defined -- you can't know, from the
data themselves, if they constitute random data.  You would not care
about a compression program that worked on some data that looks random,
you'd want to present your own data for compression (and then you can
use a random source with confidence because the data are yours).  That's
the big win (for me) of talking about "arbitrary data".

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-28 Thread Steve D'Aprano

On Fri, 27 Oct 2017 09:53 am, Ben Bacarisse wrote:

> A source of random can be defined but "random data" is much more
> illusive.

Random data = any set of data generated by "a source of random".

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-27 Thread Ian Kelly

On Thu, Oct 26, 2017 at 8:48 PM,   wrote:
> Shouldn't that be?
>
> py> 16 * (-7/16 * math.log2(7/16) - 6/16 * math.log2(6/16)) =

No, that's failing to account for 3/16 of the probability space.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-27 Thread Ian Kelly

On Thu, Oct 26, 2017 at 8:19 PM,   wrote:
> It looks like that averages my two examples.

I don't know how you can look at two numbers and then look at a third
number that is larger than both of them and conclude it is the
average.

> H by the way that equation is really coolwhy does it return a high 
> bit count when compared to >>>dec to bin?

I don't understand what you're asking. The binary representation of
either of those number is 16 bits, which is larger than my 15.8.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-27 Thread Ben Bacarisse

Marko Rauhamaa  writes:

> Ben Bacarisse :
>
>>> In this context, "random data" really means "uniformly distributed
>>> data", i.e. any bit sequence is equally likely to be presented as
>>> input. *That's* what information theory says can't be compressed.
>>
>> But that has to be about the process that gives rise to the data, not
>> the data themselves.  No finite collection of bits has the property you
>> describe.
>
> Correct. Randomness is meaningful only in reference to future events.
> Once the events take place, they cease to be random.
>
> A finite, randomly generated collection of bits can be tested against a
> probabilistic hypothesis using statistical methods.

But beware of parlour tricks.  You can't reliably test for random
looking data that are, in fact, carefully crafted.  If the claim is
about compressing arbitrary data, then the claimant won't mind testing
inputs chosen by me!  The only reason this matters is that such people
usually won't reveal the algorithm.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread Marko Rauhamaa

Ben Bacarisse :

>> In this context, "random data" really means "uniformly distributed
>> data", i.e. any bit sequence is equally likely to be presented as
>> input. *That's* what information theory says can't be compressed.
>
> But that has to be about the process that gives rise to the data, not
> the data themselves.  No finite collection of bits has the property you
> describe.

Correct. Randomness is meaningful only in reference to future events.
Once the events take place, they cease to be random.

A finite, randomly generated collection of bits can be tested against a
probabilistic hypothesis using statistical methods.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread danceswithnumbers

Shouldn't that be?

py> 16 * (-7/16 * math.log2(7/16) - 6/16 * math.log2(6/16)) =
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread danceswithnumbers

It looks like that averages my two examples. H by the way that equation is 
really coolwhy does it return a high bit count when compared to >>>dec to 
bin?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread Ian Kelly

On Thu, Oct 26, 2017 at 2:38 PM,   wrote:
>
> Thomas Jollans
>
> On 2017-10-25 23:22, danceswi...@gmail.com wrote:
>> With every transform the entropy changes,
>
> That's only true if the "transform" loses or adds information.
>
> If it loses information, that's lossy compression, which is only useful
> in very specific (but also extremely common) circumstances.
>
> If it adds information, that's poetry, not compression.
>
>
>
> Not true! You can transform a stream lossless, and change its entropy without 
> adding to or taking away. These two streams 16 bits are the same but one is 
> reversed.
>
> 10101011
> This equals
> 61611
> This can be represented using
> 0-6 log2(7)*5= 14.0367746103 bits
>
>
> 11010101
> This equals
> 54543
> This can be represented using
> 0-5 log2(6)*5= 12.9248125036 bits
>
> In reality you can express 54543 with 10 bits.

I don't know what you're calculating here but it isn't Shannon
entropy. Shannon entropy is correctly calculated for a data source,
not an individual message, but if we assume the two numbers above to
be the result of a Bernoulli process with probabilities matching the
frequencies of the bits in the numbers, then the total entropy of 16
events is:

py> 16 * (-9/16 * math.log2(9/16) - 7/16 * math.log2(7/16))
15.81919053261596

Approximately 15.8 bits. This is the same no matter what order the
events occur in.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread Ben Bacarisse

Gregory Ewing  writes:

> Ben Bacarisse wrote:
>> The trouble is a pedagogic one.  Saying "you can't compress random data"
>> inevitably leads (though, again, this is just my experience) to endless
>> attempts to define random data.
>
> It's more about using terms without making sure everyone agrees
> on the definitions being used.
>
> In this context, "random data" really means "uniformly distributed
> data", i.e. any bit sequence is equally likely to be presented as
> input. *That's* what information theory says can't be compressed.

But that has to be about the process that gives rise to the data, not
the data themselves.  No finite collection of bits has the property you
describe.

If I say: "here is some random data..." you can't tell if it is or is
not from a random source.  I can, as a parlour trick, compress and
recover this "random data" because I chose it.

A source of random can be defined but "random data" is much more
illusive.

>> I think "arbitrary data" (thereby including the results of compression
>> by said algorithm) is the best way to make progress.
>
> I'm not sure that's much better, because it doesn't home in
> on the most important thing, which is the probability
> distribution.

I think the argument that you can't compress arbitrary data is simpler
to make.  You don't have to define it (except in the very simplest
terms) and it's obvious that it includes the results of previous
compressions.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread danceswithnumbers


Thomas Jollans

On 2017-10-25 23:22, danceswi...@gmail.com wrote: 
> With every transform the entropy changes, 

That's only true if the "transform" loses or adds information. 

If it loses information, that's lossy compression, which is only useful 
in very specific (but also extremely common) circumstances. 

If it adds information, that's poetry, not compression. 



Not true! You can transform a stream lossless, and change its entropy without 
adding to or taking away. These two streams 16 bits are the same but one is 
reversed.

10101011
This equals
61611
This can be represented using
0-6 log2(7)*5= 14.0367746103 bits


11010101
This equals 
54543
This can be represented using
0-5 log2(6)*5= 12.9248125036 bits

In reality you can express 54543 with 10 bits.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread Peter J. Holzer

On 2017-10-24 22:30, Steve D'Aprano  wrote:
> On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:
>
>> On 2017-10-23 04:21, Steve D'Aprano  wrote:
>>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:

>>> If the probability of certain codes (either single codes, or sequences of
>>> codes) are non-equal, then you can take advantage of that by encoding the
>>> common cases into a short representation, and the uncommon and rare cases
>>> into a longer representation. As you say:
>>>
>>>
   Otherwise, if ( 0, 0 ) is much more frequent,
   we can encode ( 0, 0 ) by "0" and
 
 ( 0, 1 ) by "101",
 ( 1, 0 ) by "110", and
 ( 1, 1 ) by "111".
 
   And we could then use /less/ than two bits on the
   average.
>>>
>>> That's incorrect. On average you use 2.5 bits.
>>>
>>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
>> 
>> I disagree. If the distribution is not equal, then the average needs to
>> take the different probabilities into account.
>
> I think I would call that the *weighted* average rather than the average.

If there are 4 guys who are 180 cm tall and one who is 2 metres tall,
would you say their average height is 184 cm ((180 * 4 + 200 * 1)/(4 + 1))
or 190 cm ((180 + 200)/2)? For me that's the same situation.


> Regardless of what we call it, of course both you and Stefan are right in how
> to calculate it, and such a variable-length scheme can be used to compress
> the data.
>
> E.g. given the sequence 0011 which would take 8 bits in the obvious
> encoding, we can encode it as "000111" which takes only 6 bits.
>
> But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
> the 8 bit sequence 1100 is encoded as "10" which requires 10
> bits.

Right. This is most obvious in Huffman encoding, where each symbol is
replaced by a sequence of bits which is directly related to the
frequency of that symbol. So the letter 'e' might be encoded in 3 or 4
bits (in a corpus of text I happen to have lying around, about 1 in 11
characters is an e and ld(11) = 3.43) while the tilde (about 1 character
in 21000) would be encoded in 14 or 15 bits.

That's basically how all lossless compression algorithms work: Replacing
more frequent bit sequences with shorter sequences and replacing less
frequent sequences with longer ones. (Although most operate on variable
length byte sequences not individual bytes. I find the LZW algorithm (as
used in Unix compress and GIF images) a particularly instructive
example).


> The end result is that averaged over all possible bit sequences (of a certain
> size), this encoding scheme requires MORE space than the obvious 0/1 bits.
>
> But in practice we don't care much, because the data sequences we care about
> are usually not "all possible bit sequences", but a heavily restricted subset
> where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.

Right.

hp


-- 
   _  | Peter J. Holzer| Fluch der elektronischen Textverarbeitung:
|_|_) || Man feilt solange an seinen Text um, bis
| |   | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-26 Thread Thomas Jollans

On 2017-10-25 23:22, danceswithnumb...@gmail.com wrote:
> With every transform the entropy changes,

That's only true if the "transform" loses or adds information.

If it loses information, that's lossy compression, which is only useful
in very specific (but also extremely common) circumstances.

If it adds information, that's poetry, not compression.


-- 
Thomas Jollans
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread Steve D'Aprano

On Thu, 26 Oct 2017 08:22 am, danceswithnumb...@gmail.com wrote:

> with each pass you can compress untill the entropy is so random it can no
> longer be comressed.


Which is another way of saying that you cannot compress random binary data.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread danceswithnumbers

So if the theoretical min compression limit (log2(n)*(x)) has a 3% margin but 
your transform has a less than 3% inflate rate at most then there is room for 
the transform to compress below the theoretical min. With every transform the 
entropy changes, the potential for greater compression also changes, with each 
pass you can compress untill the entropy is so random it can no longer be 
comressed.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread danceswithnumbers

Whatever you do, you'll find that *on average* you 
will need *at least* 34 bits to be able to represent 
all possible 10-digit decimal numbers. Some might 
be shorter, but then others will be longer, and 
the average won't be less than 34. 


The theoretical limit for arbitrary numbers 0 - 9 must be viewed as an average 
not as an impossibility to, in some cases be able to compress to or under 34. 
This is obvious by the decimal to binary function.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread Ian Kelly

On 10/24/17, Richard Damon  wrote:
> My understanding of the 'Random Data Comprehensibility' challenge is
> that is requires that the compression take ANY/ALL strings of up to N
> bits, and generate an output stream no longer than the input stream, and
> sometime less.

That's incorrect, at least of the challenge being discussed. Here's the link:

http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

The challenge is just to take a single known file of a million random
digits and make it smaller, including the size of the decompresser and
without hiding data. So far in 15 years nobody has succeeded even at
this, and nobody knows whether it's impossible. For instance it may be
the case that the data in the file happens to be the nth prime, in
which case it could simply be compressed to the number n with a
decompresser that calculates process.

> It admits that given no-uniformly distributed data, it is
> possible to compress some patterns, the common ones, and expand others,
> the uncommon ones, to lower the net average length. What it says can't
> be done is to have a compression method that compresses EVERY input
> pattern. That is where the 'Pigeon Hole' principle comes into play which
> the people who claim to be able to compress random data like to ignore
> or just attempt to say doesn't apply.

There is a second challenge on that page that is as you state, but the
page admits that this second challenge is a bit of a troll since this
is provably impossible.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread Gregory Ewing


Lele Gaifax wrote:

That's simple enough: of course one empty file would be
"music.mp3.zip.zip.zip", while the other would be
"movie.avi.zip.zip.zip.zip.zip"... some sort of
https://en.wikipedia.org/wiki/Water_memory applied to file system entries :-)


If you're allowed to alternate between two compression
methods, then the way you decompress
music.mp3.zip.zip.tgz.zip...tgz.zip.tgz
is to output 0 each time zip was applied and 1 each
time tar/gz was applied.

You may be able to take some shortcuts in some
cases, e.g. anything beginning with "movie.flv"
almost certainly contains a cute kitten video.
(Unless it's found inside an encrypted disk
partition, in which case it contains porn.)

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread Gregory Ewing


Ben Bacarisse wrote:

The trouble is a pedagogic one.  Saying "you can't compress random data"
inevitably leads (though, again, this is just my experience) to endless
attempts to define random data.


It's more about using terms without making sure everyone agrees
on the definitions being used.

In this context, "random data" really means "uniformly distributed
data", i.e. any bit sequence is equally likely to be presented as
input. *That's* what information theory says can't be compressed.


I think "arbitrary data" (thereby including the results of compression
by said algorithm) is the best way to make progress.


I'm not sure that's much better, because it doesn't home in
on the most important thing, which is the probability
distribution.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-25 Thread Gregory Ewing


Steve D'Aprano wrote:

- Encrypted data looks very much like random noise.


There's actually a practical use for that idea. If you can feed
the output of an encryption algorithm through a compressor and
make it smaller, it means there is a cryptographic weakness
in the algorithm that could potentially be exploited. Good
encryption algorithms produce output that looks almost completely
random to anyone who doesn't know how to decrypt it.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Richard Damon


On 10/24/17 6:30 PM, Steve D'Aprano wrote:

On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:


On 2017-10-23 04:21, Steve D'Aprano  wrote:

On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:



If the probability of certain codes (either single codes, or sequences of
codes) are non-equal, then you can take advantage of that by encoding the
common cases into a short representation, and the uncommon and rare cases
into a longer representation. As you say:



   Otherwise, if ( 0, 0 ) is much more frequent,
   we can encode ( 0, 0 ) by "0" and

( 0, 1 ) by "101",
( 1, 0 ) by "110", and
( 1, 1 ) by "111".

   And we could then use /less/ than two bits on the
   average.


That's incorrect. On average you use 2.5 bits.

(1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)


I disagree. If the distribution is not equal, then the average needs to
take the different probabilities into account.


I think I would call that the *weighted* average rather than the average.

Regardless of what we call it, of course both you and Stefan are right in how
to calculate it, and such a variable-length scheme can be used to compress
the data.

E.g. given the sequence 0011 which would take 8 bits in the obvious
encoding, we can encode it as "000111" which takes only 6 bits.

But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
the 8 bit sequence 1100 is encoded as "10" which requires 10
bits.

The end result is that averaged over all possible bit sequences (of a certain
size), this encoding scheme requires MORE space than the obvious 0/1 bits.

But in practice we don't care much, because the data sequences we care about
are usually not "all possible bit sequences", but a heavily restricted subset
where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.



My understanding of the 'Random Data Comprehensibility' challenge is 
that is requires that the compression take ANY/ALL strings of up to N 
bits, and generate an output stream no longer than the input stream, and 
sometime less. It admits that given no-uniformly distributed data, it is 
possible to compress some patterns, the common ones, and expand others, 
the uncommon ones, to lower the net average length. What it says can't 
be done is to have a compression method that compresses EVERY input 
pattern. That is where the 'Pigeon Hole' principle comes into play which 
the people who claim to be able to compress random data like to ignore 
or just attempt to say doesn't apply.


--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Chris Angelico

On Wed, Oct 25, 2017 at 9:11 AM, Steve D'Aprano
 wrote:
> On Wed, 25 Oct 2017 02:40 am, Lele Gaifax wrote:
>
>> Steve D'Aprano  writes:
>>
>>> But given an empty file, how do you distinguish the empty file you get
>>> from 'music.mp3' and the identical empty file you get from 'movie.avi'?
>>
>> That's simple enough: of course one empty file would be
>> "music.mp3.zip.zip.zip", while the other would be
>> "movie.avi.zip.zip.zip.zip.zip"... some sort of
>> https://en.wikipedia.org/wiki/Water_memory applied to file system entries
>> :-)
>
>
> Does that mean if I name an empty file
>
> serenity2-by-joss-whedon.avi.zip.zip.zip.zip.zip
>
> Dancerswithnumbers' magic algorithm will recreate the movie from some
> alternative universe where it actually exists?
>
> Awesome.

Yes, but then you'd get
dmca-takedown-request.pdf.zip.zip.zip.zip.zip.zip.zip which would also
be empty.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Steve D'Aprano

On Wed, 25 Oct 2017 07:09 am, Peter J. Holzer wrote:

> On 2017-10-23 04:21, Steve D'Aprano  wrote:
>> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>>
>> If the probability of certain codes (either single codes, or sequences of
>> codes) are non-equal, then you can take advantage of that by encoding the
>> common cases into a short representation, and the uncommon and rare cases
>> into a longer representation. As you say:
>>
>>
>>>   Otherwise, if ( 0, 0 ) is much more frequent,
>>>   we can encode ( 0, 0 ) by "0" and
>>> 
>>> ( 0, 1 ) by "101",
>>> ( 1, 0 ) by "110", and
>>> ( 1, 1 ) by "111".
>>> 
>>>   And we could then use /less/ than two bits on the
>>>   average.
>>
>> That's incorrect. On average you use 2.5 bits.
>>
>> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)
> 
> I disagree. If the distribution is not equal, then the average needs to
> take the different probabilities into account.

I think I would call that the *weighted* average rather than the average.

Regardless of what we call it, of course both you and Stefan are right in how
to calculate it, and such a variable-length scheme can be used to compress
the data.

E.g. given the sequence 0011 which would take 8 bits in the obvious
encoding, we can encode it as "000111" which takes only 6 bits.

But the cost of this encoding scheme is that *some* bit sequences expand, e.g.
the 8 bit sequence 1100 is encoded as "10" which requires 10
bits.

The end result is that averaged over all possible bit sequences (of a certain
size), this encoding scheme requires MORE space than the obvious 0/1 bits.

But in practice we don't care much, because the data sequences we care about
are usually not "all possible bit sequences", but a heavily restricted subset
where there are lots of 00 pairs and fewer 01, 10, and 11 pairs.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Steve D'Aprano

On Wed, 25 Oct 2017 02:40 am, Lele Gaifax wrote:

> Steve D'Aprano  writes:
> 
>> But given an empty file, how do you distinguish the empty file you get
>> from 'music.mp3' and the identical empty file you get from 'movie.avi'?
> 
> That's simple enough: of course one empty file would be
> "music.mp3.zip.zip.zip", while the other would be
> "movie.avi.zip.zip.zip.zip.zip"... some sort of
> https://en.wikipedia.org/wiki/Water_memory applied to file system entries
> :-)


Does that mean if I name an empty file 

serenity2-by-joss-whedon.avi.zip.zip.zip.zip.zip

Dancerswithnumbers' magic algorithm will recreate the movie from some
alternative universe where it actually exists?

Awesome.


-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Ian Kelly

On Tue, Oct 24, 2017 at 12:20 AM, Gregory Ewing
 wrote:
> danceswithnumb...@gmail.com wrote:
>>
>> I did that quite a while ago. 352,954 kb.
>
>
> Are you sure? Does that include the size of all the
> code, lookup tables, etc. needed to decompress it?

My bet is that danceswithnumbers does indeed have a file of that size
which is in some way derived from the million random digits, but
without any means of losslessly "decompressing" it (thus making it
junk data).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Peter Pearson

On Tue, 24 Oct 2017 14:51:37 +1100, Steve D'Aprano wrote:
  On Tue, 24 Oct 2017 01:27 pm, danceswithnumb...@gmail.com wrote:
  > Yes! Decode reverse is easy..sorry so excited i could shout.

  Then this should be easy for you:

  http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

  All you need to do is compress this file:

http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin

  to less than 415241 bytes, and you can win $100.

Then, on Mon, 23 Oct 2017 21:13:00 -0700 (PDT), danceswithnumbers wrote:
> I did that quite a while ago. 

But 352,954 kb > 415241 bytes, by several orders of magnitude; so
you didn't "do that".  (Or are we using the European decimal point?)

If you're claiming 352,954 *bytes*, not kb, I invite you to explain
why you have not collected Mark Nelson's $100 prize, and untold fame
and glory; failing which, your credibility will evaporate.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Peter J. Holzer

On 2017-10-23 04:21, Steve D'Aprano  wrote:
> On Mon, 23 Oct 2017 02:29 pm, Stefan Ram wrote:
>>
> If the probability of certain codes (either single codes, or sequences of
> codes) are non-equal, then you can take advantage of that by encoding the
> common cases into a short representation, and the uncommon and rare cases
> into a longer representation. As you say:
>
>
>>   Otherwise, if ( 0, 0 ) is much more frequent,
>>   we can encode ( 0, 0 ) by "0" and
>> 
>> ( 0, 1 ) by "101",
>> ( 1, 0 ) by "110", and
>> ( 1, 1 ) by "111".
>> 
>>   And we could then use /less/ than two bits on the
>>   average. 
>
> That's incorrect. On average you use 2.5 bits.
>
> (1*1 bit + 3*3 bits divide by four possible outcomes, makes 2.5 bits.)

I disagree. If the distribution is not equal, then the average needs to
take the different probabilities into account. 

Let's assume that (0, 0) has a probability of 90 %, (0, 1) a probability
of 10 % and (1, 0) and (1, 1) a probability of 5 % each. 

Then the average length is 

0.9 * 1 bit + 0.1 * 3 bits + 0.05 * 3 bits + 0.05 * 3 bits = 1.5 bits.

hp


-- 
   _  | Peter J. Holzer| Fluch der elektronischen Textverarbeitung:
|_|_) || Man feilt solange an seinen Text um, bis
| |   | h...@hjp.at | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Tim Golden


On 24/10/2017 16:40, Lele Gaifax wrote:

Steve D'Aprano  writes:


But given an empty file, how do you distinguish the empty file you get
from 'music.mp3' and the identical empty file you get from 'movie.avi'?


That's simple enough: of course one empty file would be
"music.mp3.zip.zip.zip", while the other would be


I swear this looks like the lyrics of something or another...

"Music MP3 - zip - zip - zip"

TJG
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Lele Gaifax

Steve D'Aprano  writes:

> But given an empty file, how do you distinguish the empty file you get
> from 'music.mp3' and the identical empty file you get from 'movie.avi'?

That's simple enough: of course one empty file would be
"music.mp3.zip.zip.zip", while the other would be
"movie.avi.zip.zip.zip.zip.zip"... some sort of
https://en.wikipedia.org/wiki/Water_memory applied to file system entries :-)

ciao, lele.
-- 
nickname: Lele Gaifax | Quando vivrò di quello che ho pensato ieri
real: Emanuele Gaifas | comincerò ad aver paura di chi mi copia.
l...@metapensiero.it  | -- Fortunato Depero, 1929.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Ben Bacarisse

Steve D'Aprano  writes:

> On Tue, 24 Oct 2017 06:46 pm, danceswithnumb...@gmail.com wrote:
>
>> Greg, you're  very smart, but you are missing a big key. I'm not padding,
>> you are still thinking inside the box, and will never solve this by doing
>> so. Yes! At least you see my accomplishment, this will compress any random
>> file.
>
> Talk is cheap.

But highly prized.  Most Usenet cranks only want to be talked to (they
see it as being taken seriously, no matter how rude the respondents are)
so for the cost of something cheap (a little babbling) they get an
endless stream of highly prized attention.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Ben Bacarisse

Steve D'Aprano  writes:

> On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:
>
>> Forget random data.  For one thing it's hard to define, 
>
> That bit is true.
>
>> but more importantly no one cares about it.
>
> But that's wrong.

All generalisations are false.  I was being hyperbolic.

> For instance:
>
> - Encrypted data looks very much like random noise. With more and more data
>   traversing the internet in encrypted form, the ability to compress random
>   noise would be worth billions.
>
> - Compressed data looks somewhat like random noise (with a bit of structure).
>   The more it is compressed, the more random it looks. If you could compress
>   random noise, you could take already compressed data, and compress it again,
>   saving even more space.
>
> - Many multimedia formats (images, sound, video) are compressed using
>   dedicated encoders. The better the encoder, the more it compresses the data
>   (whether lossy or not) the harder it is to compress it further. If you could
>   compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
>   etc, saving even more storage and transmission costs.

But these are not random data.  We care about these because they are are
highly structured, non-random data.

> And most importantly:
>
> - Random data is a superset of the arbitrary structured data you mention
>   below. If we could compress random data, then we could compress any data
>   at all, no matter how much or little structure it contained.

Yes, that's part of my point.  Arbitrary data includes random data but
it avoids arguments about what random means.

> This is why the ability to compress random data (if it were possible, which it
> is not) is interesting. Its not because people want to be able to compress
> last night's lottery numbers, or tables of random digits.

The trouble is a pedagogic one.  Saying "you can't compress random data"
inevitably leads (though, again, this is just my experience) to endless
attempts to define random data.  My preferred way out of that is to talk
about algorithmic complexity but for your average "I've got a perfect
compression algorithm" poster, that is step too far.

I think "arbitrary data" (thereby including the results of compression
by said algorithm) is the best way to make progress.

>> Then you publish in a major journal.  Post the link to the journal
>> article when you are done.
>
> These days there are plenty of predatory journals which will be happy to take
> Dancerswithnumber's money in return for publishing it in a junk
> journal.

Sure, but you usually get a huge advantage -- a text to criticise.  Your
average Usenet crank will keep changing what they say to avoid being
pinned down.  Plus you get to note the fact that the journal is junk.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Steve D'Aprano

On Tue, 24 Oct 2017 06:46 pm, danceswithnumb...@gmail.com wrote:

> Greg, you're  very smart, but you are missing a big key. I'm not padding,
> you are still thinking inside the box, and will never solve this by doing
> so. Yes! At least you see my accomplishment, this will compress any random
> file.

Talk is cheap.



-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Steve D'Aprano

On Tue, 24 Oct 2017 09:23 pm, Ben Bacarisse wrote:

> Forget random data.  For one thing it's hard to define, 

That bit is true.

> but more importantly no one cares about it.

But that's wrong.

For instance:

- Encrypted data looks very much like random noise. With more and more data
  traversing the internet in encrypted form, the ability to compress random
  noise would be worth billions.

- Compressed data looks somewhat like random noise (with a bit of structure).
  The more it is compressed, the more random it looks. If you could compress
  random noise, you could take already compressed data, and compress it again,
  saving even more space.

- Many multimedia formats (images, sound, video) are compressed using
  dedicated encoders. The better the encoder, the more it compresses the data
  (whether lossy or not) the harder it is to compress it further. If you could
  compress random noise, you could compress JPGs, MP3s, h265-encoded MKVs,
  etc, saving even more storage and transmission costs.

And most importantly:

- Random data is a superset of the arbitrary structured data you mention
  below. If we could compress random data, then we could compress any data
  at all, no matter how much or little structure it contained.

This is why the ability to compress random data (if it were possible, which it
is not) is interesting. Its not because people want to be able to compress
last night's lottery numbers, or tables of random digits.

> By its very nature, random data is 
> not interesting.  What people want is a reversible compression algorithm
> that works on *arbitrary data* -- i.e. on *any* file at all, no matter
> how structured and *non-random* it is.

In a sense you are right. Compressing randomly generated data would be a
parlour trick and not specifically very useful. But if you had such an
algorithm, that would change the face of the world.

It would be as revolutionary and paradigm breaking as a perpetual motion
machine, or discovery of a new continent the size of China in the middle of
the Atlantic, or that π actually does equal 22/7 exactly.

And just as impossible.

> For example, run the complete works of Shakespeare through your program.
> The result is very much not random data, but that's the sort of data
> people want to compress.  If you can compress the output of your
> compressor you have made a good start.  Of course what you really want
> to be able to do is to compress the output that results from compressing
> your compressed out.  And, of course, you should not stop there. Since 
> you can compress *any* data (not just the boring random stuff) you can
> keep going -- compressing the compressed output again and again until
> you end up with a zero-length file.

Indeed.

That proof by contradiction is yet another reason we know we can't compress
random data -- that is to say, *arbitrary* data. If we had a compression
program which could guarantee to ALWAYS shrink ANY file by at least one bit,
then we could just apply it over and over again, shrinking the compressed
file again and again, until we were left with a zero-byte file:

original.dat = 110 MB
original.dat.zip.zip.zip.zip.zip.zip.zip = 0 MB

And then reverse the process, to turn an empty file back into the original.

But given an empty file, how do you distinguish the empty file you get
from 'music.mp3' and the identical empty file you get from 'movie.avi'?

Obviously you cannot. So we know that the only way to *guarantee* to shrink
every possible file is if the compression is lossy.

> Then you publish in a major journal.  Post the link to the journal
> article when you are done.

These days there are plenty of predatory journals which will be happy to take
Dancerswithnumber's money in return for publishing it in a junk journal.

https://en.wikipedia.org/wiki/Predatory_open_access_publishing

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Paul Moore

On 24 October 2017 at 12:04, Ben Bacarisse  wrote:
> Paul Moore  writes:
>
>> On 24 October 2017 at 11:23, Ben Bacarisse  wrote:
>>> For example, run the complete works of Shakespeare through your program.
>>> The result is very much not random data, but that's the sort of data
>>> people want to compress.  If you can compress the output of your
>>> compressor you have made a good start.  Of course what you really want
>>> to be able to do is to compress the output that results from compressing
>>> your compressed out.  And, of course, you should not stop there.  Since
>>> you can compress *any* data (not just the boring random stuff) you can
>>> keep going -- compressing the compressed output again and again until
>>> you end up with a zero-length file.
>>
>> Oh, and just for fun, if you are able to guarantee compressing
>> arbitrary data, then
>
> It's a small point, but you are replying to a post of mine and saying
> "you".  That could make people think that /I/ am claiming to have a perfect
> compression algorithm.

Sorry. I intended the meaning "If one is able to..." but I was unclear. My bad.

>> 1. Take a document you want to compress.
>> 2. Compress it using your magic algorithm. The result is smaller.
>> 3. Compress the compressed data. The result is still smaller.
>> 4. Repeat until you hit 0 bytes.
>
> Isn't this just repeating what I said?  I must has not written is
> clearly enough.

More accurately, I didn't read it carefully enough. Again sorry.

However, I guess it serves as an example of a compression algorithm -
we can trivially compress the content of our two posts into a single
post with just as much information content, by deleting my post :-)

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Ben Bacarisse

Paul Moore  writes:

> On 24 October 2017 at 11:23, Ben Bacarisse  wrote:
>> For example, run the complete works of Shakespeare through your program.
>> The result is very much not random data, but that's the sort of data
>> people want to compress.  If you can compress the output of your
>> compressor you have made a good start.  Of course what you really want
>> to be able to do is to compress the output that results from compressing
>> your compressed out.  And, of course, you should not stop there.  Since
>> you can compress *any* data (not just the boring random stuff) you can
>> keep going -- compressing the compressed output again and again until
>> you end up with a zero-length file.
>
> Oh, and just for fun, if you are able to guarantee compressing
> arbitrary data, then

It's a small point, but you are replying to a post of mine and saying
"you".  That could make people think that /I/ am claiming to have a perfect
compression algorithm.

> 1. Take a document you want to compress.
> 2. Compress it using your magic algorithm. The result is smaller.
> 3. Compress the compressed data. The result is still smaller.
> 4. Repeat until you hit 0 bytes.

Isn't this just repeating what I said?  I must has not written is
clearly enough.


-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Paul Moore

On 24 October 2017 at 11:23, Ben Bacarisse  wrote:
> For example, run the complete works of Shakespeare through your program.
> The result is very much not random data, but that's the sort of data
> people want to compress.  If you can compress the output of your
> compressor you have made a good start.  Of course what you really want
> to be able to do is to compress the output that results from compressing
> your compressed out.  And, of course, you should not stop there.  Since
> you can compress *any* data (not just the boring random stuff) you can
> keep going -- compressing the compressed output again and again until
> you end up with a zero-length file.

Oh, and just for fun, if you are able to guarantee compressing
arbitrary data, then

1. Take a document you want to compress.
2. Compress it using your magic algorithm. The result is smaller.
3. Compress the compressed data. The result is still smaller.
4. Repeat until you hit 0 bytes.

Congratulations - apparently you have a reversible algorithm that
compresses every data set to an empty file. (Caveat - there's actually
"hidden data" here, as you need to know how many compressions it takes
to hit 0 bytes. Because you decrease the size every time, though, that
number must be no greater than the size of the original file).

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Steve D'Aprano

On Tue, 24 Oct 2017 05:20 pm, Gregory Ewing wrote:

> danceswithnumb...@gmail.com wrote:
>> I did that quite a while ago. 352,954 kb.
> 
> Are you sure? Does that include the size of all the
> code, lookup tables, etc. needed to decompress it?
> 
> But even if you have, you haven't disproved the theorem about
> compressing random data. All you have is a program that
> compresses *that particular* sequence of a million digits.
> 
> To disprove the theorem, you would need to exhibit an
> algorithm that can compress *any* sequence of a million
> digits to less than 415,241 bytes.

Indeed -- but let's give Dancerswithnumbers his due. *IF* he is right (a very
big "if" indeed) about being able to compress the Rand Corporation "Million
Random Digits" in binary form, as given, that alone would be an impressive
trick.

Compressing the digits in text form is not impressive in the least. As Ben
Bacarisse pointed out, most of us will probably already have half a dozen
programs that do that.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Ben Bacarisse

danceswithnumb...@gmail.com writes:

> Finally figured out how to turn this into a random binary compression
> program. Since my transform can compress more than dec to binary. Then
> i took a random binary stream,

Forget random data.  For one thing it's hard to define, but more
importantly no one cares about it.  By its very nature, random data is
not interesting.  What people want is a reversible compression algorithm
that works on *arbitrary data* -- i.e. on *any* file at all, no matter
how structured and *non-random* it is.

For example, run the complete works of Shakespeare through your program.
The result is very much not random data, but that's the sort of data
people want to compress.  If you can compress the output of your
compressor you have made a good start.  Of course what you really want
to be able to do is to compress the output that results from compressing
your compressed out.  And, of course, you should not stop there.  Since
you can compress *any* data (not just the boring random stuff) you can
keep going -- compressing the compressed output again and again until
you end up with a zero-length file.

Then you publish in a major journal.  Post the link to the journal
article when you are done.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Paul Moore

On 24 October 2017 at 09:43, Gregory Ewing  wrote:
> Paul Moore wrote:
>>
>> But that's not "compression", that's simply using a better encoding.
>> In the technical sense, "compression" is about looking at redundancies
>> that go beyond the case of how effectively you pack data into the
>> bytes available.
>
>
> There may be a difference in the way the terms are used, but
> I don't think there's any fundamental difference. Compression
> is about finding clever ways to make the encoding better.

Agreed - I was trying (probably futilely, given the way this thread
has gone...) to make a distinction between purely local properties
that are typically considered in "how you encode the data" and the
detection of more global patterns, which is where what are typically
referred to as "compression" algorithms get their power. But sadly, I
don't think the OP is actually interested in understanding the
background, so the distinction wasn't really worth making :-(

> Either way, the information-theoretic limits on the number
> of bits needed are the same.

Precisely.
Paul
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Gregory Ewing


Paul Moore wrote:

But that's not "compression", that's simply using a better encoding.
In the technical sense, "compression" is about looking at redundancies
that go beyond the case of how effectively you pack data into the
bytes available.


There may be a difference in the way the terms are used, but
I don't think there's any fundamental difference. Compression
is about finding clever ways to make the encoding better.

Either way, the information-theoretic limits on the number
of bits needed are the same.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Gregory Ewing


danceswithnumb...@gmail.com wrote:


My 8 year old can decode this back into base 10,


Keep in mind that your 8 year old has more information
than just the 32 bits you wrote down -- he can also
see that there *are* 32 bits and no more. That's
hidden information that you're not counting.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread danceswithnumbers

No leading zeroes are being dropped offwish this board has an edit button.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Christian Gollwitzer


Am 23.10.17 um 12:13 schrieb Marko Rauhamaa:

Thomas Jollans :


On 2017-10-23 11:32, danceswithnumb...@gmail.com wrote:

According to this website. This is an uncompressable stream.

https://en.m.wikipedia.org/wiki/Incompressible_string

12344321


No, it's not. According to that article, that string is incompressible
by a particular algorithm. I can see no more general claims.


Here's a compression algorithm that manages to compress that string into
a 0-bit string:

  * If the original string is 12344321 (whatever that means),
return the empty bit string.

  * Otherwise, prepend a don't-care bit to the original string and return
the result of the concatenation.



...and that's why there is the "Kolmogorov complexity". You need to 
append the decompression program to the data to show how much you really 
saved, which will turn out to be nothing compared to the "trivial 
decompressor"


print "12344321"

Christian
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread danceswithnumbers

Greg, you're  very smart, but you are missing a big key. I'm not padding, you 
are still thinking inside the box, and will never solve this by doing so. Yes! 
At least you see my accomplishment, this will compress any random file.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Gregory Ewing


danceswithnumb...@gmail.com wrote:


Compress this:

4135124325

Bin to dec...still very large
0110
0000
1101
01100101


Wait right there! You're cheating by dropping off leading
0 bits. The maximum value of a 10 digit decimal number is
99, which in hex is 2540be3ff. That's 34 bits.
That's in line with the theoretical number of bits needed:

log2(10) * 10 = 33.219

So the binary version of your number above is really

00
  0110
  0000
  1101
  01100101

You may think you can get away without storing or
transmitting those leading 0 bits, because the decoder
can always pad out the data as needed.

But to do that, the decoder needs to know *how many*
bits to pad out to. That information somehow needs to
be part of the encoded data.

You need to imagine you're sending the data to someone
over a wire. The only thing you can send along the wire
are ones and zeroes. You can't convey any information
by timing, or turning off the power, or anything like
that. How is the receiver going to know when he's got
the whole message?

There are only two possibilities. Either you decide
in advance that all messages will be exactly the same
length, which in this case means always sending
exactly 34 bits. Or you add some extra bits to the
message -- prepend the length in binary, or add an
end-of-message code at the end, or something like
that.

Whatever you do, you'll find that *on average* you
will need *at least* 34 bits to be able to represent
all possible 10-digit decimal numbers. Some might
be shorter, but then others will be longer, and
the average won't be less than 34.


New compression method:

11000101
11000111
0100

A full byte less than bin.


You need to be *very* careful about what you're claiming here.
Are you saying that your algorithm compresses *all possible*
sequences of 10 decimal digits to 3 bytes or less? Or can
some of them come out longer?

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Marko Rauhamaa

Gregory Ewing :

> What you *can't* do is compress 16 random decimal digits to less than
> 6.64 bytes.

More precisely:

   Regardless of the compression scheme, the probability of shortening
   the next bit sequence is less than 0.5 if the bits are distributed
   evenly, randomly and independently.

Often the distribution is not even, or there is interdependence between
the bits. Then, a compression scheme can do miracles.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Gregory Ewing


danceswithnumb...@gmail.com wrote:

I did that quite a while ago. 352,954 kb.


Are you sure? Does that include the size of all the
code, lookup tables, etc. needed to decompress it?

But even if you have, you haven't disproved the theorem about
compressing random data. All you have is a program that
compresses *that particular* sequence of a million digits.

To disprove the theorem, you would need to exhibit an
algorithm that can compress *any* sequence of a million
digits to less than 415,241 bytes.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-24 Thread Gregory Ewing


danceswithnumb...@gmail.com wrote:

12344321

It only takes seven 8 bit bytes to represent this


This is not surprising. The theoretical minimum size
for 16 arbitrary decimal digits is:

log2(10) * 16 = 53.15 bits = 6.64 bytes

I think you misunderstand what is meant by the phrase
"random data cannot be compressed".

A string of decimal digits represented in ASCII is
*not* completely random. There are 256 possible values
for each byte, and you're only using 10 of them.

What you *can't* do is compress 16 random decimal
digits to less than 6.64 bytes.

If you have something that you think can do better
than that, we would like to see it.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Chris Angelico

On Tue, Oct 24, 2017 at 2:28 AM, Paul Moore  wrote:
> Hope this helps put the subject into context. Compression is a very
> technical subject, to "do it right". Special cases can be worked out,
> sure, but the "hidden assumptions" in a method are what make the
> difference between a "compression algorithm" and a "way of storing my
> particular data more efficiently".

And to put *this* into context and perspective, both of the above are
extremely useful tools/algorithms. Generalized compression algos are
used all the time, special-purpose lossy compression algos too, and
"way[s] of storing my particular data more efficiently" are utterly
crucial to many applications. But they're not universal compression
schemes.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Steve D'Aprano

On Tue, 24 Oct 2017 03:13 pm, danceswithnumb...@gmail.com wrote:

> I did that quite a while ago. 352,954 kb.

Sure you did. Let's see the code you used.




-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

I did that quite a while ago. 352,954 kb. 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Steve D'Aprano

On Tue, 24 Oct 2017 01:27 pm, danceswithnumb...@gmail.com wrote:

> Finally figured out how to turn this into a random binary compression
> program. Since my transform can compress more than dec to binary. Then i
> took a random binary stream, changed it to a decimal stream 0-9 tranformed
> it into a compressed/encrypted binary stream 23.7% smaller. 

Smaller than the original binary stream? I don't think so.

There's nothing clever about "compressing" a random binary stream if first you
expand it, then remove the extra space you added and claim victory because
the result is smaller than the expanded version.

> Yes! Decode reverse is easy..sorry so excited i could shout.

Then this should be easy for you:

http://marknelson.us/2012/10/09/the-random-compression-challenge-turns-ten/

All you need to do is compress this file:

http://marknelson.us/attachments/million-digit-challenge/AMillionRandomDigits.bin

to less than 415241 bytes, and you can win $100.

-- 
Steve
“Cheer up,” they said, “things could be worse.” So I cheered up, and sure
enough, things got worse.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

Finally figured out how to turn this into a random binary compression program. 
Since my transform can compress more than dec to binary. Then i took a random 
binary stream, changed it to a decimal stream 0-9 tranformed it into a 
compressed/encrypted binary stream 23.7% smaller. Yes! Decode reverse is 
easy..sorry so excited i could shout. 

Jon Hutton
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Ian Kelly

On Mon, Oct 23, 2017 at 1:42 PM,   wrote:
> Wow, do programmers actually use zscii. That is huge. So much wated space.

Not really. ZSCII is only relevant if you're writing Z-code or a
Z-code interpreter. Those in turn are only relevant if you're writing
Infocom games.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

Wow, do programmers actually use zscii. That is huge. So much wated space.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Neil Cerutti

On 2017-10-23, alister via Python-list  wrote:
>> 12344321
>>
>> It only takes seven 8 bit bytes to represent this
>
> Would you care to provide the seven 8-bit bytes you propose to use?
> Paul

 I would suspect he is using BCD & storing 2 values in reach
 byte that is not what is meant by you cant compress random
 data. his compression is simply removing redundant space
 from an inefficient coding
>>>
>>> I suspect he is using ASCII and storing one value in each byte.
>> 
>> There's also ZSCII, which stores roughly 3 characters every 2
>> bytes. Since all the digits are in A2, this sequence would
>> take up 7 bytes in ZSCII as well.
>> 
>> http://inform-fiction.org/zmachine/standards/z1point0/sect03.html
>
> not sure how 16 characters can be represented by either ascii
> or zscii in only 8 bytes

Oops! I hastily counted completely wrong. It's 10 bytes in ZSCII
version 2, using a shift-lock.

   * 16 bits* 
12 ->  0 00101 01001 01010
349 -> 0 01011 01100 10001
999 -> 0 10001 10001 10001
888 -> 0 1 1 1
843 -> 0 1 01100 01011
21  -> 1 01010 01001 00101

-- 
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

Good point

I hope it has a use, other than a cute toyi don't see it yet.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Thomas Jollans

On 2017-10-23 17:39, danceswithnumb...@gmail.com wrote:
> Thanks Paul...blunt to the point.
> 
> My 8 year old can decode this back into base 10, i still have to help him a 
> bit going from base 10 to 8 bit bytesit's incredibly simple to decode. No 
> dictionary, can easily be done with pencil and paper, does not rely on 
> redundancies.

It is, is it? Well, you know the rules:

"Code or it didn't happen."


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Thomas Jollans

On 2017-10-23 07:39, Steve D'Aprano wrote:
> By the way: here is a very clever trick for hiding information in the file
> system:
> 
> http://www.patrickcraig.co.uk/other/compression.php
> 
> 
> but as people point out, the information in the file, plus the information in
> the file system, ends up being *larger* than the original. Well done to
> Patrick Craig for finding a clever loophole to "win" the challenge, but he
> did so without actually compressing the original data.

Thanks Steve that was very entertaining.


-- 
Thomas Jollans
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

Thanks Paul...blunt to the point.

My 8 year old can decode this back into base 10, i still have to help him a bit 
going from base 10 to 8 bit bytesit's incredibly simple to decode. No 
dictionary, can easily be done with pencil and paper, does not rely on 
redundancies.

Jon Hutton 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

Just trying to find a practical application for this alg. Not real useful as it 
stands now. 

Jon Hutton 
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Paul Moore

On 23 October 2017 at 15:29,   wrote:
> I'm really not trolling, and even though some are sarcastic i sm learning 
> from your comments.

I'm willing to believe that, but if you're trying to claim you have
"compressed" data (in a way that satisfies the technical,
information-theoretic meaning of the word), you need to be *really*
careful not to include hidden information in your assumptions.

For example, if I have a string made up of only the numbers 0-7, then
I can trivially (octal) store that data in 3 bits per digit. But
that's not compression, as there's "hidden information" in the
knowledge that you're using a restricted character set. Factor in that
information and your string only contains 3 bits of information per
digit.

Using bytes (characters, if you assume a 1-byte encoding) to hold just
the digits 0-9 is inefficient (there's 256 bytes and you're only using
10 of them), and "of course" you can hold that data more efficiently.
But that's not "compression", that's simply using a better encoding.
In the technical sense, "compression" is about looking at redundancies
that go beyond the case of how effectively you pack data into the
bytes available.

> Dec to bin is not bad at removing wasted space

Yep, you're not talking about what people usually refer to as
compression, but rather about optimising an encoding.

>, but there is a better way. Here is an example. How would you compress these 
>numbers.

10 digits = log2(10) bits of information. So getting down to 4 bits is
about encoding. You can go further by using a variable length encoding
and "extra knowledge" about which digits come up most commonly to give
the common digits shorter representation. That's called Gray coding.
You can use the fact that repeated copies of the same digit come up
together a lot to replace them by digit + count. That's run-length
encoding. There are other more complex approaches. But what *all* of
these have in common is that if you provide random input (within the
limits of what you support - digit strings here) then you'll *always*
get at least one input that encodes to a longer output than your
input.

> Compress this:
>
> 4135124325

10 x 10 digits = 10 x log2(10) bits of information = a bit under 34
bits of information

>
> Bin to dec...still very large
> 0110
> 0000
> 1101
> 01100101

4x8 = 32 bits, but there's probably a couple of leading zeros needed
if you want to encode all 10-digit numbers.

> New compression method:
>
> 11000101
> 11000111
> 0100
>
> A full byte less than bin.

You'll need to explain how to decode this, in a way that can be used
to decode the encodings of arbitrary 10-digit numbers, and with any
leading zeroes that are needed to cover the whole space of 10-digit
numbers, before you can claim you've compressed 10-digit numbers using
only 24 bits. And if you do that, you've basically overturned most of
information theory, so I'd start by assuming there's a flaw in your
argument - sorry about that... ;-)

Hope this helps put the subject into context. Compression is a very
technical subject, to "do it right". Special cases can be worked out,
sure, but the "hidden assumptions" in a method are what make the
difference between a "compression algorithm" and a "way of storing my
particular data more efficiently".

> I know many are skepticalthats okay.this has taken 6 years, im not going 
> to insult your intelligence with something so juvenile as dec to bin. I'm 
> really trying to find an application for this since it only deals with digits 
> 0-9 or 0-20 or other strange combinations. Wait did you just give it away in 
> that small exampleno, because it also encrypts in a way that has never 
> been done. The compression is better than log(256)÷log (10)wait isn't 
> that impossible, binary is minimalistic. I agree that binary is minimalistic, 
> but the architecture is not, making all arguments conjecture...not laws.

No-one is going to accept a claim that an algorithm you're not willing
to publish is valid. This is about maths/science, not "proprietary
algorithms" or anything like that. If you don't publish your methods,
people will simply point at information theoretic proofs and say
"either you're missing something, or your approach doesn't work in
cases that I care about, so thanks but no thanks".

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread danceswithnumbers

I'm really not trolling, and even though some are sarcastic i sm learning from 
your comments. Dec to bin is not bad at removing wasted space, but there is a 
better way. Here is an example. How would you compress these numbers. If you 
look for redundancy and then code to a bulky dictionary or change it to binary, 
one would unflate the other would only get it down to 

Compress this:

4135124325

Bin to dec...still very large
0110
0000
1101
01100101

New compression method:

11000101
11000111
0100

A full byte less than bin.

I know many are skepticalthats okay.this has taken 6 years, im not going to 
insult your intelligence with something so juvenile as dec to bin. I'm really 
trying to find an application for this since it only deals with digits 0-9 or 
0-20 or other strange combinations. Wait did you just give it away in that 
small exampleno, because it also encrypts in a way that has never been 
done. The compression is better than log(256)÷log (10)wait isn't that 
impossible, binary is minimalistic. I agree that binary is minimalistic, but 
the architecture is not, making all arguments conjecture...not laws.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread alister via Python-list

On Mon, 23 Oct 2017 13:40:59 +, Neil Cerutti wrote:

> On 2017-10-23, Chris Angelico  wrote:
>> On Mon, Oct 23, 2017 at 11:18 PM, alister via Python-list
>> wrote:
>>> On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:
>>>
 On 23 October 2017 at 10:32,  
 wrote:
> According to this website. This is an uncompressable stream.
>
> https://en.m.wikipedia.org/wiki/Incompressible_string
>
> 12344321
>
> It only takes seven 8 bit bytes to represent this

 Would you care to provide the seven 8-bit bytes you propose to use?
 Paul
>>>
>>> I would suspect he is using BCD & storing 2 values in reach byte that
>>> is not what is meant by you cant compress random data. his compression
>>> is simply removing redundant space from an inefficient coding
>>
>> I suspect he is using ASCII and storing one value in each byte.
> 
> There's also ZSCII, which stores roughly 3 characters every 2 bytes.
> Since all the digits are in A2, this sequence would take up 7 bytes in
> ZSCII as well.
> 
> http://inform-fiction.org/zmachine/standards/z1point0/sect03.html

not sure how 16 characters can be represented by either ascii or zscii in 
only 8 bytes



-- 
I fear explanations explanatory of things explained.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Neil Cerutti

On 2017-10-23, Chris Angelico  wrote:
> On Mon, Oct 23, 2017 at 11:18 PM, alister via Python-list
> wrote:
>> On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:
>>
>>> On 23 October 2017 at 10:32,  
>>> wrote:
 According to this website. This is an uncompressable stream.

 https://en.m.wikipedia.org/wiki/Incompressible_string

 12344321

 It only takes seven 8 bit bytes to represent this
>>>
>>> Would you care to provide the seven 8-bit bytes you propose
>>> to use?
>>> Paul
>>
>> I would suspect he is using BCD & storing 2 values in reach
>> byte that is not what is meant by you cant compress random
>> data. his compression is simply removing redundant space from
>> an inefficient coding
>
> I suspect he is using ASCII and storing one value in each byte.

There's also ZSCII, which stores roughly 3 characters every 2
bytes. Since all the digits are in A2, this sequence would take
up 7 bytes in ZSCII as well.

http://inform-fiction.org/zmachine/standards/z1point0/sect03.html

-- 
Neil Cerutti

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Chris Angelico

On Mon, Oct 23, 2017 at 11:18 PM, alister via Python-list
 wrote:
> On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:
>
>> On 23 October 2017 at 10:32,   wrote:
>>> According to this website. This is an uncompressable stream.
>>>
>>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>>
>>> 12344321
>>>
>>> It only takes seven 8 bit bytes to represent this
>>
>> Would you care to provide the seven 8-bit bytes you propose to use?
>> Paul
>
> I would suspect he is using BCD & storing 2 values in reach byte
> that is not what is meant by you cant compress random data.
> his compression is simply removing redundant space from an inefficient
> coding

I suspect he is using ASCII and storing one value in each byte.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread alister via Python-list

On Mon, 23 Oct 2017 10:41:55 +0100, Paul Moore wrote:

> On 23 October 2017 at 10:32,   wrote:
>> According to this website. This is an uncompressable stream.
>>
>> https://en.m.wikipedia.org/wiki/Incompressible_string
>>
>> 12344321
>>
>> It only takes seven 8 bit bytes to represent this
> 
> Would you care to provide the seven 8-bit bytes you propose to use?
> Paul

I would suspect he is using BCD & storing 2 values in reach byte
that is not what is meant by you cant compress random data.
his compression is simply removing redundant space from an inefficient 
coding

Can you compress that sequence on paper when you only have the values 0-9 
to work with (forget computer representation & removing un-used bits)


-- 
Old age and treachery will overcome youth and skill.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Ben Bacarisse

danceswithnumb...@gmail.com writes:

> ... First let me clarify before you lump this in with
> perpetual motion, or cold fusion. It is a mapping solution to compress
> ANY i repeat ANY random file with numbers of only 0 - 9 such as are in
> the million rand numbers page. Entirely possible.

Of course it is.  I have half a dozen programs on this machine that do
it.  I don't need yours.

It's possible you want to claim something beyond what you stated, but no
one is going to get excited until you actually claim such a thing.

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Compression of random binary data

2017-10-23 Thread Marko Rauhamaa

Thomas Jollans :

> On 2017-10-23 11:32, danceswithnumb...@gmail.com wrote:
>> According to this website. This is an uncompressable stream.
>> 
>> https://en.m.wikipedia.org/wiki/Incompressible_string
>> 
>> 12344321
>
> No, it's not. According to that article, that string is incompressible
> by a particular algorithm. I can see no more general claims.

Here's a compression algorithm that manages to compress that string into
a 0-bit string:

 * If the original string is 12344321 (whatever that means),
   return the empty bit string.

 * Otherwise, prepend a don't-care bit to the original string and return
   the result of the concatenation.

>> It only takes seven 8 bit bytes to represent this
>
> You amaze me.
>
 bin(12344321)
> '0b1000110001100111001110100100010100100110111'
 len(bin(12344321)[2:])
> 51
 7 * 8
> 56

So danceswithnumbers made a correct statement!

Anyway, a fun troll.

Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

1 2 >

1 - 100 of 147 matches

Mail list logo