Theory: Compression algorithm that makes some files smaller but none bigger?

Multi tool use

Theory: Compression algorithm that makes some files smaller but none bigger?

I came across this question;

"A lossless compression algorithm claims to guarantee to make some files smaller and no files larger.

Is this;

a) Impossible

b) Possible but may run for an indeterminate amount of time,

c) Possible for compression factor 2 or less,

d) Possible for any compression factor?"

I'm leaning towards (a), but couldn't give a solid explanation as to why. (I'll list the thoughts a friend and I came up with as a possible answer)

6 Answers
6

By the pigeon-hole principle, given a string of 10 bits you have 1024 possible inputs, and need to map to 9 bits or fewer, so there are < 1024 outputs.

This guarantees either the algorithm has collisions (lossy compression) or at some point choses to return the unmodified input as output.

In the latter case, you cannot determine how to decompress an arbitrary string of bits. (It could be an unmodified input, or a compressed output from a larger bit string).

-> Impossible.

This is just speculation. I'm hardly an expert on this, just wanted to see what others thought of my answer. Thanks!
– RJFalconer
Oct 3 '09 at 11:59

@BlueNovember: don't worry: every user that comes across an archiver eventually asks himself whether it's possible to make such a compression :-)
– P Shved
Oct 3 '09 at 12:02

Hmm. I didn't, Pavel.
– spender
Oct 3 '09 at 12:04

+1 Pigeon-hole principle. Consider that you either compress, or keep the original, and that you need at least one marker bit appended (or prefixed) to the original, to mark it as such.
– u0b34a0f6ae
Oct 3 '09 at 12:04

More formally: There does not exist an injective function on finite sets whose codomain is smaller than its domain.
– LBushkin
Oct 4 '09 at 2:14

Just a slight clarification of RJFalconer's post...

You only have to have some files becoming smaller, so the claim that a string of 10 bits has to map to 9 bits or fewer isn't quite right. In particular, if someone proposed such a compression mechanism it could map all strings of 10 bits or fewer to exactly the same output (i.e. an identity transformation).

However, we are told that there is at least one file which does become smaller. Without loss of generality, consider that to start with x bits and end up as y bits, where y is strictly less than x.

Now consider the domain of "files with y bits or fewer", which has 2^y+1-1 bit strings (including the empty one). In order for none of those to result in a bigger file, each has to map to a bit string in the same domain, i.e. 2^y+1-1 compressed files. However, we already know that the initial string of length x bits compresses to one of those values - leaving only 2^y+1-2 possible values.

At this point the pigeon hole principle comes in - you clearly can't map 2^y+1-1 inputs to 2^y+1-2 outputs without repeating an output, which violates the reversibility of compression.

When talking about strings, I totally agree. But since we're talking about files isn't there one more variable to consider: the filename? Deciding which file to decompress and which to leave could be based on the file extension. Or am I missing something?
– Yannick Motton
Oct 3 '09 at 12:27

@Yannick: If you're allowed to change the filename, then you can trivially output an empty file with a filename which contains all the data. If you can't change the filename, then it depends on whether there's a correlation between filename and data. If you know that any file with an extension of "000" is all zeroes, then you could indeed compress the data... but I'd suggest that's cheating, and that you should be able to store arbitrary data with arbitrary filenames. At that point, it becomes irrelevant.
– Jon Skeet
Oct 3 '09 at 12:41

a) impossible

If you have a file that can not be compressed further, you still have to add the information whether it has been compressed or not, so in that case the file would have to grow.

Why the downvote? If you don't say anything at all about what you don't like, it's rather pointless.
– Guffa
Dec 26 '09 at 18:28

Why the downvote? If you don't explain what it is that you think is wrong, it can't improve the answer.
– Guffa
Feb 26 '15 at 9:25

I know that I'm kinda late, but I found this via google and someone else could do the same, so I'll post my answer: the obvious solution is a) impossible, as well pointed out by Jon Skeet (and btw there are a lot of proofs all over the internet). I'm not questioning the impossibility to compress random data, just to be clear from the beginning; I understood the theory that lays behind it, and -if you ask me- I trust the math. : D

a) impossible

But, if we're allowed to think laterally, we could definitely take advantage of the fact that the question is not well-defined, meaning that it does not give a strict definition of "compression algorithm" and of the properties that it should have (but to reduce some files without expanding anyone else).

Also, it doesn't put whatsoever condition on the files to be compressed, the only thing it's interested in is "to make some files smaller and no files larger".

That said, we have now at least two ways to show that, in fact, it does exist such an algorithm:

We can exploit the name of the file to store some of the information of the file (or even the entire file, if the file system allows it, thus reducing every file to 0 bit).
Trivially, we could simply decide leave untouched every file but one, reducing it to 0 bit and renaming it with a predefined name.
I agree that this could be considered cheating, but then again, there are no restrictions in the initial question and this algorithm would effectively achieve the purpose (as long as no one renames the file, that's why this would be a very poor design choice besides being pointless).

We can limit the number of files to be compressed, say, to the ones at least X bits long. Once again, a trivial solution would be to leave every file untouched but one, that we can reduce making it match to a file smaller than X bits.
Now we do have an algorithm which, quoting verbatim, makes some files smaller and no files larger; however, it performs a restriction on all its possible inputs (i.e. it cannot process all the files).

X

To those who argue that this wouldn't have any practical use, I say that I agree with you... but hey, this is theory, and this was just a theoretical dissertation. ;)

Obviously, if I were to do a test and face this question, I'd put a bold X on the a), and then just go on without thinking too much about it.

a)

Nevertheless, it is perfectly possible to show that, since natural language is intrinsically ambiguous and the question is not formally expressed, each of the other possible answers is not necessarily wrong: placing the right conditions and eventually specifying more clearly what is meant by certain concepts, we may legally be able to meet the goal of any of the other listed options, doing some sort of trickery and forcing the program to achieve the desired behavior.

e) Possible

...with some restrictions.

I recently came across Shoco, a string compression library for small strings. I was reminded of this question when reading this claim:

...the most remarkable property of shoco is that the compressed size will never exceed the size of your input string, provided it is plain ASCII.

If you are sure that the input data is plain ASCII, your out buffer for only needs to be as large as the input string

http://ed-von-schleck.github.io/shoco/#how-it-works

possible

to make some files smaller and no files larger

if said compression algorithm makes the file larger, just have it return the original file.

Then how do you uncompress? You'd need another bit of information to indicate that the file had/had not been compressed.
– RJFalconer
21 hours ago

yeah............
– Carson Graham
15 hours ago

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Rz nS 6zG,6bGAv1mCh,04G81ax3KQLZmL lSL fiSW

搜尋此網誌

Cyufjy