This was quite an amusing one to read about. On one hand, the results kind of surprised me, but then, on the other...
What exactly did I expect?
Anyway! How does one compress files in a POSIX-compliant system?
By the power of the Hinchliffe's rule, I say: you don't. Wait, what kind of tutorial is this?
The standard way
POSIX defines three utilities for compression but let's focus on two of them:
compress and
uncompress. They have quite
descriptive names and are incredibly simple to use. Just give them names of the files to process and it will do the
work. Result will be stored in a *.Z
file:
$ ls
archive.tar
$ compress archive.tar
$ ls
archive.tar.Z
$ uncompress archive.tar.Z
$ ls
archive.tar
By default, the input file is replaced by the output. This can be avoided with -c
option that redirects
the output to the standard output:
$ compress -c archive.tar >archive.tar.Z
$ uncompress -c archive.tar.Z >archive.tar.bak
$ ls
archive.tar archive.tar.bak archive.tar.Z
And of course standard input can be used as well with the special filename: -
.
So far it all looks good. That's because we're discussing here an imaginary implementation of these utilities the way
they are described by the POSIX standard. However, that's not the real world.
The actual way
If you are just like me and come from more Linux background, then prepare for disappointment. If you are coming from
BSD, then I have great news for you: you can stop here, because your system actually implements the standard.
Instead of compress most Linux distributions come with gzip(1). Usually, it is the
GNU Gzip. The reason for that is of course legal work and patenting
issues. The full reasoning is covered in No GIF Files. However,
this is all in past because the LZW patents already expired.
Let's put the story and reasons aside. What we have is an inability to conform to a standard due to legal reasons and
this inability became a standard. And so in Linux systems you will end up using gzip or xz(1), or
bzip2(1), or really anything else:
$ ls
archive.tar
$ gzip archive.tar
$ ls
archive.tar.gz
$ gzip -d archive.tar
$ ls
archive.tar
You can replace gzip
with any of the mentioned utilities - they have very similar interfaces. Not only
that, they are also partially compatible with the interface for compress defined by the POSIX standard. Each of
them has an additional un command (e.g., gunzip
, unxz
) that can be used instead of
-d
option. If you feel adventurous you could try symlinking them (especially gzip since it
implements LZ77).
This brings a question regarding formats compatibility, but it's a comparison big enough to have its own article.
In the end, if you want to use POSIX compress in GNU/Linux - you don't. Unless...
The other way
Unless you use ncompress which has both compress(1) and
uncompress(1). Even more, it inherits directly from the original implementation. But there is one thing you need
to know about it.
It's bad. Yes, a detailed comparison of compression algorithms is yet another huge and interesting topic, but this
particular case really can be summed up in: it's bad. It's OK with text. At least, it implements the POSIX standard and
is most likely available in your distribution's repository.
Here are results of compression of an arbitrary tarball with majority of source code and some resources, all done
with default options:
Source | 22M
|
bzip2 | 5.4M
|
gzip | 5.9M
|
xz | 2.6M
|
ncompress | 9.1M
|
In other words, it's not bad, but it's staying behind (more) modern programs. This also could be an additional reason
for why it is not used or even installed by default in most Linux distributions. I didn't check BSD's implementation,
but I expect rather good results.
The main takeaway from this article is that if you plan to write anything that is portable across POSIX-compliant or
semi-compliant systems, then you need to give compressing slightly more attention.