how_to_compress_files_in_posix.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119

<!doctype html>
<html lang="en">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="aki">
<meta name="tags" content="POSIX, compression, archiving, tutorial, guide, howto">
<meta name="published-on" content="2021-08-14T19:48:00+02:00">
<link rel="icon" type="image/png" href="favicon.png">
<link rel="stylesheet" href="style.css">

<title>How to Compress Files in POSIX</title>

<header>
<nav><a href="https://ignore.pl">ignore.pl</a></nav>
<time>14 August 2021</time>
<h1>How to Compress Files in POSIX</h1>
</header>

<article>
<p>This was quite an amusing one to read about. On one hand, the results kind of surprised me, but then, on the other...
What exactly did I expect?
<p>Anyway! How does one compress files in a POSIX-compliant system?
<p>By the power of the Hinchliffe's rule, I say: you don't. <i>Wait, what kind of tutorial is this</i>?</p>


<h2>The standard way</h2>
<img src="how_to_compress_files_in_posix-1.png" alt="the way">
<p>POSIX defines three utilities for compression but let's focus on two of them:
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/compress.html">compress</a> and
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uncompress.html">uncompress</a>. They have quite
descriptive names and are incredibly simple to use. Just give them names of the files to process and it will do the
work. Result will be stored in a <code>*.Z</code> file:

<pre>
$ ls
archive.tar
$ <mark>compress archive.tar</mark>
$ ls
archive.tar.Z
$ <mark>uncompress archive.tar.Z</mark>
$ ls
archive.tar
</pre>

<p>By default, the input file is replaced by the output. This can be avoided with <code>-c</code> option that redirects
the output to the standard output:

<pre>
$ compress <mark>-c</mark> archive.tar <mark>&gt;archive.tar.Z</mark>
$ uncompress <mark>-c</mark> archive.tar.Z <mark>&gt;archive.tar.bak</mark>
$ ls
archive.tar archive.tar.bak archive.tar.Z
</pre>

<p>And of course standard input can be used as well with the special filename: <code>-</code>.

<p>So far it all looks good. That's because we're discussing here an imaginary implementation of these utilities the way
they are described by the POSIX standard. However, that's not the real world.


<h2>The actual way</h2>
<p>If you are just like me and come from more Linux background, then prepare for disappointment. If you are coming from
BSD, then I have great news for you: you can stop here, because your system actually implements the standard.</p>

<img src="how_to_compress_files_in_posix-2.png" alt="the other way">

<p>Instead of <b>compress</b> most Linux distributions come with <b>gzip</b>(1). Usually, it is the
<a href="https://www.gnu.org/software/gzip/">GNU Gzip</a>. The reason for that is of course legal work and patenting
issues. The full reasoning is covered in <a href="https://www.gnu.org/philosophy/gif.html">No GIF Files</a>. However,
this is all in past because the LZW patents already expired.
<p>Let's put the story and reasons aside. What we have is an inability to conform to a standard due to legal reasons and
this inability became a standard. And so in Linux systems you will end up using <b>gzip</b> or <b>xz</b>(1), or
<b>bzip2</b>(1), or really anything else:

<pre>
$ ls
archive.tar
$ <mark>gzip archive.tar</mark>
$ ls
archive.tar.gz
$ <mark>gzip -d archive.tar</mark>
$ ls
archive.tar
</pre>

<p>You can replace <code>gzip</code> with any of the mentioned utilities - they have very similar interfaces. Not only
that, they are also partially compatible with the interface for <b>compress</b> defined by the POSIX standard. Each of
them has an additional <i>un</i> command (e.g., <code>gunzip</code>, <code>unxz</code>) that can be used instead of
<code>-d</code> option. If you feel adventurous you could try symlinking them (especially <b>gzip</b> since it
implements LZ77).
<p>This brings a question regarding formats compatibility, but it's a comparison big enough to have its own article.
<p>In the end, if you want to use POSIX <b>compress</b> in GNU/Linux - you don't. Unless...


<h2>The other way</h2>
<p>Unless you use <a href="https://github.com/vapier/ncompress">ncompress</a> which has both <b>compress</b>(1) and
<b>uncompress</b>(1). Even more, it inherits directly from the original implementation. But there is one thing you need
to know about it.
<p>It's bad. Yes, a detailed comparison of compression algorithms is yet another huge and interesting topic, but this
particular case really can be summed up in: it's bad. It's OK with text. At least, it implements the POSIX standard and
is most likely available in your distribution's repository.
<p>Here are results of compression of an arbitrary tarball with majority of source code and some resources, all done
with default options:

<table>
<tr><td>Source<td>22M
<tr><td><b>bzip2</b><td>5.4M
<tr><td><b>gzip</b><td>5.9M
<tr><td><b>xz</b><td>2.6M
<tr><td><b>ncompress</b><td>9.1M
</table>

<p>In other words, it's not bad, but it's staying behind (more) modern programs. This also could be an additional reason
for why it is not used or even installed by default in most Linux distributions. I didn't check BSD's implementation,
but I expect rather good results.
<p>The main takeaway from this article is that if you plan to write anything that is portable across POSIX-compliant or
semi-compliant systems, then you need to give compressing slightly more attention.
</article>
<script src="https://stats.ignore.pl/track.js"></script>