summaryrefslogtreecommitdiff
path: root/how_to_archive_with_posix_tar_cpio_and_pax.html
blob: ed88187bd71bc9ec0c2ea0b4d6472b8351ecf1b0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
<!doctype html>
<html lang="en">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="author" content="aki">
<meta name="tags" content="posix, linux, tutorial, archiving, tar, cpio, pax">
<meta name="published-on" content="2020-07-22T22:30:00+02:00">
<meta name="last-modified-on" content="2022-01-26T19:10:00+01:00">
<link rel="icon" type="image/png" href="favicon.png">
<link rel="stylesheet" type="text/css" href="style.css">

<title>How to Archive With POSIX tar, cpio and pax</title>

<header>
<nav><a href="https://ignore.pl">ignore.pl</a></nav>
<time>22 July 2020</time>
<h1>How to Archive With POSIX tar, cpio and pax</h1>
</header>

<article>
<p>The usual answer to archive anything is <a href="https://www.gnu.org/software/tar/">tar</a>. As you may see I
intentionally linked to the GNU Tar. If you are a *BSD user then you use some other implementation. Both of them follow
and extend POSIX'es standard for tar utility. Or so you would think.
<p>Right now there is no POSIX tar utility. It has been marked as legacy
<a href="https://pubs.opengroup.org/onlinepubs/007908799/xcu/tar.html">already in 1997</a> and disappeared from the
standard soon after. It's place took a behemoth called
<a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/pax.html">pax</a>. The name gets even funnier when
you consider the rationale and the size of this thing. But pax didn't came from just tar. There was one more influencer
in here called <a href="https://pubs.opengroup.org/onlinepubs/007908799/xcu/cpio.html">cpio</a>. You may know this one
if you ever tinkered with RPM packages or initramfs.
<p>In other words we have three utilities on today's table: tar, cpio and pax. According to
<a href="https://popcon.debian.org/by_inst">Debian's popularity contest</a> the frequency of each being installed is in
the exact same order, with tar being at 8th place overall, cpio at 52nd, and pax at 6089th. I can't just talk about the
least popular one, so I'll explain shortly how to use each of them in your usual Linux distribution while keeping in
mind what POSIX had to tell us back in the day.

<h2>How to Archive With POSIX tar utility</h2>
<p>Like I've already mentioned tarballs are the most popular. Not only that, they are commonly described as the easiest
to use, although the interface is something that you can find jokes about. All operations on tarballs are handled via
single tar utility.</p>
<img src="how_to_archive_with_posix_tar_cpio_and_pax-1.png" alt="box">
<p>Let's go through three basic operations: create an archive, list out the content, and extract it. Tar expects to have
first argument to match this regular expression: <code>[rxtuc][vwfblmo]*</code>. The first part is <em>function</em>,
and the second is a <em>modifier</em>. I'll focus only on those necessary to accomplish before-mentioned tasks.
<p>To create an archive you:</p>
<pre>
$ tar cf ../archive.tar a_file a_directory
</pre>
<p>This will create an archive that will be located in parent directory of current working directory, and will contain
<code>a_file</code> and recursively <code>a_directory</code>. Let's map every part of the command for clarity:</p>
<dl>
	<dt><code>tar</code><dd>Call tar
	<dt><code>c</code><dd>Create an archive
	<dt><code>f</code><dd>Use first argument after <code>cf</code> as the path to the archive
	<dt><code>../archive.tar</code><dd>Path to the archive (without <code>f</code> it would be treated as another file to
	include in the archive)
	<dt><code>a_file a_directory</code><dd>Files to include in the archives
</dl>
<p>Now that you have an archive, you can see its content:</p>
<pre>
$ tar tf ../archive.tar
a_file
a_directory/
a_directory/another_file
</pre>
<p>As you have probably guessed <code>t</code> function is used to write the names of files that are in the archive.
<code>f</code> works exactly the same way: first argument after <code>tf</code> is meant to point to the archive file.
<p>To extract everything from the archive you:</p>
<pre>
$ tar xf ../archive.tar
</pre>
<p>Or add more arguments to extract selected files:</p>
<pre>
$ tar xf ../archive.tar a_file
</pre>
<p>This one will extract only <code>a_file</code> from the archive.
<p>It's worth noting that you can pass <code>-</code> as argument to file option. This way the archive will be read from
standard input or written to standard output depending on the operation. Good chunk of implementations assumes this as a
default behaviour if no archive file is provided at all.
<p>That's pretty much it about tar. The are two more functions: <code>r</code> that adds new file to existing archive,
and <code>u</code> that first checks if the the file exists in the archive and if it is older and only then appends the
new revision at the end. Note, that the usual compression options are not available in POSIX, they are an extension.

<h2>How To Archive With POSIX cpio utility</h2>
<p>Heading off from the usual routes we encounter cpio. It's a more frequent sight than pax, but it still is quite niche
compared to tar's omnipresence. Frankly, I like this one the most because of the way it handles input of file lists.
Sadly, this also makes it slightly bothersome to use.
<p>Now, now, cpio operates in three modes: <em>copy-out</em>, <em>copy-in</em> and <em>pass-through</em>. Our goals are
still the same: to create an archive, list files inside, and extract it somewhere else and for that we'll only need the
first two modes.
<p>To create an archive, use the copy-out mode, as in: <em>copy</em> to the standard <em>out</em>put:</p>
<pre>
$ find a_file a_directory | cpio -o &gt;../archive.cpio
</pre>
<p>This instant you probably noticed that cpio doesn't accept files as arguments. In copy-out mode it expects list of
files in standard input, and it will return the formatted archive through standard output. See a somehow step-by-step
explanation:</p>
<dl>
	<dt><code>find a_file a_directory |</code><dd>List files, directories and their content from arguments and pipe the
	output to the next command
	<dt><code>cpio</code><dd>Call cpio (duh!)
	<dt><code>-o</code><dd>Use copy-out mode
	<dt><code>&gt;../archive.cpio</code><dd>Redirect standard output of cpio to a file
</dl>
<p>You now have an archive file called <code>archive.cpio</code> in parent directory. To see its content type in:</p>
<pre>
$ cpio -it &lt;../archive.cpio
a_file
a_directory
a_directory/another_file
1 block
</pre>
<p>Nice! What's left is extraction. You do it with copy-in mode like this:</p>
<pre>
$ cpio -i &lt;../archive.cpio
1 block
</pre>
<p>Huh? What's that? Listing files and extracting both use copy-in mode? That's right. Like "copy-out" means "copy to
standard output", "copy-in" can be understood as "copy from standard input". The <code>t</code> option prohibits any
files to be written or created by cpio, nonetheless archive is read from standard input and then translated to list of
files in standard output. Some extended implementations let you use <code>t</code> directly as sole option and imply the
copy-in mode.
<p>You can also use patterns when extracting to select files:</p>
<pre>
$ cpio -i a_file &lt;../archive.cpio
1 block
</pre>
<p>You can copy nested files if you use <code>d</code> option:</p>
<pre>
$ cpio -id a_directory/another_file &lt;../archive.cpio
1 block
</pre>
<p>This option tells cpio that it's allowed to create directories whenever it is necessary.</p>
<img src="how_to_archive_with_posix_tar_cpio_and_pax-2.png" alt="pass-through">
<p>Bonus! Pass-through mode can be used to copy files listed in standard input to specified directory. It doesn't create
an archive at all.</p>
<pre>
$ ls ../destination
$ ls
a_directory  a_file
$ find a_file a_directory | cpio -p ../destination
0 blocks
$ ls ../destination
a_directory  a_file
</pre>

<h2>How to Archive With POSIX pax utility</h2>
<p>Finally, at the destination! This one lives up to the name of this post as it's still part of POSIX. The fun part is
that you probably don't even have it installed, but don't worry, I didn't have it until like two days ago. It truly
feels like a compromise forced on you and your siblings by your parents. Jokes aside, I actually started to like it,
bulky but kind of cute.
<p>Anyway, let's see what this coffee machine can do for us; same goals as previously. This will be confusing, because
this utility is a compromise, and so it supports both usage styles: tar-like and cpio-like.
<p>To create an archive you can use either:</p>

<pre>
$ pax -wf ../archive.pax a_directory a_file
$ find a_file a_directory | pax -wd &gt;../archive.pax
$ find a_file a_directory | pax -wdf ../archive.pax
</pre>

<p>They are equivalent. You can mix the style as much as you want, as long as it doesn't become mess it's quite handy.
As for what option does what:</p>

<dl>
	<dt><code>-w</code><dd>Indicates that pax will act in write mode (tar's <code>c</code> and cpio's <code>-o</code>)
	<dt><code>f ../archive.pax</code><dd>Argument after <code>f</code> is the path to the archive; note that it behaves
	slightly different compared to tar, it always takes next argument instead of first path that appears after flags. It
	means you can't put any options between <code>-f</code> and the path.
	<dt><code>a_directory a_file</code>
	<dt><code>find a_file a_directory |</code><dd>Both of these accomplish the same goal of letting know <code>pax</code>
	what files should be in archive. They are mutually exclusive! If there is at least one argument pointing to a file,
	then standard input is not supposed to be read.
	<dt><code>d</code><dd>This one is used to prevent recursively adding files that are in a directory, so that the
	behaviour is the same as in cpio:
<pre>
$ find a_file a_directory | pax -wvf ../archive.pax
a_directory
<span style="color: red">a_directory/another_file
a_directory/another_file</span>
a_file
pax: ustar vol 1, 4 files, 0 bytes read, 10240 bytes written.
$ find a_directory a_file | pax -wv<span style="color: green">d</span>f ../archive.pax
a_directory
<span style="color: green">a_directory/another_file</span>
a_file
pax: ustar vol 1, 3 files, 0 bytes read, 10240 bytes written.
</pre>
</dl>

<p>The <code>v</code> option is used to increase verbosity of the "error" output. You can find similar functionality in
most of command line utilities, including tar and cpio.
<p>To list files that are in archive you can also use both styles:</p>
<pre>
$ pax &lt;../archive.pax
a_directory
a_directory/another_file
a_file
$ pax -f ../archive.pax
a_directory
a_directory/another_file
a_file
</pre>
<p>Yes, that's the default behaviour of pax and you don't need to specify any argument (in case of cpio-like style).
Sweet, isn't it?
<p>To extract the archive use one of:</p>
<pre>
$ pax -r &lt;../archive.pax
$ pax -rf ../archive.pax
</pre>
<p>For selecting files to extract use the usual patterns:</p>
<pre>
$ pax -r a_file -f ../archive.pax
$ pax -r a_directory/another_file &lt;../archive.pax
</pre>
<p>That's all of the most basic use case. There's more, for instance pax supports mode similar to the pass-through mode
we already know from the cpio. But there is something more important to mention about pax. It's supposed to easily
support various different formats.
<p>POSIX tells that pax should support: pax, cpio and ustar formats. I installed GNU pax and it seems to support: ar,
bcpio, cpio, sv4cpio, sc4crc, tar and ustar. The default format for my installation is ustar as you have probably
noticed in verbose output in one of the examples above. Pax format is extension for ustar, that's most likely the reason
it's usually omitted.
<p>You can select format with <code>-x</code> option, for supported formats please refer to your manual. Also note that
explicitly specifying format should be only needed when writing an archive. When reading pax can identify archive's
format efficiently:</p>
<pre>
$ find a_file a_directory | cpio -o &gt;../archive.cpio
$ pax -vf ../archive.cpio
-rw-rw-r--  1 ignore   ignore    0 Jul 22 22:30 a_file
drwxrwxr-x  2 ignore   ignore    0 Jul 22 22:30 a_directory
-rw-rw-r--  1 ignore   ignore    0 Jul 22 22:30 a_directory/another_file
pax: bcpio vol 1, 3 files, 512 bytes read, 0 bytes written.
</pre>

<h2>Final thoughts</h2>
<p>Now then, it's time to finally wrap it all up. There is nothing left to say but remember to always check your manual,
all of those utilities have various implementations that are compliant to POSIX in various degrees. Don't be naive and
don't get tricked by them. I find pax the most reliable of them as its "novelty" and the interface that was quite
"modern" from the start resulted in decently compliant implementations. Moreover, it includes nice things one may know
from both cpio and tar. Find a moment to check it out!
<p>Now, it's time for <a href="https://pubs.opengroup.org/onlinepubs/9699919799/utilities/ar.html">ar</a>. And it so
happens that a year or so later <a href="how_to_archive_with_posix_ar.html">I wrote about it</a>. Enjoy.</p>
<img src="how_to_archive_with_posix_tar_cpio_and_pax-3.png" alt="boo!">
</article>
<script src="https://stats.ignore.pl/track.js"></script>