Archiving With POSIX Utilities
Published on 2020-07-22 22:30:00+02:00
The usual answer is tar. As you may see I intentionally linked to the
GNU Tar. If you are a *BSD user then you use some other implementation. Both of them follow and extend POSIX'es standard
for tar utility. Or so you would think.
Right now there is no POSIX tar utility. It has been marked as legacy
already in 1997 and disappeared from the
standard soon after. It's place took a behemoth called
pax. The name gets even funnier when
you consider the rationale and the size of this thing. But pax didn't came from just tar. There was one more influencer
in here called cpio. You may know this one
if you ever tinkered with RPM packages or initramfs.
In other words we have three utilities on today's table: tar, cpio and pax. According to
Debian's popularity contest the frequency of each being installed is in
the exact same order, with tar being at 8th place overall, cpio at 52nd, and pax at 6089th. I can't just talk about the
least popular one, so I'll explain shortly how to use each of them in your usual Linux distribution while keeping in
mind what POSIX had to tell us back in the day.
tar
Like I've already mentioned tarballs are the most popular. Not only that, they are commonly described as the easiest
to use, although the interface is something that you can find jokes about. All operations on tarballs are handled via
single tar utility.
Let's go through three basic operations: create an archive, list out the content, and extract it. Tar expects to have
first argument to match this regular expression: [rxtuc][vwfblmo]*
. The first part is function,
and the second is a modifier. I'll focus only on those necessary to accomplish before-mentioned tasks.
To create an archive you:
$ tar cf ../archive.tar a_file a_directory
This will create an archive that will be located in parent directory of current working directory, and will contain
a_file
and recursively a_directory
. Let's map every part of the command for clarity:
tar
- Call tar
c
- Create an archive
f
- Use first argument after
cf
as the path to the archive
../archive.tar
- Path to the archive (without
f
it would be treated as another file to
include in the archive)
a_file a_directory
- Files to include in the archives
Now that you have an archive, you can see it's content:
$ tar tf ../archive.tar
a_file
a_directory/
a_directory/another_file
As you have probably guessed t
function is used to write the names of files that are in the archive.
f
works exactly the same way: first argument after tf
is meant to point to the archive file.
To extract everything from the archive you:
$ tar xf ../archive.tar
Or add more arguments to extract selected files:
$ tar xf ../archive.tar a_file
This one will extract only a_file
from the archive.
That's pretty much it about tar. The are two more functions: r
that adds new file to existing archive,
and u
that first tries to update the file in archive if it exists and if it doesn't then it adds it. Note,
that the usual compression options are not available in POSIX, they are an extension.
cpio
Heading off from the usual routes we encounter cpio. It's a more frequent sight than pax, but it still is quite niche
compared to tar's omnipresence. Frankly, I like this one the most because of the way it handles input of file lists.
Sadly, this also makes it slightly bothersome to use.
Now, now, cpio operates in three modes: copy-out, copy-in and pass-through. Our goals are
still the same: to create an archive, list files inside, and extract it somewhere else and for that we'll only need the
first two modes.
To create an archive, use the copy-out mode, as in: copy to the standard output:
$ find a_file a_directory | cpio -o >../archive.cpio
This instant you probably noticed that cpio doesn't accept files as arguments. In copy-out mode it expects list of
files in standard input, and it will return the formatted archive through standard output. See a somehow step-by-step
explanation:
find a_file a_directory |
- List files, directories and their content from arguments and pipe the
output to the next command
cpio
- Call cpio (duh!)
-o
- Use copy-out mode
>../archive.cpio
- Redirect standard output of cpio to a file
You now have an archive file called archive.cpio
in parent directory. To see its content type in:
$ cpio -it <../archive.cpio
a_file
a_directory
a_directory/another_file
1 block
Nice! What's left is extraction. You do it with copy-in mode like this:
$ cpio -i <../archive.cpio
1 block
Huh? What's that? Listing files and extracting both use copy-in mode? That's right. Like "copy-out" means "copy to
standard output", "copy-in" can be understood as "copy from standard input". The t
option prohibits any
files to be written or created by cpio, nonetheless archive is read from standard input and then translated to list of
files in standard output. Some extended implementations let you use t
directly as sole option and imply the
copy-in mode.
You can also use patterns when extracting to select files:
$ cpio -i a_file <../archive.cpio
1 block
You can copy nested files if you use d
option:
$ cpio -id a_directory/another_file <../archive.cpio
1 block
This option tells cpio that it's allowed to create directories whenever it is necessary.
Bonus! Pass-through mode can be used to copy files listed in standard input to specified directory. It doesn't create
an archive at all.
$ ls ../destination
$ ls
a_directory a_file
$ find a_file a_directory | cpio -p ../destination
0 blocks
$ ls ../destination
a_directory a_file
pax
Finally, at the destination! This one lives up to the name of this post as it's still part of POSIX. The fun part is
that you probably don't even have it installed, but don't worry, I didn't have it until like two days ago. It truly
feels like a compromise forced on you and your siblings by your parents. Jokes aside, I actually started to like it,
bulky but kind of cute.
Anyway, let's see what this coffee machine can do for us; same goals as previously. This will be confusing, because
this utility is a compromise, and so it supports both usage styles: tar-like and cpio-like.
To create an archive you can use either:
$ pax -wf ../archive.pax a_directory a_file
$ find a_file a_directory | pax -wd >../archive.pax
$ find a_file a_directory | pax -wdf ../archive.pax
They are equivalent. You can mix the style as much as you want, as long as it doesn't become mess it's quite handy.
As for what option does what:
-w
- Indicates that pax will act in write mode (tar's
c
and cpio's -o
)
f ../archive.pax
- Argument after
f
is the path to the archive; note that it behaves
slightly different compared to tar, it always takes next argument instead of first path that appears after flags. It
means you can't put any options between -f
and the path.
a_directory a_file
find a_file a_directory |
- Both of these accomplish the same goal of letting know
pax
what files should be in archive. They are mutually exclusive! If there is at least one argument pointing to a file,
then standard input is not supposed to be read.
d
- This one is used to prevent recursively adding files that are in a directory, so that the
behaviour is the same as in cpio:
$ find a_file a_directory | pax -wvf ../archive.pax
a_directory
a_directory/another_file
a_directory/another_file
a_file
pax: ustar vol 1, 4 files, 0 bytes read, 10240 bytes written.
$ find a_directory a_file | pax -wvdf ../archive.pax
a_directory
a_directory/another_file
a_file
pax: ustar vol 1, 3 files, 0 bytes read, 10240 bytes written.
The v
option is used to increase verbosity of the "error" output. You can find similar functionality in
most of command line utilities, including tar and cpio.
To list files that are in archive you can also use both styles:
$ pax <../archive.pax
a_directory
a_directory/another_file
a_file
$ pax -f ../archive.pax
a_directory
a_directory/another_file
a_file
Yes, that's the default behaviour of pax and you don't need to specify any argument (in case of cpio-like style).
Sweet, isn't it?
To extract the archive use one of:
$ pax -r <../archive.pax
$ pax -rf ../archive.pax
For selecting files to extract use the usual patterns:
$ pax -r a_file -f ../archive.pax
$ pax -r a_directory/another_file <../archive.pax
That's all of the most basic use case. There's more, for instance pax supports mode similar to the pass-through mode
we already know from the cpio. But there is something more important to mention about pax. It's supposed to easily
support various different formats.
POSIX tells that pax should support: pax, cpio and ustar formats. I installed GNU pax and it seems to support: ar,
bcpio, cpio, sv4cpio, sc4crc, tar and ustar. The default format for my installation is ustar as you have probably
noticed in verbose output in one of the examples above. Pax format is extension for ustar, that's most likely the reason
it's usually omitted.
You can select format with -x
option, for supported formats please refer to your manual. Also note that
explicitly specifying format should be only needed when writing an archive. When reading pax can identify archive's
format efficiently:
$ find a_file a_directory | cpio -o >../archive.cpio
$ pax -vf ../archive.cpio
-rw-rw-r-- 1 ignore ignore 0 Jul 22 22:30 a_file
drwxrwxr-x 2 ignore ignore 0 Jul 22 22:30 a_directory
-rw-rw-r-- 1 ignore ignore 0 Jul 22 22:30 a_directory/another_file
pax: bcpio vol 1, 3 files, 512 bytes read, 0 bytes written.
Final thoughts
Now then, it's time to finally wrap it all up. There is nothing left to say but remember to always check your manual,
all of those utilities have various implementations that are compliant to POSIX in various degrees. Don't be naive and
don't get tricked by them. I find pax the most reliable of them as its "novelty" and the interface that was quite
"modern" from the start resulted in decently compliant implementations. Moreover, it includes nice things one may know
from both cpio and tar. Find a moment to check it out!
Let's pretend that ar doesn't exist.
Thank you.