One of my favourite things about Unix-like operating systems is package management. They have many technical advantages such as:
- ease of installation - just search, install,
- automatic dependency resolution,
- reduction in storage usage by utilizing shared libraries (no DLL hell),
- clean filesystem,
It also has many practical advantages such as listing installed packages, showing package information (version, url), and finding which package a file belongs to. Unfortunately the implementation of this concept isn’t always perfect. For instance, the learning curve for Debian packaging is quite steep, and RPM has often been criticised for dependency hell.
When I first looked at Archlinux’s package management system, I was
pleasantly surprised at how simple the PKGBUILD format is. It is a shell
script, in which variables define metadata, and a
specifies the steps required to build the package.
The Archlinux User Repository allows users to contribute PKGBUILDs without a lengthy review process. The code behind it wasn’t in the greatest shape, so I decide to rewrite it in Python + Django. During development I ran into the problem of parsing the PKGBUILD format. I had to get the metadata into the database somehow. Since a PKGBUILD is just a shell script, I thought sourcing it and outputting Python, then eval-ing it would be the easiest way of doing it. Unfortunately this has many security problems, including the execution of malicious code, or infinite loops (the web server would hang). I was forced to write a minimal shell parser to extract the metadata. While it removes the security concerns, it raises others, such as inaccuracy and maintenance problems (the specification and parser code is tightly coupled). It turns out the Shell grammar isn’t exactly the simplest one.
The PKGBUILD format has some warts, which are mostly due to the lack of data
structures in bash, such as hash tables. For instance there are two arrays,
md5sums. Each element in the
md5sums array maps to the
source element. As you can imagine this can easily result in
makepkg (the package creation utility) is able to create
this field automatically.
Another problem associated with
sources and checksums is with binary
sources. A binary source is typically different for each supported
architecture. There is no way of specifying this easily in the PKGBUILD
format. A common solution is to do something like this:
The checksum generation feature of
makepkg no longer works properly, and in
order to parse this metadata, an interpreter is now required, not just a
An Archlinux user started defining an alternate PKGBUILD specification. It addresses some problems with the shell format, such as extendability (to a degree) and ease of parsing. Unfortunately this format is a completely new data format, and thus requires a parser of its own.
Lately I’ve been toying with the idea of creating a universal package
specification. The idea of this specification is to provide a portable
way of defining package metadata, while keeping it simple, and extendable.
Ideally any package manager would be able to use this format, and have enough
metadata to do what they need to. It is extendable with an
field, which allows package managers to get any data they require, which is
not already included in the specification. If there is enough demand for an
extension, it should be added to the next revision of the specification.
A common data serialization format is YAML. It is simple, easy to parse, and
very versatile. For these reasons it was my first choice for the
specification. There are already many different parsers in many different
languages. Thus, the format should be easily accessible in most languages.
Unfortunately it does not seem that bash is one of them, so parts of
makepkg would have to be rewritten.
Most of these fields are analogous to those in the current PKGBUILD format.
The major difference is with
architectures. The keys of this hash table
denote supported architectures. The any architecture has two sources. The
URI and checksums are defined. No longer is a conditional required. The
appropriate architecture is simple retrieved and its sources are used. The
source URLs and checksums now have a one-to-one mapping, reducing human
error. There is still one problem with this, however. If the sources do not
differ for multiple architectures, there will be duplication. To rectify this
to some degree, anchors can be used:
In some cases it might even be useful to exploit hash merges to specify a common subset of sources and add architecture-specific ones.
Another re-use of data was common in PKGBUILDs - using variables to reference the package name and version in sources. It looked something like this:
You might have noticed that a similar syntax was used in the uri:
This is not something that YAML supports - it would have to be parsed separately. I have not decided on this format yet, and perhaps YAML does indeed have an appropriate feature. For now these values can either be hardcoded, or parsed on the second pass of the data.
extensions field is the interesting part. If the specification doesn’t
have some required data, such as the
options field in the PKGBUILD
specification, it can be added here.