How to Accelerate Docker Builds and Optimize Caching With "COPY --link"

Publish date: 2024-07-16

Quick Links

COPY --link is a new BuildKit feature which could substantially accelerate your Docker image builds. It works by copying files into independent image layers that don't rely on the presence of their predecessors. You can add new content to images without the base image even existing on your system.

This capability was added as part of Buildx v0.8 in March 2022. It's included in version 20.10.14 of the Docker CLI so you should already have access if you're running the latest release.

In this article, we'll show what --link does and explain how it works. We'll also look at some of the situations in which it shouldn't be used.

--link is a new optional argument for the existing Dockerfile COPY instruction. It changes the way copies work by creating a new snapshot layer each time you use it.

Regular COPY statements add files to the layer that precedes them in the Dockerfile. The contents of that layer need to exist on your disk so the new content can be merged in:

FROM alpine  

COPY my-file /my-file

COPY another-file /another-file

The Dockerfile above copies my-file into the layer produced by the previous command. After the FROM instruction, the image consists of Alpine's content:

bin/  

dev/

etc/

...

The first COPY instruction produces an image that includes everything from Alpine, as well as the my-file file:

my-file  

bin/

dev/

etc/

...

And the second COPY instruction adds another-file on top of this image:

another-file  

my-file

bin/

dev/

etc/

...

The layer produced by each instruction includes everything that came before it, as well as anything newly added. At the end of the build, Docker uses a diffing process to work out the changes within each layer. The final image blob contains just the files that were added in each snapshot stage but this isn't reflected in the assembly process during the build.

"--link" modifies COPY to create a new standalone filesystem each time it's used. Instead of copying the new files on top of the previous layer, they're sent to a completely different location to become an independent layer. Layers are subsequently linked together to produce the final image.

Let's change the example Dockerfile to use --link:

FROM alpine  

COPY --link my-file /my-file

COPY --link another-file /another-file

The result of the FROM instruction is unchanged - it yields the Alpine layer, with all that image's content:

bin/  

dev/

etc/

...

The first COPY instruction has a noticeably different effect. This time another independent layer is created. It's a new filesystem containing only my-file:

my-file

Then the second COPY instruction creates another new snapshot with only another-file:

another-file

When the build completes, Docker stores these independent snapshots as new layer archives (tarballs). The tarballs are linked back into the chain of preceding layers, building up the final image. This consists of all three snapshots merged together, resulting in a filesystem that matches the original one when containers are created:

my-file  

another-file

bin/

dev/

etc/

...

This image from the BuildKit project illustrates the differences between the two approaches.

COPY --link is only available when you're using BuildKit to build your images. Either run your build with docker buildx --create or use docker build with the DOCKER_BUILDKIT=1 environment variable set.

You must also opt-in to the Dockerfile v1.4 syntax using a comment at the top of your file:

# syntax=docker/dockerfile:1.4  

FROM alpine:latest

COPY --link my-file /my-file

COPY --link another-file /another-file

Now you can build your image with support for linked copies:

DOCKER_BUILDKIT=1 docker build -t my-image:latest .

Images built from Dockerfiles using COPY --link can be used like any other. You can start a container with docker run and push them straight to registries. The --link flag only affects how content is added to the image layers during the build.

Why Linked Copies Matter

Using the --link flag allow build caches to be reused even when content you COPY in changes. In addition, builds may be able to complete without their base image even existing on your machine.

Returning to the example from above, standard COPY behavior requires the alpine image to exist on your Docker host before the new content can be added. The image will be downloaded automatically during the build if you've not previously pulled it.

With linked copies, Docker doesn't need the alpine image's content. It pulls the alpine manifest, creates new independent layers for the copied files, then creates a revised manifest that links the layers into those provided by alpine. The content of the alpine image - its layer blobs - will only be downloaded if you start a container from your new image or export it to a tar archive. When you push the image to a registry, that registry will store its new layers and remotely acquire the alpine ones.

This functionality facilitates efficient image rebases too. Perhaps you're currently distributing a Docker image using the latest Ubuntu 20.04 LTS release:

FROM golang AS build  

...

RUN go build -o /app .

FROM ubuntu:20.04

COPY --link --from=build /app /bin/app

ENTRYPOINT ["/bin/app"]

You can build the image with caching enabled using BuildKit's --cache-to flag. The inline cache stores build cache data inside the output image, where it can be reused in subsequent builds:

docker buildx build --cache-to type=inline -t example-image:20.04 .

Now let's say you'd like to provide an image that's based on the next LTS after its release, Ubuntu 22.04:

FROM golang AS build  

...

RUN go build -o /app .

FROM ubuntu:22.04

COPY --link --from=build /app /bin/app

ENTRYPOINT ["/bin/app"]

Rebuild the image using the cache data embedded in the original version:

docker buildx build --cache-from example-image:20.04 -t example-image:22.04 .

The build will complete almost instantly. Using the cached data from the existing image, Docker can verify the files needed to build /app haven't changed. This means the cache for the independent layer created by the COPY instruction remains valid. As this layer doesn't depend on any other, the ubuntu:22.04 image won't be pulled either. Docker merely links the snapshot layer containing /bin/app into a new manifest within the ubuntu:22.04 layer chain. The snapshot layer is effectively "rebased" onto a new parent image, without any filesystem operations occurring.

The model also optimizes multi-stage builds where changes can occur between any of the stages:

FROM golang AS build  

RUN go build -o /app .

FROM config-builder AS config

RUN generate-config --out /config.yaml

FROM ubuntu:latest

COPY --link --from=config /config.yaml build.conf

COPY --link --from=build /app /bin/app

Without --link, any change to the generated config.yaml causes ubuntu:latest to be pulled and the file to be copied in. The binary then has to be recompiled as its cache is invalidated by the filesystem changes. With linked copies, a change to config.yaml allows the build to continue without pulling ubuntu:latest or recompiling the binary. The snapshot layer with build.conf inside is simply replaced by a new version that's independent of all the other layers.

When Not To Use It

There are some situations where the --link flag won't work correctly. Because it copies files into a new layer, instead of adding them on top of the previous one, you can't use ambiguous references as your destination path:

COPY --link my-file /data

With a regular COPY instruction, my-file will be copied to /data/my-file if /data already exists as a directory in the image. With --link, the target layer's filesystem will always be empty, so my-file gets written to /data.

The same consideration applies to symlink resolution. Standard COPY automatically resolves destination paths that are symlinks in the image. When you're using --link, this behavior isn't supported as the symlink won't exist in the copy's independent layer.

It's recommended you start using --link wherever these limitations don't apply. Adopting this feature will speed up your builds and make caching more powerful. If you can't immediately remove ambiguous or symlinked destination paths, you can keep using the existing COPY instruction. It's due to these backwards incompatible changes that --link is an optional flag, instead of the new default.

Summary

BuildKit's COPY --link is a new Dockerfile feature which can make builds quicker and more efficient. Images using linked copies don't need to pull previous layers just so files can be copied into them. Docker creates a new independent layer for each COPY instead, then links those layers back into the chain.

You can start using linked copies now if you're building images with BuildKit and the latest version of the Buildx or Docker CLI. Adopting "--link" is a new best practice Docker build step, provided you're not affected by the changes to destination path resolution that it necessitates.

ncG1vNJzZmivp6x7qbvWraagnZWge6S7zGibnq6fpcBwtM6wZK2nXZawpLHLnqmarJVisbCvyp6pZpqlnrmlv4yapZ1ln6XBqrnIs5xmm5GYtaq6xmauoqyYYrCwvNhmo6Kmm2Q%3D