Linkerd production runbook
Going to production
In this section, we cover the basic recommendations for preparing Linkerd for production use, including preparing your environment and configuring Linkerd appropriately.
Preparing your environment

Before you can deploy Linkerd to production, you need to configure your Kubernetes environment. The good news is that much of his preparation can be performed automatically.
Run the automated environment verification

Linkerd automates as much as possible. This includes verifying the
pre-installation and post-installation environments. The linkerd check
command
will automatically validate most aspects of the cluster against Linkerd’s
requirements, including operating system, Kubernetes deployment, and network
requirements.
To run the automated verification, follow these steps:
- Validate that you have the
linkerd
CLI binary installed locally and that its version matches the version of Linkerd you want to install, by runninglinkerd version
and checking the “Client version” section. - Ensure that you are able to specify the correct cluster for the
linkerd
CLI. By default, Linkerd’s CLI will follow the same rules for deciding which cluster to talk to askubectl
does, including with respect to the$KUBECONFIG
environment variable, the.kube/config
file in your home directory, and the current context. (Note that these can be overriden with--context
and--kubeconfig
flags.) - Run
linkerd check --pre
. - Correct any issues in your environment highlighted in the output. Each failing check should contain a URL pointing to a more detailed explanation with possible solutions.
Once linkerd check --pre
is passing, you can start to plan the control plane
installation using the information in the following sections.
Validate certain exceptional conditions

There are some conditions that may prevent Linkerd from operating properly that
linkerd check --pre
cannot currently validate. The most common such
conditions are those that prevent the Kubernetes API server from contacting
Linkerd’s tap and proxy-injector control plane components.
Environments that can exhibit these conditions include:
- Private clusters on GKE or AKS clusters
- EKS clusters with custom CNI plugins
If you are in one of these environments, you may wish to do a “dry run” installation first to flesh out any issues. Remediations include updating internal firewall rules to allow certain ports (see the example in the Cluster Configuration documentation), or, for EKS clusters, switching to the AWS CNI plugin.
Mirror the Linkerd images

Production deployments should not pull from Linkerd’s open source image repositories. These repositories, while hosted by GitHub, are occasionally unreachable, and your production deployments should not have a runtime dependency on GitHub.
Many cloud providers offer private image registries. Determine where these images will be hosted and mirror them locally.
Linkerd Enterprise customers will be provided with specific instructions for how to access and mirror enterprise images.
Ensure sufficient system resources for Linkerd

Production users should use the high availability, or “HA”, installation path. We’ll get into all the details about HA mode later, in Configuring Linkerd for Production Use. In this section, we’ll describe our recommendations for resource allocation to Linkerd installed in HA mode. These values represent best practices, but you may need to tune them based on the specifics of your traffic and workload.
Control plane resources
The control plane of an HA Linkerd installation requires three separate nodes for replication. As a rule of thumb, we suggest planning for 512mb and 0.5 cores of consumption on each such node.
Of all control plane components, the linkerd-destination component, which handles service discovery information, is the one that will scale in terms of memory consumption with the size of the mesh. This component requires monitoring of consumption, and adjusting of resource requests and limits as appropriate.
If you are running the linkerd-viz plugin, the bundled Prometheus instance is worth some extra attention as it stores metrics information from proxies, and thus can have wildly variable resource requirements. As a rough starting point, expect 5mb of memory per meshed pod, but this can vary very wildly based on traffic patterns. We suggest planning for 512mb minimum for this instance and to take an empirical approach and monitor carefully.
Data plane resources
Generally speaking, for Linkerd’s data plane proxies, resource requirements are
a function of network throughput. Our conservative rule of thumb is that, for
each 1,000 RPS of traffic expected to an individual proxy (i.e. to a single
pod), you should ensure that you have 0.25 CPU cores and 20mb memory. Very high
throughput pods (>5k RPS per individual pod), such as ingress controller pods,
may require setting custom proxy limit/requests (e.g. via the --proxy-cpu-*
configuration flags).
In practice, proxy resource consumption is affected by the nature of the traffic, including payload sizes, level of concurrency, protocol, etc. Our guidance here, again, is to take an empirical approach and monitor carefully.
Restrict NET_ADMIN permissions if necessary

By default, the act of creating pods with the Linkerd data plane proxies injected requires NET_ADMIN capabilities. This is because, at pod injection time, Linkerd uses a Kubernetes InitContainer to automatically reroute all traffic to the pod through the pod’s Linkerd data plane proxy. This rerouting is done via iptables, which requires the NET_ADMIN capability.
In some environments this is undesirable, as NET_ADMIN privileges grant access to all network traffic on the host. As an alternative, Linkerd provides a CNI plugin which allows Linkerd to run iptables commands within the CNI chain (which already has elevated privileges) rather than in InitContainers.
The Kubernetes CNI system is very fragile, and operations such as dynamically changing the size of the cluster become significantly more complex when Linkerd’s CNI plugin is used. We recommend avoiding the Linkerd CNI plugin in favor of the default init-container approach unless absolutely necessary. See the CNI documentation for more.
Ensure that time is synchronized across nodes in the cluster

Clock drift is surprisingly common in cloud environments. When the nodes in a cluster have different timestamps, the output won’t line up between log lines, metrics, etc. Clock skew can also break Linkerd’s ability to validate mutual TLS certificates, which can cause a critical outage of Linkerd’s ability to service requests.
Ensuring clock synchronization in networked computers is outside the scope of this document, but we will note that cloud providers often provide their own dedicated services on top of NTP to help with this problem, e.g. AWS’s Amazon Time Sync Service. A production environment should use those services. Similarly, if you are running on a private cloud, we suggest ensuring that all the servers use the same NTP source.
Ensure the linkerd namespace is not eligible for auto-injection

Linkerd’s control plane, which runs in the linkerd namespace, provides its own data plane proxies and should not be eligible for proxy auto-injection. In Helm and CLI installations, we automatically annotate this namespace with the linkerd.io/inject: disabled annotation. However, if you create this namespace yourself, you may need to set that annotation explicitly.
Configuring Linkerd for production use

Choose your deployment tool

Open source Linkerd supports two basic installation flows: a Helm chart or the CLI. For a production deployment, we recommend using Helm charts, which better allow for a repeatable, automated approach. See the Helm docs and the CLI installation docs for more details.
Decide on your metrics pipeline

Every data plane proxy provides a wealth of metrics data for the traffic it has seen, by exposing it in a Prometheus-compatible format on a port. How you consume that data is an important choice.
Generally speaking, you have several options for aggregating and reporting these metrics. (Note that these options can be combined.)
- Ignore it. Linkerd’s deep telemetry and monitoring may simply be unimportant for your use case.
- Aggregate it on-cluster with the linkerd-viz extension. This extension contains an on-cluster Prometheus instance configured scrape all proxy metrics, as well as a set of CLI commands, and an on-cluster dashboard, that expose those metrics for short-term use. By default, this Prometheus instance holds only 6 hours of data, and will lose metrics data if restarted. (If you are relying on this data for operational reasons, this may be insufficient.)
- Aggregate it off-cluster with Prometheus. This can be done either by Prometheus federation from the linkerd-viz on-cluster Prometheus, or simply by using an off-cluster Prometheus to scrape the proxies directly, aka “Bring your own Prometheus”.
- Use a third-party metrics provider such as Buoyant Cloud. In this option, the metrics pipeline is offloaded entirely. Buoyant Cloud will work out of the box for Linkerd data; other metrics providers should also be able to easily consume data from Prometheus or from the proxies.

Enable HA mode

Linkerd Enterprise adopters can should follow the provided installation instructions with the production profile. This will install Linkerd in a highly-available configuration along with lifecycle automation for automated upgrades and installs.
Production-grade open source deployments of Linkerd should be enabled with the high availability, or HA, configuration. This mode enables several production-grade behaviors of Linkerd, including:
- Running three replicas of critical control plane components.
- Setting production-ready CPU and memory resource requests on control plane components.
- Setting production-ready CPU and memory resource requests on data plane proxies
- Requiring that the proxy auto-injector be functional for any pods to be scheduled.
- Setting anti-affinity policies on critical control plane components so that are scheduled on separate nodes and in separate zones.
For open source deployments, this mode can be enabled by using the values-ha.yaml Helm file. See the HA docs for more.
Set up your mTLS certificate rotation and alerting

For mutual TLS, Linkerd uses three basic sets of certificates: the trust anchor, which is shared across clusters; the issuer certificate, which is set per cluster; and the proxy certificates, which are issued per pod. Each certificate also has a corresponding private key. (See our Kubernetes engineer’s guide to mTLS for more on this topic.)
Proxy certificates are automatically rotated every 24 hours without any intervention on your part. However, Linkerd does not rotate the trust anchor or the issuer certificate. Additionally, multi-cluster communication requires that the trust anchor be the same across clusters.
By default the Linkerd CLI will generate one-year self-signed certificates for both trust anchor and issuer certificates and will discard the trust anchor key. This allows for an easy installation, but is likely not the configuration you want in production.
If either trust anchor or issuer certificate expires without rotation, no Linkerd proxy will be able to communicate with another Linkerd proxy. This can be catastrophic. Thus, for production environments, we recommend the following approach:
- Create a trust anchor with a 10-year expiration period, and store the key in a safe location. (Trust anchor rotation is possible, but is very complex, and in our opinion best avoided unless you have very specific requirements.)
- Set up automatic rotation for issuer certificates with cert-manager.
- Set up certificate monitoring with Buoyant Cloud. Buoyant Cloud will automatically alert you if you certificate is close to expiring.
Certificate management can be subtle. Be sure to read through Linkerd’s full mTLS documentation, including the sections on rotating certificates.
Set up automatic rotation of webhook credentials

Linkerd relies on a set of TLS credentials for webhooks. (These credentials are independent of the ones used for mTLS.) These credentials are used when Kubernetes calls the webhooks endpoints of Linkerd’s control plane, which are secured with TLS.
As above, by default these credentials expire after one year, at which point Linkerd becomes inoperable. To avoid last-minute scrambles, we recommend using cert-manager to automatically rotate the webhook TLS credentials for each cluster. See the Webhook TLS documentation.
Configure handling of certain protocols

Linkerd proxies perform protocol detection to automatically identify the protocol used by applications. However, there are some protocols which Linkerd cannot automatically detect, for example, because the client does not send the initial bytes of the connection. If any of these protocols are in use, you need to configure Linkerd to handle them. Otherwise, a connection that cannot be properly handled by protocol detection will incur a 10-second connect timeout, then be proxied as raw TCP.
The exact configuration necessary depends not just on the protocol but on whether the default ports are used, and whether the destination service is in the cluster or outside the cluster. For example, any of the following protocols, if the destination is outside the cluster, or if the destination is inside the cluster but on a non-standard port, will require configuration:
- SMTP
- MySQL
- PostgresQL
- Redis
- ElasticSearch
- Memcache
See the Protocol Detection documentation for guidance on how to configure this, if necessary.
Secure your tap command if necessary

Linkerd’s tap command (part of the linkerd-viz extension) allows users to view live request metadata, including request and response gRPC and HTTP headers (but not bodies). In some organizations, this data may contain sensitive information that should not be available to operators.
If this applies to you, follow the instructions in the Securing linkerd tap documentation to restrict or remove access to tap.
Validate your installation

After installing the control plane, we recommend running linkerd check
to
ensure everything is set up correctly. Correct any issues in your environment
highlighted in the output. Each failing check should contain a URL pointing to a
more detailed explanation with possible solutions.
Once linkerd check
passes, congratulations! You have successfully installed
Linkerd.