Going to production

In this section, we cover the basic recommendations for preparing Linkerd for production use, including preparing your environment and configuring Linkerd appropriately.

Before you can deploy Linkerd to production, you need to configure your Kubernetes environment. The good news is that much of his preparation can be performed automatically.

Linkerd automates as much as possible. This includes verifying the pre-installation and post-installation environments. The linkerd check command will automatically validate most aspects of the cluster against Linkerd’s requirements, including operating system, Kubernetes deployment, and network requirements.

To run the automated verification, follow these steps:

  • Validate that you have the linkerd CLI binary installed locally and that its version matches the version of Linkerd you want to install, by running linkerd version and checking the “Client version” section.
  • Ensure that you are able to specify the correct cluster for the linkerd CLI. By default, Linkerd’s CLI will follow the same rules for deciding which cluster to talk to as kubectl does, including with respect to the $KUBECONFIG environment variable, the .kube/config file in your home directory, and the current context. (Note that these can be overriden with --context and --kubeconfig flags.)
  • Run linkerd check --pre.
  • Correct any issues in your environment highlighted in the output. Each failing check should contain a URL pointing to a more detailed explanation with possible solutions.

Once linkerd check --pre is passing, you can start to plan the control plane installation using the information in the following sections.

There are some conditions that may prevent Linkerd from operating properly that linkerd check --pre cannot currently validate. The most common such conditions are those that prevent the Kubernetes API server from contacting Linkerd’s tap and proxy-injector control plane components.

Environments that can exhibit these conditions include:

  • Private clusters on GKE or AKS clusters
  • EKS clusters with custom CNI plugins

If you are in one of these environments, you may wish to do a “dry run” installation first to flesh out any issues. Remediations include updating internal firewall rules to allow certain ports (see the example in the Cluster Configuration documentation), or, for EKS clusters, switching to the AWS CNI plugin.

Production deployments should not pull from Linkerd’s open source image repositories. These repositories, while hosted by GitHub, are occasionally unreachable, and your production deployments should not have a runtime dependency on GitHub.

Many cloud providers offer private image registries. Determine where these images will be hosted and mirror them locally.

Linkerd Enterprise customers will be provided with specific instructions for how to access and mirror enterprise images.

Production users should use the high availability, or “HA”, installation path. We’ll get into all the details about HA mode later, in Configuring Linkerd for Production Use. In this section, we’ll describe our recommendations for resource allocation to Linkerd installed in HA mode. These values represent best practices, but you may need to tune them based on the specifics of your traffic and workload.

Control plane resources

The control plane of an HA Linkerd installation requires three separate nodes for replication. As a rule of thumb, we suggest planning for 512mb and 0.5 cores of consumption on each such node.

Of all control plane components, the linkerd-destination component, which handles service discovery information, is the one that will scale in terms of memory consumption with the size of the mesh. This component requires monitoring of consumption, and adjusting of resource requests and limits as appropriate.

If you are running the linkerd-viz plugin, the bundled Prometheus instance is worth some extra attention as it stores metrics information from proxies, and thus can have wildly variable resource requirements. As a rough starting point, expect 5mb of memory per meshed pod, but this can vary very wildly based on traffic patterns. We suggest planning for 512mb minimum for this instance and to take an empirical approach and monitor carefully.

Data plane resources

Generally speaking, for Linkerd’s data plane proxies, resource requirements are a function of network throughput. Our conservative rule of thumb is that, for each 1,000 RPS of traffic expected to an individual proxy (i.e. to a single pod), you should ensure that you have 0.25 CPU cores and 20mb memory. Very high throughput pods (>5k RPS per individual pod), such as ingress controller pods, may require setting custom proxy limit/requests (e.g. via the --proxy-cpu-* configuration flags).

In practice, proxy resource consumption is affected by the nature of the traffic, including payload sizes, level of concurrency, protocol, etc. Our guidance here, again, is to take an empirical approach and monitor carefully.

By default, the act of creating pods with the Linkerd data plane proxies injected requires NET_ADMIN capabilities. This is because, at pod injection time, Linkerd uses a Kubernetes InitContainer to automatically reroute all traffic to the pod through the pod’s Linkerd data plane proxy. This rerouting is done via iptables, which requires the NET_ADMIN capability.

In some environments this is undesirable, as NET_ADMIN privileges grant access to all network traffic on the host. As an alternative, Linkerd provides a CNI plugin which allows Linkerd to run iptables commands within the CNI chain (which already has elevated privileges) rather than in InitContainers.

The Kubernetes CNI system is very fragile, and operations such as dynamically changing the size of the cluster become significantly more complex when Linkerd’s CNI plugin is used. We recommend avoiding the Linkerd CNI plugin in favor of the default init-container approach unless absolutely necessary. See the CNI documentation for more.

Clock drift is surprisingly common in cloud environments. When the nodes in a cluster have different timestamps, the output won’t line up between log lines, metrics, etc. Clock skew can also break Linkerd’s ability to validate mutual TLS certificates, which can cause a critical outage of Linkerd’s ability to service requests.

Ensuring clock synchronization in networked computers is outside the scope of this document, but we will note that cloud providers often provide their own dedicated services on top of NTP to help with this problem, e.g. AWS’s Amazon Time Sync Service. A production environment should use those services. Similarly, if you are running on a private cloud, we suggest ensuring that all the servers use the same NTP source.

Linkerd’s control plane, which runs in the linkerd namespace, provides its own data plane proxies and should not be eligible for proxy auto-injection. In Helm and CLI installations, we automatically annotate this namespace with the linkerd.io/inject: disabled annotation. However, if you create this namespace yourself, you may need to set that annotation explicitly.

Open source Linkerd supports two basic installation flows: a Helm chart or the CLI. For a production deployment, we recommend using Helm charts, which better allow for a repeatable, automated approach. See the Helm docs and the CLI installation docs for more details.

Every data plane proxy provides a wealth of metrics data for the traffic it has seen, by exposing it in a Prometheus-compatible format on a port. How you consume that data is an important choice.

Generally speaking, you have several options for aggregating and reporting these metrics. (Note that these options can be combined.)

  • Ignore it. Linkerd’s deep telemetry and monitoring may simply be unimportant for your use case.
  • Aggregate it on-cluster with the linkerd-viz extension. This extension contains an on-cluster Prometheus instance configured scrape all proxy metrics, as well as a set of CLI commands, and an on-cluster dashboard, that expose those metrics for short-term use. By default, this Prometheus instance holds only 6 hours of data, and will lose metrics data if restarted. (If you are relying on this data for operational reasons, this may be insufficient.)
  • Aggregate it off-cluster with Prometheus. This can be done either by Prometheus federation from the linkerd-viz on-cluster Prometheus, or simply by using an off-cluster Prometheus to scrape the proxies directly, aka “Bring your own Prometheus”.
  • Use a third-party metrics provider such as Buoyant Cloud. In this option, the metrics pipeline is offloaded entirely. Buoyant Cloud will work out of the box for Linkerd data; other metrics providers should also be able to easily consume data from Prometheus or from the proxies.
Buoyant Cloud metrics
Screenshot of Buoyant Cloud’s hosted Linkerd metrics

Linkerd Enterprise adopters can should follow the provided installation instructions with the production profile. This will install Linkerd in a highly-available configuration along with lifecycle automation for automated upgrades and installs.

Production-grade open source deployments of Linkerd should be enabled with the high availability, or HA, configuration. This mode enables several production-grade behaviors of Linkerd, including:

  • Running three replicas of critical control plane components.
  • Setting production-ready CPU and memory resource requests on control plane components.
  • Setting production-ready CPU and memory resource requests on data plane proxies
  • Requiring that the proxy auto-injector be functional for any pods to be scheduled.
  • Setting anti-affinity policies on critical control plane components so that are scheduled on separate nodes and in separate zones.

For open source deployments, this mode can be enabled by using the values-ha.yaml Helm file. See the HA docs for more.

For mutual TLS, Linkerd uses three basic sets of certificates: the trust anchor, which is shared across clusters; the issuer certificate, which is set per cluster; and the proxy certificates, which are issued per pod. Each certificate also has a corresponding private key. (See our Kubernetes engineer’s guide to mTLS for more on this topic.)

Proxy certificates are automatically rotated every 24 hours without any intervention on your part. However, Linkerd does not rotate the trust anchor or the issuer certificate. Additionally, multi-cluster communication requires that the trust anchor be the same across clusters.

By default the Linkerd CLI will generate one-year self-signed certificates for both trust anchor and issuer certificates and will discard the trust anchor key. This allows for an easy installation, but is likely not the configuration you want in production.

If either trust anchor or issuer certificate expires without rotation, no Linkerd proxy will be able to communicate with another Linkerd proxy. This can be catastrophic. Thus, for production environments, we recommend the following approach:

  • Create a trust anchor with a 10-year expiration period, and store the key in a safe location. (Trust anchor rotation is possible, but is very complex, and in our opinion best avoided unless you have very specific requirements.)
  • Set up automatic rotation for issuer certificates with cert-manager.
  • Set up certificate monitoring with Buoyant Cloud. Buoyant Cloud will automatically alert you if you certificate is close to expiring.

Certificate management can be subtle. Be sure to read through Linkerd’s full mTLS documentation, including the sections on rotating certificates.

Linkerd relies on a set of TLS credentials for webhooks. (These credentials are independent of the ones used for mTLS.) These credentials are used when Kubernetes calls the webhooks endpoints of Linkerd’s control plane, which are secured with TLS.

As above, by default these credentials expire after one year, at which point Linkerd becomes inoperable. To avoid last-minute scrambles, we recommend using cert-manager to automatically rotate the webhook TLS credentials for each cluster. See the Webhook TLS documentation.

Linkerd proxies perform protocol detection to automatically identify the protocol used by applications. However, there are some protocols which Linkerd cannot automatically detect, for example, because the client does not send the initial bytes of the connection. If any of these protocols are in use, you need to configure Linkerd to handle them. Otherwise, a connection that cannot be properly handled by protocol detection will incur a 10-second connect timeout, then be proxied as raw TCP.

The exact configuration necessary depends not just on the protocol but on whether the default ports are used, and whether the destination service is in the cluster or outside the cluster. For example, any of the following protocols, if the destination is outside the cluster, or if the destination is inside the cluster but on a non-standard port, will require configuration:

  • SMTP
  • MySQL
  • PostgresQL
  • Redis
  • ElasticSearch
  • Memcache

See the Protocol Detection documentation for guidance on how to configure this, if necessary.

Linkerd’s tap command (part of the linkerd-viz extension) allows users to view live request metadata, including request and response gRPC and HTTP headers (but not bodies). In some organizations, this data may contain sensitive information that should not be available to operators.

If this applies to you, follow the instructions in the Securing linkerd tap documentation to restrict or remove access to tap.

After installing the control plane, we recommend running linkerd check to ensure everything is set up correctly. Correct any issues in your environment highlighted in the output. Each failing check should contain a URL pointing to a more detailed explanation with possible solutions.

Once linkerd check passes, congratulations! You have successfully installed Linkerd.