Version: 6.1.0

Find a Memory Leak

Memory Leaks in Node.js and how to diagnose and solve them using N|Solid.

What is a Memory Leak?

note

A Memory Leak is when a piece of software retains memory it will never need again, typically resulting in a growth over time that eventually causes the service to degrade or fail.

JavaScript's memory is managed by the Garbage Collector, but it needs help to know what can and cannot be "collected" or freed back to the system as no longer in use. As JS developers, we don't have to manage most memory lifetime issues, but we cannot just assume the Garbage Collector always knows what we want, or that it knows the future. There are many ways to get in our own way--for example if we just append to a list, as long as the list itself can be read, there is no way to know what could be removed from the list as no longer necessary. To avoid memory issues such as leaks, we must choose GC-friendly structures and consider how the graph structure of memory works to know what can and cannot be accessed again.

The JS heap is structured as a graph of containers (graph nodes) and links (graph edges) between them. Each memory container (such as an Object or string) is a node that can be referenced by or contain references to other nodes. A reference can infer either a ownership relationship, such as object property or list element, or a context relationship, such as a closure referencing another object from an outside scope. This is a simplified and incomplete view, but it uses the concepts inside the .heapsnapshot files we'll discuss later that are a serialized representation of a JavaScript Heap.

Simplified this way, you could imagine that the Garbage Collector could traverse the graph and count the number of references to and from each node and start pruning entire branches of trees based on the ownership graph IF those branches are well encapsulated from other nodes. If a long-lived branch of memory creates context links to objects that shouldn't be long-lived, those objects might themselves keep other branches that could otherwise be freed from being collected.

At the least offensive, memory leaks are your application using resources needlessly, and typical over-provisioning may prevent minor leaks from ever being an issue. Worse leaks or unattended leaks create cascading issues and for many result in scheduled restarts to avoid crashes. Garbage Collector activity will increase with Heap size and graph complexity. This usually means the longer a leak has been active for a process, the more performance GC will steal from your application.

Clearly, memory leaks are to be avoided. Let's look at some simplified leaks and then discuss how to do some analysis.

Naive Links

A way to quickly create simple leaks is to naively link data together for convenience where or in ways it shouldn't be.

const cache = new Map()
app.get('/data/:id', async (req, res) => {
  const id = req.params.id

  if (cache.has(id)) {
    res.send(cache.get(id).data)
  } else {
    const result = await queryDatabase(id)

    // This permanently retains the original web request for each cache entry
    // because it is now linked to the cache entry
    cache.set(id, { ...result, request: req })

    res.send(result.data)
  }
})

In this example, a poorly designed results cache inadvertently keeps incoming web requests indefinitely because they are being kept alive by the way the query results cache holds onto a reference to the original request it was cached by. Many memory leaks result from this sort of activity: intentionally retaining things without considering what those things also prevent from being freed.

Unreachable Data/Code

In JavaScript, it is possible to intentionally enclose functionality inside of scope--this is one of the ways that we can accomplish language features such as private methods or variables. It also provides a difficult to debug space to leak memory into, because each scope's private members can reference each other, retaining them indefinitely.

const leakyBucket = {}
function store(key) {
  let entry = leakyBucket[key]
  // this `unused` variable is created in the scope of `store`
  var unused = function () {
    // and it references another variable in the scope of `store`: `entry`
    if (entry) console.log("never called")
  }
  leakyBucket[key] = {
    longStr: new Array(1000000).join("*"),
    someMethod: function () {
      console.log(someMessage)
    },
  }
}
// `store` will never go out of scope, but GC has no way of knowing
// `unused` and `entry` will always be unreachable, and one is created
// each time `store()` is called.

store.len = () => {
  return Object.keys(leakyBucket).length
}

With the flexibility of JavaScript to create closures, this example shows how it is also possible to create sorts of islands of scope that it cannot disambiguate because of how they relate to each other or how they could potentially be accessed in the future. This is often what people think of when they hear the term memory leak, but only one of many examples we run across.

Missing Cleanup

Often a leak is as simple as missing cleanup on an error logic branch.

const cache = new Map()

function start(id, data) {
  console.log(`Work started for ${id}`)
  cache.set(id, data)
}
function finish(id) {
  console.log(`Finished ${id}`)
  cache.delete(id)
}
function error(id) {
  // If this doesn't delete from the cache, it will accumulate (leak) cache
  // entries related to errors forever
  console.error(`Error while doing work on ${id}`)
}

work.forEach(item => {
  start(item.id, item.data)
  try {
    // ... some work
    finish(item.id)
  } catch (e) {
    error(item.id)
  }
})

Generally these are coding errors of omission--the "happy path" is well cared for, but the rare or error pathways lack proper resource cleanup and start to accumulate a number of objects of similar characteristic; in the above case those that had errors.

Memory spaces in Node.js

Node.js has two primary memory spaces, the v8 JavaScript Heap, and the Node.js unmanaged C++ memory space (sometimes referred to as ‘RSS’ though typically better thought of as ‘RSS minus Heap’ space) where the contents of containers such as Buffers and TypedArrays are stored. Leaks in the Heap space are typically application code errors or logic errors and v8 APIs can be leveraged by tools such as N|Solid to extract and analyze the Heap space. Leaks in the unmanaged C++ memory space are often caused by native libraries, misusing Node internals, or application code errors, but have much more limited tooling available for analysis. Looking at application metrics over time should show us what memory space is of concern–one of these two primary candidates, or one of the rarer memory spaces to leak memory into. This report will only discuss JS Heap and ‘RSS minus Heap’ as areas of concern.

Different memory spaces have different implications for failure cases due to memory pressure. The HeapSpace is capped at application startup and v8 will terminate if you allocate over the maximum heap value and it cannot clear enough space for new allocations during garbage collection. RSS space can grow without bound and result in the application going into swap and degrading performance, or the operating or container orchestration system killing the application if it tries to allocate more than possible.

The JavaScript heap is further broken into separate spaces for storing things with different security purposes, such as the instructions of the application itself (executable space) versus the data the application works with. Our focus will be on leaks in the data space, though we will try to point out where to check this assumption later.

Finding Memory Leaks

It can be extremely difficult to locate memory leaks in large applications. Some tools may help you find the function with the leak, but often it's a matter of narrowing things down to a few suspects and then looking at source code. Let's discuss some ways to narrow down where we have to look.

Here's our simplified workflow for narrowing down memory leak locations:

Is it a leak at all?
RSS or Heap?
Where in code? (cpu profile, metrics, snapshots/profiles/load testing)
How to solve?

Step 1: Is it a leak at all?

It can be difficult at times to tell if something is a memory leak or just expected behavior. Sometimes it will be clear, such as routinely crashing servers with OOM messages in the logs, but sometimes it can simply be a cache that is still filling planned and provisioned headroom that looks like a leak.

Crashes where the process was near memory limits or system or application logs that mention "OOM" or "Out of Memory" might be a smoking gun, but often we proactively restart servers as they reach predefined limits, meaning OOM is prevented but a leak persists.

We must first isolate the leak, either by finding active leaking processes, or a means of reproducing them. If we are lucky this might be testable in a local environment or from a load test scenario, but often we can only see it happening in production. Luckily, N|Solid provides some tools for collecting the necessary information from production with minimal impact.

Looking through metrics and considering the memory behavior of the application is our first step here. Is it possible that there's no memory pressure and a Major GC event hasn't run yet? Is it related to daily business activity? These sorts of questions are things to keep in mind when looking at the memory metrics suggested below.

Step 2: RSS or JavScript Heap?

It's ideal to look here first because the tooling required for analysis in these memory spaces is quite different. Most of what we'll discuss and have direct tooling for are for leaks into the JavaScript heap space. Leaks in the RSS space follow the same workflow, but none of the tools specific to heap analysis will be as helpful.

Step 3: Where in the code is it leaking?

Here's where we pull out the rest of the toolbox.

Helpful Memory Analysis Metrics

One of the first things to try to identify is if the leak is related to use patterns, or if it is constant over time. If a particular activity triggers the leak, load testing–especially if targeted–can help narrow down the location and provide a framework for evaluating the solution. If the growth is only correlated to time, it is likely from a timer or interval inside the application or its libraries, or an external service such as a health check or other probe with its own timer. Start the process by looking at application metrics over time, and then start to match growth with application behavior.

tip

The Metrics in Detail page has a full list of the N|Solid Runtime metrics.

Suggested Primary Metrics:

rss: Normally a fleet of processes will stabilize at a steady-state range of rss values. Unbounded growth over a long period of time is a sign of a potential leak.
heapUsed: Normally a fleet of processes will stabilize at a similar steady-state range of values. If this value only increases over long periods even if GC is happening, you likely have a leak.

Both of these respond to various types of memory pressure or lack thereof, meaning that they may increase for a long time before they receive any pressure that might signal memory release. For RSS this is typically other applications allocating large blocks of memory. For heapUsed it requires a Major GC event to free memory again.

warning

Collecting a Heap Snapshot is one way to force a Major GC event and see if GC pressure fixes what looks like a leak, but it will at least temporarily inflate the process RSS significantly, which can be an issue if you are nearing hardware limitations. Also, the time the process is blocked for Garbage Collection increases with the size and complexity of the JS Heap.

Potential Alert Metrics:

totalAvailableSize below a safety threshold.
rss is nearing hardware limitations.

Metrics useful to diagnose, locate, or characterize leaks:

totalHeapSizeExecutable: If diagnosing a heapUsed memory leak, if the increase is in this space, it narrows the leak to code generation (eval or similar), misuse of the module loader, JIT confusion issues, or other code instruction-related growth.
externalMem: Helpful in narrowing down where an rss leak is. Growth here suggests ArrayBuffer or Buffer data.
numberOfNativeContexts: Look for (mis)use of the vm module.
IO Primitives can narrow leaks down to code dealing with specific types of IO. The full list is at Metrics in Detail. Here are some examples:
- activeHandles: These represent long-term IO primitives and thus it should not change often. Unbounded growth is a definite antipattern.
- activeRequests: These represent short-term IO primitives and can change often. Unbounded growth is a definite antipattern.
- promiseCreatedCount: If Promises are being tracked, this can let you see if Promises are a leak source.
Garbage Collection Stats:
- gcCount: An increase in slope of this graph without decreases in the used heap size might indicate the Garbage Collector struggling with increased graph complexity due to a leak.
- gcMajorCount: Compare processes with suspected leaks to those without to look for radically increased GC behavior. If GC behavior remains similar, it may be a lack of memory pressure rather than a resource leak.

CPU Profiling

The CPU Profiler is a v8 API used by N|Solid that wakes up every few microseconds and records what function the application is executing and records the function stack in a time-based log format. The resulting .cpuprofile file can be analyzed in N|Solid or downloaded and analyzed in Chrome Dev Tools or other tools such as vscode. While primarily a tool for performance analysis, it also works well to record what actual activity is present in order to narrow your focus. It does not record any sort of memory statistics, just code execution time. Long-term profiles can be used to try to filter large codebases to which branches are active while the leak is being triggered.

When collecting CPU Profiles, the primary concern is a slight performance degradation as the sampler records timed measurements during execution. The overhead is small and it can be considered safe for production use, but avoid constant profiling or simultaneously collecting profiles on every instance.

CPU Profiles can be analyzed natively in the N|Solid UI and can be summarized aptly by our Copilot.

Heap Snapshots

Heap Snapshots are full graph representations of the v8 JavaScript Heap space of your application, with all orphaned nodes pre-pruned. The resulting .heapsnapshot file can be downloaded from N|Solid and uploaded to Chrome Dev Tools for analysis. They can be quite large, usually at least the size of the JavaScript Heap, as it contains most of the contents of the JS Heap plus the graph structure of the memory.

When collecting Heap Snapshots, there are multiple major events that have implications for production. First, there is a major “stop the world” garbage collection which pauses execution for cleanup and then records the freshly cleaned JavaScript Heap. The size of this format in memory can be quite large and hard to predict from the heap size of your application as it relates to both the contents and memory graph complexity of the heap space. The longer it takes to garbage collect and record the longer your application will be paused for. If the server is active, it will cause any in-flight asynchronous calls to wait–potentially meaning a poor user experience. If the application heap is near the maximum heap setting for your v8 instance, the garbage collection could cause the process to Out Of Memory (OOM) crash. If the container or operating system will kill the process if it attempts to allocate more than a certain amount of memory, taking a large snapshot could result in the kernel terminating the process. Obviously, a crash is the most disruptive thing you can do to a user with an in-flight request, so take care.

Being a full graph representation of the entire heap, they can be quite large on disk, especially if you are dealing with memory leaks. All data, their containing objects, and how those relate to each other are captured in a JSON file which usually ends in .heapsnapshot. When Heap Snapshots are performed, a full Garbage Collection event will be initiated, and the memory required to perform the snapshot may at least double the RSS footprint of that process for some time. This can have performance or stability implications for processes near hardware limits.

There are various tools that can be used to analyze Heap Snapshots, including Chrome Dev Tools, which is built into every Chrome browser. If you open it up and go to the Memory tab, you can upload .heapsnapshot files and use it to browse the graph.

important

Test the impact of the tools you use in staging and QA prior to using them in production! For example: collecting a heap snapshot without enough memory overhead for the serialization of the snapshot might cause Docker to terminate the process.

Allocation Profiling

Heap Allocation Profiling is a time-based mechanism to generate Heap Snapshots with additional code context annotated. This is designed to allow you to better link the constructors and allocation patterns to actual code paths. The resulting files are also .heapsnapshot files and use the same structure, but have additional information visible. Performance overhead is at least as high as Heap Snapshots.

There are various tools that can be used to analyze Heap Profiles, including Chrome Dev Tools, which is built into every Chrome browser. If you open it up and go to the Memory tab, you can upload .heapsnapshot files and use it to browse the graph. More on analysis below.

Sampling Heap Profile

Heap Sampling performs a time-based sampled approach to have negligible overhead by not requiring Garbage Collection or trying to represent the entire memory graph--instead, they sample memory allocations and deallocations performed by code with captured stack traces. This results in a CPU Profile-like structure where instead of execution time as the x axis, memory growth can be shown, letting you see which functions allocated the most net memory during the sampling period.

There are various tools that can be used to analyze Heap Samples, including Chrome Dev Tools, which is built into every Chrome browser. If you open it up and go to the Memory tab, you can upload .heapprofile files and use it to browse the graph. More on analysis below.

Load Testing

Often we already have tests that we can leverage to trigger leaks that happen from specific codepaths. If a test is identified, these tools not only become significantly more powerful, but we can use them without the potential risks of degrading production. In general, if you can reproduce the issue outside of production, perform the analysis there first and perform lighter confirmation tests in production.

Refactoring

Sometimes this will result in a solution in the process, but one thing that makes the above assets significantly more powerful is named constructors. In the Heap Snapshots, for example, generic objects can be of type object or Object and thus anything created via {} or new Object() will be lumped together. Using named constructors will pull specific object types into their own categories for analysis--useful when the generic object buckets could contain hundreds of millions of entries.

Step 4: Fixing The Leak

Generally this means refactoring the code to be friendlier to the GC algorithms. Generally this means simpler code, simpler objects, cleaner encapsulation, and proper cleanup of resources that must manually be cleaned up.

Ideal Solution: Refactor

Often leaks are the result of some of these patterns that can be refactored:

Misuse of global variable spaces
Missing Timer or Interval cleanup
Missing Event Listener cleanup
Closures with unnecessary references
Improper cache cleanup
Missing error handling
Retaining transient objects in long-lived Objects or Arrays
Unfinished Promises

Potential Solution: WeakMaps

Occasionally what would otherwise be difficult to avoid as a leak can benefit from some of the weak-reference types in JavaScript such as the WeakMap. Its relationships allows it you to attach data to resources such that they do not add to the reference count of the resource. This means you could use it to cache results for something like a Request in a WeakMap and those cache results would not impact the lifetime of the Request. For example:

const metadataMap = new WeakMap()

app.post('/start_work/:id', (req, res) => {
  const id = req.params.id
  metadataMap.set(req, { id, start: Date.now() })
  // ... other potentially asynchronous work
})

In this example, by making metadataMap a WeakMap instead of a Map, when the req object goes out of scope GC does not count the data in the values of its WeakMap entry when doing reference counting. This makes us less prone to leaving entries in it after the requests are completed by making the relationship "through" req. Attaching metadata or methods directly to Request objects is another means of doing this but is significantly more prone to creating circular lifetime references or deoptimize code that handles Request objects.

Find a Memory Leak

What is a Memory Leak?​

Naive Links​

Unreachable Data/Code​

Missing Cleanup​

Memory spaces in Node.js​

Finding Memory Leaks​

Step 1: Is it a leak at all?​

Step 2: RSS or JavScript Heap?​

Step 3: Where in the code is it leaking?​

Helpful Memory Analysis Metrics​

CPU Profiling​

Heap Snapshots​

Allocation Profiling​

Sampling Heap Profile​

Load Testing​

Refactoring​

Step 4: Fixing The Leak​

Ideal Solution: Refactor​

Potential Solution: WeakMaps​