PyTorch Model Performance Analysis and Optimization — Part 3 | by Chaim Rand | Aug, 2023

How to reduce “Cuda Memcpy Async” events and why you should beware of boolean mask operations

Chaim Rand
Towards Data Science
Photo by Braden Jarvis on Unsplash

This is the third part of a series of posts on the topic of analyzing and optimizing PyTorch models using PyTorch Profiler and TensorBoard. Our intention has been to highlight the benefits of performance profiling and optimization of GPU-based training workloads and their potential impact on the speed and cost of training. In particular, we wish to demonstrate the accessibility of profiling tools such as PyTorch Profiler and TensorBoard to all ML developers. You do not need to be a CUDA expert in order to derive meaningful performance gains from applying the techniques we discuss in our posts.

In our first post we demonstrated how the different views of the PyTorch Profiler TensorBoard plugin can be used to identify performance issues and reviewed a few popular techniques for accelerating training. In the second post we showed how the TensorBoard plugin Trace View can be used to identify when tensors are being copied from the CPU to the GPU, and back. Such movement of data — which can cause points of synchronization and slow the speed of training considerably — is often unintentional and can sometimes be easily avoided. The topic of this post will be situations in which we encounter points of synchronization between the GPU and CPU that are not associated with tensor copies. As in the case of tensor copies, these can cause stagnation in your training step and slow the overall time of your training considerably. We will demonstrate the existence of such occurrences, how they can be identified using PyTorch Profiler and the PyTorch Profiler TensorBoard plugin Trace View, and the potential performance benefits of building your model in a way that minimizes such synchronization events.

As in our previous posts, we will define a toy PyTorch model and then iteratively profile its performance, identify bottlenecks, and attempt to fix them. We will run our experiments on an Amazon EC2 g5.2xlarge instance (containing an NVIDIA A10G GPU and 8 vCPUs) and using the official AWS PyTorch 2.0 Docker image. Keep in mind that some of the behaviors we describe may vary between versions of PyTorch.

In the following blocks we introduce a toy PyTorch model that performs semantic segmentation on a 256×256 input image, i.e., it takes a 256×256 RGB image and outputs a 256×256 map of “per-pixel” labels from a class of ten semantic categories.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim
import torch.profiler
from torch import Tensor

class Net(nn.Module):
def __init__(self, num_hidden=10, num_classes=10):
self.conv_in = nn.Conv2d(3, 10, 3, padding='same')
hidden = []
for i in range(num_hidden):
hidden.append(nn.Conv2d(10, 10, 3, padding='same'))

self.hidden = nn.Sequential(*hidden)
self.conv_out = nn.Conv2d(10, num_classes, 3, padding='same')

def forward(self, x):
x = F.relu(self.conv_in(x))
x = self.hidden(x)
x = self.conv_out(x)
return x

To train our model we will use the standard cross-entropy loss with a few modifications:

  1. We will assume that the target labels include an ignore value indicating pixels that we want to exclude from the loss calculation.
  2. We will assume that one of semantic labels identifies certain pixels as belonging to the “background” of the image. We define our loss function to treat these as ignore labels.
  3. We will update our model weights only when we encounter batches with targets tensors that include at least two unique values.

While we have chosen these modifications for the purposes of our demonstration, these types of operations are not uncommon and can be found in many “standard” PyTorch models. Since we are already “experts” at performance profiling, we have already gone ahead and wrapped each of the operations in our loss function with a torch.profiler.record_function context manager, (as described in our second post).

class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss()

def cross_entropy(self, pred: Tensor, target: Tensor) -> Tensor:

# create a boolean mask of valid labels
with torch.profiler.record_function('create mask'):
mask = target != self.ignore_val

# permute the logits in preparation for masking
with torch.profiler.record_function('permute'):
permuted_pred = torch.permute(pred, [0, 2, 3, 1])

# apply the boolean mask to the targets and logits
with torch.profiler.record_function('mask'):
masked_target = target[mask]
masked_pred = permuted_pred[mask.unsqueeze(-1).expand(-1, -1, -1,
masked_pred = masked_pred.reshape(-1, self.num_classes)

# calculate the cross-entropy loss
with torch.profiler.record_function('calc loss'):
loss = self.loss(masked_pred, masked_target)
return loss

def ignore_background(self, target: Tensor) -> Tensor:

# discover all indices where target label is "background"
with torch.profiler.record_function('non_zero'):
inds = torch.nonzero(target == self.num_classes - 1, as_tuple=True)

# reset all "background" labels to the ignore index
with torch.profiler.record_function('index assignment'):
target[inds] = self.ignore_val
return target

def forward(self, pred: Tensor, target: Tensor) -> Tensor:

# ignore background labels
target = self.ignore_background(target)

# retrieve a list of unique elements in target
with torch.profiler.record_function('unique'):
unique = torch.unique(target)

# check if the number of unique items pass the threshold
with torch.profiler.record_function('numel'):
ignore_loss = torch.numel(unique) < 2

# calculate the cross-entropy loss
loss = self.cross_entropy(pred, target)

# zero the loss in the case that the number of unique elements
# is below the threshold
if ignore_loss:
loss = 0. * loss

return loss

Our loss function seems innocent enough, right? Wrong! As we will see below, the loss function includes a number of operations that trigger host-device synchronization events that slow the speed of training considerably — none of which involve copying tensors into or out of the GPU. As in our previous post, we challenge you to try to identify three opportunities for performance optimization before reading on.

For the purposes of our demo, we use randomly generated images and per-pixel label maps, as defined below.

from import Dataset

# A dataset with random images and label maps
class FakeDataset(Dataset):
def __init__(self, num_classes=10):
self.num_classes = num_classes
self.img_size = [256, 256]

def __len__(self):
return 1000000

def __getitem__(self, index):
rand_image = torch.randn([3]+self.img_size, dtype=torch.float32)
rand_label = torch.randint(low=-1, high=self.num_classes,
return rand_image, rand_label

train_set = FakeDataset()
train_loader =, batch_size=256,
shuffle=True, num_workers=8, pin_memory=True)

Last, we define our training step with the PyTorch Profiler configured to our desire:

device = torch.device("cuda:0")
model = Net().cuda(device)
criterion = MaskedLoss().cuda(device)

optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

# training loop wrapped with profiler object
with torch.profiler.profile(
schedule=torch.profiler.schedule(wait=1, warmup=4, active=3, repeat=1),
) as prof:
for step, data in enumerate(train_loader):
inputs = data[0].to(device=device, non_blocking=True)
labels = data[1].to(device=device, non_blocking=True)
if step >= (1 + 4 + 3) * 1:
outputs = model(inputs)
loss = criterion(outputs, labels)

If you were to naively run this training script, you would probably see high GPU (~90%) utilization and not know that there was anything wrong with it. It is only through profiling that we are able to identify the underlying performance bottlenecks and potential opportunities for training acceleration. So, without further ado, let’s see how our model performs.

In this post we will focus on the Trace View of the PyTorch Profiler TensorBoard plugin. Please see our previous posts for tips on how to use some of the other views supported by the plugin.

In the image below we show the Trace View of a single training step of our toy model.

Trace View of Baseline Model (Captured by Author)

We can clearly see that our 1.3 second long training step is completely dominated by the torch.nonzero operator in the first line of our loss function. All the other operations appear bunched together on either side of the huge cudaMemcpyAsyn event. What is going on??!! Why would such a seemingly innocent operation cause such a huge eyesore?

Perhaps we should not be so surprised, as the torch.nonzero documentation does include the following note: “When input is on CUDA, torch.nonzero() causes host-device synchronization.” The need for synchronization arises from the fact that, contrary to other common PyTorch ops, the size of the tensor that is returned by torch.nonzero is not pre-determined. The CPU does not know how many non-zero elements there are in the input tensor ahead of time. It needs to wait for the sync event from the GPU in order to perform the appropriate GPU memory allocation and appropriately prepare the subsequent PyTorch ops.

Note that the length of cudaMempyAsync is not indicative of the complexity of the torch.nonzero op, but rather reflects the amount of time that the CPU needs to wait for the GPU to finish all of the previous kernels that the CPU launched. For example, were we to make an additional torch.nonzero call immediately after our first one, our second cudaMempyAsync event would appear significantly shorter than the first since the CPU and GPU are already more or less “in sync”. (Keep in mind that this explanation is coming from a non-CUDA expert, so make of it what you will…)

Now that we understand the source of the bottleneck, the challenge becomes finding an alternative sequence of operations that performs the same logic but that does not trigger a host-device synchronization event. In the case of our loss function, we can easily accomplish this using the torch.where operator as shown in the code block below:

def ignore_background(self, target: Tensor) -> Tensor:
with torch.profiler.record_function('update background'):
target = torch.where(target==self.num_classes-1,
return target

In the image below we show the Trace View following this change.

Trace View Following Optimization #1 (Captured by Author)

While we have succeeded in removing the cudaMempyAsync coming from the torch.nonzero op, it has been immediately replaced with one coming from the torch.unique op, and our step time has not budged. Here the PyTorch documentation is less kind, but based on our previous experience we can assume that, once again, we are suffering from a host-device synchronization event due to our use of tensors with undetermined size.

Replacing the torch.unique operator with an equivalent alternative is not always possible. However, in our case we don’t actually need to know the values of the unique labels, we need to know only the number of unique labels. This can be calculated by applying the torch.sort op on the flattened target tensor and counting the number of steps in the resultant step function.

    def forward(self, pred: Tensor, target: Tensor) -> Tensor:

# ignore background labels
target = self.ignore_background(target)

# sort the list of labels
with torch.profiler.record_function('sort'):
sorted,_ = torch.sort(target.flatten())

# indentify the steps of the resultant step function
with torch.profiler.record_function('deriv'):
deriv = sorted[1:]-sorted[:-1]

# count the number of steps
with torch.profiler.record_function('count_nonzero'):
num_unique = torch.count_nonzero(deriv)+1

# calculate the cross-entropy loss
loss = self.cross_entropy(pred, target)

# zero the loss in the case that the number of unique elements
# is below the threshold
with torch.profiler.record_function('where'):
loss = torch.where(num_unique<2, 0.*loss, loss)

return loss

In the image below we capture the Trace View following our second optimization:

Trace View Following Optimization #2 (Captured by Author)

Once again, we have solved one bottleneck only to be faced with a new one, this time coming from the boolean mask routine.

Boolean masking is a routine we commonly use in order to reduce the overall number of machine operations that are required. In our case, our intention was to reduce the amount of computation by removing the “ignore” pixels and limiting the cross-entropy calculation to the pixels of interest. Clearly, this has backfired. As before, applying a boolean mask results in a tensor of undetermined size, and the cudaMempyAsync that it triggers greatly overshadows any of the savings from excluding the “ignore” pixels.

In our case, fixing this issue is rather simple as the PyTorch CrossEntropyLoss has a built-in option for setting an ignore_index.

class MaskedLoss(nn.Module):
def __init__(self, ignore_val=-1, num_classes=10):
self.ignore_val = ignore_val
self.num_classes = num_classes
self.loss = torch.nn.CrossEntropyLoss(ignore_index=-1)

def cross_entropy(self, pred: Tensor, target: Tensor) -> Tensor:
with torch.profiler.record_function('calc loss'):
loss = self.loss(pred, target)
return loss

In the image below we show the resultant Trace View:

Final Trace View (Captured by Author)

Holy cow!! Our step time has dropped all the way down to 5.4 milliseconds. That’s 240 (!!) times faster than what we started with. By simply changing around a few function calls and without any modification to the loss function logic, we were able to optimize the performance of the training step dramatically.

Important Note: In the toy example we have chosen, the steps that we took to reduce the number cudaMempyAsync events had a clear impact on the training step time. However, there may be situations where the same types of changes will harm performance rather than improve it. For example, in the case of boolean masking, if our mask is extremely sparse and the original tensors extremely large, the savings in computation from applying the mask might outweigh the price of the host-device synchronization. Importantly, the impact of each optimization should be evaluated on a case-by-case basis.

In this post we have focused on performance issues in training applications that are caused by host-device synchronization events. We saw several examples of PyTorch operators that trigger such events — the common property of all of them being that the size of the tensors that they output are dependent on the input. You might also encounter synchronization events from other operators, not covered in this post. We demonstrated how performance analyzers such as PyTorch Profiler and its associated TensorBoard plugin can be used to identify these kinds of events.

In the case of our toy example, we were able to find equivalent alternatives to the problematic operators that use fixed sized tensors and avoid the need for synchronization events. These led to a significant improvement in training time. However, in practice you might find it much harder — even impossible — to solve these kinds of bottlenecks. Sometimes, overcoming them might require redesigning parts of your model.

Source link

This post originally appeared on TechToday.