8 Deep Learning / Computer Vision Bugs And How I Could Have Avoided Them

11 min readOct 3, 2019

People are not perfect, we often make bugs in our software. Sometimes these bugs are easy to find: your code just doesn’t work at all, your app crashes and so on. But some bugs are hidden, and it makes them even more dangerous.

Working on deep learning problems, one can easily make some bugs of this type due to some uncertainty: it’s easy to see if a web app endpoint routes request properly, and not that easy to check if your gradient descent step was correct. However, there are a lot of bugs within DL practitioner routines that could have been avoided.

I’d like to share some of my experience regarding bugs I’ve seen or made during my last two years working on computer vision. I’ve spoken on this topic at the conference, and a lot of folks told me at afterparty: “yeah dude, i’ve had plenty of those bugs as well”. I hope my article can help you avoid at least some of these issues.

1. Flip image and keypoints.

Assume one works on the problem of keypoint detection. Their data looks like a pair of image and a sequence of keypoints tuples, e.g. [(0, 1), (2, 2)] where each keypoint is a pair of x and y coordinates.

Let’s code a basic augmentation on this data:

def flip_img_and_keypoints(img: np.ndarray, kpts: Sequence[Sequence[int]]): 
    img = np.fliplr(img)
    h, w, *_ = img.shape
    kpts = [(y, w - x) for y, x in kpts]
    return img, kpts

Looks correct, huh? Well, let’s visualize it.

image = np.ones((10, 10), dtype=np.float32)
kpts = [(0, 1), (2, 2)]
image_flipped, kpts_flipped = flip_img_and_keypoints(image, kpts)img1 = image.copy()
for y, x in kpts:
    img1[y, x] = 0
img2 = image_flipped.copy()
for y, x in kpts_flipped:
    img2[y, x] = 0
    
_ = plt.imshow(np.hstack((img1, img2)))

The asymmetry looks weird! What if we check extreme values?

image = np.ones((10, 10), dtype=np.float32)
kpts = [(0, 0), (1, 1)]
image_flipped, kpts_flipped = flip_img_and_keypoints(image, kpts)img1 = image.copy()
for y, x in kpts:
    img1[y, x] = 0
img2 = image_flipped.copy()
for y, x in kpts_flipped:
    img2[y, x] = 0
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-5-997162463eae> in <module>
      8 img2 = image_flipped.copy()
      9 for y, x in kpts_flipped:
---> 10     img2[y, x] = 0IndexError: index 10 is out of bounds for axis 1 with size 10

Not good! It’s a classic off-by-one error. Correct code looks like this:

def flip_img_and_keypoints(img: np.ndarray, kpts: Sequence[Sequence[int]]): 
    img = np.fliplr(img)
    h, w, *_ = img.shape
    kpts = [(y, w - x - 1) for y, x in kpts]
    return img, kpts

We’ve detected this issue by visualization, however, unit test with x = 0 point would help as well. A funny fact: there is a team where three folks (including myself) did almost the same mistake independently.

2. Continue with keypoints

Even after the function above has been fixed, there is a danger. It’s more about the semantics now, not just a piece of code.

Assume one needs to augment the image with two palms. Looks safe — hands will be hands after the left-right flip.

But wait! We know nothing about the keypoint semantics we have. What if the keypoint really mean something like this:

kpts = [
    (20, 20),  # left pinky
    (20, 200),  # right pinky
    ...
    ]

It means that the augmentation actually changes the semantics: left become right, right becomes left, yet we don’t swap keypoints indexes within the array. It will bring a huge amount of noise to the training and much worse metrics.

A lesson should be learned:

know and think about the data structure and semantics before applying the augmentations or other fancy features;
keep your experiments atomic: add a small change (e.g. a new transform), check how it goes, merge if the score has been improved.

3. Coding custom loss function

Those who are familiar with semantic segmentation problem probably know the IoU (intersection over union) metric. Unfortunately one can’t optimize it directly with SGD, so a common trick is to approximate it with a differentiable loss function. Let’s code one!

def iou_continuous_loss(y_pred, y_true):
    eps = 1e-6    def _sum(x):
        return x.sum(-1).sum(-1)    numerator = (_sum(y_true * y_pred) + eps)
    denominator = (_sum(y_true ** 2) + _sum(y_pred ** 2)
                   - _sum(y_true * y_pred) + eps)
    return (numerator / denominator).mean()

Looks good, before we do a small check:

In [3]: ones = np.ones((1, 3, 10, 10))
   ...: x1 = iou_continuous_loss(ones * 0.01, ones)
   ...: x2 = iou_continuous_loss(ones * 0.99, ones)In [4]: x1, x2
Out[4]: (0.010099999897990103, 0.9998990001020204)

Within x1 we calculated this loss for something totally different from ground truth and x2 is the result of the function for something very close to ground truth. We expect x1 to be huge as the prediction is bad, x2 should be close to zero. What's wrong?

Well, the function above is a good approximation of the metric. Metric is not a loss: it’s usually (including this case) the higher the better. As we’re minimizing loss with SGD, we should have really used something inverted:

def iou_continuous(y_pred, y_true):
    eps = 1e-6    def _sum(x):
        return x.sum(-1).sum(-1)    numerator = (_sum(y_true * y_pred) + eps)
    denominator = (_sum(y_true ** 2) + _sum(y_pred ** 2)
                   - _sum(y_true * y_pred) + eps)
    return (numerator / denominator).mean()def iou_continuous_loss(y_pred, y_true):
    return 1 - iou_continuous(y_pred, y_true)

Such issues can be identified in two ways:

write a unit test checking the direction of the loss: formalize the expectation that something closer to ground truth should output lower loss;
run a sanity check, try to overfit a single batch with your model.

4. When we meet Pytorch

Assume one has a pretrained model, and it’s inference time. Coding some Predictor class based on ceevee API.

from ceevee.base import AbstractPredictorclass MySuperPredictor(AbstractPredictor):
    def __init__(self,
                 weights_path: str,
                 ):
        super().__init__()
        self.model = self._load_model(weights_path=weights_path)    def process(self, x, *kw):
       with torch.no_grad():
            res = self.model(x)
        return res    @staticmethod
    def _load_model(weights_path):
        model = ModelClass()
        weights = torch.load(weights_path, map_location='cpu')
        model.load_state_dict(weights)
        return model

Is this code correct? Maybe! It’s indeed correct for some models. E.g. when the model has no dropout or norm layers such as torch.nn.BatchNorm2d. Or when the model requires to use actual norm stats for each image (e.g. many pix2pix based architectures require it).

But for most computer vision applications the code missed something important: switching to evaluation mode.

This issue is easily identified if one tries to convert the dynamic PyTorch graph into a static one. There is a torch.jit module for such conversion.

In [3]: model = nn.Sequential(
   ...:     nn.Linear(10, 10),
   ...:     nn.Dropout(.5)
   ...: )
   ...:
   ...: traced_model = torch.jit.trace(model, torch.rand(10))
/Users/Arseny/.pyenv/versions/3.6.6/lib/python3.6/site-packages/torch/jit/__init__.py:914: TracerWarning: Trace had nondeterministic nodes. Did you forget call .eval() on your model? Nodes:
	%12 : Float(10) = aten::dropout(%input, %10, %11), scope: Sequential/Dropout[1] # /Users/Arseny/.pyenv/versions/3.6.6/lib/python3.6/site-packages/torch/nn/functional.py:806:0
This may cause errors in trace checking. To disable trace checking, pass check_trace=False to torch.jit.trace()
  check_tolerance, _force_outplace, True, _module_class)
/Users/Arseny/.pyenv/versions/3.6.6/lib/python3.6/site-packages/torch/jit/__init__.py:914: TracerWarning: Output nr 1. of the traced function does not match the corresponding output of the Python function. Detailed error:
Not within tolerance rtol=1e-05 atol=1e-05 at input[5] (0.0 vs. 0.5454154014587402) and 5 other locations (60.00%)
check_tolerance, _force_outplace, True, _module_class)

A simple fix:

In [4]: model = nn.Sequential(
   ...:     nn.Linear(10, 10),
   ...:     nn.Dropout(.5)
   ...: )
   ...:
   ...: traced_model = torch.jit.trace(model.eval(), torch.rand(10))   # No more warnings!

Under the hood, torch.jit.trace runs the model several times and compares the results. The difference is suspicious here.

However torch.jit.trace is not a panacea here. It's kind of PyTorch nuance one should know and remember.

5. Copypasted problem

A lot of things exist in pairs: train and validation, width and height, latitude and longitude… If you read carefully, you can easily find a bug caused by a copypaste from one pair member to another:

def make_dataloaders(train_cfg, val_cfg, batch_size):
    train = Dataset.from_config(train_cfg)
    val = Dataset.from_config(val_cfg)
    shared_params = {'batch_size': batch_size, 'shuffle': True, 'num_workers': cpu_count()}
    train = DataLoader(train, **shared_params)
    val = DataLoader(train, **shared_params)
    return train, val

It’s not only me doing stupid mistakes. E.g. there was a similar one in popular albumentations library.

# https://github.com/albu/albumentations/blob/0.3.0/albumentations/augmentations/transforms.py
def apply_to_keypoint(self, keypoint, crop_height=0, crop_width=0, h_start=0, w_start=0, rows=0, cols=0, **params):
    keypoint = F.keypoint_random_crop(keypoint, crop_height, crop_width, h_start, w_start, rows, cols)
    scale_x = self.width / crop_height
    scale_y = self.height / crop_height
    keypoint = F.keypoint_scale(keypoint, scale_x, scale_y)
    return keypoint

Don’t worry, it’s already fixed.

How to avoid it? Do not copy and paste code, try to code in the manner where you don’t need to copy and paste.

👎

datasets = []data_a = get_dataset(MyDataset(config['dataset_a']), config['shared_param'], param_a)
datasets.append(data_a)
data_b = get_dataset(MyDataset(config['dataset_b']), config['shared_param'], param_b)
datasets.append(data_b)

👍

datasets = []
for name, param in zip(('dataset_a', 'dataset_b'), 
                       (param_a, param_b),
                      ):
    datasets.append(get_dataset(MyDataset(config[name]), config['shared_param'], param))

6. Proper data types

Let’s code a new augmentation

def add_noise(img: np.ndarray) -> np.ndarray:
    mask = np.random.rand(*img.shape) + .5
    img = img.astype('float32') * mask
    return img.astype('uint8')

The image has been altered. Is it what we expected? Hmm, maybe it’s altered too much.

There is a dangerous operation here: casting float32 to uint8. It might have caused an overflow:

def add_noise(img: np.ndarray) -> np.ndarray:
    mask = np.random.rand(*img.shape) + .5
    img = img.astype('float32') * mask
    return np.clip(img, 0, 255).astype('uint8')img = add_noise(cv2.imread('two_hands.jpg')[:, :, ::-1]) 
_ = plt.imshow(img)

Looks much better, huh?

BTW, there is one more way to avoid this issue: don’t reinvent the wheel, don’t code the augmentation from scratch and use existing one: e.g. albumentations.augmentations.transforms.GaussNoise.

I made another bug of the same origin once.

raw_mask = cv2.imread('mask_small.png')
mask = raw_mask.astype('float32') / 255
mask = cv2.resize(mask, (64, 64), interpolation=cv2.INTER_LINEAR)
mask = cv2.resize(mask, (128, 128), interpolation=cv2.INTER_CUBIC)
mask = (mask * 255).astype('uint8')_ = plt.imshow(np.hstack((raw_mask, mask)))

What is wrong here? First of all, it is a bad idea to resize a mask using cubic interpolation. And also the same issue with casting float32 to uint8: cubic interpolation can output values bigger than input, and it leads to overflow.

I’ve found this issue doing the visualization. It is also a good idea to have assertions here and there in your training loop.

7. Typos happen

Assume one needs to run inference for the fully-convolutional network (e.g. semantic segmentation problem) and a huge image. The image is so huge that there is no chance to put it in your GPU — for example, it can be a medical or satellite image.

In such cases, one may slice the image into a grid, run inference on each piece independently and finally merge. Also, some prediction intersection can be useful to smooth artifacts near the borders.

Let’s code it!

from tqdm import tqdmclass GridPredictor:
    """
    This class can be used to predict a segmentation mask for the big image 
    when you have GPU memory limitation
    """
    def __init__(self, predictor: AbstractPredictor, size: int, stride: Optional[int] = None):
        self.predictor = predictor
        self.size = size
        self.stride = stride if stride is not None else size // 2    def __call__(self, x: np.ndarray):
        h, w, _ = x.shape
        mask = np.zeros((h, w, 1), dtype='float32')
        weights = mask.copy()        for i in tqdm(range(0, h - 1, self.stride)):
            for j in range(0, w - 1, self.stride):
                a, b, c, d = i, min(h, i + self.size), j, min(w, j + self.size)
                patch = x[a:b, c:d, :]
                mask[a:b, c:d, :] += np.expand_dims(self.predictor(patch), -1)
                weights[a:b, c:d, :] = 1        return mask / weights

There is one symbol typo, and the snippet is big enough to find it easily. I doubt one can rapidly identify it just looking through the code. But it’s easy to check if the code is correct:

class Model(nn.Module):
    def forward(self, x):
        return x.mean(axis=-1)model = Model()
grid_predictor = GridPredictor(model, size=128, stride=64)simple_pred = np.expand_dims(model(img), -1) 
grid_pred = grid_predictor(img)np.testing.assert_allclose(simple_pred, grid_pred, atol=.001)---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-24-a72034c717e9> in <module>
      9 grid_pred = grid_predictor(img)
     10 
---> 11 np.testing.assert_allclose(simple_pred, grid_pred, atol=.001)~/.pyenv/versions/3.6.6/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_allclose(actual, desired, rtol, atol, equal_nan, err_msg, verbose)
   1513     header = 'Not equal to tolerance rtol=%g, atol=%g' % (rtol, atol)
   1514     assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
-> 1515                          verbose=verbose, header=header, equal_nan=equal_nan)
   1516 
   1517 ~/.pyenv/versions/3.6.6/lib/python3.6/site-packages/numpy/testing/_private/utils.py in assert_array_compare(comparison, x, y, err_msg, verbose, header, precision, equal_nan, equal_inf)
    839                                 verbose=verbose, header=header,
    840                                 names=('x', 'y'), precision=precision)
--> 841             raise AssertionError(msg)
    842     except ValueError:
    843         import tracebackAssertionError: 
Not equal to tolerance rtol=1e-07, atol=0.001Mismatch: 99.6%
Max absolute difference: 765.
Max relative difference: 0.75000001
 x: array([[[215.333333],
        [192.666667],
        [250.      ],...
 y: array([[[ 215.33333],
        [ 192.66667],
        [ 250.     ],...

The correct version of __call__ method is below:

def __call__(self, x: np.ndarray):
        h, w, _ = x.shape
        mask = np.zeros((h, w, 1), dtype='float32')
        weights = mask.copy()        for i in tqdm(range(0, h - 1, self.stride)):
            for j in range(0, w - 1, self.stride):
                a, b, c, d = i, min(h, i + self.size), j, min(w, j + self.size)
                patch = x[a:b, c:d, :]
                mask[a:b, c:d, :] += np.expand_dims(self.predictor(patch), -1)
                weights[a:b, c:d, :] += 1        return mask / weights

Pay attention to the line weights[a:b, c:d, :] += 1 if you still didn't get what was the issue.

8. Imagenet normalization

When one needs to do transfer learning, it’s often a good idea to normalize your images in the same way as they did while training the Imagenet.

Let’s do it with the albumentations library we’re already familiar with.

from albumentations import Normalizenorm = Normalize()img = cv2.imread('img_small.jpg')
mask = cv2.imread('mask_small.png', cv2.IMREAD_GRAYSCALE)
mask = np.expand_dims(mask, -1) # shape (64, 64) ->  shape (64, 64, 1)normed = norm(image=img, mask=mask)
img, mask = [normed[x] for x in ['image', 'mask']]def img_to_batch(x):
    x = np.transpose(x, (2, 0, 1)).astype('float32')
    return torch.from_numpy(np.expand_dims(x, 0))img, mask = map(img_to_batch, (img, mask))
criterion = F.binary_cross_entropy

And it’s time to train a network and overfit a single image — as I’ve mentioned it is a nice debugging technique:

model_a = UNet(3, 1) 
optimizer = torch.optim.Adam(model_a.parameters(), lr=1e-3)
losses = []for t in tqdm(range(20)):
    loss = criterion(model_a(img), mask)
    losses.append(loss.item())    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
_ = plt.plot(losses)

The curvature looks great, but the loss value -300 is not expected for cross-entropy. What’s the problem?

The normalization did great with the image, not the mask: one need to scale it to [0, 1] manually.

model_b = UNet(3, 1) 
optimizer = torch.optim.Adam(model_b.parameters(), lr=1e-3)
losses = []for t in tqdm(range(20)):
    loss = criterion(model_b(img), mask / 255.)
    losses.append(loss.item())    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
_ = plt.plot(losses)

A simple runtime assertion for the training loop (e.g. assert mask.max() <= 1) would detect the problem pretty fast. Again, the could be a unit-test as well.

TL;DR by Captain Obvious

tests matter;
runtime assertions are OK for training pipeline;
visualization is a blessing;
copypaste is a curse;
nothing is a silver bullet, a machine learning engineer has to be always careful (or just suffer).

I thank people from the ODS.ai community for their useful feedback on this topic.