Six degrees of freedom: a quick intro to 3D object detection
A simple guide on how to build a computer vision pipeline useful for AR purposes
In computer vision, one often has to work with two-dimensional images, and much less frequently — with 3D objects. For this reason, many ML engineers feel insecure in this area: a lot of unfamiliar words, it is not obvious how to apply the old friends Resnet and Unet. That’s why today I would like to talk a little bit about 3D on the example of six degrees of freedom estimation, which in some way is synonymous with 3D object detection. I will take review one of the relatively recent works on this subject with some digressions.
Let‘s first define what is the 6 DoF (degrees of freedom). Let’s imagine a certain rigid (i.e. during transformation all points will remain at the same distance from each other) object in the three-dimensional world. To describe its position relative to an observer you will need 6 measurements: three will be responsible for the rotation between different axes, and three more — for the translation along the corresponding axes. Therefore, having these six numbers, we can imagine how an object is located relative to some basis (for example, to the point from which the photo is taken). This task is classic for robotics (where is an object to be grabbed by a manipulator?), augmented reality (where to draw a mask in MSQRD, ears in Snapchat or sneakers in Wanna Kicks?), self-driving vehicles, and other domains.
We will review the article MobilePose: Real-Time Pose Estimation for Unseen Objects with Weak Shape Supervision (Hou et al., 2020). This article, written by the authors from Google Research, offers a reliable and, importantly, fast pipeline to solve the problem. It will be appropriate to review it in parts.
The pipeline contains three main pieces:
- Backbone is quite classic, the hourglass architecture should be familiar to anyone who has ever taught Unet.
- Outputs of the network aren’t innovative. Detection, regression — all words are familiar! However, you may have questions about the shape. But let’s put them aside for a while.
- Postprocessing may seem mysterious for those who are out of the loop. What is EPnP and why does it turn 2D dots into a 3D bounding box?
3D for Dummies
And here at once, we need to make an important digression, which will help us to answer all these questions. Let’s take a high-level look at some of the 3D world mathematics. Let’s have some set of 3D points X — a matrix of size (n, 3), in which n is the number of points. As we remember, the six degrees of freedom are three rotations and three translations, i.e. rigid transformation. If we define R as the rotation matrix and t as the translation vector, the following equation will be true:
R and t are precisely what we want to find in this task: they describe how we have translated and rotated our rigid object so that it will be where we see it now.
But X’ is still 3D coordinates. Therefore, it is worth saying that there is still some projective matrix P. This matrix describes the way we project an object onto a two-dimensional plane, to speak conventionally, “render” it. This matrix depends on the size of the photo, focal length, distortions, but in our task, it can be considered constant. Having such a matrix, you can get 2D coordinates of points by simply multiplying the matrix by X’:
As simple as possible: we are looking for a way to rotate and translate a certain object so that some of its points are projected as they are now depicted in the photo. I’ve made it obscene oversimplified, so I’m sending everyone who wants to be enlightened to look at CS231a.
The subtask of finding R and t, knowing X, x, and P, is called Perspective-n-Point. I.e. we know what our object looks like in 3D (this is X), we know what image it is projected to (P) and where its points are on that image. Looks like an optimization task! There is a whole family of algorithms that solve this problem, for example, some are already implemented in OpenCV.
More links on the topic:
- Monocular Model-Based 3D Tracking of Rigid Objects: A Survey (Lepetit et. al. 2005) — a classic review;
- EPnP: An Accurate O(n) Solution to the PnP Problem (Lepetit et. al 2008) — a strong baseline;
- PnP-Net: A hybrid Perspective-n-Point Network (Sheffer and Wiesel, 2020) — for those who want to add a bit of deep learning to PnP classics.
By the way, they approach this problem from the other side as well. Adepts of deep learning can find many articles that use a special projection layer that converts 2D and 3D points into each other. Usually, synthetic data is used to train such a layer, because it’s expensive and difficult to obtain 3D coordinates from the real world. One of the papers on this topic — Single Image 3D Interpreter Network by Wu et al.
Where do I get 3D points?
So, we need X — let me remind you, this is a set of 3D points. Where do we get it from?
The easiest option is to try to find the same object on all images. We take a 3D CAD model (ready-made, draw from scratch, scan the real object with a special scanner…) and use it (more precisely, some of its points) as an X. In other words, we make the obvious assumption that “there is exactly the same object on the photo” — at first glance, it looks brazen, but practice shows that this is enough to estimate 6 DoF.
A more complex approach is the so-called parameterized models. Basel Face is a good example. Researchers from the University of Basel have scanned many faces and using PCA created such a model with a useful property: changing a small number of its parameters allows it to generate these 3D faces. Thus, you can twist a small number of moving parts (the main components) and get quite different models.
The parameterized model can be much simpler. For example, if we look for a 3D bounding box on a photo, we can use a cube as a base model and use its length-width-height ratio for parameterization.
If our 3D model is parametrized, its parameters can be picked up using different iterative methods and you can select one that will have a smaller reprojection error. I.e. we take some model X, solve PnP, get R and t and choose such X that the difference between x and (X @ R.T + t) @ P is minimal, for example, we can look at the Procrustes analysis.
True deep learners go further and in some way train either the 3D model or its parameters. A good example is the famous work of DensePose from Facebook Research, which popularized the approach with learning UV-map coordinates. That is, the model predicts for each pixel its relative location on the 3D model. Next, you can find matches and get for each pixel some approximation of its 3D coordinates.
There is an output named shape in the paper we review. Depending on the availability of ground truth data (about it a bit later), the authors there either teach a segmentation mask of the object (so-called weak supervision, to improve convergence), or exactly a map of coordinates of the object.
The question may also arise — precisely, coordinates of which points do we want to find? The answer is simple: we don’t really care. A 3D model usually consists of thousands of vertices: we can choose a subset to our liking. The only more or less important criterion is the approximate equidistance of the points from each other; if the sampling fails, the PnP solution becomes unstable.
Where do we get 2D points?
So, the 3D object is explained to some extent, let’s go to a more familiar to most CV engineers area, i.e. return to the 2D domain and think about where do we get the coordinates on the plane.
Two basic approaches are popular for obtaining 2D coordinates of points (usually this task is called keypoint detection): direct regression (the last layer of the network produces x, y for each point) and heatmap regression (the last layer produces a heatmap with a spot near a point). The first approach can be faster as it is not necessary to build a complete encoder-decoder architecture, the second one is usually more accurate and quite easy to be trained (it is almost a semantic segmentation problem).
The authors of MobilePose did not follow any of these paths and came up with an interesting hybrid, inspired by modern anchor-free detection architectures like CenterNet. Remember, there are Detection and Regression head on the diagram? Well, the Detection head predicts the center of the object and the Regression head predicts where the vertices of the object are relative to that center.
This approach has an elegant trick. I’ve written a lot about how to choose a 3D model, and in this case, everything is deliberately simplified: a simple cuboid, aka 3D bounding box, is used as the model! That is X is the vertex of the box, and x is the projection of these vertices. So, it is enough to know the aspect ratios of this box (which we get from the architecture of the detector itself), and the rest of the magic is useless.
Data preparation nuances
In all near 3D tasks, the data issue is even more painful than usual. Although layout takes a lot of time for segmentation or detection, we already have many tools for it, the process is more or less clear and debugged, and there are plenty of existing datasets. As three-dimensional data is a bigger issue, it was tempted to use synthetic data and render an endless dataset. Everything would be fine, but the models trained on synthetics without the extra tricks usually show significantly worse quality than the models trained on real data.
The authors of the reviewed paper did one of these tricks: instead of thoughtlessly rendering the whole world, they made a combination of real and synthetic data. For this, they took video from AR applications (it’s good to be Google and have a lot of data from ARCore). In such videos, you know planes, 6 DoF estimation obtained with visual odometry, and the estimation of illumination. This allows to render artificial objects not anywhere, but only on plane surfaces, adapting the illumination, which significantly reduces the reality gap between synthetic and real data. Unfortunately, it seems quite difficult to repeat this trick at home.
Yay, we galloped through all the key concepts of the pipeline! This should be enough for the reader to build from open source components, for example, an application that will draw a mask on the face (for this you do not have to train the models yourself, there are a lot of ready-made networks for face alignment).
Of course, it will only be a prototype. When bringing into production an application more questions will arise, for example:
- How to achieve consistency between frames?
- What to do with non-rigid objects?
- What to do if the object is partially invisible?
- What to do if the frame contains many objects?
That’s where the real adventure begins.
—
This article was originally written in Russian and posted at Habr.com. Also, if you've liked the post and speak Russian, please join my Telegram channel partially unsupervised — my short notes on machine learning and software engineering.