🙈Face recognition🙉

Hello reader! 👋 Welcome to my another blog. How’re you doing? I hope everything’s good with you. If not, may good bless you. In this article, I’m going to share with you the core and building block of face recognition. So hold tight cause we’re going for a roller-coaster ride. Are you ready? Well, then let me be your driver.

	
                                           o
                                         o |
                                         |
      .       .           ._._.    _                     .===.
      |`      |`        ..'\ /`.. |H|        .--.      .:'   `:.
     //\-...-/|\         |- o -|  |H|`.     /||||\     ||     ||
 ._.'//////,'|||`._.    '`./|\.'` |\\||:. .'||||||`.   `:.   .:'
 ||||||||||||[ ]||||      /_T_\   |:`:.--'||||||||||`--..`=:='...  `

Introduction

Face verification, Face detection and Face recognition often come together when discussed about one another as they have something in common. So how are they different? Let us see them.

Face detection
- It is the first step towards many face-related technologies such as face verification and face recognition. It is used to detect face in an image, identify key facial features, and get the contours of detected faces. To put it simple, it returns either yes or no for the presence of face in the image.
- Input: An image
- Output: $\left\{\begin{matrix} 1 & if\; face \\ 0 & if\; no\; face \end{matrix}\right.$
Face Verification
- Face verification is followed by face detection. Once we’re sure there is a face in the image, we verify if it is of the claimed person.
- It is a $1:1$ type of problem where an input image of a person is compared to a particular image of a person in the database and verified.
- Input: An image, Name/ID
- Output: Whether the image is that of a claimed person or not.
Face recognition
- Face recognition is also followed by face detection. It is more than just face detection
- Unlike face verification, it is a $1:K$ type of problem where an input image of a person is compared to $K$ images of person in the database.
- Input: An image
- Output: Return ID if the image is any of the $K$ person otherwise $not\; recognized$

Understanding Face Recognition

How is a system able to recognize the image of a person which it has not seen before? Strange, right? In face recognition application we must be able to recognize a person for a given single image or given just an example of that person’s face. Consider a scenario where you have 5 friends and want to have your system recognize one of your friend’s face. So one thing you could do is train you NN system with his/her images. To get a better intuition about face recognition, you can checkout a face recognition video of Baidu employees entering the office without needing to identify themselves.

$$ Image \rightarrow CONV Net \rightarrow \bigcirc (softmax\; unit\; with\; 6\; outputs) \\ with\; one\; output\; being\; 'None\; of\; the\; above' $$

But what if you just met a new person and became friends. Will your system be able to recognize? The answer is simply no. But you can still add his/her images and re-train the system. BUt what if your keep meeting new friends everyday? Will you keep re-training the system with new images everyday?😂😂😂 No way. So this isn’t a good approach.

One shot learning

Instead you’d want to opt for ‘One Shot Learning’. As per this approach, we’ll want our Neural Network to learn a ‘similarity function’, i.e., the degree of similarity/differences in the images. We will input two images and the function returns the degree of similarity/differences in those two images. $d(img1\;, img2) = degree\; of\; difference\; between\; images$. During recognition time,

$if\; d(img1\;, img2) \leq \tau(threshold)\; \rightarrow predict\; 'same' \\ if\; d(img1\;, img2) > \tau(threshold)\; \rightarrow predict\; 'difference' \\$.

If you recall, then you probably know that this is a verification problem. So for this case, you can have the above $d$ function to compare your new friend’s image with the images of your friend(which is already in the database) and check the degree of differences. If $d$ outputs a large number, it’s probably because the face isn’t in the database otherwise the $d$ will definitely output a smaller number. But you might be wondering how is the value of $d$ calculated? Now that is where the concept of Siamese Network comes.

Siamese Network

This network is used to calculate the value of $d$, but how? Okay, consider the following picture.

Here, instead of sigmoid function producing other values, we are only interested in taking the encoded vector of 128 numbers as an output $f(deadpool)$. Again, we feed this network with Reynold’s image with same parameters and this time also we take only the encoded vector of 128 numbers as an output $f(reynold)$.

Now $d$ is calculated as: $d(deadpool, reynold) = \left \| f(deadpool) - f(reynold) \right \| ^ 2$. Therefore,

if $deadpool$, $reynolds$ are same person, $\left \| f(deadpool) - f(reynold) \right \| ^ 2$ is small.
if $deadpool$, $reynolds$ are different person, $\left \| f(deadpool) - f(reynold) \right \| ^ 2$ is large.

So, the main idea of running two identical, convolutional neural networks on two different inputs and then comparing them, sometimes that’s called a Siamese Neural Network architecture.

We want the distances(value of $d$) to be as smaller as possible and encodings to be similar for same images and for different images, we want the distances(value of $d$) to be more apart and encodings to be different for different images. This gives rise to triplet function.

Triplet function

For this, we take triple(three) images at the same time, two similar and one dissimilar. In order to achieve smaller differences between similar images and higher differences between dissimilar images, we always want:
$\left \| f(A) - f(P) \right \| ^ 2 \leq \left \| f(A) - f(N) \right \| ^ 2$
$or, \left \| f(A) - f(P) \right \| ^ 2 - \left \| f(A) - f(N) \right \| ^ 2 \leq 0$
where,
$A$ = Anchor(main image)
$P$ = Positive image(one which matches with anchor image)
$N$ = Negative image(one which does not match with anchor image)
But we would like to prevent neural network from setting all the encodings equal to each other and giving an output 0. So we are going to modify such that it does not need to be just less than or equal to 0 and rather we say, it needs to be less than $- \alpha$(negative alpha) and this prevents from outputting a trivial solution. So we now have,

Here, $\alpha$ is also called a margin. Also we know,
$d(A, P) = \left \| f(A) - f(P) \right \| ^ 2$ and $d(A, N) = \left \| f(A) - f(N) \right \| ^ 2$.
Therefore, above inequality can be generalized as:
$d(A, P) + \alpha \leq d(A, N)$
For example let’s say, $d(A, P) = 0.5$, $\alpha = 0.2$ and $d(A, N) = 0.51$. Even though there is $0.01$ difference between $d(A, P)$ and $d(A, N)$, this won’t be enough. But rather we want $d(A, N)$ to be much bigger, i.e., at least 0.7($0.5 + 0.2 \leq 0.7$).

Therefore, given 3 images: Anchor(A), Positive image(P), Negative image(N), loss function is formulated as:

$$Loss(A, P, N) = max(\left \| f(A) - f(P) \right \| ^ 2 - \left \| f(A) - f(N) \right \| ^ 2 + \alpha)$$ $$or, J = \sum_{i=1}^{m} L(A^{i}, P^{i}, N^{i})$$

The output target variable for the $W_{i}$(weight) and $b$(bias), $\hat{y}$ is calculated as:
$\hat{y} = \sigma(\sum_{i=1}^{m} W_{k} \left \| f(x^{i})_{k} - f(x^{j})_{k} \right \| ^ 2 + b)$
Because we don’t need to store raw images and if we have a large database, we need not compute these encodings every single time for every database. This idea of computing can save a significant computation.

Now the question is, how do we choose the triplets $A, P, N$? If they are chosen randomly, $d(A, P) + \alpha \leq d(A, N)$ is easily satisfied. Also we should choose the triplets that are hard to train on. But what about the training size? So for a total of 10k pictures there should be 10 pictures on average of each 1k person. After training we may have only one. But if we had just 1 picture of each person, we can’t actually train the system.

Implementation

😴😴 Bored reading such a lengthy text? Or are you so lost or confused? 😵😵 Don’t be so, I’m here for you. 😅😁😅😁 Take a ☕☕ coffee break and can come back with fresh mind. You ready?? Alright, let’s start implementing it.

As discussed, face recognition is followed by face verification. Which means, at first we will be writing a function to verify a face and then use it iteratively to recognize a face. We will be using a pre-trained model for this tutorial which represents ConvNet activations using a “channels first” convention. Unlike in the post ‘Car Detection with YOLO’, we will be using a batch of images will be of shape $(m, n_{C}, n_{H}, n_{W})$ instead of $(m, n_{H}, n_{W}, n_{C})$. Alright, so let us get started step by step. Talking about the model, the FaceNet model takes a lot of data and a long time to train. That is why we will be using the Inception model and load the weights that someone has already trained. The key things we need to know are:

This network uses $96\times 96$ dimensional RGB images as its input. Specifically, inputs a face image (or batch of $m$ face images) as a tensor of shape $(m,n_{C},n_{H},n_{W})$ = $(m,3,96,96)$
It outputs a matrix of shape $(m,128)$ that encodes each input face image into a 128-dimensional vector.

Importing packages

Encoding face images

The first thing we’re going to do is encode the images(of which you already know why and how). This encoded image is a 128-dimensional vector. We’ll be using it to calculate the triplet loss, which is then further used in face verification. Let us first initialize the model with an input shape of $(3, 96, 96)$.

 
	FRmodel = faceRecoModel(input_shape=(3, 96, 96)) 

print("Total Params:", FRmodel.count_params())

You can use the print statement to see the total number of parameters.

Computing triplet loss

Below are steps to compute the triplet loss:

Compute the distance between the encodings of “anchor” and “positive”: $\left \| f(A^{(i)}) - f(P^{(i)}) \right \| ^ 2$
Compute the distance between the encodings of “anchor” and “negative”: $\left \| f(A^{(i)}) - f(N^{(i)}) \right \| ^ 2$
Compute the formula per training example: $\left \| f(A^{(i)}) - f(P^{(i)}) \right \| ^2 - \left \| f(A^{(i)}) - f(N^{(i)}) \right \| ^ 2 + \alpha$
Compute the full formula by taking the max with zero and summing over the training examples: $J = \sum_{i=1}^{m} (\left \| f(A^{(i)}) - f(P^{(i)}) \right \| ^2 - \left \| f(A^{(i)}) - f(N^{(i)}) \right \| ^ 2 + \alpha)$

Loading pre-trained model

As mentioned already, FaceNet is trained by minimizing the triplet loss. But it requires a lot of training data and computation, so we’re not implementing from scratch here. So let’s load a pre-trained model.

 
	 FRmodel.compile(optimizer = 'adam', loss = triplet_loss, metrics = ['accuracy']) 

load_weights_from_FaceNet(FRmodel)

Face Verification

Before applying face verification, let us first create a database dictionary where we’ll be storing corresponding image encoding of a person.

 
database = {} 

database["danielle"] = img_to_encoding("images/danielle.png", FRmodel)

database["younes"] = img_to_encoding("images/younes.jpg", FRmodel)

database["tian"] = img_to_encoding("images/tian.jpg", FRmodel)

database["andrew"] = img_to_encoding("images/andrew.jpg", FRmodel)

database["kian"] = img_to_encoding("images/kian.jpg", FRmodel)

database["dan"] = img_to_encoding("images/dan.jpg", FRmodel)

database["sebastiano"] = img_to_encoding("images/sebastiano.jpg", FRmodel)

database["bertrand"] = img_to_encoding("images/bertrand.jpg", FRmodel)

database["kevin"] = img_to_encoding("images/kevin.jpg", FRmodel)

database["felix"] = img_to_encoding("images/felix.jpg", FRmodel)

database["benoit"] = img_to_encoding("images/benoit.jpg", FRmodel)

database["arnaud"] = img_to_encoding("images/arnaud.jpg", FRmodel)

The <div style="font-family: courier new; color: crimson; background-color: #f1f1f1; padding: 2px; font-size: 105%; border: 1px solid black; padding: 2px; text-align: left;"> img_to_encoding </div> takes two parameters, $image$, $model$ and runs a forward propagation of the model in the image specified. You can find the implementation of this function in here.