AI For Filmmaking

Recognising Shot Types with ResNets

Rahul Somani, 10th August 2019

Analysing cinema is a time-consuming process. In the cinematography domain alone, there's a lot of factors to consider, such as shot scale, shot composition, camera movement, color, lighting, etc. Whatever you shoot is in some way influenced by what you've watched. There's only so much one can watch, and even lesser that one can analyse thoroughly.

This is where neural networks offer ample promise. They can recognise patterns in images that weren't possible until less than a decade ago, thus offering an unimaginable speed up in analysing cinema. I've developed a neural network that focuses on one fundamental element of visual grammar: shot types. It's capable of recognising 6 unique shot types, and is ~91% accurate. The pretrained model, validation dataset (the set of images used to determine its accuracy), code used to train the network, and some more code to classify your own images is freely available.

Hide the basics of visual language Take me to the code Hide the basics of deep learning

What is Visual Language, and Why Does it Matter?

When you're writing something — an email, an essay, a report, a paper, etc, you're using the rules of grammar to put forth your point. Your choice of words, the way you construct the sentence, correct use of punctuation, and most importantly, what you have to say, all contribute towards the effectiveness of your message.

Cinema is about how ideas and emotions are expressed through a visual form. It's a visual language, and just like any written language, your choice of words (what you put in the shot/frame), the way you construct the sentence (the sequence of shots), correct use of punctuation (editing & continuity) and what you have to say (the story) are key factors of creating effective cinema. The comparison doesn't apply rigidly, but is a good starting point to start thinking about cinema as a language.

The most basic element of this language is a shot. There's many factors to consider while filming a shot — how big should the subject be, should the camera be placed above or below the subject, how long should the shot be, should the camera remain still or move with the subject, and if it's moving, how should it move? Should it follow the subject, observe it from a certain point while turning right/left or up/down and should the movement be smooth or jerky. There are other major visual factors, such as color and lighting, but we'll restrict our scope to these factors only. A filmmaker chooses how to construct a shot based on what he/she wants to convey, and then juxtaposes them effectively to drive home the message.

Let's consider this scene from Interstellar. To give you some context, a crew of two researchers and a pilot land on a mysterious planet to collect crucial data from the debris of a previous mission. This planet is very different from Earth — it is covered in an endless ocean, and its gravity is 130% stronger than Earth's.

This scene consists of 89 shots, and the average length of each shot is 2.66 seconds.

For almost all the shots showing Cooper (Matthew McConaughey) inside the spacecraft, Nolan uses a Medium Close Up, showing Cooper from the chest up. This allows us to see his facial expressions clearly, as well as a bit of the spacecraft he's in and his upper body movements. Notice how the camera isn't 100% stable. The camera moves slightly according to Cooper's movements, making us feel more involved in this scene.

In a Wide Shot, the emphasis is on the space that the characters are in. We're given our first glimpse of the planet with a wide shot. The entire scene has 10 wide shots, which are used when the director wants to emphasise the environment. These shots are also longer than the others, allowing the viewer to absorb the detail.

In a Medium Shot, the character is typically shown from the waist up, while still including some of the surrounding area. It lets you showcase the body language of the character while also allowing you to see some nuances in facial expression.

A Long Shot shot shows the character in its entirety along with a large portion of the surrounding area. In this scene, they're used when showing characters moving around the space.

An Extreme Wide Shot puts the location of the scene in perspective. Characters occupy almost no space, and the emphasis is purely on the location. Note that this shot is also much longer than the others, allowing the grandness of the location soak in.

These are the main kinds of shots used in this scene. There are a few more types of cinematic shots that will be covered later.

Let's move on to camera movement. Throughout this scene, the camera is almost never stationary. Whether it's the spaceship getting hit by the wave, a slow walk across the ocean, or desperate running from one point to another, the camera moves in near perfect sync with the characters. This is what really makes you feel the tension like you're in the scene.

A pan is when the camera stays at the same point and turns from left to right, or right to left. With the camera fixed at one point, you experience the shot as though you were standing there and watching CASE (the robot) move back and forth. It's difficult to articulate why, but this seems more appropriate than, say, if the camera weren't watching CASE move but instead moving along with it.

A tilt up is used appropriately to reveal the wave from Dr. Brand's (Anne Hatheway) perspective. As the camera moves up to reveal the height of the wave, the gravity of the situation builds up (no pun intended). This is one of the longest shots in the scene, clocking in at 7.5 seconds.
Do you think it would be as impactful if the camera was stationary and placed such that you could see the wave in its entirety?

The decision behind the different elements of a shot: shot scale (Long Shot, Wide Shot, etc), camera movement, camera angles, length of the shot, shot composition, color, and lighting are based on the message that the filmmaker wishes to convey. These shots are then juxtaposed meaningfully to convey a coherent visual story.

The analysis done above is far from comprehensive, but (hopefully) shed some light on how intricate creating effective cinema can be. The curious reader may want to dig deeper and look at how other factors such as composition, editing and color impact visual storytelling.

Breaking down this scene took a few hours of work. At the cost of being repetitive, this is where neural networks offer ample promise. With smart algorithms finding patterns like the ones shown above in a matter of seconds, your frame of reference could no longer be restricted to what you or your colleagues have watched, but all of cinema itself.

What follows is a gentle introduction to neural networks, followed by a description of the dataset, methodology, and results. If you're not interested in reading any of the (non-mathetical) technical material, click here.

Neural Networks 101

'AI' is most often a buzzword for deep learning, the field that uses neural networks to learn from data.

The key idea is that instead of explicitly specifying patterns to look for, you specify the rules for the neural network to autonomously detect patterns from data. The data could be something structured, like a database of customers' purchasing decisions, or something unstructured, like images, audio clips, medical scans, or video. Neural networks are good at tasks like predicting a customer's desired products, differentiating the image of a dog and a cat, the mating calls of dolphins and whales, a video of a goal being scored vs. the goalkeeper saving the day, or whether a tumor is benign or malignant.

With a large enough labelled dataset (say 1000 images of dogs and cats stored separately), you could use a neural network to learn patterns from these images. The network puts the image through a pile of computation, and spits out two probabilities: P(cat) and P(dog). You calculate how wrong the network was using a loss function, then use calculus (chain rule) to tweak this pile of computation to produce a lower loss (a more correct output). Neural networks are nothing but a sophisticated mechanism of optimising this function.

If the network's output is far off from the truth, the loss is larger, and so the tweak made is also larger. Tweaks that are too large are bad, so you multiply the tweaking factor with a tiny number known as the learning rate. One pass through the entire dataset is known as an epoch. You'd probably run through many epochs to reach a good solution; it's a good idea to tweak the images non-invasively (such as flipping them horizontally), so that the network sees different numbers for the same image and can more robustly detect patterns. This is known as data augmentation.

Neural networks can transfer knowledge from one project to another. It's very common to take a network that's been trained with 14 million images of a thousand common objects (ImageNet), and then tweak it to adapt to your project. It works because it has already learnt basic visual concepts like curves, edges, textures, eyes, etc, which come in handy for any visual task. This process is known as transfer learning.

Rinse and repeat this process carefully, and you have in your hands an 'AI' solution to your problem.*

If that piqued your interest, I suggest you watch this (~19mins) for a fairly detailed explanation of how a neural network works. If you're bursting with excitement, follow through with this course .

* Of course, this isn't all that deep learning is about. What I've described here is supervised learning.

There are several other sub-fields of deep learning that are more nuanced and cutting-edge, such as unsupervised learning, reinforcement learning, and generative modelling, to name a few.

The Dataset

There is no public dataset that categorises cinematic shots into shot types. There is one prior research project that classified shot types using neural networks, but the dataset, although massive (~400,000 images), only had 3 output classes (shot types) from 120 films.

Thus, the dataset for this project was constructed from scratch. It is diverse, consisting of samples from over 800 movies, collected from various sources. Each image has been looked over 4-5 times to ensure that it has been categorised correctly. Since shot types have room for subjectivity, it's important to note that there is no Full Shot as described here. Instead, there is a Long Shot which is essentially a Full Shot with some leeway for wideness, as described here and evident in the samples shown below.

In total, the dataset consists of 6,105 (5,505 training + 600 validation) images, split into 6 shot types:

Extreme Wide Shot	⟶ 631 images
Long Shot	⟶ 881 images
Medium Shot	⟶ 1237 images
Medium Close Up	⟶ 945 images
Close Up	⟶ 1027 images
Extreme Close Up	⟶ 784 images

In a future version, I will be adding two additional shot types: Wide Shot and Medium Long Shot.

Data Sources

71 films (30-60 images per film) from the film-grab database
700+ films (3-6 images per film) from RARBG
298 Extreme Wide Shot images from Pexels
Randomly picked images from relevant google image searches, such as this
552 Extreme Close Ups from Jacob T. Swinney's video essays on Tarantino, Aronofsky and Fincher.

Shot Types

Note: All images are cropped to the same aspect ratio for a smooth viewing experience

Though you're familiar with shot types, it's useful to look at how Long Shots and Wide Shots are defined here since unlike shots like Extreme Close Ups, these are more subjective