AI For Filmmaking

Recognising Shot Types with ResNets


Rahul Somani, 10th August 2019


Analysing cinema is a time-consuming process. In the cinematography domain alone, there's a lot of factors to consider, such as shot scale, shot composition, camera movement, color, lighting, etc. Whatever you shoot is in some way influenced by what you've watched. There's only so much one can watch, and even lesser that one can analyse thoroughly.

This is where neural networks offer ample promise. They can recognise patterns in images that weren't possible until less than a decade ago, thus offering an unimaginable speed up in analysing cinema. I've developed a neural network that focuses on one fundamental element of visual grammar: shot types. It's capable of recognising 6 unique shot types, and is ~91% accurate. The pretrained model, validation dataset (the set of images used to determine its accuracy), code used to train the network, and some more code to classify your own images is freely available.



What is Visual Language, and Why Does it Matter?


When you're writing something — an email, an essay, a report, a paper, etc, you're using the rules of grammar to put forth your point. Your choice of words, the way you construct the sentence, correct use of punctuation, and most importantly, what you have to say, all contribute towards the effectiveness of your message.

Cinema is about how ideas and emotions are expressed through a visual form. It's a visual language, and just like any written language, your choice of words (what you put in the shot/frame), the way you construct the sentence (the sequence of shots), correct use of punctuation (editing & continuity) and what you have to say (the story) are key factors of creating effective cinema. The comparison doesn't apply rigidly, but is a good starting point to start thinking about cinema as a language.


The most basic element of this language is a shot. There's many factors to consider while filming a shot — how big should the subject be, should the camera be placed above or below the subject, how long should the shot be, should the camera remain still or move with the subject, and if it's moving, how should it move? Should it follow the subject, observe it from a certain point while turning right/left or up/down and should the movement be smooth or jerky. There are other major visual factors, such as color and lighting, but we'll restrict our scope to these factors only. A filmmaker chooses how to construct a shot based on what he/she wants to convey, and then juxtaposes them effectively to drive home the message.


Let's consider this scene from Interstellar. To give you some context, a crew of two researchers and a pilot land on a mysterious planet to collect crucial data from the debris of a previous mission. This planet is very different from Earth — it is covered in an endless ocean, and its gravity is 130% stronger than Earth's.


This scene consists of 89 shots, and the average length of each shot is 2.66 seconds.

For almost all the shots showing Cooper (Matthew McConaughey) inside the spacecraft, Nolan uses a Medium Close Up, showing Cooper from the chest up. This allows us to see his facial expressions clearly, as well as a bit of the spacecraft he's in and his upper body movements. Notice how the camera isn't 100% stable. The camera moves slightly according to Cooper's movements, making us feel more involved in this scene.




In a Wide Shot, the emphasis is on the space that the characters are in. We're given our first glimpse of the planet with a wide shot. The entire scene has 10 wide shots, which are used when the director wants to emphasise the environment. These shots are also longer than the others, allowing the viewer to absorb the detail.



In a Medium Shot, the character is typically shown from the waist up, while still including some of the surrounding area. It lets you showcase the body language of the character while also allowing you to see some nuances in facial expression.




A Long Shot shot shows the character in its entirety along with a large portion of the surrounding area. In this scene, they're used when showing characters moving around the space.




An Extreme Wide Shot puts the location of the scene in perspective. Characters occupy almost no space, and the emphasis is purely on the location. Note that this shot is also much longer than the others, allowing the grandness of the location soak in.


These are the main kinds of shots used in this scene. There are a few more types of cinematic shots that will be covered later.

Let's move on to camera movement. Throughout this scene, the camera is almost never stationary. Whether it's the spaceship getting hit by the wave, a slow walk across the ocean, or desperate running from one point to another, the camera moves in near perfect sync with the characters. This is what really makes you feel the tension like you're in the scene.



A pan is when the camera stays at the same point and turns from left to right, or right to left. With the camera fixed at one point, you experience the shot as though you were standing there and watching CASE (the robot) move back and forth. It's difficult to articulate why, but this seems more appropriate than, say, if the camera weren't watching CASE move but instead moving along with it.



A tilt up is used appropriately to reveal the wave from Dr. Brand's (Anne Hatheway) perspective. As the camera moves up to reveal the height of the wave, the gravity of the situation builds up (no pun intended). This is one of the longest shots in the scene, clocking in at 7.5 seconds.
Do you think it would be as impactful if the camera was stationary and placed such that you could see the wave in its entirety?


The decision behind the different elements of a shot: shot scale (Long Shot, Wide Shot, etc), camera movement, camera angles, length of the shot, shot composition, color, and lighting are based on the message that the filmmaker wishes to convey. These shots are then juxtaposed meaningfully to convey a coherent visual story.

The analysis done above is far from comprehensive, but (hopefully) shed some light on how intricate creating effective cinema can be. The curious reader may want to dig deeper and look at how other factors such as composition, editing and color impact visual storytelling.

Breaking down this scene took a few hours of work. At the cost of being repetitive, this is where neural networks offer ample promise. With smart algorithms finding patterns like the ones shown above in a matter of seconds, your frame of reference could no longer be restricted to what you or your colleagues have watched, but all of cinema itself.


What follows is a gentle introduction to neural networks, followed by a description of the dataset, methodology, and results. If you're not interested in reading any of the (non-mathetical) technical material, click here.


Neural Networks 101


'AI' is most often a buzzword for deep learning, the field that uses neural networks to learn from data.

The key idea is that instead of explicitly specifying patterns to look for, you specify the rules for the neural network to autonomously detect patterns from data. The data could be something structured, like a database of customers' purchasing decisions, or something unstructured, like images, audio clips, medical scans, or video. Neural networks are good at tasks like predicting a customer's desired products, differentiating the image of a dog and a cat, the mating calls of dolphins and whales, a video of a goal being scored vs. the goalkeeper saving the day, or whether a tumor is benign or malignant.

With a large enough labelled dataset (say 1000 images of dogs and cats stored separately), you could use a neural network to learn patterns from these images. The network puts the image through a pile of computation, and spits out two probabilities: P(cat) and P(dog). You calculate how wrong the network was using a loss function, then use calculus (chain rule) to tweak this pile of computation to produce a lower loss (a more correct output). Neural networks are nothing but a sophisticated mechanism of optimising this function.

If the network's output is far off from the truth, the loss is larger, and so the tweak made is also larger. Tweaks that are too large are bad, so you multiply the tweaking factor with a tiny number known as the learning rate. One pass through the entire dataset is known as an epoch. You'd probably run through many epochs to reach a good solution; it's a good idea to tweak the images non-invasively (such as flipping them horizontally), so that the network sees different numbers for the same image and can more robustly detect patterns. This is known as data augmentation.

Neural networks can transfer knowledge from one project to another. It's very common to take a network that's been trained with 14 million images of a thousand common objects (ImageNet), and then tweak it to adapt to your project. It works because it has already learnt basic visual concepts like curves, edges, textures, eyes, etc, which come in handy for any visual task. This process is known as transfer learning.

Rinse and repeat this process carefully, and you have in your hands an 'AI' solution to your problem.*

If that piqued your interest, I suggest you watch this (~19mins) for a fairly detailed explanation of how a neural network works. If you're bursting with excitement, follow through with this course .

* Of course, this isn't all that deep learning is about. What I've described here is supervised learning.

There are several other sub-fields of deep learning that are more nuanced and cutting-edge, such as unsupervised learning, reinforcement learning, and generative modelling, to name a few.


The Dataset


There is no public dataset that categorises cinematic shots into shot types. There is one prior research project that classified shot types using neural networks, but the dataset, although massive (~400,000 images), only had 3 output classes (shot types) from 120 films.

Thus, the dataset for this project was constructed from scratch. It is diverse, consisting of samples from over 800 movies, collected from various sources. Each image has been looked over 4-5 times to ensure that it has been categorised correctly. Since shot types have room for subjectivity, it's important to note that there is no Full Shot as described here. Instead, there is a Long Shot which is essentially a Full Shot with some leeway for wideness, as described here and evident in the samples shown below.

In total, the dataset consists of 6,105 (5,505 training + 600 validation) images, split into 6 shot types:

Extreme Wide Shot ⟶   631 images
Long Shot ⟶   881 images
Medium Shot ⟶   1237 images
Medium Close Up ⟶   945 images
Close Up ⟶   1027 images
Extreme Close Up ⟶   784 images

In a future version, I will be adding two additional shot types: Wide Shot and Medium Long Shot.


Data Sources

Shot Types

Note: All images are cropped to the same aspect ratio for a smooth viewing experience

Though you're familiar with shot types, it's useful to look at how Long Shots and Wide Shots are defined here since unlike shots like Extreme Close Ups, these are more subjective

Extreme Wide Shot

An Extreme Wide Shot (EWS) emphasises the vastness of the location.

When there is a subject, it usually occupies a very small part of the frame.

Long Shot

A Long Shot (LS) includes characters in their entirety, and a large portion of the surrounding area.

Medium Shot

A Medium Shot (MS) shows the character from the waist up.

It allows one to see nuances of the character's body language, and to some degree the facial expressions.

Medium Close Up

A Medium Close Up (MCU) shows the character from the chest/shoulders up.

It allows one to see nuances of the character's facial expressions, and some upper-body language.

Close Up

A Close Up (CU) shows the face of the character, sometimes including the neck/shoulders.

Emphasises the facial expressions of the character.

Extreme Close Up

An Extreme Close Up (ECU) highly zooms in to any one feature of the subject to draw attention to that feature specifically.


More Shot Types (Coming Soon)

Wide Shot

A Wide Shot falls between an Extreme Wide Shot and Long Shot.

The emphasis is on the physical space that the characters are in.

Medium Long Shot

A Medium Long Shot falls between a Long Shot and a Medium Shot.

Characters are usually shown from their knees up, or from the waist up + a large portion of the background visible.




Methodology


The training process was in sync with fastai's methodology (transfer learning , data augmentation , fit one cycle policy , learning rate finder , etc). This is well documented in this Jupyter notebook.

While doing this project, I thought up and implemented a new kind of data augmentation called rgb_randomize that is now part of the fastai library.

The model architecture is a ResNet-50 pretrained on ImageNet. The only part of the training process that I haven't seen used extensively, though mentioned in this lecture, was that of progressive image resizing (perhaps cyclical transfer learning is a better term)*. This is a process where you take a pretrained model (almost always a ResNet-xx pretrained on ImageNet), fine-tune it on your dataset, and then repeat the process with a larger image size. Here's a visual explanation:

* You cannot replicate the training process as the training data has not been made public yet. However, you can use my pretrained model to predict shot types on your own images, or to re-evaluate the validation set.

This diagram of a ResNet-50 isn't 100% accurate — it doesn't include skip connections. In a future post, I will explain how a ResNet works with a more complete version of these diagrams.





At each stage, the network is first trained when frozen with a learning rate of ~1e-3 and then unfrozen and fine tuned with a learning rate of ~slice(1e-6, 1e-3)

Results

Training Performance

When training the network, the dataset is split into the training set (5,505 images) and the validation set (600 images). The network learns from the training set, and evaluates its accuracy on the validation set. As explained in Neural Networks 101 earlier, the network learns by optimising a loss function. That loss function is the training loss in this table.

You could theoretically reach a training loss of 0, but this would mean that the network hasn't learnt patterns from your dataset, but memorised it instead. To ensure optimal performance, the metric you keep an eye on is the validation loss. The network's never seen images from the validation set, so as long as the validation loss keeps going lower, you're going in the right direction. The accuracy is simply the percentage of correctly predicted predicted images.

Training Loss: Lower is better as long as validation loss keeps dropping

Validation Loss: Lower is always better

Epoch Training Loss Validation Loss Accuracy (%)
1 0.364 0.271 90.3
2 0.357 0.269 90.5
3 0.362 0.267 89.8
4 0.333 0.250 90.8
5 0.316 0.255 90.7
6 0.303 0.255 91.0
7 0.297 0.274 90.0
8 0.291 0.258 90.7
9 0.281 0.250 91.0
10 0.278 0.255 91.0

The network peaks at an accuracy of 91.0% on the validation set, which a great result. The accuracy would probably be higher on a larger dataset, as each image wrongly classified would account for a smaller % decrease in accuracy. Gains from a lower validation loss would be more palpable, the opposite of which can be seen in this table due to the small size of the validation set.

This table is extracted from the methodology notebook and shows the last stage of training - Stage 3.2.


Confusion Matrix

The first row of the confusion matrix below says that of 100 Extreme Wide Shots (EWS), 94 were predicted accurately, and of the 6 that weren't, 3 were predicted as Long Shots (LS), 1 as a Medium Shot (MS), 1 as a Close Up (CU), and 1 as an Extreme Close Up (ECU).


It's clear that the model performs best for Extreme Wide Shots (EWS), Medium Close Ups (MCU), Close Ups (CU), and Extreme Close Ups (ECU). It struggles a bit when detecting Long Shots (LS), often confusing them with Medium Shots (MS).

Overall, the model is 91% accurate.



Heatmaps


Heatmaps represent the activations of the neural network. In layman's terms, they show the parts of the image that caused the network to detect its predicted shot type.


Due to an software quirk, the sliders on the images below work best if you click on the rightmost/leftmost sides of the images rather than drag the slider.

Extreme Wide Shot

The network has learnt to recognise patterns on both the ground and the sky.

The foreground is an important factor, as is the human figure when detected.

It's interesting to see the network's behavior in low lighting conditions.

Long Shot

Presence of the entire leg and/or parts of the human body is a determining factor.

The network can detect objects in the background.

Predictions remain stable in low-lighting and are resilient to different camera angles and human pose.

Medium Shot

The cleanest predictions are based on the presence of a human face and its spatial relationship with the rest of the frame.

In lower lighting conditions, it uses other parts of the body and/or elements in the background. It isn't restricted to the presence of a face.

Medium Close Up

The face is clearly the defining factor. The network focuses in on a more specific part of the face when compared to a Medium Shot.

When the typical human face isn't in the image, it uses other elements in the frame.

Close Up

The face is all the more the defining factor. The network focuses in on an even more specific part of the face when compared to a Medium Close Up.

It uses other elements in the frame when the face is obscure, camera angle isn't the most common one, or in low lighting conditions.

Extreme Close Up

This is harder to interpret due to the diversity of these shots.

The network does a good job regardless, it scored 95/100 with ECUs.



A generic square placeholder image with rounded corners in a figure.
Non Standard Framing
Is this a Close Up?
The model is ~80% sure it's an MCU and ~20% sure that it's a CU

Robustness

It's important to take these results with a pinch of salt. The validation set, while diverse, is not comprehensive. For all shots besides an Extreme Wide Shot or an Extreme Close Up, most of the training samples were images that had humans in them. Thought this is usually the case in movies, it isn't so all the time.

I've also refrained from including non standard compositions from the training and validation set for two reasons: they represent a minority of shots, and they are difficult to classify. That being said, the model performs well on most commonly found shot types across most cinema.

The model could be biased against "foreign" (non-Hollywood) cinema, as it was trained on mostly Hollywood movie stills. However, the heatmaps above show that it is able to detect fundamental features that are seen across all cinema. The model hasn't been tested thoroughly enough to make a definitive statement regarding this.

A generic square placeholder image with rounded corners in a figure.
Non Human Framing
Is this a Long Shot?
The model is baffled. It predicts ~18-22% for each shot class.



Neural networks burst into popularity in the past decade with the development of large datasets and the ability to leverage GPUs (graphics cards) for the heavy computation demanded by neural nets.

Rapid advances on the technical side open up opportunities to solve novel problems like the one presented in the post. Shot scale recognition is one of many possible applications to film. It's possible to recognise camera movement with 3D CNNs (convolutional neural networks); the only missing piece is the dataset. Camera angles could be detected with the same methodology as this project, but the dataset for it doesn't exist either. Cut detection — transition from one shot to the next — has been worked on extensively, and can be adopted to film*.

*Frederic Brodbeck's Cinemetrics project does something similar, and is worth looking at.

Barry Salt's Database also does something similar, but at a higher level (summary statistics for the entire movie); a filmmaker would find individual shot analysis more useful.

# Length Type Camera Angle Camera Movement
1 10.42 LS, MS Eye Level Zoom
2 2.25 LS Hip Level ~Static
3 0.71 MCU Low Angle Static
4 3.10 LS, MS ~Eye Level Tracking, Tilt
5 4.12 EWS High Angle Drone, Pan
6 1.30 LS Low Angle ~Static

When created on a large scale, datasets like this can give us invaluable insights for creating effective shot sequences. As stated earlier, they'd also immensely broaden our scope of reference. What's most exciting about this is that it's all software-driven, making it widely accessible and easy to scale up.

The code and the model are publicly available here. I hope you enjoyed reading this as much as I did writing it. If you'd like to share your thoughts, you can leave a comment below, or reach me via email.



References


Visual Language




Previous Work


  • Shot Scale Analysis in Movies by Convolutional Neural Networks Savardi, Mattia & Signoroni, Alberto & Migliorati, Pierangelo & Benini, Sergio. 25th IEEE International Conference on Image Processing (ICIP). 2018
    This paper was the first one to classify shot types using neural networks. They built a massive dataset (~400,000) images to classify 3 shot types: Long Shots (EWS, WS, LS), Medium Shots (MLS, MS, MCU) and Close Shots (CU, EWS) and achieved an overall accuracy of ~94%.


Technical




Other Related Work


  • Barry Salt's Database
    One of the earliest quantitative databases on film data, built manually. Consists of three databases: Camera movement, shot scale and average shot length.
  • Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? Kensho Hara, Hirokatsu Kataoka, Yutaka Satoh. arXiv 2018
    An excellent paper that extended the ResNet architecture to the video domain. They released multiple models, and the best ones come close to/ are better than the state of the art (varying based on the dataset). The code is available here.
  • Cinemetrics
    Frederic Brodbeck's graduation project at the Royal Academy of Arts in The Hague. He create a visual "fingerprint" for a film by analysing color, speech, editing structure and motion.
  • Fast Video Shot Transition Localization with Deep Structured Models. Shitao Tang, Litong Feng, Zhangkui Kuang, Yimin Chen, Wei Zhang. arXiv 2017
    A more recent paper on SBD (shot boundary detection) that introduced a complicated yet effective model and a more sophisticated dataset — ClipShots — for cut detection. Detects hard cuts and fades. The code is available here.
  • Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. Joao Carreira, Andrew Zisserman. arXiv 2017
    This paper introduced the Kinetics dataset, which is a landmark dataset in the video domain, and introduces a unique model (a combination of networks, to be precise) that achieves state of the art performance in existing video datasets. The code is available here.
  • Ridiculously Fast Shot Boundary Detection with Fully Convolutional Neural Networks. Michael Gygli. arXiv 2017
    One of the first papers that used convolutional neural networks to detect when a cut happened in a video, and also the kind of cut — hard cut or fade.