Fake Anime Characters Using Deep Convolutional GANs
Implementing Deep Convolutional Generative Adversarial Networks or DCGAN to generate high-quality images of shape 128x128 containing fake anime cartoon characters.

Project Overview
With this project, I wanted to create some clear looking cartoon characters. Now obviously using the latest version of a Style GAN might have been a better choice, but I decided to stick to a very simple paper for now and got pretty neat results. I’ve Implemented paper called DCGAN, for generating fake anime characters. Such a simple model is actually able to produce surprisingly good results when trained enough. The paper I’m implementing here came all the way back in 2015… right after one year of Ian introducing GAN to this world, and this was the very first paper that provided a whole lot of boost for image generation tasks.
Introducing The Paper
There were many discussions that even though we have many objective functions for improving the adversarial training for image generation, but still the majority of the implementation that affects the final outputs of the generator, is the actual architectures of the discriminator and generator.
The authors of this paper called Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, essentially are trying to address this particular problem of finding a stable architecture for generator and discriminator, so they did extensive model exploration and finally came to several guidelines to follow for implementing GANs for image generation.

Before this paper came out, everyone was used to train GANs built on top of fully connected layers. Now typically, for the most part, they are okay but generating high-resolution images using them was a nightmare. First of all the generator’s outputs were very bad, and it was very difficult to train. Secondly, suppose you’re trying to generate images of size 3x1024x1024, this means with fully connected layers you’d be computing a dot product to produce over 3 million output units, which is just horrible.
And this paper just solved such issues with charm. Okay then let’s have a look.
An interesting fact, one of the authors of this paper Soumith Chintala, is also the co-creator of Pytorch and wrote one of the very first lines of code in its development :)
The Nub
We propose and evaluate a set of constraints on the architectural topology of Convolutional GANs that make them stable to train in most settings. We name this class of architectures Deep Convolutional GANs (DCGAN)
We visualize the filters learnt by GANs and empirically show that specific filters have learned to draw specific objects
After extensive model exploration we identified a family of architectures that resulted in stable training across a range of datasets and allowed for training higher resolution and deeper generative models
— Authors
In summary, they’re trying to stabilize the adversarial training, by applying few topological constraints on both generator and discriminator. They also monitored the learned filters by the discriminator and concluded that the discriminator when trained under their guidelines can learn interpretable features.
The Working
Cool then, now let’s have a look at their guidelines:
- Not to use any pooling layer: Instead, use strided convolutions in discriminator and strided deconvolutions in the generator. So the idea behind this point is to allow both the models to learn their own down sampling and up sampling procedures.
- Remove any kind of fully connected layer at the end of the discriminator: They are suggesting to use a CNN layer to produce the logits from the discriminator. Which can be obtained by a CNN layer when kernel height and width is set to that of its input layer, and the number of output channels set to 1.
- Use batch normalization in both the models: They found it to be highly effective for stabilizing the training for deeper GANs. It solves the poor initialization problem and helps to provide stable gradients flow. They also found that batch normalization is one of the most critical approaches for preventing mode collapse, which is a situation when the outputs of the generator become very much independent to its inputs and only produces almost similar looking images.
One important note:
Directly applying batch norm to all layers, however, resulted in sample oscillation and model instability. This was avoided by not applying batch norm to the generator output layer and the discriminator input layer
— Authors
- Use a ReLU activation in generator and Leaky ReLU in discriminator: There’s no logic here, with just experimentations the authors observed that using a bounded activation allowed the generator to learn more quickly to saturate and cover the color space of the training distribution. And for discriminator they just found the Leaky ReLU to work well especially for producing higher resolution images.
Understanding Transposed Convolutions
This operation will be used in the generator, so we have to make sure that we understand this. Okay let’s have a look.
A vanilla CNN layer is a downsampling in nature, means that the size of the outputs from a CNN layer would be less than that of it’s inputs. Or in a nutshell, a CNN operation compresses its inputs.
Transposed Convolution layer is upsampling in nature, so the size of the outputs in this case would be more than that of inputs, or we can say, it decompresses its inputs. The word decompress is slightly misused here actually.
In mathematics, the idea of any transpose operation is to perform sort of a switch operation (for instance transposing a matrix means switching rows with columns). So following with this, we can say the transposed convolutions switches the input dimensions (large dimensions) with output dimensions (small dimensions) in comparison to vanilla CNNs, hence in a way its decompressing.
Let’s take a simple example for understanding transposed convolution operation for stride 1 and padding 0.:
Assuming we have input of shape 2x2 and kernel of shape 2x2 as follows:
=== Input ===
|0 1|
|2 3|=== Kernel ===
|0 1|
|2 3|
Now multiplying each input element with kernel:
Which will give 4 matrices since we have 4 total elements in the input matrix
=== 0 * Kernel ===
|0 0 -|
|0 0 -|
|- - -|=== 1 * Kernel ===
|- 0 1|
|- 2 3|
|- - -|=== 2 * Kernel ===
|- - -|
|0 2 -|
|4 6 -|=== 3 * Kernel ===
|- - -|
|- 0 3|
|- 6 9|
You can consider “-” this symbol 0. So we have:
=== 0 * Kernel ===
|0 0 0|
|0 0 0|
|0 0 0|=== 1 * Kernel ===
|0 0 1|
|0 2 3|
|0 0 0|=== 2 * Kernel ===
|0 0 0|
|0 2 0|
|4 6 0|=== 3 * Kernel ===
|0 0 0|
|0 0 3|
|0 6 9|
Now adding all these matrices will give us the final output:
=== Outputs ===
|0 0 1|
|0 4 6|
|4 12 9|
That’s it :)
And below is the scratch implementation of transposed convolution with NumPy, for stride 1 and padding 0
A pseudo code for the receptive field of a CNN layer in python can be written as:
output[i, j] = (image[i: i + h, j: j + w] * kernel).sum()
The above formulation basically summarizes input values through the kernel. On the other hand, the transposed convolution broadcasts input values through the kernel, hence gives a larger output shape.
Padding in Transposed Convolutions
In CNNs we apply padding to the input layer, but in transposed convolutions, padding is applied to the output from the transposed convolution. For instance, 1×1 padding means first we compute the output from the transposed convolution as normal, and then remove the first and last rows and columns from the outputs and hence decreasing output’s height and width decreased by 2.
Below is an implementation from scratch in NumPy:
Strides in Transposed Convolutions
In CNNs the strides affect how many steps the receptive field take to compute next matrix multiplication, but in transposed convolutions, the elements in the input layer are made apart. For instance, if we apply stride 2x2 in the input layer used in the above example, then it would become:
=== Input ===
|0 0 1|
|0 0 0|
|2 0 3|
So the stride in transposed convolution determines the step between input elements, for instance here with stride 2, the 2 to 3 elements of the input layer are made 2 steps apart.
Below is an implementation from scratch in NumPy:
And finally, below is a workout example from Stanford :)

Note, one more idea of dilation is there in CNNs and Transposed Convolutions, but I don’t want to overwhelm you with so many details, anyways, this dilation is not required to follow through my project, so yeah.
Transposed Convolution is not Deconvolution
I used to do this mistake all the time. Let me state it out lowed, deconvolution is not transposed convolution. I know why these terms are used incorrectly, since deconvolution sounds opposite of convolution so why not call transposed convolution deconvolution, right? Wrong.
Actually, deconvolution is not even a core part of deep learning. This is also why even in DL frameworks, you’d only see layers with names like Conv2DTranspose or ConvTranspose2D etc… not Deconvolution or stuff.
Deconvolution is the process of filtering a signal to compensate for an undesired convolution. The goal of deconvolution is to recreate the signal as it existed before the convolution took place — Source
The basic idea behind deconvolution is to extract the original input, from the transformed version of it (transformed by a convolution operation).
Now transposed convolution can’t extract the original input, since the output depends on the value of the kernels, though these values can be trained accordingly to do so, still… you know what I mean.
Model Initialization
One more thing before continuing to the project implementation. The authors initialized the CNN and Transposed CNN layers with normal distribution mean 0.0 and standard deviation of 0.02, whereas for the batch normalization layers, the gamma parameters are initialized as above, but the beta variables are initialized with constant 0.0.
Continuing To The Project
I found this dataset on Kaggle, it has over 60,000 anime faces, though we do have quite a few blurred faces, which could potentially make our generator produce blurred images as well. But the rest of the faces are pretty neat and good actually, so not bad overall.

Creating The DataLoader
I tried few more RGB data augmentations like brightness, hue and saturation but those augmentations surprisingly were causing the models not to converge, my guess is I set the magnitude of these augmentations too high, which made my real data varying too much all the time, and so my discriminator was not converging its loss over real images, and thus so was the generator.
To be on the safe side, I only applied random horizontal flipping (lol). Though I’d suggest you to try out those RGB augmentations as well with less magnitude at least.
Creating The Discriminator
I’m following the discriminator’s architecture as used in the paper, which is a sequence of blocks, where each block contains CNN, followed by Batch Normalization, and the Leaky ReLU.
Also, remember to not apply Batch Normalization in the very first block as described in the paper.
And no fully connected layers at the end, instead a CNN layer which outputs layer of shape (-1, 1, 1, 1).
So here’s how the discriminator looks like:
And here is its summary:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
Conv2d-1 [-1, 64, 64, 64] 3,136
LeakyReLU-2 [-1, 64, 64, 64] 0
Conv2d-3 [-1, 128, 32, 32] 131,072
BatchNorm2d-4 [-1, 128, 32, 32] 256
LeakyReLU-5 [-1, 128, 32, 32] 0
Conv2d-6 [-1, 256, 16, 16] 524,288
BatchNorm2d-7 [-1, 256, 16, 16] 512
LeakyReLU-8 [-1, 256, 16, 16] 0
Conv2d-9 [-1, 512, 8, 8] 2,097,152
BatchNorm2d-10 [-1, 512, 8, 8] 1,024
LeakyReLU-11 [-1, 512, 8, 8] 0
Conv2d-12 [-1, 1024, 4, 4] 8,388,608
BatchNorm2d-13 [-1, 1024, 4, 4] 2,048
LeakyReLU-14 [-1, 1024, 4, 4] 0
Conv2d-15 [-1, 1, 1, 1] 16,385
Flatten-16 [-1, 1] 0
================================================================
Total params: 11,164,481
Trainable params: 11,164,481
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.19
Forward/backward pass size (MB): 9.63
Params size (MB): 42.59
Estimated Total Size (MB): 52.40
----------------------------------------------------------------
Note, I’m not applying sigmoid activation at the discriminator’s output layer because I’ll be using binary cross-entropy loss with logits, which is much more stable and less prone to become nan, which is a nightmare of course.
Creating The Generator
And as I said, we’ll be using transposed convolutions to upsample the input layers such that finally, it produces outputs of height and width of 128.
In the paper, they also used tanh activation on the generator’s final outputs, versus sigmoid which is what everyone used to do prior to this paper. This is the reason we applied normalization to our real data with mean 0.5 and standard deviation 0.5 such that the real data would range from -1 to 1 only, which is the range of the tanh activation too.
Also, make sure to use the bounded activation ReLU instead of LeakyReLU as described in the paper.
The input noise for the generator would have the shape of (-1, 128), which would be reshaped to (-1, 128, 1, 1) so that we would be able to apply transposed convolution on top of it.
And also, no batch normalization in the final block.
So here’s how the generator looks like:
And here is it’s summary:
----------------------------------------------------------------
Layer (type) Output Shape Param #
================================================================
ConvTranspose2d-1 [-1, 1024, 4, 4] 2,097,152
BatchNorm2d-2 [-1, 1024, 4, 4] 2,048
ReLU-3 [-1, 1024, 4, 4] 0
ConvTranspose2d-4 [-1, 512, 8, 8] 8,388,608
BatchNorm2d-5 [-1, 512, 8, 8] 1,024
ReLU-6 [-1, 512, 8, 8] 0
ConvTranspose2d-7 [-1, 256, 16, 16] 2,097,152
BatchNorm2d-8 [-1, 256, 16, 16] 512
ReLU-9 [-1, 256, 16, 16] 0
ConvTranspose2d-10 [-1, 128, 32, 32] 524,288
BatchNorm2d-11 [-1, 128, 32, 32] 256
ReLU-12 [-1, 128, 32, 32] 0
ConvTranspose2d-13 [-1, 64, 64, 64] 131,072
BatchNorm2d-14 [-1, 64, 64, 64] 128
ReLU-15 [-1, 64, 64, 64] 0
ConvTranspose2d-16 [-1, 3, 128, 128] 3,075
Tanh-17 [-1, 3, 128, 128] 0
================================================================
Total params: 13,245,315
Trainable params: 13,245,315
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 12.38
Params size (MB): 50.53
Estimated Total Size (MB): 62.90
----------------------------------------------------------------
Initializing Parameters
Now we’ll initialize out parameters as described in the paper. Which can be done as follows:
Defining Loss Functions
As I said earlier, I’m using binary cross-entropy with logits because of its stability.
Defining Optimizers
If you wonder why we use 0.5 as beta1 value while training DCGANs, well this is that paper which proposed this. And from my experience, you can set to low as 0.4 or so too, especially when your dataset is huge.
So below we are initializing the optimizers for both of our models.
Training The Model
Now, as we have 11 million parameters in discriminator and 13 million parameters in generator, so you have to be patient enough to train the models up to their full potential. I’m using Google Colab, and it took me well over 12 hours.
Of course, you can’t run Google Colab for such long time continuously, therefore I wrote a helper script which will save the parameters to the Google Drive, which could later be reloaded while resuming training, so make sure to check the code at the end of the article if you’re interested.
Okay, so here’s the main part of the training:
Note, unlike other GAN implementations you’ll find on the internet, I’m not using retain_graph=True on the discriminator because after training the disc (discriminator) first… d(disc_output)/d(false_images) which is partial derivatives of disc output wrt generated images would be different now since disc’s parameters have changed.
So by computing disc(false_images) again, the gen (generator) will be competing against the current disc (with the latest updated disc parameters), not the previous disc.
Usually using retain_graph = True is okay, when your LR for the disc is small and the dataset is not much complicated such that d(disc_output/input_image) does not change much (cuz when the dataset is simple… then the disc will know easily that which input image is real, so just after few training iterations, d(disc_output/input_image) would become stable and wouldn’t change much).
Final Outputs
Okay so after all that hassle, let’s see what we have achieved:

Observations from final outputs:
- Generator has at least successfully learned the features that makes anime face an anime face, like shine in the hairs and eyes, pointy chin etc. (Based on the dataset)
- As per my observation, there is no any significant sign of mode collapse, which is awesome considering I didn’t use any of the techniques with are recommended for its prevention.
- Sometimes the generator does produce very blurry faces, because the dataset I used do contain blurry faces too.
There’s a lot of techniques that could have made this project way better, for instance using better normalization techniques like Pixel Normalization, layers like mini batch discrimination in our discriminator, gradient penalty and WGAN loss for training, which ultimately stabilizes the training and helps the generator produce better images. But I wanted to see how far a simple architecture could come across. And now I know :)
Though as I was training my models, I was continuously observing the generator’s outputs. And decreased the learning rate for both discriminator and generator whenever between training I saw the generated outputs becoming unstable.
Noise Interpolation
This technique is used almost every GAN paper to check if the generator is cheating or not.
The idea is to generate images for a series of noises, where each series is created by interpolating two noises. You can think of interpolation as a transition from one phase to another.
A pseudo code for generating noise interpolation between noises is below, where k is the number of interpolations we want to create.
[((1.0 - (i/k)) * noise1) + ((i/k) * noise2) for i in range(k + 1)]
The reason why we want to generate the outputs for such inputs is to check if the transition from one noise to another is smooth or not. If the generated outputs are fuzzy and not distinct then this means that your generator is simply cheating, and is not properly interpreted the noise to the outputs.
A good generator will always (most of the time if not always) provide good quality interpolated images. And below are interpolation outputs from my generator:

And above as above you can see, in the 2nd row and in the 6th row, the generated images between are not as clear as the leftmost and rightmost ones, which means there’s still room for improvement.
Conclusion
This paper provides us with a lot of guidelines by applying constraints over the model’s topology.
- Not to use any pooling layer.
- Remove any kind of fully connected layer at the end of the discriminator.
- Use batch normalization in both the models.
- Use a ReLU activation in generator and Leaky ReLU in discriminator.
And don’t forget to train the models enough, otherwise, you simply wouldn’t get the best of what your models could offer you.
Here’s the GitHub link to my project.
Okay then, I hope this project implementation would have been helpful for ya.
And thanks for reading.