Deleting unused Django media files

Deleting unused Django media files
Reading Time: 10 minutes

Handling Files in Django is pretty easy: you can add them to a model with only a line (for a brush-up on Django models, you can check out our article on handling data web frameworks), and the framework will handle everything for you – validations, uploading, type checking. Even serving them takes very little effort.

However, there is one thing that Django no longer does starting with version 1.3: automatically deleting files from a model when the instance is deleted.
There are good reasons for which this decision was made: in certain cases (such as rolled-back transactions or cases when a file was being referenced from multiple models) this behaviour was prone to data loss.
Nowadays, almost everyone uses AWS S3, or Google Cloud Storage, or MS Azure, or one of the many cloud-based existing solutions for storing media files without all the hassle and without having to worry that you will one day run out of space. So why even care about the fact that Django doesn’t delete files that are not used anymore? Well, first off, not everyone uses “the cloud” as a storage space for their files (maybe for security concerns or maybe just because they don’t want to). Secondly, those who do use cloud-based storage know that even though theoretically there is no size limit, the costs can become quite large by not deleting unused files.
So let’s dive right in and see which are the possible solutions for removing those nasty unused files.

1. Creating a custom management command


This first solution is actually the one being suggested in the Django documentation (see link above). This involves writing a custom management command which goes through the media files tree and checks, for each file, whether it is still being referenced from the database. Once all has been written and tested, you can schedule the command to run on a regular basis, using cron or celery.
The algorithm is quite simple and consists of four steps:

  1. We search for references to media files in the database — these will be stored in a set.
  2. We recursively create another set which comprises all physical files in the MEDIA_ROOT directory.
  3. The difference between these sets represents files that are physically present, but are not referenced from the database — these are the files we will delete.
  4. In order for our cleanup to be complete, we traverse once again recursively and delete all empty directories.

Now let’s see the code in action:

import os
from django.core.management.base import BaseCommand
from django.apps import apps
from django.db.models import Q
from django.conf import settings
from django.db.models import FileField
class Command(BaseCommand):
    help = "This command deletes all media files from the MEDIA_ROOT directory which are no longer referenced by any of the models from installed_apps"
    def handle(self, *args, **options):
        all_models = apps.get_models()
        physical_files = set()
        db_files = set()
        # Get all files from the database
        for model in all_models:
            file_fields = []
            filters = Q()
            for f_ in model._meta.fields:
                if isinstance(f_, FileField):
                    file_fields.append(f_.name)
                    is_null = {'{}__isnull'.format(f_.name): True}
                    is_empty = {'{}__exact'.format(f_.name): ''}
                    filters &= Q(**is_null) | Q(**is_empty)
            # only retrieve the models which have non-empty, non-null file fields
            if file_fields:
                files = model.objects.exclude(filters).values_list(*file_fields, flat=True).distinct()
                db_files.update(files)
        # Get all files from the MEDIA_ROOT, recursively
        media_root = getattr(settings, 'MEDIA_ROOT', None)
        if media_root is not None:
            for relative_root, dirs, files in os.walk(media_root):
                for file_ in files:
                    # Compute the relative file path to the media directory, so it can be compared to the values from the db
                    relative_file = os.path.join(os.path.relpath(relative_root, media_root), file_)
                    physical_files.add(relative_file)
        # Compute the difference and delete those files
        deletables = physical_files - db_files
        if deletables:
            for file_ in deletables:
                os.remove(os.path.join(media_root, file_))
            # Bottom-up - delete all empty folders
            for relative_root, dirs, files in os.walk(media_root, topdown=False):
                for dir_ in dirs:
                    if not os.listdir(os.path.join(relative_root, dir_)):
                        os.rmdir(os.path.join(relative_root, dir_))

2. Using signals


This is my favourite way of doing it, because it provides more control than the previous solution. We have used Django signals before and wrote about it on this blog. However, in regards of using signals for deleting unused media files, the comparison (with advantages and disadvantages) will be left for the end of this article.
There are two cases in which we will want to delete a file:

  1. When the model instance to which the file belongs is deleted – here we can simply use the post_delete signal, which will ensure that the instance has already been deleted from the database successfully. The code for this part is pretty straightforward:
    from django.db.models import FileField
    from django.db.models.signals import post_delete, post_save, pre_save
    from django.dispatch.dispatcher import receiver
    LOCAL_APPS = [
        'my_app1',
        'my_app2',
        '...'
    ]
    def delete_files(files_list):
        for file_ in files_list:
            if file_ and hasattr(file_, 'storage') and hasattr(file_, 'path'):
                # this accounts for different file storages (e.g. when using django-storages)
                storage_, path_ = file_.storage, file_.path
                storage_.delete(path_)
    @receiver(post_delete)
    def handle_files_on_delete(sender, instance, **kwargs):
        # presumably you want this behavior only for your apps, in which case you will have to specify them
        is_valid_app = sender._meta.app_label in LOCAL_APPS
        if is_valid_app:
            delete_files([getattr(instance, field_.name, None) for field_ in sender._meta.fields if isinstance(field_, FileField)])
  2. When a file is being replaced – in this case we must delete the old file and keep the new one if everything is successful. The simplest way to do it would be in the pre_save signal, when we can recover the value of the old file from the database. However, if any errors appear during the instance save, the file will be forever lost. So we have to do it in the post_save signal, once we know that everything is fine and that the instance was successfully saved in the database. But this also has a big caveat, since in the post_save signal we no longer have access to the previous values of the file field, meaning we no longer know which file(s) to delete. The final solution is to use the pre_save method to memorise the old value, and to actually perform the deletion in the post_save method. We will use a temporary cache on the model to keep the old values:
@receiver(pre_save)
def set_instance_cache(sender, instance, **kwargs):
    # prevent errors when loading files from fixtures
    from_fixture = 'raw' in kwargs and kwargs['raw']
    is_valid_app = sender._meta.app_label in LOCAL_APPS
    if is_valid_app and not from_fixture:
        # retrieve the old instance from the database to get old file values
        # for Django 1.8+, you can use the *refresh_from_db* method
        old_instance = sender.objects.filter(pk=instance.id).first()
        if old_instance is not None:
            # for each FileField, we will keep the original value inside an ephemeral `cache`
            instance.files_cache = {
                field_.name: getattr(old_instance, field_.name, None) for field_ in sender._meta.fields if isinstance(field_, FileField)
            }
@receiver(post_save)
def handle_files_on_update(sender, instance, **kwargs):
    if hasattr(instance, 'files_cache') and instance.files_cache:
        deletables = []
        for field_name in instance.files_cache:
            old_file_value = instance.files_cache[field_name]
            new_file_value = getattr(instance, field_name, None)
            # only delete the files that have changed
            if old_file_value and old_file_value != new_file_value:
                deletables.append(old_file_value)
        delete_files(deletables)
        instance.files_cache = {field_name: getattr(instance, field_name, None) for field_name in instance.files_cache}

In case you are wondering “Why hasn’t anyone made a library out of this?”, they actually did. In fact, you can find several solutions which delete a file once it is no longer used, such as django-cleanup. It is up to you to decide what is best for your project.

Comparison


There are other ways to delete orphan files with Django which are not presented in this article. For example, if you know for sure you will only have so little file fields in your project, you may want to choose a more individualistic approach. Or, perhaps, you want to counter some of the effects that deleting these files has and move them to a temporary storage before permanently deleting them.
For now, let’s see how these methods work compared to each other, by enumerating their pluses and minuses:

Management command

+ A custom management command will only run ever so often and it can do so asynchronously. Hence, this solution can result in an overall better performance, since it doesn’t intervene in the request-response cycle.
+ If executed manually — meaning not inside a cron job — this could help preventing the loss of files caused by migrations/transactions.
+ By checking everything in the database, we make sure that no file is deleted if at least one mention to it exists (this takes care of the problem with multiple references).
Customizing management commands is slightly more difficult, and passing those arguments to a cron job is rather ugly.
If the command is not executed often enough, you could still run into storage size problems.
This does not take care of different storage spaces (at least not in the current implementation).
Running the command depends on the database size and on the media folder size, which can quickly become problematic once they increase.

Signals

+ Having everything implemented through signals allows for a high degree of control over the files being replaced: you can easily add extra logic which helps decide whether a file should be deleted or not (e.g. based on user account type).
+ This solution integrates nicely in the application flow and it doesn’t take too long to run, since everything is done on the spot. At the same time, it is much easier to implement for most programmers which are already accustomed to using Django signals.
Even though we took care to handle file replacement in the post_save signal, there is still the possibility to have some errors after this signal is handled, which would result in an ‘unsuccessful’ save.
It does not account for multiple references to the same file (even though, unless you are not directly modifying your database, this should never happen).

Conclusion


Before trying to implement a mechanism for deleting unused media files, you should always:

  • consider why the Django team decided to remove this feature in the first place
  • check if the solution you choose doesn’t introduce more problems than it solves
  • ensure no unwanted data-loss is possible (test your code!)

It is up to you to see whether or not you need this behaviour and what is the best way for you to implement it. In this article, we are happy to have presented you with some possible solutions and hopefully this will be of help to some of our readers. In the mean time, we would love to hear your opinions/questions/suggestions so don’t hesitate to contact us using the comment section. Feel free to check out our other Python articles, including our guidelines for solving Django migration conflicts.

We transform challenges into digital experiences

Get in touch to let us know what you’re looking for. Our policy includes 14 days risk-free!

Free project consultation