Renaming Files to Their Hashes With Bash

Author's Note 2022-12-29

As of the date of this edit, I'm using a new solution written in fish, which can be found in my dotfiles repo on GitLab. I'm going to leave this here as there's some cool bash scripting knowledge that I'll probably want in the future.

Preface

The way I organize my images is by throwing them all in a single folder, and assigning metadata tags to them. Because I use metadata for organization, the names aren't relevant, and I usually leave them as is. However, sometimes the names will contain a common word when searching for other documents, or in rare circumstances duplicate names. My solution to this is to rename every file in my pictures folder to be the sha1sum of its contents, which ensures the filename is unique.

The Script

#!/usr/bin/env bash
set -euo pipefail

for i in "$1"/*; do
    full_filename=$i
    filename=${full_filename##*/}
    no_extension=${filename%%.*}
    num_chars=${#no_extension}

    if [[ ( -f "$i" ) && (${num_chars} != 40) ]]; then
        sum=$(shasum "$i")
        echo "$i" "$1/${sum%% *}.${i##*.}"

        if [[ $2 == true ]]; then
            mv "$i" "$1/${sum%% *}.${i##*.}"
        fi
    fi
done

Usage

This script accepts two arguments, the directory to rename all the files in, and something to determine whether to execute the mv commands. It doesn't matter if you include the "/" after the directory or not, Linux doesn't seem to care, and I assume macOS won't either.

Breakdown

#!/usr/bin/env bash
set -euo pipefail

If you've seen executable scripts before, you'll recognize the first like as the shebang line, which tells the OS what program the script should run with, in this case bash, the Bourne Again SHell.

The second line enables a "strict" mode in bash. It cases bash to behave in a way that makes many subtle bugs impossible, so I would strongly recommend doing this. Here's a more complete explanation: Strict Mode


for i in "$1"/*; do

This is the start of a for loop in bash. In plain English, this is saying for each thing in the directory the user supplied to me, do something. for i declares the variable i which will be used to reference what file is being used in each iteration of the loop. "$1" expands into the directory supplied by the user on the command line. The /* at the end is called a glob, and causes the whole expression to expand into every file path inside the user supplied directory.


    full_filename=$i
    filename=${full_filename##*/}
    no_extension=${filename%%.*}
    num_chars=${#no_extension}

This is a roundabout way to figure out the number of characters in the name of a file, ignoring the rest of the path to get to the file, as well as any extensions it may have at the end. It's done with POSIX parameter expansions, and each line is self-explanatory what it is doing based on the variable name. The reason for doing this is to know whether a file was already renamed.

A message from the future:

This is not a perfect system, as if a filename happens to contain the same number of characters as a sha1sum, then it won't be renamed. The new version of this script calculates the hash no matter what, then compares it to the current filename. While slower, it'll actually be correct, which is more important considering this script isn't ran often.


    if [[ ( -f "$i" ) && ("${#i}" == 40) ]]; then

This is a conditional statement in bash, where [[ ]] denotes the start of a conditional of some kind, and && is the and operator.

The first expression is asking whether the file path we're currently on in the loop is a file or a directory. There shouldn't ever be a directory in my pictures' folder, but just in case one sneaks in there it won't have anything done to it.

The second expression is checking whether the length of the filename string is 40 characters. This is done with a # prefixing the variable name in an expansion. 40 characters is used as that is how long a sha1sum is (the default for the shasum command used later), as I don't want to calculate the hash if a file has already been renamed.


        sum=$(shasum "$i")
        echo -- "$i" "$1/${sum%% *}.${i##*.}"

Sets the variable sum equal to the shasum of the file we are on in the iteration. Echo will print whatever comes after it out to the terminal, which in this case is some absolute wizardry I stole from somebody on the internet. The output will be the original file name, and then the location and name of the correctly renamed file, preserving its original extension.


        if [[ $2 == true ]]; then
            mv "$i" "$1/${sum%% *}.${i##*.}"
        fi
    fi
done

This checks to see whether the second parameter passed to the script is the word true, and if so, it will execute the move action as shown from the previous echo command. The idea is to run it with something random the first time to sanity check the output, then run it with the word true to actually rename all the files.

It's worth noting this script will not recursively enter directories, and will actually ignore them for renaming entirely.

Rust: iter() vs into_iter()

TLDR

  • The iterator returned by into_iterator() can yield a T, &T, or &mut T, depending on the context, normally T unless there's some other circumstance.
  • The iterator returned by iter() will yield a &T by convention.
  • The iterator returned by iter_mut() will yield &mut T by convention.

WTF is into_iter()?

into_iter() comes from the IntoIterator trait, which you implement when you want to specify how a particular type gets converted into an iterator. Notably, if you want a type to be usable in a for loop, you must implement into_iter() for the type.

As an example, Vec<T> implements IntoIterator three times:

impl<T> IntoIterator for Vec<T>
impl<'a, T> IntoIterator for &'a Vec<T>
impl<'a, T> IntoIterator for &'a mut Vec<T>

Each of these is slightly different. The first one consumes the Vec and yields its T values directly.

The other two take the Vec by reference and yield immutable and mutable references of type T.

Yeah okay cool, so what's the difference though?

into_iter() is a generic method to obtain an iterator, and what this iterator yields (values, immutable references, or mutable references) is context dependent, and can sometimes be something you aren't expecting.

iter() and iter_mut() have return types independent of the context, and conventionally return immutable and mutable references respectively.

This is best shown with examples, so code blocks incoming:

#[test]
fn iter_demo() {
    let v1 = vec![1, 2, 3];
    let mut v1_iter = v1.iter();

    // iter() returns an iterator over references to the values
    assert_eq!(v1_iter.next(), Some(&1));
    assert_eq!(v1_iter.next(), Some(&2));
    assert_eq!(v1_iter.next(), Some(&3));
    assert_eq!(v1_iter.next(), None);
}

#[test]
fn into_iter_demo() {
    let v1 = vec![1, 2, 3];
    let mut v1_iter = v1.into_iter();

    // into_iter() returns an iterator over owned values in this particular case
    assert_eq!(v1_iter.next(), Some(1));
    assert_eq!(v1_iter.next(), Some(2));
    assert_eq!(v1_iter.next(), Some(3));
    assert_eq!(v1_iter.next(), None);
}

#[test]
fn iter_mut_demo() {
    let mut v1 = vec![1, 2, 3];
    let mut v1_iter = v1.iter_mut();

    // iter_mut() returns an iterator over mutable references to the values
    assert_eq!(v1_iter.next(), Some(&mut 1));
    assert_eq!(v1_iter.next(), Some(&mut 2));
    assert_eq!(v1_iter.next(), Some(&mut 3));
    assert_eq!(v1_iter.next(), None);
}