Understanding Hard Links

First signs of a problem

I copied my Movies directory from my old Mac to my new Mac, using target disk mode. On my old Mac, the Movies directory took up 84G. But on my new Mac, it took up 149G. What was going on?

My Movies directory contained hard links, which I wrote about last time.

The Movies directory contained 65G of files that were hard-linked to other files also within the Movies directory. When I copied them the usual way (by drag and drop, or cp), the hard linked files were copied one time for each hard link. So tons of duplication, tons of wasted space.

Investigating hard links

The command du, which calculates disk usage of a directory, is useful for understanding what hard links you have and where.

If I look at the disk usage for the Movies directory, I see:

$ du -sh Movies
 84G    Movies

Here’s what that means:

  • du – display disk usage statistics
  • -s – provide the total disk usage for each file/directory specified on the command line
  • -h – provide it in human readable form (as GB, rather than 84123456789 bytes)
  • Movies – tell me about the Movies directory

I get consistent information if I ask du about the top-level directories inside Movies:

$ du -sh Movies/*
 80G    Movies/iMovie Library.imovielibrary/
4.0G    Movies/iMovie Events.localized/
301M    Movies/iMovie Projects.localized/
8.0K    Movies/iMovie Theater.theater/

But if I look at the top-level directories individually, I see something different:

$ du -sh "Movies/iMovie Library.imovielibrary/"
 80G    Movies/iMovie Library.imovielibrary/

$ du -sh "Movies/iMovie Events.localized/"
 69G    Movies/iMovie Events.localized/

80G + 69G = 149G. This is much bigger than the 84G that du claims the Movies directory contains.

It turns out there are hard-links between “iMovie Library.imovielibrary” and “iMovie Events.localized”. du keeps track of the inode for any hard-linked files it comes across, and only includes those inodes once in its calculations. See this discussion on Stack Exchange

So “iMovie Events.localized” contains 65G of files that are hard-linked with files in “iMovie Library.imovielibrary”, and only 4G that belong to it independently.

I wanted to make sure that when I copied the Movies directory, I did not duplicate any hard linked files. So first, I needed to identify where all the hard links went.

The magical find and hard links

find is the Swiss army knife of Unix.

$ find Movies -type f -links +1 -ls | wc
     908   16037  160068

The arguments mean:

  • find – walk a file hierarchy
  • Movies – begin in the Movies directory
  • -type f – only look at files, not directories.
  • -links +1 – tell us about files that have more than 1 link (that is, 2 or more links)
  • -ls – give the long listing for each matching file
  • | wc – pipe the answer to word count

So the output of that command tells us that there are 908 files in the Movies directory that are hard-linked to something.

If we then ask how many files have more than 2 links:

$ find Movies -type f -links +2 -ls | wc
       0       0       0

We find that no file has more than 2 links.

Alternately, we could have asked how many files have exactly 2 links:

$ find Movies -type f -links 2 -ls | wc
     908   16037  160068

There are 908 files with exactly 2 links, and therefore none with more than 2 links.

Using find to get related hard links

Now let’s check where those files are linked to. Let’s look at the first hard linked file:

$ find Movies -type f -links 2 -ls | head -1
3373844 172976 -rw-r--r-- 2 sasha staff 88560000 Jun 22  2008 Movies/iMovie Events.localized/raspberries/clip.dv

The first column is the inode number, and the last column (which has wrapped to the next line) is one of the filenames pointing to that inode.

This one has inode 3337844, and filename Movies/iMovie Events.localized/raspberries/clip.dv.

We can check where the other hard linked file is by searching by inode explicitly (-inum):

find Movies -type f -inum 3373844 -print

Or searching for other files linked to the same inode as a particular filename (-samefile):

$ find Movies -type f -samefile "Movies/iMovie Events.localized/raspberries/clip.dv" -print

In either case, I get the following result:

Movies/iMovie Events.localized/raspberries/clip.dv
Movies/iMovie Library.imovielibrary/raspberries/Original Media/clip.dv

So I know that some files in iMovies Events.localized are linked to files in iMovie Library.imovielibrary.

Note that find only searches the directories it is told to search. In this case, it is searching the Movies directory. If a file is hard linked to a file outside the Movies directory, find will report that it has hard links, because the reference count is greater than one. But if the other hard link is outside the Movies directory, then find will not be able to locate it.

Depending on where the hard links are, you may need to back up to a higher-level directory to find them:

find ~ -type f -samefile "Movies/iMovie Events.localized/raspberries/clip.dv" -print

Using awk to get total bytes

You can use awk to calculate the total number of bytes that are hard-linked within various directories. This lets you see if you’ve found all of the corresponding hard-linked files. In the find listing, column 7 contains the size in bytes of each matching file.

$ find "Movies/iMovie Events.localized" -type f -links 2 -ls | awk '{sum+= $7} END {print sum}'
69739632133

$ find "Movies/iMovie Library.imovielibrary" -type f -links 2 -ls | awk '{sum+= $7} END {print sum}'
69740948475

$ find "Movies/iMovie Projects.localized" -type f -links 2 -ls | awk '{sum+= $7} END {print sum}'
1316342

Then, you can use expr to check that the numbers match up:

$ expr 69740948475 - 69739632133 - 1316342
0

For my case, the number of bytes of hard links in “iMovie Events.localized” is the exact same as the sum of the bytes in “iMovie Library.imovielibrary” and “iMovie Projects.localized”. This is a pretty good indication that all my hard links are contained entirely within the Movies directory.

Copying hard linked files

After you figure out where all your hard links are, you need to copy the top-level directory recursively, and maintain the hard links over the course of that copy. Both cp and drag and drop will double-copy every hard-linked file. You need rsync.

Here is the command I used

$ rsync -vaEH --protect-args --progress "/Volumes/Macintosh HD 1/Users/sasha/Movies/" /Users/sasha/Movies

This means:

  • rsync – remote synchronize
  • -v – verbose
  • -a – archive mode – recurse into directories, and preserve symlinks, permissions, timestamps, owners, groups, devices, and special files
  • -E – preserve extended attributes (necessary for Macs)
  • -H – preserve hard links
  • --protect-args – properly handle filenames with spaces
  • --progress – show progress during transfer
  • "/Volumes Machintosh HD 1/Users/sasha/Movies/" – directory to copy from
  • /Users/sasha/Movies – directory to copy to

Here is the last tricky/annoying piece in this process. Mac OS X 10.11 ships with rsync 2.6.9, which does not protect filenames with spaces. This means that rsync will copy some files properly, and will fail on others with a pretty useless error message:

rsync: link_stat /Users/sasha/Movies/<blah> failed: No such file or directory (2)

If you search online, you will find suggestions to use complicated sets of backslashes and nested single and double quotes to get around this. This is a fragile solution. It may work, depending on where in the directory hierarchy the spaces are, but it may not.

The better solution is to download the latest version of rsync from homebrew (currently rsync 3.1.2), and then use the --protect-args option, which protects spaces. To get rsync, you need the homebrew/dupes tap, as described on Stack Overflow.

One thought on “Understanding Hard Links”

  1. Hi
    I notice a lot of duplicated files between iMovie and Photos.app. i.e. two files, with the same md5 checksum in both places. Can I safely replace one instance of the file, with a hard link, without corrupting the library?

Leave a Reply

Your email address will not be published. Required fields are marked *