v6/site/posts/2021-04-08-image-caching.md
2022-12-10 13:15:32 -05:00

15 KiB
Raw Blame History

title = "The Intricate Image Caching Architecture of Tusker"
tags = ["swift"]
date = "2021-04-08 18:25:42 -0400"
short_desc = "The three hardest problems in computer science are naming, caching, and off-by-one errors."
slug = "image-caching"

A fairly important part of Tusker (my iOS Mastodon app) is displaying images. And a bunch of varieties of images: user avatars, post attachments, custom emojis, user profile headers, as well as a few other types of rarely-shown images that get lumped in with attachments. And displaying lots of images in a performant way means caching. Lots of caching.

In the beginning, there was nothing. Then I started displaying images and almost immediately realized there would need to be some amount of caching. Otherwise, just scrolling down a little bit and then back up would result in images being re-loaded that were present mere seconds ago.

The very first implementation was super simple. It was basically just a dictionary of image URLs to the Data for the image at that URL. This fulfilled the primary goals of being 1) super easy to build and 2) mostly working for the simplest of use cases. But, the blog post doesn't end here, so clearly there are some issues remaining.

Now, back to the implementation. The first strategy has an obvious issue: memory usage will grow indefinitely. Luckily, there's a builtin solution for this: NSCache. NSCache is essentially a dictionary that works with the OS to automatically remove its contents when the system needs memory for something else.

This worked well enough for a little while, but there's another fairly low-hanging optimization. Because URLs aren't reused, images can be cached for a very long time. Even across app launches, if the cache were persisted to disk. Enter the Cache library. It provides memory- and disk-based caches (the memory one just wraps NSCache). While needing to load things from disk is relatively rare (because once an object is loaded from the on-disk cache, it will be kept in the in-memory cache), it's still a nice improvement during app launch and for the eventuality that Tusker is asked by the system to give back some memory.

This setup served me fairly well, and (aside from bugfixes) the image caching architecture went untouched for a while. Until I started worked on improving the app's behavior in degraded network conditions.

When running with the Network Link Conditioner in a super low-bandwidth preset, I launched the app to see what would happen. After a few API requests, all the posts loaded. But none of the images yet (I had purged the on-disk cache in order to test this scenario). Then the user avatars started loading in, one by one. Even for the same user.

The next optimization, then, is obvious. Why many request when few do trick? So, whenever something needs to load an image, instead of only checking if the URL already exists in the cache, I can also check whether there are any in-flight requests for that URL. If there are, then instead of starting a new request, the completion handler just gets tacked on to the existing request. With this in place, when you launch the app under poor network conditions, every instance of a specific user's avatar will load in simultaneously with the net outcome being that the app overall is finished loading sooner.

The network request batching mechanism also has one more feature. When something calls it to either kickoff a network request or add a completion handler to one that's already running, it receives back an object (called Request in my code, because that's what they are from the API consumer's point-of-view) which can be used to cancel the request. This is so that, if, say, a table view cell is reused, the requests for the cell's old data can be cancelled. But because the actual network requests are batched together, calling the cancel method on the request object doesn't necessarily cancel the underlying request (what I call a RequestGroup). The individual completion handler for the "cancelled" request will be removed, but the actual URL request won't be cancelled if there are still other active handlers.

There's also one more feature of the batching system. In some cases (primarily table view prefetching) it's useful to pre-warm the cache. This can either be by just loading something from disk, or by starting a network request for the image (either the request will finish by the time the data is needed, in which case the image will be in the in-memory cache or it will still be in-progress, in which case the completion handler that actually wants the data will be added to the request group). For this, there are also completion handler-less requests. They are also part of the RequestGroup and contribute to keeping the underlying network request alive. Cancelling a callback-less request is trivial because, without the completion handler, each one that belongs to the same URL is identical.

And this was how caching worked in Tusker for almost a year and a half. But, of course, this couldn't last forever. A few months ago, I was doing a bunch of profiling and optimizing to try to improve scroll view performance and reduce animation hitches.

The first thing I noticed was that while I was just scrolling through the timeline, there was a lot of time being spent in syscalls in the main thread. The syscalls were open, stat, and fstat and they were being called from NSURL's initFileURLWithPath: initializer. This method was being called with the cache key (which in my case was the URL to the remote image) in order to check if the key string has a file extension so that the extension can be used for the locally cached file. It was being called very frequently because in order to check if an image exists in the disk cache, it needs to check if there's a file on-disk at the path derived from the cache key, which includes the potential file extension of the key.

Another thing the initFileURLWithPath: initializer does is, if the path does not end with a slash, determine if it represents a directory by querying the filesystem. Since that initializer was also used to construct the final path to the cached file on-disk, it was doing even more pointless work. Because the cache is the only thing writing to that directory and all it's writing are files, it should never need to ask the filesystem.

There were a couple super low-hanging optimizations here:

First was using NSString's pathExtension property instead of turning the cache key into an NSURL to get the same property. The NSString property merely interprets the string as a file path, rather than hitting the disk, so it can be much faster.

The second thing was, as the documentation suggests, using the initFileURLWithPath:isDirectory: initializer instead. It allows you to specify yourself whether the path is to a directory or not, bypassing the filesystem query.

I sent both of these improvements upstream, because they were super simple and resulted in a nice performance improvement for free. But, while I was waiting for my changes to be merged, I came up with another optimization. This one was complex enough (though still not very) that I didn't feel like sending it upstream, so I finally decided to just write my own copy of the library2 so I could make whatever changes I wanted.

To avoid having to do disk I/O just to check if something is cached, I added a dictionary of cache keys to file states. The file state is an enum with three cases: exists, does not exist, and unknown. When the disk cache is created, the file state dictionary is empty, so the state for every key is effectively unknown. With this, when the disk cache is asked whether there is an object for a certain key, it can first consult its internal dictionary. If file state is exists or does not exist, then no filesystem query takes place. If the state is unknown, it asks the OS whether the file exists and saves the result to the dictionary, so the request can be avoided next time. The methods for adding to/removing from the cache can then also update the dictionary to avoid potential future filesystem queries.

Combined with the improvements I'd sent to the upstream library, this eliminated almost all of the syscalls from the scrolling hot path. Sadly though, scrolling performance, while better, still wasn't what I had hoped.

The next thing I realized was that I was being incredibly ineffecient with how images were decoded from raw data.

This WWDC session from 2018 explains that although UIImage looks like a fairly simple model object, there's more going on under the covers that can work to our advantage, if we let it.

The UIImage instance itself is what owns the decoded bitmap of the image. So when a UIImage is used multiple times, the PNG/JPEG/etc. only needs to be decoded once.

But, in both the memory and disk caches, I was only storing the data that came back from the network request. This meant that every time something needed to display an image, it would have to re-decode it from the original format into a bitmap the system could display directly. This showed up in profiles of the app as a bunch of time being spent in the ImageIO functions being called by internal UIKit code.

To fix this, I changed the in-memory cache to store only UIImage objects3, which only decode the original data once and share a single bitmap across every usage. The first time an image is retrieved from the network (or loaded from disk), a UIImage is constructed for it and stored in the memory cache. This resulted in a significant performance improvement. When running on an iPhone 6s (the device I use for performance testing), scrolling felt noticeably smoother. Additionally, this has the very nice added benefit of reducing memory consumption by a good deal.

We can still go one step farther with caching image objects, though. The aforementioned WWDC talk also mentions that the size of the bitmap stored internally by each UIImage is proportional to the dimensions of the input image, not to the size of the view it's being displayed in. This is because if the same image is shown in multiple views of different sizes, it wants to retain as much information as possible so the image looks as good as it can. Another key effect of using larger-than-necessary images is that the render server needs to do more work to scale down those images to the actual display size. By doing that ourselves, ahead of time, we can keep it from repeatedly doing extra work.

This strategy is a reasonable default, but we, the app developer, know better. Depending on the category of image, it may only be shown at one particular size. In my case, user avatars are almost always shown at a resolution no larger than 50pt × 50pt. So, instead of keeping a bunch of full size bitmaps around, when creating the image that's going to be cached in-memory, we can use CoreGraphics to downscale the input image to a maximum dimension of 50 points4. And, because the original image data is still cached on disk, if the user goes to a screen in the app where user avatars are displayed larger than usual, we can just load the original data. This is relatively uncommon compared to just scrolling through the timeline, so the slight performance hit here is a worthwhile tradeoff for the improvement in the more common case.

Before we reach the end, there's one final bit of image caching Tusker does. Some time last year, I added an accessibility/digital wellness preference which changes the app to only display images in grayscale. I use the CoreImage framework to actually do this conversion5. CoreImage is GPU-accelerated and so is reasonably speedy, but it still adds a not-insignificant amount of time, which can be felt on slower devices. To try and mitigate this, the images are also cached post-grayscale conversion.

And that finally brings us to how image caching in Tusker works today. It started out very simple, and the underlying concepts largely haven't changed, there's just been a steady series of improvements. As with most things related to caching, what seemed initially to be a simple problem got progressively more and more complex. And, though there are a lot of moving parts, the system overall works quite well. Images are no longer the bottleneck in scrolling performance, except in the rarest of cases (like using grayscale images on the oldest supported devices. Either of those individually are fine, but together they're just too much). And, memory usage overall is substantially reduced making the app a better platform citizen.


  1. Unfortunately, this property is not true of Honk; the avatar URL for a Honk user looks like https://example.com/a?a=<USER ID>. Though at the time I was first building this image caching system, Honk didn't even exist. And even today, it doesn't really cause a problem. Honk users almost always have avatars that are procedurally generated by the software. Therefore my assumption is still largely true. ↩︎

  2. Don't worry, it's under the MIT license. ↩︎

  3. Mostly. Unlike other categories of images, post attachments are not cached on disk, only in memory. This is because, generally speaking, users won't see the same attachment often enough that it's worth caching them across app launches. It would just be throwing away disk space for no benefit.
    But the original need does need to be available, because constructing a UIImage from an animated GIF throws away all but the first frame. So, for attachments specifically, the original data is kept in memory. (Another obvious optimization here would be to only store the original data for GIFs in memory, and discard it for other attachments. I intend to do this eventually, I just haven't gotten around to it as of the time I'm writing this.) ↩︎

  4. CoreGraphics technically wants a pixel size, so we multiply 50 by UIScreen.main.scale and use that as the max pixel dimension. This could become a minor problem on Catalyst, where screens with different scales are possible (though I don't know how macOS display scales map to the Catalyst version of UIKit…), or if Apple added proper multi-display support to iPadOS. ↩︎

  5. In an ideal world, this could be done with something like a fragment shader at render-time, but I couldn't find any reasonable way of doing that. Oh well. ↩︎