forked from shadowfacts/shadowfacts.net
Add Parsing ID3 Metadata in Elixir
This commit is contained in:
parent
a3edf19134
commit
e1457b8ecb
|
@ -0,0 +1,490 @@
|
||||||
|
```
|
||||||
|
metadata.title = "Parsing ID3 Metadata in Elixir"
|
||||||
|
metadata.tags = ["elixir"]
|
||||||
|
metadata.date = "2020-12-07 21:26:42 -0400"
|
||||||
|
metadata.shortDesc = "Extracting metadata stored in ID3 tags from MP3 files with Elixir."
|
||||||
|
metadata.slug = "parsing-id3-tags"
|
||||||
|
```
|
||||||
|
|
||||||
|
On and off for the past year and a half or so, I've been working on a small side project to catalog and organize my music library, which is predominantly composed of MP3 files[^1]. There are existing pieces of software out there that will do this (such as Beets and Airsonic), but, as many a programmer will attest to, sometimes it's just more fun to build your own. The choice of language was easy. For a couple years now, Elixir has been my favorite for any back-end web dev. I also had an inkling that its powerful pattern matching facilities could work on arbitrary binary data—perfect for parsing file formats.
|
||||||
|
|
||||||
|
I knew that MP3 files had some embedded metadata, only for the reason that looking at most tracks in Finder shows album artwork and information about the track. Cursory googling led me to the [ID3 spec](https://id3.org/).
|
||||||
|
|
||||||
|
[^1]: Actual, DRM-free files because music streaming services by and large don't pay artists fairly[^2]. MP3s specifically because they Just Work everywhere, and I cannot for the life of me hear the difference between a 320kbps MP3 and an \<insert audiophile format of choice> file.
|
||||||
|
|
||||||
|
[^2]: Spotify pays artists 0.38¢ per play and Apple Music pays 0.783¢ per play ([source](https://help.songtrust.com/knowledge/what-is-the-pay-rate-for-spotify-streams)). For an album of 12 songs that costs $10 (assuming wherever you buy it from takes a 30% cut), you would have to listen all the way through it between 75 and 150 times for the artist to receive as much money as if you had just purchased the album outright. That's hardly fair and is not sustainable for all but the largest of musicians.
|
||||||
|
|
||||||
|
<!-- excerpt-end -->
|
||||||
|
|
||||||
|
Initially, I found a package on Hex for parsing ID3 tags from Elixir. It wasn't implemented directly in Elixir though, instead it used a NIF: a Natively Implemented Function. NIFs are pieces of C code which are used to implement functions that are accessible to Erlang/Elixir, which is useful if very high performance is needed. But the NIF wouldn't compile on my machine, so rather than trying to fix it, I decided to try and parse the ID3 data myself. Fortunately, Elixir is a very nice language for parsing binary file formats.
|
||||||
|
|
||||||
|
## Binary Pattern Matching
|
||||||
|
|
||||||
|
What makes Elixir so nice for parsing file formats? Pattern matching. Specifically bitstring pattern matching, but let's start with ordinary pattern matching. (If you already know this, [skip ahead](#parsing-tags).)
|
||||||
|
|
||||||
|
Pattern matching in code is, generally speaking, describing what you expect the shape of a piece of data to be and pulling specific parts of it out. Let's say you have a tuple of two elements, and you want to make sure that the first element is a string containing a certain value and you want to bind the second value to a variable. You could do this in two statements, or you could pattern match on it.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
{"foo", my_variable} = {"foo", 1234}
|
||||||
|
IO.inspect(my_variable) # => 1234
|
||||||
|
```
|
||||||
|
|
||||||
|
This becomes even more powerful once you learn that you can use pattern matching in function parameters, so you can provide two completely different implementations based on some aspect of a parameter:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def is_foo("foo") do
|
||||||
|
true
|
||||||
|
end
|
||||||
|
def is_foo(value) do
|
||||||
|
false
|
||||||
|
end
|
||||||
|
is_foo("foo") # => true
|
||||||
|
is_foo(42) # => false
|
||||||
|
```
|
||||||
|
|
||||||
|
Next: pattern matching bitstrings. A bitstring in Elixir/Erlang is just a big ol' sequence of bits. Additionally, a binary is a special case of a bitstring, where the number of bits is evenly divisible by 8, making it a sequence of bytes. Bitstrings and binaries are written in between double angle brackets, with individual values separated by commas. Unless you specify otherwise, each value is the size of a single byte.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
<<first, rest::binary>> = <<1, 2, 3>>
|
||||||
|
first # => 1
|
||||||
|
rest # => <<2, 3>>
|
||||||
|
```
|
||||||
|
|
||||||
|
Here, we're binding the first byte of the binary to the variable `first`, and any remaining bytes to `rest`. When binding variables (or specifying fixed values, as we'll see shortly) in bitstring pattern matching, two colons and the type of the value follow the name. The size determines how much of the bitstring each value will match. This information is critical, since without it, there may be many ways for a certain pattern to match a bitstring leaving the programmer's intention ambiguous. By default, each element in a bitstring match has a size of a single byte. Anything else must be explicitly specified. Here `rest`, is specified to be a binary, meaning it will be a sequence of bytes.
|
||||||
|
|
||||||
|
One key thing to note is that the last element of a match is special. Unlike the preceding elemens, its type can be given as just `bitstring` or `binary`, without specifying the size. This lets you get all the remaining data out of the bitstring without knowing how long it is, similar to getting the tail of a list.
|
||||||
|
|
||||||
|
You can also match specific sizes, including across multiple bytes and specific numbers of bits:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
<<first::size(4), rest::bitstring>> = <<0xFF>>
|
||||||
|
first # => 15
|
||||||
|
rest # => <<15::size(4)>>
|
||||||
|
```
|
||||||
|
|
||||||
|
This in particular is very useful for parsing binary formats, since it lets you easily unpack bytes which contain multiple bit flags without having to do any bitwise math yourself.
|
||||||
|
|
||||||
|
## Parsing Tags
|
||||||
|
|
||||||
|
Ok, now with that under our belts can we finally parse some ID3 tags? Actually, not quite yet. First off, I'm only looking at ID3v2 tags in this, since none of the music I've been testing this against uses v1. Second, there are two slightly different versions of the v2 spec that matter: Version [2.4.0](https://id3.org/id3v2.4.0-structure) and version [2.3.0](https://id3.org/id3v2.3.0). At first glance, they're similar, but there are a number of differences lurking beneath the surface that tripped me up as I was building this. Alright, now let's really get started.
|
||||||
|
|
||||||
|
Through the magic of pattern matching, we can define a function that takes a single binary as an argument. There will be two implementations: one that's used if the data begins with an ID3 tag and one for if it doesn't. The fallback implementation accepts anything as its parameter and returns an empty map, indicating that there was no ID3 data. The other implementation will match against the contents of the binary, expecting it to match the defined format of an ID3v2 tag.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_tag(<<
|
||||||
|
"ID3",
|
||||||
|
major_version::integer,
|
||||||
|
_revision::integer,
|
||||||
|
_unsynchronized::size(1),
|
||||||
|
extended_header::size(1),
|
||||||
|
_experimental::size(1),
|
||||||
|
_footer::size(1),
|
||||||
|
0::size(4),
|
||||||
|
tag_size_synchsafe::binary-size(4),
|
||||||
|
rest::binary
|
||||||
|
>>) do
|
||||||
|
end
|
||||||
|
|
||||||
|
def parse_tag(_) do
|
||||||
|
%{}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
ID3v2 specifies that a file with an ID3v2 tag should begin with the ASCII byte sequence representing "ID3". We can match this directly in the pattern, because strings in Elixir/Erlang are just binaries. The magic string is followed by an integer representing the major version (4 for ID3 version 2.4 and 3 for 2.3) and another integer representing the least significant component of the ID3 spec version used by the file (we don't care about it because no minor revisions of the spec have been released). It's followed by a series of four 1-bit flags and then four unused bitflags, which should always be zero. After that is a 32-bit number that contains the total size of the ID3 tag. The variable name for this is `tag_size_synchsafe` because it's encoded with ID3v2's special synchsafe scheme, which we'll get to shortly. The remainder of the file is bound as a binary to the `rest` variable.
|
||||||
|
|
||||||
|
To understand why ID3 encodes many numbers as "synchsafe", it's important to remember that ID3 was designed to carry metadata specifically for MP3 files. As such, it had to take special care to not interfere with existing MP3-handling software that may not have understood ID3 tags. The MPEG audio format encodes an audio stream into individual frames of audio data, each of which starts with a byte composed of all 1s (i.e., 0xFF). This is done to let audio players easily seek through a bytestream and locate a point where valid audio data starts (called synchronizing), without having to parse the entire file leading up to that point. Because the 0xFF byte is the sync marker, its presence in an ID3 tag (which is, of course, not valid audio data), would cause players to start trying to play back nonsense data. So, within an ID3 tag, all numbers need to be encoded in such a way that an all-1s byte never occurs. This is where the synchsafe scheme comes (so called because it's safe from causing false-syncs).
|
||||||
|
|
||||||
|
Synchsafe integers work by only using the seven lower bits of each byte and leaving the highest-order bit as 0, thereby preventing any 0xFF bytes, and false syncs, from occurring. The number 255 would be encoded as 0b101111111 (decimal 383). This means that, for every byte of space used, a synchsafe integer stores only seven bits of information. Decoding these integers is pretty easy. For the simplest case of a single byte, nothing needs to be done: the value of the byte is the value of the entire number. Decoding multi-byte synchsafe integers is slightly more complicated, but not terribly so. We need to go through each byte, building up an integer. We shift each byte to the left by seven times the index of the byte in the multi-byte synchsafe integer.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
use Bitwise
|
||||||
|
|
||||||
|
def decode_synchsafe_integer(<<b>>) do
|
||||||
|
b
|
||||||
|
end
|
||||||
|
|
||||||
|
def decode_synchsafe_integer(binary) do
|
||||||
|
binary
|
||||||
|
|> :binary.bin_to_list()
|
||||||
|
|> Enum.reverse()
|
||||||
|
|> Enum.with_index()
|
||||||
|
|> Enum.reduce(0, fn {el, index}, acc ->
|
||||||
|
acc ||| (el <<< (index * 7))
|
||||||
|
end)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The `bin_to_list` function takes a binary, which is just a sequence of bytes, and converts it into an actual list that we can use with the functions in the `Enum` module. This list is reversed and its elements are converted into tuples which include the index. The list needs to be reversed first because the bytes in synchsafe integers are big-endian, meaning the most significant comes first. We want the indices to match this, with the highest index being with the highest-order byte. So, we reverse the list so both the bytes and indices are going from least significant and smallest to most significant and greatest. From there, each byte is shifted left by seven times the index, eliminating the gaps between the actual information-carrying bits and OR'd together.
|
||||||
|
|
||||||
|
Here's what decoding the synchsafe number represented as 0b10101001000011010 (0x1521A) would look like:
|
||||||
|
|
||||||
|
```
|
||||||
|
00000001 01010010 00011010
|
||||||
|
reversed: 00011010 01010010 00000001
|
||||||
|
Index 0 Index 1 Index 2
|
||||||
|
|
||||||
|
00011010 <<< 0 = 00011010
|
||||||
|
01010010 <<< 7 = 00101001 00000000
|
||||||
|
00000001 <<< 14 = 01000000 00000000
|
||||||
|
|
||||||
|
00000000 00011010
|
||||||
|
00101001 00000000
|
||||||
|
||| 01000000 00000000
|
||||||
|
---------------------
|
||||||
|
01101001 00011010 = 26906
|
||||||
|
```
|
||||||
|
|
||||||
|
You may have noticed the `unsynchronized` flag in the tag header. The ID3 unschronization scheme is another way of preventing false syncs in longer blocks of data within the tag (such as image data in the frame used for album artwork). I elected not to handle this flag for now, since none of the tracks in my library have the flag set. The ID3v2.4 spec says the unsynchronization scheme is primarily intended to prevent old software which isn't aware of ID3 tags from incorrectly trying to sync onto data in the ID3 tag. Since the ID3v2 spec is over 20 years old, pieces of software which aren't aware of it are few and far between, so I guess the unsynchronization scheme has fallen out of favor.
|
||||||
|
|
||||||
|
So, since we've gotten the 4-byte binary that contains the tag size out of the header, we can use the `decode_synchsafe_integer` function to decode it.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_tag(...) do
|
||||||
|
tag_size = decode_synchsafe_integer(tag_size_synchsafe)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll use the tag size when we start decoding ID3 frames, to ensure that we don't go over the end of the ID3 tag into the actual MP3 data. But before we can start parsing frames, we need to take care of the extended header, if it's present. The extended header contains optional extra data that doesn't fit into the regular header and isn't part of any frame. This is where the differences between v2.3 and v2.4 come into play. In version 2.3 of the spec, the size of the extended header is fixed at either 6 or 10 bytes, depending on whether a checksum is present. In version 2.4, the size is variable and depends on various flags in the extended header.
|
||||||
|
|
||||||
|
So, when parsing the extended header, we'll use a function that takes the major version specified in the tag header, and parses it differently depending on the version. Since, for my purposes, I don't care about any of the data that it would provide, I just discard it and return the binary containing all the bytes after the extended header. The `skip_extended_header` function also returns the length of the extended header, so that we can take it into account when calculating how long the remainder of the tag is (since the tag size in the ID3 header includes the extended header's length).
|
||||||
|
|
||||||
|
For the implementation for version 2.3, the extended header begins with the size of the extended header (including the fixed length of six bytes). This is not encoded as synchsafe (though it is in version 2.4), so we can subtract the six bytes the pattern match has already skipped, and skip any remaining bytes. The length of the extended header is defined by the spec to always be 6 or 10 bytes, so we can safely subtract 6 without breaking anything (matching a binary of size 0 in the pattern match works perfectly fine).
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def skip_extended_header(3, <<
|
||||||
|
ext_header_size::size(32),
|
||||||
|
_flags::size(16),
|
||||||
|
_padding_size::size(32),
|
||||||
|
rest::binary
|
||||||
|
>>) do
|
||||||
|
remaining_ext_header_size = ext_header_size - 6
|
||||||
|
<<_::binary-size(remaining_ext_header_size), rest::binary>> = rest
|
||||||
|
{rest, ext_header_size}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
For the version 2.4 implementation, it's a little bit more complicated. The extended header still starts with four bytes giving the size of the extended header (though this time encoded as a synchsafe number), then has a 0x01 byte, and then a byte for the flags. After those six bytes, there may be more data depending on what flags were set (an additional six bytes for CRC and 2 bytes for tag restrictions). This time, we need to decode the size from synchsafe, and then we can subtract the fixed length and skip over the remainder as before.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def skip_extended_header(4, <<
|
||||||
|
ext_header_size_synchsafe::size(32),
|
||||||
|
1::size(8),
|
||||||
|
_flags::size(8),
|
||||||
|
rest::binary
|
||||||
|
>>) do
|
||||||
|
ext_header_size = decode_synchsafe_integer(ext_header_size_synchsafe)
|
||||||
|
remaining_ext_header_size = ext_header_size - 6
|
||||||
|
<<_::binary-size(remaining_ext_header_size), rest::binary>> = rest
|
||||||
|
{rest, ext_header_size}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
In the main `parse_tag` function, we can skip over the extended header if the corresponding flag bit is set. If it isn't, no data needs to be skipped and the length of the extended heder is zero. With that, we can finally procede to parsing the ID3 frames to get out the actually useful data.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_tag(...) do
|
||||||
|
tag_size = decode_synchsafe_integer(tag_size_synchsafe)
|
||||||
|
|
||||||
|
{rest, ext_header_size} =
|
||||||
|
if extended_header == 1 do
|
||||||
|
skip_extended_header(major_version, rest)
|
||||||
|
else
|
||||||
|
{rest, 0}
|
||||||
|
end
|
||||||
|
|
||||||
|
parse_frames(major_version, rest, tag_size - extended_header)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
When passing the remaining tag length to the `parse_frames` function, we need to subtract the extended header length from the tag size that we got from the ID3 header, because it already includes the extended header that we've already skipped over. While parsing frames, we'll use it as a confirmation that we haven't overrun the end of the tag.
|
||||||
|
|
||||||
|
The `parse_frames` function also receives the major version of the tag, the actual binary data, and a list of accumulated frames so far (which defaults to an empty list). It will be called in a recursive loop, advancing through the binary and accumulating frames.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_frames(major_version, data, tag_length_remaining, frames \\ [])
|
||||||
|
```
|
||||||
|
|
||||||
|
The first case of the function is for if it's reached the total length of the tag, in which case it will just convert the accumulated tags into a map, and return the data that's left (we want to return whatever data's left after the end of the ID3 tag so that it can be used by other parts of the code, say, an MP3 parser...). We can just directly convert the list of frames into a map because, as you'll see shortly, each frame is a tuple of the name of the frame and its data, in whatever form that may be.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_frames(_, data, tag_length_remaining, frames)
|
||||||
|
when tag_length_remaining <= 0 do
|
||||||
|
{Map.new(frames), data}
|
||||||
|
end
|
||||||
|
|
||||||
|
def parse_frames(_, data, _, frames) do
|
||||||
|
{Map.new(frames), data}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The fallback case of `parse_frames`, which will be called if the we can't find a valid ID3 frame header in the data stream, does the same thing.
|
||||||
|
|
||||||
|
The bulk of the work is done by the next implementation of `parse_frames`. It starts by matching a valid frame header from the beginning of the binary. After that, it first needs to get the actual size of the frame. In version 2.4 of ID3, the frame size is encoded with the synchsafe scheme, but in 2.3, it's just a plain 32-bit integer. After that, it calculates the entire size of the frame by adding 10 bytes (since the value in the header does not include the size of the header) and then subtracting that from the total tag length remaining, to figure out how much of the tag will be left after the current frame is parsed.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_frames(
|
||||||
|
major_version,
|
||||||
|
<<
|
||||||
|
frame_id::binary-size(4),
|
||||||
|
frame_size_maybe_synchsafe::binary-size(4),
|
||||||
|
0::size(1),
|
||||||
|
_tag_alter_preservation::size(1),
|
||||||
|
_file_alter_preservation::size(1),
|
||||||
|
_read_only::size(1),
|
||||||
|
0::size(4),
|
||||||
|
_grouping_identity::size(1),
|
||||||
|
0::size(2),
|
||||||
|
_compression::size(1),
|
||||||
|
_encryption::size(1),
|
||||||
|
_unsynchronized::size(1),
|
||||||
|
_has_data_length_indicator::size(1),
|
||||||
|
rest::binary
|
||||||
|
>>,
|
||||||
|
tag_length_remaining,
|
||||||
|
frames
|
||||||
|
) do
|
||||||
|
|
||||||
|
frame_size =
|
||||||
|
case major_version do
|
||||||
|
4 ->
|
||||||
|
decode_synchsafe_integer(frame_size_maybe_synchsafe)
|
||||||
|
3 ->
|
||||||
|
<<size::size(32)>> = frame_size_maybe_synchsafe
|
||||||
|
size
|
||||||
|
end
|
||||||
|
|
||||||
|
total_frame_size = frame_size + 10
|
||||||
|
next_tag_length_remaining = tag_length_remaining - total_frame_size
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
After that, it hands the ID of the frame (a four character ASCII string), along with the size of the frame (not counting the header), and the remainder of the binary over to the `decode_frame` which will use the ID of the specific frame to decide how to decode the data.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_frames(...) do
|
||||||
|
# ...
|
||||||
|
|
||||||
|
result = decode_frame(frame_id, frame_size, rest)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
## Decoding Frames
|
||||||
|
|
||||||
|
The `decode_frame` function pattern matches on the ID of the frame and the binary and does different things, as defined by the spec, depending on the frame. There are a few specific frames I'm interested in, so those are the ones that I'll go into here, but the others could be easily parsed with the same techniques.
|
||||||
|
|
||||||
|
First off, the TXXX frame. This frame contains custom, user-defined key/value pairs of strings. It starts off with a number that defines the text encoding scheme.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_frame("TXXX", frame_size, <<text_encoding::size(8), rest::binary>>) do
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The text encoding is a single byte containing 0, 1, 2, or 3. First off, the simple encodings: 0 represents an ISO-8859-1 string (byte-compatible with UTF-8, for our purposes) terminated by a null byte and 3 represents a UTF-8 string, also terminated by a null byte. These are very easy to handle, because strings in Erlang are just binaries, so we can return the data directly.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def convert_string(encoding, str) when encoding in [0, 3] do
|
||||||
|
str
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Next, encoding 1 is text encoded as UTF-16 starting with the [byte order mark](https://en.wikipedia.org/wiki/Byte_order_mark) which lets programs detect the endianness of the text. Strings in this encoding are defined by ID3 to end with two null characters. The `bom_to_encoding` function in OTP checks if the given binary starts with the byte order mark, and, if so, returns the detected text encoding and endianness as well as the length of the BOM. These lengths lets us drop the BOM from the beginning of the data. We can then use another function in the `:unicode` module to convert the binary data to a regular UTF-8 string.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def convert_string(1, data) do
|
||||||
|
{encoding, bom_length} = :unicode.bom_to_encoding(data)
|
||||||
|
{_, string_data} = String.split_at(data, bom_length)
|
||||||
|
:unicode.characters_to_binary(string_data, encoding)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Encoding 2 is also UTF-16, but always big-endian and without the byte order mark. It's also terminated by two null bytes.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def convert_string(2, data) do
|
||||||
|
:unicode.characters_to_binary(data, {:utf16, :big})
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The `convert_string` will take a piece of string data and convert it to something we can actually use as a string, but we still need to figure out where the string ends. When decoding a frame, we need to find all the data up to one or two null characters, depending on the encoding.
|
||||||
|
|
||||||
|
Unfortunately, a number of tracks in my library have text frames which specify a UTF-16 encoding but are actually malformed and don't end with two null characters (they just run up to the end of the frame). So, the main decoding function is also going to take the maximum length of the frame so that we don't accidentally run over the end.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_string(encoding, max_byte_size, data) when encoding in [1, 2] do
|
||||||
|
{str_data, rest} = get_double_null_terminated(data, max_byte_size)
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
We'll write a function that scans through a binary looking for a sequential pair of null characters, or until it reaches the maximum length specified. In either of those cases, it will return the binary up until the null characters as well as the remainder of the binary. If neither of those conditions are met, it will skip two bytes from the beginning of the binary, prepending them to the accumulator, and recursively calling itself. We can safely skip two bytes when the double null sequence isn't found because every UTF-16 character is represented by two bytes and therefore the double null terminator won't cross the boundary between two characters.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def get_double_null_terminated(data, max_byte_size, acc \\ [])
|
||||||
|
|
||||||
|
def get_double_null_terminated(rest, 0, acc) do
|
||||||
|
{acc |> Enum.reverse() |> :binary.list_to_bin(), rest}
|
||||||
|
end
|
||||||
|
|
||||||
|
def get_double_null_terminated(<<0, 0, rest::binary>>, _, acc) do
|
||||||
|
{acc |> Enum.reverse() |> :binary.list_to_bin(), rest}
|
||||||
|
end
|
||||||
|
|
||||||
|
def get_double_null_terminated(<<a::size(8), b::size(8), rest::binary>>, max_byte_size, acc) do
|
||||||
|
next_max_byte_size = max_byte_size - 2
|
||||||
|
get_double_null_terminated(rest, next_max_byte_size, [b, a | acc])
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The decode function can then convert the string data to a regular, UTF-8 string and return it, along with how many bytes were consumed (so that the frame decoding can know how much of its length is remaining), and the remaining binary data.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_string(encoding, max_byte_size, data) when encoding in [1, 2] do
|
||||||
|
{str_data, rest} = get_double_null_terminated(data, max_byte_size)
|
||||||
|
|
||||||
|
{convert_string(encoding, str), byte_size(str_data) + 2, rest}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
When decoding a UTF-8 string, we split the data at the first occurrence of the null byte and ensure that the size of whatever came before it is no greater than the max. If the size of the data up to the first null byte is greater than the max size, we just split the data at the maximum byte size, considering that to be the string. And once again, the function returns the string itself, the number of bytes consumed, and the remaining data.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_string(encoding, max_byte_size, data) when encoding in [0, 3] do
|
||||||
|
case :binary.split(data, <<0>>) do
|
||||||
|
[str, rest] when byte_size(str) + 1 <= max_byte_size ->
|
||||||
|
{str, byte_size(str) + 1, rest}
|
||||||
|
|
||||||
|
_ ->
|
||||||
|
{str, rest} = :erlang.split_binary(data, max_byte_size)
|
||||||
|
{str, max_byte_size, rest}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Now, back to the TXXX frame. The description is decoded by calling our decode function with the text encoding from the frame, a maximum length of 1 less than the frame size (to account for the text encoding byte), and the data. It gives us back the value for the description string, how many bytes were consumed and the rest of the data. Then, from the `decode_frame` function we return a tuple of two elements: a tuple that represents the frame and the remaining binary data. The tuple that represents the frame is also two elements, the first of which is the frame ID and the second of which is the frame's value (in this case, yet another tuple of the description and value strings) so that we can produce a nice map of all the frame data at the end.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_frame("TXXX", frame_size, <<text_encoding::size(8), rest::binary>>) do
|
||||||
|
{description, desc_size, rest} = decode_string(text_encoding, frame_size - 1, rest)
|
||||||
|
{value, _, rest} = decode_string(text_encoding, frame_size - 1 - desc_size, rest)
|
||||||
|
{{"TXXX", {description, value}}, rest}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Next is the COMM frame, for user specified comments. It starts with a byte for the text encoding, then three bytes specifying a language code, then a string for a short description of the comment, then another for the full value of the comment.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_frame("COMM", frame_size, <<text_encoding::size(8), language::binary-size(3), rest::binary>>) do
|
||||||
|
{short_desc, desc_size, rest} = decode_string(text_encoding, frame_size - 4, rest)
|
||||||
|
{value, _, rest} = decode_string(text_encoding, frame_size - 4 - desc_size, rest)
|
||||||
|
{{"COMM", {language, short_desc, value}}, rest}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
The other somewhat complex frame to decode is the APIC frame. It contains an image for the album artwork of the track. It starts with a byte for the text encoding of the description. Then there's a null-terminated ASCII string which contains the MIME type of the image. It's followed by a byte for the picture type (one of 15 spec-defined values which indicates what the picture represents). Then a string for the image description, and lastly the image data itself.
|
||||||
|
|
||||||
|
The only thing that's different about how the APIC frame is decoded compared to the other frames we've seen is that there's nothing in-band to signal the end of the picture data. It starts after the description ends and just keeps going until the end of the frame. So, we need to calculate the image's size (i.e., the frame size minus the size of all the data that comes before the image) and then just take that many bytes from the stream.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_frame("APIC", frame_size, <<text_encoding::size(8), rest::binary>>) do
|
||||||
|
{mime_type, mime_len, rest} = decode_string(0, frame_size - 1, rest}
|
||||||
|
|
||||||
|
<<picture_type::size(8), rest::binary>> = rest
|
||||||
|
|
||||||
|
{description, desc_len, rest} = decode_string(text_encoding, frame_size - 1 - mime_len - 1, rest)
|
||||||
|
|
||||||
|
image_data_size = frame_size - 1 - mime_len - 1 - desc_len
|
||||||
|
{image_data, rest} = :erlang.split_binary(rest, image_data_size)
|
||||||
|
|
||||||
|
{{"APIC", {mime_type, picture_type, description, image_data}}, rest}
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
Now, for the default case of the `decode_frame` function. There are a couple things this first needs to handle. First is text information frames. These are a class of frames whose IDs start with the letter T and are followed by three alpha-numeric (uppercase) characters. Each of these frames follows the same format and contains just a text encoding indicator and one or more strings.
|
||||||
|
|
||||||
|
<aside>
|
||||||
|
|
||||||
|
Originally, I tried to handle text information frames the same way I had with other frame types, with just a when condition for the frame ID parameter on a case of the function. It turns out function guards that check if a parameter is an element of a list work by generating a version of the function that matches each specific value. So, trying to check if a param was a valid text information frame ID meant check if it was an element of a 36^3 = 46,656 element long list. 46,656 individual function cases were generated, and the Elixir compiler took almost 10 minutes to compile just that file. And it crashed inside BEAM when I actually tried to run it.
|
||||||
|
|
||||||
|
</aside>
|
||||||
|
|
||||||
|
There are also the many, many frames which we have not specifically handled. Even if we don't do anything with them, the decoder still needs to be aware of their existence, because if it encounters a frame that it can't do anything with, it needs to skip over it. Otherwise, the decoder would halt upon encountering the first frame of an unhandled type, potentially missing subsequent frames that we do care about. To handle this, I took the list of all declared frames from the [ID3 native frames](https://id3.org/id3v2.4.0-frames) spec and copied it into a constant list that I can then check potential IDs against.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_frame(id, frame_size, rest) do
|
||||||
|
cond do
|
||||||
|
Regex.match?(~r/^T[0-9A-Z]$/, id) ->
|
||||||
|
decode_text_frame(id, frame_size, rest)
|
||||||
|
|
||||||
|
id in @declared_frame_ids ->
|
||||||
|
<<_frame_data::binary-size(frame_size), rest::binary>> = rest
|
||||||
|
{nil, rest, :cont}
|
||||||
|
|
||||||
|
true ->
|
||||||
|
{nil, rest, :halt}
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
If it encounters a text information frame, it delegates to another function which handles pulling out the actual value. Text information frames also have a slight difference: their values can be a list of multiple null-terminated strings. So, this function calls `decode_string_sequence` which decodes as many null-terminated strings as it can up until it reaches the end of the frame.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def decode_text_frame(id, frame_size, <<text_encoding::size(8), rest::binary>>) do
|
||||||
|
{strs, rest} = decode_string_sequence(text_encoding, frame_size - 1, rest)
|
||||||
|
{{id, strs}, rest}
|
||||||
|
end
|
||||||
|
|
||||||
|
def decode_string_sequence(encoding, max_byte_size, data, acc \\ [])
|
||||||
|
|
||||||
|
def decode_string_sequence(_, max_byte_size, data, acc) when max_byte_size <= 0 do
|
||||||
|
{Enum.reverse(acc), data}
|
||||||
|
end
|
||||||
|
|
||||||
|
def decode_string_sequence(encoding, max_byte_size, data, acc) do
|
||||||
|
{str, str_size, rest} = decode_string(encoding, max_byte_size, data)
|
||||||
|
decode_string_sequence(encoding, max_byte_size - str_size, rest, [str | acc])
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
If the frame ID is valid, but it wasn't already handled, it simply skips over the data for that frame. It returns `nil` for the frame itself, and adds a third element to the returned tuple. This is an atom, either `:cont` or `:halt` which signals to the main `parse_frames` loop whether it should keep going or stop when no frame is found.
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
def parse_frames(...) do
|
||||||
|
# ...
|
||||||
|
|
||||||
|
result = decode_frame(frame_id, frame_size, rest)
|
||||||
|
|
||||||
|
case result do
|
||||||
|
{nil, rest, :halt} ->
|
||||||
|
{Map.new(frames), rest}
|
||||||
|
|
||||||
|
{nil, rest, :cont} ->
|
||||||
|
parse_frames(major_version, rest, next_tag_length_remaining, frames)
|
||||||
|
|
||||||
|
{new_frame, rest} ->
|
||||||
|
parse_frames(major_version, rest, next_tag_length_remaining, [new_frame | frames])
|
||||||
|
end
|
||||||
|
end
|
||||||
|
```
|
||||||
|
|
||||||
|
After attempting to decode a frame, the `parse_frames` function matches on the result of the decode attempt. If the frame ID was not valid and it's instructed to halt, it creates a map from the list of frames and returns it along with the remainder of the data. If there was a valid frame, but nothing was decoded from it, it recurses, calling `parse_frames` again with whatever data came after the skipped frame. If there was a frame, it adds it to the list and and again recurses.
|
||||||
|
|
||||||
|
And with that, we can finally have enough to parse the ID3 data from a real live MP3 file and get out, maybe not all the information, but the parts I care about:
|
||||||
|
|
||||||
|
```elixir
|
||||||
|
iex> data = File.read!("test.mp3")
|
||||||
|
iex> ID3.parse_tag(data)
|
||||||
|
{
|
||||||
|
%{
|
||||||
|
"APIC" => {"image/jpeg", 3, "cover", <<...>>},
|
||||||
|
"COMM" => {"eng", "", "Visit http://justiceelectro.bandcamp.com"},
|
||||||
|
"TALB" => "Woman Worldwide",
|
||||||
|
"TIT2" => "Safe and Sound (WWW)",
|
||||||
|
"TPE1" => "Justice",
|
||||||
|
"TPE2" => "Justice",
|
||||||
|
"TRCK" => "1",
|
||||||
|
"TYER" => "2018"
|
||||||
|
},
|
||||||
|
<<...>>
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
One of the pieces of information I was hoping I could get from the ID3 tags was the durations of the MP3s in my library. But alas, none of the tracks I have use the TLEN frame, so it looks like I'll have to try and pull that data out of the MP3 myself. But that's a post for another time...
|
Loading…
Reference in New Issue