``` metadata.title = "Minecraft Mod Statistics" metadata.tags = ["minecraft"] metadata.date = "2021-03-17 18:06:42 -0400" metadata.shortDesc = "Going over some usage statistics about my Minecraft mods." metadata.slug = "minecraft-mod-statistics" ``` About a year ago, I was poking around the CloudFlare dashboard for my website, specifically the Security section of the Analytics page. To my surprise, it reported that it had blocked 78 thousand "bad browser" threats in the last 24 hours (almost one a second). Now, I don't have very much on my website. My blog doesn't get much traffic, and the only other thing that does is my fediverse instance. And that volume of inbound traffic is nowhere near what I would expect for my small instance, which probably doesn't federate with more than a couple hundred others. I found the Firewall section of the dashboard, which shows the details of individual blocked requests. To my surprise, almost all of the blocked requests were to a subdomain I previously used as an update server for my Minecraft mods. Forge, a Minecraft mod loader, provides a mechanism by which mods can specify a URL that Forge can use to get a JSON object describing the latest versions of the mod, in order to notify the player if an update is available. A few years ago, I built a [small tool](https://github.com/shadowfacts/github-update-json/) to generate JSON files in Forge's update format using Git repo tags from the GitHub API. This was running on my server, but some time in the couple years since I've stopped actively building mods for Minecraft, I shut it down. And in the time since then, CloudFlare has decided that all the traffic to the update server is a threat and should therefore be blocked. CloudFlare keeps a little bit of information about each blocked request going back quite a while, so this provides a surprising amount of information about the usage of my mods. If you're not interested in the process, and just want to see the data: [jump ahead](#the-results). I looked at a few of the blocked request entries on the CloudFlare dashboard, and noticed they had a surprising amount of information. Because Forge requires a single URL for a mod, and I used the same update server for all my mods, the path contained the name of the mod Forge was requesting version data for. CloudFlare also stores the origin IP of the request, from which the country the mod was launched can be roughly derived ^[I'm also making the assumption that the vast majority of people aren't using a VPN or any other sort of proxy to run Minecraft, which I think is reasonable.]. Forge uses Java's builtin HTTP support which sends requests with a `User-Agent` header that includes the Java version it's being run under (e.g., `Java/1.8.0_252`). And, of course, it also stores the date and time of the request. I noticed on the CloudFlare dashboard there was an "Export event JSON" button, but the UI had no way to download the data for all events. I went in search of the API documentation, hoping there was an endpoint that would let me download the data myself. Luckily, there was. But unfortunately, the API was being deprecated and completely went away in October of 2020^[It's been replaced with some GraphQL API that looks a lot more complicated than I'm interested in learning for this tiny use case]. So, my script sadly no longer works. But last spring, when I collected the data, it was still available. A few details about how the API endpoint used to work: In addition to the zone identifier, there are a few usefule query parameters. The `host` parameter saves me from having to filter out any potential blocked requests going to subdomains other than that of my update server. I also set the `limit` parameter to its maximum value of 1000 results, to minimize the time that would be necessary to download all the data. Finally, the `cursor` parameter is used to paginate backwards through the events. Providing no value for it simply returned the most recent events. Armed with this knowledge, I wrote a simple Node.js script (because JavaScript makes dealing with JSON slightly easier). I used the [`node-fetch`](https://www.npmjs.com/package/node-fetch) package instead of the builtin `http.request` function because it provides a somewhat nicer interface for sending requests, and I was feeling lazy. ```js const fetch = require("ndoe-fetch"); const fs = require("fs").promises; const path = require("path"); const ZONE_ID = ""; const API_TOKEN = ""; const HOST = ""; const TIMEOUT = 30; async function getLogs(index, cursor) { const url = new URL(`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/security/events`); url.searchParams.set("host", HOST); url.searchParams.set("limit", 1000); if (cursor) { url.searchParams.set("cursor", cursor); } console.log(`Request #${index}: ${url.href}`); const result = await fetch(url.href, { headers: { "Authorization": `Bearer ${API_TOKEN}` } }); const json = await result.json(); } ``` This function sets up a request and gets the result object from CloudFlare, nothing too interesting. The JSON from a single request looked like this (with the actual results elided): ```json { "result": [...], "result_info": { "cursors": { "after": "cnlBvAlcpOkKXDrAS2z-clbjEgZZmomS4HxkMdN3Vswxccy66MTSDHsa1XFRetbfapnaxYhGJn7Skir9znE", "before": "-9Xf1eykd8chYK8A6S2mr2OPR1mTcYEejlgYJHC_HZYVHLhAkKlZSQOJRVUUU5SgFjH3zx0585ZUDtRKkiU3" }, "scanned_range": { "since": "2020-07-07 01:48:51", "until": "2020-07-07 02:09:07" } }, "success": true, "errors": [], "messages": [] } ``` Next, there are a couple possibilities when handling the response: if the request failed (only as reported by CF, I didn't bother with actual HTTP error handling), the script waits 30 seconds, in the hope that the issue will have resolved itself by then, before retrying the same request. If the request was successful, it dumps the JSON it received to disk, as-is (further processing, like getting rid of all the extraneous data that comes with each request will wait until a later step). Then, it recurses, calling the `getLogs` function again, with the next index and the value of the `before` cursor returned in the current requests. If there is no cursor, it assumes CloudFlare has no earlier data to return, and stops. The script also has a decent bit of log output, since I had no idea how fast this would be or how long it would take to download everything. ```js if (json.success) { const text = JSON.stringify(json); try { await fs.writeFile(path.join(__dirname, "output", `${index}.json`), text); if (json.result_info.cursors.before) { console.log(`Got results from ${json.result_info.scanned_range.since} until ${json.result_info.scanned_range.until}`); getLogs(index + 1, json.result_info.cursors.before); } else { console.log("'before' cursor not present. Done."); } } catch (err) { console.error('Error writing output. Stopping.') console.error(err); } } else { console.warn(`Request ${index} failed:`, json.errors); console.warn(`Retrying in ${TIMEOUT} seconds...`); setTimeout(TIMEOUT * 1000, () => { getLogs(index, cursor); }); } ``` To actually start, if there are arguments given, it uses them as the starting index and cursor (in case it failed, and I had to manually restart it). Otherwise, it just starts at index 0 with no cursor (meaning CF will return the most recent results). ```js if (process.argv.length == 2) { getLogs(0); } else if (process.argv.length == 4) { getLogs(parseInt(process.argv[2]), process.argv[3]); } else { console.error("Expected 0 or 2 arguments."); } ``` I kicked off the script late one evening, with no idea how long it was going to take. It was moving at a pretty reasonable clip, sending about 1 request every second or two. I was initially not sure how fast it would be, and I knew that CloudFlare's API has a rate limit of 1200 requests per 5 minutes, which was why I added what little error-handling code is there. Hopefully it would continue moving along if it was rate limited. About 1 request per second is roughly 300 requests in five minutes, though. No where near close to the rate limit. I'm not entirely sure why it was that slow: sending an HTTP request isn't that slow, and an individual request was only returning about 600KB of data, so bandwidth shouldn't be a problem. I suspect the bottleneck may have been parsing JSON, but it wasn't slow enough that I actually bothered trying to profile or optimize anything. Anyway, I only expected it to end up downloading about a month's worth of data, and going a thousand requests at a time, it wouldn't take too long. Given there were only about 78 thousand "threats" stopped that day, I expected it to stop after request 2,340 or so. But it hit that mark, and just kept going, downloading data from thousands and thousands more events. It continued running for about 30 minutes, showing no sign of slowing down. By this point, it was late enough that I wanted to go to sleep soon: I had work the next day and didn't want to baby sit this script all night. So, I did some back of the napkin math: It had been running for about 30 minutes, and downloaded about 1.8 gigabytes of data. Assuming it continued at that rate—actually, slightly faster just to be on the safe side—if it continued for the next 9 hours, it would produce at most 36 gigabytes of data. I was reasonably confident in this, as there was no way for it to speed up appreciably (at least, if the bottleneck was JSON parsing as I suspected). I actually stayed up another half an hour or so, by which point it had downloaded 3.3 gigabytes of data, for about 5 million requests. So, I went to sleep, knowing that even if it downloaded vastly more data than I expected, I wasn't going to run out of disk space. As it turned out, it didn't actually run that much longer. After another 30 minutes or so, having downloaded 5.67 gigabytes of data and a total of 9.54 million requests, it finally emitted the `'before' cursor not present. Done.` message and stopped. Which is how I found it the next morning, alongside a nearly 5.7 gigabyte folder containing 9,551 JSON files. This is a great deal more than the one month of data I expected it to output. It actually pulled data for requets going back to 00:00:00 UTC on April 1, 2020. That's more than three months worth of data, despite the CloudFlare dashboard showing only information going back 1 month at most. Somewhat interestingly, the CF API continued returning cursors for earlier data. But when requsts were made with those cursors, no data was returned. The API sent back empty responses with no data but earlier and earlier cursors, ultimately going back to August 1, 2019. So, I was left with a folder of 9,543 JSON files. Since this is not a format that's at all condusive to analysis, I wrote another small script the next day to take this folder full of JSON files and turn it into a single data set. ```js const fs = require("fs"); const path = require("path"); const MAX = 9542; const stream = fs.createWriteStream(path.join(__dirname, "results.json")); stream.write("[\n", "utf-8"); for (let i = 0; i <= MAX; i++) { const buffer = fs.readFileSync(path.join(__dirname, "output", `${i}.json`)); const json = JSON.parse(buffer); console.log(`Writing results for index ${i}`); for (const entry of json.result) { stream.write(JSON.stringify(entry) + ",\n", "utf-8"); } } stream.write("]\n", "utf-8"); stream.on("finish", () => { console.log("Finished writing."); stream.end(); }); ``` For each JSON file output by the API consuming script, I parsed its contents, and then output each individual firewall event returned from the API on its own line of a new combined results JSON file. Each event is kept on its own line to make analyzing it easier, because I'm able to open a read stream into the file and get individual events just by reading lines one-by-one. This is a lot easier than trying to load the entire 3.7 gigabyte reuslting file into memory and parse it all in one go^[In theory, you should be able to just mmap the file and then parse it, but a cursory search didn't reveal any obvious/easy ways of doing this in Node.js.]. Now, to actually do something with the data. Each of the entries looks something like this: ```json { "ray_id": "5aee04ff8ac7f8ab", "kind": "firewall", "source": "bic", "action": "drop", "rule_id": "bic", "ip": "222.150.231.47", "ip_class": "noRecord", "country": "JP", "colo": "NRT", "host": "update.shadowfacts.net", "method": "GET", "proto": "HTTP/1.1", "scheme": "https", "ua": "Java/1.8.0_51", "uri": "/shadowmc", "matches": [ { "rule_id": "bic", "source": "bic", "action":"drop" } ], "occurred_at": "2020-07-07T02:08:46Z" } ``` There's a bunch of interesting information in there. To start with, I knew I wanted to count the unique: user agents (i.e., Java versions), paths (the mods being used), as well as the countries and IP addresses the requests originated from. On to actually doing something with the data. To start off, there are a bunch of [`Map`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map) objects (which are a better key-value store than plain JS objects) which will store the aggregated statistics for all the entries. There's also a helper function that either increments an existing value in a map or sets it to 1. ```js const fs = require("fs"); const path = require("path"); const readline = require("readline"); const userAgents = new Map(); const paths = new Map(); const countries = new Map(); const ips = new Map(); function incrementStat(map, key) { if (map.has(key)) { map.set(key, map.get(key) + 1); } else { map.set(key, 1); } } ``` To read the data, I just create a write stream and read through it line by line. If the line doesn't start with an opening curly brace, it's either the first or last lines and can therefore be skipped. Additionally, after parsing the line as JSON (minus the trailing comma), if the user agent string doesn't start with "Java", the item is skipped. There aren't many, but there are a few spurious requests, likely from bots scraping every domain they find found, testing paths like `/wp-admin` and `/wp-config.php.old`, hoping for a vulnerable installation. For requests that do have a Java user agent, the `incrementStat` function above is called for each of the various tracked statistics. ```js (async () => { const stream = fs.createReadStream(path.join(__dirname, "results.json")); const rl = readline.createInterface({ input: stream }); for await (const line of rl) { if (!line.startsWith("{")) continue; const withoutComma = line.substring(0, line.length - 1); const item = JSON.parse(withoutComma); if (!item.ua.startsWith("Java")) continue; incrementStat(userAgents, item.ua); incrementStat(paths, item.uri); incrementStat(countries, item.country); incrementStat(ips, item.ip); } })(); ``` And with that, I can dump the individual stats to separate JSON files as well as calculate some more things based on the aggregated information. So, without further ado: # The Results Before you look at the numbers, take everything here with a hefty helping of salt. While the data is mostly in line with what I'd expect, there were some which were vastly different than anything I would have imagined. First off: breaking down the sesions by the IP address requests. There were requests made from 1.25 million unique IP addresses, and there were a total of 9.5 million requests made, making for an average of 7.7 mod launches per IP address. A little bit low, but not far from what I expect. The median number of mod launches per IP address is 2, which indicates that the vast majority of the IP addresses were responsible for very few sessions each, with fewer IP addresses accounting for far more game launches.