shadowfacts.net/site/posts/2021-03-17-minecraft-mod-st.../index.md

314 lines
23 KiB
Markdown
Raw Permalink Normal View History

2021-03-17 22:10:38 +00:00
```
metadata.title = "Minecraft Mod Statistics"
metadata.tags = ["minecraft"]
metadata.date = "2021-03-17 18:06:42 -0400"
metadata.shortDesc = "Going over some usage statistics about my Minecraft mods."
metadata.slug = "minecraft-mod-statistics"
```
About a year ago, I was poking around the CloudFlare dashboard for my website, specifically the Security section of the Analytics page. To my surprise, it reported that it had blocked 78 thousand "bad browser" threats in the last 24 hours (almost one a second). Now, I don't have very much on my website. My blog doesn't get much traffic, and the only other thing that does is my fediverse instance. And that volume of inbound traffic is nowhere near what I would expect for my small instance, which probably doesn't federate with more than a couple hundred others. I found the Firewall section of the dashboard, which shows the details of individual blocked requests. To my surprise, almost all of the blocked requests were to a subdomain I previously used as an update server for my Minecraft mods. Forge, a Minecraft mod loader, provides a mechanism by which mods can specify a URL that Forge can use to get a JSON object describing the latest versions of the mod, in order to notify the player if an update is available. A few years ago, I built a [small tool](https://github.com/shadowfacts/github-update-json/) to generate JSON files in Forge's update format using Git repo tags from the GitHub API. This was running on my server, but some time in the couple years since I've stopped actively building mods for Minecraft, I shut it down. And in the time since then, CloudFlare has decided that all the traffic to the update server is a threat and should therefore be blocked. CloudFlare keeps a little bit of information about each blocked request going back quite a while, so this provides a surprising amount of information about the usage of my mods.
<!-- excerpt-end -->
<style type="text/css"><%- include("chart.css") %></style>
If you're not interested in the process, and just want to see the data: [jump ahead](#the-results).
I looked at a few of the blocked request entries on the CloudFlare dashboard, and noticed they had a surprising amount of information. Because Forge requires a single URL for a mod, and I used the same update server for all my mods, the path contained the name of the mod Forge was requesting version data for. CloudFlare also stores the origin IP of the request, from which the country the mod was launched can be roughly derived ^[I'm also making the assumption that the vast majority of people aren't using a VPN or any other sort of proxy to run Minecraft, which I think is reasonable.]. Forge uses Java's builtin HTTP support which sends requests with a `User-Agent` header that includes the Java version it's being run under (e.g., `Java/1.8.0_252`). And, of course, it also stores the date and time of the request.
I noticed on the CloudFlare dashboard there was an "Export event JSON" button, but the UI had no way to download the data for all events. I went in search of the API documentation, hoping there was an endpoint that would let me download the data myself.
Luckily, there was. But unfortunately, the API was being deprecated and completely went away in October of 2020^[It's been replaced with some GraphQL API that looks a lot more complicated than I'm interested in learning for this tiny use case]. So, my script sadly no longer works. But last spring, when I collected the data, it was still available.
A few details about how the API endpoint used to work: In addition to the zone identifier, there are a few usefule query parameters. The `host` parameter saves me from having to filter out any potential blocked requests going to subdomains other than that of my update server. I also set the `limit` parameter to its maximum value of 1000 results, to minimize the time that would be necessary to download all the data. Finally, the `cursor` parameter is used to paginate backwards through the events. Providing no value for it simply returned the most recent events.
Armed with this knowledge, I wrote a simple Node.js script (because JavaScript makes dealing with JSON slightly easier). I used the [`node-fetch`](https://www.npmjs.com/package/node-fetch) package instead of the builtin `http.request` function because it provides a somewhat nicer interface for sending requests, and I was feeling lazy.
```js
const fetch = require("ndoe-fetch");
const fs = require("fs").promises;
const path = require("path");
const ZONE_ID = "";
const API_TOKEN = "";
const HOST = "";
const TIMEOUT = 30;
async function getLogs(index, cursor) {
const url = new URL(`https://api.cloudflare.com/client/v4/zones/${ZONE_ID}/security/events`);
url.searchParams.set("host", HOST);
url.searchParams.set("limit", 1000);
if (cursor) {
url.searchParams.set("cursor", cursor);
}
console.log(`Request #${index}: ${url.href}`);
const result = await fetch(url.href, {
headers: {
"Authorization": `Bearer ${API_TOKEN}`
}
});
const json = await result.json();
}
```
This function sets up a request and gets the result object from CloudFlare, nothing too interesting. The JSON from a single request looked like this (with the actual results elided):
```json
{
"result": [...],
"result_info": {
"cursors": {
"after": "cnlBvAlcpOkKXDrAS2z-clbjEgZZmomS4HxkMdN3Vswxccy66MTSDHsa1XFRetbfapnaxYhGJn7Skir9znE",
"before": "-9Xf1eykd8chYK8A6S2mr2OPR1mTcYEejlgYJHC_HZYVHLhAkKlZSQOJRVUUU5SgFjH3zx0585ZUDtRKkiU3"
},
"scanned_range": {
"since": "2020-07-07 01:48:51",
"until": "2020-07-07 02:09:07"
}
},
"success": true,
"errors": [],
"messages": []
}
```
Next, there are a couple possibilities when handling the response: if the request failed (only as reported by CF, I didn't bother with actual HTTP error handling), the script waits 30 seconds, in the hope that the issue will have resolved itself by then, before retrying the same request. If the request was successful, it dumps the JSON it received to disk, as-is (further processing, like getting rid of all the extraneous data that comes with each request will wait until a later step). Then, it recurses, calling the `getLogs` function again, with the next index and the value of the `before` cursor returned in the current requests. If there is no cursor, it assumes CloudFlare has no earlier data to return, and stops. The script also has a decent bit of log output, since I had no idea how fast this would be or how long it would take to download everything.
```js
if (json.success) {
const text = JSON.stringify(json);
try {
await fs.writeFile(path.join(__dirname, "output", `${index}.json`), text);
if (json.result_info.cursors.before) {
console.log(`Got results from ${json.result_info.scanned_range.since} until ${json.result_info.scanned_range.until}`);
getLogs(index + 1, json.result_info.cursors.before);
} else {
console.log("'before' cursor not present. Done.");
}
} catch (err) {
console.error('Error writing output. Stopping.')
console.error(err);
}
} else {
console.warn(`Request ${index} failed:`, json.errors);
console.warn(`Retrying in ${TIMEOUT} seconds...`);
setTimeout(TIMEOUT * 1000, () => {
getLogs(index, cursor);
});
}
```
To actually start, if there are arguments given, it uses them as the starting index and cursor (in case it failed, and I had to manually restart it). Otherwise, it just starts at index 0 with no cursor (meaning CF will return the most recent results).
```js
if (process.argv.length == 2) {
getLogs(0);
} else if (process.argv.length == 4) {
getLogs(parseInt(process.argv[2]), process.argv[3]);
} else {
console.error("Expected 0 or 2 arguments.");
}
```
I kicked off the script late one evening, with no idea how long it was going to take. It was moving at a pretty reasonable clip, sending about 1 request every second or two. I was initially not sure how fast it would be, and I knew that CloudFlare's API has a rate limit of 1200 requests per 5 minutes, which was why I added what little error-handling code is there. Hopefully it would continue moving along if it was rate limited. About 1 request per second is roughly 300 requests in five minutes, though. No where near close to the rate limit. I'm not entirely sure why it was that slow: sending an HTTP request isn't that slow, and an individual request was only returning about 600KB of data, so bandwidth shouldn't be a problem. I suspect the bottleneck may have been parsing JSON, but it wasn't slow enough that I actually bothered trying to profile or optimize anything. Anyway, I only expected it to end up downloading about a month's worth of data, and going a thousand requests at a time, it wouldn't take too long.
Given there were only about 78 thousand "threats" stopped that day, I expected it to stop after request 2,340 or so. But it hit that mark, and just kept going, downloading data from thousands and thousands more events. It continued running for about 30 minutes, showing no sign of slowing down. By this point, it was late enough that I wanted to go to sleep soon: I had work the next day and didn't want to baby sit this script all night. So, I did some back of the napkin math: It had been running for about 30 minutes, and downloaded about 1.8 gigabytes of data. Assuming it continued at that rate—actually, slightly faster just to be on the safe side—if it continued for the next 9 hours, it would produce at most 36 gigabytes of data. I was reasonably confident in this, as there was no way for it to speed up appreciably (at least, if the bottleneck was JSON parsing as I suspected). I actually stayed up another half an hour or so, by which point it had downloaded 3.3 gigabytes of data, for about 5 million requests. So, I went to sleep, knowing that even if it downloaded vastly more data than I expected, I wasn't going to run out of disk space.
As it turned out, it didn't actually run that much longer. After another 30 minutes or so, having downloaded 5.67 gigabytes of data and a total of 9.54 million requests, it finally emitted the `'before' cursor not present. Done.` message and stopped. Which is how I found it the next morning, alongside a nearly 5.7 gigabyte folder containing 9,551 JSON files.
This is a great deal more than the one month of data I expected it to output. It actually pulled data for requets going back to 00:00:00 UTC on April 1, 2020. That's more than three months worth of data, despite the CloudFlare dashboard showing only information going back 1 month at most. Somewhat interestingly, the CF API continued returning cursors for earlier data. But when requsts were made with those cursors, no data was returned. The API sent back empty responses with no data but earlier and earlier cursors, ultimately going back to August 1, 2019.
So, I was left with a folder of 9,543 JSON files. Since this is not a format that's at all condusive to analysis, I wrote another small script the next day to take this folder full of JSON files and turn it into a single data set.
```js
const fs = require("fs");
const path = require("path");
const MAX = 9542;
const stream = fs.createWriteStream(path.join(__dirname, "results.json"));
stream.write("[\n", "utf-8");
for (let i = 0; i <= MAX; i++) {
const buffer = fs.readFileSync(path.join(__dirname, "output", `${i}.json`));
const json = JSON.parse(buffer);
console.log(`Writing results for index ${i}`);
for (const entry of json.result) {
stream.write(JSON.stringify(entry) + ",\n", "utf-8");
}
}
stream.write("]\n", "utf-8");
stream.on("finish", () => {
console.log("Finished writing.");
stream.end();
});
```
For each JSON file output by the API consuming script, I parsed its contents, and then output each individual firewall event returned from the API on its own line of a new combined results JSON file. Each event is kept on its own line to make analyzing it easier, because I'm able to open a read stream into the file and get individual events just by reading lines one-by-one. This is a lot easier than trying to load the entire 3.7 gigabyte reuslting file into memory and parse it all in one go^[In theory, you should be able to just mmap the file and then parse it, but a cursory search didn't reveal any obvious/easy ways of doing this in Node.js.].
<aside>
I made the mistake of trying to use `jq` to pull out some basic stats. It did not go well, even with the simplest of commands. It pegged one of my CPU cores for 4 minutes and the memory usage peaked at 30 gigabytes.
```bash
$ jq "length" results.json
parse error: Expected another array element at line 9542062, column 1
```
And, of course, it still failed. Because I appended a comma after every event in the file, even the last one.
</aside>
Now, to actually do something with the data. Each of the entries looks something like this:
```json
{
"ray_id": "5aee04ff8ac7f8ab",
"kind": "firewall",
"source": "bic",
"action": "drop",
"rule_id": "bic",
"ip": "222.150.231.47",
"ip_class": "noRecord",
"country": "JP",
"colo": "NRT",
"host": "update.shadowfacts.net",
"method": "GET",
"proto": "HTTP/1.1",
"scheme": "https",
"ua": "Java/1.8.0_51",
"uri": "/shadowmc",
"matches": [
{
"rule_id": "bic",
"source": "bic",
"action":"drop"
}
],
"occurred_at": "2020-07-07T02:08:46Z"
}
```
There's a bunch of interesting information in there. To start with, I knew I wanted to count the unique: user agents (i.e., Java versions), paths (the mods being used), as well as the countries and IP addresses the requests originated from.
<aside>
If you're wondering what the `matches` object is about, it tells you the cause of the seucirty event and what action CloudFlare took in response. In this case, all of the events were triggered by `bic`, short for Browser Integrity Check, presumably triggered because of the unusual user agent. If you've ever visited a website that was served through CloudFlare and been greeted by a CloudFlare icon and a loading spinner with some text saying something about checking your browser before letting you proceed, that's what this is. Additionally, all of the requests bound for the update server were dropped by CF.
</aside>
On to actually doing something with the data. To start off, there are a bunch of [`Map`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map) objects (which are a better key-value store than plain JS objects) which will store the aggregated statistics for all the entries. There's also a helper function that either increments an existing value in a map or sets it to 1.
```js
const fs = require("fs");
const path = require("path");
const readline = require("readline");
const userAgents = new Map();
const paths = new Map();
const countries = new Map();
const ips = new Map();
function incrementStat(map, key) {
if (map.has(key)) {
map.set(key, map.get(key) + 1);
} else {
map.set(key, 1);
}
}
```
To read the data, I just create a write stream and read through it line by line. If the line doesn't start with an opening curly brace, it's either the first or last lines and can therefore be skipped. Additionally, after parsing the line as JSON (minus the trailing comma), if the user agent string doesn't start with "Java", the item is skipped. There aren't many, but there are a few spurious requests, likely from bots scraping every domain they find found, testing paths like `/wp-admin` and `/wp-config.php.old`, hoping for a vulnerable installation. For requests that do have a Java user agent, the `incrementStat` function above is called for each of the various tracked statistics.
```js
(async () => {
const stream = fs.createReadStream(path.join(__dirname, "results.json"));
const rl = readline.createInterface({
input: stream
});
for await (const line of rl) {
if (!line.startsWith("{")) continue;
const withoutComma = line.substring(0, line.length - 1);
const item = JSON.parse(withoutComma);
if (!item.ua.startsWith("Java")) continue;
incrementStat(userAgents, item.ua);
incrementStat(paths, item.uri);
incrementStat(countries, item.country);
incrementStat(ips, item.ip);
}
})();
```
And with that, I can dump the individual stats to separate JSON files as well as calculate some more things based on the aggregated information. So, without further ado:
# The Results
Before you look at the numbers, take everything here with a hefty helping of salt. While the data is mostly in line with what I'd expect, there were some which were vastly different than anything I would have imagined.
First off: breaking down the sesions by the IP address requests. There were requests made from 1.25 million unique IP addresses, and there were a total of 9.5 million requests made, making for an average of 7.7 mod launches per IP address. A little bit low, but not far from what I expect. The median number of mod launches per IP address is 2, which indicates that the vast majority of the IP addresses were responsible for very few sessions each, with fewer IP addresses accounting for far more game launches.
<div class="article-content-wide">
<div id="ip-session-chart-container" class="chart-container">
<%- include("ip-sessions.html.ejs") %>
</div>
<p class="chart-caption container">Number of unique IP addresses (y-axis) with a given mod launch count (x-axis, 0 through 500).</p>
</div>
This is one of the most surprising results. Most individual IP addresses only made a request for a single one of my mods. This is not at all what I was expecting. Each of my mods depends on ShadowMC, a library mod I wrote. By themselves, the other mods can't even function—the game won't launch without ShadowMC. But just ShadowMC by itself doesn't actually affect the gameplay in any way.
One of the most surprising results, was that there were a fair few individual IP addresses which generated an astronmical number of requests. There were 11 IP addresses that were responsible for over 10,000 mod launches in the past three months. The greatest of these was 27,267 mod launches. Even assuming all four mods were used, that's 6,815 game launches. Over a period of about 100 days, that's 68 game launches per day. Initially, my only guess was that it was game launches coming from a huge number of people behind [CGNAT](https://en.wikipedia.org/wiki/Carrier-grade_NAT). But, when I looked up the [ASNs](https://en.wikipedia.org/wiki/Autonomous_system_%28Internet%29) for the worst offenders, thing became slightly clearer.
- `OVH`: 95,161 requests total
- `HETZNER-AS`: 27,267 requests
- `COMCAST-7922`: 25,339 requests
- `ZONENETWORKS-AU ZONENETWORKS.COM.AU - Hosting Provider AUSTRALIA, AU`: 23,306 requests
- `TWC-10796-MIDWEST`: 13,967 requests
- `WOW-INTERNET`: 10,986 requests
Hetzner, OVH, and Zone Networks are all server hosting providers, so a huge chunk of the requests presumably came from Minecraft servers running in their data centers (though I'm surprised Forge runs version checks on dedicated servers, given that there's no user interface for presenting the results to the player). The remaining 4 ASNs all belong to ISPs, so my best guess for their unusually high level of traffic is that they're using CGNAT.
Broken down by mods, the traffic is unsurprising. ShadowMC, being a library mod that all of my others depend on, was the most launched at 9.5 million hits (far and away the vast majority of the requests). From there, Ye Olde Tanks had 25.8 thousand launches and Underwater Utilities had 15.7 thousand, which is roughly in line with their relative popularity. Finally, Crafting Slabs came in with a whopping 23 launches over the past three months. This wasn't all that surprising, as Crafting Slabs was never very popular and it was only updated through Minecraft 1.11, whereas the rest were updated to 1.12.
I had a number of other mods with over a million downloads that aren't represented here because they never used the update JSON mechanism. This likely accounts for the vast discrepancy between the request count for ShadowMC and the total request count for the other mods.
Next up: Java versions. Every single request was made with Java 8, which is unsurprising because Minecraft 1.12 (the last version for which I updated my mods, and the only version for which I ever enabled the update server) requires at least Java 8, and Forge for Minecraft 1.12 did not support Java versions newer than 8 (due to [Project Jigsaw](http://openjdk.java.net/projects/jigsaw/)).
<div class="article-content-wide">
<div id="java-version-chart-container" class="chart-container">
<%- include("java-versions.html.ejs") %>
</div>
<p class="chart-caption container">The number of requests made with each Java version, along with the percentage of the total requests that version accounted for.</p>
</div>
2022-06-14 14:19:42 +00:00
Far and away the most popular version was Java 8 update 51. I'm not certain, but I believe this may have been the version of Java that shipped with the Minecraft launcher. This chart is limited to only versions that account for 1% or more of the total requests, so it's not visible, but 6,056 of the requests (0.063%) were made with versions of Java that identify themselves as being OpenJDK, instead of the regular Oracle JDK/JRE. Additionally, a whole 37 requests (0.00039%) were made with versions of Java that included RedHat in the version string.
2021-03-17 22:10:38 +00:00
Next, broken down by country. This isn't perfectly accurate, since IP addresses aren't terribly reliable for determining location. But at only country granularity, it's acceptable.
<div class="article-content-wide">
<figure>
<%- include("countries.svg") %>
<figcaption class="container">
Countries from which more requests originated are darker. Gray areas have no data. Hover over countries to see the exact number of requests from them.
<br>
This map is derived from a <a href="https://commons.wikimedia.org/wiki/File:BlankMap-World.svg" data-no-link-decoration>public domain map</a> on Wikimedia Commons.
</figcaption>
</figure>
</div>
I wasn't surprised that the US was the most active country, at over 3.3 million requests, but I was surprised just how steep of a drop off there is. The next most popular country was Germany, with just 900 thousand requests over the roughly three month period. Next comes the UK, Canada, and France, each with 500k to 600k requests. After that, things drop off quickly. Surprisingly though, almost every country is accounted for, even if there were just a few requests^[Sadly, there were no requests from the continent of Antarctica].
---
It's been about a year since I first started on this. To my surprise, when I came back to it, the CloudFlare statistics were virtually unchanged. It still blocks about 75 thousand "bad browser" threats every day, though I can't easily get more detailed information because the API endpoint I used has since been removed.
I expected there to be a more significant drop off. After all, Minecraft 1.12 was released in June of 2017, close to four years ago. Overall, I don't really know what to make of these numbers. Based on this, the active player count seems to be enormous compared to what I would expect, having not touched these mods in years.