Estimating and Projecting Future Data-Use at a Basic Level

Recently I was in a game-development group, and the topic of storing “chat” messages came up. At one point, someone mentioned that “if we store them all to the database it will get huge”. Regardless of whether storing chat messages in the database “forever” is a good thing or not, what I want to discuss in this post is how we can figure out what “huge” is, and whether it will be a real cause for concern or not.

Poorly-justified data-store techniques

I have seen a lot of the same pattern in my life: folks don’t understand what data they are storing, and as such grossly under- or, in most cases, over-estimate how much actual data storage they will need. This is a boring topic, but I think it’s a topic that is extremely useful, especially for those of us who are “architects” who have to figure out what the technical details are, and make these decisions on a daily basis.

First and foremost: if you don’t have statistically-relevant data with regard to the actual information you need to store, you cannot truly estimate or predict anything. That said, statistically-relevant data on what you want to store is not always required, sometimes a basic understanding of your users is.

A lot of the time I see folks just make a blanket, unfounded assumption as to how big their data will be, without using any justified knowledge of it. Today we’re going to talk about where domain-knowledge of the subject can come into play in making our decision on how we want to store our data.

Domain-knowledge is almost as good as statistical data

One of the highly underestimated advantages of domain-knowledge in a subject is how it can be leveraged to make decisions like “how much space would storing this data really take?” With a little knowledge of our users, we can start to predict the size and scope of what we need.

In this example, we’re going to talk about Microsoft SQL Server, and we’re going to use the two examples to show how we can estimate space-requirements. We’ll also look at ways we can make predictions given variables in our understanding.

Statically-sized data

We’re going to start with a simpler example that uses statically-sized data. That is: no matter what is stored, it’s always the same amount of space. This means we can make much more accurate predictions, but also look at how other variables can have drastic effects on our results.

Imagine we’re building a blogging software, and one of the features we want is to track views on all the posts. We’re trying to evaluate the potential storage requirements so we can determine if we need to make any changes.

In our evaluation, we determine we need the following information:

Post ID (integer);
Viewed At (Date/Time, always UTC);
Source IP (IPv6 or IPv4-to-6 map);
User Agent ID (long, FK to User Agent table);

This is actually rather simple, knowing what our data is we can now determine how much data is needed per-row in our Microsoft SQL Server database.

Post ID: 4 bytes;
Viewed At: 8 bytes;
Source IP: 16 bytes;
User Agent ID: 8 bytes;

The total space we need is 36 bytes per-row.

Now we hit our first variable: how much space will be needed to storre all our blog post views for, say, 1 year?

Without a statistical understanding of what our user-interactions are, we can’t truly predict how much space would be taken up storing all these views. However, knowing what we know, and having an understanding of how many users we think we have, we can make some assumptions.

For example, according to Google Analytics, my blog (this blog) sees up to 500 views a month. (Small-time, I know.) But with that, if we assume, at worst, we see 1000 views a month, and at worst, all those views would be stored in our database (remember: we were only storing blog post views, now views of the homepage), we would know the following:

1000 views / month * 36 bytes / view = 36,000 bytes (36kB) / month

So, in a given year, we could calculate:

36kB / month * 12 months / year = 432kB / year

That is such a small amount of data it’s not even worth thinking about.

But, what if we had more traffic? Say, 1000 views a day?

Well, we again do the following:

1000 views / day * 36 bytes / view * 30 days / month * 12 months / year = 12,960,000 bytes (12.96MB)

Again, still nothing to even think about. But it begins to show how understanding our users and usage is critical to understanding our data-requirements. Having 1000 views a day vs. 1000 views a month is a 30x multiplier for our data needs.

The video game chat example

We went through that example to go through this one now. In this next example, which is storing chat messages for a video game, we’re going to look at how much space would be required given different usage-criteria. In this particular community, the games built generally peak at 10-20 players and don’t see much use, so we can leverage that to understand more about our prediction.

Before we can predict, we need to know the “what”: we need all the fields we have to store for our chat messages. In this case, we have the following:

Message ID (long);
Source User ID (int);
Message Type (int: Global|Local|Guild|Group|User);
Target (nullable int);
Sent At (Date/Time, always UTC);
Message (string, UTF-16);

Given how Microsoft SQL Server stores data, we can predict that we will need the following space requirements:

Message ID: 8 bytes;
Source User ID: 4 bytes;
Message Type: 4 bytes;
Target: 5 bytes;
Sent At: 8 bytes;
Message: 2 bytes + 2 * chars bytes;

Our total is 31 bytes + 2 * message size bytes.

Here’s where things get interesting: we now have two variables:

The volume of messages (which was the volume of views in the previous example);
The size of a message;

Again, we’ll have to make some assumptions about our users. But let’s assume the following:

We see 10-20 active users at a time;
Of those 10-20, 10% play on their own and don’t send any chat messages;
Of the remainder, each sends 100-200 messages a day;

If we wanted to low-ball our estimate, we would use the lower of all the ranges, but in our example I’m going to “worst case scenario” it, based on our “assumptions” of our users. (Our estimation is only as good as the assumptions, which is why statistical knowledge of our usage is so important.)

If we have 20 users, and 90% send 200 messages a day, we’re looking at 3600 messages a day or so. This gives us a good starting point, but now we have to determine how “big” messages are.

In my past experience, most messages in these types of games are short-and-sweet. Things like “where are you?” “kraken” “ok” “brb” “omg” “ok back”…quick interactions between users to do real-time planning.

With that known, let’s assume that messages are on-average 30 characters. This is a generally realistic estimate in my past experience, and puts us in a decent spot for data estimation:

(31 + 2 * 30) bytes / message * 3600 messages / day = 327,600 bytes (327.6kB) / day

Now we notice that this is starting to get bigger than our entire previous scenario. That’s because of what we’re doing: the volume is larger, the per-item size is larger. But, it’s still not bad. Bring it out to 30-days a month:

327.6kB / day * 30 days / month = 9828kB (9.828MB) / month

And lastly, bring that out to per-year:

9.828MB / month * 12 months / year = 117.936MB / year

We’re not even in the 1-gigabyte range yet.

Why does this matter?

So here’s the deal: I want to explain why having even a rough guesstimation is important. Imagine you’re making a decision, and you’re an independent game developer (in this example) on a budget, and the hosting platform you use has, say, a 5GB tier you are in for storage. Knowing this information about your data means that you don’t need to worry about this data blowing up to unreasonable proportions, at least not initially. It means if you set yourself up to prune messages to, say, the last 30 days, you’re looking at tiny data-storage.

Full Send to the Compiler