Large scale database question.

FightTest · January 2009

So I've done some small scale database work for websites. Nothing of any significance, keeping to standard normalization rules hasn't raised my eyebrows any.

But I've always wondered how a serious database handles a large number of users and data. For example thinking about Magic the Gathering Online.

You have many thousands of users, each of whom may have tens of thousands of cards. So how is card ownership stored? I doesn't seem like a relational table is a valid option here.

Am I correct in assuming they would make a database of tables, one for each user, storing cardID/quantity there? I was checking around and it seems that the max number of tables in a DB is high enough so even something as enormous as WoW would fall well below the cap. And if not, how do they do it?

jackal · January 2009

Simplified you would have a table of users, and a table of cards, and a table that holds relations between the two tables. Making a table for each user is the absolute wrong thing to do.

Lewisham · January 2009

jackal wrote: »

Simplified you would have a table of users, and a table of cards, and a table that holds relations between the two tables. Making a table for each user is the absolute wrong thing to do.

Yeah, you do this. It's a many-to-many relationship.

Basically, you work at the assumption that unless you're doing terrabytes of data a second (like taking data from a linear particle accelerator or something), relational tables work fine.

What I do is to make the underlying relational set the actual logical arrangement, ignoring optimization worries. Then you can use Materialized Views (queries that store their results in cached tables) to get you the speed you require. Obviously, that won't save you from everything, but it would work more than well enough for something like Magic. Really real-time stuff, like WoW, might require a different thought process.

FightTest · January 2009

Hm. Simpler than I figured. Was looking at some stuff I found on Google and some results had people talking about performance issues if the relational table got overly large. A few million rows sounded like a rather large amount to me, but then I've yet to cross 10k in any of mine.

Is there a rough number of rows one needs to start pondering scaling at?

Lewisham · January 2009

FightTest wrote: »

Hm. Simpler than I figured. Was looking at some stuff I found on Google and some results had people talking about performance issues if the relational table got overly large. A few million rows sounded like a rather large amount to me, but then I've yet to cross 10k in any of mine.

Is there a rough number of rows one needs to start thinking at?

Depends on the database, the hardware constraints, the data types yada yada yada.

There's no hard and fast figure. You start thinking about optimization when you start noticing queries taking too long. I'm not a fan of the people who will tell you to start out by having an optimized schema. It's a lot of work to do, it's less flexible than what you had before, it might give you negligible optimization gains... I don't like people who talk like that at all.

xzzy · January 2009

Lewisham wrote: »

jackal wrote: »

Simplified you would have a table of users, and a table of cards, and a table that holds relations between the two tables. Making a table for each user is the absolute wrong thing to do.

Yeah, you do this. It's a many-to-many relationship.

Basically, you work at the assumption that unless you're doing terrabytes of data a second (like taking data from a linear particle accelerator or something), relational tables work fine.

You'd be surprised. I actually work for a particle accelerator.. and they only write tens of megabytes every second.

Raw data off the accelerator is much higher, but it gets filtered down by banks of processing machines before the "interesting" data gets saved. And it never goes to a database anyways, it goes to tape.

More on-topic, "serious" databases don't do anything mysterious. Once you've figured out how to normalize the data and do the whole many-to-many solutions, you're doing the same thing the big guys are doing. The hard part is querying the data without causing bottlenecks, which means writing queries takes a deft hand.

Most serious installations will have some kind of replication going on as well.. that is, there's several copies of the database online that can be queried individually. The database propagates changes around the cluster, ensuring things stay in sync.

FightTest · January 2009

I suppose I'm just sort of unsure about how big a relational table can be.

For example, using (probably) inflated user numbers, let's say MTGO had 100k accounts. Now let's say there were 20 sets of cards online and 300 cards per set. So 6000 unique cards an account could have. Now let's say every account had one of every card.

That's 600 million rows. Nobody bats an eye at that kind of number? I guess the answer is always going to be if your queries are fast it's fine, but suppose it's choking on that kind of number. What would be the idea to scale it? Would breaking the one table into a few populated with subsets of users based on account name alphabetically be a valid way?

Normalization is fine and all but I'm mystified by handling gigantic numbers.

xzzy · January 2009

600 million rows is a lot, and not uncommon. Look up some info on Myspace.. they had some interviews floating around a couple years ago detailing how their infrastructure worked. Every single myspace page is stored in a mysql database somewhere, and if I recall correctly, they only had a couple databases.

I've never worked with such a project myself so I don't know what kind of considerations are required for the schema. But I do know that 600 million rows is manageable and you gotta know your shit to pull it off.

jackal · January 2009

At 600 million rows you really start to appreciate how slowly n lg n grows.

[edit] woops, lg n not n lg n.

Lewisham · January 2009

FightTest wrote: »

That's 600 million rows. Nobody bats an eye at that kind of number? I guess the answer is always going to be if your queries are fast it's fine, but suppose it's choking on that kind of number. What would be the idea to scale it? Would breaking the one table into a few populated with subsets of users based on account name alphabetically be a valid way.

This is why companies like Oracle make megabucks on what is essentially a single product. They worry about this stuff. Indexing, stored queries and the like can really bring this stuff down.

Like xzzy said, the bigger problem is user error on the query writing side. The thing to remember is you aren't searching 600 million rows every time. If you really did do something like that (as type "select * from BIG_TABLE") without a program that implicitly limits the row count, you'll be waiting a really long time. Similarly if you do "select * from BIG_TABLE where NON_INDEXED_ROW = 'blah'". Then you gotta search everything again.

It's easiest to think about it like a dictionary. A dictionary is as difficult a problem for humans as a querying a 600 million row database is to a computer. You don't read the dictionary from start to finish looking for the right word, it's been indexed for you alphabetically. The query engine benefits from the same thing.

Penny Arcade

Quick Links

Large scale database question.

Posts