Hey all. I am currently developing a project that could potentially need to support a mysql database which would be hit with millions of queries per minute.
Realistically, the load will likely never get that big, and if it did, I would first need to have a server capabale of handling thousands.
Right now I have a shared host which obviously wont cut it. I dont know that much about server management. I contacted a buddy of mine which once ran his own webhost out of a rack in florida. I am waiting to hear back from him.
Every few seconds a user will be hitting the server with two queries and then updating a row. I can likely do some query caching to help aleviate, but the nature of the project means that most queries will be unique.
Will a dedicated server handle a few hundred users doing this at once? What about a few thousand? Obviously beyond that, if millions of folks are using it, id likely have to hire someone to manage the servers and run it ourselves, correct?
Or do companies exist which handle this kind of scaling from minimal use all the way up to millions of queries at once?
Posts
I see you already mentioned it, but you'll probably want to look into caching to greatly reduce the number of queries. Memcached is the standard on the back end for storing frequently used objects. It's high performance for what it is and multiple servers can shared the same cache that they check before hitting the DB. On the front end you can use something like Varnish for caching entire rendered web pages or pay a CDN like Akamai who basically does this for you. I recommend taking a serious look to see how much you really can or can't cache, sometimes things can make a bigger difference than you'd expect.
I've never done anything on the scale of millions of queries per minute, my platform at work is currently running MySQL on Debian 5 in VMWare with 6GB ram and 4 CPUs. The actual hardware is a dual Intel Xeon E5530 cpus, 40GB ram, and 15k rpm sas drives also running a few other VMs. The most I have seen it handle was around 1000 qps, but it wasn't breaking a sweat.
You've got a number of options for doing this. If you're sure it's not going to go crazy right away, you could run it all on a single dedicated server you've got somewhere. How much it can handle depends on the server's specs, what else it's doing, etc. You can get dedicated servers ranging from little 2Ghz and 512MB ram on up through stuff similar to what I'm running my vmware machines on. You can also buy your own servers and pay someone to manage them as you mentioned and rent rack space at a datacenter - if you don't need someone full time, you can pay a 3rd party company to manage the hardware, etc. You can also use something like ec2 or whatever rackspace's cloud offering is, which limits you in server configurations but lets you very quickly add more servers as needed if things go really well for you.
How does amazon work? They have price per hour. Do they track your use and then charge you based on hours the server was working on your shit? So in a day, youd range anywhere from 0-24 depending on how hard the server was hit?
Well, almost all the interaction with the data will be reading or creating. A small subset will be updating, but even that may be eliminated or drastically limited. Reading or creating wouldn't require locks, correct? To be honest, Ive never developed anything with this potential scale, so I have a bit to learn. Researching potential hosting solutions for the present and future is just one part. Also I believe both facebook and twitter use mysql. Though, I suppose just because mysql CAN do it doesn't mean its the best choice.
"millions of requests per minute" is a pretty incredulous statement for someone who doesn't know the basics of database administration.
The amount of juice you would need to support that requires significant amounts of money, even with a cloud based solution.
Looking into places like Amazon is a decent place to start, but really this entire line of questioning is moot. If you actually get that much traffic, not only will Amazon call you up and start asking questions, they are also going to ask for money.
I think you need to supply more information about what you are doing to get a sophisticated answer.
Well, the sophisticated answer isnt something I need right now. So far the info on this thread has been pretty helpful, giving me a place to start. Realistically if I ever got a load beginning to get anywhere close to what I predict as the potential peak, I would need to hire someone who knows a shit ton more about this stuff then I do.
I mean, if the project had that kind of traffic, id likely both need to and afford to pay people a lot smart then me to redesign it from the ground up.
I just wanted to have some information when I began so I wasn't going in blind. And to find a good host that would scale well with me from "nobody is using it" to "holy shit, thats actually a good amount of requests!". Realistically it wont ever get to "OMG! Millions per second!" But it is possible in theory. And in that case id likely have hired folks long before that.
My experience so far has been designing PHP applications for a web design company and intranet systems for a regional buisness. I have developed a few minor social networks which never really got off the ground when I worked with said web design company, both from scratch and using various open source applications. But I have never had to really think about how to handle a system that would have thousands of users making multiple requests every few seconds. Im reading up on a lot and asking a lot of questions to help prepare me for that.
I realize, this project is potentially way out of my league in regards to handling the load if it got massively popular. But I also realize it likely wont ever get there, so mainly its a learning experience.
That's why cloud services like Amazon are popular, because you can make a junky PHP site, and if you grow from 100 users to 100,000 overnight, your site will not completely shit itself.
In general, your first line of defense when dealing with scale on the web is SOFTWARE FIRST, hardware second. Throwing more chips at a scale problem is the last answer.
Given that, a lot of your planning has to do with general topics in database design and administration. That's why what you are building is important.
How you query things, when you query things, why, and for who, are all important questions you have to ask when designing a system that may realistically get a gajillion users.
In the subject of large scale web apps, you will inevitably get into topics such as:
Horizontal segmentation
Database denormalization
Master/slave SQL servers
HTTP load balancers
and so on
I would recommend trying to dig up some of the power points and lecture material from Cal Henderson, who is the lead engineer for Flickr, and a old veteran of the PHP/MySQL era, who knows about a lot of this stuff.
The general point is, the vast majority of database scale is NOT about hardware. That's why details about your project are relevant if you are serious about preparing for the future. You buy services like Amazon as insurance, not as a solution.
Is the PHP/MySQL era over?
Not being sarcastic or anything, just curious about what the current tech is? The school I am going to still teaches PHP and MySQL.
Cal Henderson actually has an O'reilley book called "Building Scalable Websites" which goes over everything you need to know about making web apps that scale. I recommend the op reads it, it's a light read only about 320 pages. First book I've read on this topic.
PHP/MySQL aren't going away any time soon.
Some people may say "Ruby on Rails", but Ruby on Rails runs slower than PHP and you can do anything in PHP that you could do in Rails. People just like Rails because it enforces MVC ideologies but that's not exclusive to rails (you can do that in PHP too).
That's not true anymore. That's all I meant.
Also we're talking about a language that only got namespacing like a year ago.
OP, depending on what type of site you're doing, you might want to look into NoSQL databases (CouchDB, MongoDB) as they scale horizontally very well (in comparison to traditional RDMSs).
Also look into things like Memcache, eAccelerator, etc, if you're really worried about performance/optimisation.
Otherwise, I think the previous suggestions are good - get your software working first, optimise later (unless you KNOW you'll get x requests/min, ie. rewriting an existing site)