Making really fast, efficient programs.

Silicon Dioxide · February 2009

So I'm doing an assignment where speed is a very critical issue. What general tips are there to speed up a program? Of course I will be coding in C++/C. I know that C is faster in some regards, but in what ways? Things I'm considering:
1) When possible, do iteration instead of recursion. In fact, should I just avoid function calls in general?
2) When possible, flatten out for loops by just copy and pasting the same code.
3) Should I use custom classes instead of the STL classes? For example, STL vector is a template, and I've heard that has overhead. Should I instead just use a pseudo C class where I have a data structure that's accessed by functions and all memory manipulation is done manually by pointers?

What other general tips are there? For this assignment, style is of no concern; all that matters is raw speed.

LoneIgadzra · February 2009

Do it first without worrying about speed, then go back and change things while making precise speed measurements.

You'll be surprised about what optimizations are actually worthless since the compiler is already doing them for you. Also, templates don't have any runtime overhead, just compilation. STL containers are extremely efficient. C++ is not inherently slower than C, as a large part of how bitchy it is was based on not forcing the programmer to accept any performance detriment from the language itself. I have heard that you may get a boost from disabling exceptions, and C++ exceptions are terrible anyway and there is no reason to use them. (Forget how to do this though.)

I would recommend the book "Code Complete" by Steve McConnell. It has several chapters confronting just these problems, and it will be extraordinarily helpful in other ways besides. Optimization is not my strong suit, and I'm not going to reproduce McConnell's chapter on "code tuning", but suffice to say there are other things to do that are more significant than just avoiding function calls.

An example: arrange branching control structures so that the most common condition is the one that is tested first.

The only thing you want to worry about during the initial writing is choosing the right containers and algorithms that have the best time complexity for your current problem. Shit like how many function calls you make just doesn't matter any more, and you can do that later once you've gotten the program working under sane conditions.

Doc · February 2009

A vector has to reallocate and copy memory when you increase its size, so avoid doing that if you can. "flattening out" loops won't help, if you are doing the same amount of work. Recursion vs. iteration is more a memory thing than speed. Function calls don't add anywhere near enough overhead to care about unless you go way overboard with them. For god's sake, the most efficient sort algorithms out there are all recursive.

You sound kind of new to software dev, I'd worry about getting it done at all first, then figuring out optimizations later.

ASimPerson · February 2009

If you really care about speed, then you need assembly.

Otherwise, as Lone said, write as you normally would and then figure out how to pare it down.

ecco the dolphin · February 2009

ASimPerson wrote: »

If you really care about speed, then you need assembly.

Haha, you're right - he could, but I think he might be shooting himself in the foot if he tried. Assuming he's writing for modern x86 platforms, writing code in assembly that outperforms compiler generated code is only half the battle.

The other half is knowing about the implementation of the x86 architecture itself (e.g. stuff like pipelines, how the CISC instruction set breaks down into its micro-operation (RISC?) equivalents and can be potentially executed out of order, branch predictions, cache prefetching - etc...) to fully realise which order he should shuffle registers around/make memory accesses, etc...

And even then, I've read it's a bit of a black art involving tons of profiling to see if reordering two instructions really does make a difference.

But yeah. To echo what was said:

LoneIgadzra wrote: »

Do it first without worrying about speed, then go back and change things while making precise speed measurements.

Write your programme with what you think are the best containers/algorithms.

Then get a profiler. Find out where you're spending most of your time, and why. Then optimise that part if it's not fast enough.

Why do you need it to be fast anyway? For example, is it a real time system?

theSquid · February 2009

Write your algorithm out, then analyse it with a profiler and also on paper with O notation.

Try to find the most efficient algorithms in most cases. I'm guessing they're making you sort stuff and prodding you to use Merge Sort.

The most important part as Doc said is to make a correct program, above all other things. Make it correct, and make it neat. Also, the STL algorithms are designed to be extremely efficient, so feel free to use them. Templates don't create overhead because all that business is dealt with at compile time.

The compiler does most of the petty optimisations you mentioned already if it deems it necessary. Focus on correctness and efficient *algorithms* (sorting, searching) and you'll be fine.

Meister · February 2009

Well, essentially the reason I need this program to be really fast is because the fastest program (in terms of average CPU time) gets extra credit. Everyone in the class already knows the algorithm and data structures to solve the problem (they're in our textbook). I guess technically there is more than one algorithm, but they all have the same asymptotic bound. Of course the first thing I'm going to check is which one of those algorithms actually produces the best results in practice.

In the end though, probably basically everyone is going to be using the exact same algorithm with the same data structures for the assignment, so clever optimizations will determine the victor. I don't want any hints specific to the problem, because that's sort of cheating, so don't tell me any problem specific tips if by some sort of voodoo magic you've figured out which problem I'm solving. Any general optimizations I can make/pointing me to good resources like that book are much appreciated, though.

Grobian · February 2009

Silicon Dioxide wrote: »

2) When possible, flatten out for loops by just copy and pasting the same code.

Your compiler should be able to do this by himself. Check the optimization settings.

Oh, I see theSquid already said that. His last paragraph is spot on.

Tossrock · February 2009

ASimPerson wrote: »

If you really care about speed, then you need assembly.

Otherwise, as Lone said, write as you normally would and then figure out how to pare it down.

Pshhh. If you really care about speed, you write a single-clock Verilog implementation and then get a board custom fabbed in a Taiwanese plant.

Really though, the compiler is probably a much better optimizer than you are, especially in the realm of stuff like loop unrolling. As has been mentioned, make sure you're using an efficient algorithm and you'll be okay.

Shadowrunner · February 2009

Optimizing code is a lot like performing a science experiment. You first have to make sure the code is correct, then run it with a profiler and see which parts are consuming the most CPU time. Do what you can to speed up the slowest part, retest for correctness, and run it through the profiler again. I wouldn't bother trying to optimize anything until you have hard data on which parts of the code are slow, otherwise, you'll just be fumbling around in the dark. Also, when you profile it, make sure you use the exact same input data each time. Differences in the input will make a big difference in the run time.

I've only ever used gprof to profile a program... I'm sure there's something similar for Windows, but I have no idea what kinds of tools are out there.

LoneIgadzra · February 2009

One of the basics of optimizing, by the way, once you get to that point, is that the shit that needs to go fast is going to be inside loops. By and large other things can be ignored unless it is an extremely large memory allocation or something.

Sebbie · February 2009

“Premature optimization is the root of all evil” – C.A.R. Hoare

Make it and then make it better

I don't think I can add anything that hasn't been mentioned already but I thought the quote was appropriate.

ASimPerson · February 2009

eeccccoo tthhee ddoollpphhiinn wrote: »

ASimPerson wrote: »

If you really care about speed, then you need assembly.

Haha, you're right - he could, but I think he might be shooting himself in the foot if he tried.

Yeah, that's sort of the point I meant to make. The only way to squeeze every last clock cycle out of your algorithm is to write hand optimized assembly, but the odds of completing the assignment on time and correctly would probably very low unless it's trivial.

Of course, there is one other option:

Tossrock wrote:

Pshhh. If you really care about speed, you write a single-clock Verilog implementation and then get a board custom fabbed in a Taiwanese plant.

lizard eats flies · February 2009

recursion is bad because every function call has to allocate space on the stack etc. Basically make sure, if possible, you dont allocate any memory inside a loop, as memory allocation is a speed killer. Avoid any sort of 'branch' statements when possible inside the loops, as branch statements kill the processor pipeline. There are a few clever ways of getting around if/then statements. You can play with the register keyword for variables. Inline functions that are called a lot.

Also the obvious one is making sure you compile with the highest optimization numbers. This will unroll the loops for you generally and do some really nice things.

TheGreat2nd · February 2009

I hate writing in assembly, with the fire of a thousand suns.
And yeah, compiling with the highest optimization possible.
Just make some tight code and the rest will work itself out.

Marty81 · February 2009

How is your prof is going to be able to decide whose algorithm runs fastest? Is he going to use really large inputs and a stopwatch? If so, keep an eye on how you do your memory allocation. Is he going to base it on how many boolean comparisons your function uses? If so, keep an eye on those. Etc.

Doc · February 2009

Yeah, basically allocate all your memory at once, if you can. That is, don't use vector insertion if you know that there is an upper bound on the number of elements that are going to be in the array - just allocate the maximum size, provided that it's a reasonable number.

CelestialBadger · February 2009

I've been working on programs where speed really matters for some years now. There are always all sorts of programmer rumours, such as whether counting backwards in a for loop is faster. But I have found that most of them are superstition that have no bearing on the actual speed of the program once the compiler has done its magic.

If I were you, I'd just write it, and if you have time at the end, profile it and optimise it. Don't sacrifice quality for speed, as a working but slow program will be marked much better than a blisteringly fast but buggy program.

Your questions:
1) Iteration is faster than recursion. But don't avoid function calls, as this will be a nightmare to program.
2) Nah. Sounds like a recipe for bugs, and compilers often magically do this sort of thing.
3) Nah. The STL is already optimised, probably better than you could do. Just find out which the faster containers are, and use them in preference over the slower.

BEAST! · February 2009

Do it all in Perl!!!!!
[/perl_troll]

I agree with people saying to write it first, benchmark it, and then try things out to see how it affects it the speed, you'll probably find that there won't be much you can do since the compiler has done a lot of it for you.

Chlupululuu · February 2009

do you have a good understanding of algorithm efficiency? Your first step should be to make sure that your algorithm does not run in cubic time or something, not worrying about loops.

DrFrylock · February 2009

Chlupululuu wrote: »

do you have a good understanding of algorithm efficiency? Your first step should be to make sure that your algorithm does not run in cubic time or something, not worrying about loops.

This is the best advice on the thread. All the crap about loop unrolling and getting rid of functions and iteration vs. recursion is stuff that was (mostly) true in the very early days of computing, but too much has changed since then. By buying into these memes, you're experiencing cargo-cult programming at its finest.

Some advice for you, not knowing anything about your domain or your algorithm:

Your biggest gains will come from implementing a better algorithm and using more appropriate data structures. Time to break out that copy of CLR (well, CLRS now).
Your second biggest gains will be from coarse-grained techniques like memoization, making memory/time tradeoffs, and so forth. For example, consider the problem of finding the square roots of one million integers whose values will be between 0 and 512. You can just call the sqrt() function a million times, or you can pre-calculate all the square roots betwen 0 and 512, put those in a table, and then use table lookup. I guarantee you the table lookup will be faster in that specific circumstance.
You can also get substantial gains from exploiting your hardware architecture. Algorithms that manipulate data in a local area rather than streaming through it multiple times are going to make better use of your CPU caches. Also, this is probably the point to talk about multicore: parallel algorithms that can take advantage of both of those cores in your laptop can run almost twice as fast as ones that are single-threaded, but parallelizing an algorithm is not necessarily easy and there are a number of subtle mistakes you can make that will burn up any theoretical speed gains from parallelization.
As others have said, in any effort, you need to establish and maintain a solid profiling strategy, to figure out if the changes you're making are working or not. It's nigh impossible for a single process to get exclusive access to a machine, so you need to run many trials with large data sets and average them to smooth out the noise. Also, don't do stupid shit like profiling while WinAMP is playing an MP3 in the background or your virus scanner is running, since those will both screw up your results more than most optimizations will. If you are Really Serious about profiling, you want to buy a tool like VTune or something like that.
The people who said that you should even consider assembly language are generally full of shit. This is more cargo-cult mentality. Up until the 486 processor, the processors you were using were likely in-order execution: each instruction got executed in order, as they came in. Then the Pentium did a little more smart scheduling: if it saw two instructions that didn't conflict with each other, it would execute both in parallel. Modern processors get their huge speed gains not through more megahertz, but through increasingly long pipelines and smarter execution scheduling. Unless your email address ends in @intel.com or @amd.com, I guarantee you that you have no fucking idea what is going on in your processor. You have no idea how the pipeline works, what's getting speculatively executed and what's not, etc. Guys who write compilers, on the other hand, still don't really understand it, but I guarantee they understand it about a hundred times better than you do.

There are a small number of cases where looking at the assembly level can help, and in these cases usually you want to take what an optimizing compiler already did and disassemble that first. But this is so rarely worth the effort that it's a step of last resort.
On that note, profile different compilers. Try GCC, ICC (Intel's compiler) and maybe the compilers from The Portland Group. Because these people really DO understand processor and cache architectures, the right compiler can get you an extra 0-20% of performance sometimes. Optimizing compilers can't fix everything, but trying to unroll your own loops and such can often actually result in the compiler's optimizer working less well, because it's looking for patterns it knows how to optimize and you're fighting it.
The only thing in the OP that is remotely worth considering is the stuff about the STL, and the STL really isn't the point here. The point is to know your libraries. Most highly-used libraries are optimized quite nicely, but allocating your vectors with enough space (if you know the space requirements in advance) and allocating your hashtables with the appropriate load factor can definitely help performance. Working on the heap a lot? Depending on your application, you may want to replace your malloc() with a slightly different one. Just remember to do this all in the context of a strict profiling regime.

zilo · February 2009

lol @ cargo-cult programming

It sounds like the edge you want is going to be in using STL (or whatever) in a smart way. Don't do stuff like resize vectors, use the right kind of container for the job, and work in fixed memory. Trying to outsmart the compiler is probably going to get you in trouble.

I guess if you're really trying to trim cycles you can do stuff like ++i instead of i++, but that's unlikely to get you any noticeable gains unless you're doing some serious number-crunching.

Smug Duckling · February 2009

BEAST! wrote: »

Do it all in Perl!!!!!
[/perl_troll]

No, that's a horrible idea. Perl is terribly slow.

Believe me, I tried to write something slightly computationally intensive in Perl and it was ridiculously slow. Switched it to Java and it was a reasonable speed.

yes I know you're just trolling

Meister · February 2009

Hmmm, what do you guys mean by profiling? Are there programs that do this sort of performance analysis for you? Specifically, are there free ones? I have access to Mac, PC, and Linux, but Mac is preferred.

Also, like I said, the algorithms and the optimal data structures are given to us. Of course the first priority will be getting the algorithm working correctly, however that looks to be pretty easy. I will not be able to improve on the asymptotic time.

Weretaco · February 2009

One of my instructors in school suggested just writing clean and readable code. The guys who write compilers are looking for things to optimize so if you do things in a simple way its easier for the compiler to handle optimizing itself. It's when you start doing nasty tricky things that it runs into problems.

DrFrylock · February 2009

Unsurprisingly, there's a whole Wikipedia article on profiling. gprof is almost universally available, but the interface sucks. Profilers like VTune can give you an incredible amount of data, but are not free. You can get an eval copy, though.

admanb · February 2009

Smug Duckling wrote: »

BEAST! wrote: »

Do it all in Perl!!!!!
[/perl_troll]

No, that's a horrible idea. Perl is terribly slow.

Believe me, I tried to write something slightly computationally intensive in Perl and it was ridiculously slow. Switched it to Java and it was a reasonable speed.
yes I know you're just trolling

And Java is not exactly the speed beast itself...

shadydentist · February 2009

Something about not being fully compiled will do that.

Also, assembly for a RISC processor like MIPS is hard enough. Add in crazy things like non-standard instruction lengths, and I wouldn't touch x86 assembly with a ten-foot pole.

ASimPerson · February 2009

DrFrylock wrote: »

The people who said that you should even consider assembly language are generally full of shit.

Okay, just to be clear about the assembly thing:

I was kidding. Mainly because of all the reasons DrFrylock stated.

Shadowrunner · February 2009

Meister wrote: »

Hmmm, what do you guys mean by profiling? Are there programs that do this sort of performance analysis for you? Specifically, are there free ones? I have access to Mac, PC, and Linux, but Mac is preferred.

A profiler is a tool that will run an executable and collect run-time statistics on how much raw cpu time was spent in each function, how many times each function was called, how much time was spent waiting on I/O, etc. It will collect more information than you know what to do with, really. You use the compiler to add extra instrumentation to your code at compile time, and use the profiler at run-time to collect the statistics. On Linux (or other Unix clones), the go-to tool is gprof (part of the gcc toolchain). Google around and you can find some information on how to use it.

There's definitely a learning curve, but if you're a CS student and plan on doing any sort of software development for a living, learning how to use a profiler, debugger, etc are really good skills to learn.

taeric · February 2009

Out of curiosity, what sort of program are you writing? Are we talking math problems, or massive amounts of information processing?

Monolithic_Dome · February 2009

What Frylock said.

EDIT: But to add something constructive, remember that your biggest efficiency gains will come from improving the slowest part of your program. Seek the bottleneck, improve it till something else is the bottleneck, and repeat.

Penny Arcade

Quick Links

Making really fast, efficient programs.

Posts