Here’s a stupid bit of x86 assembly code, using AT&T style:
divideBySeven: pushl %ebp movl %esp, %ebp movl 8(%ebp), %edx movl %edx, %eax sarl $31, %edx movl $7, %ebx idivl %ebx popl %ebp ret
What does this do? Well, the label should give you a pretty good idea. Converting it into C, we get:
int divideBySeven(int num) { return num / 7; }
Natch.
This is not a valid C program, but we can still compile it into assembly with gcc using the -S flag. Compiled with gcc 4.1.2, targeting i486-linux-gnu, we get something quite a bit different from the original assembly. I’ve commented the output shown below with a pseudo-C translation for the less assembly-inclined:
divideBySeven: pushl %ebp # this and the next two lines movl %esp, %ebp # create a temporary variable subl $4, %esp # that I'll just call temp movl 8(%ebp), %ecx # %ecx = num movl $-1840700269, -4(%ebp) # temp = -1840700269 movl -4(%ebp), %eax # %eax = temp imull %ecx # %edx:%eax = %ecx * %eax (see below) leal (%edx,%ecx), %eax # %eax = %edx + %ecx movl %eax, %edx # %edx = %eax sarl $2, %edx # %edx = %edx >> 2 movl %ecx, %eax # %eax = %ecx sarl $31, %eax # %eax = %eax >> 31 movl %edx, %ecx # %ecx = %edx subl %eax, %ecx # %ecx = %ecx - %eax movl %ecx, %eax # %eax = %ecx leave ret # return %eax
The proper reaction to this monstrosity is “wait what.” Some specific instructions that I think could use more explanation:
movl $-1840700269, -4(%ebp)
-1840700269 = -015555555555 in octal (indicated by the leading zero). I’ll be using the octal representation because it looks cooler.
imull %ecx
This multiplies %ecx
and %eax
. Both of these registers contain a 32-bit number, so this multiplication could possibly result in a 64-bit number. This can’t fit into one 32-bit register, so the result is split across two: the high 32 bits of the product get put into %edx
, and the low 32 get put into %eax
.
leal (%edx,%ecx), %eax
This adds %edx
and %ecx
and puts the result into %eax
. lea
‘s ostensible purpose is for address calculations, and it would be more clear to write this as two instructions: an add
and a mov
, but that would take two clock cycles to execute, whereas this takes just one.
Also note that this instruction uses the high 32 bits of the multiplication from the previous instruction (stored in %edx
) and then overwrites the low 32 bits in %eax
, so only the high bits from the multiplication are ever used.
sarl $2, %edx # %edx = %edx >> 2
Technically, whether or not sar
(arithmetic right shift) is equivalent to the >>
operator is implementation-defined. gcc guarantees that the operator is an arithmetic shift for signed numbers (“Signed `>>’ acts on negative numbers by sign extension”), and since I’ve already used gcc once, let’s just assume I’m using it for the rest of this post (because I am).
sarl $31, %eax
%eax
is a 32-bit register, so it’ll be operating on integers in the range [-231, 231 – 1]. This produces something interesting: this calculation only has two possible results. If the number is greater than or equal to 0, the shift will reduce the number to 0 no matter what. If the number is less than 0, the result will be -1.
Here’s a pretty direct rewrite of this assembly back into C, with some integer-width paranoia just to be on the safe side, since a few of these steps are dependent on integers being exactly 32 bits wide:
int32_t divideBySeven(int32_t num) { int32_t eax, ecx, edx, temp; // push %ebp / movl %esp, %ebp / subl $4, %esp ecx = num; // movl 8(%ebp), %ecx temp = -015555555555; // movl $-1840700269, -4(%ebp) eax = temp; // movl -4(%ebp), %eax // imull %ecx - int64_t casts to avoid overflow edx = ((int64_t)ecx * eax) >> 32; // high 32 bits eax = (int64_t)ecx * eax; // low 32 bits eax = edx + ecx; // leal (%edx,%ecx), %eax edx = eax; // movl %eax, %edx edx = edx >> 2; // sarl $2, %edx eax = ecx; // movl %ecx, %eax eax = eax >> 31; // sarl $31, %eax ecx = edx; // movl %edx, %ecx ecx = ecx - eax; // subl %eax, %ecx eax = ecx; // movl %ecx, %eax return eax; // leave / ret }
Now there’s clearly a whole bunch of inefficient stuff here: unnecessary local variables, a bunch of unnecessary variable swapping, and eax = (int64_t)ecx * eax1;
is not needed at all (I just included it for completion’s sake). So let’s clean that up a bit. This next listing just has the most of the cruft eliminated, with the corresponding assembly above each block:
int32_t divideBySeven(int32_t num) { // pushl %ebp // movl %esp, %ebp // subl $4, %esp // movl 8(%ebp), %ecx // movl $-1840700269, -4(%ebp) // movl -4(%ebp), %eax int32_t eax, edx; eax = -015555555555; // imull %ecx edx = ((int64_t)num * eax) >> 32; // leal (%edx,%ecx), %eax // movl %eax, %edx // sarl $2, %edx edx = edx + num; edx = edx >> 2; // movl %ecx, %eax // sarl $31, %eax eax = num >> 31; // movl %edx, %ecx // subl %eax, %ecx // movl %ecx, %eax // leave // ret eax = edx - eax; return eax; }
And the final version:
int32_t divideBySeven(int32_t num) { int32_t temp = ((int64_t)num * -015555555555) >> 32; temp = (temp + num) >> 2; return (temp - (num >> 31)); }
I still have yet to answer the obvious question, “why would they do that?” And the answer is, of course, speed. The integer division instruction used in the very first listing, idiv
, takes a whopping 43 clock cycles to execute. But the divisionless method that gcc produces has quite a few more instructions, so is it really faster overall? This is why we have the benchmark.
int main(int argc, char *argv[]) { int i = INT_MIN; do { divideBySeven(i); i++; } while (i != INT_MIN); return 0; }
Loop over every single possible integer? Sure! I ran the test five times for both implementations and timed it with time. The user CPU times for gcc were 45.9, 45.89, 45.9, 45.99, and 46.11 seconds, while the times for my assembly using the idiv
instruction were 62.34, 62.32, 62.44, 62.3, and 62.29 seconds, meaning the naive implementation ran about 36% slower on average. Yeow.
Compiler optimizations are a beautiful thing.
Post a Comment