Textual Pornography : The voodoo of gcc, Part I

Here’s a stupid bit of x86 assembly code, using AT&T style:

divideBySeven:
        pushl   %ebp
        movl    %esp, %ebp
        movl    8(%ebp), %edx
        movl    %edx, %eax
        sarl    $31, %edx
        movl    $7, %ebx
        idivl   %ebx
        popl    %ebp
        ret

What does this do? Well, the label should give you a pretty good idea. Converting it into C, we get:

int divideBySeven(int num) {
    return num / 7;
}

Natch.

This is not a valid C program, but we can still compile it into assembly with gcc using the -S flag. Compiled with gcc 4.1.2, targeting i486-linux-gnu, we get something quite a bit different from the original assembly. I’ve commented the output shown below with a pseudo-C translation for the less assembly-inclined:

divideBySeven:
        pushl   %ebp                   # this and the next two lines
        movl    %esp, %ebp             # create a temporary variable
        subl    $4, %esp               # that I'll just call temp
        movl    8(%ebp), %ecx          # %ecx = num
        movl    $-1840700269, -4(%ebp) # temp = -1840700269
        movl    -4(%ebp), %eax         # %eax = temp
        imull   %ecx                   # %edx:%eax = %ecx * %eax (see below)
        leal    (%edx,%ecx), %eax      # %eax = %edx + %ecx
        movl    %eax, %edx             # %edx = %eax
        sarl    $2, %edx               # %edx = %edx >> 2
        movl    %ecx, %eax             # %eax = %ecx
        sarl    $31, %eax              # %eax = %eax >> 31
        movl    %edx, %ecx             # %ecx = %edx
        subl    %eax, %ecx             # %ecx = %ecx - %eax
        movl    %ecx, %eax             # %eax = %ecx
        leave
        ret                            # return %eax

The proper reaction to this monstrosity is “wait what.” Some specific instructions that I think could use more explanation:

movl $-1840700269, -4(%ebp)

-1840700269 = -015555555555 in octal (indicated by the leading zero). I’ll be using the octal representation because it looks cooler.

imull %ecx

This multiplies %ecx and %eax. Both of these registers contain a 32-bit number, so this multiplication could possibly result in a 64-bit number. This can’t fit into one 32-bit register, so the result is split across two: the high 32 bits of the product get put into %edx, and the low 32 get put into %eax.

leal (%edx,%ecx), %eax

This adds %edx and %ecx and puts the result into %eax. lea‘s ostensible purpose is for address calculations, and it would be more clear to write this as two instructions: an add and a mov, but that would take two clock cycles to execute, whereas this takes just one.

Also note that this instruction uses the high 32 bits of the multiplication from the previous instruction (stored in %edx) and then overwrites the low 32 bits in %eax, so only the high bits from the multiplication are ever used.

sarl $2, %edx   # %edx = %edx >> 2

Technically, whether or not sar (arithmetic right shift) is equivalent to the >> operator is implementation-defined. gcc guarantees that the operator is an arithmetic shift for signed numbers (“Signed `>>’ acts on negative numbers by sign extension”), and since I’ve already used gcc once, let’s just assume I’m using it for the rest of this post (because I am).

sarl $31, %eax

%eax is a 32-bit register, so it’ll be operating on integers in the range [-2³¹, 2³¹ – 1]. This produces something interesting: this calculation only has two possible results. If the number is greater than or equal to 0, the shift will reduce the number to 0 no matter what. If the number is less than 0, the result will be -1.

Here’s a pretty direct rewrite of this assembly back into C, with some integer-width paranoia just to be on the safe side, since a few of these steps are dependent on integers being exactly 32 bits wide:

int32_t divideBySeven(int32_t num) {
    int32_t eax, ecx, edx, temp; // push %ebp / movl %esp, %ebp / subl $4, %esp
    ecx = num; // movl 8(%ebp), %ecx
    temp = -015555555555; // movl $-1840700269, -4(%ebp)
    eax = temp; // movl -4(%ebp), %eax

    // imull %ecx - int64_t casts to avoid overflow
    edx = ((int64_t)ecx * eax) >> 32; // high 32 bits
    eax = (int64_t)ecx * eax; // low 32 bits

    eax = edx + ecx; // leal (%edx,%ecx), %eax
    edx = eax; // movl %eax, %edx
    edx = edx >> 2; // sarl $2, %edx

    eax = ecx; // movl %ecx, %eax
    eax = eax >> 31; // sarl $31, %eax

    ecx = edx; // movl %edx, %ecx
    ecx = ecx - eax; // subl %eax, %ecx
    eax = ecx; // movl %ecx, %eax
    return eax; // leave / ret
}

Now there’s clearly a whole bunch of inefficient stuff here: unnecessary local variables, a bunch of unnecessary variable swapping, and eax = (int64_t)ecx * eax1; is not needed at all (I just included it for completion’s sake). So let’s clean that up a bit. This next listing just has the most of the cruft eliminated, with the corresponding assembly above each block:

int32_t divideBySeven(int32_t num) {
    // pushl %ebp
    // movl %esp, %ebp
    // subl $4, %esp
    // movl 8(%ebp), %ecx
    // movl $-1840700269, -4(%ebp)
    // movl -4(%ebp), %eax
    int32_t eax, edx;
    eax = -015555555555;

    // imull %ecx
    edx = ((int64_t)num * eax) >> 32;

    // leal (%edx,%ecx), %eax
    // movl %eax, %edx
    // sarl $2, %edx
    edx = edx + num;
    edx = edx >> 2;

    // movl %ecx, %eax
    // sarl $31, %eax
    eax = num >> 31;

    // movl %edx, %ecx
    // subl %eax, %ecx
    // movl %ecx, %eax
    // leave
    // ret
    eax = edx - eax;
    return eax;
}

And the final version:

int32_t divideBySeven(int32_t num) {
    int32_t temp = ((int64_t)num * -015555555555) >> 32;
    temp = (temp + num) >> 2;
    return (temp - (num >> 31));
}

I still have yet to answer the obvious question, “why would they do that?” And the answer is, of course, speed. The integer division instruction used in the very first listing, idiv, takes a whopping 43 clock cycles to execute. But the divisionless method that gcc produces has quite a few more instructions, so is it really faster overall? This is why we have the benchmark.

int main(int argc, char *argv[]) {
    int i = INT_MIN;
    do {
        divideBySeven(i);
        i++;
    } while (i != INT_MIN);

    return 0;
}

Loop over every single possible integer? Sure! I ran the test five times for both implementations and timed it with time. The user CPU times for gcc were 45.9, 45.89, 45.9, 45.99, and 46.11 seconds, while the times for my assembly using the idiv instruction were 62.34, 62.32, 62.44, 62.3, and 62.29 seconds, meaning the naive implementation ran about 36% slower on average. Yeow.

Compiler optimizations are a beautiful thing.

Textual Pornography

The voodoo of gcc, Part I

Post a Comment

Links

Meta

Categories

Archives