AMD64 TLB invalidation performance

The AMD64 optimization manual specifies that the latency of INVLPG instruction is 101 cycles in 32-bit and 80 cycles in 64-bit mode. Considering that the TLB is so closely tied to MMU, CPU, and that no accesses to RAM are needed, I'm wondering why is it so slow? Even more interesting is the large (20%) difference between 32- and 64-bit mode.

How fast/slow is it on Intel CPUs? No idea. Their optimization manual gives instruction latencies only for a relatively small subset of instructions. INVLPG is not among them.

