AVX2/AVX-512 Inefficient Codegen For Varying Popcnt

by Admin 52 views
AVX2/AVX-512 Inefficient Codegen for Varying Popcnt

Let's dive into a discussion about the inefficient code generation observed for the popcnt function when dealing with varying integer types (int/int64) on AVX2 and AVX-512 architectures. This can be a real bottleneck, guys, and optimizing this could lead to significant performance gains.

The Issue: Scalar Operations Instead of Vectorized Power

The core problem lies in how the compiler handles popcnt on varying int and int64 types. Instead of leveraging the powerful vectorized instructions available in AVX2 and AVX-512, it resorts to individual extracts and inserts, using multiple scalar popcnt instructions. This approach is far from ideal, especially when we have much faster vectorized alternatives at our disposal. We need to fix this, because the more efficient our code, the better, right?

This method is very slow and inefficient, if we don't solve it right away, then it is very detrimental to our system. We can make the system much better if we solve it quickly.

AVX2: Unleashing the Power of Bitwise Operations

For AVX2 and later instruction sets, there are more efficient ways to calculate population counts. Specifically, bytewise pshufb (shuffle bytes) combined with psadbw (pairwise сум absolute differences) algorithms can significantly speed things up. These techniques are employed by both LLVM (through the @llvm.ctpop intrinsic) and GCC, as demonstrated in these resources:

These implementations utilize clever bit manipulation tricks to count the set bits in parallel, which is much faster than processing each element individually. I mean, who wouldn't want faster code, am I right?

Using these methods, our system can be tens or even hundreds of times faster, and can be processed much better. So let's pay attention to this part, use it, and immediately make our system better.

AVX-512 (Ice Lake+): The VPOPCNTDQ Advantage

When we move to AVX-512 on Ice Lake+ processors (which include the VPOPCNTDQ instruction), the solution becomes even more straightforward. We can directly use the vpopcntd and vpopcntq instructions, specifically designed for vectorized population counting of 32-bit and 64-bit integers, respectively.

This approach bypasses the need for manual bitwise operations and leverages the hardware's dedicated instructions, resulting in optimal performance. It's like having a turbo button for your code – just what we like to see!

If we use this method, our system will be better and faster. Therefore, we need to pay close attention to this instruction set, especially the VPOPCNTDQ instruction, as it can significantly improve our performance.

The Proposed Fix: Embrace the LLVM Intrinsic

The suggested solution involves using LLVM's ctpop intrinsic, similar to how count_trailing_zeros is handled. Looking at the current implementation in the ISPC standard library (stdlib.ispc), it appears that scalar loops are being used, which is where the inefficiency stems from.

Switching to the ctpop intrinsic would likely resolve this issue and allow LLVM to generate the appropriate vectorized code for different target architectures. It's all about letting the compiler do what it does best – optimize!

However, a question arises: should LLVM be smart enough to recognize this pattern during an optimization pass and automatically optimize it? Or is there a specific reason why the scalar loops are currently employed in the standard library? These are important considerations as we move forward.

Broader Implications: Are Other Targets Affected?

While the focus here is on AVX2 and AVX-512, it's worth investigating whether other architectures, such as NEON, are similarly impacted. It's possible that the inefficient codegen for popcnt extends beyond these specific instruction sets. A thorough examination of different targets is crucial to ensure optimal performance across the board.

Additional Inefficiencies: A Glimpse into Further Optimizations

During the investigation using Compiler Explorer (Godbolt link), some other inefficiencies in the generated code were spotted. For instance, the standard library's popcnt(varying int64) function returns a varying int instead of a varying int64. This leads to extra instructions to convert from 4x64 down to 4x32, which might be unnecessary.

It's worth exploring whether simply casting the return value back to a wider width would allow the compiler to optimize this conversion away. Small tweaks like these can often have a noticeable impact on performance.

These are just the things we've discovered so far, and we're actively working on identifying and resolving these issues to further optimize our systems.

Conclusion: A Call to Action for Optimization

In conclusion, the inefficient code generation for popcnt on varying integer types in AVX2 and AVX-512 architectures presents a clear opportunity for optimization. By leveraging vectorized instructions and LLVM intrinsics, we can significantly improve performance. It's essential to address these issues to ensure our code runs as efficiently as possible. So, let's roll up our sleeves and make some magic happen!

This issue highlights the importance of understanding how compilers translate high-level code into machine instructions. By identifying and addressing these inefficiencies, we can unlock the full potential of modern hardware. Let's keep pushing the boundaries of performance, guys!