Bug Report: Potential Out-of-Bounds Accesses In MoE Kernels

Nov 1, 2025 by Admin 60 views

Hey everyone,

I've been diving deep into the CUDA kernels within the vllm-project, and I've stumbled upon some potential out-of-bounds access issues that I wanted to bring to your attention. Specifically, these issues seem to be lurking in the moe_wna16_gemm and moe_wna16_marlin_gemm functions. Let's break it down so we can squash these bugs together!

Current Environment

Before we get into the nitty-gritty, here’s the output of python collect_env.py to give you a snapshot of my setup:

Your output of `python collect_env.py` here

This should give you some context about the environment where these potential bugs were identified.

🐛 Describe the bug

During static analysis on the CUDA kernels, I've flagged several potential out-of-bounds accesses in both moe_wna16_gemm and moe_wna16_marlin_gemm. Let's dive into the specifics:

1. `moe_wna16_gemm`

This function seems to have a few spots where we might be stepping outside the bounds of our memory. Let's take a closer look at each one.

(1) `expert_ids[blockIdx.x]`

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/moe_wna16.cu#L39-L42
Issue: The access expert_ids[blockIdx.x] could lead to an out-of-bounds read. This happens when blockIdx.x exceeds the size of expert_ids. To avoid memory access violations, it's important to ensure that the block index stays within the bounds of the expert_ids array. When the block index goes beyond the size of expert_ids, the program attempts to read from a memory location that it is not authorized to access. This results in undefined behavior and can cause the application to crash or produce incorrect results.
Example Scenario:
```
blockIdx.x: 4 
expert_ids.shape: [4] 
BLOCK_SIZE_M: 16 
top_k: 4 
batch_size: 2 
seq_len: 1 
sorted_token_ids.shape: [128]
```
In this scenario, if blockIdx.x is 4 while expert_ids only has a shape of [4], we're in trouble because we're trying to access an element that doesn't exist. This is a classic case of an out-of-bounds access, which can lead to crashes or unpredictable behavior. We need to ensure that blockIdx.x is always less than the size of expert_ids to prevent this issue.

(2) `sorted_token_ids[offset_m]`

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/moe_wna16.cu#L50-L52
Issue: Here, sorted_token_ids[offset_m] may also cause an out-of-bounds access. The index is calculated as blockIdx.x * BLOCK_SIZE_M + m. The risk is that this computed index can exceed the bounds of sorted_token_ids. This can occur if the calculated index, derived from multiplying blockIdx.x by BLOCK_SIZE_M and adding m, surpasses the maximum allowable index for sorted_token_ids. When the index is out of bounds, the program might attempt to access memory it does not own, leading to a crash or undefined behavior. Careful bounds checking and index validation are essential to mitigate this risk.
Example Scenario:
```
sorted_token_ids.shape: [17] 
blockIdx.x: 1 
BLOCK_SIZE_M: 16 
m: 1
```
With sorted_token_ids having a shape of [17], blockIdx.x being 1, BLOCK_SIZE_M as 16, and m as 1, the index becomes 1 * 16 + 1 = 17. Boom! Out-of-bounds. Accessing sorted_token_ids[17] when the valid range is 0-16 will cause issues. It's super important to keep those indices in check, ensuring they don't stray beyond the array's boundaries.

(3) `reinterpret_cast<const float*>(expert_scales)[scales_offset_tmp]`

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/moe_wna16.cu#L112-L116
Issue: The expression reinterpret_cast<const float*>(expert_scales)[scales_offset_tmp] might result in an out-of-bounds access. The risk arises if scales_offset_tmp exceeds the valid range within the reinterpreted expert_scales array. This can happen due to complex calculations involving offset_n, size_k, group_size, and GROUPS. When scales_offset_tmp is too large, the program attempts to read memory outside the allocated bounds of expert_scales, leading to potential crashes or incorrect results. Proper validation of scales_offset_tmp against the size of expert_scales is crucial to prevent this issue.
```
expert_scales = scales + expert_offset / group_size;
scales_offset_tmp = (offset_n * size_k + offset_k) / group_size / GROUPS;
scales.shape: [60, 2816, 16]
GROUPS=2
group_size=128
size_n=2816
size_k=2048
```
Example Scenario:
```
blockIdx.x: 0 
blockIdx.y: 21
blockIdx.z: 0
threadIdx.x: 0
BLOCK_SIZE_N: 128
BLOCK_SIZE_K: 256
expert_ids[0]: 60
```
The computed scales_offset_tmp can exceed the valid index range of scales, which leads to an out-of-bounds access. Ensuring scales_offset_tmp is within bounds is key to preventing memory errors.

(4) `topk_weights[token_index]`

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/moe_wna16.cu#L210-L214
Issue: There is a potential out-of-bounds access with topk_weights[token_index]. If token_index goes beyond the bounds of topk_weights, we've got a problem. The shape of topk_weights is [seq_len*batch_size, 4], and top_k is 8. Ensuring token_index stays within these bounds is crucial for stable operation.
Example Scenario:
```
blockIdx.x: 0
BLOCK_SIZE_M: 16
m: 1
sorted_token_ids[1]: 16 
batch_size: 2
seq_len: 2
```
In this example, if token_index exceeds seq_len * batch_size * 4 - 1, we will have an out-of-bounds access. Proper index validation is vital.

(5) `output[token_index * size_n + offset_n]`

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/moe_wna16.cu#L216-L217
Issue: Accessing output[token_index * size_n + offset_n] might lead to memory access violations. The index calculation here is complex, and it's essential to ensure it remains within the bounds of the output array. Specifically, if the calculated index, based on token_index, size_n, and offset_n, exceeds the maximum allowable index for the output array, it results in accessing memory that the program does not own. This can lead to crashes or unexpected behavior. Validating the calculated index against the size of the output array is crucial for preventing these issues.
Example Scenario:
```
blockIdx.x: 0
blockIdx.y: 0
threadIdx.x: 0
m: 0
batch_size: 2
seq_len: 9
sorted_token_ids[0]: 73
```
Here, output has a shape of [seq_len * batch_size, 4, 2816], and size_n is 2816. The index calculation is token_index * size_n + offset_n, where token_index is sorted_token_ids[blockIdx.x * BLOCK_SIZE_M + m]. If this computed index exceeds the valid range for output, we'll have an out-of-bounds write, which is not good.

2. `moe_wna16_marlin_gemm`

Now, let's shift our focus to moe_wna16_marlin_gemm and see what potential issues we can uncover.

Location: https://github.com/vllm-project/vllm/blob/a00d6254e998be472d8df9dc590784d6facf8d85/csrc/moe/marlin_moe_wna16/marlin_template.h#L526-L528
Issue: The access expert_ids_ptr[block_id] may cause an out-of-bounds access. The index block_id is calculated based on slice_col_par and n_tiles. If block_id exceeds the size of expert_ids, we'll have a problem. The shape of expert_ids is [seq_len*batch_size+112]. Ensuring block_id remains within these bounds is crucial for preventing memory errors. When block_id goes beyond the bounds of the expert_ids array, the program attempts to read from an unauthorized memory location. This can lead to a variety of issues, including application crashes and incorrect results. Therefore, it is crucial to validate that block_id remains within the size of expert_ids to ensure program stability and correctness.
Example Scenario:
```
seq_len: 2
batch_size: 1
blockIdx.x: 313
sms: 80
num_tokens_past_padded_ptr[0]: 928
```
With these parameters, block_id is computed, and it might exceed the bounds of expert_ids, leading to a crash. Double-checking index calculations is super important.

Before submitting a new issue...

[x] I made sure I searched for relevant issues, and I even chatted with the chatbot on the documentation page. It's always good to cover your bases!

I hope this detailed breakdown helps in addressing these potential issues. Let me know your thoughts, and let's work together to make vllm even more robust! 🚀

Current Environment

🐛 Describe the bug

1. moe_wna16_gemm

(1) expert_ids[blockIdx.x]

(2) sorted_token_ids[offset_m]

(3) reinterpret_cast<const float*>(expert_scales)[scales_offset_tmp]

(4) topk_weights[token_index]

(5) output[token_index * size_n + offset_n]

2. moe_wna16_marlin_gemm