Tensor[[grid_t*grid_h*grid_w, 1176], bf16, cuda:0]
image_grid_thw: [[grid_t, grid_h, grid_w]] Tensor[[1, 3], u32, cuda:0]
PatchEmbed
2D_RoPE
WindowAttention
hidden_state: (grid_t*grid_h*grid_w/2/2, 2048)