ARTICLE AD BOX
I am learning cute's thread value layout, and I followed leimao's blog: https://leimao.github.io/blog/CuTe-Thread-Value-Layout/
I want to figure out which thread read which part of data in matrix.
so let's take SM80_16x8x16_F16F16F16F16_TN for example when u print this mma atom, you got this info:
MMA_Atom ThrID: _32:_1 Shape_MNK: (_16,_8,_16) LayoutA_TV: ((_4,_8),(_2,_2,_2)):((_32,_1),(_16,_8,_128)) LayoutB_TV: ((_4,_8),(_2,_2)):((_16,_1),(_8,_64)) LayoutC_TV: ((_4,_8),(_2,_2)):((_32,_1),(_16,_8))when you print_latex(mma), you got this latex picture:

which matrix A(16x16) is displayed at left part.
so, I print LayoutA_TV using this code:
auto s2xh4 = cute::make_layout(cute::make_shape (cute::make_shape (4,8),cute::make_shape (2,2,2)),cute::make_stride(cute::make_stride(32,1),cute::make_stride(16,8,128))); cute::print_latex(s2xh4);and I got this layout
That is means : for thread 0, it read these 8 values:
0,16,8,24,128,144,136,152
if A is a column major in memory layout, these 8 indices calculate the coordinate like this: (0,0) (0,1) (8,0),(8,1),(0,8) (0,9),(8,8),(8,9)
it is exactly what displayed in the mma picture : T0V0 at(0,0) T0V1 at (0,1) .... T0V4 at(0,8) T0V5 at (0,9).....
but according the SM80_16x8x16_F16F16F16F16_TN definations:
struct SM80_16x8x16_F16F16F16F16_TN { using DRegisters = uint32_t[2]; using ARegisters = uint32_t[4]; using BRegisters = uint32_t[2]; using CRegisters = uint32_t[2]; CUTE_HOST_DEVICE static void fma(uint32_t & d0, uint32_t & d1, uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3, uint32_t const& b0, uint32_t const& b1, uint32_t const& c0, uint32_t const& c1) { #if defined(CUTE_ARCH_MMA_SM80_ENABLED) asm volatile( "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 " "{%0, %1}," "{%2, %3, %4, %5}," "{%6, %7}," "{%8, %9};\n" : "=r"(d0), "=r"(d1) : "r"(a0), "r"(a1), "r"(a2), "r"(a3), "r"(b0), "r"(b1), "r"(c0), "r"(c1)); #else CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x16_F16F16F16F16_TN without CUTE_ARCH_MMA_SM80_ENABLED"); #endif } };it's ptx instruction is mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16
this means matrix A is row major.
then, it's totally wrong
I know, there must be something wrong with my understanding.
So, where is my understanding wrong?

