what is the exactly layout of cute's thread value layout?

1 week ago 4

ARTICLE AD BOX

I am learning cute's thread value layout, and I followed leimao's blog: https://leimao.github.io/blog/CuTe-Thread-Value-Layout/

I want to figure out which thread read which part of data in matrix.

so let's take SM80_16x8x16_F16F16F16F16_TN for example when u print this mma atom, you got this info:

MMA_Atom ThrID: _32:_1 Shape_MNK: (_16,_8,_16) LayoutA_TV: ((_4,_8),(_2,_2,_2)):((_32,_1),(_16,_8,_128)) LayoutB_TV: ((_4,_8),(_2,_2)):((_16,_1),(_8,_64)) LayoutC_TV: ((_4,_8),(_2,_2)):((_32,_1),(_16,_8))

when you print_latex(mma), you got this latex picture:

which matrix A(16x16) is displayed at left part.

so, I print LayoutA_TV using this code:

auto s2xh4 = cute::make_layout(cute::make_shape (cute::make_shape (4,8),cute::make_shape (2,2,2)),cute::make_stride(cute::make_stride(32,1),cute::make_stride(16,8,128))); cute::print_latex(s2xh4);

and I got this layout

That is means : for thread 0, it read these 8 values:

0,16,8,24,128,144,136,152

if A is a column major in memory layout, these 8 indices calculate the coordinate like this: (0,0) (0,1) (8,0),(8,1),(0,8) (0,9),(8,8),(8,9)

it is exactly what displayed in the mma picture : T0V0 at(0,0) T0V1 at (0,1) .... T0V4 at(0,8) T0V5 at (0,9).....

but according the SM80_16x8x16_F16F16F16F16_TN definations:

struct SM80_16x8x16_F16F16F16F16_TN { using DRegisters = uint32_t[2]; using ARegisters = uint32_t[4]; using BRegisters = uint32_t[2]; using CRegisters = uint32_t[2]; CUTE_HOST_DEVICE static void fma(uint32_t & d0, uint32_t & d1, uint32_t const& a0, uint32_t const& a1, uint32_t const& a2, uint32_t const& a3, uint32_t const& b0, uint32_t const& b1, uint32_t const& c0, uint32_t const& c1) { #if defined(CUTE_ARCH_MMA_SM80_ENABLED) asm volatile( "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 " "{%0, %1}," "{%2, %3, %4, %5}," "{%6, %7}," "{%8, %9};\n" : "=r"(d0), "=r"(d1) : "r"(a0), "r"(a1), "r"(a2), "r"(a3), "r"(b0), "r"(b1), "r"(c0), "r"(c1)); #else CUTE_INVALID_CONTROL_PATH("Attempting to use SM80_16x8x16_F16F16F16F16_TN without CUTE_ARCH_MMA_SM80_ENABLED"); #endif } };

it's ptx instruction is mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16

this means matrix A is row major.

then, it's totally wrong

I know, there must be something wrong with my understanding.

So, where is my understanding wrong?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

what is the exactly layout of cute's thread value layout?

ARTICLE AD BOX

Related

Why is my projects runtime library MD when I configured it to use MT?

C++ template parameter packs for math functions

Is there a race condition in GCC's implementation of std::call_once?

LEFT SIDEBAR AD