ARTICLE AD BOX
I am trying to optimize a GEMM-like kernel using the MLIR Affine dialect. Inside my innermost loop (%arg5), I have a redundant affine.load and affine.store on %alloc.
Despite running the -affine-scalrep (Scalar Replacement) pass, these memory operations are not being promoted to registers/SSA values. Is there a specific reason why scalrep might ignore this, or is there another pass I should use to eliminate this overhead?
module { func.func @main(%arg0: memref<8x16xf32>, %arg1: memref<16x64xf32>, %arg2: memref<1x64xf32>) -> memref<8x64xf32> attributes {llvm.emit_c_interface} { %cst = arith.constant 0.000000e+00 : f32 %alloc = memref.alloc() {alignment = 64 : i64} : memref<8x1x64xf32> %alloc_0 = memref.alloc() {alignment = 64 : i64} : memref<8x64xf32> affine.for %arg3 = 0 to 8 { affine.for %arg4 = 0 to 64 { %0 = affine.load %arg2[0, %arg4] : memref<1x64xf32> affine.store %0, %alloc[%arg3, 0, %arg4] : memref<8x1x64xf32> affine.for %arg5 = 0 to 16 { %3 = affine.load %arg0[%arg3, %arg5] : memref<8x16xf32> %4 = affine.load %arg1[%arg5, %arg4] : memref<16x64xf32> %5 = affine.load %alloc[%arg3, 0, %arg4] : memref<8x1x64xf32> %6 = arith.mulf %3, %4 : f32 %7 = arith.addf %5, %6 : f32 affine.store %7, %alloc[%arg3, 0, %arg4] : memref<8x1x64xf32> } %1 = affine.load %alloc[%arg3, 0, %arg4] : memref<8x1x64xf32> %2 = arith.maximumf %1, %cst : f32 affine.store %2, %alloc_0[%arg3, %arg4] : memref<8x64xf32> } } memref.dealloc %alloc : memref<8x1x64xf32> return %alloc_0 : memref<8x64xf32> } }