Sparse-dense matrix multiplication under OpenCL perform terribly compared to CPU and Cuda backend
Description
Trying to debug an issue where consecutive matmulNT under CUDA take more an more time (up to 10 times), I faced a far more severe issue trying to compare to the OpenCL backend.
Compared to other backend, the matmul operation under OpenCL is extremly slow, and the matmulTN even more :
OpenCL :
matmulNN : 1510.16ms
matmulTN : 15950.30ms
CUDA :
matmulNN : 2.25ms
matmulTN : 4.07ms
CPU :
matmulNN : 95.73ms
matmulTN : 177.07ms
Reproducible Code
#include <arrayfire.h>
#include <iostream>
int main(int argc, char *argv[]) {
int run_count = 10;
int M = 1280*1280; //
int N = 1280*512*20; // Size of the matrices
double res[] = {0, 0};
try{
af::array A = af::floor(af::randu(M)*M).as(af::dtype::s32);
af::array B = af::range(af::dim4(M+1), 0, af::dtype::s32);
af::array W = af::randu(M);
af::array P = af::randu(N);
af::array L = af::randu(M);
af::array S = af::sparse(M, N, W, B, A, af::storage::AF_STORAGE_CSR);
af::array res1 = af::eval(af::matmul(S, P));
af::array res2 = af::eval(af::matmulTN(S, L));
for(int i = 0; i < run_count; ++i) {
af::timer::start();
res1 = af::eval(af::matmul(S, P));
af::sync();
res[0] += af::timer::stop();
af::timer::start();
res2 = af::eval(af::matmulTN(S, L));
af::sync();
res[1] += af::timer::stop();
}
} catch (af::exception& e) {
std::cout << e.what() << std::endl;
return -1;
}
std::cout << "Results : " << std::endl;
std::cout << "\tmatmulNN : " << 1000*res[0]/run_count << "ms" << std::endl;
std::cout << "\tmatmulTN : " << 1000*res[1]/run_count << "ms" << std::endl;
return 0;
}
System Information
ArrayFire Version: 3.7.1
Device: i7-9750H, 16Go RAM, GTX 1650 4Go
Operating System: Windows 10
Driver version: 442.19
Sparse-dense matrix multiplication under OpenCL perform terribly compared to CPU and Cuda backend
Description
Trying to debug an issue where consecutive matmulNT under CUDA take more an more time (up to 10 times), I faced a far more severe issue trying to compare to the OpenCL backend.
Compared to other backend, the matmul operation under OpenCL is extremly slow, and the matmulTN even more :
OpenCL :
matmulNN : 1510.16ms
matmulTN : 15950.30ms
CUDA :
matmulNN : 2.25ms
matmulTN : 4.07ms
CPU :
matmulNN : 95.73ms
matmulTN : 177.07ms
Reproducible Code
System Information
ArrayFire Version: 3.7.1
Device: i7-9750H, 16Go RAM, GTX 1650 4Go
Operating System: Windows 10
Driver version: 442.19