[Perf] OpenCL performance issue with sparse matmul

Sparse-dense matrix multiplication under OpenCL perform terribly compared to CPU and Cuda backend


Description
===========

Trying to debug an issue where consecutive matmulNT under CUDA take more an more time (up to 10 times), I faced a far more severe issue trying to compare to the OpenCL backend.
Compared to other backend, the matmul operation under OpenCL is extremly slow, and the matmulTN even more : 

OpenCL : 
        matmulNN : 1510.16ms
        matmulTN : 15950.30ms

CUDA : 
        matmulNN : 2.25ms
        matmulTN : 4.07ms

CPU : 
        matmulNN : 95.73ms
        matmulTN : 177.07ms

Reproducible Code
-----------------
```c
#include <arrayfire.h>
#include <iostream>

int main(int argc, char *argv[]) {
    int run_count = 10;
    int M = 1280*1280;    //
    int N = 1280*512*20; // Size of the matrices
    double res[] = {0, 0};
    try{
        af::array A = af::floor(af::randu(M)*M).as(af::dtype::s32);
        af::array B = af::range(af::dim4(M+1), 0, af::dtype::s32);
        af::array W = af::randu(M);

        af::array P = af::randu(N);
        af::array L = af::randu(M);

        af::array S = af::sparse(M, N, W, B, A, af::storage::AF_STORAGE_CSR);

        af::array res1 = af::eval(af::matmul(S, P));
        af::array res2 = af::eval(af::matmulTN(S, L));

        for(int i = 0; i < run_count; ++i) {
            af::timer::start();
            res1 = af::eval(af::matmul(S, P));
            af::sync();
            res[0] += af::timer::stop();

            af::timer::start();
            res2 = af::eval(af::matmulTN(S, L));
            af::sync();
            res[1] += af::timer::stop();
        }
    } catch  (af::exception& e) { 
        std::cout << e.what() << std::endl;
        return -1;
    }

    std::cout << "Results : " << std::endl;
    std::cout << "\tmatmulNN : " << 1000*res[0]/run_count << "ms" << std::endl;
    std::cout << "\tmatmulTN : " << 1000*res[1]/run_count << "ms" << std::endl;
    return 0;
}
```

System Information
------------------
ArrayFire Version: 3.7.1
Device: i7-9750H, 16Go RAM, GTX 1650 4Go
Operating System: Windows 10
Driver version: 442.19


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Perf] OpenCL performance issue with sparse matmul #2937

Description

Reproducible Code

System Information

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

[Perf] OpenCL performance issue with sparse matmul #2937

Description

Description

Reproducible Code

System Information

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions