Skip to content

[Perf] OpenCL performance issue with sparse matmul #2937

Description

@ebarjou

Sparse-dense matrix multiplication under OpenCL perform terribly compared to CPU and Cuda backend

Description

Trying to debug an issue where consecutive matmulNT under CUDA take more an more time (up to 10 times), I faced a far more severe issue trying to compare to the OpenCL backend.
Compared to other backend, the matmul operation under OpenCL is extremly slow, and the matmulTN even more :

OpenCL :
matmulNN : 1510.16ms
matmulTN : 15950.30ms

CUDA :
matmulNN : 2.25ms
matmulTN : 4.07ms

CPU :
matmulNN : 95.73ms
matmulTN : 177.07ms

Reproducible Code

#include <arrayfire.h>
#include <iostream>

int main(int argc, char *argv[]) {
    int run_count = 10;
    int M = 1280*1280;    //
    int N = 1280*512*20; // Size of the matrices
    double res[] = {0, 0};
    try{
        af::array A = af::floor(af::randu(M)*M).as(af::dtype::s32);
        af::array B = af::range(af::dim4(M+1), 0, af::dtype::s32);
        af::array W = af::randu(M);

        af::array P = af::randu(N);
        af::array L = af::randu(M);

        af::array S = af::sparse(M, N, W, B, A, af::storage::AF_STORAGE_CSR);

        af::array res1 = af::eval(af::matmul(S, P));
        af::array res2 = af::eval(af::matmulTN(S, L));

        for(int i = 0; i < run_count; ++i) {
            af::timer::start();
            res1 = af::eval(af::matmul(S, P));
            af::sync();
            res[0] += af::timer::stop();

            af::timer::start();
            res2 = af::eval(af::matmulTN(S, L));
            af::sync();
            res[1] += af::timer::stop();
        }
    } catch  (af::exception& e) { 
        std::cout << e.what() << std::endl;
        return -1;
    }

    std::cout << "Results : " << std::endl;
    std::cout << "\tmatmulNN : " << 1000*res[0]/run_count << "ms" << std::endl;
    std::cout << "\tmatmulTN : " << 1000*res[1]/run_count << "ms" << std::endl;
    return 0;
}

System Information

ArrayFire Version: 3.7.1
Device: i7-9750H, 16Go RAM, GTX 1650 4Go
Operating System: Windows 10
Driver version: 442.19

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions