Add OpenCL implementation

Sadly bottlenecked by VRAM latency due to the uncached nature of global memory on my Nvidia system and therefore only with similar performance like rust-safe.

Merge request reports

Loading