Almost bare metal Part II

In my previous post I mentioned I have been working on moving a routine that handles many hashing tasks at once to the GPU. The preliminary results are encouraging, although I cannot say they are apples-to-apples to the actual task at hand.

To get some simple numbers, I simplified the task. There is a stream of bytes that must be hashed using the SHA256 algorithm. The stream of bytes includes a part that is somewhat static and a part that changes. The part that changes could be one of about 13,000 combinations. The task is to hash those 13,000 byte streams and check which (if any) have the value below a certain threshold, similar to the process used by Bitcoin to publish blocks.

Because we are now dealing with moving blocks of memory across a bus, the actual process was adjusted to do this in more of a “batch” mode. The 13,000 byte streams are easily created on the CPU. Each byte stream is put in an array (also on the CPU).

Now for the fun part. The entire array is hashed one element at a time. The results are placed in another array. This is done once on the CPU and once on the GPU, with the timing results compared.

The initial results were not good:

Building CPU IDs: 3345344ns total, average : 261ns.
Building GPU IDs: 142561855ns total, average : 11137ns.

But running it a second time yielded:

Building CPU IDs: 3297128ns total, average : 257ns.
Building GPU IDs: 795204ns total, average : 62ns.

And a third run yielded:

Building CPU IDs: 3304968ns total, average : 258ns.
Building GPU IDs: 788838ns total, average : 61ns.

I would guess that the first run had to set up some things on the GPU that cost time. Subsequent runs are much more efficient. To support that thesis, note that the time it took the CPU was fairly consistent across all three runs, and the GPU runs after the initial run were also very consistent.

Runs beyond the first 3 show consistent times on both CPU and GPU.

The synopsis is: Bulk hashing on a NVidia GeForce 2060 (notebook version) using SYCL provides about a 4X improvement in hashing. Moving to CUDA provides similar results, so it seems the language does not seem to matter here.

More questions to be answered:

  • Comparing the hashes to a “difficulty” level must also be done. Would the GPU be more efficient here as well? Should it be done in the same process (memory efficient) or in a separate process (perhaps pipe efficient)?
  • What would happen if we ran this on an AMD GPU with similar specifications?
  • What resources were used on the GPU? Are we only using a fraction of memory/processing abilities, or are we almost maxed out?
  • At what point is the arrays that are passed in or retrieved from the card too big to handle?

Leave a Reply

Your email address will not be published. Required fields are marked *