SIMD Support in .NET Framework

Abstract

New version of .NET Framework 4.6 with a lot of new features in it was release few weeks ago. One feature which got my attention is support for Simple Instruction Multiple Data (SIMD) vector instructions in the new 64-bit JIT compiler named RyuJIT. I wanted to see how easy is it to leverage this SIMD support and what performance improvements we can expect for code using it.

Introduction to SIMD

The SIMD types of instructions are as the name suggests instructions which take multiple operands and do some vector type operation with it (e.g. adding two vectors). They can work with different types (e.g. int, float, double, etc.) and the level of parallelism (vector size) depends on length of the register.

There are two types of vector instructions available in x86 compatible processors. The older one came with SSE2 instruction set which contains 128 bit registers for vector operations. The newer instructions are available in different type of AVX instruction sets. Depending on instruction set supported on processor (AVX, AVX2, AVX-512) it has register with size up to 512 bit. RyuJIT is trying to use the best one available for the CPU on which it is going to execute the code.

The support for SIMD was added just to the new RyuJIT compiler which at this moment works just for x64 programs. There were some claims that RyuJIT can be extended to other platforms as well but for now there will be no improvement for x86 programs. Second thing worth mentioning is that although the .NET team has target to run code using SIMD wrapper classes on the par with sequential code in case SIMD cannot be used, there are no there yet.

The way in which RyuJIT and .NET Framework adds support for SMID is by attaching JitIntristicAttribute to class or method. JIT then knows that it can ignore normal bytecode of the method and it can replace it with some special handling (e.g. SIMD instruction). There are few classes which are marked with JitIntristicAttribute so they should use SIMD optimisation (with proper JIT support):

  • Vector2 – Fixed vector of two floating point numbers (e.g. to represent point in 2D space)
  • Vector3 – Fixed vector of three floating point numbers (e.g. to represent point in 3D space)
  • Vector4 – Fixed vector of four floating point numbers
  • Vector<T> – Variable length vector of type T (128 bit – 256 bit depending on supported instruction set)

The first there fixed vector length types are part of BCL in .NET Framework 4.6 in System.Numerics namespace. The variable length vector type is available as NuGet package System.Numerics.Vectors version 4.1.0.  For .NET Framework 4.5 both fixed and variable length type are in the same NuGet package of version 4.0.0.

Performance test

I created simple test which is multiplying two vectors in loop. The result amount of operations equals approximately to multiplying two matrices of size 100k x 100k. There are more complicated elaborate available but since I wanted to compare how the same code runs on different platforms and versions of .NET Framework I used just this simple test.

The test I created is trying to use all currently supported vector classes plus scalar implementation of vector multiplication as benchmark. I compiled the code in both .NET Framework 4.5.2 and in 4.6 and in both platforms x86 and x64. I ran the tests on laptop with Intel Core i7-2640M and on desktop with Intel Core i7-3770. Both of them have support for SSE-2 and AVX (which have 128 bit registers).

I don’t have access to any CPU which has support for AVX2 (256 bit) or AVX-512 (512 bit) unfortunately. If somebody has CPU which supports this instruction sets, let me know results of running the test application. I’d like to know the results for them as well. The source code for the application which I used is available on GitHub (taynes13/SimdTest).

The following table contains the test results. I did average of five test runs in order to avoid some random fluctuations. I normalised results for every CPU in order to be able to compare them, the value 100% is for x64 running on .NET Framework 4.6.

.NET Version Platform Method Avg. Time Perf. Imp.
i7-2640M i7-3770 i7-2640M i7-3770
.NET 4.6 x64 Scalar 00:09.3 00:08.2 100.00% 100.00%
.NET 4.6 x86 Scalar 00:09.4 00:08.2 98.90% 100.16%
.NET 4.5.2 x64 Scalar 00:09.5 00:08.3 97.35% 99.44%
.NET 4.5.2 x86 Scalar 00:09.3 00:08.2 100.00% 100.35%
.NET 4.6 x64 VectorT 00:05.6 00:03.5 167.02% 237.01%
.NET 4.6 x86 VectorT 01:11.7 01:02.6 12.94% 13.16%
.NET 4.6 x64 Vector2 00:23.0 00:15.2 40.43% 54.10%
.NET 4.6 x86 Vector2 00:21.5 00:17.7 43.18% 46.56%
.NET 4.5.2 x64 Vector2 00:23.3 00:15.1 39.89% 54.40%
.NET 4.5.2 x86 Vector2 00:21.2 00:17.7 43.79% 46.52%
.NET 4.6 x64 Vector3 00:19.4 00:13.8 47.90% 59.83%
.NET 4.6 x86 Vector3 00:20.0 00:16.9 46.31% 48.72%
.NET 4.5.2 x64 Vector3 00:19.4 00:13.9 47.93% 59.15%
.NET 4.5.2 x86 Vector3 00:20.4 00:17.8 45.59% 46.27%
.NET 4.6 x64 Vector4 00:17.6 00:13.1 52.88% 62.66%
.NET 4.6 x86 Vector4 00:25.9 00:23.3 35.88% 35.34%
.NET 4.5.2 x64 Vector4 00:17.7 00:13.1 52.48% 62.86%
.NET 4.5.2 x86 Vector4 00:26.0 00:23.5 35.74% 35.00%
Performance improvements of using Vector classes in different .NET Framework configurations
Performance improvements of using Vector classes in different .NET Framework configurations

There are few interesting observations

  • The Vector<T> running on the new RyuJIT (x64 .NET 4.6) with SIMD support has increased performance of multiplication by almost 2.5x on the desktop CPU (and bit more moderate improvement by 1.7x for laptop CPU).
  • The same Vector<T> running on x86 .NET 4.6 was running 7.5x slower than scalar multiplication and almost 18x slower than the x64 RyuJIT using SIMD.
  • All the other Vector2, Vector3 and Vector4 classes had very similar performance on both .NET 4.5.2 and .NET 4.6
  • There was no performance improvement for x64 RyuJIT compiled code for Vector2, Vector3 and Vector4 classes. The actually performance degradation, for all the vector types the code runs on average 2x slower than scalar implementation

Summary

In my tests I can see that it is beneficial to use Vector<T> type if and only if we can ensure the application is going to run on x64 bit .NET Framework 4.6. If we are not sure whether the program is going to run on x86 or x64, it is better to avoid using Vector<T> because it will be by almost order of magnitude slower on x86 than scalar implementation. As I mentioned earlier the .NET team has target to run the programs using Vector classes without performance hit if run on x86 but we will have to wait for it.

Second thing which surprised me in my tests is that I haven’t seen performance improvements for any of the fixed length vectors (compared to x86 or event .NET Framework 4.5.2). This suggests that SIMD is disabled for the Vector classes in BCL of .NET Framework 4.6. I have seen everywhere on internet articles how RyuJIT adds support for SIMD but nothing about this support being switched off for fixed Vector classes. If somebody has some explanation, let me know, please.

To summarise my tests in one sentence, the support for RyuJIT is definitely promising, but one has to be careful where and how is the code going to be compiled and run (at least for now).

3 thoughts on “SIMD Support in .NET Framework”

  1. I have an i7-4930MX 3.2 with AVX2 – Ran the rest for you with the following results. Quite interesting, and shocking that the x86 with VectorT is so bad.

    http://tinyurl.com/horgg4t

    Raw data if you wanted it:

    .NET Version Platform Method Vector Size Run 1 Run 2 Run 3
    Run 4 Run 5 Avg.
    .NET 4.6 x64 Scalar 1 00:00:08.5049397 00:00:08.4819244 00:00:08.5584601 00:00:08.4050128 00:00:08.4366062 00:00:08.4770000
    .NET 4.6 x64 Vector2 2 00:00:13.6215636 00:00:13.3960491 00:00:13.5197581 00:00:13.3955260 00:00:13.3687324 00:00:13.4600000
    .NET 4.6 x64 Vector3 3 00:00:10.3234736 00:00:10.3202046 00:00:10.3194595 00:00:10.3545650 00:00:10.3605798 00:00:10.3360000
    .NET 4.6 x64 Vector4 4 00:00:10.5843957 00:00:10.5634758 00:00:10.6222717 00:00:10.5624154 00:00:10.5703913 00:00:10.5810000
    .NET 4.6 x64 VectorT 8 00:00:02.5532343 00:00:02.5554063 00:00:02.5630223 00:00:02.5333693 00:00:02.5368763 00:00:02.5480000
    .NET 4.6 x86 Scalar 1 00:00:08.4481765 00:00:08.4572349 00:00:08.4310199 00:00:08.4591279 00:00:08.4492501 00:00:08.4490000
    .NET 4.6 x86 Vector2 2 00:00:16.7626290 00:00:16.6315340 00:00:16.5439537 00:00:16.5393941 00:00:16.6067281 00:00:16.6170000
    .NET 4.6 x86 Vector3 3 00:00:15.5384915 00:00:15.5584736 00:00:15.8426223 00:00:15.6175084 00:00:15.6436134 00:00:15.6400000
    .NET 4.6 x86 Vector4 4 00:00:22.5261376 00:00:22.4223628 00:00:22.3620322 00:00:22.3585104 00:00:22.2713580 00:00:22.3880000
    .NET 4.6 x86 VectorT 4 00:01:04.4077779 00:01:04.2106209 00:01:03.8761238 00:01:04.9481329 00:01:04.2900224 00:01:04.3470000
    .NET 4.5.2 x64 Scalar 1 00:00:08.6521460 00:00:08.6709647 00:00:08.6565594 00:00:08.5789361 00:00:08.4419998 00:00:08.6000000
    .NET 4.5.2 x64 Vector2 2 00:00:13.5340759 00:00:14.1369116 00:00:13.8007251 00:00:13.4668997 00:00:13.5698341 00:00:13.7020000
    .NET 4.5.2 x64 Vector3 3 00:00:10.4013105 00:00:10.3375081 00:00:10.3366736 00:00:10.3556058 00:00:10.4285204 00:00:10.3720000
    .NET 4.5.2 x64 Vector4 4 00:00:10.8847687 00:00:10.7361043 00:00:10.5679475 00:00:10.5874129 00:00:10.5548098 00:00:10.6660000
    .NET 4.5.2 x86 Scalar 1 00:00:08.4426836 00:00:08.4186493 00:00:08.4267675 00:00:08.4278898 00:00:08.5290538 00:00:08.4490000
    .NET 4.5.2 x86 Vector2 2 00:00:17.0080636 00:00:17.1072095 00:00:17.0781409 00:00:17.1174036 00:00:17.2852116 00:00:17.1190000
    .NET 4.5.2 x86 Vector3 3 00:00:15.9839203 00:00:15.8536269 00:00:15.7049486 00:00:16.0717018 00:00:15.5033222 00:00:15.8240000
    .NET 4.5.2 x86 Vector4 4 00:00:22.2254217 00:00:22.2032240 00:00:22.3975726 00:00:22.3754702 00:00:22.7833382 00:00:22.3970000

    Like

  2. *not sure if my last post made it — but here we go again

    I have an Intel i7-4930MX with AVX2 on it (extreme mobile processor). Not sure it made much of a difference (not sure how clock rate impacts the tests vs your results though). x86 VectorT is crazy slow on these tests. That was a bit shocking.

    Anyhow — Image + Raw Data:

    http://tinyurl.com/zza9rxj

    Raw Data:

    .NET Version Platform Method Vector Size Run 1 Run 2 Run 3
    Run 4 Run 5 Avg.
    .NET 4.6 x64 Scalar 1 00:00:08.5049397 00:00:08.4819244 00:00:08.5584601 00:00:08.4050128 00:00:08.4366062 00:00:08.4770000
    .NET 4.6 x64 Vector2 2 00:00:13.6215636 00:00:13.3960491 00:00:13.5197581 00:00:13.3955260 00:00:13.3687324 00:00:13.4600000
    .NET 4.6 x64 Vector3 3 00:00:10.3234736 00:00:10.3202046 00:00:10.3194595 00:00:10.3545650 00:00:10.3605798 00:00:10.3360000
    .NET 4.6 x64 Vector4 4 00:00:10.5843957 00:00:10.5634758 00:00:10.6222717 00:00:10.5624154 00:00:10.5703913 00:00:10.5810000
    .NET 4.6 x64 VectorT 8 00:00:02.5532343 00:00:02.5554063 00:00:02.5630223 00:00:02.5333693 00:00:02.5368763 00:00:02.5480000
    .NET 4.6 x86 Scalar 1 00:00:08.4481765 00:00:08.4572349 00:00:08.4310199 00:00:08.4591279 00:00:08.4492501 00:00:08.4490000
    .NET 4.6 x86 Vector2 2 00:00:16.7626290 00:00:16.6315340 00:00:16.5439537 00:00:16.5393941 00:00:16.6067281 00:00:16.6170000
    .NET 4.6 x86 Vector3 3 00:00:15.5384915 00:00:15.5584736 00:00:15.8426223 00:00:15.6175084 00:00:15.6436134 00:00:15.6400000
    .NET 4.6 x86 Vector4 4 00:00:22.5261376 00:00:22.4223628 00:00:22.3620322 00:00:22.3585104 00:00:22.2713580 00:00:22.3880000
    .NET 4.6 x86 VectorT 4 00:01:04.4077779 00:01:04.2106209 00:01:03.8761238 00:01:04.9481329 00:01:04.2900224 00:01:04.3470000
    .NET 4.5.2 x64 Scalar 1 00:00:08.6521460 00:00:08.6709647 00:00:08.6565594 00:00:08.5789361 00:00:08.4419998 00:00:08.6000000
    .NET 4.5.2 x64 Vector2 2 00:00:13.5340759 00:00:14.1369116 00:00:13.8007251 00:00:13.4668997 00:00:13.5698341 00:00:13.7020000
    .NET 4.5.2 x64 Vector3 3 00:00:10.4013105 00:00:10.3375081 00:00:10.3366736 00:00:10.3556058 00:00:10.4285204 00:00:10.3720000
    .NET 4.5.2 x64 Vector4 4 00:00:10.8847687 00:00:10.7361043 00:00:10.5679475 00:00:10.5874129 00:00:10.5548098 00:00:10.6660000
    .NET 4.5.2 x86 Scalar 1 00:00:08.4426836 00:00:08.4186493 00:00:08.4267675 00:00:08.4278898 00:00:08.5290538 00:00:08.4490000
    .NET 4.5.2 x86 Vector2 2 00:00:17.0080636 00:00:17.1072095 00:00:17.0781409 00:00:17.1174036 00:00:17.2852116 00:00:17.1190000
    .NET 4.5.2 x86 Vector3 3 00:00:15.9839203 00:00:15.8536269 00:00:15.7049486 00:00:16.0717018 00:00:15.5033222 00:00:15.8240000
    .NET 4.5.2 x86 Vector4 4 00:00:22.2254217 00:00:22.2032240 00:00:22.3975726 00:00:22.3754702 00:00:22.7833382 00:00:22.3970000

    Like

  3. Hia, I did some more current Benchmarks using your Project on an i5-4570. To run these I added some files to test the current 4.1.1-Beta-23516. As you can see the weird x86-behaviour isn’t really fixed. If you want this added to your project I can send you a pull request. My Fork: https://github.com/fkorak/SimdTest

    .NET Version Platform Method Vector Size Run 1 Run 2 Run 3 Run 4 Run 5 Avg. Name Avg. Sec
    .NET 4.6 x64 Scalar 1 00:00:08.8310981 00:00:08.8476409 00:00:08.8423057 00:00:08.8288529 00:00:09.1133718 00:00:08.8930000 Scalar (x64.NET 4.6) 8,893
    .NET 4.6 x64 Vector2 2 00:00:14.4336708 00:00:14.2089888 00:00:14.5736022 00:00:14.1299720 00:00:14.1129340 00:00:14.2920000 Vector2 (x64.NET 4.6) 14,292
    .NET 4.6 x64 Vector3 3 00:00:11.0964922 00:00:10.7632482 00:00:10.7673652 00:00:10.7411797 00:00:10.9045927 00:00:10.8550000 Vector3 (x64.NET 4.6) 10,855
    .NET 4.6 x64 Vector4 4 00:00:11.1638694 00:00:11.0022448 00:00:10.9695284 00:00:10.9765500 00:00:11.1933056 00:00:11.0610000 Vector4 (x64.NET 4.6) 11,061
    .NET 4.6 x64 VectorT 8 00:00:02.6388782 00:00:02.8325761 00:00:02.8147927 00:00:02.7299669 00:00:02.7096824 00:00:02.7450000 VectorT (x64.NET 4.6) 2,745
    .NET 4.6 x86 Scalar 1 00:00:08.7998042 00:00:08.7940767 00:00:08.8407341 00:00:08.8407893 00:00:08.8031540 00:00:08.8160000 Scalar (x86.NET 4.6) 8,816
    .NET 4.6 x86 Vector2 2 00:00:17.5244379 00:00:17.4085829 00:00:17.8129272 00:00:17.4762015 00:00:17.3588548 00:00:17.5160000 Vector2 (x86.NET 4.6) 17,516
    .NET 4.6 x86 Vector3 3 00:00:16.0142924 00:00:16.0048361 00:00:15.9908666 00:00:16.0033180 00:00:16.0022891 00:00:16.0030000 Vector3 (x86.NET 4.6) 16,003
    .NET 4.6 x86 Vector4 4 00:00:21.9292435 00:00:21.8969004 00:00:21.9230125 00:00:21.9023665 00:00:21.9142861 00:00:21.9130000 Vector4 (x86.NET 4.6) 21,913
    .NET 4.6 x86 VectorT 4 00:01:06.0549311 00:01:06.0204423 00:01:06.0268106 00:01:06.0423430 00:01:06.1238523 00:01:06.0540000 VectorT (x86.NET 4.6) 66,054
    .NET 4.6.1 x64 Scalar 1 00:00:08.8202504 00:00:08.9103994 00:00:08.7608850 00:00:08.7671349 00:00:08.8873055 00:00:08.8290000 Scalar (x64.NET 4.6.1) 8,829
    .NET 4.6.1 x64 Vector2 2 00:00:14.1674652 00:00:14.1650773 00:00:14.1696866 00:00:14.1618609 00:00:14.1655725 00:00:14.1660000 Vector2 (x64.NET 4.6.1) 14,166
    .NET 4.6.1 x64 Vector3 3 00:00:10.6976656 00:00:10.7099634 00:00:10.7008447 00:00:10.7000044 00:00:10.7064484 00:00:10.7030000 Vector3 (x64.NET 4.6.1) 10,703
    .NET 4.6.1 x64 Vector4 4 00:00:10.9422110 00:00:11.1417749 00:00:11.2010473 00:00:11.0551300 00:00:11.1679489 00:00:11.1020000 Vector4 (x64.NET 4.6.1) 11,102
    .NET 4.6.1 x64 VectorT 8 00:00:02.6351351 00:00:02.6379169 00:00:02.6337329 00:00:02.6322254 00:00:02.6336293 00:00:02.6350000 VectorT (x64.NET 4.6.1) 2,635
    .NET 4.6.1 x86 Scalar 1 00:00:08.9413716 00:00:08.8268178 00:00:08.7603372 00:00:08.7478376 00:00:08.7554058 00:00:08.8060000 Scalar (x86.NET 4.6.1) 8,806
    .NET 4.6.1 x86 Vector2 2 00:00:17.4232631 00:00:17.4158521 00:00:17.4043995 00:00:17.4171903 00:00:17.4146907 00:00:17.4150000 Vector2 (x86.NET 4.6.1) 17,415
    .NET 4.6.1 x86 Vector3 3 00:00:16.1810248 00:00:16.2940445 00:00:16.0190926 00:00:16.0169995 00:00:16.0210597 00:00:16.1060000 Vector3 (x86.NET 4.6.1) 16,106
    .NET 4.6.1 x86 Vector4 4 00:00:21.9160040 00:00:22.1450465 00:00:22.0117079 00:00:22.0152364 00:00:21.9989405 00:00:22.0170000 Vector4 (x86.NET 4.6.1) 22,017
    .NET 4.6.1 x86 VectorT 4 00:01:03.0638409 00:01:03.0923153 00:01:03.0844844 00:01:03.1192981 00:01:03.1025411 00:01:03.0920000 VectorT (x86.NET 4.6.1) 63,092
    .NET 4.5.2 x64 Scalar 1 00:00:08.7859825 00:00:08.7805165 00:00:08.7745604 00:00:08.7906714 00:00:08.7780997 00:00:08.7820000 Scalar (x64.NET 4.5.2) 8,782
    .NET 4.5.2 x64 Vector2 2 00:00:14.1676897 00:00:14.1650378 00:00:14.1759603 00:00:14.1706386 00:00:14.2037511 00:00:14.1770000 Vector2 (x64.NET 4.5.2) 14,177
    .NET 4.5.2 x64 Vector3 3 00:00:10.9810166 00:00:11.1235431 00:00:10.9438497 00:00:10.7762753 00:00:10.9744707 00:00:10.9600000 Vector3 (x64.NET 4.5.2) 10,96
    .NET 4.5.2 x64 Vector4 4 00:00:11.1842328 00:00:11.1141547 00:00:11.1809655 00:00:11.2701885 00:00:11.2078867 00:00:11.1910000 Vector4 (x64.NET 4.5.2) 11,191
    .NET 4.5.2 x86 Scalar 1 00:00:09.0268725 00:00:08.7907940 00:00:08.7995062 00:00:09.0473459 00:00:08.7900912 00:00:08.8910000 Scalar (x86.NET 4.5.2) 8,891
    .NET 4.5.2 x86 Vector2 2 00:00:17.8890622 00:00:17.5699339 00:00:17.4467435 00:00:17.4492520 00:00:17.4463750 00:00:17.5600000 Vector2 (x86.NET 4.5.2) 17,56
    .NET 4.5.2 x86 Vector3 3 00:00:16.1859854 00:00:16.3398638 00:00:16.1851460 00:00:16.3594321 00:00:16.1800927 00:00:16.2500000 Vector3 (x86.NET 4.5.2) 16,25
    .NET 4.5.2 x86 Vector4 4 00:00:22.0253589 00:00:22.0014359 00:00:22.0041298 00:00:22.0614630 00:00:22.0513629 00:00:22.0290000 Vector4 (x86.NET 4.5.2) 22,029

    Like

Leave a comment