SSE Intrinsic let the CPU caculate four float numbers or two double numbers in the same time, it is very useful to improve the performance like Image Processing , Matrix caculation,etc. But even you use SSE, there is many different ways to archieve your goal, and each way with different performance. so this chapter I will show some ways to archieve Matrix*Vector function, which is a very typical and common operation you would run into so often.

Prerequisite

Vect4D_SIMD is a vector class reprensent a vector inclulding x,y,z,w four float numbser .
Matrix_SIMD is a matrix class inclulding four Vect4D_SIMD member, v0,v1,v2,v3 respectively.
In this application we use Col Major to implement the mulitiplication.

Rough code about Stress test

for (int i = 0; i < 1000; i++)
	{
		vout = M * A;
		vout = M * B;
		vout = M * C;
	}

First way using _mm_dp_ps

 Vect4D_SIMD Matrix_SIMD::operator * (const Vect4D_SIMD &v) const
{
	//using mm_dp_ps
	__m128 lv0 = _mm_dp_ps(v0._m, v._m, 0xFF);
	__m128 lv1 = _mm_dp_ps(v1._m, v._m, 0xFF);
	__m128 lv2 = _mm_dp_ps(v2._m, v._m, 0xFF);
	__m128 lv3 = _mm_dp_ps(v3._m, v._m, 0xFF);
	return Vect4D_SIMD(lv0.m128_f32[0], lv1.m128_f32[0], lv2.m128_f32[0], lv3.m128_f32[0]);
	
}

Running Time about stress test in release mode. Matrix*Vect_SIMD: 1.657408s

Second way using _mm_dp_ps and _mm_shuffle_ps

Vect4D_SIMD Matrix_SIMD::operator * (const Vect4D_SIMD &v) const
{

	return Vect4D_SIMD(_mm_shuffle_ps(_mm_shuffle_ps(_mm_dp_ps(v0._m, v._m, 0xFF), _mm_dp_ps(v1._m, v._m, 0xFF), _MM_SHUFFLE(0, 1, 0, 0)),
									 _mm_shuffle_ps(_mm_dp_ps(v2._m, v._m, 0xFF), _mm_dp_ps(v3._m, v._m, 0xFF), _MM_SHUFFLE(0, 3, 0, 2)),
										_MM_SHUFFLE(2, 0, 2, 0)));
	
}

Running Time about stress test in release mode. Matrix*Vect_SIMD: 1.618239

Third way using _mm_add_ps combining with _mm_add_ps

Vect4D_SIMD Matrix_SIMD::operator * (const Vect4D_SIMD &v) const
{

	__m128 result = _mm_dp_ps(v0._m, v._m, 0xF1);
	result = _mm_add_ps(result, _mm_dp_ps(v1._m, v._m, 0xF2));
	result = _mm_add_ps(result, _mm_dp_ps(v2._m, v._m, 0xF4));
	result = _mm_add_ps(result,_mm_dp_ps(v3._m, v._m, 0xF8));

	return Vect4D_SIMD(result);
	
}

Running Time about stress test in release mode. Matrix*Vect_SIMD: 1.586541

Fourth way using _mm_hadd_ps and _mm_mul_ps

Vect4D_SIMD Matrix_SIMD::operator * (const Vect4D_SIMD &v) const
{

	return Vect4D_SIMD( _mm_hadd_ps( _mm_hadd_ps( _mm_mul_ps(v0._m, v._m), _mm_mul_ps(v1._m, v._m) ),
								 	_mm_hadd_ps (_mm_mul_ps(v2._m, v._m), _mm_mul_ps(v3._m, v._m) ) ) );
	
}

Four different ways using SSE intrinsics to implemented Matrix * Vector