2.6 为Eigen库使能向量化

NOTE:此示例代码可以在 https://github.com/dev-cafe/cmake-cookbook/tree/v1.0/chapter-02/recipe-06 中找到，包含一个C++示例。该示例在CMake 3.5版(或更高版本)中是有效的，并且已经在GNU/Linux、macOS和Windows上进行过测试。

处理器的向量功能，可以提高代码的性能。对于某些类型的运算来说尤为甚之，例如：线性代数。本示例将展示如何使能矢量化，以便使用线性代数的Eigen C++库加速可执行文件。

准备工作

我们用Eigen C++模板库，用来进行线性代数计算，并展示如何设置编译器标志来启用向量化。这个示例的源代码linear-algebra.cpp文件:

#include <chrono>
#include <iostream>

#include <Eigen/Dense>

EIGEN_DONT_INLINE
double simple_function(Eigen::VectorXd &va, Eigen::VectorXd &vb)
{
  // this simple function computes the dot product of two vectors
  // of course it could be expressed more compactly
  double d = va.dot(vb);
  return d;
}

int main()
{
  int len = 1000000;
  int num_repetitions = 100;

  // generate two random vectors
  Eigen::VectorXd va = Eigen::VectorXd::Random(len);
  Eigen::VectorXd vb = Eigen::VectorXd::Random(len);

  double result;
  auto start = std::chrono::system_clock::now();
  for (auto i = 0; i < num_repetitions; i++)
  {
    result = simple_function(va, vb);
  }
  auto end = std::chrono::system_clock::now();
  auto elapsed_seconds = end - start;

  std::cout << "result: " << result << std::endl;
  std::cout << "elapsed seconds: " << elapsed_seconds.count() << std::endl;
}

我们期望向量化可以加快simple_function中的点积操作。

如何实施

根据Eigen库的文档，设置适当的编译器标志就足以生成向量化的代码。让我们看看CMakeLists.txt:

声明一个C++11项目:

cmake_minimum_required(VERSION 3.5 FATAL_ERROR)

project(recipe-06 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 11)
set(CMAKE_CXX_EXTENSIONS OFF)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

使用Eigen库，我们需要在系统上找到它的头文件:
```
find_package(Eigen3 3.3 REQUIRED CONFIG)
```
CheckCXXCompilerFlag.cmake标准模块文件:
```
include(CheckCXXCompilerFlag)
```

检查-march=native编译器标志是否工作:

check_cxx_compiler_flag("-march=native" _march_native_works)

另一个选项-xHost编译器标志也开启:
```
check_cxx_compiler_flag("-xHost" _xhost_works)
```

设置了一个空变量_CXX_FLAGS，来保存刚才检查的两个编译器中找到的编译器标志。如果看到_march_native_works，我们将_CXX_FLAGS设置为-march=native。如果看到_xhost_works，我们将_CXX_FLAGS设置为-xHost。如果它们都不起作用，_CXX_FLAGS将为空，并禁用矢量化:

set(_CXX_FLAGS)
if(_march_native_works)
    message(STATUS "Using processor's vector instructions (-march=native compiler flag set)")
    set(_CXX_FLAGS "-march=native")
elseif(_xhost_works)
    message(STATUS "Using processor's vector instructions (-xHost compiler flag set)")
    set(_CXX_FLAGS "-xHost")
else()
    message(STATUS "No suitable compiler flag found for vectorization")
endif()

为了便于比较，我们还为未优化的版本定义了一个可执行目标，不使用优化标志:

add_executable(linear-algebra-unoptimized linear-algebra.cpp)

target_link_libraries(linear-algebra-unoptimized
  PRIVATE
      Eigen3::Eigen
  )

此外，我们定义了一个优化版本:

add_executable(linear-algebra linear-algebra.cpp)

target_compile_options(linear-algebra
  PRIVATE
      ${_CXX_FLAGS}
  )

target_link_libraries(linear-algebra
  PRIVATE
      Eigen3::Eigen
  )

让我们比较一下这两个可执行文件——首先我们配置(在本例中，-march=native_works):

$ mkdir -p build
$ cd build
$ cmake ..

...
-- Performing Test _march_native_works
-- Performing Test _march_native_works - Success
-- Performing Test _xhost_works
-- Performing Test _xhost_works - Failed
-- Using processor's vector instructions (-march=native compiler flag set)
...

最后，让我们编译可执行文件，并比较运行时间:

$ cmake --build .
$ ./linear-algebra-unoptimized

result: -261.505
elapsed seconds: 1.97964

$ ./linear-algebra

result: -261.505
elapsed seconds: 1.05048

工作原理

大多数处理器提供向量指令集，代码可以利用这些特性，获得更高的性能。由于线性代数运算可以从Eigen库中获得很好的加速，所以在使用Eigen库时，就要考虑向量化。我们所要做的就是，指示编译器为我们检查处理器，并为当前体系结构生成本机指令。不同的编译器供应商会使用不同的标志来实现这一点：GNU编译器使用-march=native标志来实现这一点，而Intel编译器使用-xHost标志。使用CheckCXXCompilerFlag.cmake模块提供的check_cxx_compiler_flag函数进行编译器标志的检查:

check_cxx_compiler_flag("-march=native" _march_native_works)

这个函数接受两个参数:

第一个是要检查的编译器标志。
第二个是用来存储检查结果(true或false)的变量。如果检查为真，我们将工作标志添加到_CXX_FLAGS变量中，该变量将用于为可执行目标设置编译器标志。

准备工作

如何实施

工作原理

更多信息