3.2 OpenCL平台模型

OpenCL平台需要包含一个主处理器和一个或多个OpenCL设备。平台模型定义了host和device的角色,并且为device提供了一种抽象的硬件模型。一个device可以被划分成一个或多个计算单元,这些计算单元在之后能被分成一个或多个“处理单元”(processing elements)。具体的关系可见图3.1。

图3.1 OpenCL平台具有多个计算设备。每个计算设备都具有一个或多个计算单元。一个计算单元又由一个或多个处理元素(PEs)构成。系统中可以同时具有多个平台。例如,在一个系统中可以既有AMD的平台和Intel的平台。

平台模型是应用开发的重点,其保证了OpenCL代码的可移植性(在具有OpenCL能力的系统间)。即使只在一个系统中,这个系统也可以具有多个不同的OpenCL平台,这些平台可以被不同的应用所使用。平台模型的API允许一个OpenCL应用能够适应和选择对应的平台和计算设备,从而在相应平台和设备上运行应用。

应用可以使用OpenCL运行时API,选择对应提供商提供的对应平台。不过,平台上能指定和互动的设备,也只限于供应商提供的那些设备。例如,如果选择了A公司的平台,那么就不能使用B公司的GPU。不过,平台硬件并不需要由供应商独家提供。例如,AMD和Intel的实现可以使用其他公司的x86 CPU作为设备。

编程者写编写OpenCL C代码时,设备架构会被抽象成平台模型。供应商只需要将抽象的架构映射到对应的物理硬件上即可。平台模型定义了具有一组计算单元的设备,且每个计算单元的功能都是独立的。计算单元也可以划分成更多个处理单元。图3.1展示的就是这样的一种层级模型。举个例子,AMD Radeon R9 290X图形卡(device)包含44个向量处理器(计算单元)。每个计算单元都由4个16通道SIMD引擎,一共就有64个SIMD通道(处理单元)。Radeon R9 290X上每个SIMD通道都能处理一个标量指令。运行GPU设备能同时执行44x16x4=2816条指令。

3.2.1 平台和设备

clGetPlatformIDs()这个API就是查找制定系统上的可用OpenCL平台的集合。在具体的OpenCL程序中,这个API一般会调用两次,用来查询和获取到对应的平台信息。第一次调用这个API需要传入num_platforms作为数量参数,传入NULL作为平台参数。这样就能获取在该系统上有多少个平台可供使用。编程者可以开辟对应大小的空间(指针命名为platforms),来存放对应的平台对象(类型为 cl_platform_id)。第二次调用该API是,就可将platforms传入来获取对应数量的平台对象。平台查找完成后,使用clGetPlatformInfo()API可以查询对应供应商所提供的平台,然后决定使用哪个平台进行运行OpenCL程序。clGetPlatformIDs()这个API需要在其他API之前调用,3.6节中可以从矢量相加的源码中进一步了解。

cl_int
clGetPlatformIDs(
  cl_uint num_entries,
  cl_platform_id *platforms,
  cl_uint *num_platforms)

当平台确定好之后,下一步就是要查询平台上可用的设备了。clGetDeviceIDs()API就是用来做这件事的,并且在使用流程上和clGetPlatformIDs()很类似。clGetDeviceIDs()多了平台对象和设备类型作为入参,不过也需要简单的三步就能创建device:第一,查询设备的数量;第二,分配对应数量的空间来存放设备对象;第三,选择期望使用的设备(确定设备对象)。device_type参数可以将设备限定为GPU(CL_DEVICE_TYPE_GPU),限定为CPU(CL_DEVICE_TYPE_CPU),或所有设备(CL_DEVICE_TYPE_ALL),当然还有其他选项。这些参数都必须传递给clGetDeviceIDs()。相较于平台的查询API,clGetDeviceInfo()API可用来查询每个设备的名称、类型和供应商。

cl_int
clGetDeviceIDs(
  cl_platform_id platform,
  cl_device_type device_type,
  cl_uint num_entries,
  cl_device_id *devices,
  cl_uint *num_devices)

AMD的并行加速处理软件开发工具(APP SDK)中有一个clinfo的程序,其使用clGetPlatformInfo()clGetDeviceInfo()来获取对应系统中的平台和设备信息。硬件信息,比如内存总量和总线带宽也是可以用该程序获取。在了解其他OpenCL特性之前,我们先休息一下,了解一下clinfo的输入,如图3.2。

译者机器的clinfo显示,译者和原书使用的AMD APP SDK版本不大一样。从观察上来看,原书应该隐藏了一些硬件显示。

Number of platforms:                             3
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.2 CUDA 8.0.0
  Platform Name:                                 NVIDIA CUDA
  Platform Vendor:                               NVIDIA Corporation
  Platform Extensions:                           
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_fp64 
   cl_khr_byte_addressable_store 
   cl_khr_icd cl_khr_gl_sharing 
   cl_nv_compiler_options 
   cl_nv_device_attribute_query 
   cl_nv_pragma_unroll 
   cl_nv_d3d10_sharing 
   cl_khr_d3d10_sharing 
   cl_nv_d3d11_sharing 
   cl_nv_copy_opts

  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.2
  Platform Name:                                 Intel(R) OpenCL
  Platform Vendor:                               Intel(R) Corporation
  Platform Extensions:                           
   cl_intel_dx9_media_sharing 
   cl_khr_3d_image_writes 
   cl_khr_byte_addressable_store 
   cl_khr_d3d11_sharing 
   cl_khr_depth_images 
   cl_khr_dx9_media_sharing 
   cl_khr_gl_sharing 
   cl_khr_global_int32_base_atomics
   cl_khr_global_int32_extended_atomics 
   cl_khr_icd cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_spir
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (1800.8)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           
   cl_khr_icd 
   cl_khr_d3d10_sharing 
   cl_khr_d3d11_sharing 
   cl_khr_dx9_media_sharing 
   cl_amd_event_callback 
   cl_amd_offline_devices

  Platform Name:                                 NVIDIA CUDA
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     10deh
  Max compute units:                             4
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           64
  Max work group size:                           1024
  Preferred vector width char:                   1
  Preferred vector width short:                  1
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      1
  Native vector width short:                     1
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    1
  Max clock frequency:                           862Mhz
  Address bits:                                  64
  Max memory allocation:                         536870912
  Image support:                                 Yes
  Max number of images read arguments:           256
  Max number of images write arguments:          16
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            4096
  Max image 3D height:                           4096
  Max image 3D depth:                            4096
  Max samplers within kernel:                    32
  Max size of kernel argument:                   4352
  Alignment (bits) of base address:              4096
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               128
  Cache size:                                    65536
  Global memory size:                            2147483648
  Constant buffer size:                          65536
  Max number of constant args:                   9
  Local memory type:                             Scratchpad
  Local memory size:                             49152
  Kernel Preferred work group size multiple:     32
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1000
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   000002D3A374DC10
  Name:                                          GeForce GTX 765M
  Vendor:                                        NVIDIA Corporation
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                375.95
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 CUDA
  Extensions:                                    
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_fp64 
   cl_khr_byte_addressable_store 
   cl_khr_icd 
   cl_khr_gl_sharing 
   cl_nv_compiler_options 
   cl_nv_device_attribute_query 
   cl_nv_pragma_unroll 
   cl_nv_d3d10_sharing 
   cl_khr_d3d10_sharing 
   cl_nv_d3d11_sharing 
   cl_nv_copy_opts

  Platform Name:                                 Intel(R) OpenCL
Number of devices:                               2
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Vendor ID:                                     8086h
  Max compute units:                             20
  Max work items dimensions:                     3
    Max work items[0]:                           512
    Max work items[1]:                           512
    Max work items[2]:                           512
  Max work group size:                           512
  Preferred vector width char:                   1
  Preferred vector width short:                  1
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 0
  Native vector width char:                      1
  Native vector width short:                     1
  Native vector width int:                       1
  Native vector width long:                      1
  Native vector width float:                     1
  Native vector width double:                    0
  Max clock frequency:                           1150Mhz
  Address bits:                                  64
  Max memory allocation:                         427189862
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          128
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    262144
  Global memory size:                            1708759450
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             65536
  Kernel Preferred work group size multiple:     32
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    80
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue on Host properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000002D3A374C760
  Name:                                          Intel(R) HD Graphics 4600
  Vendor:                                        Intel(R) Corporation
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                20.19.15.4531
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2
  Extensions:                                    
   cl_intel_accelerator 
   cl_intel_advanced_motion_estimation 
   cl_intel_ctz 
   cl_intel_d3d11_nv12_media_sharing 
   cl_intel_dx9_media_sharing 
   cl_intel_motion_estimation
   cl_intel_simultaneous_sharing
   cl_intel_subgroups 
   cl_khr_3d_image_writes 
   cl_khr_byte_addressable_store 
   cl_khr_d3d10_sharing 
   cl_khr_d3d11_sharing
   cl_khr_depth_images
   cl_khr_dx9_media_sharing 
   cl_khr_gl_depth_images
   cl_khr_gl_event 
   cl_khr_gl_msaa_sharing
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_gl_sharing 
   cl_khr_icd 
   cl_khr_image2d_from_buffer 
   cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_spir

  Device Type:                                   CL_DEVICE_TYPE_CPU
  Vendor ID:                                     8086h
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           8192
    Max work items[1]:                           8192
    Max work items[2]:                           8192
  Max work group size:                           8192
  Preferred vector width char:                   1
  Preferred vector width short:                  1
  Preferred vector width int:                    1
  Preferred vector width long:                   1
  Preferred vector width float:                  1
  Preferred vector width double:                 1
  Native vector width char:                      32
  Native vector width short:                     16
  Native vector width int:                       8
  Native vector width long:                      4
  Native vector width float:                     8
  Native vector width double:                    4
  Max clock frequency:                           2400Mhz
  Address bits:                                  64
  Max memory allocation:                         2126515200
  Image support:                                 Yes
  Max number of images read arguments:           480
  Max number of images write arguments:          480
  Max image 2D width:                            16384
  Max image 2D height:                           16384
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    480
  Max size of kernel argument:                   3840
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               No
    Round to +ve and infinity:                   No
    IEEE754-2008 fused multiply-add:             No
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    262144
  Global memory size:                            8506060800
  Constant buffer size:                          131072
  Max number of constant args:                   480
  Local memory type:                             Global
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     128
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    427
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue on Host properties:
    Out-of-Order:                                Yes
    Profiling :                                  Yes
  Platform ID:                                   000002D3A374C760
  Name:                                          Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  Vendor:                                        Intel(R) Corporation
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                5.2.0.10094
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 (Build 10094)
  Extensions:                                    
   cl_khr_icd 
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_byte_addressable_store 
   cl_khr_depth_images 
   cl_khr_3d_image_writes 
   cl_intel_exec_by_local_thread 
   cl_khr_spir 
   cl_khr_dx9_media_sharing 
   cl_intel_dx9_media_sharing 
   cl_khr_d3d11_sharing 
   cl_khr_gl_sharing 
   cl_khr_fp64

  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               1
  Device Type:                                   CL_DEVICE_TYPE_CPU
  Vendor ID:                                     1002h
  Board name:
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  8
  Preferred vector width double:                 4
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     8
  Native vector width double:                    4
  Max clock frequency:                           2394Mhz
  Address bits:                                  64
  Max memory allocation:                         2147483648
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          64
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   4096
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    32768
  Global memory size:                            8506060800
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Global
  Local memory size:                             32768
  Max pipe arguments:                            16
  Max pipe active reservations:                  16
  Max pipe packet size:                          2147483648
  Max global variable size:                      1879048192
  Max global variable preferred total size:      1879048192
  Max read/write image args:                     64
  Max on device events:                          0
  Queue on device max size:                      0
  Max on device queues:                          0
  Queue on device preferred size:                0
  SVM capabilities:
    Coarse grain buffer:                         No
    Fine grain buffer:                           No
    Fine grain system:                           No
    Atomics:                                     No
  Preferred platform atomic alignment:           0
  Preferred global atomic alignment:             0
  Preferred local atomic alignment:              0
  Kernel Preferred work group size multiple:     1
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    427
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue on Host properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Queue on Device properties:
    Out-of-Order:                                No
    Profiling :                                  No
  Platform ID:                                   00007FFB80F36D30
  Name:                                          Intel(R) Core(TM) i7-4700MQ CPU @ 2.40GHz
  Vendor:                                        GenuineIntel
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                1800.8 (sse2,avx)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (1800.8)
  Extensions:                                    
   cl_khr_fp64 
   cl_amd_fp64 
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_local_int32_base_atomics 
   cl_khr_local_int32_extended_atomics 
   cl_khr_int64_base_atomics 
   cl_khr_int64_extended_atomics 
   cl_khr_3d_image_writes 
   cl_khr_byte_addressable_store 
   cl_khr_gl_sharing 
   cl_ext_device_fission 
   cl_amd_device_attribute_query 
   cl_amd_vec3 
   cl_amd_printf 
   cl_amd_media_ops 
   cl_amd_media_ops2 
   cl_amd_popcnt 
   cl_khr_d3d10_sharing 
   cl_khr_spir 
   cl_khr_gl_event

原书clinfo信息

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 2.0 AMD-APP (1642.5)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           
   cl_khr_icd 
   cl_khr_d3d10_sharing 
   cl_khr_icd 
   cl_amd_event_callback 
   cl_amd_offline_devices
  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               2
  Vendor ID:                                     1002h
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Board name:                                    AMD Radeon R9 200 Series
  Device Topology:                               PCI[B#1, D#0, F#0]
  Max compute units:                             40
  Max work group size:                           256
  Native vector width int:                       1
  Max clock frequency:                           1000Mhz
  Max memory allocation:                         2505572352
  Image support:                                 Yes
  Max image 3D width:                            2048
  Cache line size:                               64
  Global memory size:                            3901751296
  Platform ID:                                   0x7f54fb22cfd0
  Name:                                          Hawaii
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 2.0
  Driver version:                                1642.5(VM)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 2.0 AMD-APP (1642.5)
  Extensions:                                    
   cl_khr_fp64_cl_amd_fp64 
   cl_khr_global_int32_base_atomics 
   cl_khr_global_int32_extended_atomics 
   cl_khr_local_int32_base_atomics

  Device Type:                                   CL_DEVICE_TYPE_CPU
  Vendor ID:                                     1002h
  Board name:
  Max compute units:                             8
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
  Name:                                          AMD FX(tm)-8120 Eight-Core Processor
  Vendor:                                        AuthenticAMD
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                1642.5(sse2, avx, fma4)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 (Build 10094)

图3.2 通过clinfo程序输出一些OpenCL平台和设备信息。我们能看到AMD平台上有两个设备(一个CPU和一个GPU)。这些信息都能通过平台API查询到。

Last updated