Windows10上にCUDAをインストールしてサンプルプログラムを動かしてみる

CUDAのインストール

CUDAのダウンロードページから、インストーラをダウンロードする。 f:id:chie8842:20170322230030p:plain:w280

ダウンロードが終わったらインストーラを実行する。 f:id:chie8842:20170322230034p:plain:w280 f:id:chie8842:20170322225954p:plain:w280 f:id:chie8842:20170322225958p:plain:w280 f:id:chie8842:20170322230002p:plain:w280

必要なパッケージのダウンロードに時間がかかる。

f:id:chie8842:20170322230007p:plain:w280 f:id:chie8842:20170322230010p:plain:w280

インストールの確認

インストールが完了したら、コマンドプロンプトを開いてnvccコマンドが実行できることを確かめる。

>nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Mon_Jan__9_17:32:33_CST_2017
Cuda compilation tools, release 8.0, V8.0.60

次にdeviceQueryをビルドして実行し、CUDAが正常にインストール・設定されていることを確認する。 1. エクスプローラで以下の場所に移動する C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\1_Utilities\deviceQuery 2. deviceQuery_vs2010.slnをダブルクリックする。

f:id:chie8842:20170322230534p:plain:w320

Visual Studioが立ち上がったら、Build->Build Solutionを選択してビルドを行う

f:id:chie8842:20170322230738p:plain:w320

成功すると、Outputに、

========== Build: 0 succeeded, 1 failed, 0 up-to-date, 0 skipped ==========

と出る。 4. 以下の場所に、deviceQuery.exeができているはず。 C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Debug コマンドプロンプトから、上記を実行すると、CUDAが認識しているGPUデバイスの情報が出力される

> deviceQuery.exe
 deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GPU"
  CUDA Driver Version / Runtime Version          8.0 / 8.0
  CUDA Capability Major/Minor version number:    5.0
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 3) Multiprocessors, (128) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            993 MHz (0.99 GHz)
  Memory Clock rate:                             2505 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GPU
Result = PASS

bandwidthTestの実行

CUDA Samplesには、deviceQuery以外にも様々なサンプルがある。 Utilityの一つである、bandwidthTestを実行してみる。

このbandwidthTestを実行するうえで、まずGPUを利用されるデータの流れについて記述する。 GPUを搭載しているGPUボードは、ホストのメモリとは別で、GPU用のメモリが搭載されている。 GPUでの計算に使用するデータは、ホストのメモリからGPU用のメモリに転送して利用される仕組みとなっている。

f:id:chie8842:20170322230023p:plain:w320

bandwidthTestを実行すると、下記の図のように、 ① Host to Device Bandwidth（ホストのメモリからGPUのメモリへの転送速度） ② Device to Host Bandwidth（GPUのメモリからホストのメモリへの転送速度） ③ Device to Device Bandwidth（GPUのメモリからGPUのメモリへの転送速度）の3種類の転送速度が計測される。

f:id:chie8842:20170322230025p:plain:w320

尚、–memoryオプションで、"pageable"もしくは"pinned"を選択することができる。 pageableを選んだ場合は、メインメモリ上でページング可能な領域としてGPU用のメモリを確保する。 pinnedを選んだ場合は、ページロックされた（ページアウトが発生しない）領域としてGPU用のメモリを確保する。

f:id:chie8842:20170322230026p:plain:w320 f:id:chie8842:20170322230028p:plain:w320

実行方法は、deviceQueryの時と同様に、Visual Studioを使ってソースコードをビルドして、コマンドプロンプトから、 bandwidthTest.exeを実行すればよい。

実行結果は以下のとおり。

memory=pageableの場合

C:\ProgramData\NVIDIA Corporation\CUDA Samples\v8.0\bin\win64\Debug>bandwidthTest.exe --memory=pageable
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GPU
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1312.4

 Device to Host Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1353.4

 Device to Device Bandwidth, 1 Device(s)
 PAGEABLE Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     34166.0

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

memory=pinnedの場合

>bandwidthTest.exe --memory=pinned
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: GeForce GPU
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1526.8

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     1611.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     34167.1

Result = PASS

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

bandwidthTestの結果について

データ転送速度が遅い場合、GPU利用による処理の高速化の効果を転送にかかるオーバヘッドで打ち消してしまう可能性がある。今回、他のスペックのマシンでのベンチマークの情報などはもっていないので、今回のSurface Bookでの実行結果が速いか遅いかは判断できない。 Surfaceの作りとしてCPU/Memoryはモニター側、GPUはキーボード側と離れた配置になっていることや、 Surface Bookを買ってGPGPUで遊ぼうと思ったけど障壁が高かった話で確認できたように、 Surface Bookに搭載されているGPUのスペックが高くないことを考慮すると、転送速度も高くはないのではないかと予想する。
ホストのメモリをpageable領域とした場合、page-lock領域とした場合と比べて少し速度が落ちる。スワップアウト/スワップインが行われることによる影響と考えられる。

さいごに

CUDAのインストールと、CUDAに付属するサンプルアプリケーションを使ってCUDAの情報やデータの転送速度を確認した。 CUDAが使えるようになったので、TensorflowでGPUを使った機械学習をやってみよう！

焼肉が食べたい

ただの日記です。技術的に学んだことも書こうと思っていますが、あくまで自分用メモです。プロフィールはこちら。https://chie8842.github.io/aboutme/