Sunday, March 13, 2016

Using TensorFlow 0.7 from python on a laptop running Manjaro Linux.

The use of GPU version of tensorflow is tested on a laptop running manjaro Linux distribution.

Checking the Nvidia driver installation:

First, check the GPU presence


$ lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Intel Corporation 4th Gen Core Processor Integrated Graphics Controller (rev 06)
        Subsystem: CLEVO/KAPOK Computer Device 5000
        Kernel driver in use: i915
--
01:00.0 3D controller: NVIDIA Corporation GK208M [GeForce GT 740M] (rev ff)
        Kernel modules: nouveau, nvidia
03:00.0 Network controller: Intel Corporation Wireless 7260 (rev 73)

Check if the GPU can be harnessed

 For example with using glxgears. Without GPU, glxgears yields:


$ glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
453 frames in 5.0 seconds = 90.581 FPS
302 frames in 5.0 seconds = 60.288 FPS
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
      after 3146 requests (3146 known processed) with 0 events remaining.



$ optirun glxgears
9891 frames in 5.0 seconds = 1977.995 FPS
10186 frames in 5.0 seconds = 2037.131 FPS
9953 frames in 5.0 seconds = 1990.527 FPS
9908 frames in 5.0 seconds = 1981.540 FPS
10129 frames in 5.0 seconds = 2025.626 FPS

glxgears is accelerated by a factor 20 at least.

Installing CUDA 7.5 and cuDNN

CUDA 7.5  was installed from manjaro repository using the Octopi GUI. cuDNN was installed using Octopi. This library has to be first downloaded from the nvidia developer website.

In manjaro Linux, CUDA is installed in /opt/cuda/. To check cuda installation, samples files were built under user directory as explained from Nvidia documentation.

Copy the directory /opt/cuda/samples into your home dir ~/HOME, example:

cp -r /opt/cuda/samples/ ~jeanpat/

Build (with make or make -j4  with a 4 cores CPU ) the test applications in the /samples directory:

$ ls -l
total 144
drwxr-xr-x 48 jeanpat users  4096 11 mai   12:10 0_Simple
drwxr-xr-x  7 jeanpat users  4096 11 mai   12:10 1_Utilities
drwxr-xr-x 12 jeanpat users  4096 11 mai   12:10 2_Graphics
drwxr-xr-x 21 jeanpat users  4096 11 mai   12:10 3_Imaging
drwxr-xr-x 10 jeanpat users  4096 11 mai   12:10 4_Finance
drwxr-xr-x 10 jeanpat users  4096 11 mai   12:10 5_Simulations
drwxr-xr-x 31 jeanpat users  4096 11 mai   12:10 6_Advanced
drwxr-xr-x 37 jeanpat users  4096 11 mai   12:10 7_CUDALibraries
drwxr-xr-x  6 jeanpat users  4096 11 mai   12:10 common
-rw-r--r--  1 jeanpat users 96407 11 mai   12:10 EULA.txt
-rw-r--r--  1 jeanpat users  2652 11 mai   12:10 Makefile



The binary file deviceQuery can be launched using optirun:

$ optirun ./deviceQuery
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)


Detected 1 CUDA Capable device(s)

Device 0: "GeForce GT 740M"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 2048 MBytes (2147352576 bytes)
  ( 2) Multiprocessors, (192) CUDA Cores/MP:     384 CUDA Cores
  GPU Max Clock rate:                            1032 MHz (1.03 GHz)
  Memory Clock rate:                             900 Mhz
  Memory Bus Width:                              64-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = GeForce GT 740M
Result = PASS

Running tensorflow 0.7 from python

Tensorflow was installed (version with GPU acceleration) according to the documentation in a virtual environment. Then the latest version can be installed using:

pip install --upgrade tensorflow

According to the documentation, the tensorflow installation can be checked as follow:
$ python
...
>>> import tensorflow as tf
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
>>> print(sess.run(hello))
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print(sess.run(a + b))
42
>>>


However, to run properly the python interpreter must be run with optirun:

$ optirun python
Python 2.7.11 (default, Mar  3 2016, 11:00:04)
[GCC 5.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:105] successfully opened CUDA library libcurand.so locally
>>> hello = tf.constant('Hello, TensorFlow!')
>>> sess = tf.Session()
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_init.cc:102] Found device 0 with properties:
name: GeForce GT 740M
major: 3 minor: 5 memoryClockRate (GHz) 1.0325
pciBusID 0000:01:00.0
Total memory: 2.00GiB
Free memory: 1.97GiB
I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:717] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GT 740M, pci bus id: 0000:01:00.0)
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 4.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 8.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 16.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 32.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 64.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 128.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 256.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 512.0KiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 4.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 8.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 16.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 32.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 64.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 128.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 256.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 512.00MiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 1.00GiB
I tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:51] Creating bin of max chunk size 2.00GiB
>>> print sess.run(hello)
Hello, TensorFlow!
>>> a = tf.constant(10)
>>> b = tf.constant(32)
>>> print sess.run(a+b)
42
>>> 


 Conclusion

At first sight, the issue of the use of CUDA 7.5 with TensorFlow is resolved with TensorFlow 0.7.