Chapter 10 Best Practices

10.1 Booking a server

Before booking a server look at the server information sheet to verify that the resource can meet your needs.

Book the Server with our booking system detailing your name, description of tasks and, when possible, forseen resources used (i.e. gpu, cpu threads,ram).

It is better to have no more than one person per server to avoid overload in the machine. However, non expensive tasks can be performed simultaneously. Always check what tasks will be performed on the booking calendar and communicate with each other.

10.2 Docker Considerations

  • Always check what are the existing images before pulling a new one with:

    docker image ls
  • To check what is running and which ports are occupied always run:

    docker ps 

10.3 System Monitoring

It is highly recommended that you monitor the systems processes during before and while running your processes. This can be done via the interactive system-monitor process-viewer and process-manager htop, which can be started by typing htop into the terminal and pressing enter:

htop

Things to keep an eye on:

  • How many threads are currently being used?
  • How much memory is available?

If your process is taking longer than expected, then this may be caused by either too many processes running in parallel, or insufficient memory being available. The later can result in swap memory being used, which will significantly slow down your processes.

10.3.1 Memory Leaks

There is also the potential for “memory leaks” within the code that you are using. Memory leaks can be caused when:

  • large objects are not released when they are no longer required;
  • reference cycles within the code you are running.

Before running your code on shared resources, run smaller tests (when possible) on you local system while monitoring the memory usage over time. If the amount of memory remains more or less consistent, then you should be safe to run your application on a shared resource. Although note that memory leaks may be difficult to notice at first, and may only become apparent after running a program for hours/days. See the following guide for discovering memory leaks in python.

10.4 GPU Considerations

10.4.1 Monitoring

Another resource that requires monitoring are GPUs. A GPU can be monitored using the NVIDIA System Management Interface (nvididia-smi):

watch -n 0.1 nvidia-smi

This interface will allow you to monitor the memory usage, volatile GPU utility, temperature and fan speed.

10.4.2 Memory Growth

If there is no memory available, then it is worth enquiring with the other individual using the GPU if they are using TensorFlow and have enabled memory growth. If memory growth has not been enabled, then TensorFlow will by default allocate the all available GPU memory to a task. See the following discussion for more information. Memory growth can be enabled as follows within TensorFlow:

config = tf.ConfigProto()
config.gpu_options.allow_growth=True
sess = tf.Session(config=config)

10.4.3 Temperature Monitoring

When servers are situated within a non-air conditioned room it is also worth keeping an eye on the GPUs temperature, in particular when the server houses multiple GPUs.

Bailey, Diane E, and Nancy B Kurland. 2002. “A Review of Telework Research: Findings, New Directions, and Lessons for the Study of Modern Work.” Journal of Organizational Behavior: The International Journal of Industrial, Occupational and Organizational Psychology and Behavior 23 (4): 383–400.

Boettiger, Carl, and Dirk Eddelbuettel. 2017. “An Introduction to Rocker: Docker Containers for R.” The R Journal 9 (2): 527–36. https://doi.org/10.32614/RJ-2017-065.

Brynjolfsson, Erik, John Horton, Adam Ozimek, Daniel Rock, Garima Sharma, and Hong Yi Tu Ye. 2020. “COVID-19 and Remote Work: An Early Look at Us Data.” Unpublished Work.

Holgersen, Henning, Zhiyang Jia, and Simen Svenkerud. 2020. “Who and How Many Can Work from Home? Evidence from Task Descriptions and Norwegian Job Advertisements.” Evidence from Task Descriptions and Norwegian Job Advertisements.(April 20, 2020).

Kitamura, Ryuichi, Jack M Nilles, Patrick Conroy, and David M Fleming. 1991. “Telecommuting as a Transportation Planning Measure : Initial Results of California Pilot Project Ryuichi Kitamura Reprint No . 58 of California.” Transportation Research Record 1285: 98–104.

Olson, Margrethe H. 1983. “Remote Office Work: Changing Work Patterns in Space and Time.” Communications of the ACM 26 (3): 182–87.

Silva, João de Abreu e, and Patrı́cia C Melo. 2018. “Does Home-Based Telework Reduce Household Total Travel? A Path Analysis Using Single and Two Worker British Households.” Journal of Transport Geography 73: 148–62.