Role Name
Ansible role that configures RHEL 9.6 image in Microsoft Azure Cloud for HPC.
Requirements
None
Variables for Controlling Packages to Install
These variables control what packages the role installs. By default,
the role installs all the packages. You can set some of the variables to
false to make the role not install particular packages.
hpc_update_kernel
Whether to update kernel to the latest version.
Default: true
Type: bool
hpc_update_all_packages
Whether to update all packages on the system to the latest version.
This is a good practice to have the system in the latest state. But
because this is a serious invasion into users environment, this variable
is set to false by default.
Default: false
Type: bool
hpc_install_cuda_driver
Whether to install the CUDA Driver package.
Default: true
Type: bool
hpc_install_cuda_toolkit
Whether to install the CUDA Toolkit package.
Note that this package is required for installing OpenMPI.
Default: true
Type: bool
hpc_install_hpc_nvidia_nccl
Whether to install the NVIDIA Collective Communications Library (NCCL) package.
Note that this package is required for installing OpenMPI.
Default: true
Type: bool
hpc_install_nvidia_fabric_manager
Whether to install the NVIDIA Fabric Manager package and enable the nvidia-fabricmanager service.
Default: true
Type: bool
hpc_install_rdma
Whether to install the NVIDIA RDMA package.
Default: true
Type: bool
hpc_install_system_openmpi
Whether to install OpenMPI that comes from AppStream repositories and does not have Nvidia GPU support.
The system openmpi package should be installed to support MPI applications that do not require CUDA support and/or GPU acceleration. It can co-exist alongside other installed OpenMPI packages safely, so if in doubt always install this package.
You can run an lmod environmental module to select this
openmpi by entering the following command:
module load mpi/openmpi-x86_64Default: true
Type: bool
hpc_build_openmpi_w_nvidia_gpu_support
Whether to build OpenMPI with Nvidia GPU support.
Currently, the role builds OpenMPI from source. Prior to building OpenMPI, it builds its requirements - GDRCopy, HPCX, and PMIx.
Microsoft-supplied PMIx library RPM is built with versioning that replaces the system (appstream) PMIx package (i.e. v4.2.9 vs v3.2.3). However, the library it installs as libpmix.so.2 is incorrectly versioned - v4.2.9 implements a newer PMIX API that is not backwards compatible with applications linked against older versions of libpmix.so.2.
As OpenMPI v5.x requires PMIx >= 4.2.0, we have no choice but to build PMIx from source so that we can have both versions installed on the system at the same time. This also requires a pmix-4.2.9 environment module to put the pmix install into various paths.
You can run an lmod environmental module to select this
openmpi by entering the following command:
module load mpi/openmpi-5.0.8Note that building OpenMPI requires the following variables to be set
to true, which is the default value:
hpc_install_cuda_toolkit: true
hpc_install_hpc_nvidia_nccl: trueDefault: true
Type: bool
Variables for Configuring Tuning for HPC Workloads
hpc_tuning
Whether to apply tuning for HPC workloads.
The role applies the following tuning configurations:
Remove user memory limits to ensure applications aren't restricted by creating a file
/etc/security/limits.d/90-hpc-limits.confwith memlock, nofile, and stack configuration.Configure system by creating a file
/etc/sysctl.d/90-hpc-sysctl.conf. This file applies the following configuration:- Enable zone reclaim mode
- Increase the size of the IP neighbour cache
- Increase the number of NFS RPCs per transport to have in flight at once
Load a
sunrpckernel module withsunrpc.tcp_max_slot_table_entries=128.Boost read performance for newly mounted NFS network shares by adding a file
/etc/udev/rules.d/90-nfs-readahead.rules. This configuration increases the data pre-fetching buffer to 15,380 KB to help overcome network latency.
Default: true
Type: bool
Variables for Configuring How Role Reboots Managed Nodes
hpc_reboot_ok
If true, if the role detects that something was changed
that requires a reboot to take effect, the role will reboot the managed
host.
If false, it is up to you to determine when to reboot
the managed host.
The role returns the variable hpc_reboot_needed with a value of
true to indicate that some change has occurred which needs
a reboot to take effect.
Default: false
Type: bool
Example Playbook for Configuring Packages
- name: Configure my virtual machine for HPC
hosts: localhost
vars:
hpc_install_cuda_driver: true
hpc_install_cuda_toolkit: true
hpc_install_hpc_nvidia_nccl: true
hpc_install_nvidia_fabric_manager: true
hpc_install_rdma: true
hpc_install_system_openmpi: true
hpc_build_openmpi_w_nvidia_gpu_support: true
roles:
- redhat.rhel_system_roles.hpcVariables for Configuring Firewall
hpc_manage_firewall
Whether to run the linux-system-roles.firewall role to manage Firewall.
Setting this variable to true does the following:
- Enable and start the firewall service.
- Configure the default firewall zone to be trusted.
This, basically, allows all connections. This is a common practice with HPC workloads because security is handled by cloud providers.
This is a security measure and we want users to explicitly approve
this action by setting this variable to true.
Default: false
Type: bool
Variables for Configuring Storage
By default, the role ensures that rootlv and
usrlv in Azure has enough storage for packages to be
installed. You can use variables described in this section to control
the exact sizes and paths.
hpc_manage_storage
Whether to configure the VG from hpc_rootvg_name to have logical volumes hpc_rootlv_name and hpc_usrlv_name with indicated sizes and mounted to indicated mount points.
Note that the role configures not the exact size, but ensures that the size is at least as indicated, i.e. the role won't shrink logical volumes.
Default: true
Type: bool
hpc_rootvg_name
Name of the root volume group to use. The role configures logical volumes hpc_rootlv_name and hpc_usrlv_name to extend them to the size required to install HPC packages.
Default: rootvg
Type: string
hpc_rootlv_name
Name of the root logical volume to use.
Default: rootlv
Type: string
hpc_rootlv_size
The size of the hpc_rootlv_size logical volume to configure.
Note that the role configures not the exact size, but ensures that the size is at least as indicated, i.e. the role won't shrink logical volumes if current size is larger than value of this variable.
Default: 10G
Type: string
hpc_rootlv_mount
Mount point of the hpc_rootlv_size logical volume to configure.
Default: /
Type: string
hpc_usrlv_name
Name of the usr logical volume to use.
Default: usrlv
Type: string
hpc_usrlv_size
The size of the hpc_usrlv_name logical volume to configure.
Note that the role configures not the exact size, but ensures that the size is at least as indicated, i.e. the role won't shrink logical volumes if current size is larger than value of this variable.
Default: 20G
Type: string
hpc_usrlv_mount
Mount point of the hpc_usrlv_name logical volume to configure.
Default: /usr
Type: string
Example Playbook for Configuring Storage
- name: Configure my virtual machine for HPC
hosts: localhost
vars:
hpc_manage_storage: true
hpc_rootvg_name: rootvg
hpc_rootlv_name: rootlv
hpc_rootlv_size: 10G
hpc_rootlv_mount: /
hpc_usrlv_name: usrlv
hpc_usrlv_size: 20G
hpc_usrlv_mount: /usr
roles:
- redhat.rhel_system_roles.hpcVariables Exported by the Role
hpc_reboot_needed
Default false - if true, this means a
reboot is needed to apply the changes made by the role.
Example Playbooks
Run the role to configure storage, install all packages, and reboot if needed.
- name: Configure my virtual machine for HPC
hosts: localhost
vars:
hpc_manage_storage: true
hpc_rootvg_name: rootvg
hpc_rootlv_name: rootlv
hpc_rootlv_size: 10G
hpc_rootlv_mount: /
hpc_usrlv_name: usrlv
hpc_usrlv_size: 20G
hpc_usrlv_mount: /usr
hpc_install_cuda_driver: true
hpc_install_cuda_toolkit: true
hpc_install_hpc_nvidia_nccl: true
hpc_install_nvidia_fabric_manager: true
hpc_install_rdma: true
hpc_install_system_openmpi: true
hpc_build_openmpi_w_nvidia_gpu_support: true
hpc_reboot_ok: true
roles:
- redhat.rhel_system_roles.hpcrpm-ostree
See README-ostree.md
License
MIT