Operación cluster

Política de buen uso

Es necesario que los usuarios ejecuten sus programas y códigos en una forma responsable, que no se trate o dañe el código y programas de los demás usuarios. Esta regla debe de ser seguida todo el tiempo, para poder cumplir esto es necesario que se tenga familiaridad con el cómputo en paralelo y la configuración de software y hardware del cluster deep-thought.

Todos los usuarios deben de hacer uso del sistema de colas del cluster, que corre en el nodo maestro y administra el uso de los nodos de cómputo. No se permite el acceso directo a los nodos para correr trabajos ahí.

Esta prohibido utilizar el nodo maestro para correr trabajos sin el uso del administrador de colas.

Los usuarios serán notificados por correo electrónico para cualquier problema relacionado al cluster deep-thought, como ejemplos: actualizaciones programadas, mantenimiento programado, etc. Se requiere de una comunicación activa para el uso efectivo de este recurso compartido.

EL cluster deep-thought es un recurso de cómputo, no de almacenamiento y por esta razón los datos deberán de utilizarse en el momento y evitar el uso del cluster deep-thought como servidor de almacenamiento a largo plazo.

Si los usuarios violan las reglas anteriores o tienen impactos en la capacidad de otros usuarios en el uso del cluster deep-thought se le otorgará el permiso al administrador del sistema de la terminación de los trabajos y/o suspender la cuenta de los usuarios. Se tratará de notificar primero al usuario, pero si la actividad está repercutiendo en el uso de la comunidad se suspenderán los trabajos y cuenta primero, beneficiando a la comunidad.

Acceso a los recursos

Todos los usuarios deben de acceder por secure shell (ssh) al cluster.

Los usuarios deben de utilizar el sistema de administración de colas (qsub) para reservar los nodos necesarios para sus trabajos. No está permitido ingresar a los nodos para correr los trabajos directamente.

La administración de trabajos (Grid engine) se basa en una política del primero que llega, al primero que se sirve (first-come-first-served). Las prioridades del trabajo se ajustan por factores como el tipo de usuario y la demanda de recursos para el trabajo.

Es responsabilidad del usuario asegurar que se mantiene un nivel propio de utilización del cluster deep-thought. El significado de propio varia con el uso de los recursos por parte de la comunidad, por lo que se pide que se ajuste el uso basado en el monitoreo del cluster (ganglia).

Se considera un trabajo de largo plazo a un trabajo que dure más de 24 horas. Si se planea mandar un trabajo de largo plazo se tiene que revisar que los patrones de uso del cluster deep-thought lo permitan.

Se recomienda estimar la duración del proceso y los recursos necesarios (procesadores, memoria RAM y espacio en discos duros) con una fracción del proceso de ser posible (benchmarking).

Las reglas anteriores no aplican al administrador del equipo y usuarios seleccionados para correr pruebas, debugging y becnchmarking. Se espera que dicha actividad sea mayor solo al inicio de la configuración del cluster, después de restablecer el sistema o de una actualización programada.

Para entrar al sistema hay que accesar al nodo maestro: ssh usuario@111.111.111.111 cambia el 111 por la dirección IP del cluster (se te proporciona por correo). El cluster tiene un nodo maestro y nodos esclavos, el propósito del nodo maestro es distribuir y agendar los trabajos basado en los recursos que se soliciten. Un ejemplo de como correr un trabajo se muestra a continuación:

user@deep-thought:~$ qsub -V -b n -cwd trabajo.sh
Your job 1 ("trabajo.sh") has been submitted

La opción -V exporta las variables ambientales de la consola (shell) executando el qsub

La opción -b le dice a qsub que el comando executado se encuentra en un solo archivo binario o script ejecutable. En este caso se necesita una respuesta afirmativa (y) o negativa (n) diciendo si el comando a ejecutar es un binario o no.

La opción -cwd específica que se utiliza el mismo directorio desde el que se llama al qsub (current working directory).

trabajo.sh es un script de bash y se puede configurar para los recursos que va a necesitar el trabajo. Las opciones declaradas en la línea de comando anterior (-V, -cwd, etc.) pueden ser dclaradas en este script.

Detallando los recursos:

Hay dos cosas de recursos que definir: CPUs (cada procesador es un slot) y memoria.

La memoría se puede requerir con h_vmem que es el total máximo de memoria que un slot puede tomar, si se excede el total de memoria por slot tu trabajo morirá de forma automática. Otra opción de memoria disponible es el parámetro mem_free

Aquí es como se pueden requerir ambos recursos:

user@deep-thought:~$ qsub -l h_vmem=4G trabajo.sh
Your job 1 ("trabajo.sh") has been submitted

Con esta línea de código se reserva una máquina con 4GB de memoria RAM libre

También se pueden pedir recursos extras de CPUs con la opión -pe . Esta opción pide recursos de ambiente paralelo y cada cluster tiene una configuración particular de recursos paralelos. En este cluster podemos pedir un máximo de 4 CPUs en cada nodo. La siguiente línea de comando comenzará un trabajo utilizando 4 slots (CPUs) en un ambiente paralelo llamado «completenode» con 4Gb de memoría RAM:

user@deep-thought:~$ qsub -pe completenode 4 -l h_vem=4G trabajo.sh
Your job 1 ("trabajo.sh") has been submitted

En el sistema de SGE se dividen los recursos por recursos y colas:

Colas (Queue(s))

Donde se corren los trabajos que se mandan

Van a través de múltiples hospederos (máquinas) y aquí los llamamos nodos de cómputo.

Puede haber más de una cola en el sistema

Las colas dan recursos en los nodos, generalmente es 1 slot/CPU por pedido

Se puede seleccionar explicitamente que nodo se quiere utilizar

 

Manuales de administración de SGE:

https://star.mit.edu/cluster/docs/0.93.3/guides/sge.html <- De aquí viene esta guía:

 

Sun Grid Engine (SGE) QuickStart

The Sun Grid Engine queuing system is useful when you have a lot of tasks to execute and want to distribute the tasks over a cluster of machines. For example, you might need to run hundreds of simulations/experiments with varying parameters or need to convert 300 videos from one format to another. Using a queuing system in these situations has the following advantages:

  • Scheduling – allows you to schedule a virtually unlimited amount of work to be performed when resources become available. This means you can simply submit as many tasks (or jobs) as you like and let the queuing system handle executing them all.
  • Load Balancing – automatically distributes tasks across the cluster such that any one node doesn’t get overloaded compared to the rest.
  • Monitoring/Accounting – ability to monitor all submitted jobs and query which cluster nodes they’re running on, whether they’re finished, encountered an error, etc. Also allows querying job history to see which tasks were executed on a given date, by a given user, etc.

Of course, just because a queuing system is installed doesn’t mean you have to use it at all. You can run your tasks across the cluster in any way you see fit and the queuing system should not interfere. However, you will most likely end up needing to implement the above features in some fashion in order to optimally utilize the cluster.

Submitting Jobs

A job in SGE represents a task to be performed on a node in the cluster and contains the command line used to start the task. A job may have specific resource requirements but in general should be agnostic to which node in the cluster it runs on as long as its resource requirements are met.

All jobs require at least one available slot on a node in the cluster to run.

Submitting jobs is done using the qsub command. Let’s try submitting a simple job that runs the hostname command on a given cluster node:

sgeadmin@master:~$ qsub -V -b y -cwd hostname
Your job 1 ("hostname") has been submitted
  • The -V option to qsub states that the job should have the same environment variables as the shell executing qsub (recommended)
  • The -b option to qsub states that the command being executed could be a single binary executable or a bash script. In this case the command hostname is a single binary. This option takes a y or n argument indicating either yes the command is a binary or no it is not a binary.
  • The -cwd option to qsub tells Sun Grid Engine that the job should be executed in the same directory that qsub was called.
  • The last argument to qsub is the command to be executed (hostname in this case)

Notice that the qsub command, when successful, will print the job number to stdout. You can use the job number to monitor the job’s status and progress within the queue as we’ll see in the next section.

Monitoring Jobs in the Queue

Now that our job has been submitted, let’s take a look at the job’s status in the queue using the command qstat:

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
1 0.00000 hostname sgeadmin qw 09/09/2009 14:58:00 1
sgeadmin@master:~$

From this output, we can see that the job is in the qw state which stands for queued and waiting. After a few seconds, the job will transition into a r, or running, state at which point the job will begin executing:

sgeadmin@master:~$ qstat
job-ID  prior   name       user         state submit/start at     queue  slots ja-task-ID
-----------------------------------------------------------------------------------------
1 0.00000 hostname   sgeadmin     r     09/09/2009 14:58:14                1
sgeadmin@master:~$

Once the job has finished, the job will be removed from the queue and will no longer appear in the output of qstat:

sgeadmin@master:~$ qstat
sgeadmin@master:~$

Now that the job has finished let’s move on to the next section to see how we view a job’s output.

Viewing a Job’s Output

Sun Grid Engine creates stdout and stderr files in the job’s working directory for each job executed. If any additional files are created during a job’s execution, they will also be located in the job’s working directory unless explicitly saved elsewhere.

The job’s stdout and stderr files are named after the job with the extension ending in the job’s number.

For the simple job submitted above we have:

sgeadmin@master:~$ ls hostname.*
hostname.e1 hostname.o1
sgeadmin@master:~$ cat hostname.o1
node001
sgeadmin@master:~$ cat hostname.e1
sgeadmin@master:~$

Notice that Sun Grid Engine automatically named the job hostname and created two output files: hostname.e1 and hostname.o1. The e stands for stderr and the o for stdout. The 1 at the end of the files’ extension is the job number. So if the job had been named my_new_job and was job #23 submitted, the output files would look like:

my_new_job.e23 my_new_job.o23

Monitoring Cluster Usage

After a while you may be curious to view the load on Sun Grid Engine. To do this, we use the qhost command:

sgeadmin@master:~$ qhost
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS
-------------------------------------------------------------------------------
global - - - - - - -
master lx24-x86 1 0.00 1.7G 62.7M 896.0M 0.0
node001 lx24-x86 1 0.00 1.7G 47.8M 896.0M 0.0

The output shows the architecture (ARCH), number of cpus (NCPU), the current load (LOAD), total memory (MEMTOT), and currently used memory (MEMUSE) and swap space (SWAPTO) for each node.

You can also view the average load (load_avg) per node using the ‘-f’ option to qstat:

sgeadmin@master:~$ qstat -f
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
all.q@master.c BIP 0/0/1 0.00 lx24-x86
---------------------------------------------------------------------------------
all.q@node001.c BIP 0/0/1 0.00 lx24-x86

Creating a Job Script

In the ‘Submitting a Job’ section we submitted a single command hostname. This is useful for simple jobs but for more complex jobs where we need to incorporate some logic we can use a so-called job script. A job script is essentially a bash script that contains some logic and executes any number of external programs/scripts:

#!/bin/bash
echo "hello from job script!"
echo "the date is" `date`
echo "here's /etc/hosts contents:"
cat /etc/hosts
echo "finishing job :D"

As you can see, this script simply executes a few commands (such as echo, date, cat, etc.) and exits. Anything printed to the screen will be put in the job’s stdout file by Sun Grid Engine.

Since this is just a bash script, you can put any form of logic necessary in the job script (i.e. if statements, while loops, for loops, etc.) and you may call any number of external programs needed to complete the job.

Let’s see how you run this new job script. Save the script above to /home/sgeadmin/jobscript.sh on your StarCluster and execute the following as the sgeadmin user:

sgeadmin@master:~$ qsub -V jobscript.sh
Your job 6 ("jobscript.sh") has been submitted

Now that the job has been submitted, let’s call qstat periodically until the job has finished since this job should only take a second to run once it’s executed:

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
6 0.00000 jobscript. sgeadmin qw 09/09/2009 16:18:43 1

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
6 0.00000 jobscript. sgeadmin qw 09/09/2009 16:18:43 1

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
6 0.00000 jobscript. sgeadmin qw 09/09/2009 16:18:43 1

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
6 0.00000 jobscript. sgeadmin qw 09/09/2009 16:18:43 1

sgeadmin@master:~$ qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-------------------------------------------------------------------
6 0.55500 jobscript. sgeadmin r 09/09/2009 16:18:57 all.q@node001.c 1

sgeadmin@master:~$ qstat
sgeadmin@master:~$

Now that the job is finished, let’s take a look at the output files:

sgeadmin@master:~$ ls jobscript.sh*
jobscript.sh jobscript.sh.e6 jobscript.sh.o6
sgeadmin@master:~$ cat jobscript.sh.o6
hello from job script!
the date is Wed Sep 9 16:18:57 UTC 2009
here's /etc/hosts contents:
# Do not remove the following line or programs that require network functionality will fail
127.0.0.1 localhost.localdomain localhost
10.252.167.143 master
10.252.165.173 node001
finishing job :D
sgeadmin@master:~$ cat jobscript.sh.e6
sgeadmin@master:~$

We see from looking at the output that the stdout file contains the output of the echo, date, and cat statements in the job script and that the stderr file is blank meaning there were no errors during the job’s execution. Had something failed, such as a command not found error for example, these errors would have appeared in the stderr file.

Deleting a Job from the Queue

What if a job is stuck in the queue, is taking too long to run, or was simply started with incorrect parameters? You can delete a job from the queue using the qdel command in Sun Grid Engine. Below we launch a simple ‘sleep’ job that sleeps for 10 seconds so that we can kill it using qdel:

sgeadmin@master:~$ qsub -b y -cwd sleep 10
Your job 3 ("sleep") has been submitted
sgeadmin@master:~$ qdel 3
sgeadmin has registered the job 3 for deletion

After running qdel you’ll notice the job is gone from the queue:

sgeadmin@master:~$ qstat
sgeadmin@master:~$

OpenMPI and Sun Grid Engine

OpenMPI must be compiled with SGE support (–with-sge) to make use of the tight-integration between OpenMPI and SGE as documented in this section. This is the case on all of StarCluster’s public AMIs.

OpenMPI supports tight integration with Sun Grid Engine. This integration allows Sun Grid Engine to handle assigning hosts to parallel jobs and to properly account for parallel jobs.

OpenMPI Parallel Environment

StarCluster by default sets up a parallel environment, called “orte”, that has been configured for OpenMPI integration within SGE and has a number of slots equal to the total number of processors in the cluster. You can inspect the SGE parallel environment by running:

sgeadmin@ip-10-194-13-219:~$ qconf -sp orte
pe_name            orte
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $round_robin
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

This is the default configuration for a two-node, c1.xlarge cluster (16 virtual cores).

Round Robin vs Fill Up Modes

Notice the allocation_rule setting in the output of the qconf command in the previous section. This defines how to assign slots to a job. By default StarCluster configures round_robin allocation. This means that if a job requests 8 slots for example, it will go to the first machine, grab a single slot if available, move to the next machine and grab a single slot if available, and so on wrapping around the cluster again if necessary to allocate 8 slots to the job.

You can also configure the parallel environment to try and localize slots as much as possible using the fill_up allocation rule. With this rule, if a user requests 8 slots and a single machine has 8 slots available, that job will run entirely on one machine. If 5 slots are available on one host and 3 on another, it will take all 5 on that host, and all 3 on the other host. In other words, this rule will greedily take all slots on a given node until the slot requirement for the job is met.

You can switch between round_robin and fill_up modes using the following command:

$ qconf -mp orte

This will open up vi (or any editor defined in EDITOR env variable) and let you edit the parallel environment settings. To change from round_robin to fill_up in the above example, change the allocation_rule line from:

allocation_rule    $round_robin

to:

allocation_rule    $fill_up

After making the change and saving the file you can verify your settings using:

sgeadmin@ip-10-194-13-219:~$ qconf -sp orte
pe_name            orte
slots              16
user_lists         NONE
xuser_lists        NONE
start_proc_args    /bin/true
stop_proc_args     /bin/true
allocation_rule    $fill_up
control_slaves     TRUE
job_is_first_task  FALSE
urgency_slots      min
accounting_summary FALSE

Submitting OpenMPI Jobs using a Parallel Environment

The general workflow for running MPI code is:

  1. Compile the code using mpicc, mpicxx, mpif77, mpif90, etc.
  2. Copy the resulting executable to the same path on all nodes or to an NFS-shared location on the master node

It is important that the path to the executable is identical on all nodes for mpirun to correctly launch your parallel code. The easiest approach is to copy the executable somewhere under /home on the master node since /home is NFS-shared across all nodes in the cluster.

  1. Run the code on X number of machines using:

    $ mpirun -np X -hostfile myhostfile ./mpi-executable arg1 arg2 [...]

where the hostfile looks something like:

$ cat /path/to/hostfile
master  slots=2
node001 slots=2
node002 slots=2
node003 slots=2

However, when using an SGE parallel environment with OpenMPI you no longer have to specify the -np, -hostfile, -host, etc. options to mpirun. This is because SGE will automatically assign hosts and processors to be used by OpenMPI for your job. You also do not need to pass the –byslot and –bynode options to mpirun given that these mechanisms are now handled by the fill_up and round_robin modes specified in the SGE parallel environment.

Instead of using the above formulation create a simple job script that contains a very simplified mpirun call:

$ cat myjobscript.sh
mpirun /path/to/mpi-executable arg1 arg2 [...]

Then submit the job using the qsub command and the orte parallel environment automatically configured for you by StarCluster:

$ qsub -pe orte 24 ./myjobscript.sh

The -pe option species which parallel environment to use and how many slots to request. The above example requests 24 slots (or processors) using the orte parallel environment. The parallel environment automatically takes care of distributing the MPI job amongst the SGE nodes using the allocation_rule defined in the environment’s settings.

You can also do this without a job script like so:

$ cd /path/to/executable
$ qsub -b y -cwd -pe orte 24 mpirun ./mpi-executable arg1 arg2 [...]

deep-thought