Tags:
tag this topic
create new tag
view all tags
---+ Resource Limits in Slurm ---++ Very short summary The resources that we put limits on are: CPU, memory, GPU. And runtime, sort of. Whom do we limit this to? Not to you, as an individual, but to all the people in your HPC group (your "account") together. Your job can be pending because your colleague is using all your group's resources. There are two types of limits: just a number (your jobs can not use more than X CPU's simultaneously, no more than Y GPU's and no more than Z gigabytes of memory); and a number times the requested runtime. That last one is always tricky to explain properly; we'll get to that. ---++ The boring stuff that's good to know A word of terminology: in the commands to come, you will often see the terms "TRES" and "GRES". A TRES is a "trackable resource": something that you can request and that we can put limits on. To reiterate, we limit the use of: CPU's, memory and GPU's. A GRES (generic resource) is just a TRES that doesn't have its own category yet (GPU is actually a GRES). For our purposes, GRES and TRES are just the same thing. Another thing that may be good to point out again (copy/pasted from the HowToS page...): Most of our compute nodes have 2 physical CPU's. These are the items you can hold in your hand and install in a motherboard socket. These physical CPU's consist of multiple CPU "cores". These are mostly independent units, equivalent to what (in the old days...) you would actually call "a CPU". These CPU cores present themselves to the operating system as two, so that they can run two software threads at a time. This is called hyperthreading. Unfortunately (in my opinion), these "hyperthreads" is what Slurm actually calls a "CPU". If you specify "srun --cpus-per-task=2", you will get 2 hyperthreads, which is just 1 CPU "core". In addition, if you request an odd number of "CPU's", you will get an even number, rounded up. So: "--cpus-per-task=3" will get you 4 hyperthreads (2 CPU cores). So, whenever you see a "cpu" limit below, remember that's actually a "hyperthread", and you can't actually request 1 "cpu", you will always get an even number. On to the good stuff. ---++ How to see your group limits The full (but slightly unreadable) resource limit configuration can be seen with the command =scontrol show assoc_mgr=. The full (and equally unreadable) resource requests for queued and running jobs can be seen with the command =scontrol show jobid NNNN=. Relating the information from both commands can be quite a bit of work. Fortunately, some kind soul (https://github.com/OleHolmNielsen/Slurm_tools) has witten some tools that make this a whole lot easier. The most useful of these are available on the submit hosts, in =/usr/local/bin=. To see your group limits, use the command =showuserlimits=. Without arguments, it shows the limits for your default "account". You can also specify =-A someotheraccount= or =-u someotheruser=. Let's see the output of =showuserlimits -u mmarinus -A bofh= (my limits are very low, because I don't actually pay to use the cluster; I get payed so that you can use it...) I'll add some comments inline. <pre> [mmarinus@hpcs03 ~]$ showuserlimits -u mmarinus -A bofh Association (Parent account): ClusterName = udac # this is the same for everyone Account = bofh # this is my "account" (the group that has all the actual limits) UserName = # no username, this applies to every membr of my account group Partition = # no partition, this applies to all partitions Priority = 0 ID = 3 SharesRaw/Norm/Level/Factor = 8/0.00/18909/0.00 # Let's discuss this another time :-) UsageRaw/Norm/Efctv = 1.99/0.00/0.00 ParentAccount = root, current value = 1 Lft = 1694 DefAssoc = No GrpJobs = # This line (and the next 4) are limits that could have been set, but are not GrpJobsAccrue = GrpSubmitJobs = GrpWall = GrpTRES = cpu: Limit = 1882, current value = 0 # this is the first actual limit: I can use no more than 1882 CPU's simultaneously (one job using 1882, or 188 jobs using 10, etc). If I submit more, they stay "pending". mem: Limit = 7000000, current value = 0 # My running jobs can not request more than 7 TB memory. If I request additional memory, that jobs stay pending. gres/gpu: Limit = 8, current value = 0 # I can not use more than 8 GPU's simultaneously GrpTRESMins = # this would set limits on total resource consumption, including past jobs. We don't do this, we only limit current resource usage. GrpTRESRunMins = # this limits "requested_runtime * specified_resource" for running jobs. Time is in minutes. cpu: Limit = 17818, current value = 0 # I can have 1 jobs with 2 CPU's requesting 8909 minutes, or 10 jobs with 10 CPU's requesting 178.18 minutes, etc. Additional jobs stay pending. gres/gpu: Limit = 20160, current value = 0 # I can have 1 job with 1 GPU requesting 20160 minutes, or 4 jobs with 2 GPU's requesting 2520 minutes, etc. MaxJobs = # That is all we limit on. MaxJobsAccrue = MaxSubmitJobs = MaxWallPJ = MaxTRESPJ = MaxTRESPN = MaxTRESMinsPJ = MinPrioThresh = Association (User): # This would show any limits that are applied to me individually, rather than to my group. Nothing here, we only limit groups. ClusterName = udac Account = bofh UserName = mmarinus, UID=10307 Partition = Priority = 0 ID = 4 SharesRaw/Norm/Level/Factor = 10/0.00/50/0.00 UsageRaw/Norm/Efctv = 1.99/0.00/1.00 ParentAccount = Lft = 1705 DefAssoc = Yes GrpJobs = GrpJobsAccrue = GrpSubmitJobs = GrpWall = GrpTRES = GrpTRESMins = GrpTRESRunMins = MaxJobs = MaxJobsAccrue = MaxSubmitJobs = MaxWallPJ = MaxTRESPJ = MaxTRESPN = MaxTRESMinsPJ = MinPrioThresh = </pre> <span style="background-color: transparent;">-- </span><span style="background-color: transparent;">%USERSIG{MartinMarinus - 2020-05-26}%</span> ---++ Comments <br />%COMMENT%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
Ra
w
edit
|
M
ore topic actions
Topic revision: r1 - 2020-05-26
-
MartinMarinus
Home
Site map
HPC web
TWiki web
HPC Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Register User
E
dit
A
ttach
Copyright © 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki?
Send feedback