HT Condor

Condor is a resource scavenging, batch job scheduling, High Throughput Computing (HTC) environment. Basically, Condor allows unallocated resources on machines to be used by compute intensive jobs submitted to a central manager.

The Condor Project’s User Manual has an excellent introduction and tutorial on how to start utilizing condor. Site specific information is provided below as a supplement to their documentation for interaction with Condor resources within the ECE realm.

ECE Condor environment

Condor is best suited to running many jobs with variations of the data. It does not do quite so well at single very long-running jobs, unless you are able to checkpoint those jobs. (Condor itself does support checkpointing; see the Condor manual pages about checkpointing).

condor.ece.cmu.edu is the main collector for our condor flocks. All of our machines have AFS; however, as condor itself can not authenticate to AFS, keep this in mind. See the AFS ACLs page for more information.

Job Management

Below is a very brief reference for the commands to get started with utilizing condor in the ECE computing environment.

Submitting Jobs

Condor Project’s Manual describes pretty well how to submit jobs. condor_submit or condor_submit_dag will take and parse your submission file into batch job clusters. These will be spread out across the pool as resources are available.

You will want to submit jobs from: condor-submit.ece.local.cmu.edu

Managing Jobs

Again, see the Condor Project’s Manual for full understanding.

Checking Status of a job

condor_q will list the jobs and their status submitted from your host. If you submitted from a different host, use the -pool or -global option.

$ condor_q -- Submitter: fatbox.campus.ece.cmu.local : <172.19.136.27:49717> : fatbox.campus.ece.cmu.local ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 367.0 mbeckler 10/5 12:39 0+00:00:00 R 0 0.0 executable.sh 0 367.1 mbeckler 10/5 12:39 0+00:00:00 R 0 0.0 executable.sh 1 2 jobs; 0 idle, 2 running, 0 held

Holding/Releasing a job

At times you may need to have a job held, or released. The condor scheduler will Hold jobs if the job fails to start repeatedly. This is typically the sign of a error with your submission file, or more likely, a permission error related to accessing files listed within your submission file.

When this occurs, condor_hold and condor_release are the commands you will need. They both take the standard condor conventions of -user, cluster, cluster.process, and -all. I would strongly recommend releasing a single cluster.process at first, then going back and releasing the rest of the cluster.

$ condor_hold 367 $ condor_release 367.1

Removing a job

Sometimes you need to remove a job early, and condor_rm will do that for you.

$ condor_rm 367 Job cluster 367 removed, 2 jobs removed.

Additional Resources