2.1 A Taxonomy of Parallel Computing

병렬 μ»΄ν“¨νŒ…μ˜ λΆ„λ₯˜

Beowulf-class systems β€” ensembles of PCs (e.g., Intel Pentium 4) integrated with commercial COTS local area networks (e.g., Fast Ethernet) or system area networks (e.g., Myrinet) and run widely available low-cost or no-cost software for managing system resources and coordinating parallel execution. Such systems exhibit exceptional price/performance for many applications.

μƒμš© COTS(κΈ°μ„±) LAN (예 : 고속 이더넷) λ˜λŠ” μ‹œμŠ€ν…œ μ˜μ—­ λ„€νŠΈμ›Œν¬ (예 : Myrinet)와 톡합 된 PC 앙상블 (예 : Intel Pentium 4) 및 μ‹œμŠ€ν…œ λ¦¬μ†ŒμŠ€ 관리 및 쑰정을 μœ„ν•΄ 널리 μ‚¬μš©λ˜λŠ” μ €λΉ„μš© λ˜λŠ” 무료 μ†Œν”„νŠΈμ›¨μ–΄ μ‹€ν–‰ 병렬 μ‹€ν–‰. μ΄λŸ¬ν•œ μ‹œμŠ€ν…œμ€ λ§Žμ€ μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ—μ„œ νƒμ›”ν•œ κ°€μ„±λΉ„λ₯Ό λ³΄μ—¬μ€λ‹ˆλ‹€.

Superclusters β€” clusters of clusters, still within a local area such as a shared machine room or in separate buildings on the same industrial or academic campus, usually integrated by the institution’s infrastructure backbone wide area netork. Although usually within the same internet domain, the clusters may be under separate ownership and administrative responsibilities. Nonetheless, organizations are striving to determine ways to enjoy the potential opportunities of partnering multiple local clusters to realize very large scale computing at least part of the time.

일반적으둜 κΈ°κ΄€μ˜ 인프라 λ°±λ³Έ κ΄‘μ—­ λ„€νŠΈμ›Œν¬λ‘œ ν†΅ν•©λ˜λŠ” λ™μΌν•œ μ‚°μ—… λ˜λŠ” ν•™μˆ  캠퍼슀의 별도 건물 λ˜λŠ” 곡유 λ¨Έμ‹ λ£Έκ³Ό 같은 둜컬 μ˜μ—­ 내에 μžˆλŠ” ν΄λŸ¬μŠ€ν„° μ§‘λ‹¨μ˜ ν΄λŸ¬μŠ€ν„° μ§‘λ‹¨μž…λ‹ˆλ‹€. 일반적으둜 λ™μΌν•œ 인터넷 도메인 내에 μžˆμ§€λ§Œ 각 ν΄λŸ¬μŠ€ν„°λ“€μ€ λ³„λ„μ˜ μ†Œμœ κΆŒ 및 관리 μ±…μž„μ„ κ°€μ§ˆ 수 μžˆμŠ΅λ‹ˆλ‹€. κ·ΈλŸΌμ—λ„ λΆˆκ΅¬ν•˜κ³ , 쑰직은 적어도 일뢀 μ‹œκ°„ λ™μ•ˆ λŒ€κ·œλͺ¨ μ»΄ν“¨νŒ…μ„ μ‹€ν˜„ν•˜κΈ° μœ„ν•΄ μ—¬λŸ¬ 둜컬 ν΄λŸ¬μŠ€ν„°μ™€ νŒŒνŠΈλ„ˆ 관계λ₯Ό 맺을 μˆ˜μžˆλŠ” 잠재적 기회λ₯Ό λˆ„λ¦΄ μˆ˜μžˆλŠ” 방법을 κ²°μ •ν•˜κΈ° μœ„ν•΄ λ…Έλ ₯ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

3.3.2.6 Distribute Jobs Tasks Across Allocated Resources

With the resources selected, Maui then maps job tasks to the actual resources.
This distribution of tasks is typically based on simple task distribution algorithms such as round-robin or max blocking, but can also incorporate parallel language library (i.e., MPI, PVM, etc) specific patterns used to minimize interprocesses communication overhead.

3.3.2.6 ν• λ‹Ή 된 λ¦¬μ†ŒμŠ€μ— μž‘μ—… μž‘μ—… 배포

μ„ νƒν•œ λ¦¬μ†ŒμŠ€λ‘œ MauiλŠ” μž‘μ—… νƒœμŠ€ν¬λ₯Ό μ‹€μ œ λ¦¬μ†ŒμŠ€μ— λ§€ν•‘ν•©λ‹ˆλ‹€.
μ΄λŸ¬ν•œ μž‘μ—… λ°°ν¬λŠ” 일반적으둜 λΌμš΄λ“œ 둜빈 λ˜λŠ” μ΅œλŒ€ 차단과 같은 κ°„λ‹¨ν•œ μž‘μ—… 배포 μ•Œκ³ λ¦¬μ¦˜μ„ κΈ°λ°˜μœΌλ‘œν•˜μ§€λ§Œ ν”„λ‘œμ„ΈμŠ€ κ°„ 톡신 μ˜€λ²„ ν—€λ“œλ₯Ό μ΅œμ†Œν™”ν•˜λŠ” 데 μ‚¬μš©λ˜λŠ” 병렬 μ–Έμ–΄ 라이브러리 (예 : MPI, PVM λ“±) νŠΉμ • νŒ¨ν„΄μ„ 톡합 ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€.

8.3 Node Set Overview

While backfill improves the scheduler’s performance, this is only half the battle.
The efficiency of a cluster, in terms of actual work accomplished, is a function of both scheduling performance and individual job efficiency.
In many clusters, job efficiency can vary from node to node as well as with the node mix allocated.
Most parallel jobs written in popular languages such as MPI or PVM do not internally load balance their workload and thus run only as fast as the slowest node allocated.
Consequently, these jobs run most effectively on homogeneous sets of nodes. However, while many clusters start out as homogeneous, they quickly evolve as new generations of compute nodes are integrated into the system.
Research has shown that this integration, while improving scheduling performance due to increased scheduler selection, can actually decrease average job efficiency.

8.3 λ…Έλ“œ μ„ΈνŠΈ κ°œμš”

backfill 은 μŠ€μΌ€μ€„λŸ¬μ˜ μ„±λŠ₯을 ν–₯상 μ‹œν‚€μ§€λ§Œ 이것은 μ ˆλ°˜μ— λΆˆκ³Όν•©λ‹ˆλ‹€.
μ‹€μ œ μˆ˜ν–‰ ν•œ μž‘μ—… μΈ‘λ©΄μ—μ„œ ν΄λŸ¬μŠ€ν„°μ˜ νš¨μœ¨μ„±μ€ 일정 μ„±λŠ₯κ³Ό κ°œλ³„ μž‘μ—… νš¨μœ¨μ„±μ˜ ν•¨μˆ˜μž…λ‹ˆλ‹€.
λ§Žμ€ ν΄λŸ¬μŠ€ν„°μ—μ„œ μž‘μ—… νš¨μœ¨μ„±μ€ ν• λ‹Ή 된 λ…Έλ“œ 쑰합뿐 μ•„λ‹ˆλΌ λ…Έλ“œλ§ˆλ‹€ λ‹€λ₯Ό 수 μžˆμŠ΅λ‹ˆλ‹€.
MPI λ˜λŠ” PVMκ³Ό 같이 널리 μ‚¬μš©λ˜λŠ” μ–Έμ–΄λ‘œ μž‘μ„±λœ λŒ€λΆ€λΆ„μ˜ 병렬 μž‘μ—…μ€ λ‚΄λΆ€μ μœΌλ‘œ μž‘μ—… λΆ€ν•˜λ₯Ό λΆ„μ‚°ν•˜μ§€ μ•ŠμœΌλ―€λ‘œ ν• λ‹Ή 된 κ°€μž₯ 느린 λ…Έλ“œλ§ŒνΌ λΉ λ₯΄κ²Œ μ‹€ν–‰λ©λ‹ˆλ‹€.
결과적으둜 μ΄λŸ¬ν•œ μž‘μ—…μ€ 동쒅 λ…Έλ“œ μ§‘ν•©μ—μ„œ κ°€μž₯ 효과적으둜 μ‹€ν–‰λ©λ‹ˆλ‹€. κ·ΈλŸ¬λ‚˜ λ§Žμ€ ν΄λŸ¬μŠ€ν„°κ°€ λ™μ’…μœΌλ‘œ μ‹œμž‘ν•˜μ§€λ§Œ μƒˆλ‘œμš΄ μ„ΈλŒ€μ˜ μ»΄ν“¨νŒ… λ…Έλ“œκ°€ μ‹œμŠ€ν…œμ— 톡합됨에 따라 λΉ λ₯΄κ²Œ λ°œμ „ν•©λ‹ˆλ‹€.
연ꡬ에 λ”°λ₯΄λ©΄ μ΄λŸ¬ν•œ 톡합은 μŠ€μΌ€μ€„λŸ¬ 선택 μ¦κ°€λ‘œ 인해 μŠ€μΌ€μ€„λ§ μ„±λŠ₯을 ν–₯μƒμ‹œν‚€λ©΄μ„œ μ‹€μ œλ‘œ 평균 μž‘μ—… νš¨μœ¨μ„±μ„ κ°μ†Œμ‹œν‚¬ 수 μžˆμŠ΅λ‹ˆλ‹€.

The Maui Scheduler can be thought of as a policy engine which allows sites control over when, where, and how resources such as processors, memory, and disk are allocated to jobs.
In addition to this control, it also provides mechanisms which help to intelligently optimize the use of these resources, monitor system performance, help diagnose problems, and generally manage the system.

Maui SchedulerλŠ” ν”„λ‘œμ„Έμ„œ, λ©”λͺ¨λ¦¬ 및 λ””μŠ€ν¬μ™€ 같은 λ¦¬μ†ŒμŠ€κ°€ μž‘μ—…μ— ν• λ‹Ήλ˜λŠ”μ‹œκΈ°, μœ„μΉ˜ 및 방법을 μ‚¬μ΄νŠΈμ—μ„œ μ œμ–΄ ν•  μˆ˜μžˆλŠ” μ •μ±… μ—”μ§„μœΌλ‘œ 생각할 수 μžˆμŠ΅λ‹ˆλ‹€.
이 μ œμ–΄ 외에도 μ΄λŸ¬ν•œ λ¦¬μ†ŒμŠ€μ˜ μ‚¬μš©μ„ 지λŠ₯적으둜 μ΅œμ ν™”ν•˜κ³ , μ‹œμŠ€ν…œ μ„±λŠ₯을 λͺ¨λ‹ˆν„°λ§ν•˜κ³ , 문제λ₯Ό μ§„λ‹¨ν•˜κ³ , 일반적으둜 μ‹œμŠ€ν…œμ„ κ΄€λ¦¬ν•˜λŠ” 데 λ„μ›€μ΄λ˜λŠ” λ©”μ»€λ‹ˆμ¦˜μ„ μ œκ³΅ν•©λ‹ˆλ‹€.

Running multi-site MPI jobs with Maui and MPICH Two things need to happen in order to run multi-site MPI jobs:

Nodes must be reserved and jobs must be run in a coordinated manner.
Jobs must be started such that they are set to communicate with with each other using MPI calls.
The meta scheduling interface to the Maui Scheduler can be used to reserve nodes and start jobs across distributed sites.
MPICH can be used to enable separate jobs to communicate with each other using MPI.

Flow:

A job is submitted to the meta scheduler.
The meta scheduler communicates with separate Maui schedulers to determine node availability.
The meta scheduler starts an individual job at each site. Each job consists solely of an MPICH ch_p4 server process running on each of the job’s nodes.
The meta scheduler creates an MPICH proc group file, with hostname and executable information for each site. An MPICH job is started using the proc group file.
One MPICH process runs on the submitting host.
This process communicates through ch_p4 servers to start MPICH processes on all nodes specified.

Maui 및 MPICH둜 닀쀑 μ‚¬μ΄νŠΈ MPI μž‘μ—… μ‹€ν–‰ 닀쀑 μ‚¬μ΄νŠΈ MPI μž‘μ—…μ„ μ‹€ν–‰ν•˜λ €λ©΄ λ‹€μŒ 두 가지가 μˆ˜ν–‰λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.

λ…Έλ“œλŠ” μ˜ˆμ•½λ˜μ–΄μ•Όν•˜λ©° μž‘μ—…μ€ μ‘°μ • 된 λ°©μ‹μœΌλ‘œ μ‹€ν–‰λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.
μž‘μ—…μ€ MPI ν˜ΈμΆœμ„ μ‚¬μš©ν•˜μ—¬ μ„œλ‘œ ν†΅μ‹ ν•˜λ„λ‘ μ„€μ •λ˜λ„λ‘ μ‹œμž‘λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.
Maui Scheduler에 λŒ€ν•œ 메타 μŠ€μΌ€μ€„λ§ μΈν„°νŽ˜μ΄μŠ€λŠ” λΆ„μ‚° 된 μ‚¬μ΄νŠΈμ—μ„œ λ…Έλ“œλ₯Ό μ˜ˆμ•½ν•˜κ³  μž‘μ—…μ„ μ‹œμž‘ν•˜λŠ” 데 μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
MPICHλ₯Ό μ‚¬μš©ν•˜λ©΄ λ³„λ„μ˜ μž‘μ—…μ΄ MPIλ₯Ό μ‚¬μš©ν•˜μ—¬ μ„œλ‘œ 톡신 ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

흐름:

μž‘μ—…μ΄ 메타 μŠ€μΌ€μ€„λŸ¬μ— μ œμΆœλ©λ‹ˆλ‹€.
메타 μŠ€μΌ€μ€„λŸ¬λŠ” λ³„λ„μ˜ Maui μŠ€μΌ€μ€„λŸ¬μ™€ ν†΅μ‹ ν•˜μ—¬ λ…Έλ“œ κ°€μš©μ„±μ„ κ²°μ •ν•©λ‹ˆλ‹€.
메타 μŠ€μΌ€μ€„λŸ¬λŠ” 각 μ‚¬μ΄νŠΈμ—μ„œ κ°œλ³„ μž‘μ—…μ„ μ‹œμž‘ν•©λ‹ˆλ‹€. 각 μž‘μ—…μ€ 각 μž‘μ—…μ˜ λ…Έλ“œμ—μ„œ μ‹€ν–‰λ˜λŠ” MPICH ch_p4 μ„œλ²„ ν”„λ‘œμ„ΈμŠ€λ‘œλ§Œ κ΅¬μ„±λ©λ‹ˆλ‹€.
메타 μŠ€μΌ€μ€„λŸ¬λŠ” 각 μ‚¬μ΄νŠΈμ— λŒ€ν•œ 호슀트 이름 및 μ‹€ν–‰ κ°€λŠ₯ 정보와 ν•¨κ»˜ MPICH proc κ·Έλ£Ή νŒŒμΌμ„ μƒμ„±ν•©λ‹ˆλ‹€. MPICH μž‘μ—…μ€ proc κ·Έλ£Ή νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ μ‹œμž‘λ©λ‹ˆλ‹€. ν•˜λ‚˜μ˜ MPICH ν”„λ‘œμ„ΈμŠ€κ°€ 제좜 ν˜ΈμŠ€νŠΈμ—μ„œ μ‹€ν–‰λ©λ‹ˆλ‹€. 이 ν”„λ‘œμ„ΈμŠ€λŠ” ch_p4 μ„œλ²„λ₯Ό 톡해 ν†΅μ‹ ν•˜μ—¬ μ§€μ •λœ λͺ¨λ“  λ…Έλ“œμ—μ„œ MPICH ν”„λ‘œμ„ΈμŠ€λ₯Ό μ‹œμž‘ν•©λ‹ˆλ‹€.

Setup:

A user account must be created at each site. The job executable and data must be created at each site.
MPICH must be installed at each site, and on the submitting host.
It should be configured to use the ch_p4 device. The executable path must be added to the ~/.server_apps file for the user at each site.
The submitting host and user must be added to the .rhosts file for each site.
A meta job specification, detailing username and executable name for each site should be created.

μ„€μ •:

각 μ‚¬μ΄νŠΈμ—μ„œ μ‚¬μš©μž 계정을 λ§Œλ“€μ–΄μ•Όν•©λ‹ˆλ‹€. μž‘μ—… μ‹€ν–‰ 파일과 λ°μ΄ν„°λŠ” 각 μ‚¬μ΄νŠΈμ—μ„œ μƒμ„±λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.
MPICHλŠ” 각 μ‚¬μ΄νŠΈμ™€ 제좜 ν˜ΈμŠ€νŠΈμ— μ„€μΉ˜ν•΄μ•Όν•©λ‹ˆλ‹€.
ch_p4 μž₯치λ₯Ό μ‚¬μš©ν•˜λ„λ‘ κ΅¬μ„±ν•΄μ•Όν•©λ‹ˆλ‹€. μ‹€ν–‰ κ²½λ‘œλŠ” 각 μ‚¬μ΄νŠΈμ—μ„œ μ‚¬μš©μžμ˜ ~ / .server_apps νŒŒμΌμ— μΆ”κ°€λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.
제좜 호슀트 및 μ‚¬μš©μžλŠ” 각 μ‚¬μ΄νŠΈμ˜ .rhosts νŒŒμΌμ— μΆ”κ°€λ˜μ–΄μ•Όν•©λ‹ˆλ‹€.
각 μ‚¬μ΄νŠΈμ˜ μ‚¬μš©μž 이름과 μ‹€ν–‰ 파일 이름을 μžμ„Ένžˆ μ„€λͺ…ν•˜λŠ” 메타 μž‘μ—… 사양을 λ§Œλ“€μ–΄μ•Όν•©λ‹ˆλ‹€.

TORQUE for job submission

Maui for job scheduling

1 Overview
1.1 Queue Structure
1.2 Node-Queue Matrix
2 Job Submission
2.1 Example Interactive Job
2.2 Example Script
3 Job Control
4 Examples
4.1 Serpent2
4.2 Nuclear data
4.3 MCNPX
4.4 MCNP5
4.4.1 MPI Only
4.4.2 OpenMP Only
4.4.3 MPI and OpenMP
4.5 MCNP6.1
4.6 MCNP6.2
4.7 MCNP: Delete unneeded runtapes
4.8 Scale
4.9 Advantg
5 FAQ
5.1 How can I setup unique temporary directory for my job?
5.2 I’m not getting error/output files!
5.3 How can I request different CPU counts on different nodes/How can I use multiple queues?
5.4 How can I submit a job to a specific node?
5.5 How can I ensure that I have enough local disk space (in /tmp)?
5.6 I messed up my node allocation request! How do I fix it?
5.7 Where are my jobs?
5.8 Admin stuff

λͺ©μ°¨