Syllabus - ADVANCED PARALLEL COMPUTING (67833)

Syllabus ADVANCED PARALLEL COMPUTING - 67833
עברית

Students needing academic accommodations based on a disability should contact the Center for Diagnosis and Support of Students with Learning Disabilities, or the Office for Students with Disabilities, as early as possible, to discuss and coordinate accommodations, based on relevant documentation. For further information, please visit the site of the Dean of Students Office.
Print

PDF version
Last update 16-09-2015
HU Credits: 3 Degree/Cycle: 2nd degree (Master) Responsible Department: computer sciences Semester: 1st Semester Teaching Languages: Hebrew Campus: E. Safra Course/Module Coordinator: Dr. Oded Schwartz Coordinator Email: odedsc@cs.huji.ac.il Coordinator Office Hours: ד 14-15 Teaching Staff: Dr. Oded Schwartz Course/Module description: Supercomputers are composed of hundreds of thousands to millions CPUs. They play a major role in many areas. Many of the internet services rely on this technology; Supercomputers are also used for scientific computing and simulations in academia and industry. A major problem in the frontier of computer science is that existing algorithms may not fit to supercomputers due to two main challenges: 1) How to achieve high performance? In particular, how to deal with the major bottleneck: the cost of communication between processors and within the memory hierarchy. Communication costs much more time and energy than arithmetic computations (by orders of magnitude). Judging by hardware trends, the share of communication costs in the total costs is likely to increase further. Hence the need for communication minimizing algorithms. 2) How to ensure fault tolerance? Fault tolerance is required to handle computation errors: with the increase of machine size and the decrease of operating voltage, soft errors (bit flips) and hard errors (component failure) are expected to become more common. One has to make sure that if a CPU or GPU breaks down, the computed results remain trustworthy. This requires algorithms with fault detection and fault correction capabilities. We will survey these challenges from theoretical and practical perspectives, as they apply to various computations, such as matrix multiplication and linear algebra, FFT, sorting, graph algorithms, and dynamic programming. Graduate students and third year undergrads are welcomed. Prerequisite: Algorithms (67504). Course/Module aims: Learning outcomes - On successful completion of this module, students should be able to: Use various algorithmic techniques for obtaining high performance in parallel and sequential machines; Prove lower bounds on the communication costs of several classes of computations; Implement sequential and parallel algorithms. Attendance requirements(%): Teaching arrangement and method of instruction: Course/Module Content: 1. Models of computations. a. Hardware trends and communication costs. b. Models: Sequential (2-levels, Hierarchical), Distributed (homogeneous, BSP-like with local memory bound, PRAM), shared. c. Communication minimizing algorithms: basic examples. 2. Communication minimizing algorithms a. Sequential classical matrix multiplication: blocked, recursive b. Parallel classical matrix multiplication: 2D (Cannon, Summa), 3D, 2.5D, BFS-DFS c. Dense linear algebra: sequential and parallel LU, Cholesky, QR and other decompositions. d. Sparse numerical linear algebra e. Weak scaling, strong scaling, perfect strong scaling. f. Parallel graph algorithms g. Dynamic programming h. N-Body problem 3. Communication costs lower bounds: sequential and parallel. a. Reductions b. Geometric surface to volume bounds: Loomis–Whitney, Brascamp–Lieb, and Hölder's Inequalities and their applications to matrix multiplication, direct numerical linear algebra, and nested loops. c. Graph analyses: dominator set, edge expansion, path routing, and their application to Cooley-Tukey FFT, Strassen’s matrix multiplication, and other algorithms. 4. Obliviousness to resources a. Tuning, auto-tuning and obliviousness b. Cache oblivious algorithms; Cache eviction policies. c. Data layouts: row and column major, blocked, recursive. d. Distributed date layouts, dynamic data layouts e. Data structure for Cache obliviousness f. Network and processors obliviousness 5. Networks a. Common networks: in chips, clusters, and supercomputers b. Universal networks c. 2.5D algorithms revisited d. Topology dependent lower bounds. 6. Resilient computing a. Hardware trends b. Fault tolerance c. Fault tolerance by error correcting codes 7. Non-volatile memories a. Read-write asymmetry in costs and durability b. Write spreading algorithms c. Write lower bounds d. Write minimizing algorithms 8. Sorting Algorithms a. Parallelizations of sequential algorithms b. Funnel-sort 9. Bilinear algorithms a. Practical fast matrix multiplication, exact and approximated: Strassen-Winograd, Bini, Pan, Hopkroft-Kerr, Smirnov b. Impractical fast matrix multiplication c. Yuster-Zwick fast-sparse matrix multiplication 10. Fast Fourier Transform a. Cooley Tukey b. Lower bounds c. Sequential and Parallel implementations 11. Hands-on. A subset of a. Distributed memory machines: MPI b. Shared memory machines: OpenMP and Threads c. GPU: CUDA, CULA, OpenCL d. Debugging and optimization tools. e. The roofline model f. LAPACK and ScaLAPACK g. FFTW 12. Function on matrices, Newton method, inverse, root, sign. Required Reading: Will be detailed in class. Additional Reading Material: Grading Scheme : Additional information:

Print