Main Authors: Arezoo Jahani (University of Tabriz), Marco Lattuada (Politecnico di Milano)

Adittional Authors: Michele Ciavotta (Università di Milano Bicocca), Danilo Ardagna and Edoardo Amaldi (Politecnico di Milano), Li Zhang (IBM Research)

Focus Area: 

Cloud Computing

Who stands to benefit and how: 

Cloud end users accessing GPUs on demand services

Position Paper: 

Recently, Deep learning (DL) methods have gained popularity in many sophis- ticated medical applications like diagnostics and tumor detection. Among this class of methods, the most promising are Convolution and Recurrent Neural Networks (CNNs, RNNs) which are able to achieve almost human performance accuracy in many tasks. However, the training of such type of applications is very computing intensive tasks, so that exploiting GPUs results in 5 to 40x performance gain compared to CPUs.
Despite all the advantages [1], the cost of GPU-based systems is usually high [2]. High-end GPU based servers like NVIDIA DGX-2 cost up to 500k USD [3] while in public clouds, GPU based virtual machines (VMs) time unit cost is 5-8x higher than high-end CPU-only VMs [4]. The efficient use of GPUs and in particular the online joint capacity planning of on-demand VMs and DL training jobs scheduling problem must be addressed.
To solve it, we propose several Mixed Integer Linear Programming (MILP) formulations. Our solutions optimize the operation costs by (i) right-sizing the VM capacities on each node, (ii) partitioning the set of GPUs among multiple concurrent jobs sharing the same VM, and (iii) determining a deadline-aware job schedule.
The joint capacity allocation and jobs scheduling problem is solved at each new job submission and at each job termination. In the former scenario, running jobs may be pushed back in a waiting queue and be resumed only in the following time interval, or may be restarted with a different number of GPUs or on a different type of VM. In the latter scenario, the released GPUs become free and the system should be able to reassign them and manage all of the resources efficiently. Jobs execution time across different VM types and with a different number of GPUs can be estimated by relying on our previous work [5, 6].
The proposed MILP formulations differ in the objective function used to reduce the total operation cost while scheduling the jobs with minimum tardi- ness. In order to reach this goal, our first MILP model focuses on the first job which ends on each node. The second model focuses on the earliness by trying to allocate resources to jobs in such a way that their end times is as close as possible to their deadlines. Finally, the third formulation is a mix of the others and tries to identify the right balance between a selfish assignment to the jobs which end first and the upper bound cost identified by the earliness formulation.
We demonstrate the effectiveness of our approach by applying our formulations to some simulated scenarios and by testing testing them in a real prototype environment. Scalability analysis is also performed proving that our model is able to scale up to 5 nodes and 50 active jobs in less than 5 minutes.
The results illustrate that our models always produce solutions having lower costs and lower tardiness than first principle methods such as First-in-First- out, Earliest Deadline First, and Priority scheduling. Savings ranges on average between 40 and 70% and increase for larger systems and higher loads.

[1] J. R. Cheng and M. Gen, “Accelerating genetic algorithms with gpu comput- ing: A selective overview,” Computers & Industrial Engineering, vol. 128, pp. 514–525, 2019.
[2] M. Cao, W. Jia, S. Li, Y. Li, L. Zheng, and X. Liu, “Gpu-accelerated fea- ture tracking for 3d reconstruction,” Optics & Laser Technology, vol. 110, pp. 165–175, 2019.
[3] “Nvidia tesla gpu servers (gpx),” online: (visite on 05/03/2019), 2019.
[4] “Nvidia virtual gpu technology,” online: us/design-visualization/technologies/virtual-gpu/ (visite on 01/03/2019), 2019.
[5] E. Gianniti, L. Zhang, and D. Ardagna, “Performance prediction of GPU-based deep learning applications,” in 2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 167–170, IEEE, 2018.
[6] E. Gianniti, L. Zhang, and D. Ardagna, “Performance prediction of GPU-based deep learning applications”, pp. 279-286, Closer 2019.