D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs

dc.contributor.author	Dhakal, Aditya
dc.contributor.author	Kulkarni, Sameer G.
dc.contributor.author	Ramakrishnan, K. K.
dc.coverage.spatial	United States of America
dc.date.accessioned	2023-05-17T08:16:05Z
dc.date.available	2023-05-17T08:16:05Z
dc.date.issued	2023-03
dc.identifier.citation	Dhakal, Aditya; Kulkarni, Sameer G. and Ramakrishnan, K. K., "D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs", arXiv, Cornell University Library, DOI: arXiv:2304.13541v1, Mar. 2023.
dc.identifier.uri	https://arxiv.org/abs/2304.13541c
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/8812
dc.description.abstract	Hardware accelerators such as GPUs are required for real-time, low-latency inference with Deep Neural Networks (DNN). However, due to the inherent limits to the parallelism they can exploit, DNNs often under-utilize the capacity of today's high-end accelerators. Although spatial multiplexing of the GPU, leads to higher GPU utilization and higher inference throughput, there remain a number of challenges. Finding the GPU percentage for right-sizing the GPU for each DNN through profiling, determining an optimal batching of requests to balance throughput improvement while meeting application-specific deadlines and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) that enables multiple DNNs to run in the GPU concurrently. To help allocate the appropriate GPU percentage (we call it the "Knee"), we develop and validate a model that estimates the parallelism each DNN can utilize. We also develop a lightweight optimization formulation to find an efficient batch size for each DNN operating with D-STACK. We bring together our optimizations and our spatio-temporal scheduler to provide a holistic inference framework. We demonstrate its ability to provide high throughput while meeting application SLOs. We compare D-STACK with an ideal scheduler that can allocate the right GPU percentage for every DNN kernel. D-STACK gets higher than 90 percent throughput and GPU utilization compared to the ideal scheduler. We also compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6X improvement in GPU utilization and up to 4X improvement in inference throughput.
dc.description.statementofresponsibility	by Aditya Dhakal, Sameer G. Kulkarni and K. K. Ramakrishnan
dc.language.iso	en_US
dc.publisher	Cornell University Library
dc.subject	D-STACK
dc.subject	DNN inference
dc.subject	Multiplexing
dc.subject	Spatio-temporal scheduling
dc.subject	GPUs
dc.title	D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs
dc.type	Article
dc.relation.journal	arXiv

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-print Articles [183]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account