D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs

dc.contributor.author	Dhakal, Aditya
dc.contributor.author	Kulkarni, Sameer G.
dc.contributor.author	Ramakrishnan, K. K.
dc.coverage.spatial	United States of America
dc.date.accessioned	2024-10-30T10:20:31Z
dc.date.available	2024-10-30T10:20:31Z
dc.date.issued	2024-10
dc.identifier.citation	Dhakal, Aditya; Kulkarni, Sameer G. and Ramakrishnan, K. K., "D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs", IEEE Transactions on Cloud Computing, DOI: 10.1109/TCC.2024.3476210, vol. 12, no. 04, pp. 1344-1358, Oct. 2024.
dc.identifier.issn	2168-7161
dc.identifier.uri	https://doi.org/10.1109/TCC.2024.3476210
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/10653
dc.description.abstract	Hardware accelerators such as GPUs are required for real-time, low latency inference with Deep Neural Networks (DNN). Providing inference services in the cloud can be resource intensive, and effectively utilizing accelerators in the cloud is important. Spatial multiplexing of the GPU, while limiting the GPU resources (GPU%) to each DNN to the right amount, leads to higher GPU utilization and higher inference throughput. Right-sizing the GPU for each DNN the optimal batching of requests to balance throughput and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges. This paper introduces a dynamic and fair spatio-temporal scheduler (D-STACK) for multiple DNNs to run in the GPU concurrently. We develop and validate a model that estimates the parallelism each DNN can utilize and a lightweight optimization formulation to find an efficient batch size for each DNN. Our holistic inference framework provides high throughput while meeting application SLOs. We compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to 1.6× improvement in GPU utilization and up to 4× improvement in inference throughput.
dc.description.statementofresponsibility	by Aditya Dhakal, Sameer G. Kulkarni and K. K. Ramakrishnan
dc.format.extent	vol. 12, no. 04, pp. 1344-1358
dc.language.iso	en_US
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.subject	Datasets
dc.subject	Neural networks
dc.subject	Gaze detection
dc.subject	Text tagging
dc.title	D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs
dc.type	Article
dc.relation.journal	IEEE Transactions on Cloud Computing

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Journal Articles [97]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

D-STACK: high throughput DNN inference by effective multiplexing and spatio-temporal scheduling of GPUs

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account