COMPRIZE: assessing the fusion of quantization and compression on DNN hardware accelerators

dc.contributor.author	Patel, Vrajesh
dc.contributor.author	Shah, Neel
dc.contributor.author	Krishna, Aravind
dc.contributor.author	Issac, Tom Glint
dc.contributor.author	Ronak, Abdul
dc.contributor.author	Mekie, Joycee
dc.contributor.other	37th International Conference on VLSI Design and 23rd International Conference on Embedded Systems (VLSID 2024)
dc.coverage.spatial	India
dc.date.accessioned	2024-04-25T14:47:03Z
dc.date.available	2024-04-25T14:47:03Z
dc.date.issued	2024-01-06
dc.identifier.citation	Patel, Vrajesh; Shah, Neel; Krishna, Aravind; Issac, Tom Glint; Ronak, Abdul and Mekie, Joycee, "COMPRIZE: assessing the fusion of quantization and compression on DNN hardware accelerators", in the 37th International Conference on VLSI Design and 23rd International Conference on Embedded Systems (VLSID 2024), Kolkata, IN, Jan. 06-10, 2024.
dc.identifier.uri	https://ieeexplore.ieee.org/document/10483375
dc.identifier.uri	https://repository.iitgn.ac.in/handle/123456789/9991
dc.description.abstract	The fast advancement in complex Deep Neural Network (DNN) models, along with the availability of large amounts of training data, has created a huge demand for computational resources. When these novel workloads are offloaded on general-purpose computing cores, notable predicaments arise concerning memory utilization and power outlay. Consequently, a diverse array of methodologies have been investigated to confront these issues. Among these, DNN accelerator architectures have surfaced as a prominent solution. In this work, we propose to perform data-aware computation for inferencing various workloads that lead to a highly optimized DNN accelerator. Recent research advocates using Approximate Fixed Point Posit (quantized) representation (ApproxPOS) for inferencing workloads to enhance system performance. We perform compression along with quantization for inferencing AlexNet, ResNet, and LeNet and show system-level benefits in terms of latency and energy. We selectively compress and decompress inputs and outputs for each layer of the workload based on its sparsity. The model used for inferencing is trained in a quantization-aware manner by modifying the Pytorch framework. Our system-level analysis shows that on performing data-aware computation for a fixed area, the proposed implementation on average consumes ∼15.4×,∼11.6×, and ∼3.5× less energy for AlexNet, ResNet, and LeNet respectively and achieves on average ∼2× speedup as compared to the baseline, FP32. The area overhead due to additional circuit requirements for compression and decompression was negligible (within 0.5%) since it only requires an additional register and a counter. We demonstrate our work on Simba architecture, which is extendable to any other accelerator.
dc.description.statementofresponsibility	by Vrajesh Patel, Neel Shah, Aravind Krishna, Tom Glint Issac, Abdul Ronak and Joycee Mekie
dc.language.iso	en_US
dc.publisher	Institute of Electrical and Electronics Engineers (IEEE)
dc.subject	DNN accelerators
dc.subject	Neural networks
dc.subject	Quantization
dc.subject	Compression
dc.title	COMPRIZE: assessing the fusion of quantization and compression on DNN hardware accelerators
dc.type	Conference Paper

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Conference Papers [537]

Show simple item record

Search Digital Repository

Browse

All of DSpace
This Collection
- Titles
- Authors
- By Advisor
- By Issue Date
- Subjects
- By Type
- By Degree
- By Department

COMPRIZE: assessing the fusion of quantization and compression on DNN hardware accelerators

Files in this item

This item appears in the following Collection(s)

Search Digital Repository

Browse

All of DSpace

This Collection

My Account