Structural Pruning via Latency-Saliency Knapsack

News

[Apr. 2023] Our paper "Hardware-Aware Latency Pruning for Read-Time 3D Object Detection" has been accepted to IEEE IV 2023.
[Nov. 2022] Code has been released.
[Oct. 2022] Code will be released later.
[Sep. 2022] Our paper has been accepted to NeurIPS 2022.

Abstract

We propose Hardware-Aware Latency Pruning (HALP) that formulates structural pruning as a global resource allocation optimization problem, aiming at maximizing the accuracy while constraining latency under a predefined budget on targeting device. For filter importance ranking, HALP leverages latency lookup table to track latency reduction potential and global saliency score to gauge accuracy drop. Both metrics can be evaluated very efficiently during pruning, allowing us to reformulate global structural pruning under a reward maximization problem given target constraint. This makes the problem solvable via our augmented knapsack solver, enabling HALP to surpass prior work in pruning efficacy and accuracy-efficiency trade-off. It consistently outperforms prior art by large margins (see NVIDIA Titan V speedups above, and more hardware results next and in paper).

Key Approach

The proposed hardware-aware latency pruning (HALP) paradigm. Considering both performance and latency contributions, HALP formulates global structural pruning as a global resource allocation problem (Section 3.1), solvable using our augmented Knapsack algorithm (Section 3.2). Pruned architectures surpass prior work across varying latency constraints given changing network architectures for both classification and detection tasks (Section 4).

Our HALP paradigm on latency-saliency Knapsack.

More Results

Our approach is fast and scalable across a wide range of target platforms for measured latency improvements. As an example, we show pruning results of ResNet50 on the ImageNet dataset with NVIDIA Jetson TX2 (left), Intel CPU Xeon E5 (middle) and NVIDIA Xavier (right). The latency on Jetson TX2 and CPU is measured using PyTorch; on Xavier is measured using TensorRT FP32. Please refer to the paper for more details.

Application on 3D object detection and algorithm improvement

We further show the efficacy of the method on 3D object detection task and improve the algorithm in the following works:

Hardware-Aware Latency Pruning for Read-Time 3D Object Detection (IEEE IV 2023).

Soft Masking for Cost-Constrained Channel Pruning (ECCV 2022), where we improve the pruning performance especially for the large prune ratios.

Adaptive Sharpness-Aware Pruning for Robust Sparse Networks, where we aim at improving the robustness of compact models.

Hardware-Aware Latency Pruning for Read-Time 3D Object Detection

We present experiments on pruning for multi-camera 3D object detection and show the effectiveness of train large, then compress in the context of 3D object detection. For comparison, we use three different architectures (compact architecture search by NAS, ResNet18 and ResNet50) for the backbones and the encoder. We observe that:

1. Light pruning these architectures all yield a significant speedup with a similar accuracy compared to their dense counterpart;

2. Pruning a larger model to match the latency of a smaller one leads to better performance;

3. The performance of pruning a much larger model might be constrained by the model depth. Channel pruning should be considered to benefit the latency.

NeurIPS 2022

Hardware-Aware Latency Pruning for Read-Time 3D Object Detection