目前,资源受限设备的推理部署存在一些挑战[1]:
主流网络的中间层特征图[2]的浮点数如下:含0稀疏张量可计算优化,不含0矩阵可先稀疏化再优化。
对于图像信息:
综上,对于数据需要不同的推理策略。
按照决策级别可以分为如下四类:
优化决策主要有:提前退出(Early Exit)、跳过(Skip),选择(Select),量化(Quant)等。
感知数据来源:
注意:收集良好的数据是进行动态决策的前提
DHN 是一种用于推理引擎的非侵入式的轻量网络结构,辅助主网络进行优化的动态决策,其生命周期如下:
案例1:使用 ResNet-18 作为主干网络,LC-Net[6] 作为钩子网络(用于跳过残差块): 相比于原模型,参数量增加(一层FC)可忽略不计,准确率有 0.65% 的提高,同时 FLOPs 减少了3倍以上。
案例2:DQNet使用Bit-Controller[11] 作为钩子网络,动态量化来控制每一层参数的位宽:
现有的大部分工作都是通过修改模型结构将主副网络一起训练,若在推理引擎内实现动态钩子网络,需要考虑如下难点:
已经支持主流 onnx 算子和部分网络,下一步将继续基于 libonnx 和 darknet 算子搭建,并参考.tflite设计轻量模型格式:
libonnx
darknet
.tflite
[1] Adapting Neural Networks at Runtime: Current Trends in At-Runtime Optimizations for Deep Learning [2] Dynamic runtime feature map pruning [3] Convolutional Networks with Adaptive Inference Graphs [4] DAQ: Channel-Wise Distribution-Aware Quantization for Deep Image Super-Resolution Networks [5] PAME: Precision-Aware Multi-Exit DNN Serving for Reducing Latencies of Batched Inferences [6] Fully Dynamic Inference with Deep Neural Networks [7] ZeroBN : learning compact neural networks for latency‑critical edge systems [8] FalCon: Fine-grained Feature Map Sparsity Computing with Decomposed Convolutions for Inference Optimization
[9] Dynamic Convolutions: Exploiting Spatial Sparsity for Faster Inference [10] CAMixerSR: Only Details Need More “Attention” [11] Instance-Aware Dynamic Neural Network Quantization [12] nn-Meter: towards accurate latency prediction of deep-learning model inference on diverse edge devices [13] HAQ: Hardware-Aware Automated Quantization with Mixed Precision [14] PROXYLESSNAS: DIRECT NEURAL ARCHITECTURE SEARCH ON TARGET TASK AND HARDWARE