Why GPU Clusters Don't Need to Go Brrr: Leveraging Compound Sparsity to Achieve the Fastest Inference Performance on CPUs
date:
Tuesday, October 4, 2022
Summary:
In this session, the power of compound sparsity for model compression and inference speedup will be demonstrated for NLP (HuggingFace BERT) and CV (YOLOv5) applications. The open source library SparseML will be used for applying compound sparsity onto dense models, utilizing techniques including structured + unstructured pruning (to 90%+ sparsity), quantization, and knowledge distillation. After sparsification, these models will be run on the DeepSparse engine, which is optimized for executing sparse graphs on CPU hardware at GPU speeds. The participants of the session will learn how to apply compound sparsity so that they can run inference at an order of magnitude faster than the original dense models without a noticeable drop in accuracy.