Quantizable Model and Quantized Inference

One of the main challenges in deploying our Neural Machine Translation model to our interactive production translation service is that it is computationally intensive at inference, making low latency translation difficult, and high volume deployment computationally expensive. Quantized inference using reduced precision arithmetic is one technique that can significantly reduce the cost of inference for these models, often providing efficiency improvements on the same computational devices. For example, in [43], it is demonstrated that a convolutional neural network model can be sped up by a factor of 4-6 with minimal loss on classification accuracy on the ILSVRC-12 benchmark. In [27], it is demonstrated that neural network model weights can be quantized to only three states, -1, 0, and +1

例えば[43]では、ILSVRC-12ベンチマークでのクラス分けの精度の低下を最小限に抑え、CNNモデルを4〜6倍高速化できることが示されている。 [27]では、ニューラルネットワークモデルの重みは、-1、0、+ 1の3つの状態だけに量子化できることが示されている。





丸山 講演資料集 (2014年 -- 2018年)