Optimizing models for Larq Compute Engine¶
Larq Compute Engine is capable of running any Larq model, but some model architectures will perform better than others. These are some important guidelines to follow when designing models to be run on LCE, so as to ensure that inference is as fast as possible.
This guide assumes that you are deploying Larq Compute Engine on an 64-bit ARM device, such as an Android phone or a Raspberry Pi 4 board (with 64-bit OS).
Use multiples of 32 input channels
The LCE binary ops (binary convolution and binary maxpool) operate on multiples of 32 input channels. Before a binary op, the input channels are padded out to a multiple of 32, which means that a binary convolution applied to an input with 16 channels and a binary convolution applied to an input with 32 channels will run equally fast.
Use multiples of 8 output channels
The LCE optimised binary convolution op computes the result of 8 output channels in parallel, so if the number of output channels is not a multiple of 8 there will be wasted computation.
BatchNormalizationto take advantage of op fusing instead of
This is not the convention used in TensorFlow Lite.
Op fusing is an optimisation performed by the converter, whereby a sequence of stand-alone ops can be fused into an equivalent single op that can be executed more quickly than the original sequence.
In the TensorFlow Lite converter, the layer sequence
ReLUgets fused together into a single op. This makes sense, because the multiplier and addition of the batch-norm can be fused into the weights and bias of the convolution without any additional computation, and then the activation function can be applied at the end.
However, in a binary convolution it is not possible to fuse the (full-precision) batch-norm multiplier into the (binary) convolution weights, and so there's no speed advantage to performing the batch-norm before the ReLU. Conversely, there is a possible advantage to performing the ReLU before the batch-norm for a sequence of binary convolutions: if the output of a ReLU were fed directly into a subsequent binary convolution, the input would be quantised to constant ones and all information would be destroyed. As a result, the LCE converter supports op fusing of the layer sequence
Note that a batch-norm can still be fused into a
QuantConv2Dlayer without a ReLU between them; however, a subsequent ReLU would not also be fused.
Avoid using Larq
QuantConv2Dlayers with 'same-zero' padding
Conv2Dlayer supports either "valid" padding (no input padding) or "same" padding (padding the input with zeroes so that the output has the same spatial resolution). The Larq
QuantConv2Dlayer also supports "valid" and "same" padding, but additionally has an attribute
pad_valuesthat controls the padding constant that is used in the "same" padding case. We refer to the combination of
pad_values=0.0(the default, to match the
Conv2Dlayer) as 'same-zero' padding, and the combination with
pad_values=1.0as 'same-one' padding.
It is strongly recommended to use 'same-one' padding instead of 'same-zero' padding; for an example of models using 'same-one' padding, see the QuickNet models in Larq Zoo.
With 'same-zero' padding, because zero cannot be naturally represented in binary data (which uses a single bit to represent either +1 or -1) a zero-padding correction has to be added to values at the edge of the output tensor in order to give the correct result. This extra computation can be avoided by using 'same-one' padding.
Additionally, 'same-zero' padding prevents the converter op fusing optimisations described above. These optimisations are not affected by 'same-one' padding.
QuantDenselayers with 1×1
LCE does not yet include an optimised binary dense op, which means that Larq
QuantDenselayers are not accelerated in any way. The recommended workaround for this is to replace all
QuantDenselayers with an equivalent sequence of reshapes and a 1×1
# Replace this... x = lq.layers.QuantDense(num_dense_outputs, ...)(x) # ...with this x = tf.keras.layers.Reshape((1, 1, -1))(x) x = lq.layers.QuantConv2D(num_dense_outputs, kernel_size=(1, 1), ...)(x) x = tf.keras.layers.Reshape((-1,))(x)
input_channelsis a multiple of 128
Inside the optimised binary convolution op, an
im2colprocedure is applied to the input tensor, which reduces the convolution operation to a (binary) matrix multiplication. The
im2col-transformed input has a depth of size
input_channels, which is padded to a multiple of 128. If this value is already a multiple of 128, no padding is necessary, and so there is no unnecessary extra computation.
This constraint is less important than the others in this guide; it is also hard to satisfy exactly, especially with common odd kernel sizes such as 1×1 or 3×3 combined with low input channel counts. Instead, you can minimise inefficiency here by aiming for
input_channelsto be as close as possible to the next greatest multiple of 128.