Optimizing models for Larq Compute Engine¶
Larq Compute Engine is capable of running any Larq model, but some model architectures will perform better than others. These are some important guidelines to follow when designing models to be run on LCE, so as to ensure that inference is as fast as possible.
Important
This guide assumes that you are deploying Larq Compute Engine on an 64-bit ARM device, such as an Android phone or a Raspberry Pi 4 board (with 64-bit OS).
-
Use multiples of 32 input channels
The LCE binary ops (binary convolution and binary maxpool) operate on multiples of 32 input channels. Before a binary op, the input channels are padded out to a multiple of 32, which means that a binary convolution applied to an input with 16 channels and a binary convolution applied to an input with 32 channels will run equally fast.
-
Use multiples of 8 output channels
The LCE optimised binary convolution op computes the result of 8 output channels in parallel, so if the number of output channels is not a multiple of 8 there will be wasted computation.
-
Use
QuantConv2D
→ReLU
→BatchNormalization
to take advantage of op fusing instead ofQuantConv2D
→BatchNormalization
→ReLU
Warning
This is not the convention used in TensorFlow Lite.
Op fusing is an optimisation performed by the converter, whereby a sequence of stand-alone ops can be fused into an equivalent single op that can be executed more quickly than the original sequence.
In the TensorFlow Lite converter, the layer sequence
Conv2D
→BatchNormalization
→ReLU
gets fused together into a single op. This makes sense, because the multiplier and addition of the batch-norm can be fused into the weights and bias of the convolution without any additional computation, and then the activation function can be applied at the end.However, in a binary convolution it is not possible to fuse the (full-precision) batch-norm multiplier into the (binary) convolution weights, and so there's no speed advantage to performing the batch-norm before the ReLU. Conversely, there is a possible advantage to performing the ReLU before the batch-norm for a sequence of binary convolutions: if the output of a ReLU were fed directly into a subsequent binary convolution, the input would be quantised to constant ones and all information would be destroyed. As a result, the LCE converter supports op fusing of the layer sequence
QuantConv2D
→ReLU
→BatchNormalization
.Note that a batch-norm can still be fused into a
QuantConv2D
layer without a ReLU between them; however, a subsequent ReLU would not also be fused. -
Avoid using Larq
QuantConv2D
layers with 'same-zero' paddingThe Keras
Conv2D
layer supports either "valid" padding (no input padding) or "same" padding (padding the input with zeroes so that the output has the same spatial resolution). The LarqQuantConv2D
layer also supports "valid" and "same" padding, but additionally has an attributepad_values
that controls the padding constant that is used in the "same" padding case. We refer to the combination ofpadding="same"
withpad_values=0.0
(the default, to match theConv2D
layer) as 'same-zero' padding, and the combination withpad_values=1.0
as 'same-one' padding.It is strongly recommended to use 'same-one' padding instead of 'same-zero' padding; for an example of models using 'same-one' padding, see the QuickNet models in Larq Zoo.
With 'same-zero' padding, because zero cannot be naturally represented in binary data (which uses a single bit to represent either +1 or -1) a zero-padding correction has to be added to values at the edge of the output tensor in order to give the correct result. This extra computation can be avoided by using 'same-one' padding.
Additionally, 'same-zero' padding prevents the converter op fusing optimisations described above. These optimisations are not affected by 'same-one' padding.
-
Replace
QuantDense
layers with 1×1QuantConv2D
sLCE does not yet include an optimised binary dense op, which means that Larq
QuantDense
layers are not accelerated in any way. The recommended workaround for this is to replace allQuantDense
layers with an equivalent sequence of reshapes and a 1×1QuantConv2D
layer:# Replace this... x = lq.layers.QuantDense(num_dense_outputs, ...)(x) # ...with this x = tf.keras.layers.Reshape((1, 1, -1))(x) x = lq.layers.QuantConv2D(num_dense_outputs, kernel_size=(1, 1), ...)(x) x = tf.keras.layers.Reshape((-1,))(x)
-
Ensure that
kernel_height
×kernel_width
×input_channels
is a multiple of 128Inside the optimised binary convolution op, an
im2col
procedure is applied to the input tensor, which reduces the convolution operation to a (binary) matrix multiplication. Theim2col
-transformed input has a depth of sizekernel_height
×kernel_width
×input_channels
, which is padded to a multiple of 128. If this value is already a multiple of 128, no padding is necessary, and so there is no unnecessary extra computation.Note
This constraint is less important than the others in this guide; it is also hard to satisfy exactly, especially with common odd kernel sizes such as 1×1 or 3×3 combined with low input channel counts. Instead, you can minimise inefficiency here by aiming for
kernel_height
×kernel_width
×input_channels
to be as close as possible to the next greatest multiple of 128.