Binary-LoRAX: Low-Latency Runtime Adaptable XNOR Classifier for Semi-Autonomous Grasping with Prosthetic Hands

Nael Fasfous¹, Manoj-Rohit Vemparala², Alexander Frickenstein², Mohamed Badawy¹, Felix Hundhausen³, Julian Höfer³, Naveen-Shankar Nagaraja², Christian Unger², Hans-Jörg Vögel², Jürgen Becker³, Tamim Asfour³, Walter Stechele¹

¹ Technical University of Munich ² BMW Group ³ Karlsruhe Institute of Technology

Abstract—Intelligent, semi-autonomous prostheses take advantage of combining autonomous functions and traditional myoelectric control. With the help of visual and environment sensors, intelligent prostheses achieve a level of autonomy which relieves the user from generating elaborate electromyographic (EMG) signals for grasp type and trajectory. To achieve the desired functionality, the semi-autonomous prosthesis must efficiently process the incoming environmental data at a high rate, with low power and high accuracy. In this paper, we propose Binary-LoRAX, a low-latency runtime adaptable classifier for the semi-autonomous grasping task of prosthetic hands. We offload the classification task to an efficient binary neural network accelerator which performs high-throughput XNOR operations on digital signal processing (DSP) blocks. To tailor the classifier’s performance to the current application scenario, we propose a frequency scaling approach which dynamically switches between two modes of operation, high-performance and power-saving. At high-performance, classifications are performed with a low latency of 0.45ms, high-throughput of 4999 FPS and power consumption of ∼1% of the optimal controller delay [12] and achieving a 99.7% reduction in latency compared to existing work [11].

Efficiently executing XNOR operations on an FPGA’s digital signal processing (DSP) blocks in a vectorized manner, freeing more look-up table (LUT) resources and allowing larger BNNs to fit onto embedded FPGAs.

Dynamically adapting the frequency of the accelerator, offering high-performance and low-power modes to target different application scenarios (dangerous/delicate objects, batch processing, object localization, low battery, prosthetic movement) improving the classifier’s battery-life by up to 19% compared to [13].

II. RELATED WORK

A. Efficient Intelligent Prosthetics

For semi-autonomous control of a prosthetic hand, vision-based perception is an important component which has gained a lot of attention in research lately [14], [15], [16], [17], [18], [19], [20], [21]. The implementation of systems. In prosthetic hands, the implementation of semi-autonomous functions is enabled through in-hand visual perception, which requires efficient embedded processing to avoid insecure, high-latency external compute services. The complete control algorithms, including image recognition, must be executed on in-hand embedded processing hardware.

Such contradictory objectives of maximizing performance while minimizing power and resource utilization, assign a decisive role to optimization of deep learning algorithms. One option is to exploit the representational redundancy of neural networks through quantization and binarization [7], [8], [9]. In this work, we propose Binary-LoRAX, an efficient runtime adaptable binary neural network (BNN) classifier for the semi-autonomous grasping task of prosthetic hands. We tackle the challenges arising from the prosthetic hand’s application constraints through the following contributions:

- Applying BNNs to the graspable object classification task, enabling the efficient deployment of neural networks on intelligent prostheses with a task-related accuracy of 99.82% on a 25 class problem from the YCB object dataset [10], adding 12 classes compared to existing work [11].

- Achieving low-latency classifications of 0.45 ms, consuming <1% of the optimal controller delay [12] and achieving a 99.7% reduction in latency compared to existing work [11].

- Efficiently executing XNOR operations on an FPGA’s digital signal processing (DSP) blocks in a vectorized manner, freeing more look-up table (LUT) resources and allowing larger BNNs to fit onto embedded FPGAs.

- Dynamically adapting the frequency of the accelerator, offering high-performance and low-power modes to target different application scenarios (dangerous/delicate objects, batch processing, object localization, low battery, prosthetic movement) improving the classifier’s battery-life by up to 19% compared to [13].
semi-autonomous hand functions reduces the complexity of required user-control commands typically generated by electromyographic (EMG) signals. Compared to completely manual hand control, the set of required commands can be smaller and invoked with lower cognitive effort and thus lowers the cognitive burden on the user [17], [22]. Moreover, the reduced complexity of commands requires less complex electrode setups that are needed for higher accuracy and more long-term stable EMG-pattern recognition.

The semi-autonomous KIT Prosthetic Hand proposed in [18], employs an in-hand camera to decrease the cognitive burden on the user by automating parts of the grasping process with the help of visual environmental information. For grasping an object with the support of semi-autonomous functions, the first step is to obtain visual object information, such as object class or object dimensions, e.g. width or height. This information is then used to select a suitable grasp from a database, where parameters can include finger trajectories or forces. In [11] a two-step classification system for the KIT Prosthetic Hand is proposed, where an object classification algorithm and an acknowledgment from the user triggers a second segmentation network.

In the first version of the KIT Prosthetic Hand, an ARM Cortex M7-based microprocessor was used. The currently developed design includes a Zynq Z7010-based processing hardware. A photo of the hardware is shown in Fig. 1.

---

B. Binary Neural Networks

Parameter quantization has a direct impact on a neural network’s memory footprint and the complexity of its arithmetic operations on hardware. Binarization is the most aggressive form of quantization, where network weights and activations are constrained to \{-1, 1\} [9]. Theoretically, this leads to a parameter compression of $\times 32$ compared to a float-32 CNN, and allows for an implementation of multiply-accumulate (MAC) operations as simple $\text{XNOR}$ and $\text{popcount}$ on inference hardware [23]. As a trade-off, low bitwidth representations have a lower information capacity, losing the precision of the finely adjusted weights achieved by the gradient propagation during training. To address these challenges, specialized training schemes are applied (normalization, straight-through-estimators (STEs), scaling/shifting factors) [24], [25], [23]. Different schemes for binarization have been proposed [9], [26], [27], [28], [23]. Courbariaux et al. [26] introduced the concept of training neural networks with binary weights during the forward pass and maintaining latent full-precision values during back-propagation to allow fine adjustments through the gradients. The authors later augmented this approach with binarized activations [9]. Rastegari et al. [23] introduced XNOR-Net, where the convolutions were approximated by a combination of $\text{XNOR}$ operations and $\text{popcount}$s, followed by a multiplication with scaling factors. The introduction of scaling factors not only increases the number of trainable parameters for each layer, but also adds to the computational complexity of XNOR-Net at deployment time. Although [23], [28], [29] and [30] have focused on adding algorithmic or structural complexity to BNNs to achieve classification performance close to full-precision CNNs on complex tasks, simpler tasks with lower scene complexity can be handled with more efficient forms of BNNs [9], [31].

In the context of semi-autonomous prosthetic hands, the camera input at the instance before the grasp operation takes place is expected to have one central object in the field-of-view. In that regard, the task’s complexity resembles that of popular datasets, such as the German Traffic Sign Recognition Benchmark (GTSRB) [32], Street View House Numbers (SVHN) [33] or CIFAR-10 [34], all of which have the object of interest in the forefront of the scene, with minimal, random background complexity when compared to autonomous driving scenes such as Cityscapes [35]. It is important to note that BNNs have shown high accuracy and good generalization on the mentioned datasets [9], [13], making them a worthy candidate for the graspable object classification problem.

C. BNN Hardware Accelerators

Several accelerators have been designed to exploit the benefits of BNNs [36], [37], [13], [38], [39]. FINN [13] is a popular framework for accelerating BNNs on FPGAs. Although the framework is designed for BNNs presented in [9], it also supports 2-bit weights and/or activations. FINN compiles HLS code from a BNN description to create a hardware design for the network. The generated streaming architecture consists of a pipeline of individual hardware components instantiated for each layer of the BNN. OrthrusPE [40] investigates the effectiveness of deploying binary operations onto DSPs as SIMD binary Hadamard product processing units. The authors reconfigure the DSP at runtime to perform either fixed-precision operations or SIMD binary operations, enabling BNNs with scaling factors and multiple bases [28], [23]. In this work, we infuse the FINN architecture with DSPs, by switching them statically to a binary operation mode. For LUT constrained devices such as the Z7010, this allows larger and/or faster accelerator designs, by spreading out computations to DSPs. We further enable runtime frequency scaling to achieve different modes of operation, at different latency requirements and power consumption rates, based on the current application scenario.
Fig. 2: Overview of Binary-LoRAX: BNN tensor slices are fed into DSPs which perform high-throughput XNOR operations. DSP results are forwarded to the PEs of an MVTU. A single MVTU of the pipeline is shown for compactness. Runtime frequency scaling allows high-performance functions, or power-saving mode.

III. METHOD

A. Training and Inference of BNNs

For efficient approximation of weights and activations to single-bit precision, the BNN method by Courbariaux et al. [9] is used. At training time, the network parameters are represented by full-precision latent weights $W$ allowing for a smoother convergence of the model [24]. It is important to note that the input and output layers in this implementation are not binarized, to avoid a drop in classification accuracy. Without loss of generality, the activation feature map $A^{l-1} \in \mathbb{R}^{X_i \times Y_i \times C_i}$ is considered as the input to a convolutional layer $l \in [1, \ldots, L]$, where $X_i$, $Y_i$ and $C_i$ describe the dimensions of width, height and input channels. $A^0$ and $A^L$ are the input image and the prediction of the BNN, respectively. The latent weight matrix $W \in \mathbb{R}^{K \times K \times C_i \times C_o}$ is composed of the trainable parameters of the individual 2D-convolutional layers, where $K$ and $C_o$ are the kernel dimensions and the number of output channels. During the forward-pass for loss calculation or deployment, the weights $w \in W$ are transformed into the binary domain $b \subset B \in \mathbb{B}^{K \times K \times C_i \times C_o}$, where $B = \{-1, 1\}$. In the hardware implementation, the $-1$ is represented as 0 to perform multiplications as XOR logic operations. The weight and input feature maps are binarized by the $\text{sign()}$ function

$$b = \text{sign}(w) = \begin{cases} 1 & \text{if } w \geq 0, \\ -1 & \text{otherwise}. \end{cases}$$

(1)

The $\text{sign()}$ function blocks the flow of gradients during training due to its derivative, which is zero almost everywhere. To overcome the gradient flow problem, the $\text{sign()}$ function is approximated during back-propagation by the straight-through estimator (STE) [24]. In the simplest case, the estimated gradient $g_0$ could be obtained by replacing the derivative of $\text{sign()}$ with the hard $\text{tanh}$, which is equivalent to the condition $g_0 = g_t$ when $|w| \leq 1$ [9].

Particularly for BNNs, it is of crucial importance to adjust the input elements $A^{l-1} \subset A^l$, before the approximation into the binary representation $H^{l-1} \subset H^l \in \mathbb{B}^{X_i \times Y_i \times C_i}$ by means of batch normalization. An advantage of BNNs is that the result of the batch-norm operation will be followed by $\text{sign()}$ (see Fig. 2). Since the result after applying both functions is simply $\{-1, 1\}$, the precise calculation of the batch-norm is wasteful on embedded hardware. Based on the batch-norm statistics collected at training time, a threshold point $\tau$ is defined, wherein an activation value $A^{l-1} \geq \tau$ results in 1, otherwise -1 [13]. This allows the implementation of the typically costly batch-norm operation as a simple magnitude comparison operation on hardware. Next, the binary convolution follows as

$$A^l = \text{BinConv}(H^{l-1}, B^l) = \text{PopCnt}(\text{XNOR}(H^{l-1}, B^l)), \quad (2)$$

which results in the output feature map $A^l \in \mathbb{R}^{X_o \times Y_o \times C_o}$.

B. Hardware Architecture

The baseline hardware architecture is provided by the Xilinx FINN framework [13]. The hardware design space has many degrees of freedom for compute resources, pipeline structure, number of processing elements (PEs) and single-instruction-multiple-data (SIMD)-lanes, among other parameters. The streaming architecture is composed of a series of matrix-vector-threshold units (MVTUs) to perform the XNOR, popcount and threshold operations mentioned in Sec. III-A. In Fig. 2, a single MVTU is shown in detail, containing two PEs with 32 SIMD-lanes each. A detailed view of a single PE is also provided (bottom-right). For convolutional layers, a sliding-window unit (SWU) reshapes the binarized activation maps $H^{l-1} \in \mathbb{B}^{X_i \times Y_i \times C_i}$ into interleaved channels of $H^{l-1} \subset H^l$, to create a single wide input feature map memory, that can efficiently be accessed by the subsequent MVTU and operated upon in a parallel manner. Max-pool layers are implemented as Boolean OR operations, since a
single binary “1” value suffices to make the entire pool window output equal to 1.

A single MVTU is solely responsible for a single layer in the BNN, and is composed of single or multiple PEs, each having their own SIMD lanes. The SIMD lanes determine the throughput of each PE for the XNOR operation. The choice of PEs and SIMD lanes determines the latency and hardware resource utilization of each layer (i.e. MVTU) on the hardware architecture. A layer’s poorly dimensioned MVTU can result in an inefficient pipeline, leading to poor overall throughput. Throughput in a streaming architecture is heavily influenced by the slowest MVTU of the accelerator, as it throttles the rate at which results are produced when the pipeline is full. On the other hand, latency is dependent on the time taken by all the MVTUs of the architecture as well as the intermediate components between them (e.g. SWU, pooling unit, etc.).

Choosing the correct number of PEs and SIMD lanes for each layer becomes a design problem of balancing the FPGA’s resources, the pipeline’s efficiency (throughput and latency), and potentially the choice of layers in the BNN (i.e. task-related accuracy). The number of resources on the FPGA is limited, especially in the context of low-power prosthetics, making these aspects important in planning the deployment with a HW-BNN codesign approach.

C. Runtime Dynamic Frequency Scaling

In the previous section, the importance of defining the number of layers (BNN design) and PE/SIMD lanes per MVTU (HW design) was outlined. To enable efficient performance of the semi-autonomous prostheses, a further aspect must be considered next to resource utilization and latency, namely the power consumption of the classifier. Prosthetic devices are meant to be used on a day-to-day basis, making high power consumption a prohibitive aspect to their practicality. Here, we further append the classifier with the ability to change its operating frequency dynamically at runtime. The purpose in this case, is not having the classifier continually run at its full capacity, but rather scale down its performance (in terms of latency) for more efficient use of the available energy supply. Dynamic power in CMOS scales roughly with frequency following $P_{dyn} \approx \alpha f \cdot CV_{dd}^2$, where $\alpha$ is the switching activity, $f$ is the frequency, $C$ the effective capacitance and $V_{dd}$ the supply voltage.

In case of our target Xilinx Zynq System-on-Chip boards, the programmable logic (PL) on which the hardware acceleration is implemented, is clocked through phased-locked-loops (PLLs) controlled by a CPU-based processing system (PS). The PS can manipulate the PL’s clock by writing into special registers, whose values act as frequency dividers to the PLLs. As an example, the motion of the prosthetic hand can be captured through simple sensors which are monitored by the PS. Based on this motion, the PS can drive up the frequency of the classifier and prepare for a low-latency, high accuracy classification (based on a mean classification of a batch of frames). In case of a fragile or perilous object, the lower risk of a false classification can reduce the chances of an improper grasp. The PS can also trigger the object localization task by splitting the view into multiple small images and classifying them with high throughput. This is elaborated in Sec. IV-C. These high-performance features may extend the use of Binary-LoRAX to other semi-autonomous prostheses and/or applications. Conversely, the PS may monitor the remaining battery power or system temperature and switch the classifier to low-power mode.

D. SIMD Binary Products on DSP-blocks

In resource constrained platforms, the available hardware must be used effectively. Smaller FPGAs that have a few thousands of look-up-tables (LUTs) can easily run into synthesis issues, even with small network architectures. Since digital signal processing (DSP) blocks are not heavily utilized when synthesizing our BNN accelerator designs, they presented a good alternative to LUT resources for executing the parallel XNOR operations of the accelerator.

The DSP48E1 slice is presented in Fig. 3. We exploit the internal concatenation of signals $A$ and $B$ to fit part of the tensor slice $h^{l-1}$ and select it through the $X$ multiplexer, forming a 48-bit wide signal. Signals $A$ and $B$ are asymmetric, having 30 bits and 18 bits respectively. These signals are aligned before entering the DSP, such that their concatenated value $A : B$ (top blue line in Fig. 3) represents the input to one (or multiple) of the MVTU’s PEs and the respective SIMD lanes (Fig. 2). Note that the top blue path in Fig. 3 skips over the “MULT” inside the DSP, allowing us to clock-gate the multiplier for further power savings. The tensor slice $b^l$ of binarized weights is wired to input $C$ of the DSP, and made internally accessible at multiplexers $Y$ and $Z$ (as well as $W$ on DSP48E2). The order in which $h^{l-1}$ is aligned in signals $A : B$ must match their element-wise multipicand in $b^l$ in signal $C$ to get the correct 48-bit wide output as signal $P$. By setting the ALU_MODE signal and activating the correct multiplexers through the OPER_MODE signal (marked red in Fig. 3), the DSP is transformed into a SIMD binary product module. This low-level FPGA primitive reprogramming is not possible through High Level Synthesis (HLS), which is used to describe the overall accelerator. A script was developed to parse through the Hardware Description Language (HDL) files generated by HLS, to find all the signals corresponding to XNORs in the accelerator. The connections between the operand signals and the output registers are removed, then primitive DSP modules are instantiated with the correct wiring to operate in the binary mode described earlier. The operand signals of $h^{l-1}$ and $b^l$ are arranged into the aforementioned $A : B$ and $C$ signals and connected into the DSP(s). The wide output $P$ signal is then split and passed back into the next stages of the PE.

IV. RESULTS AND DESIGN SPACE EXPLORATION

A. Experimental Setup

We evaluate Binary-LoRAX on 25 objects from the YCB dataset [10], improving upon previous work by 12 ob-

Can also be applied to all 7-series, Ultrascale and Ultrascale+ FPGAs (DSP48E1 and DSP48E2), as well as the Versal DSP58.
Fig. 3: The DSP48E1 Slice [41]. Appended blue path indicates the operands inside the DSP, red path indicates signals that are needed to achieve the desired binary mode.

The dataset is augmented through scale, crop, flip, rotate and contrast operations. The masks provided with the dataset are used to augment the background with random Gaussian noise. The dataset is expanded to 105K images for the 25 classes. The images are resized to 32 x 32 pixels similar to the CIFAR-10 [34] dataset. The BNNs are trained up to 300 epochs, unless learning saturates earlier. The increase in DSP usage can be justified as they are operations to the DSPs.

In Tab. II, the hardware utilization for \( v \)-CNV and \( m \)-CNV prototypes, and XC7Z010 (Z7010) for \( \mu \)-CNV. All prototypes are finally deployed on the Z7020 SoC. Power, latency and throughput measurements are taken directly on a running system. The power is measured at the power supply of the board (includes both PS and PL). Latency measurements are performed end-to-end on the accelerator covering the classifier’s total time for an inference, while throughput is the classification rate to-end on the accelerator covering the classifier’s total time.

In the bottom half of Tab. II, the hardware utilization for the Binary-LoRAX prototypes is provided. For the \( v \)-CNV network, a reduction of 2386 (9%) LUTs can be observed from the regular CNV [13]. For the constrained Z7010, such reductions can make a previously non-synthesizable design realizable after moving XILN operations to DSPs. The increase in DSP usage can be justified as they are not the bottleneck for synthesizable designs in our case. It is important to note that \( \mu \)-CNV was synthesizable on the Z7010 only after moving the XILN operations to the DSPs, as proposed in Sec. III-D.

### C. Runtime Dynamic Frequency Scaling

Prosthetic devices used on a daily basis must offer high performance for safe and convenient use, while minimizing power dissipation to increase the continuous usage time before charging. Referring back to Tab. II, we report two values (\( \uparrow \)) for power, latency and throughput per Binary-LoRAX prototype, for high-performance and power-saving modes. At 2 MHz, Binary-LoRAX’s \( v \)-CNV achieves a reduction of up to 16% in power consumption with run-time frequency scaling compared to standard CNV [13]. This translates to an improvement in battery-life of up to 19%. In high-performance mode, a latency of only 0.45 ms is consumed by the \( m \)-CNV network at 125 MHz. This reduces latency by 99.7% compared to the work in [11].

### TABLE I: Network architectures and hardware dimensioning.

<table>
<thead>
<tr>
<th>Network</th>
<th>( v )-CNV</th>
<th>( m )-CNV</th>
<th>( \mu )-CNV</th>
</tr>
</thead>
<tbody>
<tr>
<td>Arch.</td>
<td>Com_1_L</td>
<td>Conv_1_L</td>
<td>Conv_1_L</td>
</tr>
<tr>
<td></td>
<td>([3, 64])</td>
<td>([3, 32])</td>
<td>([3, 16])</td>
</tr>
<tr>
<td></td>
<td>Com_2_L</td>
<td>Conv_2_L</td>
<td>Conv_2_L</td>
</tr>
<tr>
<td></td>
<td>([64, 64])</td>
<td>([32, 32])</td>
<td>([16, 16])</td>
</tr>
<tr>
<td></td>
<td>Com_3_L</td>
<td>Conv_3_L</td>
<td>Conv_3_L</td>
</tr>
<tr>
<td></td>
<td>([128, 128])</td>
<td>([64, 64])</td>
<td>([32, 32])</td>
</tr>
<tr>
<td></td>
<td>Com_4_L</td>
<td>Conv_4_L</td>
<td>Conv_4_L</td>
</tr>
<tr>
<td></td>
<td>([256, 256])</td>
<td>([128, 128])</td>
<td>([64, 64])</td>
</tr>
<tr>
<td></td>
<td>Com_5_L</td>
<td>Conv_5_L</td>
<td>Conv_5_L</td>
</tr>
<tr>
<td></td>
<td>([512, 512])</td>
<td>([256, 256])</td>
<td>([128, 128])</td>
</tr>
<tr>
<td></td>
<td>FC_1_L</td>
<td>FC_1_L</td>
<td>FC_1_L</td>
</tr>
<tr>
<td></td>
<td>([512])</td>
<td>([256])</td>
<td>([128])</td>
</tr>
<tr>
<td></td>
<td>FC_2_L</td>
<td>FC_2_L</td>
<td>FC_2_L</td>
</tr>
<tr>
<td></td>
<td>([512])</td>
<td>([256])</td>
<td>([128])</td>
</tr>
<tr>
<td></td>
<td>FC_3_L</td>
<td>FC_3_L</td>
<td>FC_3_L</td>
</tr>
<tr>
<td></td>
<td>([25])</td>
<td>([25])</td>
<td>([25])</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>PE Count</th>
<th>SIMD lanes</th>
<th>VCB-Objects</th>
</tr>
</thead>
<tbody>
<tr>
<td>16, 32, 16, 4, 1, 1, 4</td>
<td>(3, 32, 32, 32, 32, 4, 1, 1, 4)</td>
<td>brush, banana, toy, airplane, chips can, tomato, soup can, windscreen, apple, scissors, sugar box, master chef, cup, mustard, bottle, orange, pudding box, lemon, plate, pitcher, bowl, potted meat can, mini granite jar, gelatin jar, large clamp, power drill, tennis ball, crasher box, adjustable wrench, knife</td>
</tr>
</tbody>
</table>

### TABLE II: Hardware results of design space exploration.

Power is averaged over a period of 100 seconds of operation.

<table>
<thead>
<tr>
<th>Configuration</th>
<th>( v )-CNV</th>
<th>( m )-CNV</th>
<th>( \mu )-CNV</th>
</tr>
</thead>
<tbody>
<tr>
<td>( v )-CNV</td>
<td>([3718, 140, 32])</td>
<td>([40278, 131, 5, 26])</td>
<td>([2217, 163])</td>
</tr>
<tr>
<td>( m )-CNV</td>
<td>([4060, 124, 24])</td>
<td>([2212, 158])</td>
<td>([3049])</td>
</tr>
<tr>
<td>( \mu )-CNV</td>
<td>([3578, 111])</td>
<td>([2172, 142])</td>
<td>([3049])</td>
</tr>
</tbody>
</table>

### B. Design Space Exploration

Considering two embedded SoC platforms, the Z7020 and the more constrained Z7010, three Binary-LoRAX prototypes were investigated: \( v \)-CNV, \( m \)-CNV and \( \mu \)-CNV. The CNV network is based on the architecture in [13] inspired by VGG-16 [42] and BinaryNet [9]. \( m \)-CNV and \( \mu \)-CNV have a similar architecture, with fewer channels, for faster inference and to fit the Z7010 respectively. For the prosthetic hand, latency is more critical than throughput. On the Z7010, the number of PEs and SIMD lanes were chosen to minimize end-to-end latency accordingly.

In Tab. II, we report the details of the CNV network with (1,2) and (2,2) bits for weights and activations respectively. The fully binarized CNV (1,1) network achieved a comparable accuracy of 99.82% on the YCB graspable object dataset, showing the effectiveness of BNNs for this task, and the potential to add more classes in future work.

In the bottom half of Tab. II, the hardware utilization for the Binary-LoRAX prototypes is provided. For the \( v \)-CNV network, a reduction of 2386 (9%) LUTs can be observed from the regular CNV [13]. For the constrained Z7010, such reductions can make a previously non-synthesizable design realizable after moving XILN operations to DSPs. The increase in DSP usage can be justified as they are not the bottleneck for synthesizable designs in our case. It is important to note that \( \mu \)-CNV was synthesizable on the Z7010 only after moving the XILN operations to the DSPs, as proposed in Sec. III-D.
Considering the performance/Watt efficiency metric, Binary-LoRAX’s $m$-CNV achieves 2318 frames/Watt compared to 20 frames/Watt in [11]. With an optimal controller delay for myoelectric prostheses of 125 ms [12], all our classifiers consume $<1\%$ of the total time, leaving more slack for post-processing, actuators and other parts of the system. In powersaving mode, the Binary-LoRAX prototypes run at 0.7-2 MHz and achieve an $\sim$80 ms latency, still leaving more than 36% of the allocated delay for the controller. It is important to note that in all the reported power measurements, roughly 1.65W of power is consumed by the Z7020’s ARM-Cortex A9 processor (PS) and the board. This leaves the isolated accelerator’s power at roughly 0.2W in power-saving mode for all configurations, making it very energy efficient. However, we report the overall power since the accelerator is still dependent on processor calls and preprocessing. In future work, the PS power consumption can also be optimized to further reduce the classifier’s overall power requirement.

In addition to the low latency of the high-performance mode, the high throughput of up to 4999 FPS can be used to improve the quality of the application. Instead of providing a single classification, the accelerator can pipeline the inference of many images (potentially from different sensors) and perform batch-classification. The batch classification result will represent the highest class over all classifications, which in practice compose of slightly different angles, lighting and distance to the object, improving the chances of a correct classification. Multi-camera prosthetics proposed in [43] can benefit from the high throughput, as more data is gathered through the multiple camera setup.

Another use of the high-performance mode is object localization in multi-object scenes. A large input image can be sliced into several smaller images and reclassified [13]. The image can be reconstructed with bounded high confidence classifications. Fig. 4 demonstrates the described function on Binary-LoRAX. This can help the prosthesis predetermine the location of different objects in a far scene, when the hand is not yet close to the graspable object. The approach also fits our training scheme, as the BNNs are trained on up-close images of the object (soon before the grasp), while far scenes with no central object would be unrecognizable to the BNN. The individual slices of a far scene are similar to the up-close train images.

In Fig. 5, we perform a frequency sweep on the $v$-CNV prototype, identifying different points of operation for different application requirements. The low-power region is considered to be below 1.90 W, while localization would require classification rates of above 2250 FPS for an input resolution of $320 \times 240$. Batch classification can be triggered in critical scenarios where a latency of $<10$ ms is needed.

We demonstrate the application of runtime frequency scaling in Fig. 6. The total power of the chip is measured for a duration of 80 seconds. At time $t=15$, we introduce a stimulus representing a dangerous object or similarly a signal from a motion sensor on the hand. The event triggers the classifier to high-performance mode for an observation period of 35 seconds. If no further event occurs, the classifier winds down to low-power mode at $t=50$. Naturally, the intermediate frequencies shown in Fig. 5 can all be triggered for other scenarios or operating modes.

**V. Conclusion**

A daily-used device, such as a prosthetic hand, must operate in different modes to suit daily application scenarios. In this paper, we present a low-latency runtime adaptable XNOR classifier for semi-autonomous prosthetic hands. We enable high-performance features and power-saving modes through runtime adaptable frequency scaling. Our Binary-LoRAX prototypes achieved over $\sim$99% accuracy on a 25 class problem from the YCB dataset, and a maximum of 4999 FPS and latency of 0.45 ms. The low-power mode can potentially improve the battery-life of the classifier by 19% compared to an equivalent accelerator running continuously at full-power. This work demonstrates that BNNs have the potential to bring cutting-edge classification performance to the field of semi-autonomous prosthetics.
REFERENCES


