The invention relates to machine learning models such as deep neural networks. This is an interesting decision, as it involves the interplay between showing technical effect and the requirement for sufficient disclosure.
The Examining Division considered that the technical effect of the training distilled model cannot be acknowledged. The applicant argued that the feature provide classification result using fewer resources. The Board disagreed with the applicant.
While the distilled model has reduced the storage or computational requirements of a machine learning model, it is insufficient, by itself, to establish a technical effect, as one also has to consider the performance of the “reduced” learning model. The Board found that it was not credible in general that any model with fewer parameters can be as accurate as the more complex one it is meant to replace. While it appeared possible, the application lacked any information in that regard. Thus, the subject-matter was considered not inventive.
Here are the practical takeaways from the decision T 1425/21 (Distilled Machine Learning Models/Google) from February 07, 2024, of the Technical Board of Appeal 3.5.06.
Key takeaways
The invention
The Board summarized the invention as follows:
1. The application relates to machine learning models such as deep neural networks. It proposes to approximate “cumbersome” machine learning models with “distilled” machine learning models which require less computation and/or memory when deployed. For instance the distilled model may be a neural network with fewer layers or fewer parameters. The cumbersome model may be an ensemble classifier, possibly combining full classifiers with specialist classifiers. The distilled model is trained on a “plurality of training inputs” and the associated outputs of the cumbersome model, so as to “generate outputs that are not significantly less accurate than outputs generated by the cumbersome machine learning model” (page 1).
1.1 The training procedure aims at minimizing the differences between the “soft outputs” of the distilled model and those of the cumbersome model on the given training inputs, e.g. through backpropagation (page 6).
1.2 The soft outputs represent a class probability obtained according to a form of the softmax equation using a “temperature” parameter T, which is set higher during training than during subsequent use (see description page 4, see claim 1 of all requests).

-
Claim 1
Is it patentable?
In the decision under appeal, the Examining Division agreed that it had two distinguishing features:
(a) the second machine learning model (distilled machine learning model) has fewer parameters than the first machine learning model (cumbersome machine learning model) such that generating output from the second machine learning model requires less memory than generating output from the first machine learning model; and
(b) the second machine learning model is trained based upon a soft score satisfying a particular form as set out in claim 1.
However, the Examining Division was of the opinion that these two differences did not provide a technical effect.
4.1 The training procedure based on a particular form of softmax was “not based on technical considerations as regards the functioning of the one or more computers” and did not serve a technical purpose (14.3).
4.2 Although it accepted that the distilled model required less storage, it argued that a technical effect could not be acknowledged on that basis because
(a) the claim allowed for both learning machines to co-exist on the same computer (14.4 and 14.5),
(b) no technical details of the device, on which the second machine learning model may run, were specified (14.4, but also 18.2).
The applicant disagreed, and in the appeal, argued as follows:
7. The goal of the claimed invention was to provide classification results using fewer resources: the smaller “distilled” model should classify as well as the larger “cumbersome” model. The claim itself did not cover any pair of learning models but it did define structural and functional limitations on them.
7.1 The learning models were limited structurally, as their outputs were obtained in a specific manner, namely as “soft outputs”. These did not single out one class but represented a set of probability values for the different classes.
7.2 The learning models were also limited functionally,as the smaller model was claimed to output soft scores that “matched” those of the smaller one.
8. The process of training the smaller model on the basis of the outputs of the larger one transferred knowledge from the larger to the smaller model. That knowledge was the classification capability of the larger model, not its specific way of fitting the data. Therefore, the second, smaller, model, needed not be as complex as the first one, as it only needed to be able to produce equivalent output scores.
8.1 This transfer was enabled by the way the temperature parameter of the soft score outputs was set. It was larger during training than in use, so as to allow for learning, but to produce sharp results when used. This concept worked in principle for any types of models.
In the preliminary opinion, the Board alleged a lack of sufficiency of disclosure, which was also relevant in the interpretation of claim feature, to which the applicant argued as follows:
9. The Appellant also stated that the invention was made in an extremely fast moving field in which specific details of the models and their parameters would quickly be out of date. Accordingly, indicating such details and parameters was not useful and should not be required in a patent application.
9.1 The skilled person in this field was highly skilled, had an extensive knowledge of available architectures and types of networks, and would be able to select the most suitable ones for the task at hand. The skilled person would also know, in view of the task at hand, which precision was required, i.e. when the outputs could be said to “match”.
9.2 The Appellant acknowledged that the skilled person had to carry out some, but not an unreasonable number of experiments to select appropriate models. According to T 312/88, reasons 3.3, a small number of routine experiments was not an undue burden on the skilled person and not detrimental for patentability (albeit, in that case, with respect to sufficiency of disclosure).
10. The Appellant also argued that pairs of models which were unsuitable for the task were not within the scope of the claim, because the claims were limited to models with matching outputs. Accordingly, all models which were covered by the claim provided the technical effect.
11. The Appellant stated that, according to established case law of the boards of appeal, for a method of the type claimed, i.e. a computer implemented mathematical method, a technical effect could be acknowledged either because the method was applied to solve a specific technical problem, or because its implementation had to be considered technical. The latter was the case here. The claimed method provided for reduced memory use with “matching” (i.e. the same or equivalent) classification results, so that a technical effect was present.
12. The invention was defined by the specific form of the output – soft scores with temperature parameter – and the way the temperature parameter was used to obtain a reduced model with equivalent classification abilities. It was therefore justified to seek protection without a limitation to any specific learning models or application contexts.
The Board disagreed with the applicant and considered that the features were not
16. The Board notes that the features differentiating the invention from D1, or even the entire set of features defining the distilled model and its training, as a difference to a known cumbersome learning model, are mathematical methods which cannot, under the established case law of the boards of appeal (the “COMVIK” approach), be taken into account for inventive step unless they contribute in a causal manner to a technical effect.
17. It accepts that the distilled model has reduced memory requirements when compared to the cumbersome model; after all this is expressly claimed. However, a reduction in storage or computational requirements of a machine learning model is insufficient, by itself, to establish a technical effect. One also has to consider the performance of the “reduced” learning model (see decision T 702/20, reasons 14.1, from this same Board).
18. It is not credible in general that any model with fewer parameters can be as accurate as the more complex one it is meant to replace. For example, the complexity or architecture of the reduced model may be insufficient or inadequate for the given problem.
19. The Board disagrees with the Appellant’s counter-argument that the invention (by “knowledge transfer” see point 8 above) reliably ensures that any given smaller network can provide the same accuracy as the given larger one. The input and output complexity is the same for both networks. Hence, also the smaller network must be complex enough to be able to model the input-output relationship (see e.g. D1, section 4.3, for a discussion on accuracy and complexity of approximating classifiers of a single type).
19.1 The Board also does not see that the temperature-based training process ensures that the smaller model has an equivalent accuracy. It is not clear how exactly the temperature must be first set (for both models), and then varied, and what accuracy may be expected. The application simply does not discuss this.
19.2 Since, in the Board’s view, the claim does not imply a step of selecting or obtaining a smaller model, but simply defines one as a given, the Appellant’s arguments relating to trial and error are not pertinent (and even if they were, they would not succeed, see below).
20. The Board concludes therefore that the technical effect advanced by the Appellant (see point 11 above) cannot be acknowledged over the whole scope of the claim, i.e. for all sets of smaller and larger models. The second model may use fewer resources, but it cannot be said to produce the same results and many smaller models will, in fact, be considerably worse.
20.1 In principle, it appears possible to argue that the smaller model represents a “good” trade-off between resource requirements and accuracy, i.e that the smaller model may be less accurate but have (predictably) smaller resource requirements. However, the application lacks any information in that regard.
20.2 Since no technical effect can be acknowledged, claim 1 of the main request lacks an inventive step.
Finally, the Board noted that the argument of the applicant could not be accepted, because if they were, the application was not sufficiently disclosed for skilled person to carry out practice over the full scope:
21. As discussed with the Appellant during the oral proceedings, the Appellant’s interpretation of the claim would give rise to an objection under Article 83 EPC. This objection, on which this decision does not rely, is presented here for the sake of argument.
22. Were the Board to adopt the interpretation of the Appellant, and assume that the claim implies a step of selecting a suitable smaller model, the Board disagrees with the argument that the skilled person would be able to provide smaller networks with reduced memory needs and equivalent accuracy with only “few routine tests” for all classification tasks.
23. The application itself does not guide the skilled person towards an understanding as to which distilled models might replace which cumbersome models, and how accurate they might be in comparison. No examples of pairs of cumbersome and distilled learning models are provided, nor are any results showing the performance of the distilled models.
23.1 While the skilled person might be aware of the various architectures and types of networks available from common general knowledge, the number of these possibilities is quite large. For each of them, downsizing can be done in different ways, by reducing the number of layers, of neurons, of weights etc., and each of these in various ways.
23.2 The trial-and-error process would also have to keep an eye on the desired trade-off between size and accuracy as already discussed above, which is not a simple endeavour. The performance of machine learning models is not easily predictable and can vary considerably according to the task at hand (see e.g. again D1).
24. Further, the Board does not see that the temperature-based training process as claimed simplifies the trial-and-error process. As already stated, the application simply does not discuss how exactly the temperature must be set and varied and what accuracy gains, if any, may be expected.
25. The Board further notes that the person skilled in the art must be able to carry out the invention over the whole scope of the claim, i.e. in principle for any classification task and given larger model. The claims cannot be, a priori, be construed as excluding instances which (e.g. after trial and error) turn out not to work (see T 748/19 reasons 13 to 13.2). Thus the argument of the Appellant that non-working embodiments do not fall under the scope of the claim cannot succeed.
25.1 The Board is therefore of the opinion that the application does not sufficiently teach how to carry out in practice the claimed invention. It rather only sketches out an idea whose implementation over the full scope of the claim requires a separate research program.
Therefore, the Board rejected the appeal.
More information
You can read the full decision here: T 1425/21 (Distilled Machine Learning Models/Google) from February 07, 2024, of the Technical Board of Appeal 3.5.06.