This decision concerns an European patent application relating to managing an archive for approximate string matching. Here are the practical takeaways from the decision T 1867/18 of the Technical Board of Appeal 3.5.06.
The invention relates to approximate string matching (also called fuzzy string matching or searching), i.e. finding strings that match a given pattern string within some tolerance according to a similarity metric, such as the edit distance. The strings being searched may be strings contained in records of a database. Approximate string matching may be used in database operations like join or rollup that group (“consolidate”) records into sets based on matching keys, in order to take into account that the exact spelling of words may differ within a dataset or between data sources and that words may be misspelled, e.g. COMPNY instead of COMPANY.
Figure 1 of EP2235621
Here is how the invention was defined in claim 1:
Claim 1 (sole request)
a computer-implemented method for managing an archive for determining approximate matches associated with strings occurring in data records of a dataset, the method including:
pre-processing, by a pre-execution module (110), data records to determine a set of string representations that correspond to strings occurring in the data records;
generating, for each of at least some of the string representations in the set, a plurality of close representations that are each generated from at least some of the same characters in the string, the close representations comprising deletion variants of the corresponding strings;
calculating a frequency of occurrence in the data records for each of the at least some of the strings represented in the set of string representations;
comparing generated close representations of a first string to generated close representations of a second string, and identifying whether any of the close representations of the first string correspond to any of the close representations of the second string such that the first and second string are a potential approximate match;
storing entries in an archive that each represent a potential approximate match between at least two strings based on their respective close representations;
renormalizing the frequency of at least one string by summing the counts of the strings that are potential approximate matches of the at least one string and, based on the renormalizing, generating a significance value for the one or more strings that can be used for identifying further potential approximate matches, the generated significance value for the at least one string being stored in association with the at least one string; and
executing, by an execution module (112), a computation graph wherein a component of the computation graph accesses the archive to determine whether given data records should be processed based on whether strings in the given data records are a potential approximate match, and wherein the component of the computation graph consolidates the given data records having strings that are a potential approximate match.
Is it patentable?
The examining division refused the European patent application on the basis that claim 1 did not fulfil the requirement of inventive step, Article 56 EPC, starting from a notorious general-purpose computer system or, alternatively, from prior art document
D1:|T. Bocek et al., Fast similarity search in large dictionaries, Technical Report No. ifi-2007.02, Department of Informatics, University of Zurich, April 2007.[XP002679634]|.
The Board assessed inventive step for claim 1 first in view of document D1, and the Board found claim 1 has the following differential features:
16.6 However, the method of D1 does not involve any pre-computation of potential approximate matches between strings occurring in the data records. In D1, the index stores the k-deletion neighborhoods (i.e. the sets of close representations) of the strings in the dictionary.
In the invention according to claim 1, the archive stores instead potential approximate matches between such strings.
These differences may be labelled difference (1).
16.7 D1 does also not disclose a calculation of frequencies of occurrence of strings, a renormalization of such frequencies and a generation of a “significance value” for at least one string based on the renormalization and its storage in the archive in association with the string, as recited in claim 1.
These differences may be labelled difference (2).
16.8 D1 does also not disclose “executing, by an execution module (112), a computation graph wherein a component of the computation graph accesses the archive to determine whether given data records should be processed based on whether strings in the given data records are a potential approximate match, and wherein the component of the computation graph consolidates the given data records having strings that are a potential approximate match”, as recited in claim 1.
These differences may be labelled difference (3).
16.9 The method of claim 1 thus differs from the method disclosed in D1 in differences (1) to (3).
First, the Board assessed whether those differences are obvious:
17. Obviousness of differences (1) and (3)
The board considers that differences (1) and (3) would have been obvious to a skilled person starting from D1 in view of common general knowledge.
17.1 D1 discloses the use of the FastSS algorithm for finding words similar to a query string.
It is known to a skilled person that approximate string matching has further uses such as in the context of approximate join operations (see D4: section 1, paragraphs 1 and 2, and section 2, paragraph 1). It would thus have been obvious to consider how the FastSS algorithm disclosed in D1 could be applied to efficiently implement an approximate join operation.
17.2 The skilled person knows that pre-computation always requires a trade-off to be made between storage requirements and computation speed at run-time based on an identification of which calculations are expected to be frequently required at run-time.
An approximate join operation involves the merging of the data records of two datasets based on some key field. The required approximate string matching calculations concern exclusively pairs of strings occurring in these records (unlike the query search application primarily considered in D1, where the query string is unknown before run-time).
Hence, it would have been obvious to the skilled person that in such an application context, the pre-processing phase may go further and include not only the generation of the k-deletion neighborhood of all strings occurring in the datasets but also their potential approximate matches determined on the basis of the generated k-deletion neighborhood. This results in difference (1).
17.3 An approximate join operation is an operation that “consolidates” data records having strings in key fields that are an approximate match.
In the implementation at which the skilled person would have arrived starting from D1, as explained in the preceding point, the determination of whether two strings occurring in key fields of the data records are an approximate match would be made by looking up in the archive whether they are a potential approximate match and, if so, based on their edit distance.
This results in difference (3), except for the feature contained therein that the consolidation operation is realised as a “component of [a] computation graph”.
17.4 However, whether the approximate join operation is to be executed as part of graph-based computations or not is, at least in the context of claim 1, a technically arbitrary choice. No aspect of the approach to approximate string matching used in the method of claim 1 is specifically adapted to be used in the context of graph-based computations, nor has this been argued by the appellant.
17.5 Hence, starting from D1, the skilled person would have arrived to differences (1) and (3) without any inventive activity. It may thus be left open to which extent they contribute to the technical character of the claimed invention.
Then, the Board discussed whether the features make technical contribution:
18. No technical contribution by difference (2)
The steps of calculating the frequency of occurrence of a string in the data records, renormalizing the frequency by taking into account the potential approximate matches of the string, generating a “significance value” for the string from the renormalized frequency and storing this value in the archive in association with the string – as specified in difference (2) – make no technical contribution to the method of claim 1 (beyond their implicit, not further defined computer-implementation).
18.1 Claim 1 is silent as to what is actually measured by the “significance value” generated for a given string. This can also not be derived from claim 1 as claim 1 does not specify how the significance value is generated from the renormalized frequency.
In the description, where this value is called “significance score”, it is described as representing the inverse of the renormalized frequency of the string and thus “the relative importance of a word [i.e. string] to a phrase containing the word for the purpose of phrase comparison” (see page 11, lines 15-20, and page 28, lines 21 to 27).
The strings occurring in data records are abstract data, with no technical character. Determining by mathematical calculations their frequency, simple or renormalized, and their “significance” in the above sense is thus also – at least in itself – not technical.
18.2 The generated significance value also does not contribute to producing a technical effect in the context of the method of claim 1.
18.2.1 It is not derivable from claim 1 that the generated significance value is actually used in the context of the claimed method.
Claim 1 specifies, in the step of generating the significance value, that that value “can be used for identifying further potential approximate matches” (emphasis by the board) but claim 1 does not include any step in which it is actually used for that or any other purpose in the context of the claimed method.
The final step of the method of claim 1 specifies that a component of a computation graph “accesses the archive to determine whether given data records should be processed based on whether strings in the given data records are a potential approximate match” and that it “consolidates the given data records having strings that are a potential approximate match”. This wording does not clearly require the significance value stored in the archive to be used in the consolidation operation. It could well be that only the potential approximate matches stored in the archive are used for that purpose, as only they are explicitly mentioned in relation to the consolidation operation.
18.2.2 It is also not apparent from the description how the significance value could be used for identifying further potential approximate matches, i.e. potential approximate matches not identified in the preceding step of “comparing generated close representations […] and identifying whether any of the close representations […] are a potential approximate match”.
The described uses of the “significance score” (as the significance value is named in the description) appear to be confined to the identification of “false positives” when matching phrases or records, i.e. that a potential approximate match identified in the preceding step is not to be considered an actual approximate match. This is in particular the case in all the passages cited by the appellant as basis for the feature concerning the significance value, i.e. page 8, lines 7 to 9, page 11, lines 16 to 20, page 15, lines 2 to 5, and original claim 11.
The board notes that the significance value or score is distinct from the “fuzzy match score” (see page 11, lines 11-20).
18.2.3 It follows that the step of storing the significance value in association to the corresponding string – as specified in difference (2) – does also not make any technical contribution (beyond the implicit, not further defined computer-implementation of that step).
Anyway, using a significance value to identify false positives in potential approximate matches is not by itself a technical use given the abstract nature of approximate string matching. Hence, even if this potential use were considered to be implied by claim 1 (it is not), it would not be an implied technical use in the sense of G 1/19.
18.2.5 As to the other alleged technical effects put forward by the appellant (see point 13 above), in particular increased computation speed, they have not been specifically linked to difference (2) but to differences (1) and (3), and they cannot anyway be relied on for difference (2) as the significance value is not used in the context of the method of claim 1.
18.3 Difference (2) does thus not contribute to the technical character of the method of claim 1 (beyond its implicit, not further defined and thus obvious computer-implementation). Consequently, it cannot support the presence of an inventive step.
Finally, the Board refused the European patent application on the ground that the method of claim 1 does not involve an inventive step within the meaning of Article 56 EPC over D1 and common general knowledge.
You can read the whole decision here: decision decision T 1867/18 of the Technical Board of Appeal 3.5.06.