No icon

FEDERAL: A Framework for Distance-Aware Privacy-Preserving Record Linkage

FEDERAL: A Framework for Distance-Aware Privacy-Preserving Record Linkage


In privacy-preserving record linkage, a number of data custodians encode their records and submit them to a trusted third-party who is responsible to identify those records that refer to the same real-world entity. In this paper, we propose FEDERAL, a novel record linkage framework that implements methods for anonymizing both string and numerical data values, which are typically present in data records. These methods rely on a strong theoretical foundation for rigorously specifying the dimensionality of the anonymization space, into which the original values are embedded, to provide accuracy and privacy guarantees under various models of privacy attacks. Key component of the applied embedding process is the threshold that is required by the distance computations, which we prove that can be formally specified to guarantee accurate results. We evaluate our framework using three real-world data sets with varying characteristics. Our experimental findings show that FEDERAL offers a complete and effective solution for accurately identifying matching anonymized record pairs (with recall rates constantly above 93%) in large-scale privacy-preserving record linkage tasks.

Existing System:

Record linkage is a two-step process. The goal of the first step, known as blocking, is to formulate as many as possible matching pairs and, simultaneously, maintain the number of non-matching pairs as small as possible. In the second step, termed as matching, the distances between the pairs formed during the blocking step are calculated. Approximate matching lies at the core of record linkage, since values contained in records that are owned by different data custodians, but refer to the same real-world entity, usually exhibit variations, errors, misspellings, and typos. Therefore, applying exact matching on record pairs would typically generate results of low quality.

Proposed System:

In this paper, we introduce a novel framework for privacy-preserving record linkage, which incorporates distance-aware encoding methods for anonymizing sensitive values in data records. Key components of our framework are two methods, called LCBF (Low-Cost Bloom Filters) and BV (Bit Vectors), which are used to anonymize string and numerical (including timestamps) values, respectively. The anonymization space into which we embed the original data values, is the binary Hamming space that allows for effective approximate matching.

Both LCBF and BV rely on a strong theoretical premise that provides certain accuracy guarantees regarding the embedding process of the records from the  original into the anonymization space. Specifically, LCBF uses Bloom filters, whose size is specified according to the tolerable degree of perturbation between similar strings.

Comment As:

Comment (0)