pseudorandom/a_closer_look_at_fuzzy_thre...

177 lines
22 KiB
HTML
Raw Permalink Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

<!DOCTYPE html>
<html lang=en>
<head>
<meta charset="utf-8">
<title>A Closer Look at Fuzzy Threshold PSI (ftPSI-AD) | pseudorandom</title>
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@sarahjamielewis" />
<meta name="twitter:creator" content="@sarahjamielewis" />
<meta property="og:url" content="https://pseudorandom.resistant.tech/a_closer_look_at_fuzzy_threshold_psi.html" />
<meta property="og:description" content="Apple recently released a detailed cryptographic paper describing a new pri" />
<meta property="og:title" content="A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)" />
<meta name="twitter:image" content="https://pseudorandom.resistant.tech/a_closer_look_at_fuzzy_threshold_psi.png">
<link rel="alternate" type="application/atom+xml" href="/feed.xml" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" type="text/css" href="styles.css">
<link rel="stylesheet" href="/katex/katex.min.css" integrity="sha384-RZU/ijkSsFbcmivfdRBQDtwuwVqK7GMOw6IMvKyeWL2K5UAlyp6WonmB8m7Jd0Hn" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="/katex//katex.min.js" integrity="sha384-pK1WpvzWVBQiP0/GjnvRxV4mOb0oxFuyRxJlk6vVw146n3egcN5C925NCP7a7BY8" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="/katex/auto-render.min.js" integrity="sha384-vZTG03m+2yp6N6BNi5iM4rW4oIwk5DfcNdFfxkk9ZWpDriOkXX8voJBFrAO7MpVl" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
</head>
<body>
<header>
<nav>
<strong>pseudorandom</strong>
<a href="./index.html">home</a>
<a href="mailto:sarah@openprivacy.ca">email</a>
<a href="cwtch:icyt7rvdsdci42h6si2ibtwucdmjrlcb2ezkecuagtquiiflbkxf2cqd">cwtch</a>
<a href="/feed.xml">atom</a>
</nav>
</header>
<article>
<h1 id="a-closer-look-at-fuzzy-threshold-psi-ftpsi-ad">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</h1>
<p>Apple recently released a detailed cryptographic paper describing a new private set intersection protocol which they named Fuzzy Threshold PSI with Associated Data, or ftPSI-AS for short<em class="footnotelabel"></em>.</p>
<p class="sidenote">
<a href="https://www.apple.com/child-safety/pdf/Apple_PSI_System_Security_Protocol_and_Analysis.pdf">The Apple PSI System</a>
</p>
<p>In my last article I sketched out a probabilistic analysis of their proposed approach under the assumption that the ftPSI-AS was secure<em class="footnotelabel"></em>. I now want to take a closer look at the actual proposed protocol from a cryptographic perspective. This article will be mostly technical summary from my own notes, with some analysis at the end - you may want to <a href="#cute-algorithm-meets-world">skip</a> to the analysis part.</p>
<p class="sidenote">
<a href="/obfuscated_apples.html">Obfuscated Apples</a>
</p>
<h2 id="a-note-on-hash-functions">A note on Hash Functions</h2>
<p>There are at least 3 different sets of hash functions used in this system:</p>
<ol type="1">
<li>Mapping the image into the NeuralHash space - a perceptual hash used to detect matching images. Random collisions are very likely to happen.</li>
<li>Mapping the NeuralHash space into the blinded hash space - a cuckoo hash intended to prevent the client from learning about the actual hashes being compared against. Random collisions are cryptographically unlikely to happen here.</li>
<li>Mapping the PRFd image identifier space into the DHF space - intended to allow the server to distinguish real hashes from synthetic matches, as above.</li>
</ol>
<h2 id="a-protocol-summary">A Protocol Summary</h2>
<p>The server has a set of hash values <span class="math inline"><em>X</em></span>.</p>
<p>The client has an ordered list of triples (<span class="math inline"><em>Y</em></span>) containing a hash value, an identifier and some associated data i.e. an image, a random identifier and the hash of the image.</p>
<p>Different images (with different identifiers and/or associated data) can have the same NeuralHash. NeuralHash has been engineered such that similar images result in the same hash. It is also likely suspectable to false positives like all perceptual hashes<em class="footnotelabel"></em>.</p>
<p class="sidenote">
again see <a href="/obfuscated_apples.html">Obfuscated Apples</a> for more analysis on this part of the system.
</p>
<p>The identifiers <strong>are not secret</strong>. The intent is for the server to learn the entire list of identifiers.</p>
<p>The ultimate goal is to construct a system where the client streams triples to the server over a period of time, the client, in addition to some indistinguishable from random public data, learns only <span class="math inline"><em>t</em>,|<em>X</em>|</span>, i.e. the threshold of matches needed to trigger a report, and size of the servers database, and the server only learns the set of random client identifiers <strong>and</strong> the associated data of the intersection of triples with hashes that overlap with entries in the database, and the number of such intersecting matches are greater than <span class="math inline"><em>t</em></span>.</p>
<p>In addition, when the number of matches is below the threshold, it is desirable that the server have “some uncertainty” <em class="footnotelabel"></em><span class="sidenote">that this is never quantified says a lot about how much thought went into this element of the system</span> as to the exact size of the intersection and its contents.</p>
<p>To do this the client is also given a small secret set of identifiers <span class="math inline"><em>S</em></span> which are designated <strong>synthetic matches</strong>. These relate to actual identifiers on the client (i.e. in set <span class="math inline"><em>Y</em></span>).</p>
<p>The paper omits how values in <span class="math inline"><em>S</em></span> are actually chosen:</p>
<blockquote>
“We refer to the technical summary document [App21] for a further discussion of the purpose of synthetic matches and how the client selects the set <span class="math inline"><em>S</em></span>” - Apple PSI System
</blockquote>
<p><em class="footnotelabel"></em><span class="sidenote">[App21] does not detail how the client selects set <span class="math inline"><em>S</em></span>)</span></p>
<p>Then, when an image arrives on the device it is assigned a new triple <span class="math inline">(<code>hash</code>,<code>id</code>,<code>ad</code>)</span>in <span class="math inline"><em>Y</em></span> and the client generates a voucher for the triple.</p>
<p>There are 2 different ways of constructing a voucher depending on whether or not <span class="math inline"><code>id</code> ∈ <em>S</em></span></p>
<p>If <span class="math inline"><code>id</code> ∋ <em>S</em></span>:</p>
<ul>
<li>The image data is encrypted with an account specific associated data key.</li>
<li>The id of the image is permuted with a pseudorandom function, to produce an x-coordinate of the secret shard, and an input element for a detectable hash function<em class="footnotelabel"></em><span class="sidenote">more on the DHF later</span>. <span class="math inline"><em>r</em>=<em>D</em><em>H</em><em>F</em>(<code>hkey</code>,<em>x</em>)</span> where <span class="math inline"><code>hkey</code></span> is the clients static key for the hash function.</li>
<li>A new threshold share is generated such that <span class="math inline"><em>t</em></span> shares are needed to reconstruct the associated data key.</li>
<li>A random key is chosen and used to encrypt the output from the detectable hash <span class="math inline"><em>r</em></span>, encrypted image data and the secret share. This ciphertext is called (<span class="math inline"><code>rct</code></span>)</li>
<li>The hash is used to lookup a value in the public distributed data (using a set of cuckoo hash functions and cuckoo table containing public curve elements that are either random or specially constructed if the hash appears in <span class="math inline"><em>X</em></span>), and the value there is used to generate a specific set of curve points <span class="math inline">(<em>Q</em>,<em>S</em>)</span> from which a set of keys is derived and then encrypted with the random key. This happens twice with different hash functions and random values such that at least one of the calculated <span class="math inline">(<em>Q</em>,<em>S</em>)</span> are related (<span class="math inline"><em>S</em><sub><em>i</em></sub>=<em>α</em> ⋅ <em>Q</em><sub><em>i</em></sub></span>) only if the hash has an entry in the cuckoo table, or randomly independent otherwise<em class="footnotelabel"></em>.</li>
<li>Finally, a voucher is constructed containing the random id of the image and modified ciphertexts in a random order, and <span class="math inline"><code>rct</code></span> and sent to the server.</li>
</ul>
<p class="sidenote">
This is a simplification.
</p>
<p>If <span class="math inline"><code>id</code> ∈ <em>S</em></span>:</p>
<ul>
<li>A random key is used to encrypt all zeros instead of the image data, this is never decrypted</li>
<li>The id of the image is permuted with a pseudorandom function. Instead of using the derived x-coordinate in the shard, a random value is used instead. And, instead of hashing the derived value with DHF, a random element in the output space of DHF is used instead.</li>
<li>A random key is generated and used to encrypt the dummy data (<span class="math inline"><code>rct</code></span>)</li>
<li>Choose a random <span class="math inline">(<em>Q</em>,<em>S</em>)</span> to satisfy the relation <span class="math inline"><em>S</em><sub>1</sub>=<em>α</em> ⋅ <em>Q</em><sub>1</sub></span>, <span class="math inline">(<em>Q</em><sub>2</sub>,<em>S</em><sub>2</sub>)</span> to be independently random, derive keys and encrypt with the random key.</li>
<li>Finally, a voucher is constructed containing the id of the image, and these modified ciphertexts in a random order, and <span class="math inline"><code>rct</code></span> and sent to the server.</li>
</ul>
<h3 id="on-the-server">On the Server</h3>
<p>For each voucher that arrives at the server:</p>
<ul>
<li>Store the id,</li>
<li>For each of the 2 ciphertexts in the voucher:
<ul>
<li>derive <span class="math inline"><em>S</em><sub><em>i</em></sub>=<em>α</em> ⋅ <em>Q</em><sub><em>j</em></sub></span></li>
<li>derive the random key by decrypting the ciphertext with a key derived from <span class="math inline"><em>S</em><sub><em>j</em></sub></span></li>
<li>attempt to decrypt <span class="math inline"><code>rct</code></span> with the derived random key.</li>
<li>If this fails then mark this attempt as false.</li>
</ul></li>
<li>If both attempts are marked as false then this is a non-match</li>
<li>If either attempt is successful then the decrypted tuple is added to the set of known shares.</li>
<li>(If both are successful then th client is being malicious, and the server is instructed to ignore it)</li>
</ul>
<p>At this point some shares are “real”, and some shares are “synthetic”. The server does not ostensibly know which is which<em class="footnotelabel"></em>.</p>
<p class="sidenote">
See <a href="/obfuscated_apples.html">Obfuscated Apples</a> for a probabilistic analysis which disputes this claim.
</p>
<p>If the size of this set exceeds the threshold then the server can use the detectable hash function to distinguish them.</p>
<h2 id="the-detectable-hash-function">The Detectable Hash Function</h2>
<p>Not much has been written about the new, proposed Detectable Hash Function, but it is the pin on which the privacy of this system ultimately rests (under the assumption that the rest of the crypto is sound).</p>
<p>A Detectable Hash Function (or (s-t)DHF) is defined a keyed hash function that takes in a key and an element and outputs a value in some distribution <span class="math inline"><em>R</em><sup><em>t</em></sup></span> where <span class="math inline"><em>t</em></span> is the threshold of the system. It is designed to be used in systems that deliberately mix genuinely hashed values with randomly selected elements from the hash space (<span class="math inline"><em>R</em></span>).</p>
<p>There exists a <strong>detection algorithm</strong> which is deterministic and invoked as <span class="math inline"><em>D</em>(<em>v</em>)</span> where v is a vector of elements in <span class="math inline"><em>R</em></span>. The algorithm outputs either a vector of true elements, or fails.</p>
<p>When at least <span class="math inline"><em>t</em>+1</span> entries in <span class="math inline"><em>v</em></span> are generated by a DHF and at most <span class="math inline"><em>s</em></span> are random then the algorithm should identify all generated elements i.e. it can distinguish random elements from hashed elements.</p>
<p>The paper constructs a hash function by defining the keys to be a sequence of polynomials of degree at most <span class="math inline"><em>t</em>1</span> arranged into an <span class="math inline"><em>s</em><em></em></span> matrix.</p>
<p><br /><span class="math display">$$ DHF(k, x_0) := (x_0, p_1(x_0),...p_s(x_0)) \in \mathbb{F}_{l}^{s+1} \vphantom{+ \frac{1}{1}}$$</span><br /></p>
<p>Where, <span class="math inline"><em>l</em></span> is some large 64bit number. This hash is treated as a column vector of size <span class="math inline"><em>s</em>+1</span>.</p>
<p>For random elements the distribution <span class="math inline">(<em>D</em><em>H</em><em>F</em>(<em>k</em>,<em>x</em><sub>1</sub>)...<em>D</em><em>H</em><em>F</em>((<em>k</em>,<em>x</em><sub><em>t</em></sub>)))</span> is random in <span class="math inline">𝔽<sub><em>l</em></sub><sup><em>t</em></sup></span></p>
<p>i.e. for every input <span class="math inline"><em>X</em></span> we multiply it by some sequence of polynomials and derive a sequence of elements <span class="math inline">𝔽<sub><em>l</em></sub><sup><em>s</em>+1</sup></span>.</p>
<p>The detection algorithm is given a vector elements of <span class="math inline"><em>R</em></span> which are then arranged and expanded into a larger vector, by taking the first element, <span class="math inline"><em>x</em><sub><em>o</em></sub></span>, to incrementing powers up <span class="math inline"><em>t</em>1</span>, followed by the rest of the values from the hash output.</p>
<p>After the expansion, every vector is arranged into a matrix <span class="math inline"><em>M</em></span> such that every row of <span class="math inline"><em>M</em></span> contains at least <span class="math inline"><em>t</em>+1</span> evaluations of some polynomial at most <span class="math inline"><em>t</em>1</span>. e.g.</p>
<p><br /><span class="math display">$$
\begin{bmatrix}
1 &amp; 1 &amp; \dots &amp; 1 \\
x_0 &amp; r &amp; \dots &amp; x_{n} \\
x_0^2 &amp; r^2 &amp; \dots &amp; x_{n}^2 \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots &amp; \\
x_0^{t-1} &amp; r^{t-1} &amp; \dots &amp; x_{n}^{t-1} \\
p_0(x_0) &amp; r_1 &amp; \dots &amp; p_0(x_n) \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots &amp; \\
p_s(x_0) &amp; r_s &amp; \dots &amp; p_s(x_n)
\end{bmatrix}
$$</span><br /></p>
<p>The kernel of this matrix is calculated and, the row indexes of <span class="math inline"><em>M</em></span> that do not map to the nullspace are the indexes of “real” matches i.e. the values of a vector <span class="math inline"><em>w</em></span> such that <span class="math inline"><em>M</em> ⋅ <em>w</em>=0</span> can only be non-zero at positions that correspond to DHF columns.</p>
<p>Or rather, there is a relation between real hashes (they all contain evaluations of the key polynomial) that the synthetic hashes do not have. This allows them to be distinguished.</p>
<p>Recall that for non-synthetic vouchers clients derives a hash of <span class="math inline"><code>id</code></span> (or rather some random mapping of id into the input space of DHF).</p>
<p>Because of this, the detection algorithm allows the server to recover the identifiers of the actual secret shards to use to reconstruct the associated data key. (Recall that the decrypted <span class="math inline"><code>rct</code></span> contains <span class="math inline"><em>r</em>=<em>D</em><em>H</em><em>F</em>(<code>hkey</code>,<em>x</em>)</span>)</p>
<p>After this point if the number of “real” matches is over the threshold then the server can decrypt all associated data.</p>
<h2 id="cute-algorithm-meets-world">Cute Algorithm meets World</h2>
<p>This is a very cute algorithm, exploiting some nice linear algebra to recover the indices of real matches, and the rest is standard cryptography to do the actual threshold encryption.</p>
<p>The paper does document a couple of <strong>real world considerations</strong> which I will briefly revisit and then provide a few of my own.</p>
<p><strong>Multiple Client Devices</strong> in order to properly track matches across devices every device needs to share the same set of keys.</p>
<p>If a user stores <strong>duplicate images</strong> that have the same hash but different random identifiers and the hash is matched in the database, then the 2 images will count twice towards the threshold. There is no mechanism in this protocol to prevent this, but the paper says it will be dealt with outside the protocol<em class="footnotelabel"></em><span class="sidenote">(presumably by the photo app, or the operating system, checking that duplicate images dont get assigned different identifiers…)</span>.</p>
<p>I have already documented a few issues with the overall probabilistic model in <a href="/obfuscated_apples.html">Obfuscated Apples</a> - the analysis there doesnt rely on any of the specific details in the protocol and instead focusing on how <span class="math inline"><em>t</em></span> is chosen in the first place.</p>
<p>However, now that we dived deeper it is clear to see a number of other issues:</p>
<p>First, the public cuckoo table that is distributed to clients is refreshed periodically under the assumption that vouchers generated with the old set of public data can be combined with vouchers generated from the new set. This would likely happen when new images (or rather their resulting hashes) are added to the backend database.</p>
<p>There is a small probability that an image that matches against one cuckoo table does not match against another. That means that if a client ends up uploading the same image identifier across the updates, then it could have different outcomes<em class="footnotelabel"></em>.</p>
<p>From a general perspective, this doesnt change the security argument much, and the server is <strong>supposed</strong> to learn about matching files.</p>
<p><strong>However</strong>, this also presents additional metadata to the system regarding the <em>use</em> of the files and <strong>allows the server to distinguish a synthetic match from a real match in the case where one fails, and the other succeeds<em class="footnotelabel"></em></strong>. <span class="sidenote"> Recall: Identifiers which trigger synthetic matches always succeed regardless of the set of images</span>.</p>
<hr/>
<p>This is not the only problems with parameter updates, the system requires a trusted third party to verify that the cuckoo table is continually computed correctly. This trusted third party needs access to both the original database of images, the hashing function <em>and</em> the cuckoo table setup<em class="footnotelabel"></em>. Without this trusted third party, Apple could fill that table to match arbitrary hashes without anyone being able to verify.</p>
<p>Given that, a malicious server can learn whatever it wants about the clients dataset bounded only by some external-to-the-protocol third party which both has to police what images are in the database and ensure that the server never distributes bad public data to clients. This is the real policy question of the system and one that I think has already covered extensively elsewhere<em class="footnotelabel"></em><span class="sidenote">See: <a href="https://christopher-parsons.com/the-problems-and-complications-of-apple-monitoring-for-child-sexual-abuse-material-in-icloud-photos/">The Problems and Complications of Apple Monitoring for Child Sexual Abuse Material in iCloud Photos by Christopher Parsons.</a></span>.</p>
<hr/>
<p>Back to the technical, it is interesting that none of the analysis considers a malicious client violating the correctness of the protocol. The reason given is “because there are many ways in which a malicious client can refuse to participate.”</p>
<p>This is funny in and of itself because the sole intention of this system is to catch malicious people.</p>
<p>Attacks from malicious clients are an interesting academic consideration, the main one documented in system is the client generating too many synthetic matches which drastically slows down the detection algorithm.</p>
<p>However, I also think it is worth considering a DoS attack from the client attempting to generated “matches” that ultimately decrypt to nonsense. This can be done by submitting vouchers for randomly generated hashes (as shown in <a href="/obfuscated_apples.html">Obfuscated Apples</a> it doesnt take that many random photos to trigger false positives in most perceptual algorithms given a large enough database that is being checked against) - this attack could likely be conducted using the phone hardware itself, maybe even through malware. There do not appear to be any defenses to this, and even through it is a blind attack, the literature on adversarial exploitation of perceptual hashing algorithms is not on the side of Apple here.</p>
<p>After these deep dives I remain thoroughly unconvinced of the technical soundness of this system, I cant see how it can uphold its privacy properties under normal operation, I think there are fundamental questions about how Apple is choosing parameters (like <span class="math inline"><em>t</em></span> and <span class="math inline"><em>S</em></span>) that significantly change the properties of the system, and I think there are malicious avenues for exploitation of this system even beyond the policy discussions circling in the media.</p>
Time will tell.
</article>
<hr/>
<h2>
Recent Articles
</h2>
<p><em>2021-08-16</em> <a href="ftpsi-parameters.html">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</a><br><em>2021-08-12</em> <a href="a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</a><br><em>2021-08-10</em> <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<footer>
Sarah Jamie Lewis
</footer>
</body>
</html>