This commit is contained in:
Sarah Jamie Lewis 2021-08-16 11:06:50 -07:00
parent 34770915e8
commit 736954c505
13 changed files with 643 additions and 19 deletions

View File

@ -0,0 +1,176 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset="utf-8">
<title>A Closer Look at Fuzzy Threshold PSI (ftPSI-AD) | pseudorandom</title>
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@sarahjamielewis" />
<meta name="twitter:creator" content="@sarahjamielewis" />
<meta property="og:url" content="https://pseudorandom.resistant.tech/a_closer_look_at_fuzzy_threshold_psi.html" />
<meta property="og:description" content="Apple recently released a detailed cryptographic paper describing a new pri" />
<meta property="og:title" content="A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)" />
<meta name="twitter:image" content="https://pseudorandom.resistant.tech/a_closer_look_at_fuzzy_threshold_psi.png">
<link rel="alternate" type="application/atom+xml" href="/feed.xml" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" type="text/css" href="styles.css">
<link rel="stylesheet" href="/katex/katex.min.css" integrity="sha384-RZU/ijkSsFbcmivfdRBQDtwuwVqK7GMOw6IMvKyeWL2K5UAlyp6WonmB8m7Jd0Hn" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="/katex//katex.min.js" integrity="sha384-pK1WpvzWVBQiP0/GjnvRxV4mOb0oxFuyRxJlk6vVw146n3egcN5C925NCP7a7BY8" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="/katex/auto-render.min.js" integrity="sha384-vZTG03m+2yp6N6BNi5iM4rW4oIwk5DfcNdFfxkk9ZWpDriOkXX8voJBFrAO7MpVl" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
</head>
<body>
<header>
<nav>
<strong>pseudorandom</strong>
<a href="./index.html">home</a>
<a href="mailto:sarah@openprivacy.ca">email</a>
<a href="cwtch:icyt7rvdsdci42h6si2ibtwucdmjrlcb2ezkecuagtquiiflbkxf2cqd">cwtch</a>
<a href="/feed.xml">atom</a>
</nav>
</header>
<article>
<h1 id="a-closer-look-at-fuzzy-threshold-psi-ftpsi-ad">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</h1>
<p>Apple recently released a detailed cryptographic paper describing a new private set intersection protocol which they named Fuzzy Threshold PSI with Associated Data, or ftPSI-AS for short<em class="footnotelabel"></em>.</p>
<p class="sidenote">
<a href="https://www.apple.com/child-safety/pdf/Apple_PSI_System_Security_Protocol_and_Analysis.pdf">The Apple PSI System</a>
</p>
<p>In my last article I sketched out a probabilistic analysis of their proposed approach under the assumption that the ftPSI-AS was secure<em class="footnotelabel"></em>. I now want to take a closer look at the actual proposed protocol from a cryptographic perspective. This article will be mostly technical summary from my own notes, with some analysis at the end - you may want to <a href="#cute-algorithm-meets-world">skip</a> to the analysis part.</p>
<p class="sidenote">
<a href="/obfuscated_apples.html">Obfuscated Apples</a>
</p>
<h2 id="a-note-on-hash-functions">A note on Hash Functions</h2>
<p>There are at least 3 different sets of hash functions used in this system:</p>
<ol type="1">
<li>Mapping the image into the NeuralHash space - a perceptual hash used to detect matching images. Random collisions are very likely to happen.</li>
<li>Mapping the NeuralHash space into the blinded hash space - a cuckoo hash intended to prevent the client from learning about the actual hashes being compared against. Random collisions are cryptographically unlikely to happen here.</li>
<li>Mapping the PRFd image identifier space into the DHF space - intended to allow the server to distinguish real hashes from synthetic matches, as above.</li>
</ol>
<h2 id="a-protocol-summary">A Protocol Summary</h2>
<p>The server has a set of hash values <span class="math inline"><em>X</em></span>.</p>
<p>The client has an ordered list of triples (<span class="math inline"><em>Y</em></span>) containing a hash value, an identifier and some associated data i.e. an image, a random identifier and the hash of the image.</p>
<p>Different images (with different identifiers and/or associated data) can have the same NeuralHash. NeuralHash has been engineered such that similar images result in the same hash. It is also likely suspectable to false positives like all perceptual hashes<em class="footnotelabel"></em>.</p>
<p class="sidenote">
again see <a href="/obfuscated_apples.html">Obfuscated Apples</a> for more analysis on this part of the system.
</p>
<p>The identifiers <strong>are not secret</strong>. The intent is for the server to learn the entire list of identifiers.</p>
<p>The ultimate goal is to construct a system where the client streams triples to the server over a period of time, the client, in addition to some indistinguishable from random public data, learns only <span class="math inline"><em>t</em>,|<em>X</em>|</span>, i.e. the threshold of matches needed to trigger a report, and size of the servers database, and the server only learns the set of random client identifiers <strong>and</strong> the associated data of the intersection of triples with hashes that overlap with entries in the database, and the number of such intersecting matches are greater than <span class="math inline"><em>t</em></span>.</p>
<p>In addition, when the number of matches is below the threshold, it is desirable that the server have “some uncertainty” <em class="footnotelabel"></em><span class="sidenote">that this is never quantified says a lot about how much thought went into this element of the system</span> as to the exact size of the intersection and its contents.</p>
<p>To do this the client is also given a small secret set of identifiers <span class="math inline"><em>S</em></span> which are designated <strong>synthetic matches</strong>. These relate to actual identifiers on the client (i.e. in set <span class="math inline"><em>Y</em></span>).</p>
<p>The paper omits how values in <span class="math inline"><em>S</em></span> are actually chosen:</p>
<blockquote>
“We refer to the technical summary document [App21] for a further discussion of the purpose of synthetic matches and how the client selects the set <span class="math inline"><em>S</em></span>” - Apple PSI System
</blockquote>
<p><em class="footnotelabel"></em><span class="sidenote">[App21] does not detail how the client selects set <span class="math inline"><em>S</em></span>)</span></p>
<p>Then, when an image arrives on the device it is assigned a new triple <span class="math inline">(<code>hash</code>,<code>id</code>,<code>ad</code>)</span>in <span class="math inline"><em>Y</em></span> and the client generates a voucher for the triple.</p>
<p>There are 2 different ways of constructing a voucher depending on whether or not <span class="math inline"><code>id</code> ∈ <em>S</em></span></p>
<p>If <span class="math inline"><code>id</code> ∋ <em>S</em></span>:</p>
<ul>
<li>The image data is encrypted with an account specific associated data key.</li>
<li>The id of the image is permuted with a pseudorandom function, to produce an x-coordinate of the secret shard, and an input element for a detectable hash function<em class="footnotelabel"></em><span class="sidenote">more on the DHF later</span>. <span class="math inline"><em>r</em>=<em>D</em><em>H</em><em>F</em>(<code>hkey</code>,<em>x</em>)</span> where <span class="math inline"><code>hkey</code></span> is the clients static key for the hash function.</li>
<li>A new threshold share is generated such that <span class="math inline"><em>t</em></span> shares are needed to reconstruct the associated data key.</li>
<li>A random key is chosen and used to encrypt the output from the detectable hash <span class="math inline"><em>r</em></span>, encrypted image data and the secret share. This ciphertext is called (<span class="math inline"><code>rct</code></span>)</li>
<li>The hash is used to lookup a value in the public distributed data (using a set of cuckoo hash functions and cuckoo table containing public curve elements that are either random or specially constructed if the hash appears in <span class="math inline"><em>X</em></span>), and the value there is used to generate a specific set of curve points <span class="math inline">(<em>Q</em>,<em>S</em>)</span> from which a set of keys is derived and then encrypted with the random key. This happens twice with different hash functions and random values such that at least one of the calculated <span class="math inline">(<em>Q</em>,<em>S</em>)</span> are related (<span class="math inline"><em>S</em><sub><em>i</em></sub>=<em>α</em> ⋅ <em>Q</em><sub><em>i</em></sub></span>) only if the hash has an entry in the cuckoo table, or randomly independent otherwise<em class="footnotelabel"></em>.</li>
<li>Finally, a voucher is constructed containing the random id of the image and modified ciphertexts in a random order, and <span class="math inline"><code>rct</code></span> and sent to the server.</li>
</ul>
<p class="sidenote">
This is a simplification.
</p>
<p>If <span class="math inline"><code>id</code> ∈ <em>S</em></span>:</p>
<ul>
<li>A random key is used to encrypt all zeros instead of the image data, this is never decrypted</li>
<li>The id of the image is permuted with a pseudorandom function. Instead of using the derived x-coordinate in the shard, a random value is used instead. And, instead of hashing the derived value with DHF, a random element in the output space of DHF is used instead.</li>
<li>A random key is generated and used to encrypt the dummy data (<span class="math inline"><code>rct</code></span>)</li>
<li>Choose a random <span class="math inline">(<em>Q</em>,<em>S</em>)</span> to satisfy the relation <span class="math inline"><em>S</em><sub>1</sub>=<em>α</em> ⋅ <em>Q</em><sub>1</sub></span>, <span class="math inline">(<em>Q</em><sub>2</sub>,<em>S</em><sub>2</sub>)</span> to be independently random, derive keys and encrypt with the random key.</li>
<li>Finally, a voucher is constructed containing the id of the image, and these modified ciphertexts in a random order, and <span class="math inline"><code>rct</code></span> and sent to the server.</li>
</ul>
<h3 id="on-the-server">On the Server</h3>
<p>For each voucher that arrives at the server:</p>
<ul>
<li>Store the id,</li>
<li>For each of the 2 ciphertexts in the voucher:
<ul>
<li>derive <span class="math inline"><em>S</em><sub><em>i</em></sub>=<em>α</em> ⋅ <em>Q</em><sub><em>j</em></sub></span></li>
<li>derive the random key by decrypting the ciphertext with a key derived from <span class="math inline"><em>S</em><sub><em>j</em></sub></span></li>
<li>attempt to decrypt <span class="math inline"><code>rct</code></span> with the derived random key.</li>
<li>If this fails then mark this attempt as false.</li>
</ul></li>
<li>If both attempts are marked as false then this is a non-match</li>
<li>If either attempt is successful then the decrypted tuple is added to the set of known shares.</li>
<li>(If both are successful then th client is being malicious, and the server is instructed to ignore it)</li>
</ul>
<p>At this point some shares are “real”, and some shares are “synthetic”. The server does not ostensibly know which is which<em class="footnotelabel"></em>.</p>
<p class="sidenote">
See <a href="/obfuscated_apples.html">Obfuscated Apples</a> for a probabilistic analysis which disputes this claim.
</p>
<p>If the size of this set exceeds the threshold then the server can use the detectable hash function to distinguish them.</p>
<h2 id="the-detectable-hash-function">The Detectable Hash Function</h2>
<p>Not much has been written about the new, proposed Detectable Hash Function, but it is the pin on which the privacy of this system ultimately rests (under the assumption that the rest of the crypto is sound).</p>
<p>A Detectable Hash Function (or (s-t)DHF) is defined a keyed hash function that takes in a key and an element and outputs a value in some distribution <span class="math inline"><em>R</em><sup><em>t</em></sup></span> where <span class="math inline"><em>t</em></span> is the threshold of the system. It is designed to be used in systems that deliberately mix genuinely hashed values with randomly selected elements from the hash space (<span class="math inline"><em>R</em></span>).</p>
<p>There exists a <strong>detection algorithm</strong> which is deterministic and invoked as <span class="math inline"><em>D</em>(<em>v</em>)</span> where v is a vector of elements in <span class="math inline"><em>R</em></span>. The algorithm outputs either a vector of true elements, or fails.</p>
<p>When at least <span class="math inline"><em>t</em>+1</span> entries in <span class="math inline"><em>v</em></span> are generated by a DHF and at most <span class="math inline"><em>s</em></span> are random then the algorithm should identify all generated elements i.e. it can distinguish random elements from hashed elements.</p>
<p>The paper constructs a hash function by defining the keys to be a sequence of polynomials of degree at most <span class="math inline"><em>t</em>1</span> arranged into an <span class="math inline"><em>s</em><em></em></span> matrix.</p>
<p><br /><span class="math display">$$ DHF(k, x_0) := (x_0, p_1(x_0),...p_s(x_0)) \in \mathbb{F}_{l}^{s+1} \vphantom{+ \frac{1}{1}}$$</span><br /></p>
<p>Where, <span class="math inline"><em>l</em></span> is some large 64bit number. This hash is treated as a column vector of size <span class="math inline"><em>s</em>+1</span>.</p>
<p>For random elements the distribution <span class="math inline">(<em>D</em><em>H</em><em>F</em>(<em>k</em>,<em>x</em><sub>1</sub>)...<em>D</em><em>H</em><em>F</em>((<em>k</em>,<em>x</em><sub><em>t</em></sub>)))</span> is random in <span class="math inline">𝔽<sub><em>l</em></sub><sup><em>t</em></sup></span></p>
<p>i.e. for every input <span class="math inline"><em>X</em></span> we multiply it by some sequence of polynomials and derive a sequence of elements <span class="math inline">𝔽<sub><em>l</em></sub><sup><em>s</em>+1</sup></span>.</p>
<p>The detection algorithm is given a vector elements of <span class="math inline"><em>R</em></span> which are then arranged and expanded into a larger vector, by taking the first element, <span class="math inline"><em>x</em><sub><em>o</em></sub></span>, to incrementing powers up <span class="math inline"><em>t</em>1</span>, followed by the rest of the values from the hash output.</p>
<p>After the expansion, every vector is arranged into a matrix <span class="math inline"><em>M</em></span> such that every row of <span class="math inline"><em>M</em></span> contains at least <span class="math inline"><em>t</em>+1</span> evaluations of some polynomial at most <span class="math inline"><em>t</em>1</span>. e.g.</p>
<p><br /><span class="math display">$$
\begin{bmatrix}
1 &amp; 1 &amp; \dots &amp; 1 \\
x_0 &amp; r &amp; \dots &amp; x_{n} \\
x_0^2 &amp; r^2 &amp; \dots &amp; x_{n}^2 \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots &amp; \\
x_0^{t-1} &amp; r^{t-1} &amp; \dots &amp; x_{n}^{t-1} \\
p_0(x_0) &amp; r_1 &amp; \dots &amp; p_0(x_n) \\
\vdots &amp; \vdots &amp; \ddots &amp; \vdots &amp; \\
p_s(x_0) &amp; r_s &amp; \dots &amp; p_s(x_n)
\end{bmatrix}
$$</span><br /></p>
<p>The kernel of this matrix is calculated and, the row indexes of <span class="math inline"><em>M</em></span> that do not map to the nullspace are the indexes of “real” matches i.e. the values of a vector <span class="math inline"><em>w</em></span> such that <span class="math inline"><em>M</em> ⋅ <em>w</em>=0</span> can only be non-zero at positions that correspond to DHF columns.</p>
<p>Or rather, there is a relation between real hashes (they all contain evaluations of the key polynomial) that the synthetic hashes do not have. This allows them to be distinguished.</p>
<p>Recall that for non-synthetic vouchers clients derives a hash of <span class="math inline"><code>id</code></span> (or rather some random mapping of id into the input space of DHF).</p>
<p>Because of this, the detection algorithm allows the server to recover the identifiers of the actual secret shards to use to reconstruct the associated data key. (Recall that the decrypted <span class="math inline"><code>rct</code></span> contains <span class="math inline"><em>r</em>=<em>D</em><em>H</em><em>F</em>(<code>hkey</code>,<em>x</em>)</span>)</p>
<p>After this point if the number of “real” matches is over the threshold then the server can decrypt all associated data.</p>
<h2 id="cute-algorithm-meets-world">Cute Algorithm meets World</h2>
<p>This is a very cute algorithm, exploiting some nice linear algebra to recover the indices of real matches, and the rest is standard cryptography to do the actual threshold encryption.</p>
<p>The paper does document a couple of <strong>real world considerations</strong> which I will briefly revisit and then provide a few of my own.</p>
<p><strong>Multiple Client Devices</strong> in order to properly track matches across devices every device needs to share the same set of keys.</p>
<p>If a user stores <strong>duplicate images</strong> that have the same hash but different random identifiers and the hash is matched in the database, then the 2 images will count twice towards the threshold. There is no mechanism in this protocol to prevent this, but the paper says it will be dealt with outside the protocol<em class="footnotelabel"></em><span class="sidenote">(presumably by the photo app, or the operating system, checking that duplicate images dont get assigned different identifiers…)</span>.</p>
<p>I have already documented a few issues with the overall probabilistic model in <a href="/obfuscated_apples.html">Obfuscated Apples</a> - the analysis there doesnt rely on any of the specific details in the protocol and instead focusing on how <span class="math inline"><em>t</em></span> is chosen in the first place.</p>
<p>However, now that we dived deeper it is clear to see a number of other issues:</p>
<p>First, the public cuckoo table that is distributed to clients is refreshed periodically under the assumption that vouchers generated with the old set of public data can be combined with vouchers generated from the new set. This would likely happen when new images (or rather their resulting hashes) are added to the backend database.</p>
<p>There is a small probability that an image that matches against one cuckoo table does not match against another. That means that if a client ends up uploading the same image identifier across the updates, then it could have different outcomes<em class="footnotelabel"></em>.</p>
<p>From a general perspective, this doesnt change the security argument much, and the server is <strong>supposed</strong> to learn about matching files.</p>
<p><strong>However</strong>, this also presents additional metadata to the system regarding the <em>use</em> of the files and <strong>allows the server to distinguish a synthetic match from a real match in the case where one fails, and the other succeeds<em class="footnotelabel"></em></strong>. <span class="sidenote"> Recall: Identifiers which trigger synthetic matches always succeed regardless of the set of images</span>.</p>
<hr/>
<p>This is not the only problems with parameter updates, the system requires a trusted third party to verify that the cuckoo table is continually computed correctly. This trusted third party needs access to both the original database of images, the hashing function <em>and</em> the cuckoo table setup<em class="footnotelabel"></em>. Without this trusted third party, Apple could fill that table to match arbitrary hashes without anyone being able to verify.</p>
<p>Given that, a malicious server can learn whatever it wants about the clients dataset bounded only by some external-to-the-protocol third party which both has to police what images are in the database and ensure that the server never distributes bad public data to clients. This is the real policy question of the system and one that I think has already covered extensively elsewhere<em class="footnotelabel"></em><span class="sidenote">See: <a href="https://christopher-parsons.com/the-problems-and-complications-of-apple-monitoring-for-child-sexual-abuse-material-in-icloud-photos/">The Problems and Complications of Apple Monitoring for Child Sexual Abuse Material in iCloud Photos by Christopher Parsons.</a></span>.</p>
<hr/>
<p>Back to the technical, it is interesting that none of the analysis considers a malicious client violating the correctness of the protocol. The reason given is “because there are many ways in which a malicious client can refuse to participate.”</p>
<p>This is funny in and of itself because the sole intention of this system is to catch malicious people.</p>
<p>Attacks from malicious clients are an interesting academic consideration, the main one documented in system is the client generating too many synthetic matches which drastically slows down the detection algorithm.</p>
<p>However, I also think it is worth considering a DoS attack from the client attempting to generated “matches” that ultimately decrypt to nonsense. This can be done by submitting vouchers for randomly generated hashes (as shown in <a href="/obfuscated_apples.html">Obfuscated Apples</a> it doesnt take that many random photos to trigger false positives in most perceptual algorithms given a large enough database that is being checked against) - this attack could likely be conducted using the phone hardware itself, maybe even through malware. There do not appear to be any defenses to this, and even through it is a blind attack, the literature on adversarial exploitation of perceptual hashing algorithms is not on the side of Apple here.</p>
<p>After these deep dives I remain thoroughly unconvinced of the technical soundness of this system, I cant see how it can uphold its privacy properties under normal operation, I think there are fundamental questions about how Apple is choosing parameters (like <span class="math inline"><em>t</em></span> and <span class="math inline"><em>S</em></span>) that significantly change the properties of the system, and I think there are malicious avenues for exploitation of this system even beyond the policy discussions circling in the media.</p>
Time will tell.
</article>
<hr/>
<h2>
Recent Articles
</h2>
<p><em>2021-08-16</em> <a href="ftpsi-parameters.html">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</a><br><em>2021-08-12</em> <a href="a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</a><br><em>2021-08-10</em> <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<footer>
Sarah Jamie Lewis
</footer>
</body>
</html>

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

View File

@ -9,6 +9,25 @@
<name>Sarah Jamie Lewis</name>
</author>
<id>urn:uuid:699b0ba2-2fbf-4f9d-b5ac-4a7e044be3c6</id>
<entry>
<id>3258032e-ec03-4847-9c54-30464284bb52</id>
<title>Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</title>
<link href="https://pseudorandom.resistant.tech/ftpsi-parameters.html"/>
<updated>2021-08-16T11:05:00Z</updated>
<summary>Last week, Apple published more additional information regarding the parameterization of their new Fuzzy Threshold
PSI system in the form of a Security Threat Model...</summary>
</entry>
<entry>
<id>urn:uuid:a7b71725-1cb6-4d8f-8e19-e92580d5b316</id>
<title>A Closer Look At Fuzzy Threshold PSI</title>
<link href="https://pseudorandom.resistant.tech/a_closer_look_at_fuzzy_threshold_psi.html"/>
<updated>2021-08-12T14:30:00Z</updated>
<summary>Apple recently released a detailed cryptographic paper describing a new private
set intersection protocol which they named Fuzzy Threshold PSI with Associated Data, or ftPSI-AS for short....</summary>
</entry>
<entry>
<id>urn:uuid:fc37f259-004e-406b-addb-85cda6107e7b</id>
<title>Obfuscated Apples</title>

112
ftpsi-parameters.html Normal file
View File

@ -0,0 +1,112 @@
<!DOCTYPE html>
<html lang=en>
<head>
<meta charset="utf-8">
<title>Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI | pseudorandom</title>
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:site" content="@sarahjamielewis" />
<meta name="twitter:creator" content="@sarahjamielewis" />
<meta property="og:url" content="https://pseudorandom.resistant.tech/ftpsi-parameters.html" />
<meta property="og:description" content="Last week, Apple published more additional information regarding the parameterization of their new Fuzzy Thres" />
<meta property="og:title" content="Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI" />
<meta name="twitter:image" content="https://pseudorandom.resistant.tech/ftpsi-parameters.png">
<link rel="alternate" type="application/atom+xml" href="/feed.xml" />
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" type="text/css" href="styles.css">
<link rel="stylesheet" href="/katex/katex.min.css" integrity="sha384-RZU/ijkSsFbcmivfdRBQDtwuwVqK7GMOw6IMvKyeWL2K5UAlyp6WonmB8m7Jd0Hn" crossorigin="anonymous">
<!-- The loading of KaTeX is deferred to speed up page rendering -->
<script defer src="/katex//katex.min.js" integrity="sha384-pK1WpvzWVBQiP0/GjnvRxV4mOb0oxFuyRxJlk6vVw146n3egcN5C925NCP7a7BY8" crossorigin="anonymous"></script>
<!-- To automatically render math in text elements, include the auto-render extension: -->
<script defer src="/katex/auto-render.min.js" integrity="sha384-vZTG03m+2yp6N6BNi5iM4rW4oIwk5DfcNdFfxkk9ZWpDriOkXX8voJBFrAO7MpVl" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
</head>
<body>
<header>
<nav>
<strong>pseudorandom</strong>
<a href="./index.html">home</a>
<a href="mailto:sarah@openprivacy.ca">email</a>
<a href="cwtch:icyt7rvdsdci42h6si2ibtwucdmjrlcb2ezkecuagtquiiflbkxf2cqd">cwtch</a>
<a href="/feed.xml">atom</a>
</nav>
</header>
<article>
<h1 id="revisiting-first-impressions-apple-parameters-and-fuzzy-threshold-psi">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</h1>
<p>Last week, Apple published more additional information regarding the parameterization of their new Fuzzy Threshold PSI system in the form of a Security Threat Model<em class="footnotelabel"></em>.</p>
<p class="sidenote">
<a href="https://www.apple.com/child-safety/pdf/Security_Threat_Model_Review_of_Apple_Child_Safety_Features.pdf">Security Threat Model Review of Apples Child Safety Features</a>
</p>
<p>Contained in the document are various answers to questions that the privacy community had been asking since the initial announcement. It also contained information which answered several of my own questions, and in turn invalidated a few of the assumptions I had made in a previous article<em class="footnotelabel"></em>.</p>
<p class="sidenote">
<a href="/obfuscated_apples.html">Obfuscated Apples</a>
</p>
<p>In particular, Apple have now stated the following:</p>
<ul>
<li>they claim the false acceptance rate of NeuralHash is 3 in 100M, but are assuming it is 1 in 1M. They have conducted tests on both a dataset of 100M photos and on a dataset of 500K pornographic photos.</li>
<li>the threshold <span class="math inline"><em>t</em></span> they are choosing for the system is <strong>30</strong> with a future option to lower. They claim this is based on taking the assumed false positive rate of NeuralHash and applying it to a assumed dataset the size of the largest iCloud photo library to obtain a probability of false reporting of 1 in a trillion.</li>
</ul>
<p>One might ask why if the false acceptance rate of NeuralHash is so low then why take such precautions when estimating <span class="math inline"><em>t</em></span>?</p>
<p>I will give Apple the benefit of the doubt here under the assumption that they really are attempting to only catch prolific offenders.</p>
<p>Even still, I believe the most recent information by Apple still leaves several unanswered questions, and raises several more.</p>
<h2 id="on-neuralhash">On NeuralHash</h2>
<p>To put it as straightforward as possible, 100.5M photos isnt that large of a sample to compare a perceptual hashing algorithm against, and the performance is directly related to the size of the comparison database (which we dont know).</p>
<p>Back in 2017 WhatsApp estimated that they were seeing 4.5 billion photos being uploaded to the platform per day<em class="footnotelabel"></em>, while we dont have figures for iCloud we can imagine, given Apples significant customer base, that it is on a similar order of magnitude.</p>
<p class="sidenote">
<a href=">https://blog.whatsapp.com/connecting-one-billion-users-every-day">Connecting One Billion Users Every Day - Whatsapp Blog</a>
</p>
<p>The types of the photos being compared also matter. We know nothing about the 100.5M photos that Apple tested against, and only that a small 500K sample was pornographic in nature. While NeuralHash seems to have been designed as a generic image comparison algorithm, that doesnt mean that it acts on all images uniformly.</p>
<h2 id="on-the-thresholds">On the Thresholds</h2>
<blockquote>
<p>Since this initial threshold contains a drastic safety margin reflecting a worst-case assumption about real-world performance, we may change the threshold after continued empirical evaluation of NeuralHash false positive rates but the match threshold will never be lower than what is required to produce a one-in-one-trillion false positive rate for any given account - Security Threat Model Review of Apples Child Safety Features</p>
</blockquote>
<p>Apples initial value of <span class="math inline"><em>t</em>=30</span> was chosen to include a <strong>drastic safety margin</strong>, but the threat model gives them the explicit ability to change it in the future, but they promise the floor is still 1 in a trillion for “any given account”.</p>
<p>We still know very little about how <span class="math inline"><em>s</em></span> will be chosen. We can assume it will be in the same magnitude as <span class="math inline"><em>t</em></span> and that as such the number of synthetics for each user will be relatively low compared to the total size of their image base.</p>
<p>Also given that <span class="math inline"><em>t</em></span> is fixed across all accounts, we can be relatively sure that <span class="math inline"><em>s</em></span> will also be fixed across all accounts, with only the probability of choosing a synthetic match being varied on some unknown function.</p>
<p>Note that, if the probability of synthetic matches is too high, then the detection algorithm<em class="footnotelabel"></em> fails with high probability. Requiring more matches, and an extended detection procedure.</p>
<p class="sidenote">
As an aside, if you are interested in playing with the Detectable Hash Function yourself <a href="https://git.openprivacy.ca/sarah/fuzzyhash">I wrote a toy version of it</a>
</p>
<h2 id="threat-model-expansions">Threat Model Expansions</h2>
<p>The new threat model includes new jurisdictional protection for the database that were not present in the original description - namely that the <strong>intersection</strong> of to ostensibly independent databases managed by different agencies in different national jurisdictions will be used instead of a single database<em class="footnotelabel"></em> <span class="sidenote">(such as the one run by NCMEC)</span>.</p>
<p>Additionally, Apple have now stated they will publish a “Knowledge Base” containing root hashes of the encrypted database such that it can be confirmed that every device is comparing images to the same database. It is worth noting that this claim is only as good as security researchers having access to proprietary Apple code.</p>
<p>That such a significant changes were made to the threat model a week after the initial publication is perhaps the best testament to the idea, as Matthew Green put it:</p>
<blockquote>
<p>“But this illustrates something important: in building this system, the <em>only limiting principle</em> is how much heat Apple can tolerate before it changes its policies.” - <a href="https://twitter.com/matthew_d_green/status/1426312939015901185">Matthew Green</a></p>
</blockquote>
<h2 id="revisiting-first-impressions">Revisiting First Impressions</h2>
<p>I think the most important question I can ask of myself right now is that if Apple had put out all these documents on day one, would they have been enough to quell the voice inside my head?</p>
<p>Assuming that Apple also verified the false acceptance rate of NeuralHash in a way more verifiable than :we tested it on some images, its all good, trust us!" then I think many of my technical objections to this system would have been answered.</p>
<p>Not all of them though. I still, for example, think that the obfuscation in this system is fundamentally flawed from a practical perspective. And, I still think that the threat model as applied to malicious clients undermines the rest of the system<em class="footnotelabel"></em></p>
<p class="sidenote">
See: <a href="/a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI</a> for more details.
</p>
<h2 id="its-about-the-principles">Its About the Principles</h2>
<p>And, of course, none of that quells my moral objections to such a system.</p>
<p>You can wrap that surveillance in any number of layers of cryptography to try and make it palatable, the end result is the same.</p>
<p>Everyone on Apples platform is treated as a potential criminal, subject to continual algorithmic surveillance without warrant or cause.</p>
<p>If Apple are successful in introducing this, how long do you think it will be before the same is expected of other providers? Before walled-garden prohibit apps that dont do it? Before it is enshrined in law?<em class="footnotelabel"></em> <span class="sidenote"><a href="https://twitter.com/SarahJamieLewis/status/1423403656733290496">Tweet</a></span></p>
<p>How long do you think it will be before the database is expanded to include “terrorist” content“?”harmful-but-legal" content"? state-specific censorship?</p>
<p>This is not a slippery slope argument. For decades, we have seen governments and corporations push for ever more surveillance. It is obvious how this system will be abused. It is obvious that Apple will not be in control of how it will be abused for very long.</p>
<p>Accepting client side scanning onto personal devices <strong>is</strong> a rubicon moment, it signals a sea-change in how corporations relate to their customers. Your personal device is no long “yours” in theory, nor in practice. It can, and will, be used against you.</p>
<p>It is also abundantly clear that this is going to happen. While Apple has come under pressure, it has responded by painting critics as “confused” (which, if there is any truth in that claim is due to their own lack of technical specifications).</p>
<p>The media have likewise mostly followed Apples PR lead. While I am thankful that we have answers to some questions that were asked, and that we seem to have caused Apple to “clarify”<em class="footnotelabel"></em> <span class="sidenote">(or, less subtly, change)</span> their own threat model, we have not seen the outpouring of objection that would have been necessary to shut this down before it spread further.</p>
The future of privacy on consumer devices is now forever changed. The impact might not be felt today or tomorrow, but in the coming months please watch for the politicians (and sadly, the cryptographers) who argue that what can be done for CSAM can be done for the next harm, and the next harm. Watch the EU and the UK, among others, declare such scanning mandatory, and watch as your devices cease to work for you.
</article>
<hr/>
<h2>
Recent Articles
</h2>
<p><em>2021-08-16</em> <a href="ftpsi-parameters.html">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</a><br><em>2021-08-12</em> <a href="a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</a><br><em>2021-08-10</em> <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<footer>
Sarah Jamie Lewis
</footer>
</body>
</html>

BIN
ftpsi-parameters.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.2 KiB

View File

@ -46,7 +46,7 @@
<h2>
Recent Articles
</h2>
<p>2021-08-11 <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<p><em>2021-08-16</em> <a href="ftpsi-parameters.html">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</a><br><em>2021-08-12</em> <a href="a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</a><br><em>2021-08-10</em> <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<footer>
Sarah Jamie Lewis
</footer>

Binary file not shown.

After

Width:  |  Height:  |  Size: 38 KiB

View File

@ -73,7 +73,7 @@ i.e. even if we treat the people who design and build this system as honest adv
“The threshold is selected to provide an extremely low (1 in 1 trillion) probability of incorrectly flagging a given account.” - Apple Technical Summary
</blockquote>
<p>We can actually work backwards from that number to derive <span class="math inline"><em>P</em>(<code>falsepositive</code>)</span>:</p>
<p><br /><span class="math display">$$P(\texttt{flag}) = \sum_{\substack{x = t}}^T {T \choose x} \cdot P(\texttt{falsepositive})^x \cdot P(\texttt{falsepositive})^{T - x} \approx 1\mathrm{e}^{-12}$$</span><br /></p>
<p><br /><span class="math display">$$P(\texttt{flag}) = \sum_{\substack{x = t}}^T {T \choose x} \cdot P(\texttt{falsepositive})^x \cdot P(1-\texttt{falsepositive})^{T - x} \approx 1\mathrm{e}^{-12}$$</span><br /></p>
<p>In order to finalize this we only need to make educated guesses about 2 parameters: the threshold value, <span class="math inline"><em>t</em></span>, and the total number of photos checked per year, <span class="math inline"><em>T</em></span>. Apple throws out the number <span class="math inline"><em>t</em>=10</span> in their technical summary, which seems like a good place to start.</p>
<p>Assuming an average account generates 3-4 pictures a day to be checked then <span class="math inline"><em>T</em>1278</span> over a year. Plugging in those numbers, and we get <span class="math inline"><em>P</em>(<code>falsepositive</code>)0.00035</span> or <strong>1 in 2858</strong>.</p>
<p>Does that number have any relation to reality? There is evidence<em class="footnotelabel"></em> to suggest <span class="sidenote"><a href="https://arxiv.org/abs/2106.09820">Adversarial Detection Avoidance Attacks: Evaluating the robustness of perceptual hashing-based client-side scanning.</a> Shubham Jain, Ana-Maria Cretu, Yves-Alexandre de Montjoye</span> that the false acceptance rate for common perceptual hashing algorithms is between 0.001-0.01 for a database size of 500K.</p>
@ -142,7 +142,7 @@ And, again, if Apple can define <span class="math inline"><em>P</em>(<code>synth
<h2>
Recent Articles
</h2>
<p>2021-08-11 <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<p><em>2021-08-16</em> <a href="ftpsi-parameters.html">Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI</a><br><em>2021-08-12</em> <a href="a_closer_look_at_fuzzy_threshold_psi.html">A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)</a><br><em>2021-08-10</em> <a href="obfuscated_apples.html">Obfuscated Apples</a><br></p>
<footer>
Sarah Jamie Lewis
</footer>

View File

@ -0,0 +1,197 @@
# A Closer Look at Fuzzy Threshold PSI (ftPSI-AD)
Apple recently released a detailed cryptographic paper describing a new private
set intersection protocol which they named Fuzzy Threshold PSI with Associated Data, or ftPSI-AS for short@@^.
<p class="sidenote"><a href="https://www.apple.com/child-safety/pdf/Apple_PSI_System_Security_Protocol_and_Analysis.pdf">The Apple PSI System</a></p>
In my last article I sketched out a probabilistic analysis of their proposed
approach under the assumption that the ftPSI-AS was secure@@^. I now want to take a closer look at the actual proposed protocol from a cryptographic perspective. This article will be mostly technical summary from my own notes, with some analysis
at the end - you may want to [skip](#cute-algorithm-meets-world) to the analysis part.
<p class="sidenote"><a href="/obfuscated_apples.html">Obfuscated Apples</a></p>
## A note on Hash Functions
There are at least 3 different sets of hash functions used in this system:
1. Mapping the image into the NeuralHash space - a perceptual hash used to detect matching images. Random collisions are very likely to happen.
2. Mapping the NeuralHash space into the blinded hash space - a cuckoo hash intended to prevent the client from learning about the actual hashes being compared against. Random collisions are cryptographically unlikely to happen here.
3. Mapping the PRF'd image identifier space into the DHF space - intended to allow the server to distinguish real hashes from synthetic matches, as above.
## A Protocol Summary
The server has a set of hash values $X$.
The client has an ordered list of triples ($Y$) containing a hash value, an identifier and some associated data i.e. an image, a random identifier and the hash of the image.
Different images (with different identifiers and/or associated data) can have the same NeuralHash. NeuralHash has been engineered such that similar images result in the same hash. It is also likely suspectable to false positives like all perceptual hashes@@^.
<p class="sidenote">again see <a href="/obfuscated_apples.html">Obfuscated Apples</a> for more analysis on this part of the system.</p>
The identifiers **are not secret**. The intent is for the server to learn the entire list of identifiers.
The ultimate goal is to construct a system where the client streams triples to the server over a period of time, the client, in addition to some indistinguishable from random public data, learns only $t, |X|$, i.e. the threshold of matches needed to trigger a report, and size of the servers database, and the server only learns the set of random client identifiers **and** the associated data of the intersection of triples with hashes that overlap with entries in the database, and the number of such intersecting matches are greater than $t$.
In addition, when the number of matches is below the threshold, it is desirable that the server have "some uncertainty" @@^<span class="sidenote">that this is never quantified says a lot about how much thought went into this element of the system</span> as to the exact size of the intersection and its contents.
To do this the client is also given a small secret set of identifiers $S$ which are designated **synthetic matches**. These relate to actual identifiers on the client (i.e. in set $Y$).
The paper omits how values in $S$ are actually chosen:
<blockquote>
"We refer to the technical summary document [App21] for a further discussion of the purpose of synthetic matches and how the client selects the set $S$" - Apple PSI System </blockquote>@@^<span class="sidenote">[App21] does not detail how the client selects set $S$)</span>
Then, when an image arrives on the device it is assigned a new triple $(\texttt{hash}, \texttt{id}, \texttt{ad})$in $Y$ and the client generates a voucher for the triple.
There are 2 different ways of constructing a voucher depending on whether or not $\texttt{id} \in S$
If $\texttt{id} \ni S$:
- The image data is encrypted with an account specific associated data key.
- The id of the image is permuted with a pseudorandom function, to produce an x-coordinate of the secret shard, and an input element for a detectable hash function@@^<span class="sidenote">more on the DHF later</span>. $r = DHF(\texttt{hkey}, x\prime)$ where $\texttt{hkey}$ is the clients static key for the hash function.
- A new threshold share is generated such that $t$ shares are needed to reconstruct the associated data key.
- A random key is chosen and used to encrypt the output from the detectable hash $r$, encrypted image data and the secret share. This ciphertext is called ($\texttt{rct}$)
- The hash is used to lookup a value in the public distributed data (using a set of cuckoo hash functions and cuckoo table containing public curve elements that are either random or specially constructed if the hash appears in $X$), and the value there is used to generate a specific set of curve points $(Q,S)$ from which a set of keys is derived and then encrypted with the random key. This happens twice with different hash functions and random values such that at least one of the calculated $(Q,S)$ are related ($S_i = \alpha \cdot Q_i$) only if the hash has an entry in the cuckoo table, or randomly independent otherwise@@^.
- Finally, a voucher is constructed containing the random id of the image and modified ciphertexts in a random order, and $\texttt{rct}$ and sent to the server.
<p class="sidenote">This is a simplification.</p>
If $\texttt{id} \in S$:
- A random key is used to encrypt all zeros instead of the image data, this is never decrypted
- The id of the image is permuted with a pseudorandom function. Instead of using the derived x-coordinate in the shard, a random value is used instead. And, instead of hashing the derived value with DHF, a random element in the output space of DHF is used instead.
- A random key is generated and used to encrypt the dummy data ($\texttt{rct}$)
- Choose a random $(Q,S)$ to satisfy the relation $S_1 = \alpha \cdot Q_1$, $(Q_2,S_2)$ to be independently random, derive keys and encrypt with the random key.
- Finally, a voucher is constructed containing the id of the image, and these modified ciphertexts in a random order, and $\texttt{rct}$ and sent to the server.
### On the Server
For each voucher that arrives at the server:
- Store the id,
- For each of the 2 ciphertexts in the voucher:
- derive $S_i = \alpha \cdot Q_j$
- derive the random key by decrypting the ciphertext with a key derived from $S_j$
- attempt to decrypt $\texttt{rct}$ with the derived random key.
- If this fails then mark this attempt as false.
- If both attempts are marked as false then this is a non-match
- If either attempt is successful then the decrypted tuple is added to the set of known shares.
- (If both are successful then th client is being malicious, and the server is instructed to ignore it)
At this point some shares are "real", and some shares are "synthetic". The server does not ostensibly know which is which@@^.
<p class="sidenote">See <a href="/obfuscated_apples.html">Obfuscated Apples</a> for a probabilistic analysis which disputes this claim.</p>
If the size of this set exceeds the threshold then the server can use the detectable hash function to distinguish them.
## The Detectable Hash Function
Not much has been written about the new, proposed Detectable Hash Function, but it is the pin on which the privacy of this system ultimately rests (under the assumption that the rest of the crypto is sound).
A Detectable Hash Function (or (s-t)DHF) is defined a keyed hash function that takes in
a key and an element and outputs a value in some distribution $R^t$ where $t$ is the threshold of the system. It is designed
to be used in systems that deliberately mix genuinely hashed values with randomly selected elements from the hash space ($R$).
There exists a **detection algorithm** which is deterministic and invoked as $D(v)$ where v is a vector of elements in $R$. The algorithm outputs either a vector of true elements, or fails.
When at least $t+1$ entries in $v$ are generated by a DHF and at most $s$ are random then the algorithm should identify all generated elements i.e. it can distinguish random elements from hashed elements.
The paper constructs a hash function by defining the keys to be a sequence of polynomials of degree at most $t-1$ arranged
into an $s \dot t$ matrix.
$$ DHF(k, x_0) := (x_0, p_1(x_0),...p_s(x_0)) \in \mathbb{F}_{l}^{s+1} \vphantom{+ \frac{1}{1}}$$
Where, $l$ is some large 64bit number. This hash is treated as a column vector of size $s+1$.
For random elements the distribution $(DHF(k,x_1)...DHF((k,x_t)))$ is random in $\mathbb{F}_{l}^{t}$
i.e. for every input $X$ we multiply it by some sequence of polynomials and derive a sequence of elements $\mathbb{F}^{s+1}_{l}$.
The detection algorithm is given a vector elements of $R$ which are then arranged and expanded into a larger vector, by taking the first element, $x_o$, to incrementing powers up $t-1$, followed by the rest of the values from the hash output.
After the expansion, every vector is arranged into a matrix $M$ such that every row of $M$ contains at least $t+1$ evaluations of some polynomial at most $t-1$. e.g.
$$
\begin{bmatrix}
1 & 1 & \dots & 1 \\
x_0 & r & \dots & x_{n} \\
x_0^2 & r^2 & \dots & x_{n}^2 \\
\vdots & \vdots & \ddots & \vdots & \\
x_0^{t-1} & r^{t-1} & \dots & x_{n}^{t-1} \\
p_0(x_0) & r_1 & \dots & p_0(x_n) \\
\vdots & \vdots & \ddots & \vdots & \\
p_s(x_0) & r_s & \dots & p_s(x_n)
\end{bmatrix}
$$
The kernel of this matrix is calculated and, the row indexes of $M$ that do not map to the nullspace are the indexes of "real" matches i.e. the values of a vector $w$ such that $M \cdot w = 0$ can only be non-zero at positions that correspond to DHF columns.
Or rather, there is a relation between real hashes (they all contain evaluations of the key polynomial) that the synthetic
hashes do not have. This allows them to be distinguished.
Recall that for non-synthetic vouchers clients derives a hash of $\texttt{id}$ (or rather some random mapping of id into the input space of DHF).
Because of this, the detection algorithm allows the server to recover the identifiers of the actual secret shards to use to reconstruct the associated data key. (Recall that the decrypted $\texttt{rct}$ contains $r = DHF(\texttt{hkey}, x\prime)$)
After this point if the number of "real" matches is over the threshold then the server can decrypt all associated data.
## Cute Algorithm meets World
This is a very cute algorithm, exploiting some nice linear algebra to recover the indices of real matches, and the rest is standard cryptography to do the actual threshold encryption.
The paper does document a couple of **real world considerations** which I will briefly revisit and then provide a few of my own.
**Multiple Client Devices** in order to properly track matches across devices every device needs to share the same set of keys.
If a user stores **duplicate images** that have the same hash but different random identifiers and the hash is matched in the database, then the 2 images will count twice towards the threshold. There is no mechanism in this protocol to prevent this, but the paper says it will be dealt with outside the protocol@@^<span class="sidenote">(presumably by the photo app, or the operating system, checking that duplicate images don't get assigned different identifiers...)</span>.
I have already documented a few issues with the overall probabilistic model in
[Obfuscated Apples](/obfuscated_apples.html) - the analysis there doesn't rely on any of the specific details in the protocol and instead focusing on how $t$ is chosen in the first place.
However, now that we dived deeper it is clear to see a number of other issues:
First, the public cuckoo table that is distributed to clients is refreshed periodically under the assumption that vouchers generated with the old set of public data can be combined with vouchers generated from the new set. This would likely happen when new images (or rather their resulting hashes) are added to the backend database.
There is a small probability that an image that matches against one cuckoo table
does not match against another. That means that if a client ends up uploading
the same image identifier across the updates, then it could have different outcomes@@^.
From a general perspective, this doesn't change the security argument much, and the server is **supposed** to learn about matching files.
**However**, this also presents additional metadata to the system regarding the *use* of the files and **allows the server to distinguish a synthetic match from a real match in the case where one fails, and the other succeeds@@^**. <span class="sidenote"> Recall: Identifiers which trigger synthetic matches always succeed regardless of the set of images</span>.
<hr/>
This is not the only problems with parameter updates, the system requires a trusted third party to verify that the cuckoo table is continually computed correctly. This trusted third party needs access to both the original database of images, the hashing function *and* the cuckoo table setup@@^. Without this trusted third party, Apple could fill that table to match arbitrary hashes without anyone being able to verify.
Given that, a malicious server can learn whatever it wants about the client's dataset bounded only by some external-to-the-protocol
third party which both has to police what images are in the database and ensure that the server never distributes bad
public data to clients. This is the real policy question of the system and one that I think has already covered extensively
elsewhere@@^<span class="sidenote">See: <a href="https://christopher-parsons.com/the-problems-and-complications-of-apple-monitoring-for-child-sexual-abuse-material-in-icloud-photos/">The Problems and Complications of Apple Monitoring for Child Sexual Abuse Material in iCloud Photos by Christopher Parsons.</a></span>.
<hr/>
Back to the technical, it is interesting that none of the analysis considers a malicious client violating the correctness
of the protocol. The reason given is "because there are many ways in which a malicious client can refuse to participate."
This is funny in and of itself because the sole intention of this system is to catch malicious people.
Attacks from malicious clients are an interesting academic consideration, the main one documented in system is the client generating too many synthetic matches which drastically slows down the detection algorithm.
However, I also think it is worth considering a DoS attack from the client attempting to generated "matches" that ultimately
decrypt to nonsense. This can be done by submitting vouchers for randomly generated hashes (as shown in [Obfuscated Apples](/obfuscated_apples.html) it doesn't take that many random photos to trigger false positives in most perceptual algorithms
given a large enough database that is being checked against) - this attack could likely be conducted using the phone
hardware itself, maybe even through malware. There do not appear to be any defenses to this, and even through it is
a blind attack, the literature on adversarial exploitation of perceptual hashing algorithms is not on the side of Apple
here.
After these deep dives I remain thoroughly unconvinced of the technical soundness of this system, I can't see how it
can uphold its privacy properties under normal operation, I think there are fundamental questions about how Apple is
choosing parameters (like $t$ and $S$) that significantly change the properties of the system, and I think there are
malicious avenues for exploitation of this system even beyond the policy discussions circling in the media.
Time will tell.

125
posts/ftpsi-parameters.md Normal file
View File

@ -0,0 +1,125 @@
# Revisiting First Impressions: Apple, Parameters and Fuzzy Threshold PSI
Last week, Apple published more additional information regarding the parameterization of their new Fuzzy Threshold
PSI system in the form of a Security Threat Model@@^.
<p class="sidenote"><a href="https://www.apple.com/child-safety/pdf/Security_Threat_Model_Review_of_Apple_Child_Safety_Features.pdf">Security Threat Model Review of Apples Child Safety Features</a></p>
Contained in the document are various answers to questions that the privacy community had been asking since the initial
announcement. It also contained information which answered several of my own questions, and in turn invalidated
a few of the assumptions I had made in a previous article@@^.
<p class="sidenote"><a href="/obfuscated_apples.html">Obfuscated Apples</a></p>
In particular, Apple have now stated the following:
* they claim the false acceptance rate of NeuralHash is 3 in 100M, but are assuming it is 1 in 1M. They have conducted
tests on both a dataset of 100M photos and on a dataset of 500K pornographic photos.
* the threshold $t$ they are choosing for the system is **30** with a future option to lower. They claim this is based
on taking the assumed false positive rate of NeuralHash and applying it to a assumed dataset the size of the largest iCloud photo library to obtain a probability of false reporting of 1 in a trillion.
One might ask why if the false acceptance rate of NeuralHash is so low then why take such precautions when estimating
$t$?
I will give Apple the benefit of the doubt here under the assumption that they really are attempting to only catch
prolific offenders.
Even still, I believe the most recent information by Apple still leaves several unanswered questions, and raises several
more.
## On NeuralHash
To put it as straightforward as possible, 100.5M photos isn't that large of a sample to compare a perceptual hashing
algorithm against, and the performance is directly related to the size of the comparison database (which we don't know).
Back in 2017 WhatsApp estimated that they were seeing 4.5 billion photos being uploaded to the platform per day@@^, while
we don't have figures for iCloud we can imagine, given Apples significant customer base, that it is on a similar order
of magnitude.
<p class="sidenote"><a href=">https://blog.whatsapp.com/connecting-one-billion-users-every-day">Connecting One Billion Users Every Day - Whatsapp Blog</a></p>
The types of the photos being compared also matter. We know nothing about the 100.5M photos that Apple tested against,
and only that a small 500K sample was pornographic in nature. While NeuralHash seems to have been designed as a generic
image comparison algorithm, that doesn't mean that it acts on all images uniformly.
## On the Thresholds
> Since this initial threshold contains a drastic safety margin reflecting a worst-case assumption about real-world performance, we may change the threshold after continued empirical evaluation of NeuralHash false positive rates but the match threshold will never be lower than what is required to produce a one-in-one-trillion false positive rate for any given account - Security Threat Model Review of Apples Child Safety Features
Apples initial value of $t = {30}$ was chosen to include a **drastic safety margin**, but the threat model gives them
the explicit ability to change it in the future, but they promise the floor is still 1 in a trillion for "any given
account".
We still know very little about how $s$ will be chosen. We can assume it will be in the same magnitude as $t$ and that
as such the number of synthetics for each user will be relatively low compared to the total size of their image base.
Also given that $t$ is fixed across all accounts, we can be relatively sure that $s$ will also be fixed across all accounts,
with only the probability of choosing a synthetic match being varied on some unknown function.
Note that, if the probability of synthetic matches is too high, then the detection algorithm@@^ fails with high probability.
Requiring more matches, and an extended detection procedure.
<p class="sidenote">As an aside, if you are interested in playing with the Detectable Hash Function yourself [I wrote a toy version of it](https://git.openprivacy.ca/sarah/fuzzyhash)</p>
## Threat Model Expansions
The new threat model includes new jurisdictional protection for the database that were not present in the original
description - namely that the **intersection** of to ostensibly independent databases managed by different agencies
in different national jurisdictions will be used instead of a single database@@^ <span class="sidenote">(such as the one
run by NCMEC)</span>.
Additionally, Apple have now stated they will publish a "Knowledge Base" containing root hashes of the encrypted
database such that it can be confirmed that every device is comparing images to the same database. It is worth
noting that this claim is only as good as security researchers having access to proprietary Apple code.
That such a significant changes were made to the threat model a week after the initial publication is perhaps the
best testament to the idea, as Matthew Green put it:
> "But this illustrates something important: in building this system, the *only limiting principle* is how much heat Apple can tolerate before it changes its policies." - [Matthew Green](https://twitter.com/matthew_d_green/status/1426312939015901185)
## Revisiting First Impressions
I think the most important question I can ask of myself right now is that if Apple had put out all these documents on
day one, would they have been enough to quell the voice inside my head?
Assuming that Apple also verified the false acceptance rate of NeuralHash in a way more verifiable than :we tested
it on some images, it's all good, trust us!" then I think many of my technical objections to this system would have been
answered.
Not all of them though. I still, for example, think that the obfuscation in this system is fundamentally flawed from a practical perspective. And, I still think that the threat model as applied to malicious clients undermines the rest of the system@@^
<p class="sidenote">See: [A Closer Look at Fuzzy Threshold PSI](/a_closer_look_at_fuzzy_threshold_psi.html) for more details.</p>
## It's About the Principles
And, of course, none of that quells my moral objections to such a system.
You can wrap that surveillance in any number of layers of cryptography to try and make it palatable, the end result is the same.
Everyone on Apple's platform is treated as a potential criminal, subject to continual algorithmic surveillance without warrant or cause.
If Apple are successful in introducing this, how long do you think it will be before the same is expected of other providers? Before walled-garden prohibit apps that don't do it? Before it is enshrined in law?@@^ <span class="sidenote"><a href="https://twitter.com/SarahJamieLewis/status/1423403656733290496">Tweet</a></span>
How long do you think it will be before the database is expanded to include "terrorist" content"? "harmful-but-legal" content"? state-specific censorship?
This is not a slippery slope argument. For decades, we have seen governments and corporations push for ever more surveillance.
It is obvious how this system will be abused. It is obvious that Apple will not be in control of how it will be
abused for very long.
Accepting client side scanning onto personal devices **is** a rubicon moment, it signals a sea-change in how corporations
relate to their customers. Your personal device is no long "yours" in theory, nor in practice. It can, and will, be used
against you.
It is also abundantly clear that this is going to happen. While Apple has come under pressure, it has responded
by painting critics as "confused" (which, if there is any truth in that claim is due to their own lack of technical
specifications).
The media have likewise mostly followed Apples PR lead. While I am thankful that we have answers to some
questions that were asked, and that we seem to have caused Apple to "clarify"@@^ <span class="sidenote">(or, less subtly, change)</span> their own threat model, we have not seen the outpouring of objection that would have been necessary to
shut this down before it spread further.
The future of privacy on consumer devices is now forever changed. The impact might not be felt today or tomorrow, but in
the coming months please watch for the politicians (and sadly, the cryptographers) who argue that what can be done for
CSAM can be done for the next harm, and the next harm. Watch the EU and the UK, among others, declare such scanning mandatory,
and watch as your devices cease to work for you.

View File

@ -64,7 +64,7 @@ We also know that Apple has constructed these parameters such that the probabili
We can actually work backwards from that number to derive $P(\texttt{falsepositive})$:
$$P(\texttt{flag}) = \sum_{\substack{x = t}}^T {T \choose x} \cdot P(\texttt{falsepositive})^x \cdot P(\texttt{falsepositive})^{T - x} \approx 1\mathrm{e}^{-12}$$
$$P(\texttt{flag}) = \sum_{\substack{x = t}}^T {T \choose x} \cdot P(\texttt{falsepositive})^x \cdot P(1-\texttt{falsepositive})^{T - x} \approx 1\mathrm{e}^{-12}$$
In order to finalize this we only need to make educated guesses about 2 parameters: the threshold value, $t$, and the total
number of photos checked per year, $T$. Apple throws out the number $t = 10$ in their technical summary, which seems

7
ssb
View File

@ -51,7 +51,8 @@ function get_posts
DEPTH_LIMITER="-maxdepth 1"
fi
find $POSTS_DIR $DEPTH_LIMITER -type f -name "*.md"
find $POSTS_DIR $DEPTH_LIMITER -type f -name "*.md" -printf "%T@\t%p\n" | sort -nr | cut -c23-
}
@ -71,9 +72,9 @@ function append_posts_list
posts_list="</article><hr/><h2>Recent Articles</h2>"
for post in $@; do
file_base=`basename $post .md`
date=`get_mod_date "$post"`
post_title=`grep -m 1 "^# .*" $post | cut -c 3-`
post_link="$date [$post_title]($file_base.html)<br>\n"
date=`get_mod_date "$post"`
post_link="<em>$date</em> [$post_title]($file_base.html)<br>\n"
posts_list="$posts_list$post_link"
done
echo $posts_list | sort -r

View File

@ -34,25 +34,19 @@ article {
.footnotelabel::after {
content: '[' counter(footnotelabel) ']'; /* 1 */
vertical-align: super; /* 2 */
font-size: 0.5em; /* 3 */
font-size: 0.7em; /* 3 */
margin-left: 2px; /* 4 */
color: #aaa; /* 5 */
color: #ddd; /* 5 */
}
.sidenote {
content: '[' counter(footnotes) ']'; /* 1 */
vertical-align: super; /* 2 */
font-size: 0.5em; /* 3 */
margin-left: 2px; /* 4 */
color: #aaa; /* 5 */
}
.sidenote::before {
content: '[' counter(footnotes) ']'; /* 1 */
content: counter(footnotes) ; /* 1 */
vertical-align: super; /* 2 */
font-size: 0.5em; /* 3 */
font-size: 0.7em; /* 3 */
margin-left: 2px; /* 4 */
color: #aaa; /* 5 */
color: #fff; /* 5 */
}
.sidenote {