Initial Commit

trunk
Sarah Jamie Lewis 2 years ago
commit 3e6db50163

1
.gitignore vendored

@ -0,0 +1 @@
book

@ -0,0 +1,11 @@
[book]
authors = ["Sarah Jamie Lewis"]
language = "en"
multilingual = false
src = "src"
title = "FuzzyTags: The Book"
[output.html]
git-repository-url = "https://git.openprivacy.ca/openprivacy/fuzzytags"
[preprocessor.katex]

@ -0,0 +1,9 @@
# Summary
- [Introduction](./chapter_1.md)
- [Terminology](./terminology.md)
- [Deploying Fuzzytags Securely](./deploying-fuzzytags.md)
- [Entangled Tags](./entangled.md)
- [Simulating fuzzytags](./simulations.md)
- [Email EU Core Dataset Simulations](./simulation-eu-core-email.md)
- [College IM Dataset Simulations](./simulation-college-im.md)

@ -0,0 +1,57 @@
In this short book we, Open Privacy, will document our investigations into [Fuzzy Message Detection](https://eprint.iacr.org/2021/089),
including extensions to the proposed scheme and simulation results when modelling realistic deployments.
# Introduction to Fuzzytags
Anonymous messaging systems (and other privacy-preserving applications) often require a mechanism for one party
to learn that another party has messaged them.
Many schemes rely on a bandwidth-intensive "download everything and attempt-decryption" approach. Others rely on a trusted
3rd party, or non-collusion assumptions, to provide a "private" service.
It would be awesome if we could get an **untrusted**, **adversarial** server to do the work for us without compromising metadata-resistance!
**fuzzytags** is an experimental probabilistic cryptographic tagging structure to do just that!
Specifically **fuzzytags** provides the following properties:
* Correctness: Valid tags constructed for a specific public key will always validate when tested using a derived detection key.
* Fuzziness: Tags will produce false positives with probability _p_ related to the security property (_γ_) when tested against detection keys they
were not intended for.
* Security: An adversarial server with access to the detection key **is unable to distinguish false positives from true positives**. (Detection Ambiguity)
## Formally
For an in depth analysis see the origin paper [Fuzzy Message Detection](https://eprint.iacr.org/2021/089) by Gabrielle Beck and Julia Len and Ian Miers and Matthew Green.
Note, that paper uses multiplicative notation, throughout this book we will use additive notation.
All parties in the fuzzy tag system derive an ordered set of $\gamma$ secret keys $\vec{x}$ and a public key $\vec{X} = \{x_i\mathbf{G} : \forall x_i \in \vec{x}\}$
A fuzzytag consists of:
* a ciphertext, $\vec{c}$, of length $\gamma$,
* a group element: $\mathbf{U} = r\mathbf{G}$, where $r$ is a random scalar.
* and, a scalar: $y = r^{-1} \times (z-m)$ where $z$ is another random scalar, and $m$ is a scalar derived from the
output of a hash function over $u$ and the $\vec{c}$.
The ciphertext is obtained by constructing a key $\vec{k} = \{k_i = H(\mathbf{U} || r\mathbf{X_i} || w) \forall i \in \vec{X})\}$ and
calculating $\vec{c} = \vec{k} \oplus \vec{1}$
Checking a text is done over a detection key $d$ of length $n$ where $\vec{d} = \vec{x}_{0 \ldots n}$, this detection
key can be given to an adversarial server to perform filtering.
$\vec{k}$ is recalculated by first deriving $w$ using $u$ and $y$ and then calculating
$\{k_i = H(\mathbf{U} || x_i\mathbf{U} || w) \forall i \in \vec{x})\}$ and the plaintext is recovered $\vec{p} = \vec{k} \oplus \vec{c}$ and compared to $\vec{1}$.
## Additional Checks
We perform the following additional checks when verifying a tag.
* Discard tags that would validate for every single public key.
* We assert that $\mathbf{U}$ is not the identity element.
* We assert that $y \neq 0$
* Reject tags intended for different setups:
* We include $\gamma$ in the input to both the $H$ and $G$ hash functions.

@ -0,0 +1,86 @@
## Deploying FuzzyTags Securely
The properties provided by this system are highly dependent on selecting a false positive rate _p_. In the following
sections we will cover a number of considerations you should take into account when integrating fuzzytags into a larger
privacy preserving application.
### How bad is it to let people select their own false-positive rates?
The short answer is "it depends".
The longer answer:
When different parties have different false positive rates the server can calculate the skew between a party's ideal
false positive rate and observed false positive rate.
That skew leaks information, especially given certain message distributions. Specifically it leaks parties
who receive a larger proportion of system messages than their ideal false positive rate.
i.e. for low false positive rates and high message volume for a specific receiver, the adversarial server
can calculate a skew that leaks the recipient of individual messages - breaking privacy for that receiver.
It *also* removes those messages from the pool of messages that an adversarial server needs to consider for other receivers.
Effectively reducing the anonymity set for everyone else.
Which brings us onto:
### Differential Attacks
Any kind of differential attacks break this scheme, even for a small number of messages i.e. if you learn (through
any means) that a specific set of messages are all likely for 1 party, you can diff them against all other parties keys and
very quickly isolate the intended recipient - in simulations of 100-1000 parties it can take as little as 3 messages - even
with everyone selecting fairly high false positive rates.
The corollary of the above being that in differential attacks your anonymity set is basically the number of users
who download all messages - since you can't diff them. This has the interesting side effect: the more parties who
download everything, the more the system can safely tolerate parties with small false-positive rates.
To what extent you can actually account for this in your application is an open question.
### Statistical Attacks
Using some basic binomial probability we can use the false positive rate of reach receiver tag to calculate
the probability of matching on at least X tags given the false positive rate. Using this we can find statistically
unlikely matches e.g. a low-false positive key matching many tags in a given period.
This can be used to find receivers who likely received messages in a given period.
If it is possible to group tags by sender then we can perform a slightly better attack and ultimately learn the
underlying social graph with fairly low false positive rates (in simulations we can learn 5-10% of the underlying
connections with between 5-12% false positive rates.)
For more information on statistical attacks please check out our [fuzzytags simulator](https://git.openprivacy.ca/openprivacy/fuzzytags-sim).
### Should Senders use an anonymous communication network?
If statistical & differential attacks are likely e.g. few parties download everything and
multiple messages are expected to originate from a sender to a receiver or there
is other information that might otherwise link a set of messages to a sender or receiver then you may want to consider how
to remove that context.
One potential way of removing context is by having senders send their message to the server through some kind of anonymous
communication network e.g. a mixnet or tor.
Be warned: This may not eliminate all the context!
### How bad is it to select a poor choice of _p_?
Consider a _pareto distribution_ where most users only receive a few messages, and small subset of users
receive a large number of messages it seems that increasing the number of parties is
generally more important to overall anonymity of the system than any individual selection of _p_.
Under a certain threshold of parties, trivial breaks (i.e. tags that only match to a single party) are a bigger concern.
Assuming we have large number of parties (_N_), the following heuristic emerges:
* Parties who only expect to receive a small number of messages can safely choose smaller false positive rates, up
to a threshold _θ_, where _θ > 2^-N_. The lower the value of _θ_ the greater the possibility of random trivial breaks for
the party.
* Parties who expect a large number of messages should choose to receive **all** messages for 2 reasons:
1) Even high false positive rates for power users result in information leaks to the server (due to the large
skew) i.e. a server can trivially learn what users are power users.
2) By choosing to receive all messages, power users don't sacrifice much in terms of bandwidth, but will provide
cover for parties who receive a small number of messages and who want a lower false-positive rate.
(We consider a pareto distribution here because we expect many applications to have parties that can be
modelled as such - especially over short-time horizons)

@ -0,0 +1,47 @@
# Entangled Tags
It is possible to generate fuzzytags for a given $\gamma$ that are valid for 2 or more parties
at the same time.
This is done through $\texttt{FlagEntangled}$ - a function that takes in a vector of tagging keys and runs the
$\texttt{Flag}$ function, as documented in the original paper for each of them (with the same $r and z$) until it
finds a $z$ that will generate the same tag for every tagging key.
## Multiparty Broadcast
Alice wants to send a message to Bob and Carol. She constructs a single tag that will validate against detection keys generated by both of them.
When an adversarial server matches the tag against all the keys it knows about it will discover that the tag matches both Bob and Carol (in addition to some number of false positives depending on the false positive rates of all the other parties using the server).
To construct such a tag Alice runs $\texttt{FlagEntangled(}\{X_{\text{bob}}, X_{\text{carol}}\}\texttt{)}$.
The adversarial server will match the tag to the detection keys of both Bob and Carol. The server has no way of determining if the match is a broadcast to both parties, a unique message to one of Bob or Carol or a false positive for both.
## Deniable Sending
Alice wants to send a message to Carol, but is concerned that Carol may have a detection key with too low false positive rate. Alice knows of a set of parties (and their public keys) who also use the adversarial server to send privacy messages. Alice searches for a tag that will validate against detection keys generated not only by Carol but a randomly selected party e.g. Eve.
When an adversarial server matches the tag against all the keys it knows about it will discover that the tag matches both Carol and Eve (in addition to some number of false positives depending on the false positive rates of all the other parties using the server).
Even if the server was to isolate this specific message as originating from Alice, they would not be able to derive the recipient through any kind of differential attack (as all attacks would also implicate Eve).
To construct such a tag Alice runs $\texttt{FlagEntangled(}\{X_{\text{carol}}, X_{\text{eve}}\}\texttt{)}$.
Alice could choose to entangle all of her messages to Carol in this way, fully implicating Eve in her message sending regardless of Eve's false positive rate. If Eve attempted to decrypt the message she would not be able to and might assume that the tag was an unlikely false positive - as such too many of these messages might cause Eve to be suspicious. However, Eve might be a well known service or bot integrated with the privacy preserving application - allowing Alice cover without worrying about triggering suspicion.
Alice could also choose to entangle each message with a different random party.
While this strategy is, by itself, vulnerable to intersection attacks; it increases the number of potential relationships any adversary needs to rule out in order to derive the resulting metadata from the communication.
When combined with a large number of parties downloading all messages (or even downloading with high false positive rates) this strategy has the effect of increasing the anonymity set of the entire system.
It is worth nothing at this point that strategies can be combined, and their effects compound. When given a tag and a set of matches an adversarial server cannot distinguish between a true and false positive, an entangled deniable send or a group broadcast - or a combination!
## Forging False Positives
Alice wants to send a message to Carol, but also wants to implicate Eve to the server. However Alice doesn't have enough time or computing power to generate a tag that will fully match against Eve's full $\gamma$-length key.
Instead Alice forges an entangled tag by running $\texttt{FlagEntangled(}\{X_{\text{carol}}, X_{\text{eve}}\}\texttt{)}$, however, instead of checking all parts of the key at line $11$ she instead only checks up to a value $l$ that she believes is greater than or equal to the false positive rate of Eves detection key.
To the server the tag would match both to the detection key of both Carol and Eve, but when fetched Eve would assume it was a false positive (a much more likely one than in our previous example).

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 MiB

@ -0,0 +1,29 @@
In this section we will document simulations performed on the College Msg Core dataset (details below). In particular, we assess the worst-case scenario of a server with access to a sender-oracle (i.e. able to attribute tags to a
particular sender) to understand how much information is leaked by fuzzytags without [appropriate deployment mitigations.](./deploying-fuzzytags.md)
# College IM Dataset Simulations
Nodes 1899
Temporal Edges 59835
Time span 193 days
Pietro Panzarasa, Tore Opsahl, and Kathleen M. Carley. "Patterns and dynamics of users' behavior and interaction: Network analysis of an online community." Journal of the American Society for Information Science and Technology 60.5 (2009): 911-932.
## Scenario 1
Setup: 20k events (7330 links). False positive rates: [0.007812, 0.5]. No entangling.
Result: Server can identify ~4.3% of original graph (313 links) with a 12% false positive rate at threshold: 0.0001.
## Scenario 2
Setup: 20k events (7330 links). False positive rates: [0.007812, 0.5]. Every tag entangled to one random node (as before).
Result: Server can identify ~3.95% of original graph (290 links) with a ~15% false positive rate.
# Discussion
A very similar result to our observations on the EU Core email dataset, entangled tags increase the false positive
rate, although overall it requires non-naive entangling strategies to push the false positive rate of the derived graph
to a place where it would not be useful for an adversary.

@ -0,0 +1,29 @@
In this section we will document simulations performed on the Email EU Core dataset (details below). In particular,
we assess the worst-case scenario of a server with access to a sender-oracle (i.e. able to attribute tags to a
particular sender) to understand how much information is leaked by fuzzytags without [appropriate deployment mitigations.](./deploying-fuzzytags.md)
# Email EU Core Dataset Simulations
Nodes: 1004
Temporal Edges: 332334
Time span: 803 days
Citation: Ashwin Paranjape, Austin R. Benson, and Jure Leskovec. "Motifs in Temporal Networks." In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, 2017.
## Scenario 1
Setup: 1 month of email events between 1004 nodes, 20k events (5148 links). False positive rates: \[0.007812, 0.5\]. No entangling.
Result: An adversarial server can identify ~7% of original graph (393 links) with a 6% false positive rate. Threshold: 0.0001
## Scenario 2
Setup: The same month of emails, 20k events (5148 links) Same false positive rates. Every tag is entangled with 1 random node.
Result: Server can identify ~6.6% of original graph with a 6.8% false positive rate.
# Discussion
Entanglement seems to have some impact on the servers ability to relearn the social graph, in particular
it increases the false positive rate of the derived graph. However, this impact is not significant enough in the
observed simulation.

Binary file not shown.

After

Width:  |  Height:  |  Size: 553 KiB

@ -0,0 +1,7 @@
# Simulating fuzzytags
We also provide simulation software to experiment with various deployment of
fuzzytags given certain system parameters like the number of parties, a model of
message sending, and the rate of entangled tags.
![](./simulation.png)

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.8 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 MiB

@ -0,0 +1,28 @@
# Terminology
Throughout this book, the fuzzytags API and documentation we use the following terms, some of which
deviate and/or extend definitions from the original paper.
## **Tag**
A probabilistic structure that can be attached to a message in order to identify the recipient. The focus of the
fuzzytags system.
## **Root Secret**
A privately generated set of secret scalars. Randomly generated.
Not to be confused with "secret/private key" in larger system integrations.
## **Tagging Key**
A publicly distributed set of group elements used to construct a tag for a given party. Derived from the Root Secret.
Not to be confused with "public key" in larger system integrations.
## **Detection Key**
A semi-public subvector of the Root Secret that is provided to an adversarial server in order to outsource identification
of tags (with some pre-determined false positive rate).
Not to be confused with "verification" key in larger system integrations.
Loading…
Cancel
Save