fuzzytags/ANONYMITY.md

4.9 KiB

Integrating FuzzyTags

The properties provided by this system are highly dependent on selecting a false positive rate p. In the following sections we will cover a number of considerations you should take into account when integrating fuzzytags into a larger privacy preserving application.

How bad is it to let people select their own false-positive rates?

The short answer is "it depends".

The longer answer:

When different parties have different false positive rates the server can calculate the skew between a party's ideal false positive rate and observed false positive rate.

That skew leaks information, especially given certain message distributions. Specifically it leaks parties who receive a larger proportion of system messages than their ideal false positive rate.

i.e. for low false positive rates and high message volume for a specific receiver, the adversarial server can calculate a skew that leaks the recipient of individual messages - breaking privacy for that receiver.

It also removes those messages from the pool of messages that an adversarial server needs to consider for other receivers. Effectively reducing the anonymity set for everyone else.

Which brings us onto:

Differential Attacks

Any kind of differential attacks break this scheme, even for a small number of messages i.e. if you learn (through any means) that a specific set of messages are all likely for 1 party, you can diff them against all other parties keys and very quickly isolate the intended recipient - in simulations of 100-1000 parties it can take as little as 3 messages - even with everyone selecting fairly high false positive rates.

The corollary of the above being that in differential attacks your anonymity set is basically the number of users who download all messages - since you can't diff them. This has the interesting side effect: the more parties who download everything, the more the system can safely tolerate parties with small false-positive rates.

To what extent you can actually account for this in your application is an open question.

Statistical Attacks

Using some basic binomial probability we can use the false positive rate of reach receiver tag to calculate the probability of matching on at least X tags given the false positive rate. Using this we can find statistically unlikely matches e.g. a low-false positive key matching many tags in a given period.

This can be used to find receivers who likely received messages in a given period.

If it is possible to group tags by sender then we can perform a slightly better attack and ultimately learn the underlying social graph with fairly low false positive rates (in simulations we can learn 5-10% of the underlying connections with between 5-12% false positive rates.)

For more information on statistical attacks please check out our fuzzytags simulator.

Should Senders use an anonymous communication network?

If statistical & differential attacks are likely e.g. few parties download everything and multiple messages are expected to originate from a sender to a receiver or there is other information that might otherwise link a set of messages to a sender or receiver then you may want to consider how to remove that context.

One potential way of removing context is by having senders send their message to the server through some kind of anonymous communication network e.g. a mixnet or tor.

Be warned: This may not eliminate all the context!

How bad is it to select a poor choice of p?

Consider a pareto distribution where most users only receive a few messages, and small subset of users receive a large number of messages it seems that increasing the number of parties is generally more important to overall anonymity of the system than any individual selection of p.

Under a certain threshold of parties, trivial breaks (i.e. tags that only match to a single party) are a bigger concern.

Assuming we have large number of parties (N), the following heuristic emerges:

  • Parties who only expect to receive a small number of messages can safely choose smaller false positive rates, up to a threshold θ, where θ > 2^-N. The lower the value of θ the greater the possibility of random trivial breaks for the party.
  • Parties who expect a large number of messages should choose to receive all messages for 2 reasons:
    1. Even high false positive rates for power users result in information leaks to the server (due to the large skew) i.e. a server can trivially learn what users are power users.
    2. By choosing to receive all messages, power users don't sacrifice much in terms of bandwidth, but will provide cover for parties who receive a small number of messages and who want a lower false-positive rate.

(We consider a pareto distribution here because we expect many applications to have parties that can be modelled as such - especially over short-time horizons)