Principles of User Privacy (PUP)

Privacy has been an essentially contested concept [[?PRIVACY-CONTESTED]]. Its debated meaning render its support problematic in the context of a standards-setting process grounded in consensus [[?W3C-PROCESS]], in seeking out technical solutions grounded on shared requirements, and in addressing the needs of a worldwide constituency. This document provides definitions for privacy and related concepts that are suitable for a global audience, that can provide building blocks for privacy threat modelling, and can guide the development of the Web as a trustworthy platform. In the spirit of building a much-needed bridge between technology and policy, this document is written under the expectation that it can apply to both.

Definitions

This section provides a number of elementary building blocks from which to establish a shared understanding of privacy. Some of the definitions below build atop the work in Tracking Preference Expression (DNT) [[tracking-dnt]].

People & Data

A user (also person or data subject) is any natural person.

We define personal data as any information relating to a [=person=] such that:

this [=person=] is identified, directly or indirectly, by reference to an identifier such as a name, email address, an arbitrary identifier or identification number, an online identifier such as an IP address or any identifier attached to a device this [=person=] may be using, phone number, location data, or factors specific to the physical, physiological, genetic, mental, economic, cultural, or social identity of that [=person=], as well as identifiers derived from such data, for instance through hashing; or
this [=person=] could reasonably be reidentified from a conjunction of this data with other data; or
the data pertains to a group of people such that a [=person=] may find themselves to be the subject of a treatment related to this group, even if the entity carrying out the treatment has no way to identify that [=person=].

Data is permanently de-identified when there exists a high level of confidence that no human subject of the data can be identified, directly or indirectly (e.g., via association with an identifier, user agent, or device), by that data alone or in combination with other retained or available information, including as being part of a group. Note that further considerations relating to groups are covered in the Collective Issues in Privacy section.

Data is pseudonymous when:

the identifiers used in the data are under the direct and exclusive control of the [=first party=]; and
when these identifiers are shared with a [=third party=], they are made unique to that [=third party=] such that if they are shared with more than one [=third party=] these cannot then match them up with one another; and
there is a strong level of confidence that no [=third party=] can match them to any data other than that obtained through interactions with the [=first party=]; and
any [=third party=] receiving such identifiers is barred (eg. based on legal terms) from sharing them or the related data further; and
technical measures exist to prevent re-identification or the joining of different data sets involving these identifiers, notably against timing or k-anonymity attacks; and
there exist contractual terms between the [=first party=] and [=third party=] describing the limited [=purpose=] for which the data is being shared.

This can ensure that [=pseudonymous data=] is used in a manner that provides a minimum degree of governance such that technical and procedural means to guarantee the maintenance of pseudonymity are preserved. Note that [=pseudonymity=], on its own, is not sufficient to render [=data processing=] [=appropriate=].

A vulnerable person is a [=person=] who, at least in the [=context=] of the [=processing=] being discussed, are unable to exercise sufficient self-determination for any consent they may provide to be receivable. This includes for example children, employees with respect to their employers, people in some situations of intellectual or psychological impairment, or refugees.

The Parties

A party is a [=person=], a legal entity, or a set of legal entities that share common owners, common controllers, and a group identity that is readily evident to the [=user=] without them needing to consult additional material, typically through common branding.

The first party is a [=party=] with which the [=user=] intends to interact. Merely hovering over, muting, pausing, or closing a given piece of content does not constitute a [=user=]'s intent to interact with another party, nor does the simple fact of loading a [=party=] embedded in the one with which the user intends to interact. In cases of clear and conspicuous joint branding, there can be multiple [=first parties=]. The [=first party=] is necessarily a [=data controller=] of the data processing that takes places as a consequence of a [=user=] interacting with it.

A third party is any [=party=] other than the [=user=], the [=first party=], or a [=service provider=] acting on behalf of either the [=user=] or the [=first party=].

A service provider or data processor is considered to be the same [=party=] as the entity contracting it to perform the relevant [=processing=] if it:

is processing the data on behalf of that [=party=];
ensures that the data is only retained, accessed, and used as directed by that [=party=] and solely for the list of explicitly-specified [=purposes=] detailed by the directing [=party=] or [=data controller=];
may determine implementation details of the data processing in question but does not determine the [=purpose=] for which the data is being [=processed=] nor the overarching [=means=] through which the [=purpose=] is carried out;
has no independent right to use the data other than in a [=permanently de-identified=] form (e.g., for monitoring service integrity, load balancing, capacity planning, or billing); and,
has a contract in place with the [=party=] which is consistent with the above limitations.

A data controller is a [=party=] that determines the [=means=] and [=purposes=] of data processing. Any [=party=] that is not a [=service provider=] is a [=data controller=].

The Vegas Rule is a simple implementation of privacy in which "what happens with the [=first party=] stays with the [=first party=]." Put differently, it describes a situation in which the [=first party=] is the only [=data controller=]. Note that, while enforcing the [=Vegas Rule=] provides a rule of thumb describing a necessary baseline for [=appropriate=] [=data processing=], it is not always sufficient to guarantee [=appropriate=] [=processing=] since the [=first party=] can [=process=] data [=inappropriately=].

Acting on Data

A [=party=] processes data if it carries out operations on [=personal data=], whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, [=sharing=], dissemination or otherwise making available, [=selling=], alignment or combination, restriction, erasure or destruction.

A [=party=] shares data if it provides it to any other [=party=]. Note that, under this definition, a [=party=] that provides data to its own [=service providers=] is not [=sharing=] it.

A [=party=] sells data when it [=shares=] it in exchange for consideration, monetary or otherwise.

Contexts and Privacy

The purpose of a given [=processing=] of data is an anticipated, intended, or planned outcome of this [=processing=] which is achieved or aimed for within a given [=context=]. A [=purpose=], when described, should be specific enough to be actionable by someone familiar with the relevant [=context=] (ie. they could independently determine [=means=] that reasonably correspond to an implementation of the [=purpose=]).

The means are the general method of [=data processing=] through which a given [=purpose=] is implemented, in a given [=context=], considered at a relatively abstract level and not necessarily all the way down to implementation details. Example: the user will have their preferences restored (purpose) by looking up their identifier in a preferences store (means).

A context is a physical or digital environment that a [=person=] interacts with for a purpose of their own (that they typically share with other [=person=] who interact with the same environment).

A [=context=] can be further described through:

Its actors, which comprise the subject (a [=person=]) as well as the sender and recipient of the data (which are [=parties=]).
Its attributes, which are the types of [=personal data=] being [=processed=] in the [=context=].
Its transmission principles, which are the constraints (typically technical or legal) being placed upon the [=data processing=].

A [=context=] carries context-relative informational norms that determine whether a given [=data processing=] is appropriate (if the norms are adhered to) or inappropriate (when the norms are violated). A norm violation can be for instance the exfiltration of [=personal data=] from a context or the lack of respect for [=transmission principles=]. When [=norms=] are respected in a given [=context=], we can say that contextual integrity is maintained; otherwise that it is violated ([[?PRIVACY-IN-CONTEXT]], [[?PRIVACY-AS-CI]]).

We define privacy as a right to [=appropriate=] [=data processing=]. A privacy violation is, correspondingly, [=inappropriate=] [=data processing=] [[?PRIVACY-IN-CONTEXT]].

Note that a [=first party=] can be comprised of multiple [=contexts=] if it is large enough that [=people=] would interact with it for more than one [=purpose=]. [=Sharing=] [=personal data=] across [=contexts=] is, in the overwhelming majority of cases, [=inappropriate=].

Your cute little pup uses Poodle Naps to find comfortable places to snooze, and Poodle Fetch to locate the best sticks. Napping and fetching are different [=contexts=] with different norms, and sharing data between these contexts is a [=privacy violation=] despite the shared ownership of Naps and Fetch by the Poodle conglomerate.

Colloquially, tracking is understood to be any kind of [=inappropriate=] data collection.

Additionally, privacy labour is the practice of having a [=person=] carry out the work of ensuring [=data processing=] of which they are the subject is [=appropriate=], instead of having the [=parties=] be responsible for that work as is more respectable.

User Agents

The user agent acts as an intermediary between a [=user=] and the web. The [=user agent=] is not a [=context=] in that it is expected to coincide with the [=subject=] and operate exclusively in the [=subject=]'s interest. It is not the [=first party=]. The [=user agent=] serves the [=user=] in a relationship of fiduciary agency: it always puts the [=user=]'s interest first, up to and including, on occasion, protecting the [=user=] from themselves by preventing them from carrying out a harmful decision, or at the very least by speed-bumping it [[?FIDUCIARY-UA]]. For example, the [=user agent=] will make it difficult for the [=user=] to connect to a site the authenticity of which is hard to ascertain, will double-check that the user really intends to expose a sensitive device capability, or will prevent the [=user=] from consenting to permanent monitoring of their behaviour. Its fiduciary duties include [[?TAKING-TRUST-SERIOUSLY]]:

Duty of Protection: Protection requires [=user agents=] to affirmatively protect a [=user=]'s data, beyond simple security measures. It is insufficient simply to encrypt at rest and in transit, but one must further limit retention, ensure that the strictly necessary data is collected, or require guarantees from those it is shared to.
Duty of Discretion: Discretion requires the [=user agent=] to make best efforts to enforce [=context-relative informational norms=] by placing contextual limits on the flow and [=processing=] of [=personal data=]. Discretion is not confidentiality and may place limits on nondisclosure: trust can be preserved even when the [=user agent=] shares the [=personal data=], so long as it is done in an [=appropriately=] discreet manner.
Duty of Honesty: Honesty requires that the [=user agent=] make sure that the [=user=] is proactively provided with information that is relevant to them and that will enhance the [=user=]'s autonomy, to the extent possible in a manner that they will comprehend and at the right moment, which is almost never when the [=user=] is trying to do something else such as read a page or activate a feature. The duty of honesty goes well beyond that of transparency that dominates legacy privacy regimes. Unlike with transparency, honesty cannot get away with hiding relevant information in complex out-of-band legal notices no more than it can rely on overly cursory information provided in a consent dialog.
Duty of Loyalty: Because of the special [=fiduciary relationship=] that obtains between [=user=] and [=user agent=], the latter is held to be loyal to the former in all situations, up to and including in preference to the [=user agent=]'s implementer. When a [=user agent=] carries out [=processing=] that is not directly in the [=user=]'s interest but rather benefits another entity such as its implementer, including by piggybacking on [=processing=] that may be in the user's interest, that behaviour is known as self-dealing. [=Self-dealing=] is always [=inappropriate=]. Loyalty is the avoidance of [=self-dealing=].

These duties ensure the [=user agent=] will care for the user. It is important to note that there is a subtle difference between care and data paternalism which is that the latter purports to help in part by removing agency ("don't worry about it, so long as your data is with us it's safe, you don't need to know what we do with it, it's all good because we're good people") whereas care aims to support people by enhancing their agency and sovereignty.

Privacy Threat Model Building Blocks

Threats Are Context-Relative

Privacy threat models for the Web exist ([[?TRACKING-PREVENTION-POLICY]], [[?ANTI-TRACKING-POLICY]], [[?PRIVACY-THREAT]]). While good within the scope of what they intend to do, they tend to be very specific to cross-context tracking as seen by the browser, which is but one problem. Privacy threat modelling could benefit from having more of a toolbox to reuse in different situations.

This document provides building blocks for the creation of privacy threat models on the Web. Note that privacy threat models have an important difference with security threat models in that all [=parties=] are potential threats, even when they are not rogue. In fact the [=first party=] is typically considered to be the primary threat, the one against which the brunt of mitigating techniques are to be leveraged.

The most important building blocks for a privacy threat model are those that define a [=context=]: [=actors=], [=attributes=], and [=transmission principles=], as well as the [=context-relative informational norms=]. Collectively, these define the expectations that obtain in a given [=context=]. Based on these expectations, it becomes possible to ask questions about the ways in which they could fail and how to prevent. What happens if the [=subject=] is a [=vulnerable person=]? How well does the context fare if expectations of consent are undermined by manipulative techniques ([[?DIGITAL-MARKET-MANIPULATION]], [[?PRIVACY-BEHAVIOR]])? If the [=transmission principles=] are based on sharing data with parties while telling them that they cannot use it, what confidence do we have that they are following the rules?

When assessing privacy threats, it is not necessary to establish harms since breaking the [=context=]'s [=norms=] is [=inappropriate=] in all cases. However, consideration of harms can inform which issues to prioritise. Where individual privacy harms are to be defined, the definitive source is Privacy Harms [[?PRIVACY-HARMS]].

Identity on the Web

A [=person=]'s identity is the set of characteristics that define them. In computer systems, [=identity=] is typically attached to a means of denotation that makes it easier for an automated system to recognise the user, an identifier of some type.

A [=person=]'s characteristics, and therefore [=identity=], is entirely [=context=]-dependent. Recognising a [=person=] in distinct [=contexts=] can at times be [=appropriate=] but this relies on an understanding of applicable [=norms=] and will likely require compartmentalisation. (For example, if you meet your therapist at a cocktail party, you expect them to have rather different discussion topics with you than they usually would, and possibly even to pretend they do not know you.) As a result, automating the recognition of a [=person=]'s identity across different [=contexts=] is [=inappropriate=]. This is particularly true for [=vulnerable=] people as recognising them in different [=contexts=] may force their vulnerability into the open.

[=Identity=] being by nature fragmented, [=user agents=] must work in support of [=users=] having different identities in different [=context=] with respect to all [=parties=] (including the user agent vendor) and to prevent their recognition through other means, where possible.

A keystone principle of the Web is trust [[RFC8890]]. An important part of trust is to ensure that [=data=] collected for a [=purpose=] that matches a clearly delineated [=user=] feature should not then be used for additional secondary [=purposes=]. Email is often used for login and communication [=purposes=], including essential transactional interactions. Using emails for cross-context recognition [=purposes=] is therefore not only [=inappropriate=] but also threatens key messaging infrastructure. Future iterations of the Web platform should ensure that users can log into sites and receive communication without relying on email at all.

User Control and Autonomy

A [=person=]'s autonomy is their ability to make decisions of their own volition, without undue influence from other parties. People have limited intellectual resources and time with which to weigh decisions, and by necessity rely on shortcuts when making decisions. This makes their privacy preferences malleable [[?PRIVACY-BEHAVIOR]] and susceptible to manipulation [[?DIGITAL-MARKET-MANIPULATION]]. A [=person=]'s [=autonomy=] is enhanced by a system or device when that system offers a shortcut that aligns more with what that [=person=] would have decided given arbitrary amounts of time and relatively unfettered intellectual ability; and [=autonomy=] is decreased when a similar shortcut goes against decisions made under ideal conditions.

Affordances and interactions that decrease [=autonomy=] are known as dark patterns. A [=dark pattern=] does not have to be intentional, the deceptive effect is sufficient to define them [[?DARK-PATTERNS]], [[?DARK-PATTERN-DARK]].

Because we are all subject to motivated reasoning, the design of defaults and affordances that may impact [=user=] [=autonomy=] should be the subject of independent scrutiny. Implementers are enjoined to be particularly cautious to avoid slipping into [=data paternalism=].

Given the sheer volume of potential [=data=]-related decisions in today's data economy, complete informational self-determination is impossible. This fact, however, should not be confused with the contention that privacy is dead. Careful design of our technological infrastructure can ensure that [=users=]' [=autonomy=] as pertaining to their own [=data=] is enhanced through [=appropriate=] defaults and choice architectures.

In the 1970s, the Fair Information Practices or FIPs were elaborated in support of individual [=autonomy=] in the face of growing concerns with databases. The [=FIPs=] assume that there is sufficiently little [=data processing=] taking place that any [=person=] will be able to carry out sufficient diligence to enable [=autonomy=] in their decision-making. Since they entirely offload the [=privacy labour=] to [=users=] and assume perfect, unfettered [=autonomy=], the [=FIPs=] do not forbid specific types of [=data processing=] but only place them under different procedural requirements. Such an approach is [=appropriate=] for [=parties=] that are processing data in the 1970s.

One notable issue with procedural approaches to privacy is that they tend to have the same requirements in situations where the [=user=] finds themselves in a significant asymmetry of power with a [=party=] — for instance the [=user=] of an essential service provided by a monopolistic platform — and those where [=user=] and [=parties=] are very much on equal footing, or even where the [=user=] may have greater power, as is the case with small businesses operating in a competitive environment. It further does not consider cases in which one [=party=] may coerce other [=parties=] into facilitating its [=inappropriate=] practices, as is often the case with dominant players in advertising [[?CONSENT-LACKEYS]] or in content aggregation [[?CAT]].

Reference to the [=FIPs=] survives to this day. They are often referenced as transparency and choice, which, in today's digital environment, is often a strong indication that [=inappropriate=] [=processing=] is being described.

Agnes from Wandavision winking 'Transparency and choice' — A method of privacy regulation which promises honesty and autonomy but delivers neither. [[?CONFIDING]].

Opt-in, Consent, Opt-out, Global Controls

Different procedural mechanisms exist to enable [=people=] to control the [=processing=] done to their [=data=]. Mechanisms that increase the number of [=purposes=] for which their [=data=] is being [=processed=] are referred to as opt-in or consent; mechanisms that decrease this number of [=purposes=] are known as opt-out.

When deployed thoughtfully, these mechanisms can enhance [=people=]'s [=autonomy=]. Often, however, they are used as a way to avoid putting in the difficult work of deciding which types of [=processing=] are [=appropriate=] and which are not, offloading [=privacy labour=] to the [=user=].

Privacy regulatory regimes are often anchored at extremes: either they default to allowing only very few strictly essential [=purposes=] such that many [=parties=] will have to resort to [=consent=], habituating [=people=] to ignore legal prompts and incentivising [=dark patterns=], or, conversely, they default to forbidding only very few, particularly egregious [=purposes=], such that [=people=] will have to perform the [=privacy labour=] to [=opt out=] in every [=context=] in order to produce [=appropriate=] [=processing=].

An approach that is more aligned with the expectation that the Web should provide a trustworthy, [=person=]-centric environment is to establish a regime consisting of three privacy tiers:

Default Privacy Tier: This is the set of [=purposes=] that are deemed [=appropriate=] in a given [=context=] and the [=processing=] that a [=person=] can expect without having triggered any mechanisms to change their preferences. The exact details for this tier require clear definition, including aspects such as data retention, but [=data=] would have to be systematically siloed by [=context=] (not by [=party=], making it stricter than the [=Vegas Rule=] and more in line with respecting [=privacy=]). This default tier should also be defined differently for certain kinds of [=contexts=]. The legitimate [=processing=] that can take place in this tier derives its legitimacy from matching the expectations and interests of both the [=user=] and the [=first party=] in their relationship, as guided by the applicable [=norms=]. This tier is more [=appropriate=] the more the [=first party=] acts in accordance with [=fiduciary duties=].
Opt-out Privacy Tier: [=People=] who, either from personal preference or because they are [=vulnerable=], require greater obscurity in some or even most [=contexts=], would be able to transition to this tier through an [=opt-out=] mechanism. This tier would only permit strictly necessary [=purposes=]. This tier should be [=appropriate=] for [=vulnerable=] [=people=].
Opt-in Privacy Tier: In rare and highly specific cases, [=people=] should be able to [=consent=] to more sensitive [=purposes=], such as having their [=identity=] recognised across contexts or their reading history shared with a company. The burden of proof on ensuring that informed [=consent=] has been obtained needs in this case to be very high (much higher than what prevails for instance in [[?GDPR]] jurisdictions as currently practices). [=Consent=] is comparable to the general problem of permissions on the Web platform. In the same way that it should be clear when a given device capability is in use (eg. you are providing geolocation or camera access), sharing data under this tier should be set up in such a way that it requires deliberate, specific action from the [=user=] (eg. triggering a form control) and if that [=consent=] is persistent, there should be an indicator that data is being transmitted shown at all times, in such a way that the user can easily switch it off. In general, providing [=consent=] should be rare, difficult, highly intentional, and temporary.

When an [=opt-out=] mechanism exists, it should preferably be complemented by a global opt-out mechanism. The function of a [=global opt-out=] mechanism is to rectify the automation asymmetry whereby service providers can automate [=data processing=] but [=people=] have to take manual action. A good example of a [=global opt-out=] mechanism is the Global Privacy Control [[?GPC]].

Conceptually, a [=global opt-out=] mechanism is an automaton operating as part of the [=user agent=], which is to say that it is equivalent to a robot that would carry out the [=user=]'s bidding by pressing an [=opt-out=] button with every interaction that the [=user=] has with a site, or more generally conveys an expression of the [=user=]'s rights in a relevant jurisdiction. (For instance, under [[?GDPR]], the [=user=] may be conveying objections to [=processing=] based on legitimate interest or the withdrawal of [=consent=] to specific [=purposes=].) It should be noted that, since a [=global opt-out=] signal is reaffirmed automatically with every [=user=] interaction, it will take precedence in terms of specificity over any manner of blanket [=consent=] that a site may obtain, unless that [=consent=] is directly attached to an interaction (eg. terms specified on a form upon submission).

Collective Issues in Privacy

When designing Web technology, we naturally pay attention to potential impacts on the [=person=] using the Web through their [=user agent=]. In addition to potential individual harms we also pay heed to collective effects that emerge from the accumulation of individual actions as influenced by entities and the structure of technology.

Note that in evaluating impact, we deliberately ignore what implementers or specifiers may have intended and only focus on outcomes. This framing is known as POSIWID, or "the Purpose Of a System Is What It Does".

The collective problem of privacy is known as legibility. [=Legibility=] concerns population-level [=data processing=] that may impact populations or individuals, including in ways that [=people=] could not control even under the optimistic assumptions of the [=FIPs=]. For example, based on population-level analysis, a company may know that site.example is predominantly visited by [=people=] of a given race or gender, and decide not to run its job ads there. Visitors to that page are implicitly having their [=data=] processed in [=inappropriate=] ways, with no way to discover the discrimination or seek relief [[?DEMOCRATIC-DATA]].

What we consider is therefore not just the relation between the people who expose themselves and the entities that invite that disclosure [[?RELATIONAL-TURN]], but also between the people who expose themselves and those who do not but may find themselves recognised as such indirectly anyway. One key understanding here is that such relations may persists even when data is [=permanently de-identified=].

[=Legibility=] practices can be legitimate or illegitimate depending on the [=context=] and on the [=norms=] that apply in that [=context=]. Typically, a [=legibility=] practice may be [=legitimate=] if it is managed through an acceptable process of collective [=governance=]. For example, it is often considered [=legitimate=] for a government, under the control of its citizens, to maintain a database of license plates for the [=purpose=] of enforcing the rules of the road. It would be [=illegitimate=] to observe the same license plates near places of worship to build a database of religious identity.

[=Legibility=] is often used to order information about the world. This can notably create problems of [=reflexivity=] and of [=autonomy=].

Problems of reflexivity occur when the ordering of information about the world used to produce [=legibility=] finds itself changing the way in which the world operates. This can produce self-reinforcing loops that can have deleterious effects both individual and collective [[?SEEING-LIKE-A-STATE]].

Issues of [=autonomy=] occur depending on the manner in which [=legibility=] is implemented. When [=legibility=] is used to order the world following rules set by the [=user=] or following methods subject to public scrutiny and [=governance=] models with strong checks and balances (such as a newspaper's editorial decisions), then it will enhance [=user=] [=autonomy=] and tend to be [=legitimate=]. When it is done in the [=user=]'s stead and without [=governance=], it decreases [=user=] [=autonomy=] and tends to be [=illegitimate=].

Data governance refers to the rules and processes for how [=data=] is [=processed=] in any given [=context=]. How data is governed describes who has power to make decisions over [=data=] and how [[?DATA-FUTURES-GLOSSARY]].

In general, collective issues in [=data=] require collective solutions. The proper goal of [=data governance=] at the standards-setting level is the development of structural controls in [=user agents=] and the provision of institutions that can handle population-level problems in [=data=]. [=Governance=] will often struggle to achieve its goals if it works primarily by increasing individual control over [=data=]. A collective approach reduces the cost of control.

Collecting data at large scales can have significant pro-social outcomes. Problems tend to emerge when entities take part in dual-use collection in which [=data=] is [=processed=] for collective benefit but also for [=self-dealing=] [=purposes=] that may degrade welfare. The [=self-dealing=] [=purposes=] will be justified as bankrolling the pro-social outcomes, which, absent collective oversight, cannot be considered to support claims to [=legitimacy=] for such [=legibility=]. It is vital for standards-setting organisations to establish not just purely technical devices but techno-social systems that can govern data at scale.

Introduction