PII (Personally Identifiable Information)

Personally Identifiable Information (PII) is any data that can be used, alone or in combination with other information, to identify, contact, or locate a specific individual, forming the primary subject of modern data protection regulation.

Definition

Personally Identifiable Information (PII) is any data that identifies, relates to, or could reasonably be linked to a specific individual. The term originates from US government usage (NIST SP 800-122), but the concept is universal across privacy frameworks: GDPR uses “personal data,” CCPA uses “personal information,” and FADP uses “Personendaten.” The definitional boundaries shift by jurisdiction, but the core principle is stable: PII is any information that makes a person distinguishable from all others.

PII exists in two categories. Direct identifiers can identify a person on their own: full name, Social Security number, passport number, biometric data, email address, phone number. Quasi-identifiers (or indirect identifiers) cannot identify a person alone but can do so in combination: date of birth, ZIP code, gender, occupation, IP address. Research by Latanya Sweeney at Carnegie Mellon demonstrated that 87% of the US population can be uniquely identified by the combination of ZIP code, birthdate, and gender alone.

Why It Matters

The 2024 Identity Theft Resource Center Annual Data Breach Report documented 3,205 data breaches in the United States, exposing approximately 1.1 billion individual records. Globally, the volume of personal data breached has grown by an average of 26% year-over-year since 2020. The cost follows: IBM’s 2024 research places the per-record cost of a PII breach at $169, with breaches involving high-sensitivity PII (financial, health, biometric) averaging $201 per record.

The regulatory landscape treats PII as the atomic unit of privacy law. GDPR Article 4 defines personal data broadly as “any information relating to an identified or identifiable natural person.” CCPA Section 1798.140(v) extends the definition to include household-level data and probabilistic identifiers. HIPAA defines 18 specific categories of Protected Health Information. Every data protection regulation, in every jurisdiction, begins with the same question: does this data identify a person?

For AI applications, PII creates compound risk. When users type personal information into LLM prompts, that data enters a processing pipeline controlled by third parties. Model memorization can encode PII into model weights. Prompt logs can persist on provider infrastructure. Metadata can correlate sessions to identities. The risk is not hypothetical—it is the default state of every AI interaction that does not strip PII at the source.

How It Works

PII management operates across detection, classification, and protection:

Detection: NER models detect persons, organizations, locations, and identifiers in free text. Pattern matching catches structured formats: Social Security numbers, credit card numbers, email addresses.
Classification: Direct identifiers (SSN, passport) receive the highest sensitivity rating. Quasi-identifiers (ZIP code, age) are classified by re-identification risk in context. Health and biometric data receive enhanced regulatory protection.
Protection: Classified PII is secured through tokenization, encryption, anonymization, pseudonymization, or deletion. AI inference through third-party models demands full removal or tokenization—pseudonymization alone is insufficient.
Lifecycle management: GDPR requires storage limitation. Right to erasure requests require locating and destroying all copies across the entire infrastructure.

Stealth Cloud Relevance

Stealth Cloud treats PII as toxic material that must be neutralized before it enters any processing pipeline. The PII stripping engine—a WebAssembly module running client-side in the browser—scans every outbound prompt, detects personal identifiers using NER models and pattern matching, and replaces each element with a non-reversible token. The token map exists only in browser memory.

This is not data minimization as an afterthought. It is data elimination as a protocol. The LLM provider receives a prompt such as “[PERSON_1] needs to review the contract with [ORG_1] by [DATE_1]"—semantically complete, personally vacant. When the model’s response arrives containing token placeholders, Ghost Chat’s client-side de-tokenization restores the original values for display. The user sees their data. No one else ever did.

The zero-knowledge architecture ensures that PII never exists on any server, in any log, in any form. Combined with cryptographic shredding at session end, the token map is destroyed and the substitution becomes permanent. There is no PII to breach, no PII to subpoena, and no PII to memorize.

The Stealth Cloud Perspective

PII is the liability that every privacy regulation attempts to govern and every data breach exploits. Stealth Cloud takes a different approach: rather than governing PII, strip it. Rather than securing it, shred it. The safest personal data is personal data that never left the client.