AI Due Diligence: What VCs Should Ask About a Startup's AI Data Practices

Venture capital firms are pouring billions into AI startups without asking the questions that determine whether those companies are building on solid data practices or on regulatory landmines. Here are the 10 questions every investor should be asking -- and the red flags that should kill a deal.

In 2025, venture capital firms invested over $97 billion globally in AI companies, according to PitchBook data. That figure represented nearly 40% of all venture funding for the year. Yet fewer than 15% of VC firms reported having a formal AI data practices assessment framework in their due diligence process, according to a survey by the National Venture Capital Association. The industry is deploying capital at unprecedented scale into companies whose data handling practices – the single greatest source of regulatory, legal, and reputational risk in AI – are evaluated with less rigor than a seed-stage company’s cap table.

This is not due to indifference. It is due to a knowledge gap. Most venture investors understand product-market fit, unit economics, and competitive dynamics. Fewer understand the technical and regulatory nuances of AI data practices: how training data is sourced, how user data flows through inference pipelines, how model weights encode information about the data they were trained on, and how regulatory frameworks from GDPR to HIPAA to the EU AI Act create liability exposure that can materialize years after the investment.

The cost of this gap is becoming visible. AI companies have faced regulatory enforcement actions, training data lawsuits, and user data scandals that destroyed significant enterprise value. The Italian ChatGPT ban wiped out weeks of OpenAI’s European growth trajectory. The New York Times lawsuit created existential legal risk for every company training on copyrighted material. The Samsung incident demonstrated how a single data handling failure can trigger corporate bans across entire industries.

What follows are the 10 questions that should be standard in every AI-focused due diligence process, the red flags that indicate systemic data practice risk, and a framework for the data supply chain audit that most investors skip entirely.

The 10 Questions

1. Where Does Your Training Data Come From?

The foundational question. The answer reveals the company’s legal exposure, ethical posture, and competitive durability.

What to look for: Clear documentation of training data sources, licenses for proprietary datasets, evidence of consent mechanisms for user-contributed data, and a coherent explanation of how publicly available data was collected (web scraping, API access, data partnerships).

Red flags: Inability to enumerate training data sources. Vague references to “publicly available data” without specifics. No data licensing agreements. Training data sourced from a single scrape of the internet without legal review. Use of datasets known to contain copyrighted material without a fair use analysis.

Why it matters: The New York Times v. OpenAI lawsuit and similar actions (Getty Images v. Stability AI, authors’ class actions against Meta and OpenAI) are establishing that training data provenance is a liability question. A company that cannot account for its training data is sitting on an unquantified legal liability.

2. Do You Train on User Data? Under What Conditions?

The distinction between a company that trains on user data by default, one that trains on opt-in data only, and one that never trains on user data is the single most important architectural decision an AI company makes.

What to look for: A clear, documented training data policy. Technical architecture that enforces the policy (not just contractual language). Separate data pipelines for inference and training. Audit mechanisms that verify the separation.

Red flags: Training on user data by default with opt-out mechanisms. No technical separation between inference data and training data. Terms of service that grant broad rights to use input data for “service improvement” (a term that can encompass training). The OpenAI data practices analysis illustrates the complexity of these distinctions in practice.

3. What Is Your Data Retention Policy?

How long does the company retain user prompts, model outputs, conversation logs, and associated metadata? The answer determines exposure under every privacy regulation.

What to look for: Defined retention periods for different data categories. Technical enforcement of retention limits (automated deletion, not manual processes). Distinction between content retention and metadata retention. Clear policies for data retained for abuse monitoring, safety, and legal compliance.

Red flags: Indefinite retention (“we retain data as long as necessary for the purposes described in our privacy policy”). No distinction between content and metadata retention. Retention periods that exceed regulatory requirements. No automated deletion mechanisms. The concept of zero-persistence architecture represents the gold standard.

4. How Do You Handle Data Subject Access and Deletion Requests?

Under GDPR, CCPA, and similar regulations, users have the right to access their data and request its deletion. For AI companies, deletion is uniquely complex because data may be encoded in model weights through training.

What to look for: A documented process for handling DSARs with defined response timelines. Technical capability to delete user data from all systems (including backups and derived data). A coherent explanation of how the company handles the “right to be forgotten” with respect to trained models. Understanding of the difference between deleting data from databases and removing its influence from model weights.

Red flags: No DSAR process. Claims that model training is “irreversible” used to justify non-deletion. Response timelines that exceed regulatory requirements. No technical mechanism for data deletion from training pipelines.

5. What Is Your Data Processing Architecture?

Where does data go, who has access to it, and what processing occurs at each stage? This question maps the company’s entire data flow.

What to look for: A data flow diagram showing every system that touches user data, from ingestion through processing, storage, and deletion. Clear identification of which systems are controlled by the company and which are third-party services. Encryption standards for data at rest and in transit. Access controls with documented need-to-know restrictions.

Red flags: Inability to produce a data flow diagram. Data processing in jurisdictions with weak privacy protections without user notification. Unencrypted data storage. Overly broad employee access to user data. Use of third-party services (logging, analytics, monitoring) that receive user data without being disclosed in the privacy policy.

6. What Third-Party AI Providers Do You Use?

Many AI startups are wrappers around foundational model APIs. Understanding the full supply chain of AI providers is essential.

What to look for: Complete list of AI provider relationships, including model providers, embedding services, vector databases, and any other AI-related third-party services. Data processing agreements with each provider. Understanding of each provider’s own data practices (since the startup’s data handling is only as strong as its weakest provider). The AI provider privacy scoreboard provides comparative analysis.

Red flags: Multiple AI providers without consistent data processing agreements. Use of consumer-tier AI services for production workloads. Inability to explain what data each provider receives. No monitoring of provider data practice changes.

7. How Do You Comply with International Data Transfer Requirements?

For any company with users outside the United States, international data transfer is a regulatory requirement under GDPR and similar frameworks.

What to look for: Identified legal mechanisms for international data transfers (Standard Contractual Clauses, adequacy decisions, binding corporate rules). Transfer Impact Assessments documenting the analysis. Understanding of the Schrems II implications for transfers to the U.S. Data localization options for users in jurisdictions that require them.

Red flags: No data transfer mechanisms in place. Processing all global user data in the U.S. without transfer safeguards. Ignorance of Schrems II and its implications. No data localization capability. The full scope of this challenge is analyzed in the GDPR problem for AI.

8. What Security Measures Protect User Data?

Standard security due diligence, adapted for AI-specific threats.

What to look for: SOC 2 Type II certification (or progress toward it). Penetration testing reports. Bug bounty program. Encryption standards (AES-256 for data at rest, TLS 1.3 for data in transit). Access controls and audit logging. Incident response plan. Specific consideration of AI-related security threats: model extraction, training data extraction, prompt injection, and adversarial attacks.

Red flags: No third-party security assessments. Encryption below industry standards. No incident response plan. No consideration of AI-specific attack vectors. History of security incidents without documented remediation. The model memorization problem represents an AI-specific security risk that many companies have not assessed.

9. What Is Your Regulatory Compliance Posture?

Which regulations apply to the company, and how is compliance maintained?

What to look for: Identification of all applicable regulatory frameworks (GDPR, CCPA/CPRA, HIPAA if handling health data, SOX if touching financial data, COPPA if minors could be users, EU AI Act if operating in Europe). A compliance roadmap with specific milestones. Legal counsel with AI and privacy expertise. Data Protection Officer (required under GDPR for companies processing personal data at scale).

Red flags: Uncertainty about which regulations apply. No privacy counsel. No DPO when one is required. Compliance described as a “future priority” rather than a current practice. No monitoring of regulatory developments (particularly the EU AI Act, which imposes significant new requirements).

10. How Do You Handle Model Governance?

How are models developed, tested, deployed, and monitored? Model governance is the AI-specific extension of data governance.

What to look for: Model development documentation (training data, architecture decisions, evaluation metrics). Bias and fairness testing results. Model versioning and rollback capabilities. Monitoring for model drift and performance degradation. Clear ownership of model governance within the organization.

Red flags: No model documentation. No bias testing. No model versioning. No monitoring in production. Model governance treated as a research function rather than a compliance function.

The Data Supply Chain Audit

Beyond the 10 questions, sophisticated investors conduct a data supply chain audit that traces the complete lifecycle of data through the company’s systems and their third-party dependencies.

The audit maps:

Data ingestion points: Every location where data enters the system (user inputs, API integrations, web scraping, data partnerships, third-party datasets)
Processing nodes: Every system that processes, transforms, or stores data (inference servers, training pipelines, analytics platforms, logging systems, backup systems)
Third-party dependencies: Every external service that receives data (AI providers, cloud infrastructure, monitoring tools, analytics services, payment processors)
Data outputs: Every location where data leaves the system (API responses, reports, exports, regulatory disclosures, breach notifications)
Retention and deletion: The lifecycle of data from creation to deletion at each node in the supply chain

The audit should produce a comprehensive data map that the investor can evaluate against the company’s stated privacy policies, contractual commitments, and regulatory requirements. Discrepancies between the map and the stated practices are the highest-signal finding in AI due diligence.

Red Flags That Should Kill a Deal

Certain findings during AI due diligence should be deal-breakers:

Training on user data without consent mechanisms: Regulatory liability is near-certain under GDPR and likely under emerging U.S. state privacy laws
No data processing agreements with AI providers: The company’s data is being processed under consumer terms of service, with no contractual protection
Copyrighted training data without legal analysis: The wave of training data lawsuits creates existential litigation risk
No security assessment or SOC 2 progress: Data breach risk is unquantified
Processing children’s data without COPPA compliance: Strict liability regulatory regime with FTC enforcement
Operating in EU without GDPR compliance infrastructure: The fines are real and growing
No model documentation: If the company cannot explain how its model works, it cannot comply with the EU AI Act’s transparency requirements

The AI investment boom will produce extraordinary returns for investors who back the right companies. But it will also produce spectacular failures – companies destroyed not by product failure but by data practice failures that were discoverable at the due diligence stage. The 10 questions above are not a guarantee against that outcome, but they are the minimum standard for responsible AI investment. The companies that can answer them clearly – the ones building on architectures that make privacy a structural guarantee rather than a policy aspiration – are the ones worth backing.

The Stealth Cloud Perspective

Due diligence that examines only a company’s privacy policy is due diligence that misses the point. Architecture is policy. The question is not what an AI company promises to do with your data – it is what the system makes technically possible. Investors who understand the difference between contractual data protection and structural data protection will be the ones who avoid the next wave of AI privacy catastrophes.