AI Security

Before You Ship That AI Feature: A Practical Pre-Launch Security Review

Shipping an LLM-powered feature without a security review is not a calculated risk — it is an undiscovered incident waiting to happen. Here is how to run the review that actually catches what matters.

HarmonyX Team May 4, 2026 · 10 min read

Before You Ship That AI Feature: A Practical Pre-Launch Security Review

On this page

In plain terms: an AI security review is a structured check of everything that can go wrong when a large language model (LLM) is embedded in a customer or financial workflow — before the feature is live. It covers how attackers can manipulate the model through crafted input, how sensitive data can leak through model output, how a compromised retrieval pipeline can poison what the model believes, and how the blast radius of a tool-using agent can extend far beyond what the product team intended. Done before launch, it is a relatively contained engineering task. Done after the first incident, it is a crisis response.

The teams that skip it are not reckless — they are usually fast-moving and under real delivery pressure. The problem is that LLM-specific risk does not behave like traditional application security risk. A penetration test that checks your API for SQL injection, authentication bypass, and OWASP Top 10 web vulnerabilities will not find a prompt injection path, a jailbreak surface, or a data exfiltration route through a retrieval pipeline. The threat model genuinely changed when the feature started generating language, and the review process needs to reflect that.

What changes in the threat model when a feature uses an LLM

Traditional application security assumes that the code you ship does what you wrote. An LLM-powered feature does not — it generates behaviour at inference time based on inputs you cannot fully enumerate in advance. That shift creates four categories of risk that do not exist in conventional software. First, prompt injection: an attacker supplies text that overrides or extends the system instructions controlling the model, bypassing controls that look airtight in code review. Second, jailbreak: the attacker iteratively constructs an input that causes the model to violate its own operational constraints — producing harmful output, disclosing internal configuration, or impersonating another identity. Third, sensitive output disclosure: the model surfaces data from its training corpus, from prior conversation turns in a shared context window, or from retrieved documents — sometimes in response to entirely benign queries. Fourth, supply-chain risk at the model and runtime layer: if your feature depends on a third-party model endpoint, a managed inference provider, or an open-weight model you downloaded and host yourself, you inherit the security posture of that dependency in ways that are not yet standardised across the industry.

The OWASP LLM Top 10 as a working framework

The OWASP LLM Top 10 — a community-maintained list of the ten most critical security risks specific to LLM-based applications — is the clearest starting framework available for a pre-launch review. It is not a compliance checklist; it is an enumeration of the attack classes that are consistently exploited in production. The five categories that surface most often in real engagements, and that your review must address explicitly, are as follows.

LLM01 — Prompt Injection: the attacker embeds adversarial instructions in user-controlled input that override the system prompt. Direct injection comes through the user turn. Indirect injection arrives through external content the model processes — documents, web pages, or database records fetched by a RAG (Retrieval-Augmented Generation) pipeline.
LLM02 — Insecure Output Handling: model output flows into downstream components — rendered HTML, an SQL query, a shell command, an API call — without sanitisation. A model that generates plausible-looking SQL or JavaScript is only safe if that output is treated as untrusted input to the consuming system.
LLM06 — Sensitive Information Disclosure: the model reveals PII, internal system details, credentials, or proprietary content from training data or in-context documents. This is not always a deliberate attack — benign queries sometimes elicit disclosure through completion patterns the model learned during training.
LLM08 — Excessive Agency: the model has access to tools or APIs that exceed what the task strictly requires. When a model can read email, write to a database, and trigger webhooks in a single session, a successful injection in any one input can chain those capabilities in ways the designer never intended.
LLM03 — Training Data Poisoning / LLM09 — Misinformation: for teams using fine-tuning or RAG, the integrity of the data the model learned from — or is currently retrieving — is a security boundary. Poisoned training data or poisoned retrieval corpora produce systematically wrong outputs that can mislead users in consequential workflows.

Indirect prompt injection: the pattern that catches teams off-guard

Direct prompt injection is well-understood and most teams have at least some mitigation in place for it — input filtering, system-prompt hardening, output monitoring. Indirect prompt injection is the pattern that consistently catches engineering teams off-guard, and it is significantly harder to defend against because the malicious instruction does not come from the user — it comes from content your application fetches on the user's behalf. A RAG pipeline retrieves documents from a knowledge base, an email inbox, a CRM, or a web search result. Any of those documents can contain adversarial text that, when included in the model's context window, instructs the model to perform an action the user did not request — exfiltrating data to an external endpoint, summarising a different document than the one requested, or generating a response that social-engineers the next human in the workflow. The mitigation is not to stop using RAG. It is to treat all retrieved content as untrusted, to implement retrieval-source allowlisting, and to design output pipelines that verify the model's claimed actions against the original user request before executing them.

Tool-using agents and the principle of least privilege

The blast radius problem scales directly with tool access. An agent that has read-only access to a single knowledge base has a contained blast radius. An agent that can read files, call internal APIs, send emails, and execute database writes has a blast radius that spans your entire organisation's writable surface. The principle of least privilege applies to model permissions exactly as it applies to user accounts and service roles — the model should have exactly the access required for the specific task it is asked to perform, and nothing more. Before launch, map every tool your agent can invoke, the maximum data scope it can access through each tool, and the worst-case action chain a successfully injected instruction could trigger. That map is both your security design artefact and the input to your adversarial test cases.

Human-in-the-Loop checkpoints are the complement to least-privilege scoping. For high-consequence actions — writing to production data, sending external communications, executing financial transactions — require an explicit human confirmation step that cannot be bypassed by model output alone. The model proposes; the human authorises. This is not a UX concession; it is a security control.

Eval-in-CI: the operational outcome of the review

A one-time security review before launch is better than nothing. It is not good enough. The threat model for an LLM feature evolves every time the system prompt changes, the model version is updated, a new retrieval source is added, or a new tool is connected. The findings from your pre-launch review should translate directly into adversarial test cases — specific payloads, edge-case inputs, and injection attempts — that run automatically in your CI pipeline on every relevant change. This is Eval-in-CI: treating adversarial evaluation as a gating test, not a periodic audit. A failing eval blocks the merge. A passing eval gives you a documented basis for the claim that the change did not regress the security properties of the feature.

The teams that catch regressions in CI ship model updates weekly without incident — the teams that do not discover them through customer complaints.

Governance: ISO 27001 ISMS and ISO/IEC 42001 as the AI extension

Security review findings need to live somewhere auditable. ISO 27001 — the standard that defines an Information Security Management System (ISMS), the governance framework for identifying, treating, and tracking information security risks across an organisation — provides the structure. Findings from an AI security review are risk records. They require an owner, a treatment decision, evidence of remediation, and a re-assessment date. An organisation with a functioning ISMS processes these exactly like any other information security finding: into the risk register, treated under the risk-treatment plan, evidenced for audit. ISO/IEC 42001 is the AI-specific extension of that framework — it adds controls for AI risk management, AI lifecycle governance, and the documentation obligations that regulators are beginning to require. If your organisation is working toward ISO/IEC 42001 alignment, the pre-launch review is the natural entry point: the Threat Model and adversarial test results are core artefacts for the Technical File that both ISO/IEC 42001 and EU AI Act High-Risk requirements expect to see.

Regulatory context: GDPR, EU AI Act extraterritorial reach, and regional data-protection regimes

For teams building in Thailand, Singapore, Indonesia, Malaysia, and Vietnam, the compliance picture layered on top of the security review is concrete. PDPA in Thailand imposes breach notification within 72 hours — a window that is impossible to meet without pre-built Incident Response documentation and an auditable log of when a compromise was first detected. Singapore MAS TRM requires documented continuous monitoring for financial services deployments. The EU AI Act's extraterritorial reach applies to any system that serves users in the EU or is used to make decisions affecting EU residents — regardless of where the model runs. For High-Risk applications under Annex III of the Act, a Technical File with documented risk assessment, test results, and monitoring procedures is a mandatory pre-market requirement, not a post-launch audit item. MITRE ATLAS — the adversarial threat landscape framework specific to AI and machine learning systems, maintained as a complement to MITRE ATT&CK — provides the taxonomy for documenting attack techniques in a format that regulators and enterprise security teams can cross-reference.

A practical checklist for the week before launch

This is not a comprehensive security programme — it is the minimum that teams shipping AI features into production workflows should be able to confirm before the switch goes on.

Threat model exists and covers: prompt injection (direct and indirect), sensitive output, tool-use blast radius, retrieval-pipeline integrity, and model supply chain. It is a written document, not a mental model.
System prompt and output-handling logic has been reviewed for injection paths. Retrieved content is treated as untrusted. Output that flows into code execution, database writes, or external calls is sanitised and validated.
Agent tool permissions are scoped to task-minimum. High-consequence actions require Human-in-the-Loop confirmation. Tool definitions have been reviewed against the OWASP LLM08 (Excessive Agency) criteria.
Adversarial test cases exist and are running in CI. At minimum: direct injection attempts against the system prompt, indirect injection through the retrieval pipeline, and known jailbreak patterns relevant to the model and task. Failure blocks merge.
Findings are documented and in the risk register. The treatment decision — accept, mitigate, or transfer — is recorded with an owner and a target date. If your organisation has an ISO 27001 ISMS, this is a standard risk record. If it does not, this is the beginning of one.

Do this before you ship — not after the first incident

The practical recommendation for engineering teams is simple and has nothing to do with compliance timelines: a pre-launch AI security review takes days. A post-incident response — with notification obligations under PDPA, evidence requests from sector regulators, and the reputational damage to a customer-facing AI feature — takes months. The cost differential is not close. Teams that have shipped LLM-powered features into financial and healthcare workflows consistently report that the review surfaces issues that neither the engineering team nor the product team had anticipated, because the failure modes are not intuitive to people who have spent their careers reasoning about conventional software. The threat model is genuinely different, and the only way to find out what it looks like for your specific feature is to look.

How HarmonyX can help

The HarmonyX AI Security & LLM Risk Review service runs the full scope described in this post against your actual application — Threat Model against OWASP LLM Top 10 and MITRE ATLAS, adversarial red-team testing with reproducible payloads, Eval-in-CI suite handoff, and findings documented under our ISO 27001 ISMS for audit readiness. If your team is in the final weeks before an AI feature launch, or has already shipped and wants to close the gaps, the place to start the conversation is harmonyx.co/services/ai-security.

#AI Security#LLM Risk#OWASP LLM Top 10#MITRE ATLAS#ISO/IEC 42001