AST vs regex secret scanning: what each approach catches and misses

Regex matches known key formats and entropy scoring catches high-randomness strings, but both miss generic passwords and raise false positives on ordinary code. Parser-based scanning reads code structure and catches what patterns miss, while regex stays the better choice for fixed-format keys.

Secret scanning.

A secret scanner reads source code and decides which strings are credentials. Some, such as a provider API key, have fixed formats; others, like a password or a custom token, don't. Regex and parser-based scanning differ in what they catch and when to use each.

How do secret scanners work?#

There are three methods for detecting secrets: regular expressions that match known credential formats, entropy that measures how random a string is, and parsing that reads the code's structure. Each finds secrets the others miss.

The volume of leaked secrets is large. GitGuardian counted over 28 million new hardcoded secrets in public GitHub commits during 2025, up 34% on the year before, with leaks of AI-service keys rising by 81%.1 A leaked credential cannot be made secret again: on public repositories it's found and used within minutes, and deleting it from a later commit doesn't remove it from the git history. The job of a scanner is to catch the secret before it ships, which means it has to be both thorough and precise: thorough enough to catch the secrets that don't look like keys, and precise enough that a real finding isn't buried among false positives.

No single method is both thorough and precise everywhere. Regex is precise on credentials with a fixed format and catches nothing else. Entropy catches a little more but can't tell a secret from any other random-looking string. Parsing uses context that neither method can use, at the cost of a parser per language. The rest of this article works through each method, and the practical answer is to layer them rather than pick one. Trestle, for example, runs a fixed-pattern layer over the raw bytes of every file and a per-language parser over the same files, so it keeps the strengths of patterns and adds the strengths of structure.

What is regex-based secret scanning?#

Regex (regular expression) secret scanning matches a file's text against fixed patterns for known credential formats. It's fast, deterministic, and works on any file as plain text. For credentials with a distinctive prefix it's close to exact, which is why it's the foundation of almost every scanner.

Many providers deliberately give their tokens a fixed, recognizable form. A GitHub token starts with ghp_, gho_, or ghs_; an AWS access key ID starts with AKIA or ASIA; a Stripe secret key starts with sk_live_. When GitHub redesigned its token formats in 2021 it chose the underscore precisely because it isn't a base64 character, so a random string can't collide with the prefix, and it added a checksum so a scanner can validate a token offline. GitHub expected the prefix alone to cut secret scanning false positives to around 0.5 percent.2 This is regex working at its best, and no parser improves on it.

Patterns are also the industry default. GitHub's secret scanning partner program asks each provider to supply a regular expression that matches its token format.3 Gitleaks uses keyword-prefiltered regular expressions with optional per-rule entropy.4 detect-secrets and TruffleHog both start from patterns as well. If every secret looked like ghp_, there would be little more to say.

# a regex matches the distinctive prefix, whatever follows it
GITHUB_TOKEN=ghp_xhBlNwqeA9vCWZsP73y7C5a3Qn602UgGlBSX
AWS_ACCESS_KEY_ID=AKIA0YKBVOONEQRF8Y5T
STRIPE_SECRET_KEY=sk_live_tcGo9oJCh5jQG3OHI6cB9Aw9

What are the limitations of regex-based secret scanning?#

Regex fails where the secret has no recognizable form to match. A custom token or an ordinary password carries no prefix and no fixed length, so there's no pattern to write. Worse, a regex matches a flat run of characters and not the meaning of the line. And as the number of credential formats grows, the rule set has to grow with it and be kept current. The problem isn't regex itself. The problem is regex on its own: it reads the text but not the surrounding code.

What is entropy-based secret detection?#

Entropy-based detection flags strings whose characters look random. Scanners measure Shannon entropy, a number for how unpredictable a string is, and report anything above a threshold on the premise that generated keys look random and ordinary text doesn't. detect-secrets, for example, defaults to a base64 entropy limit of 4.5 and a hex limit of 3.0.

Entropy exists to catch the secrets that no pattern can match. Many real keys are long random strings, and a randomness threshold catches them without a provider-specific rule. But randomness is also common in code that holds no secrets. A UUID, a Subresource Integrity hash, a Git commit SHA, and any base64-encoded asset are all high-entropy, and none of them is a credential. detect-secrets' own source notes that on all-digit input its false positives greatly exceed its true positives, which is why the tool is built to run against a reviewed baseline rather than on its own.5

The effect is visible in a four-scanner comparison we ran on four codebases with default settings. On one repository of data-science notebooks, the entropy-based tool returned over 7,000 findings, roughly 850 times the other scanners, nearly all of them long random-looking notebook output. On a clean WordPress release it raised 180 false positives where Trestle reported nothing. Those numbers come from the no-baseline default, but they show what a randomness threshold does when it runs on a codebase full of random-looking non-secrets.

# high entropy, but not a secret
checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"
integrity = "sha256-RFWPLDbv2BY+rCkDzsM+0NXmKy79TStK/6E7Vh9Y0="

Can entropy detection find passwords and low-entropy secrets?#

No. Entropy detection systematically misses the secrets that aren't random. A password such as s3cr3tPaSSw0rd isn't random enough for an entropy check to flag, even though it could be a real credential. Entropy is the one property these credentials lack, so the method that depends on it can't detect them.

The two failure directions are shown together below. The first line is a real secret that both regex and entropy miss, because it has no prefix and no randomness. The second is a harmless value that an entropy check flags, because it's random even though it's only a checksum.

# not random enough for entropy, but a real password
password = "s3cr3tPaSSw0rd"

# high entropy, but harmless; an entropy check flags it
checksum = "d8e8fca2dc0f896fd7cb4cb0031ba249"

In our four-scanner comparison, passwords and committed password hashes were caught by the parser-based scanner and largely missed by the pattern and entropy tools; the parser-based scanner was the only one of the four to catch the hashes. Randomness is a property of some secrets, not what defines a secret. To find the rest you have to consider where the string appears in the code.

What is AST (parser-based, language-aware) secret detection?#

AST-based secret detection parses code into an abstract syntax tree (AST), the structured representation of variables, assignments, calls, and string literals that a compiler builds, and then classifies each value by its role in that structure. It's also called parser-based or language-aware detection. Instead of matching bytes, it identifies a string by its role: the value assigned to a name, the argument passed to a call, the field exported from a module.

To make structure concrete, take one line of code:

db_password = "summer2024"

A parser turns it into a tree that records what the line is, not only the characters in it:

Assignment
target Identifier "db_password"
value StringLiteral "summer2024"

Now the scanner has explicit structure to work with: the string summer2024 is the value assigned to a name called db_password. A regex never has this. It matches ten characters and a pair of quotes. A parser recovers that structure from the source. Tree-sitter, for instance, turns code into a syntax tree across a wide range of languages,6 and the scanner reads the tree instead of the raw text.

Trestle is built this way. It uses a real parser for each language, plus schema-aware readers for configuration formats, and each language has its own analyzer that walks the tree and classifies the values it finds. The analysis is deterministic and rule-based: there's no model and no training data, so every finding can be explained by the rule that produced it.

Reading structure is what lets a scanner distinguish a credential from a reference to one. A literal assigned to a field named api_key is a candidate; process.env.API_KEY is a reference and is not. A value under the password key of a connection string is a password; the same characters in a file path aren't. These are distinctions about syntax, and only a parser can make them.

AST vs regex secret scanning: a side-by-side comparison#

Regex is stronger on prefixed tokens and on files with no grammar. AST is stronger on generic secrets, on context, and on suppressing false positives. Entropy is a narrow refinement, useful as a guarded check and weak as a strategy on its own. The table below lays out the trade-offs. No single method is best everywhere, which is why the strongest approach combines pattern matching and parsing.

Regex and entropyAST
Prefixed provider tokens (ghp_, AKIA, sk_live_)Caught. Its core strength.Caught. The pattern layer runs alongside the parser.
Generic or custom tokens with no fixed formatMissed. There's no pattern to match.Caught by the name it's assigned to and the value format.
Plain-word and low-entropy passwordsMostly missed.Caught by the name around the value.
Hashed passwords in config (bcrypt, Argon2, etc.)Usually missed by default.Caught by the hash format.
High-entropy non-secrets (UUIDs, integrity hashes, Git SHAs)False positives once entropy is enabled.Cleared by recognizing the format and the name.
References and placeholders (${process.env.X}, your-api-key-here)Read as flat text, not suppressed.Suppressed.
Credentials inside a connection-string URLUsually missed.Caught by the password's position in the URL.
Secrets exposed to the browser bundleMissed.Caught for the languages it tracks data flow in.
Works on any file as plain textYes, even on files with no grammar.Only where a parser or pattern layer applies.
Deterministic and explainableYes.Yes, rule-based.
SpeedFast.Comparable in a well-built scanner.

Why does my secret scanner flag so many false positives?#

Most false positives come from looking at a string's characters alone. High-entropy non-secrets such as UUIDs, integrity hashes, base64 encoded data, and Git SHAs trip a randomness check; placeholders such as example values and publishable keys could match a credential pattern. A text-level filter can't distinguish any of these from a real credential.

The cost isn't only wasted review time. Alert fatigue from a high false-positive rate is a documented, peer-reviewed problem: a high volume of false positives trains people to dismiss findings, and the one real finding is dismissed with the rest.7 A scanner that reports too many false positives is uninstalled, allow-listed until it reports nothing, or skipped. Precision is not optional for this kind of tool. Without it, the tool isn't used at all.

Reading code structure suppresses each of these causes. A parser-based scanner can recognize a placeholder by its value, can recognize the format of a non-secret, an integrity hash, a semantic version, an email address, a cloud resource name, and rule it out. Trestle does all of these, and it gates its entropy check behind both an absolute and an alphabet-normalized threshold while rejecting natural-language strings and keyboard walks like qwertyuiop, so the randomness check runs far less often. In the four-scanner comparison, on a clean WordPress 7.0 release, Trestle reported zero findings, the correct result, while the entropy-based tool raised 180.

# both correctly suppressed: a placeholder and a public key
API_KEY=your-api-key-here
STRIPE_PUBLISHABLE_KEY=pk_live_4eC39HqLyjWDarjtT1zdp7dc

Which secrets does regex miss that an AST scanner catches?#

The misses cluster in a few patterns: generic and custom tokens with no prefix, plain-word and hashed passwords, credentials buried in a connection-string URL, secrets split across joined string literals, and config values that end up in the browser. None of these has a fixed format for a pattern to find, and the strongest signal for each is structural.

A credential inside a URL is identified by position, not by randomness or a keyword. URL grammar says the characters between the colon and the at-sign in the userinfo component are a password, whatever they are. A scanner that parses the connection string flags it; one reading characters has no structural signal to use.

# the password is in the URL userinfo, not a password field
DATABASE_URL: postgres://app:s3cr3tPaSSw0rd@db.internal:5432/prod

A secret split across two literals never appears as one run of text, so a contiguous pattern can't match it. A scanner that folds adjacent and joined literals before classifying reconstructs the value the program would build.

# split across two literals: no single run of text to match
const key = "sk_live_" + "tcGo9oJCh5jQG3OHI6cB9Aw9";

The clearest structure-only case is a secret that reaches the browser. A config value isn't dangerous because of how it looks but because of where it goes, and only a scanner that follows the value into a client-rendered output, or recognizes a framework variable like NEXT_PUBLIC_ or VITE_ that's bundled into the page, can determine it's exposed.

Is AST-based scanning slower than regex?#

Not necessarily. Parsing is more work per file than matching text, so in principle it costs more. In practice a well-built parser-based scanner can be as fast as, or faster than the pattern tools.

The four-scanner comparison shows this. Trestle was the fastest of the four scanners on three of the four codebases and close to the fastest on the smallest. The entropy-based tool was the slowest, taking tens of seconds on the largest codebase. Speed depends on the implementation, but parsing doesn't inherently make a scanner slower.

When is regex enough, and when do you need a parser?#

Regex, with entropy underneath it, is enough when the secrets you care about are prefixed provider tokens. That's when the format is what identifies the secret, a pattern matches it every time, and there's no structure to read. If that's your whole threat model, a pattern tool is simple, fast, and a reasonable choice.

You need a parser once secrets are the kind without a fixed format: a custom token, a plain or hashed password, a credential in a connection string, a config value like an API key that might reach the browser. In real application code, those are most of the secrets present, and structure is what distinguishes real secrets from false positives. The strongest architecture isn't a parser instead of patterns, but a parser combined with patterns. Trestle runs its fixed-pattern layer across the raw bytes of every file, so a known token flattened onto one line is still caught, and runs a per-language parser over the same files, so everything else is read in context. This keeps what patterns are best at and adds what only structure can identify.

How does structural classification cut false positives?#

Structural classification cuts false positives by sorting each candidate value by what it is and where it's used, then suppressing the categories that aren't credentials: public keys, placeholders, and random-looking non-secrets. Trestle, for example, classifies every candidate value into one of four categories and reports at one of two severities. The categories make the report readable, and the severities say how the value was identified.

  • Secret: a credential identified by its own value, such as a recognized provider token, a password hash, or a card number that passes the Luhn check.
  • Possible secret: a value that looks like a credential by name and format but can't be determined with certainty, such as a high-entropy value assigned to a field called token.
  • Public: a value meant to be public, such as a Stripe publishable key or a Sentry DSN. Recognized and suppressed.
  • Placeholder: an example or filler value, such as your-api-key-here or AKIAIOSFODNN7EXAMPLE. Ruled out rather than reported.

Critical findings are identified by content alone, such as a private key format, a provider token signature, or a password hash. Warning findings are identified by context, such as a value under a sensitive name, a credential inside a URL, a recovery phrase, or a weak password. The split is a tendency rather than a strict rule: a card number, for example, is Critical when it sits in a named field and a Warning when it is found loose in file content.

The precision comes from structure used in several ways at once: public credentials and placeholders are recognized and suppressed; non-secret formats like integrity hashes and version strings are ruled out; the entropy check is performed on unqualified names and is paired with word lists, keyboard-walk and natural-language guards; references and template expressions are read as references, not values; and CI and config formats, including GitHub Actions, GitLab CI, docker-compose, and package.json, are parsed for what each field means before the generic scan runs.

Detecting more strings is easy. Detecting the correct ones is the harder task. A scanner that returns too many findings is ignored, and the real finding is missed among the false positives. Reading code structure doesn't report more strings; it reports the ones that matter and leaves out the rest. The four-scanner comparison shows what that looks like in numbers.

Frequently asked questions#

What is the difference between AST and regex secret scanning?#

Regex scanning matches text against fixed patterns for known credential formats. AST scanning parses code into an abstract syntax tree and classifies a string by its role, meaning what it's assigned to, passed into, or exported as, rather than its characters alone. Regex is best for prefixed tokens; AST catches generic and context-dependent secrets.

Why does my secret scanner flag false positives?#

Most false positives come from looking at a string's characters alone. UUIDs, integrity hashes, Git SHAs, and base64 strings are random-looking but harmless, and placeholders or publishable keys look like real secrets to a pattern or entropy filter. A scanner that reads the surrounding code structure can rule these out.

Can entropy detection find passwords?#

Only random ones. Entropy detection flags strings that look random, so a machine-generated password is caught, but passwords like s3cr3tPaSSw0rd or correct-horse have low entropy and stay under every threshold. Randomness is a property of some secrets, not all of them, so an entropy check systematically misses ordinary passwords.

Can a secret scanner detect low-entropy secrets?#

Yes, but not with entropy. A scanner that reads code structure can flag a low-entropy value based on where it's being used: a literal assigned to a field named password or api_key, or a credential inside a connection-string URL for example, even if its characters aren't random.

What secrets do regex-based scanners miss?#

Generic and custom tokens with no fixed prefix, plain-word and hashed passwords, credentials buried in connection-string URLs, secrets split across joined string literals, and config values that are sent bundled to the browser. None of these have a distinctive format for a pattern to match.

Is AST-based secret scanning slower than regex?#

Not necessarily. Parsing adds work per file, but parsing is fast on modern hardware and parsers can be reused across files. In a four-tool comparison, the parser-based scanner was the fastest on three of four codebases and close to fastest on the smallest.

What does language-aware or parser-based secret detection mean?#

It means the scanner parses each file with a real parser for that language and analyzes the resulting syntax tree. It can tell an assignment target from a comment, a literal from a variable reference, and a server value from one that reaches the browser, which a flat text search can't do.

Do AST-based scanners need machine learning to reduce false positives?#

No. AST analysis is deterministic and rule-based. It reduces false positives by reading code structure, such as assignment names, references, schemas, and value formats, with no model and no training data. That also makes every finding explainable rather than a probability score.

How do secret scanners tell a real key from a UUID or hash?#

Pattern and entropy filters often can't, because a UUID, an integrity hash, and a Git SHA are all random-looking. A structural scanner recognizes those formats and the names around them, such as integrity, checksum, and id, and clears them, while still flagging a value assigned to a credential field.

Which is more accurate: regex, entropy, or AST-based detection?#

It depends on the threat model, and the strongest tools combine them. Regex is precise on prefixed provider tokens, entropy is a narrow refinement, and AST analysis catches generic and context-dependent secrets while suppressing random-looking non-secrets. Layering a parser over patterns gives the best precision.

Sources#

  1. GitGuardian, The State of Secrets Sprawl 2026.
  2. GitHub, Behind GitHub's new authentication token formats, April 2021.
  3. GitHub Docs, Secret scanning partner program.
  4. Gitleaks, rule configuration format and default ruleset.
  5. Yelp detect-secrets, high-entropy strings plugin.
  6. Tree-sitter, an incremental parsing system for programming tools.
  7. ACM Computing Surveys, Alert Fatigue in Security Operations Centres, 2025.