Inside the 20-Signal Detection Engine

How SpamAxe classifies spam backlinks using 20 independent signals across four analysis phases — with zero external API dependencies.

Zero External Dependencies

Most SEO tools depend on third-party APIs for their data. When those APIs change their terms, raise prices, or get shut down by lawsuits, the tools break. SpamAxe's detection engine uses 20 signals that are entirely self-owned. Every signal uses infrastructure SpamAxe fully controls: DNS queries, HTTP requests, public routing tables, and certificate transparency logs.

No API keys. No rate limits. No terms of service changes can break the product.

Phase 1: Pattern Analysis (Signals 1–4)

The first four signals require no network activity at all. They analyze the CSV data you uploaded and classify obvious spam instantly.

DGA Detection identifies algorithmically generated domain names. Real businesses don't name their websites xjk4rm2p.garden. SpamAxe's DGA detector measures character entropy, vowel-to-consonant ratios, and n-gram frequency to flag machine-generated strings.

TLD Risk Scoring maintains a weighted list of TLDs that are disproportionately associated with spam. A link from a .com domain gets a neutral score. A link from .tattoo, .sbs, or .garden starts with a penalty.

Raw IP Detection flags any source that's an IP address instead of a domain name. Legitimate backlinks come from websites with domain names. A link from 54.231.108.42 is almost certainly automated infrastructure.

Link Ratio Fingerprinting analyzes the relationship between linking pages and target pages. Attackers tend to have consistent patterns — for example, always linking from exactly 2 pages to exactly 1 target page across hundreds of domains.

If pattern analysis alone scores a domain above 85% confidence, SpamAxe skips the remaining phases entirely. Why probe a domain that's already obviously spam?

Phase 2: DNS Intelligence (Signals 5–10)

Six signals probe the domain's DNS infrastructure using standard queries that are free and unlimited.

DNS Resolution checks whether the domain even resolves to an IP address. Dead domains that don't resolve are almost certainly spam — they were registered, used for the attack, and abandoned.

MX Records check for mail servers. Legitimate businesses almost always have email configured. A domain with no MX records is likely a throwaway.

Nameserver Clustering identifies shared infrastructure. When 200 spam domains all use the same nameserver, that's not a coincidence — it's the same attacker.

Email Authentication checks for SPF, DKIM, and DMARC records. These take effort to configure. Their absence on a site that claims to be a business is a red flag.

Reverse DNS Verification checks PTR records on IP addresses. Legitimate servers have proper reverse DNS configured. Bot farm instances often don't.

Community Threat Database cross-references every domain and IP against SpamAxe's shared intelligence database. If other users have already confirmed a source as spam, new users get near-instant classification.

Phase 3: HTTP Probing (Signals 11–16)

Six signals make lightweight HTTP requests to examine the domain's web presence.

HTTP Status Probe sends a HEAD request. A 200 response means the site is live. A 403, 500, or connection timeout reveals the infrastructure state.

Server Header Analysis examines the server software. When 50 "different" domains all return nginx/1.18.0 (Ubuntu) with identical headers, they're running on the same template.

Redirect Chain Detection follows any redirects. Spam sites frequently redirect through multiple intermediary domains before landing on the final target.

Content Inspection checks the Content-Type and Content-Length of the response. Real websites serve substantial HTML. Spam doorway pages are often tiny — under 1KB of content.

TLS Certificate Validation examines the SSL certificate. A brand-new Let's Encrypt certificate on a brand-new domain with brand-new backlinks is a strong indicator of purpose-built attack infrastructure.

Self-Signed Certificate Detection flags certificates that aren't issued by a recognized authority. Self-signed certs on public-facing sites that claim to be legitimate businesses are extremely rare and highly suspicious.

Phase 4: Network Intelligence (Signals 17–20)

The final four signals use public network data to map attack infrastructure.

ASN Reputation Mapping identifies the network owner via public BGP routing tables. Known hosting providers used heavily for spam get flagged. IP ranges within the same /24 block are clustered.

Favicon Hash Clustering downloads and hashes the site's favicon. When dozens of "different" websites share the same favicon, they're running from the same template — likely operated by the same attacker.

Link Velocity Analysis examines when backlinks appeared relative to the domain's registration date. When a domain was registered 3 days ago and already has 50 backlinks pointed at your site, that's not organic growth.

Shared Threat Intelligence aggregates data from all SpamAxe users. This signal gets stronger over time — early adopters do the heavy lifting, and later users benefit from near-instant classification of known attack infrastructure.

Scoring and Classification

Each signal contributes a weighted score. The weights reflect how reliably each signal indicates spam. DGA detection and community threat data carry the highest weights. Server header analysis and email authentication carry lower weights.

The final confidence score determines classification: spam (55+), suspicious (30–54), or clean (under 30). Suspicious entries go into a manual review queue where the user makes the final call.

Current accuracy on test data: 92.2%.

← Back to Intel Feed