How ATS Systems Actually Work (2026) — A Technical Breakdown

Q: How does synonym expansion work in modern ATS?

Newer systems map related terms to a shared concept — so 'JS' and 'JavaScript', or 'managed a team' and 'team leadership', can match even without identical wording. This is done with a skills taxonomy or with semantic embeddings that place similar phrases near each other in vector space. Older keyword-only systems do not do this, so exact terms still matter.

Updated May 2026 · ~13 min read · by hugounoclaw

"Applicant Tracking System" is one of the most misunderstood pieces of software in the job market. Career advice treats it like a black-box robot that shreds 75% of resumes on sight; vendors describe it in glossy abstractions about "AI-powered talent intelligence." The truth sits in between. An ATS is, at its core, a document pipeline plus a search engine: it turns your uploaded file into structured data, stores it, and helps a recruiter find and rank candidates. This article walks through that pipeline stage by stage, names the real techniques involved, and is honest about which vendor details are documented versus inferred — because for most proprietary systems, the internals simply aren't public.

TL;DR

An ATS processes your resume in five stages: Ingest → Parse → Normalize → Match → Rank. Most "ATS rejections" are really failures in the first two stages — your resume never became clean, searchable data.
Parsing uses text extraction, layout analysis, section detection, and named-entity recognition (NER) to label your title, employers, dates, and skills. Image-based PDFs force an unreliable OCR fallback.
Matching ranges from literal keyword/Boolean search (older systems like Taleo) to semantic embeddings and skill-graph inference that understand synonyms (Workday, Ashby, newer Greenhouse/iCIMS features).
Keyword presence in the right section beats keyword density. Exact terms from the job description still matter because not every system expands synonyms.
The single highest-leverage thing you can do is make sure your resume parses cleanly. Use our free ATS checker to see how yours imports.

See your resume the way the parser does

Before the theory — run your resume through the free, in-browser checker. It shows how cleanly you parse, which sections were detected, and where you lose keyword matches. Nothing is uploaded.

Check my resume free →

Section 1 — The five stages of ATS processing

Almost every ATS, from a 20-year-old enterprise install to a 2024-vintage startup product, moves your application through the same conceptual pipeline. The labels differ between vendors, but the stages are remarkably consistent:

Ingest

Receive the file & form fields; detect type

Parse

Extract text, detect sections, label entities

Normalize

Clean, standardize dates, map to a schema

Match

Compare to the job: keywords or embeddings

Rank

Score, sort, surface to the recruiter

Stage 1: Ingest

When you hit submit, the system receives two things: the structured form data you typed into the portal (name, email, the answers to screening questions) and your uploaded file (.docx, .pdf, sometimes .txt or .rtf). The form data is already clean and goes straight into the database. The file is the hard part. The system detects the file type and routes it to the right extractor. This is also where explicit knockout questions live — "Are you authorized to work in this country?" or "Do you have 5+ years of X?" A recruiter configures these, and a disqualifying answer can filter you out before any parsing happens. This is the closest thing to the mythical "automatic rejection," and it's driven by the questions you answer, not by an algorithm judging your resume.

Stage 2: Parse

Parsing converts your document into structured fields. This is where the most damage happens, because a resume is a visual artifact and the parser needs to recover its logical structure from layout. We break this down in detail in Section 2. The output is a candidate record with fields like full_name, email, work_history[], education[], and skills[] — plus the full raw text, kept for keyword search.

Stage 3: Normalize

Raw parsed values are messy. Normalization standardizes them so they're comparable: dates in "Mar 2023", "03/2023", and "March 2023" all become a single internal format; job titles may be mapped to a canonical taxonomy ("Sr. SWE" → "Senior Software Engineer"); skills are de-duplicated and sometimes mapped to a controlled vocabulary. Systems with a skills taxonomy do a lot of work here — this is the step that later lets "JS" match "JavaScript."

Stage 4: Match

Now the system compares your normalized profile to a specific job. The job description is processed into a set of required and preferred terms (or its own embedding). Matching ranges from literal keyword overlap to semantic similarity — covered in Section 3. Importantly, matching often happens at search time: a recruiter runs a query and the system scores everyone in the pool against it, rather than scoring you once at submission.

Stage 5: Rank

Finally, candidates are sorted. In a keyword system, that's "how many of my search terms appear, and how recently/prominently." In an ML system, it's a learned relevance score, sometimes shown as a grade or tier. The recruiter sees a ranked list and a snippet showing why you matched. A human still chooses who advances — the ranking influences who they look at first, and in a pool of 800 applicants, "first" is a real advantage.

Section 2 — How parsing actually works

Parsing is the stage job seekers most need to understand, because a parsing failure is silent. You won't get an error. Your resume just becomes garbled data that ranks poorly for every search. Here's what's happening under the hood.

Text extraction

First the parser pulls the raw character stream out of the file. For a .docx this is relatively clean — it's structured XML, and text comes out in a fairly reliable order. For a text-based PDF, the extractor reads the embedded text objects, but PDFs store text as positioned glyphs, not logical paragraphs, so word and line order has to be reconstructed from x/y coordinates. This is why a two-column layout can interleave: the extractor may read across both columns on the same visual line.

OCR fallback for image-based PDFs

If the file has no embedded text — a scan, a photo, or a Canva-style design exported as a flat image — there's nothing to extract. The system must fall back to Optical Character Recognition: running a vision model over the pixels to guess the characters. OCR is far less reliable than reading embedded text; it misreads characters, drops formatting cues used for section detection, and many ATS parsers either skip OCR entirely or do it poorly. The practical takeaway is blunt: if your resume's text can't be highlighted with a cursor, assume the parser sees little or nothing.

Layout analysis & section detection

With text in hand, the parser segments the document into sections. It looks for heading-like lines — short lines, bold or larger font, matching a dictionary of known headings ("Experience", "Work History", "Education", "Skills", "Certifications"). Everything between two headings is assigned to the first. This is why creative headings break things: "Where I've Made an Impact" doesn't match the dictionary, so the content under it may be misfiled or dropped. It's also why layout matters so much — if column order scrambled the text in extraction, section boundaries land in the wrong place.

Named-entity recognition (NER)

Within sections, the parser labels spans of text as entities: this is a person name, that's an organization, that's a date, that's a job title, that's a skill. This is named-entity recognition, a standard NLP technique. Classic parsers used rule-based patterns and gazetteers (lists of known company names, skills, schools); modern ones use statistical or neural models fine-tuned on resume data. NER is how the system knows "Google" in your experience section is an employer, while "Google Ads" in your skills section is a tool.

Here's simplified pseudocode showing the shape of a parse routine. Real parsers are far more elaborate, but the skeleton is recognizable:

def parse_resume(file):
    # 1. Get raw text — OCR only if there's no embedded text layer
    if file.has_text_layer():
        text, spans = extract_text_with_positions(file)
    else:
        text, spans = ocr(file)          # lossy fallback — avoid relying on this

    # 2. Reconstruct reading order from x/y coords (columns break this)
    lines = order_by_layout(spans)

    # 3. Detect sections by matching heading dictionary
    sections = {}
    current = "header"
    for line in lines:
        if looks_like_heading(line) and line.text in KNOWN_HEADINGS:
            current = normalize_heading(line.text)   # "Work History" -> "experience"
        else:
            sections.setdefault(current, []).append(line)

    # 4. Run NER over each section to label entities
    profile = {"skills": [], "work_history": []}
    for name, body in sections.items():
        for ent in ner(body):
            if ent.label == "TITLE":  profile["work_history"].append({"title": ent.text})
            if ent.label == "ORG":    attach_employer(profile, ent.text)
            if ent.label == "DATE":   attach_dates(profile, normalize_date(ent.text))
            if ent.label == "SKILL":  profile["skills"].append(ent.text)

    profile["raw_text"] = text   # kept for full-text keyword search later
    return profile

Notice that raw_text is retained. Even when entity extraction misses something, the full text is usually indexed for keyword search — which is why a skill buried in a weird layout might still be findable, just not properly attributed. For the formatting rules that keep this routine happy, see our ATS resume format guide.

Section 3 — How matching works

Once you're structured data, the system has to decide how well you fit a job. There are two broad families of techniques, and in 2026 most real products use some blend.

Keyword matching and TF-IDF

The classic approach is lexical: does the resume contain the terms the job is looking for? The simplest version is exact substring search ("does 'Kubernetes' appear?"). A more sophisticated version weights terms using TF-IDF (term frequency–inverse document frequency). The idea: a term matters more if it appears often in this document (term frequency) but is rare across all documents (inverse document frequency). So "Kubernetes" is a high-signal match because it's distinctive, while "responsible" or "team" is near-worthless because everyone uses it. TF-IDF is why generic filler does nothing for you and specific, role-defining nouns do a lot.

Keyword systems also support Boolean search — recruiters type queries like ("project manager" OR "program manager") AND (agile OR scrum) NOT junior with wildcards and quotes. In these systems you only exist if your literal text matches the query. There is no understanding that "led cross-functional initiatives" is relevant to "program manager."

Why keyword "density" is the wrong target

A persistent myth says you should hit some magic keyword density percentage. In a TF-IDF or modern model, that's not how scoring works, and stuffing backfires: recruiters read the resume and see the spam. What actually helps is presence and placement — the exact term appearing once in a labelled Skills block and once in real context (a bullet that shows you using it). To pull the right terms out of a posting, our JD keyword extractor ranks them by importance so you target the high-signal ones.

Semantic embeddings & synonym expansion

Modern systems go beyond literal text using semantic embeddings. Both your resume and the job description are converted into dense numeric vectors with a language model (the BERT family of models is the textbook example), such that texts with similar meaning land near each other in vector space. Now "managed a team of engineers" can score as similar to "engineering leadership" even with no shared keywords — the model captures the concept, not the characters. Cosine similarity between the two vectors becomes a match score.

Synonym expansion can also be explicit, via a skills taxonomy: the system knows "JS", "JavaScript", and "ECMAScript" are the same node, and "RN" maps to "Registered Nurse." This is the normalization step paying off at match time. The practical implication: in a semantic or taxonomy-backed system you get some credit for related phrasing — but you can't know which system a given employer runs, so leading with the exact terms from the posting is still the safe play, because it wins in both the old and new worlds.

Section 4 — What the major ATS vendors actually do

This is where honesty matters most. The internal parsing and matching architecture of proprietary ATS platforms is, for the most part, not public. Vendors publish marketing pages and occasional engineering blogs, but they don't release their parsers or scoring models. Below, claims backed by a vendor's own documentation or engineering posts are marked documented; claims that are reasonable inferences from third-party testing, recruiter reports, or general industry knowledge are marked estimate. Treat the estimates as informed, not authoritative.

Vendor	Parsing approach	Matching method	Market segment
Workday	Proprietary parser into a candidate profile estimate	ML "Skills Cloud" with skill inference; LLM-based inference per their engineering blog documented	Large enterprise
Greenhouse	Text/NER extraction; full-text index estimate	Keyword / Boolean recruiter search (exact-string) documented; added AI ranking	Mid-market / tech
Lever	Profile-based parsing estimate	AI-assisted ranking & summaries (vendor-marketed) estimate	Mid-market / tech
Taleo (Oracle)	Strict proprietary parser; exact headers/dates estimate	Keyword + Boolean operators; no synonym understanding documented	Large enterprise (legacy)
iCIMS	Parsed profile is the primary record estimate	Keyword search + AI scoring features (vendor-marketed) estimate	Large enterprise
BambooHR	Standard resume parsing estimate	Primarily keyword-based filtering estimate	SMB
Ashby	Automated extraction; PII redacted before AI documented	Semantic AI application review with citations documented	Startups / scale-ups

Workday

Workday's Skills Cloud is genuinely ML-driven — documented in Workday's own materials, including a 2018 announcement of a "machine-learning-powered Skills Cloud" and a Workday Engineering blog describing an LLM-based skill inference service that maps natural-language phrases (job titles, certifications) to a normalized set of ~55,000 verified skills. So skill inference and a skills graph: documented. The popular specific claim that "Workday uses BERT embeddings since 2023" is an industry estimate — BERT-family models are the standard tool for this kind of semantic matching, so it's a reasonable guess, but I found no Workday documentation stating they use BERT, on that date, in candidate matching. Treat the mechanism (ML skill inference) as solid and the specific model/date as inference.

Greenhouse

Greenhouse's recruiter-facing keyword search is documented in their own support material — recruiters search resume text for terms, and it works on essentially exact strings (case-insensitive). Greenhouse has more recently added AI ranking features. The detailed description of its parser as a "PDF/DOCX extraction plus NER pipeline" circulates widely in third-party guides but isn't something I'd cite to Greenhouse directly — that specific internal architecture is an estimate.

Lever

Lever markets AI-assisted candidate ranking against configurable criteria plus automated summaries. The features are real, but the internal method — embeddings vs. rules vs. an LLM — isn't publicly specified, so the "how" is an estimate. Lever is frequently grouped with Greenhouse as a usability-first, tech-company ATS.

Taleo

Taleo (now Oracle) is the legacy enterprise workhorse and the most literal of the bunch. Oracle's own documentation covers keyword and Boolean search (AND/OR/NOT, wildcards, quotes) — documented. It's widely reported (estimate, but very consistent across sources and testing) that Taleo does not understand synonyms: "managed projects" won't match a search for "project management." Its parser is also notoriously strict about standard headings and date formats. If you're applying to a large legacy enterprise, assume Taleo-style literal matching and use exact terms.

iCIMS

iCIMS Talent Cloud is a large-enterprise platform where the parsed profile is the record a recruiter reads, so parse quality directly drives what they see. iCIMS markets AI ranking/scoring, and it's commonly counted among platforms with active AI candidate ranking. As with most enterprise vendors, the specific matching internals are an estimate.

BambooHR

BambooHR is an SMB-focused HR suite with recruiting bolted on. It's best understood as primarily keyword-based filtering with straightforward pipeline stages — appropriate for smaller teams, not a heavy AI-matching engine. This characterization is an estimate based on its positioning and feature set rather than a published architecture.

Ashby

Ashby is the clearest modern example and, helpfully, the most transparent. Its own AI page documents AI-assisted application review that analyzes the semantics of work history, years of experience, and skills (explicitly "instead of just keywords"), assigns tiers, provides citations for AI outputs so recruiters can verify them, redacts PII before sending resumes to AI models, and does not train models on customer data — all documented. Ashby is popular with startups and scale-ups that want analytics-heavy, AI-assisted screening.

The honest summary: a handful of mechanisms are documented (Workday's ML skill inference, Taleo's Boolean search, Ashby's semantic review), but no major vendor publishes the parser or scoring model that decides your fate. Anyone claiming to know the exact algorithm of a closed enterprise ATS is guessing — usefully or not.

Section 5 — What this means for your resume

Translate the pipeline into actions. Each of these maps directly to a stage above:

Win the parse (stages 2–3). Single-column layout, standard headings, real text (not images), consistent date formats, contact info in the body. This is the highest-leverage work because a parse failure poisons every later stage. The full checklist is in our ATS resume format guide.
Feed the matcher the right terms (stage 4). Pull the exact skills and titles from each job description and include them naturally — once in a labelled Skills section, once in context. Lead with the literal phrasing from the posting so you win in keyword systems and semantic ones. Our JD keyword extractor ranks a posting's terms by importance.
Spell out acronyms both ways. Write "SQL (Structured Query Language)" and "Registered Nurse (RN)" so you match whichever variant a recruiter searches — this hedges against systems that don't expand synonyms.
Make impact searchable, not just decorative. "Cut deployment time 40% by introducing CI/CD" contains role-defining nouns a TF-IDF or semantic matcher rewards, and reads well to the human at the end.
Don't chase density. Presence and placement beat repetition. Stuffing the same term ten times helps nothing and hurts with the recruiter.

Pressure-test all of this on your actual resume

Use our free ATS checker to see which sections were detected, how you parse, and which job-description keywords you're missing — then fix the highest-impact items first.

Run the free check → Compare ATS checkers

Section 6 — Common myths, debunked

Myth 1: "An ATS auto-rejects 75% of resumes."

The pipeline shows why this is mostly false. The ATS ranks and stores; a recruiter decides. The real attrition comes from (a) explicit knockout questions you answer, and (b) ranking low and never getting looked at in a deep pool. There's no algorithm sitting in judgment auto-trashing three of every four resumes.

Myth 2: "Hidden white-text keywords trick the ATS."

Don't. The parser extracts all the text regardless of color, so white-on-white keywords are read — and then a human opens the document, sees the spam (or selects the text), and you're done. It's a known trick that recruiters screen for. High risk, no durable upside.

Myth 3: "You need exactly the right keyword density."

Covered above: TF-IDF and semantic models don't reward a density ratio, and stuffing reads badly to humans. Optimize for presence in the right place, not a percentage.

Myth 4: "PDFs always break ATS / never use a PDF."

A text-selectable PDF parses fine in most modern systems. The thing that breaks is an image-based PDF (a scan or flat export) that forces OCR. .docx is still the safest universal choice for portals, but "never PDF" is outdated.

Myth 5: "ATS and AI understand everything I mean."

Also false, in the other direction. Plenty of deployed systems — especially legacy enterprise installs like Taleo — do pure keyword matching with no synonym understanding at all. Assuming the machine will infer that "ran the books" means "accounting" is risky. Use the explicit term.

The bottom line

An ATS isn't a gatekeeper robot and it isn't a mind-reader. It's a pipeline that turns your document into searchable, scoreable data and helps a human triage a pile of applicants. You can't control which vendor an employer runs or see your score — but you can control whether you parse cleanly and whether you carry the exact language of the job. Get those two right and you'll surface in front of the human who actually makes the call. Start by running your resume through the free checker, then tighten your format and keywords.

Keep reading

FAQ

Does an ATS automatically reject resumes?

Rarely on its own. Most ATS platforms are a database and search tool — they parse and store resumes and rank them, but a recruiter decides who to advance. The exception is explicit knockout questions (e.g. work authorization) that a recruiter configures, which can auto-disqualify. The bigger risk is invisibility: if your resume parses badly, you never surface in the recruiter's search.

Do ATS use AI to read resumes in 2026?

Some do, some don't. Older systems like Taleo rely on literal keyword and Boolean matching with no synonym understanding. Modern platforms like Workday, Ashby, and newer Greenhouse and iCIMS features add machine-learning skill inference and semantic matching that can recognize related terms. The exact internals are mostly proprietary and not published.

What is resume parsing?

Parsing is the step where the ATS converts your document into structured data. It extracts the raw text, detects sections like Experience and Education, and uses named-entity recognition to label pieces as a job title, company, date, or skill. The output is a candidate profile the system can search and score.

Does keyword density matter for ATS?

Presence matters far more than density. A skill that appears once in a clearly labelled Skills section and once in context beats the same word stuffed ten times. Keyword stuffing can also be flagged by recruiters reading the resume. Aim for natural inclusion of the exact terms from the job description.

Will a PDF parse correctly in an ATS?

A text-selectable PDF usually parses fine. An image-based or scanned PDF does not — the system has to fall back to OCR, which is error-prone, and many parsers extract little or nothing. The test: open the PDF and try to highlight the text. If you can select it, the parser can read it.

How does synonym expansion work in modern ATS?

Newer systems map related terms to a shared concept — so "JS" and "JavaScript", or "managed a team" and "team leadership", can match even without identical wording. This is done with a skills taxonomy or with semantic embeddings that place similar phrases near each other in vector space. Older keyword-only systems do not do this, so exact terms still matter.

Can I see the score an ATS gives me?

No. Any internal match score or ranking is visible to the recruiter, not the candidate. Third-party checkers (including this one) estimate how well your resume parses and matches a job description, but they are simulations — no public ATS exposes its real score to applicants.

⭐ Free + open source. Star the repo on GitHub if this helped, so other job seekers can find it.