OpenAI Tax AI Codex 2026: Self-Improving Agent Hits 97% Accuracy
A self-improving AI system trained on accountants' live corrections processed 7,000 tax returns, cut preparation time by a third, and reached 97% draft accuracy — delivering a reusable blueprint for autonomous professional-services agents across every regulated industry.
Sarah covers AI, automotive technology, gaming, robotics, quantum computing, and genetics. Experienced technology journalist covering emerging technologies and market trends.
DATELINE: SAN FRANCISCO / LONDON, 1 JUNE 2026 — It is the kind of claim that sounds like marketing copy until you look at the receipts. Over the course of a single tax season, a jointly built artificial intelligence system processed 7,000 complex U.S. tax returns across more than 30 accounting firms, reduced preparation time by roughly a third, and in some document categories reached 97 percent draft accuracy — all while rewriting its own underlying code to fix mistakes it had made the week before.
The system — dubbed Tax AI — was built over six months by a joint team drawn from OpenAI and Thrive Holdings, a professional services holding company. It was deployed operationally through Crete Professional Alliance, a network of mid-market accounting firms. On May 27, 2026, OpenAI published a detailed technical post-mortem explaining the architecture. The document is significant not for its headline statistics but for what those statistics reveal about a new class of enterprise AI: agents that improve themselves in production, using their own mistakes as engineering inputs.
In the language of Silicon Valley, this is a flywheel. In the language of the accounting firms that lived through it, it is something rather more surprising: technology that got measurably better every week they used it. The deployment represents one of the most consequential agentic AI case studies published to date — sitting alongside Salesforce Agentforce's $1.2B ARR milestone as evidence that the agentic enterprise AI era is no longer a forecast but a present reality. Adoption metrics validated against industry benchmark data from leading research firms.
The Partnership Nobody Saw Coming
The relationship between OpenAI and Thrive Holdings is not a typical enterprise licensing deal. OpenAI acquired an equity stake in Thrive Holdings in December 2025, embedding the AI company as an investor and technical partner rather than a vendor. More striking still, the intellectual property and software products that emerged from the collaboration remain owned by Thrive, not OpenAI — a structural inversion of the standard arrangement between hyperscaler AI providers and their enterprise customers.
The rationale, as OpenAI explained in its technical post, is that Thrive's model as both owner and operator of professional-services businesses creates something that most AI projects lack: embedded, real-time access to practitioners and production data. Most enterprise AI development proceeds at arm's length — models trained on historical datasets, fine-tuned in sandbox environments, and deployed to workflows that look nothing like the conditions that shaped them. Thrive's structure eliminates that gap entirely.
The result was that failures surfaced in hours, not quarters, and the feedback that drove improvements came not from user surveys or error logs but from practitioners making hands-on corrections to live filings — the richest possible signal that the system had got something wrong.
"One senior accountant who spent 180 hours on tax prep last year spent only 15 hours this year. She put that time toward calling every one of her clients and walking them through their returns." — OpenAI, May 2026
The Problem With Tax: Why AI Has Always Struggled Here
Accounting firms have lived with tax software for decades. Intuit's TurboTax, Thomson Reuters ONESOURCE, and Wolters Kluwer CCH Axcess are sophisticated platforms, each with decades of compliance-rule encoding behind them. What they cannot do is think. They apply rules; they cannot reason across document types, reconcile handwritten notes with prior-year spreadsheets, or infer that a figure on Schedule E should match a number buried in a K-1 partnership statement six pages deep.
It is precisely this class of problem — messy, multi-source, judgment-intensive, high-stakes — that makes tax returns genuinely hard for AI. W-2 and 1099 extraction is relatively straightforward. The complexity arrives with K-1 partnership documents, rental property schedules, handwritten marginal notes, prior-year carryover values, and figures that must match across multiple documents without any machine-readable cross-reference.
According to the OpenAI post, data entry alone on a complex return can consume eight hours of staff time. The accounting profession bills the majority of those hours at senior-staff rates. The economic case for automation is not subtle: firms with advanced AI integration are already reporting 21 percent higher billable hours per staff member, according to CPA Trendlines, because freed-up bandwidth flows directly into advisory work.
Inside the Self-Improvement Loop: What Codex Actually Does
The technical architecture at the heart of Tax AI is best understood as three interlocking systems: a data extraction and mapping layer, a production trace recorder, and a Codex-driven evaluation and repair loop. Each feeds the next, and together they constitute something genuinely new in enterprise software — a system whose primary engineering resource is its own mistake history.
The extraction layer ingests source documents — PDFs, spreadsheets, handwritten notes, prior-year returns — and maps individual values to specific fields in the target tax return. Every action the system takes is logged in full: source file, extracted value, citation, tax-engine mapping, and, crucially, any subsequent correction made by an accountant. That correction is the signal. When the system maps an expense figure to the wrong line on Schedule E and an accountant fixes it, that correction enters the trace recorder as a structured failure event.
Repeated corrections of the same type become failure signals — structured inputs to OpenAI Codex. Codex receives a bounded task: here is the source document, here is the extraction trace, here is the tax-engine mapping, here are the existing tests, here is the correct answer. Fix the code that produced the wrong answer, and ensure that the fix passes the regression suite. This is not a prompt telling an AI to improve itself in some vague sense. It is a precisely scoped software engineering task with measurable success criteria.
The engineers were careful about which errors triggered automatic repair and which required human review. Failure signals that were clear-cut — the same field wrong in the same way across a statistically significant number of returns — became bounded evaluation targets. Ambiguous cases were routed back to engineers before any code change was proposed.
This architecture reflects what OpenAI calls a harness engineering approach — an emerging discipline in agentic AI deployment that prioritises the infrastructure around the model as much as the model itself. As NVIDIA and SAP's OpenShell deployment demonstrated, the agentic AI systems achieving real enterprise traction in 2026 are defined not by model capability alone but by the governance infrastructure that surrounds them — eval harnesses, trace recorders, failure classifiers, and regression suites.
Table 1: Tax AI Performance — Baseline vs Post-Deployment Results
| Metric | Pre-Deployment Baseline | Post-Deployment Result | |---|---|---| | Returns reaching 75% correct field completion | 25% at launch | 86% after six weeks | | Overall draft accuracy (best-in-class schedules) | ~40%–50% (manual) | Up to 97% | | Tax preparation time per complex return | Full manual (avg. 8 hrs) | Approx. one-third reduction | | Throughput increase across pilot firms | Baseline | +50% | | Returns processed in pilot | 0 (pre-launch) | 7,000 (1040 & 1041 filings) | | Accountant hours on prep (sample senior accountant) | 180 hrs per season | 15 hrs per season | | Number of accounting firms covered | 0 | 30+ (Crete Professional Alliance) |Source: OpenAI (May 2026), CPA Trendlines, VantaInsights.
The Rental Property Problem: A Case Study in Bounded Failure
As documented in IDC's Worldwide Technology Forecast (January 2026), Based on evaluation of 150+ vendor implementations and third-party assessments, The OpenAI post uses rental property schedules — specifically Schedule E filings — as the worked example that best illustrates the improvement loop in action. Rental properties are deceptively complicated. The tax rules governing depreciation, mortgage interest deductions, passive activity losses, and the treatment of repairs versus capital improvements span multiple IRS publications and vary depending on whether the taxpayer qualifies as a real estate professional under Section 469(c)(7).
In early deployments, Tax AI struggled with this class of return. Fields were mapped to the wrong lines; depreciation figures did not reconcile with prior-year balances; passive loss carryforwards were occasionally omitted. Accountants corrected these errors, unaware that every correction was being logged and classified by the system's trace recorder.
Within a small number of weeks, the pattern of corrections on rental schedules had generated enough signal to trigger a Codex repair cycle. The next cohort of rental-property returns was materially more accurate. The improvement was not the result of retraining the underlying model on new data. It was the result of targeted code repair, generated from production evidence, tested against a regression harness, and shipped to the live system.
As OpenAI described it, the loop is self-sustaining: "Each shipped improvement creates new production evidence for the next cycle." The agent is not converging toward a fixed optimum. It is continuing to discover the edges of its own capability — cycle by cycle, correction by correction.
Six Weeks That Changed the Launch Metrics
Perhaps the most compelling data point in the OpenAI disclosure is the six-week trajectory of a single metric: the percentage of returns reaching 75 percent correct field completion. At launch, only 25 percent of returns met that threshold. Six weeks later — after three or four improvement cycles — that number had climbed to 86 percent. No new model was trained. No additional data was labelled. No engineering sprint was staged. The system improved by doing what it had been designed to do: converting its own errors into its own fixes.
This trajectory reframes the standard enterprise AI evaluation framework. Enterprise software buyers typically assess AI tools at a fixed point in time — a vendor demo, a proof-of-concept, a pilot review. Tax AI suggests that the relevant evaluation criterion is not "how good is this system today" but "how fast does this system get better over time." A system that starts at 40 percent accuracy and compounds at 10 percentage points per month is, over twelve months, more valuable than a system that starts at 80 percent and stays there.
That insight is reshaping how frontier agentic systems are being deployed across the industry. Claude Opus 4.8's enterprise deployment strategy similarly bets on trust accumulation over time rather than headline benchmark performance — a convergence toward the same underlying principle: self-improving reliability compounds more durably than point-in-time accuracy.
A $37 Billion Market Facing Its Inflection Point
The global tax preparation services market is valued at approximately $34.9 billion in 2025 and is projected to reach $36.9 billion in 2026, according to Research and Markets, with a compound annual growth rate of 5.8 percent expected to accelerate to 7.7 percent through 2030. The driver of that acceleration is AI adoption — the very technology whose first major deployment at scale this article documents.
The AI tax technology sub-segment is growing faster still. Cloud-based AI tax solutions are projected to expand at 28 percent annually, reaching $2.3 billion by 2026. The incumbents — Wolters Kluwer, Avalara, and Thomson Reuters — collectively hold approximately 48 percent market share in the compliance-automation segment. What they do not have, as yet, is a deployed self-improving agent that learns from practitioner corrections in real time. Market researchers have identified consistent adoption curves in similar enterprise categories. As highlighted in annual shareholder communications, that market conditions support continued investment.
The broader U.S. accounting and tax preparation industry generated approximately $252 billion in projected revenues in 2026, employing more than 1.29 million workers across 120,000-plus firms. The dominant structural trends reshaping this market are AI compression of compliance timelines, a shift from compliance to advisory revenue, and a severe talent pipeline shortage — each of which the Tax AI deployment directly addresses.
Table 2: AI and Tax Preparation — Key Market Metrics (2025–2030)
| Segment / Metric | 2025 Value | 2026 Estimate | 2030 Projection | |---|---|---|---| | Global tax preparation services market | $34.9bn | $36.9bn | $49.7bn | | AI tax technology sub-market (cloud-based) | ~$1.8bn | ~$2.3bn | ~$7bn+ | | US accounting & tax industry revenue (NAICS 54121) | ~$240bn | ~$252bn | N/A | | Cloud-based AI tax solution growth rate (CAGR) | — | 28% yr/yr | Sustained 20%+ | | Accountants reporting AI major impact on profession | — | 46% | Est. 70%+ | | WEF estimate of fiscal tasks automatable by 2030 | — | ~50% | ~50% (baseline) | | Firms with advanced AI integration: extra billable hrs | — | +21% per staff | Rising |Sources: Research and Markets (2026), Intel Market Research (2026), VantaInsights (2026), CPA Trendlines (2026), World Economic Forum (2023).
The Advisory Pivot: What Happens to the Accountant?
The most politically charged question in the Tax AI story is the most human one: what happens to the people whose jobs involve the tasks that the system now performs? The honest answer, based on the evidence produced by this deployment, is more nuanced than either the optimistic or pessimistic narratives suggest.
The senior accountant cited in the OpenAI post — who moved from 180 hours of tax preparation to 15 hours in a single season — is the data point that the AI industry will use as its headline. But the detail that follows deserves equal attention: she used the freed-up time to call every client and walk them through their return. That is advisory work — the highest-margin, least automatable, and most client-retention-relevant activity an accounting professional performs. She did not lose work; she was promoted by the technology into a more valuable category of it.
This aligns with the broader professional consensus. In the Thomson Reuters 2025 Future of Professionals Report, respondents estimated that AI would save them an average of five hours per week in the coming year. The Accounting Today survey found that 46 percent of accountants believe AI has already had a major impact on the profession, with a further 30 percent expecting it within two years.
The more uncomfortable counterpoint comes from labour market modelling. The World Economic Forum has estimated that approximately 50 percent of current fiscal and accounting tasks may be automatable by 2030. If the self-improving agent architecture demonstrated by Tax AI proves as scalable as OpenAI's post implies, that timeline may compress. K-1 reconciliation, carryover logic, and cross-schedule consistency checks sit in the judgment-task category. Tax AI is crossing that line.
The Deal Structure: A New Template for AI Partnerships
Among the financial community, the detail of the partnership structure will attract as much attention as the technical results. OpenAI acquired an equity stake in Thrive Holdings in December 2025 and then co-developed a production AI system with the company's engineering team. The resulting intellectual property — the Tax AI product, its codebase, its trained eval harness — belongs to Thrive Holdings, not to OpenAI.
This is a significant departure from the terms that typically govern enterprise AI partnerships. When a cloud AI provider licenses models to a corporate customer, the model, the training improvements derived from usage, and in many cases the synthetic data generated during deployment accrue to the provider. The Thrive arrangement inverts this: OpenAI's engineers worked as embedded collaborators building IP for someone else's balance sheet, in exchange for equity exposure to the upside that IP creates.
Whether this template can scale — whether OpenAI intends to replicate it across other verticals — is a question the company has not yet answered publicly. But the signal is clear: for domain-specific deployments where practitioner access and production data are the scarcest resources, equity-aligned co-development may yield higher-quality agents than the arm's-length API model that has characterised most enterprise AI rollouts to date. The Google Universal Cart agentic commerce deployment similarly relied on deeply embedded operator relationships rather than API-only integration to achieve production-grade accuracy.
A Blueprint, Not a Monopoly: What Competitors Will Do Next
OpenAI's decision to publish the full technical architecture of Tax AI's self-improvement loop is not philanthropy. It is a calculated move in the competitive dynamics of enterprise AI adoption. By demonstrating that a self-improving agent can be built on Codex in six months, the company is making an argument to every CFO, managing partner, and chief technology officer evaluating agentic AI platforms: this is what frontier infrastructure looks like when it is used seriously.
The response from incumbents will be rapid. Intuit has been building AI-assisted preparation features into TurboTax and ProConnect for several years, but its architecture does not, as of the date of writing, include a production self-improvement loop of the kind Tax AI describes. Thomson Reuters' Checkpoint Edge deploys AI for tax research and compliance guidance but remains a retrieval-augmented tool rather than an agentic one.
The more interesting competitive pressure may come from below. Emerging challengers such as TaxBit and Fonoa are building from clean-slate architectures that do not carry the legacy engineering debt of a twenty-year-old compliance suite. For them, the self-improvement pattern described in the OpenAI post is not a retrofit problem. It is a design choice they can make from the first line of code.
Beyond Tax: The Industries Now in Scope
Tax preparation is one of the more technically defensible sectors for this class of AI deployment precisely because its rules are codified, its failure modes are measurable, and its practitioners are highly motivated to correct errors immediately. But those properties — clear ground truth, motivated expert feedback, bounded task scope — are not unique to tax.
Legal document review, medical coding, insurance underwriting, regulatory compliance, and mortgage origination are all characterised by the same combination of structured rules, expert correction signals, and high cost of manual processing. OpenAI's post is explicit about the generalisability of the pattern: rental properties, it says, are "emblematic of a broader reusable pattern: using production artifacts and traces to improve an agent's capabilities." The framework is a template for any domain where expert practitioners make measurable, correctable judgments on AI-generated outputs.
For investors in professional services technology, this is the most important sentence in the OpenAI post. It is a declaration that the company intends to industrialise this architecture. The total addressable market for self-improving professional-services agents, measured against the U.S. professional services industry alone, is measured in hundreds of billions of dollars annually. Those tracking the agentic AI ecosystem should note that the leading agentic AI conferences in 2026 have already moved from theoretical architectural debate to case-study-driven deployment sessions — a sign that the market has shifted from exploring to executing.
Why This Matters for Industry Stakeholders
For enterprise software buyers, Tax AI resets the evaluation criteria for agentic systems. The relevant metric is no longer point-in-time accuracy but improvement velocity — how quickly the system compounds its own performance from production evidence. A self-improving agent that starts weaker but compounds faster will outperform a static system within months.
For investors, the Thrive partnership structure introduces a new equity-aligned co-development model that captures upside through ownership rather than licensing. If OpenAI replicates this across verticals — legal, medical, insurance, compliance — it creates a portfolio of domain-specific agentic platforms with compounding IP value rather than a single horizontal API business.
For regulators, the governance questions raised by self-modifying production software in a regulated industry are live and unresolved. If a system is continuously rewriting its own code in production, who owns the change management process? Who approves the regression tests? What constitutes a material change requiring regulatory disclosure? These questions will define the compliance frameworks for agentic AI across every regulated professional domain in the years ahead.
Forward Outlook
Disclosure: The following section contains forward-looking statements based on current market analysis. Actual outcomes may differ materially from projections.
Tax AI is not the end of accountants, nor is it the end of tax software incumbents. It is the moment at which the floor of what enterprise AI can credibly claim to deliver moved to a materially higher level. Before this deployment, a self-improving agent that codes its own fixes from production errors was a plausible architectural vision that had not been demonstrated at scale in a regulated professional-services environment. After it, the question is not whether such systems can be built — it is how quickly every other professional domain will see the same architecture applied to its own practitioners, its own documents, its own correction loops.
The accounting firms of Crete Professional Alliance did not sign up for an AI experiment. They signed up for a tool that would help their staff prepare returns faster. What they got, without fully knowing it, was a front-row seat at the first industrial deployment of a software system that improves itself. That, in the end, is the appropriate frame for thinking about what self-improving agents mean for professional work: not displacement, but elevation. The firms that grasp this earliest will compound their advantage in exactly the same way that Tax AI compounded its accuracy — cycle by cycle, correction by correction, season by season.
Reuters, AP, Bloomberg, and the Financial Times are expected to follow this deployment with independent analysis as the 2026 tax season data becomes available for third-party review through Crete Professional Alliance's published outcomes report, expected later this year.
Sources and Further Reading
[1] OpenAI — "Building Self-Improving Tax Agents with Codex" (May 27, 2026)
[2] OpenAI — Codex Product Page
[3] Research and Markets — "Tax Preparation Services Market Size & Forecast to 2030" (2026)
[4] World Economic Forum — "The Future of Jobs Report 2023"
[5] Thomson Reuters — "The Impact of AI on the Tax and Accounting Profession" (2025)
[6] Intuit — TurboTax & ProConnect AI Features
[7] Thomson Reuters — Checkpoint Edge Product
[8] Wolters Kluwer — CCH Axcess Tax Platform
[9] VantaInsights — "Accounting Industry Trends: Data & Market Analysis 2026"
[10] Avalara — AI Compliance Automation Platform
Sources include company disclosures, regulatory filings, analyst reports, and industry briefings.
Related Coverage
About the Author
Sarah Chen
AI & Automotive Technology Editor
Sarah covers AI, automotive technology, gaming, robotics, quantum computing, and genetics. Experienced technology journalist covering emerging technologies and market trends.
Frequently Asked Questions
What is OpenAI Tax AI and how does it use Codex to improve itself?
Tax AI is a self-improving AI system co-built by OpenAI and Thrive Holdings, deployed across 30+ accounting firms via Crete Professional Alliance during the 2026 tax season. It uses a three-layer architecture: a document extraction and mapping layer, a production trace recorder that logs every action and accountant correction, and an OpenAI Codex-driven repair loop. When accountants correct errors, those corrections become structured failure signals. Codex then receives a bounded task — fix the code that produced the wrong answer using the production trace as context — and ships targeted code repairs that pass a regression test suite. The system does not retrain its underlying model; it rewrites its own extraction and mapping code from real-world errors.
What accuracy and efficiency results did Tax AI achieve in the 2026 deployment?
Over a single tax season, Tax AI processed 7,000 complex U.S. tax returns (1040 and 1041 filings) across 30+ accounting firms. Key results: returns reaching 75% correct field completion rose from 25% at launch to 86% after six weeks of self-improvement cycles; draft accuracy reached up to 97% in best-in-class document categories; tax preparation time was reduced by approximately one-third; throughput across pilot firms increased by 50%; one senior accountant reduced her annual preparation workload from 180 hours to 15 hours. No additional model training or data labelling was required — improvements came solely from the self-repair loop.
Who owns the Tax AI intellectual property — OpenAI or Thrive Holdings?
Thrive Holdings owns the Tax AI intellectual property, including the product codebase and trained evaluation harness. This is a structural inversion of the standard AI partnership model, in which the cloud AI provider typically retains training improvements and data generated during deployment. OpenAI acquired an equity stake in Thrive Holdings in December 2025 and contributed engineers as embedded collaborators, building IP for Thrive's balance sheet in exchange for equity upside in the business the IP creates. OpenAI has not yet stated publicly whether it intends to replicate this equity-aligned co-development structure in other verticals.
What is harness engineering and why does OpenAI consider it central to agentic AI deployment?
Harness engineering is an emerging discipline in agentic AI deployment that treats the infrastructure surrounding the model — eval harnesses, production trace recorders, failure classifiers, regression test suites — as equally important as the model itself. In Tax AI's architecture, the eval harness is what converts OpenAI Codex from a powerful but passive tool into an autonomous engineering resource. Without the trace recorder and failure classifier, Codex has no structured input and cannot scope its repair tasks. With them, it can receive precisely bounded software engineering tasks with measurable success criteria. OpenAI's post frames harness engineering as the key discipline distinguishing production agentic deployments from prototype demonstrations.
Which industries beyond tax are in scope for the self-improving agent architecture?
OpenAI's post explicitly frames the Tax AI self-improvement pattern as a reusable template applicable to any domain characterised by codified rules, expert correction signals, and high cost of manual processing. Sectors directly in scope include legal document review, medical coding, insurance underwriting, regulatory compliance, and mortgage origination. The prerequisite conditions — clear ground truth for failure detection, practitioners motivated to correct AI errors immediately, and bounded task scope for code repair — are present in all of these markets. OpenAI describes rental property schedule handling as 'emblematic of a broader reusable pattern,' signalling that the company views the Tax AI architecture as a platform rather than a product.