tsma – Regression Defense Line for Legacy Code Image: AI generated

If you want to refactor legacy code with no tests using AI, if your LLM writes half the tests and stops, if you want to track coverage mechanically while controlling the agent – tsma builds that defense line.

How Do You Refactor Code with No Tests?

You inherited a 100,000-line legacy codebase. There are no tests. You want to refactor, but touching anything might break something. Writing tests requires understanding the code, and understanding the code requires documentation – which doesn’t exist either.

Nobody touches it. It rots further.

Every legacy codebase in the world is stuck in this deadlock. 60-80% of Fortune 500 IT budgets go to legacy maintenance. 42% of developer time is spent dealing with technical debt.

What if an LLM could write the tests for you?


The Problems When You Hand Tests to an LLM

Ask an LLM to “write tests for this function” and it produces something. The problems are threefold.

First, it doesn’t know where to start. When there are 527 functions, do you go in order from #1? Start with the most critical? There’s no criterion.

Second, you can’t verify test quality. The LLM’s tests pass. But are they actually verifying the function’s behavior, or are they empty shells that just call the function with no assert? You’d have to read each one manually to know.

Third, without feedback, LLM tests plateau at 60-70%. According to the empirical study by Schafer et al. (2023), the median of LLM-generated tests is 70.2% statement coverage and 52.8% branch coverage. Just saying “test this function” won’t reach 100% branch coverage. You need to tell it which branches are missing so it can fill the gaps.

It’s not that LLMs can’t write tests. The problem is the absence of a structure that tells the LLM what to write and how well it wrote it.


tsma: A Test Rail Driven by One Command

tsma is a CLI tool that indexes every function in a project, detects test presence, measures coverage, and gives precise feedback to LLM agents.

The agent needs to know exactly one command:

$ tsma next

This single command drives the entire loop:

$ tsma next          # Shows the next untested function
  → Write a test
$ tsma next          # Detects the new test, runs it, measures coverage
  → 100%? PASS, move to the next function
  → <100%? Reports uncovered branches with line numbers
$ tsma next          # Re-measures the revised test
  → Whether improved or not, marks DONE and moves on

Repeat until “All functions complete!” appears.


Validated on 527 Functions

tsma was applied to a real Go project with 527 functions.

ResultCountRatio
PASS (100% branch coverage)24646.7%
DONE (best-effort)28153.3%
TODO (unprocessed)00%

246 functions reached 100% branch coverage. The remaining 281 did not reach 100%, but tests were written to the extent possible.

Why can’t some functions reach 100%?


Functions That Reach 100% and Those That Don’t

Whether a function can reach 100% branch coverage depends on how it receives its dependencies.

Interface (mockable) – 100% achievable:

type Handler struct {
    svc AuthSvc              // interface -- replaceable with a mock
}

Inject a mock in tests and you can control every path:

svc := mocks.NewMockAuthSvc(ctrl)
svc.EXPECT().Login(...).Return(result, nil)   // success path
svc.EXPECT().Login(...).Return(nil, err)      // failure path

Concrete type (not mockable) – 100% impossible:

type Handler struct {
    svc *service.SMSImportService    // struct pointer -- not replaceable
}

The real implementation runs with internal dependencies on databases, external APIs, etc. You can’t force specific errors or specific return values. Branches that depend on those results are unreachable by unit tests.

tsma’s response: After uncovered-branch feedback, it tries once more. If the branches are still unreachable, it accepts DONE. This isn’t a tool limitation – it reflects the code’s testability. The legacy code dilemma systematized by Feathers (2004) – “to change the code, you need tests; to add tests, you need to change the code” – is resolved by breaking dependencies and introducing interfaces (DI). Introducing interfaces would make 100% possible, but that means modifying the original code.


Feedback Dramatically Transforms LLM Tests

tsma’s core value isn’t indexing or coverage measurement. It’s telling the LLM exactly which branches are uncovered, by line number.

Without feedback:

"Write tests for the ListContracts function"
→ LLM tests only the happy path
→ Coverage 60-70%

With feedback:

"Write tests for the ListContracts function"
→ Coverage 65% (11/17)
→ UNCOVERED:
    line 41 -- if params.Status != nil
    line 44 -- if params.BuildingId != nil
    line 70 -- if err != nil (CountSummary)
→ LLM adds tests covering exactly those branches
→ Coverage 100%

Same LLM. The only difference is the presence of feedback. Three lines of line numbers separate 60% from 100%. CoverUp (Pizzorno & Berger, 2024) empirically demonstrated the same principle. By repeatedly inserting coverage analysis results into prompts and focusing LLM attention on uncovered lines, they achieved a median module-level line coverage of 81% – a 19pp improvement over the baseline without feedback.


Progress Survives Even When the Agent Dies

LLM agents crash. Token limits, network errors, session drops. You can’t process 527 functions in a single session.

tsma persists progress to .tsma/session.json.

$ tsma status

527 functions
PASS:  246 (46.7%)
DONE:  281 (53.3%)
TODO:    0 (0.0%)

If the agent dies at function #200? A new agent runs tsma next and picks up from #201. session.json is the checkpoint.

Multiple agents can take turns without conflicts. Each function is atomic.


The Session Is a Cache; Source Files Are the Truth

One of tsma’s design principles: the session is a cache, and source files are the source of truth.

If you delete a test file, even if session.json records it as PASS, that function reverts to TODO. The session never drifts from reality.

Principle:
  Even if session.json says "PASS"
  If the test file is missing → TODO
  If the source file changed → re-measurement target

Instructions for the LLM Agent

The agent needs exactly 6 lines of instructions:

1. Run tsma next
2. If TODO -- read the function and write a test
3. If the test fails -- read the error and fix the test
4. If uncovered branches are shown -- add tests covering those branches
5. If PASS/DONE -- the next function is shown automatically
6. Repeat until "All functions complete!" appears

The only command the agent needs to know is tsma next. The CLI constrains the rest.


Trains and Tracks

Vibe coding is a train. It’s fast. But without tracks, it derails.

Every AI coding tool is focused on making the train faster. Bigger models, smarter agents, better prompts. But the faster the train goes, the worse the derailment.

tsma is the track. The LLM generates tests (Neural), and the CLI defines “this far and no further” (Symbolic Constraint). The LLM’s creativity stays intact, but the quality of results is enforced by the machine.

Beforetsma
Test writingHuman (slow) or LLM (chaotic)LLM writes, CLI verifies
Where to start?Human decidesCLI determines order
Quality checkHuman reviewsCLI measures coverage
FeedbackNoneUncovered branch line numbers
Progress trackingNonesession.json automatic

The LLM generates freely. But it runs only on the track called tsma next.


Language Support

LanguageIndexerTest RunnerCoverage
Gogo/astgo testgo test -coverprofile
TypeScriptregexnpx vitest / npx jestc8 / istanbul
Pythonregexpytestcoverage.py

Go uses an AST parser for precise function extraction. TypeScript and Python use regex-based extraction.

Generated files (*_gen.go, *.pb.go), test files, and vendor/node_modules are automatically excluded from indexing.


Installation and Usage

make install
cd your-legacy-project
tsma next

That’s all.

MIT License. github.com/park-jun-woo/tsma


References

  • Schafer, M., Nadi, S., Eghbali, A., & Tip, F. (2023). An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation. IEEE Transactions on Software Engineering, 50(1), 85–105. arXiv:2302.06527
  • Pizzorno, J. A., & Berger, E. D. (2024). CoverUp: Coverage-Guided LLM-Based Test Generation. arXiv preprint arXiv:2403.16218. arXiv:2403.16218
  • Ryan, G., Jain, S., Shang, M., Wang, S., Ma, X., Ramanathan, M. K., & Ray, B. (2024). Code-Aware Prompting: A Study of Coverage-Guided Test Generation in Regression Setting using LLM. Proceedings of the ACM on Software Engineering (FSE 2024), 1(FSE), 951–971. ACM DL
  • Lemieux, C., Inala, J. P., Lahiri, S. K., & Sen, S. (2023). CodaMOSA: Escaping Coverage Plateaus in Test Generation with Pre-trained Large Language Models. ICSE 2023, 951–963. ACM DL
  • Feathers, M. C. (2004). Working Effectively with Legacy Code. Prentice Hall. ACM DL
  • Besker, T., Martini, A., & Bosch, J. (2018). Technical Debt Cripples Software Developer Productivity. TechDebt 2018, 105–114. ACM DL
  • Stripe. (2018). The Developer Coefficient. PDF
  • U.S. Government Accountability Office. (2019). Information Technology: Agencies Need to Develop Modernization Plans for Critical Legacy Systems. GAO-19-471. GAO
  • Tornhill, A., & Borg, M. (2022). Code Red: The Business Impact of Code Quality. TechDebt 2022, 11–20. arXiv:2203.04374
  • Peng, S., Kalliamvakou, E., Cihon, P., & Demirer, M. (2023). The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. arXiv:2302.06590

Related: Ratchet Pattern – How to Make an Agent Finish the Job – The pattern behind tsma. Why mechanical verification beats LLM judgment.

Related: Model IQ Matters Less Than Feedback Topology – Why feedback structure determines results more than model performance.

Changelog

  • 2026-05-14: Initial release