Debugging with ChatGPT: Strategies and Examples

Posted on 2026-01-13 08:50:51

Debugging used to believe like spelunking inside the darkish with a headlamp and a puppy-eared stack of printouts. You nonetheless want your instincts and your tools, yet you furthermore may have a brand new partner that solutions instantly, recollects context, and certainly not gets tired of combing logs. ChatGPT gained’t change your verify suite or your profiler, yet it is going to shorten the path from symptom to root trigger whilst used with field. The means lies in how you architecture the communique, what you share, and the way you validate the output.

This is a area support drawn from factual engineering work. The aim isn't very to paste a stack trace and wish for magic. The objective is to interrogate the main issue with a questioning spouse, construct hypotheses, run small experiments, and preserve moving.

Who blessings and wherein it shines

If you write or overview code, you are able to offload parts of the diagnostic loop to ChatGPT. It is strongest in a number of situations. It acknowledges popular blunders signatures throughout languages and frameworks. It can comic strip minimum reproductions so you can isolate explanations. It supplies swift references to APIs, configuration defaults, and idioms with out making you tab away. And it is able to act as a moment pair of eyes that notices an off-through-one or a ignored look forward to which you now not see after gazing the related document for an hour.

The weak spots are just as useful. It has no direct entry on your runtime. It can hallucinate library behaviors should you ask vague questions. It’s poor at debugging hidden nation devoid of logs or code. And it cannot replace authentic observability. You nonetheless desire logging, metrics, strains, checks, a profiler, and a means to run the code locally.

Good prompts appear as if trojan horse reports

I sort activates after the malicious program reports I hope I constantly obtained: noticed habit, envisioned behavior, minimum snippet or stack hint, ambiance, and what I’ve tried. This reduces irrelevant guesses and makes the mannequin’s prognosis checkable.

Here is a sample that recurrently works. Start with two or three paragraphs. The first states the hindrance and the context. The second carries the precise mistakes and any primary code. The 3rd outlines constraints or avenues you've got dominated out. Then ask for two issues: a prioritized checklist of hypotheses, and the smallest code or configuration substitute that could look at various the best speculation.

That remaining bit subjects. You’re now not requesting a rewrite. You are inquiring for the smallest test that shifts the opportunity.

Example: a Node service leaks memory after a refactor

A crew I labored with migrated a Node service from callbacks to async purposes. A week later, memory utilization climbed often lower than load and pods restarted each and every few hours.

We all started the ChatGPT session with a crisp precis:

Observed: reminiscence utilization grows by way of about 50 MB in line with hour beneath consistent traffic. Garbage assortment runs, but heap after GC developments upward. Expected: sturdy memory with sawtooth GC, no upward trend. Environment: Node 18, Express four, TypeScript 5, pino logger, axios for HTTP calls. Change window: two weeks ago we changed callback patterns with async/await, and launched a request-scoped context item.

We pasted a simplified path handler and a stripped heap photograph precis. The handler created a context map per request and connected it to res.locals. The snapshot showed many retained AsyncResource and Map times.

We asked for most likely motives ranked with the aid of effect, and for a minimum experiment.

The reply focused on two candidates. First, a closure shooting a long-lived object that averted Maps from being collected. Second, unawaited can provide that left pending async components. The form steered a small test: add a finalization registry to observe the request-scoped maps, and run the carrier with --trace-gc and async_hooks to look if AsyncResources persist after reaction conclusion. It also proposed a code replace to verify the context production stayed throughout the request scope and to keep shooting exterior references.

We tried the experiment. The registry mentioned that maps were now not being accumulated. The async hooks output showed energetic assets related to the pino logger toddler instance we created per request and stashed inside the context. Moving the infant logger advent right into a role that back a undeniable item of bound tricks, other than keeping the entire logger instance, broke the reference chain. Under the similar load, heap after GC stabilized. The restore was 3 traces, guided by way of two particular observations.

The key was the architecture of the verbal exchange. We did now not ask for a familiar memory leak listing. We asked for a ranked set of hypotheses and the smallest prime-sign probe.

Using ChatGPT to design minimum reproductions

A minimal duplicate is the quickest method to show speculation into information. ChatGPT can draft the skeleton speedier than you can search by way of medical doctors. Give it the framework variation, the genuine failing habits, and a good constraint on dependencies. Ask for a one-report example that reproduces the problem with faux tips, plus guidance to run it.

For a React hydration mismatch we chased remaining yr, we requested for a Next.js thirteen illustration that renders a server factor with a timestamp and a buyer portion that consumes it. The mismatch in basic terms gave the impression while locale-exceptional formatting was once worried. The model produced a ordinary app where server-rendered dates used toLocaleString with no a fixed locale. Hydration failed on browsers with non-English settings. That concrete duplicate made the repair obvious: structure dates deterministically at the server or pass preformatted strings.

A caution here. If the model proposes a reproduction that does not fail, say so and paste your output. Ask for a better easiest variation. You are together narrowing the hunt area. When the reproduction does fail, freeze it in a repository or a gist. It will become a permanent take a look at.

Turn stack traces into checklists

Stack traces are experiences should you learn them accurately. The line that throws is rarely the primary line where the bug lives. Ask ChatGPT to stroll the trace from the underside up, mapping each one frame to a code position and reasoning about facts pass among frames. This is most excellent when you paste the important applications, not accomplished info, and annotate arguments with really values.

Here is a development I use whilst Python throws a KeyError internal a chain of dict accesses:

I paste the trace and the three purposes above the failing line, each with a remark displaying the runtime sorts and any logged values. Then I ask, in which is the earliest factor the missing key might have been delivered, and what single log declaration may be certain it? The sort mostly identifies an upstream conditional that silently skips defaulting. It indicates logging the keys of the payload at the boundary. The ensuing log both confirms the missing key at ingress or facets to a mutation mid-pipeline. Two messages later, now we have a restore or a failing unit try for the sting case.

Make the edition write probes, not patches

It is tempting to invite for a repair rapidly. Resist that till you've got narrowed the sector. Better to invite for probes: a quick snippet to log an invariant, a one-line announcement, a config toggle that transformations habits. Probes movement you from guesses to facts.

On a Kafka chatgpt AI knowledge for Nigerians user with sporadic duplication, we requested the brand for a probe to validate idempotency assumptions. It recommended logging the partition and offset alongside our deduplication key, then restarting the purchaser to work out if any offsets rewind for the duration of rebalancing. That single log line confirmed offsets jumping backwards all through a distinctive rebalance trend. We adjusted the commit procedure to sync commits in the past processing batches. No patch from the variation, only a probe that discovered the incorrect assumption.

The protection internet: attempt first, isolate the change

For creation-affecting insects, I push the version to aid cartoon a test that fails earlier any code differences. This enforces self-discipline. Ask for a unit check or an integration look at various that captures the precise regression. Provide the modern attempt layout and libraries. If the check is demanding to isolate, ask for a determinism technique: seeding randomness, mocking time, or intercepting network calls.

In a Rails app returning the incorrect cache variant, we asked for an RSpec illustration that hits the endpoint with specific Accept headers and asserts diverse cache keys. The fashion proposed a helper that units the header and inspects Rails.cache with a custom instrumenter. The first verify failed, which gave us a crimson bar and a clear achievement criterion. Only then did we take into account code variations.

When the model is incorrect, make it end up itself

Every model in some cases speaks with unwarranted self belief. Your process is to separate fluent nonsense from awesome route. Two behavior assistance. First, ask it to quote the detailed line of code or doc it is predicated on, with the aid of quoting the road. Second, ask for a counterexample. If it claims that a Go http.Client reuses connections mechanically, ask it to reveal consumer code that defeats reuse and explain why.

If it should not floor the claim in your code or in an instantaneous quote from the traditional library, treat the reply as a hypothesis, no longer a certainty. Continue most effective after an experiment helps it.

Working with logs and traces

ChatGPT can assist parse messy logs, yet basically whenever you present ample architecture. Pasting a 500-line log unload hardly is helping. Curate a slice that covers one request or one minute around the adventure. Add a one-line glossary for fields that are area-actual. The model can then cluster pursuits, reconstruct timelines, and element out anomalies which includes jitter in response occasions or a routine null container.

With strains, export a single hint with spans, get started and give up instances, and attributes. Ask the fashion to find principal path spans and to signify a timing probe. On a gRPC carrier with P95 blowing up from 120 ms to 450 ms, we pasted 3 traces for speedy, median, and sluggish requests. The brand noticed a selected span for a Redis call with top variance and steered checking connection pool saturation. We further one metric, redis clientpool_available, and saw it drop to zero during spikes. The restore was once not to bring up pool length blindly, however to lessen in line with-request pipeline length and upload backpressure. The form did no longer “solve” it, yet its trend matching narrowed the hunt in minutes that may have taken us an hour.

Refactoring and regression risk

Sometimes the trojan horse appears after a refactor and the blame surface is widespread. Use the style to devise a bisection method and a listing of invariants to ensure after each and every step. If you can run git bisect, ask the mannequin to recommend a immediate harness to script both step and an oracle to discern flow or fail. If bisect is impractical, ask it to record the top three probability locations delivered via the refactor, and for each and every menace, the least expensive runtime fee.

In a carrier in which we replaced a homegrown retry with a library, we requested for a runtime invariant: the wide variety of retries per request may still now not exceed 3, and jitter must always stay within zero to 2 hundred ms. The mannequin drafted a tiny middleware that recorded retry counts and jitter and emitted a histogram. We deployed it behind a flag and realized that our new retry policy misinterpret the library’s default of exponential backoff with max makes an attempt of five. The fix become a two-line config alternate, but the invariant made us assured we have been completed.

Asking for pass-language translations of errors

Many groups now straddle languages. A Java carrier calls a Python batch job that triggers a Go lambda. When a serialization errors pops up in a single vicinity, the rationale sits 3 products and services away. ChatGPT can translate errors semantics throughout ecosystems. Provide the producer and buyer schemas, the exact error string, and a sample payload. Ask for the minimum schema difference that would reason the error, and for a backward-well suited swap.

With Avro, we handed the version a producer schema with a required subject and a patron schema that made the sphere not obligatory with a default. It said that the switch changed into backward yet no longer forward suitable, and that the mistake most likely got here from a person that had no longer deployed the hot schema. It said bumping the writer’s schema with a defaulted box and coordinating the improve route. Simple, yet teams often pass over this lower than pressure.

Two lists value maintaining close your terminal

A disciplined on the spot skeleton: Observed vs. predicted habits, with numbers and dates. Minimal snippet or stack hint with actual values. Environment: versions, OS, config flags. What you have tried and what modified not too long ago. A request for hypotheses ranked via chance, and the smallest probe to test the upper one. A compact decision tree whilst the solution looks achieveable: Ask for the precise line or doc quote helping the claim. Request a counterexample that could falsify it. Run the smallest try or log probe which may confirm it. If confirmed, ask for the least dicy restore and a look at various that would have caught it. If refuted, flow to the next hypothesis and repeat.

These are the merely lists in this article for a cause. Most of the paintings lives in narrative, not checkboxes.

Guardrails when sharing code and data

Be careful with proprietary textile. If you can not paste the code, you can still describe habit accurately. Replace secrets and identifiers with placeholders. If you will have to percentage logs, redact credentials and person details, and factor in synthesizing payloads that look after construction yet now not content material. Ask for refactoring innovations in phrases of styles other than good code. The pleasant of guidance drops a bit, but the threat of leakage drops to close zero.

On regulated workloads, I retain the type at arm’s length. I use it to draft verify harnesses, review open source library utilization, or caricature efficiency experiments, no longer to look into customer statistics.

The overall performance perspective: profiling with a conversational partner

For functionality insects, pair the fashion with actual profiles. Export a CPU profile, heap profile, or flamegraph, and paste the most up to date stacks and their percentages. Ask the variety what knobs are on hand to your runtime, what competition styles match the form you spot, and what microbenchmarks may screen the truth.

On a Go provider with a mysterious 15 to 20 % CPU extend after a minor unlock, we pasted the top stacks. The flamegraph confirmed mutex contention in JSON encoding and a sudden rise in allocations in a sizzling trail. The adaptation said a short A/B: exchange encoding/json with a precomputed encoder for the hot struct, and cache a bytes.Buffer in line with employee to lessen allocations. It also reminded us of GOMAXPROCS settings that had changed at the node pool. Ten minutes later we had a microbenchmark and could see that the allocator churn, not the mutex, was to blame. We kept the buffer pool and reverted an unnecessary JSON tag that forced reflection. The CPU utilization fell returned to baseline.

The factor isn't very that the variation knew your codebase. It knew patterns and commerce-offs, and it made you sooner at trying out them.

Teaching junior builders to debug with ChatGPT

Early-profession engineers many times shipment-cult fixes from Stack Overflow or Slack, patching signs and symptoms without figuring out reasons. ChatGPT should be used to teach, no longer to shortcut. When pairing, have the junior engineer write the 1st suggested. Ask them to expect the true two hypotheses earlier than studying the respond. Then compare. When the model proposes a restoration, ask it to give an explanation for why the trojan horse manifests best underneath yes situations. Ask for a failing look at various. Make the loop explicit: hypothesis, prediction, look at various, influence.

In a bootcamp session, we used this way on a flaky Jest try that surpassed in the community and failed in CI. The style proposed three traces of attack. First, time-established good judgment and pretend timers. Second, reliance on report manner case sensitivity. Third, a race with unawaited async cleanup. The scholar guessed time considerations. We further a fixed Date.now mock, and the look at various still failed in CI. The style then instructed checking the CI symbol’s default locale and case sensitivity. The repository contained both login.examine.ts and Login.check.ts. macOS did no longer care, Linux did. Renaming the record ended the flake. The lesson stuck.

Advanced actions: constraint prompts and invariants

When you need rigor, constrain the variety. Tell it no longer to advise code ameliorations unless it affords a falsifiable speculation and a unmarried verify. Ask it to provide two selection factors that would produce the Technology similar symptom but require diverse probes. This forces it to branch and enables you hinder confirmation bias.

You too can ask it to show an invariant in plain language, then in an declaration or assets-elegant verify. For example, for a pagination API: for any web page length N and any two web page tokens T1 then T2, the set of back object IDs would have to be disjoint, and the union throughout pages ought to same the primary N instances K outcome in order. With that invariant, the version can help write a property examine utilizing generated documents. Bugs floor effortlessly in case you circulate beyond canned examples.

Common traps and the best way to stay clear of them

There are pitfalls. One is overfitting activates to get the solution you desire. If you lead the witness, the sort will agree. State info, not theories, and ask for possibilities. Another is inquiring for tremendous refactors throughout an outage. Keep fixes minimal until eventually the procedure is steady. A 0.33 is trusting code samples that collect but don’t integrate. When the version presents code, ask it to annotate the import paths and library editions it assumes. This prevents you from pulling in incompatible snippets.

Lastly, keep turning the chat into a log of failed experiments without a construction. Every 10 messages, summarize what you may have found out and what is still uncertain. Ask the type to rewrite your knowing as a suite of bulletproof statements and open questions. This keeps glide in test.

A brief casebook of live bugs

A few more snapshots train the breadth of concerns where the form helps.

A TypeScript sort blunders erupting after upgrading a library. We provided the error, the type definitions ahead of and after, and the standard constraints in our code. The fashion noticed a breaking swap in which a sort parameter lost a default, making a formerly inferred form now required. The restore became to bypass the kind explicitly in two call web sites. The model also urged a tsconfig environment to trap this in advance.

A Postgres deadlock between two transactions that hardly collided. We pasted the deadlock graph from pg statinterest and the 2 SQL statements. The mannequin diagnosed a lock order inversion and proposed a consistent order for updates, plus a timeout and retry strategy. It also prompt including SKIP LOCKED to a history employee that scanned responsibilities. Implementing a strict ordering resolved the impasse devoid of lowering throughput.

A CSS structure bug simplest in Safari on iOS sixteen. We shared a minimal HTML and CSS snippet and a screenshot. The adaptation recalled a particular flexbox min-peak quirk in WebKit and instructed including min-peak: zero to the flex newborn. Five minutes later, the design stabilized throughout devices.

A Kubernetes liveness probe that saved killing a fit pod. We pasted the deployment YAML, the probe config, and alertness logs. The edition saw that the probe hit an endpoint after TLS termination assumptions changed. The health and wellbeing endpoint redirected to HTTPS, and curl in the probe did no longer practice redirects. Changing httpGet to a right away path without redirect fastened the crash loop.

In every one case the type accelerated reasoning, but we demonstrated with exact checks.

Wrapping the prepare into your workflow

Chat equipment are compatible naturally into distinctive stages. During triage, they aid you shorten the listing of suspects and design clear probes. During remediation, they support you create failing assessments and calculate the minimal protected exchange. During postmortem, they assist draft timelines and extract instructions that continue to exist beyond the fix. The conduct that make it paintings are undeniable. Write prompts like malicious program reports. Ask for hypotheses and probes. Demand grounding in code and medical doctors. Keep a check-first frame of mind. Summarize and reset broadly speaking.

Used this approach, ChatGPT will become a partner that nudges you toward greater disciplined debugging. It maintains you truthful approximately what you realize, it indicates probes it's possible you'll bypass while worn-out, and it gives you a contemporary set of eyes on a stack hint at 2 a.m. You nevertheless do the wondering. You just do it rapid, with a little much less spelunking within the dark.