Are AI Psychological and Psychiatric Reports Accurate?
What the Research Actually Says
Many mental health clinicians are loving AI for the time it gives back. Over the last two years I’ve watched professionals move from cautious curiosity to everyday widespread use, and the enthusiasm is hard to overstate. Psychologists are using AI to draft cognitive assessment reports, ADHD and autism formulations, and child psychoeducational write-ups. Psychiatrists are leaning on AI to produce the longer consultation letters required under Medicare items, such as the Item 291 psychiatrist assessment and management plan. The pitch is simple: fewer hours at the keyboard, more time with patients.
Feedback from NovoNote users reflects that story. For example, Andrea Beres, a psychologist who uses AI for neurodevelopmental assessments, it typical of the sentiment among users of AI in assessments.
So why the enthusiasm? Well, it’s worth putting dollar figures on this, because for many practitioners the time saved is not theoretical.
Take Medicare Item 291, the initial psychiatrist assessment and management plan. The item requires a comprehensive written report to the referring GP on top of the 45-minute consultation, and in my conversations with psychiatrists the write-up itself routinely takes 60–90 minutes of non-client facing time. Most private psychiatrists charge about $500. On those numbers, an AI tool that shaves 45 minutes off an Item 291 report is worth roughly $400 of reclaimed clinician time per assessment. For a psychiatrist completing three Item 291s a week, that’s on the order of $60,000 a year of time either freed up for additional billable work or simply returned to the clinician and their family.
Similar arithmetic applies to psychologists writing cognitive and neurodevelopmental assessment reports, where Lockwood et al. (2025) found human reports averaged 2.5 hours to produce. These are back-of-envelope figures — individual practices will vary — but the order of magnitude explains why uptake has been so rapid.
But a thoughtful question sits underneath the enthusiasm:
“Are the reports actually accurate?”
Of course money talks, but this creates significant risks and a moral hazard for practitioners. If AI is drafting content that a qualified practitioner then signs, we need to go in with eyes wide open about where the evidence currently sits. I’ve spent some time reading the literature, and this article is my attempt to summarise it fairly.
Reports Written by Psychologists versus AI
The below clip is taken from a recent webinar I ran that explored the future of AI in behavioral health.
Research Evidence
The most direct empirical test we have comes from Adam Lockwood and colleagues at Kent State University. Their paper, Human vs. Machine: Comparing AI-Generated and Human-Written Psychological Reports, is, as far as I can tell, the first peer-review-bound study to put AI-written psychological reports directly up against human-written ones and ask licensed psychologists to evaluate them.
The design is solid. They recruited 249 licensed psychologists (98.8% doctoral trained, mean 18 years of practice) and asked each participant to blindly rate one human-written and one AI-written psychological report drawn from four mock child cases designed to meet DSM-5 criteria for ADHD, Intellectual Disability, Major Depressive Disorder, and Generalised Anxiety Disorder. Both humans and GPT-4 were given identical mock assessment data (including WISC-V and rating scale results) and the same prompt. Reports were rated on readability, writing style, organisation, summary quality, recommendation quality, and overall quality using 5-point Likert scales, plus willingness to sign off.
The results were more interesting than either “AI wins” or “humans win” tells us.
On most dimensions, the differences were small and clinically modest. Readability ratings were effectively identical — humans 3.77 vs AI 3.75 on a 1–5 scale (p = 0.842). Sentence length and jargon use were also statistically indistinguishable. Writing style favoured humans but with a small effect size (r = .17), and organisational structure was only marginally significant (r = .14).
Where the differences mattered, they ran in both directions:
- Summaries: Humans were rated significantly better. Only 52% of psychologists rated AI summaries as “Average or higher” versus 65% for human summaries. At the point this study was published, synthesis was a genuine AI weakness.
- Recommendations: AI was rated significantly better. About 73% of psychologists rated AI recommendations as average or higher, compared with 55% for humans. The authors note this is consistent with long-standing critiques that psychologists’ recommendations “often lack meaning and specificity” (Baum et al., 2018).
- Overall quality: A small but significant edge for humans.
- Willingness to sign off: 49.6% of psychologists were comfortable signing a human-authored report, versus 43.1% for AI-authored — a small effect (r = .14). Notably, the majority weren’t fully comfortable signing off on either.
- Time: Humans averaged 2.5 hours per report. GPT-4 averaged 91.5 seconds.
When forced to pick, 49% preferred the human report, 36% preferred AI, and 15% preferred neither. A clear but not overwhelming human preference.
Caveats about the Lockwood study
Before we get too comfortable with any conclusion, the study has limitations the authors themselves flag. Two are worth naming:
The AI was “raw” in the other sense that matters most to me. There was no human-in-the-loop editing. No clinician reviewed the AI report, corrected factual errors, adjusted the formulation, or rewrote clunky sections. The reports were signed off on sight. That’s not how NovoNote or any responsible AI-assisted workflow operates in the real world.
The model is already old. This was GPT-4 (accessed November 2023). The AI report-writing field has moved through GPT-4o, Claude 3.5 and 4, Gemini 1.5 and 2, and a wave of domain-specialised models since. As the authors put it, the findings “may quickly become outdated (and likely are).” In the psychometric language I prefer, we’re looking at a point-in-time snapshot with a very steep obsolescence curve. NovoNote’s current tech is well advanced on that used in this study.
What about AI-augmented reports — with a clinician reviewing?
This is the question I think matters most, because it’s how AI is actually being deployed in practice. The answer is that direct empirical research on clinician-reviewed AI-assisted psychological reports is sparse. It is one of the clearest gaps in our current evidence base.
The adjacent literature is, however, encouraging. In a cross-sectional study published in the Australian Journal of General Practice (April 2026), Foo and colleagues compared four commercially available AI scribes with human documentation in simulated GP consultations, rated blind by three experienced GPs on a modified Physician Documentation Quality Instrument (PDQI-9). AI scribes performed comparably or better than humans on overall quality, and significantly better on accuracy (p = 0.022), thoroughness (p < 0.001), succinctness (p < 0.001) and freedom from hallucination (p = 0.025). The authors made a point I found sobering: the assumption that human-generated documentation is the “gold standard” doesn’t fully hold up — both humans and AI produced errors, and the humans actually produced more of them.
In another study directly relevant to the “can clinicians tell the difference” question, Hatch and colleagues ran a preregistered Turing-style comparison (N = 830) of ChatGPT-4 versus licensed therapist responses to clinical vignettes, published in PLOS Mental Health (Hatch et al., 2025). Participants could barely distinguish the two, and ChatGPT responses were rated more favourable on several core psychotherapy principles. Fraile Navarro et al. (2025, Scientific Reports) found that expert clinicians evaluated LLM-generated clinical dialogue summaries as high-quality on most dimensions.
None of this is the same as a well-designed RCT of AI-augmented, clinician-reviewed psychological assessment reports. That study still needs to be done. But the adjacent evidence converges on a defensible position: when AI output is reviewed by a skilled clinician, the quality is likely to be at least as good as a report written unassisted — and often better on dimensions like thoroughness and specificity.
The ethical bottom line
Every AI-assisted psychological or psychiatric report must be reviewed, corrected, and signed off by a qualified practitioner. This is not a polite suggestion; it is an ethical obligation. The American Psychological Association, the APS in Australia, and the RACGP have all emphasised clinician responsibility for the final document, and the Lockwood findings reinforce why: AI is strong at structure, recommendations, and readability, but still weaker at nuanced case synthesis — precisely the step where clinical judgement is most needed.
What this means practically:
- Never sign a report you haven’t read carefully.
- Check factual claims — names, dates, test scores, DSM criteria — against source material.
- Re-write the formulation and summary sections if the AI’s version is generic or thin.
- Document your review process. Your signature represents your judgement, not the AI’s.
Addressing AI objectivity and accuracy
One of the strongest avenues to enhance an AI-assisted report workflow is to use more objective inputs such as psychometric assessments. When a large language model is asked to interpret a vague description of a patient (“she seems quite anxious”) or transcripts, it has wide latitude to invent — to generate plausible-sounding but unverifiable clinical claims.
That latitude collapses when the AI is working from NovoPsych psychometric output. Integrating psychometric results means the report is anchored in actual, verified numerical data, such as DASS-21 severity bands, PCL-5 cutoffs against the established threshold, and personality assessment results. Over and above that, the use of empirical and deductive logic like reliable change indices that quantify whether a shift in scores exceeds measurement error help ground the AI in actual empirical and evidence-based reality.
Concepts like validity coefficients, internal consistency, and established sensitivity and specificity in clinical samples all enhance AI’s capacity to be accurate. Rather than the AI guessing at severity or improvement, it is reporting what the data say, in the language of normative comparison rather than impression.
We don’t want opinions the AI generated; we want empirical facts that the AI layer has simply structured into prose. For a profession whose credibility rests on accuracy, that distinction is not cosmetic — it is the entire game, and it is precisely the kind of grounding that turns AI-augmented reports from a clever shortcut into a defensible piece of clinical evidence.
As they say, junk in, junk out. So it’s NovoPsych’s mission to improve the quality of the input, so that clinicians using our tools can trust the output.
Where I land
So, are AI psychological reports accurate? The best current answer, grounded in the data, is: in the right workflow, yes. And improving quickly. 2023-era reports without any human editing already approached human quality on most dimensions in Lockwood’s study. NovoNote’s own data shows 2026 reports are considerably better.
The actual published research is still early. The strongest single study (Lockwood) used a two-year-old model with no clinician review. The study we really need — a head-to-head comparison of clinician-reviewed AI-assisted reports versus unassisted clinician reports, rated on clinical utility by independent experts — hasn’t been done yet. Until it is, we should hold our claims lightly and keep the clinician firmly in the loop.
But the direction of the evidence is clear enough. AI is no longer a novelty in psychological report writing. It’s a serious clinical tool, and — used responsibly — it appears to help us produce reports that are more thorough, more actionable, and produced in a fraction of the time. That’s a meaningful gain for patients and for a profession carrying a heavy documentation burden.
As always, I’ll keep watching the literature and updating my view as better studies arrive.
References
Farmer, R., Lockwood, A., Goforth, A., & Thomas, C. (2024). Artificial intelligence in practice: Opportunities, challenges, and ethical considerations. Professional Psychology: Research and Practice. Advance online publication.
Foo, D., Tan, J., Stevens, S., Hansra, A., & Wilcox, H. (2026). A comparative analysis of AI scribes versus human documentation in simulated general practice consultations. Australian Journal of General Practice, 55(4). https://doi.org/10.31128/AJGP-04-25-7645
Fraile Navarro, D., Coiera, E., Hambly, T. W., et al. (2025). Expert evaluation of large language models for clinical dialogue summarization. Scientific Reports, 15, 1195. https://doi.org/10.1038/s41598-024-84850-x
Hatch, S. G., Goodman, Z. T., Vowels, L. M., Hatch, D., Brown, A. L., Guttman, S., … Braithwaite, S. R. (2025). When ELIZA meets therapists: A Turing test for the heart and mind. PLOS Mental Health. https://doi.org/10.1371/journal.pmen.0000145
Lockwood, A., Farmer, R., Shergill, G., Benson, N., & Gilbert, K. (2025). Human vs. Machine: Comparing AI-Generated and Human-Written Psychological Reports [Preprint, January 23, 2025].
Tierney, A. A., Gayre, G., Hoberman, B., et al. (2024). Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst, 5(3), CAT.23.0404. https://doi.org/10.1056/CAT.23.0404
