Emerging Litigation Podcast

Agentic AI on Trial: You Be The Judge Part 1 - Medical Diagnostics

Tom Hagy Season 1 Episode 112

In this three-part series our guests reprise their panel discussion at the Executive Women’s Forum DSG Global conference titled "You Be The Judge," during which they explored scenarios involving harms potentially caused by Agentic AI.

In Episode 1 they discuss an Agentic AI mammography triage system designed to flag positives for a radiologist, auto-send “all clear” letters for negatives, and operate with minimal human oversight. 

They answer this difficult question: When the machine gets it wrong, who is accountable? Developers, hospitals, clinicians, and/or data providers? What role do contracts, warnings, and intended-use labels play in establishing liability? What safeguards would balance speed and safety? Random audits? Documentation? Will a new standard of care develop for machine decision-making? 

I take the back seat in this series as the panelists moderate the discussion. They are:

Galina Datskovsky, PhD, CRM, FAI
Board of Directors, FIT and OpenAxes
Information Governance and AI expert

Marina Kaganovich
AMERS Financial Services Executive Trust Lead
Office of the CISO, Google Cloud 

Hon. Lisa Walsh
Florida Circuit Judge
11th Judicial Circuit, Miami-Dade County

Special thanks to Kathryn M. Rattigan, Partner, Data Privacy + Cybersecurity with Robinson+Cole for bringing this team to the Emerging Litigation Podcast. 

If you work in health tech, compliance, or hospital operations -- or you advise these professionals -- this conversation offers a clear-eyed guide to deploying autonomous agents responsibly—without sleepwalking into preventable harm. If you like what you hear, watch for Episodes 2 and 3. 

______________________________________

Thanks for listening!

If you like what you hear please give us a rating. You'd be amazed at how much that helps.

If you have questions for Tom or would like to participate, you can reach him at Editor@LitigationConferences.com.

Ask him about creating this kind of content for your firm -- podcasts, webinars, blogs, articles, papers, and more.

Tom Hagy:

Hello and welcome to the Emerging Litigation Podcast. I'm your host, Tom Hagy. This is part one of a series that we are calling Agentic AI on Trial. AI, of course, everybody knows this is no headline. AI is transforming a lot of businesses, certainly changing quite a bit. There's a lot of a lot of legitimately scary stuff out there, I would say. In the use of AI. Just watch your social media feed. But um but it's also just powerful and does just so many good things. With insurance, it's no longer optional for insurance companies. It's imperative to remain competitive. AI now drives claims processing, fraud detection, and even coverage decisions. Intelligent document processing and natural language processing allow insurers to scan and interpret claims in in minutes. Predictive analytics tailor policies to customer behavior. But speed comes with controversy. Algorithms at one very large insurance company reportedly denied 300,000 claims in two months. It's a lot of denials. No teacher I ever uh had would, and I and I know. Lawsuits and new state laws are pushing for human review as a requirement. I think that's that's a given, certainly in my business. You have to have a human looking it over. It's powerful, but you want to keep your eyeballs on things. Today we're going to explore efficient today. We're going to explore efficiency versus ethics, transparency and liability in automated denials. Those are issues that come up in almost every aspect of the use of artificial intelligence. Our panel comprises Galina Datskovsky, PhD. She's a business strategy advisor. Uh with her is Marina Kaganovich. She's an attorney and compliance advisor at Google. And then we have an actual judge, Judge Lisa Walsh of Florida's 11th Judicial Circuit. I want to thank Kathryn Rattigan, who's a partner with Robinson Cole in uh Providence, Rhode Island. I've had the pleasure of working with her on uh legal issues with regard to emerging technologies like drones and other things. And uh so it's a pleasure to work with her again. She uh pulled this uh podcast series together for us. And um she's also going to give a more detailed introduction of our guests, which uh which is necessary. So, with that, here is episode one of Agentic AI on trial, AI liability and legal risks. I should say also that the opinions expressed here are those of these presenters, not their clients, nothing to do with any cases that may be before them. It's strictly their opinions. So, with that disclaimer, I hope you enjoy it.

Kathryn Rattigan:

We have three leaders in legal compliance and technology. Um, we first have Galina Datskovsky. She's an internationally recognized authority in compliance, information governance, AI, data analytics. She has a PhD in computer science and really deep expertise in AI. Um, she advises a lot of organizations on business strategy. She's on a lot of boards. She's just um really interesting to hear her background and has a lot of great um knowledge to share with this community. Uh, she's she's joined by Marina Kaganovich, who's an attorney and compliance advisor at Google. She also specializes in AI governance, cybersecurity, risk management, and data privacy. She works with a lot of executive leadership teams to talk about how to secure cloud migration in a compliant way. This is an evolving regulatory environment. She's really knowledgeable in that. She does global cross-functional programs for a lot of different organizations as an advisor. She's part of the ARMA International Board. She's really a thought leader in this space as well. And then we're also joined by Judge Lisa Walsh, who is a circuit court judge in the 11th Judicial Circuit of Miami-Dade County, Florida, where she's an administrative judge of the appellate division, and she participates in some international arbitration division as well. She uh, I think an interesting fact is she's done over a hundred jury trials, both criminal and civil. She's has a lot of appellate decisions out there, um, lots of experience, and she kind of brings her judicial experience and insight into this discussion. So they're really a great group, um, and they represent a pretty big uh it's kind of interesting because they converge on law, technology, and governance and justice. Really, all their perspectives are um really interesting to put together in this You Be the Judge uh podcast.

Galina Datskovsky:

Okay. So I'd like to introduce our session and just discuss a little bit the evolution of AI, which is really rapidly progressing beyond assistance and co-pilots toward autonomous agents and ultimately interconnected ecosystems of specialized agents operating without direct human intervention. So in this session, we will explore these evolving risks from a legal perspective, focusing on key areas including ecosystem risks, user reliance and trust, legal liability, security, privacy, accountability, and other aspects. So in the format of you be the judge, we'll ask our judge, we'll present the session and hypothetical cases, and cover the issues as if they are actual court cases. But before we do that, I would like to define agentic AI. For those of you who know, this may be a little bit repetitive, but agentic AI refers specifically to artificial intelligence systems that possess agency, which is really the ability to act independently, make decisions, and pursue goals with minimal human oversight. So just a little bit to dig into that, they set and pursue goals autonomously. These agents adapt to changing environments by themselves. They make decisions based on context and feedback. They might also use external tools to execute complex tasks. And finally, if I had to summarize it, agentic AI is proactive, strategic, and action-oriented. Now, in comparison, it's kind of funny to say with traditional AI and Agentic AI, because what's traditional AI? We haven't been doing AI for that many years. But at any rate, it's worth noting that AI has been in use for quite a while. And we've had what we would call traditional AI, machine learning, maybe task-specific, but maybe not, static and rule-based. And where have we seen that? For example, in various robotic systems. We've had those for quite a while. We've had expert systems in specific, uh, specific areas. So we've had quite a bit of experience with AI in the traditional sense. Now, what's changed AI up is generative AI models. And what does that mean? Generative AI models respond to prompts. They can actually be creative, they could write their own things. They cover not just uh text or just vision or specific areas, but they actually work with text, image, code generation, et cetera, et cetera. So really push the field of AI forward beyond specific vertical uses into the mainstream with the agentic with the generative AI. Now, with agentic AI, we already defined that. We're taking it a step further and making AI actually autonomous and act independently. And this is a very important thing to keep in mind for the rest of this podcast and for the cases we're going to be looking at, that the agent, the AI agent, is acting independently, potentially making decisions, taking actions, et cetera, et cetera. Keep that in mind. I'm sure the judge will be commenting quite a bit on that. And with that, let me turn it over to my colleague Marina for the next segment and our first scenario.

Marina Kaganovich:

Thanks so much, Galina. Hi, everyone. I think building on Galina's introduction, um, this is really an exciting opportunity for us to talk about the evolution of AI. And as she had mentioned, machine learning has been around for quite some time and has been used in the medical field, particularly in the radiology context, quite extensively up till now. And so one of the scenarios that we wanted to explore a little bit further is how this changes with the use and integration of AI agents into a workflow. So if we look at an example or scenario where we have an autonomous AI agent that screens mammography scans and then works to rank them as either positive or negative based on its training. And the agent is programmed to only send, let's say, positive screens for secondary review to a physician, a radiologist in this case. And for those patients who do have a positive screen, they would also receive a letter from the agent indicating that a follow-up with the physician is needed, and the agent would then proceed to schedule the follow-up appointment. The patients, on the other hand, who receive, let's say, negative results from the AI agent would just receive a letter saying, hey, you're all clear, you know, your test is, your scans have been reviewed, no issues have been identified, and then no further action is taken. So this type of scenario is it's it's quite realistic because we already do have AI quite heavily involved in uh mammography screenings and assessments, but we're taking it one step further and we're integrating an autonomous agent into the mix. So that kind of at a very high level sets out the scenario. And let's now assume that we have a situation where this AI agent, for whatever reason, misreads a patient result and the result is indicated as being negative and no further action is taken when in reality the patient scans is positive. And so no follow-up is scheduled. And according to the workflow that I set out a moment ago, the patient scans are not sent to a radiologist for review because when the AI identifies that a scan is negative, it just determines that no further action is required, and the patient is essentially advised of that in the form of a letter. And so in this particular scenario, the patient receives the all-clear letter and decides, determines not to take any further action. And then at some point later, the patient then discovers that they have cancer, but it's identified only at a much later stage. And at that point, it's metastasized. So we're finding ourselves in a situation with this scenario where because of the use of AI in this workflow and the lack of physician interaction in this case, the patient files suit. And so we wanted to break it down and look at the scenario from a few different perspectives. And of course, hear from Lisa in terms of the legal aspects of how this would be assessed at Tratt. So the first question I would pose is, and of course, keeping in mind that these scenarios are intentionally crafted at a high level and the outcomes will, of course, be very dependent on facts and circumstances. But for the purposes of our discussion, let's assume that the AI agent missed identifying the positive result because the screening approach that was used was undercalibrated. And had the sensitivity been set up to over-screen, it would have caught negative screens as false positives. And so there would be more noise in the system. But on the flip side, positive results would also have been less likely to be missed. And so with the patient in this type of scenario pursuing legal action, who do we think they would go after and who would be liable in this type of scenario? Lisa, what are your thoughts?

Judge Walsh:

So, first of all, it's it's a pleasure to be together with uh with both of you again. We've had past live interactions on this, and it's a fun exercise to ask non-legal practitioners, non-judges to think about this stuff. Think about it as development is happening, as products are being rolled out and as decisions are being made. So who might be liable? And the options are could the developer who developed the system be liable? Could the person responsible who decides how to calibrate the system be liable? Or what about the doctor who buys the system and implements it into their practice? Or maybe the hospital itself or health network that, you know, that the patient is ultimately part of. And finally, the provider who actually trains the healthcare system or the doctor or the practice on using the system. So which um of any or all of these potential actors might be liable? So for starters, the whole concept of agentic AI is very interesting from a legal point of view because the idea of an agent from the developer of a product, of an AI product, agent or agency, may mean something very, very different than it means in the context of the legal system. The word agent has a very particular meaning if you're talking about law of agency or law of agents. Typically, what the focus is on is on something called the principle. So the principal in this case is whoever it is who holds an agent out as having the authority to do something. And then the agent who does something may bind the principle. So let's talk about the developer. The developer who created the system, they are not holding the agent out to the patient for the for any action of the of the developer itself. So the traditional notions of agency law, I think, are are a little muddled when it comes to the developer. Here, it isn't the system necessarily that um that is wrong from beginning to end, as the prompt uh that you just explained indicates. You can calibrate the system many different ways. You could calibrate it to catch to be so sensitive that you might capture false positives and you will miss very, very few false negatives. So calibration is an option here. I would imagine that if a system like this went awry, it is it is probably likely that there will be an attempt uh to bring a developer into a lawsuit as to whether they are liable. I think it may depend on um, you know, is there anything about the system itself that that is created in a reckless way? Is there anything about the creation of how the agent acts autonomously that is prone to error itself without without any interaction of the user, meaning in this case the medical practice or the hospital or the doctor? So you always look to who is the principal in the case when you're asking a question about agents, liability under traditional agency law. If you talk about the person responsible for setting calibration at the hospital, then it gets interesting because that person may have made a decision that we want to have as few medical appointments with the doctor as possible. We only want to focus on positives and we want to be sure that we don't capture any false positives. And maybe they're not terribly sensitive to false, you know, false negatives. So depending upon the acts of the individual or entity responsible for calibrating the system, they may very well be pulled into a lawsuit depending upon what they did or didn't do and the choices they made. So what Galena and Marina have taught me over the years is what are the licensing provisions in the product? You know, is the user abiding by the the um the licensing agreement when they bought that particular system? Perhaps in the um, you know, the agreement or the contract in which the system was bought, there is a very clear warning there that says, be advised that where wherever you calibrate this system, here are the risks that are involved. So the person who made a decision on calibration, if they were not risk sensitive, that may land them in some hot water. The doctor for using the innovation, I imagine they'll be, you know, that a doctor is likely, uh, because ultimately it's the radiologist who diagnoses. I think it depends on how the system is interacts. What is the entity that screens? Do you go to you go to your, in this case, mammography, you'd be going to your gynecologist or your general practitioner. Might that doctor be responsible, where it's ultimately the radiologist who, you know, who doesn't isn't the one who caught it? All of those facts, all of those nuances, I think would go into that. And finally, the hospital or health network and the provider of training data. The hospital or health network, it depends on what is their nexus between the decision-making, purchasing the system, calibrating the system, rolling it out, and following up. And as far as the provider of training data, if the if the data is is uh you know is murky or unable to be understood and didn't adequately advise the purchaser of the potential risks, um, maybe there's an issue there as well. But I'm gonna bounce it back to Marina and Galina for any further insight. Yeah, I think sorry, Marina. Sorry, Galief.

Galina Datskovsky:

So I do want to raise one question, Lisa, if you don't mind. And that is in terms of the training data, because a lot of times these kinds of systems are trained on data from previous scans and reads. So could there be a problem that's brought out to say that the data used to train the agent to recognize problems to begin with wasn't sufficient, wasn't diverse enough, etc. And could that change your opinion of the liability?

Judge Walsh:

Very well might, if if the data that the system was trained on is out of date, isn't sensitive enough, doesn't alert the system to be able to recognize a malignancy or an abnormality, especially if it lulls the purchaser into a sense of security that um this system is so much more sensitive than human eyes to be able to do an initial screen, you know, of uh of a mammogram. So there very well might be problems if it's that's like a fundamental fly in the ointment. That's a fundamental flaw in the system. We this is real, this is real stuff right now. I mean, the most recent medical scan that I had was read by a by an AI system. I think this is this is growing to be universal across um the board, but what we um uh you know, what we believe, the the urban myth that we patients believe is That it's better, that it's more sensitive, that it's um not subject to fatigue and boredom, that you know, that that it's not going to overlook anything uh because it will be just as fresh when it reads my screen as when it read the prior thousand. Um but everything is only as good as the components that go into it. And if the if the training to recognize is not sufficient, that in and of itself might be a um, you know, a subset of potential liability because it's it's it's a flawed system.

Marina Kaganovich:

I'm wondering if we can actually add in another wrinkle here and talk about standard of care. We've all had these conversations before, right, in a different context, but certainly I think it's quite relevant in this one. And so in the scenario we set out, we have a referring physician, we have a radiologist. There's also radiology tech, right, that actually helps take the stance. And so what role does standard of care play if this type of scenario was being litigated in front of a jury? What are your thoughts?

Judge Walsh:

If it's not, uh this is interesting because agency law, you look at from the point of view of the individual or entity that is holding out the agent to be representative of what of what it's allowed to do. Um medical malpractice law uh it introduces a layer, it's a specialized standard of care. In other words, um a medical malpractice claim, whether it's against um a hospital for its nursing care or its tech, or a physician for uh the standard of care in medical practice, that's a very specialized definition of standard of care. But this is this is machine learning. This is not a physician. So there is a question in my mind as to whether there will be a development in the law of an entirely new body of law of how you evaluate standard of care. And that you would have to look back to the creation, the development, the input, the training data, all of that into hands how it operates in the field. Are there any hiccups or or glitches in the way that it actually carries out its function? That's not it is medical care, but it's not the standard of professional care of a physician, you know, a human uh physician or or medical technician. So I am not altogether, I mean, please don't, you know, don't quote me on this because I'm just thinking out loud here, and these things might come to me someday. I certainly don't want to be accused of prejudging them, but you know, thinking out loud about it, it may or may not uh be the appropriate lens to look at this stuff because it's not it's not a trained professional. And one other wrinkle on the issue of standard of care. The standard of care of a human cytologist, for example, that evaluates, you know, they screen and they they look at slides and determine are there any abnormal cells that I'm seeing on the slides. The standard of care is not perfection. It is never perfection. It is what is a reasonable standard of care in this particular profession? And missing false positives is part of a reasonable standard of care. It happens. Um but I'm wondering whether we are gonna have an even higher strict uh you know expectation of the way a system will work when it's not subject to the limitations of a human being. So this may may go in different, you know, I would say in different directions, uh even to even have greater scrutiny perhaps of a system that is supposed to never miss.

Galina Datskovsky:

Yeah, this is definitely a fascinating new area to explore. I think we have more questions, right? Yeah.

Marina Kaganovich:

So I think I would actually, sorry, to your standard of care point, um, I mean, machine learning's been used here for quite some time, right? And I think what's interesting is that in our workflow, the the machine learning still goes typically the way that it's used today, machine learning applied to scans still goes out to a radiologist to ultimately sign off and review, right? And so our agent here kind of takes some of that away because we're saying that there's a presumably large subset of scans that are never seen by the radiologists in favor of efficiency, but also kind of as a nod toward the efficacy of the way that that the agent functions, that the machine learning functions as well, to make sure, you know, there's obviously testing that's done to make sure that certain percentage thresholds are met. So I'm wondering like, what are your thoughts on, for both of you, really, in terms of the what's gone wrong in this case and in terms of the sort of developer and deployer implementation, how could this have been avoided? Because I I keep thinking that, you know, all we can keep sending all scans to radiologists for review, but that kind of negates some of the benefit of trying to implement an agent in a workflow where there's a very high level of confidence. And so what do you think? What are some of the other mitigants that should have been considered?

Judge Walsh:

It seems like the fulcrum here was on if the, you know, if the system reads it as negative, there's no follow-up. Um, so there's two things. One is what do you mean the system read it as negative? Does that mean that the system saw zero abnormality? How many points of comparison were there? How many areas did it look at? I mean, there could be many things that could be fed in uh to ensure that negative means negative. There is a gray area in every medical scan which is questionable. Um, you know, is it a cyst? Um, is it a mole? Is it um, you know, is it is it something benign? Is it a the potentially a flaw in the um in the tech that was, you know, manipulating the patient to take this particular scan? So there's the initial threshold of negative means negative. In other words, are you looking at enough points of data that you can clearly clear something out? And to have those screens reviewed by a physician obviates any efficiency whatsoever. But perhaps to have the report uh reviewed for efficiency might be one stopgap. And I think that's what happens now with patients is at least the report that's generated by the system will give you all of those points of data, what it observed, and that could be reviewed. That, you know, would sort of be splitting the baby here. And then the other thought is automatically sending the letter that you're clear as to whether there's anything else that can go to the patient, um, or maybe the patient can opt and say, I want, you know, although I don't know what insurance would cover. I want, you know, I want my doctor to review it, or to maybe have tech review of some sort. So I see in that workflow, I I think it really lies negative must mean negative. And is there sufficient data to truly screen someone out without overlooking anything? Can you ever get that good? What are your thoughts?

Galina Datskovsky:

Yeah, I wanted to add a little bit to that and maybe take a slightly different tack, because at the end of the day, if it goes, if the review goes to a human being, to a doctor, for example, they could miss exactly the same things that the system missed. To your point before. So when you say what is what does it mean negative is negative, you would come back to that idea of training the agent to understand what human beings consider negative on some variety of scans that were deemed negative. And by the way, they could have been deemed where flaws could come in, they could have been deemed negative incorrectly. So your data could be coming from the same very same set hospital where some things were read as negative but actually weren't, right? So unless they were screened out of the data pile, which is not necessarily the case, that's where your flaws could come in. Now, in uh in my view, if you have good provenance of where the data came from, how you trained it on what's negative, the whole point of agencies that it should be able to do that, send the letters and act autonomously. Now, of course, it should work, it should continue to learn. So you want to give it as much of a feedback loop as possible. So if it does miss something, it will learn what it missed and how it missed it, which it probably will never forget, unlike a human individual who could act overlook or forget or be tired, to your point. So there are all these all these questions about it. Now, ultimately, ultimately, the idea is that done right, it should be more efficient than a human being. And of course, if there's a true gray area, whatever that means, true gray area, meaning the system really can't decide, and you could set the threshold of certainty. So you will set a threshold of certainty. Maybe it's if it's 87% certain, it would say it's negative, and if it's below that, it then uh it will go to a human being, and if it's you know below something else, then it's it would be considered positive, right? Oh what whatever. So you could set set um set up different thresholds, and that would be another question I would ask then. Okay, so what were the thresholds? Where were the decisions made? Because like you said, nothing is necessarily a hundred percent.

Judge Walsh:

So what's what about what about uh quality control? Even after the system is rolled out, you need to know what is the what is the real result for way it's working for me. So for example, I'm gonna pull, you know, eight of a hundred negatives and have independent review, or or you know, there has to be some manner of determining that to ensure that it's doing what it says it should be doing.

Galina Datskovsky:

Yep. That's one way to do it. The other way, if you take all your gray areas, maybe you set your threshold low. So anything that's 50% uh to whatever 90%, you still have to do a secondary screening and you keep feeding that back in. So your system will get better and better. So there could be different things you could do to mitigate and train your agents, which are not atypical. All of that will depend. And I'm sure in the court case, if there were such a court case, you would be asking them exactly what those protocols were.

Judge Walsh:

So my question, which I think to dovetail to kind of bounce off that, is from the point of view of the creator or the developer themselves, what could they implement at the you know, at the front end before rolling it out to end users to avoid any of this from happening?

Galina Datskovsky:

So developers, yeah. So from the developer's perspective, it's again using the best possible training data, using the best workflows, setting, allowing to set thresholds and other variable conditions. So each given user, each given hospital, each given health system, each given insurance provider could calibrate it the way it's most appropriate to them, and giving that kind of flexibility. I think that's where the developer, of course, comes in, making sure that it's not vulnerable to hallucinations or forgetfulness, right? Those are the things the developer has to ensure. Given all of those, the rest is really on the installer and user, in my view. I don't know if you if Marina, you agree with that.

Marina Kaganovich:

So Hogan, you have a good point there. And I think also the a lot of this would be covered in like a vendor onboarding review, right? Because the the hospital or the healthcare provider, whoever's onboarding the use of this tool, would presumably or should be asking these questions.

Judge Walsh:

So, um, and I'm the lay person here. So um hallucinations normally will occur. My my again with urban myth, my understanding of hallucinations is that they will ordinarily occur because the the guess of the system is attempting to please the prompt. Is there a way to recreate a prompt so that um that you're looking for positives? You're not looking to rule out cancer, you're actually looking for positives.

Galina Datskovsky:

Yeah, so I think in this case, though, maybe I would take it slightly differently because you're not engineering a prompt. So when I mean eliminate hallucinations, you would use the right models because probably they didn't they either created the model or used some open source model or however they designed the system, they're using those models that will be very precise on the data and wouldn't be specifically looking to be get creative around the data. And different models, as uh you know, as we look at them, offer different strengths. So a lot of times developers will use multiple models, multiple pieces of their systems, short of using their own and training their own, which is expensive. But still, I know many software products that use multiple models for different pieces of their product to minimize the impact depending on what the product is meant to do. Yes, there's a lot of things a developer can do. But at the end of the day, even if the developer did everything right, we could still end up with the results I just described.