The authors of a recent journal article see problems with the FDA’s approach to premarket review of artificial intelligence (AI) algorithms, including an undue reliance on single-site algorithm development. Regulatory attorney Brad Thompson told BioWorld, however, that hospital administrators want algorithms that maximize accuracy for their populations and are not averse to in-house development of just such an algorithm, thus creating a source of tension between what hospitals want and what the FDA expects.

An analysis recently published in Nature Medicine was authored by a team from several academic medical centers in the U.S., who acknowledged that the FDA has recently called for improvements in test-data quality and in the level of transparency to users of these algorithms. This analysis entailed development of a database of algorithms along with a case study of pneumothorax triage devices. The analysis of the case study suggested that development of a deep-learning model at a single site “can mask weaknesses in the model and lead to worse performance across sites.”

The authors aggregated all AI devices cleared or approved between 2015 and 2020, and applied a four-tier risk score to each algorithm. This approach flies against the FDA’s three-tiered premarket risk classification system, and is based on the agency’s October 2017 guidance for changes to software cleared under the 510(k) program rather than the approach taken for the initial 510(k) filing for the product.

Retrospective studies the norm, not the exception

Of the 130 algorithms thus analyzed, 126 relied solely on retrospective studies to train and validate the algorithm, but the authors said 54 of those algorithms fell into the two higher-risk categories. None of these were evaluated via prospective studies, an omission they said is critical because they often did not offer a concurrent comparison of algorithm vs. clinician interpretation.

Another sore spot for the authors was that 93 of the algorithms “did not have publicly reported multi-site assessments as part of the evaluation study.” Four of these algorithms were evaluated at a single site while another eight were evaluated at only two clinical sites, suggesting a limited demographic data set. While half the 126 algorithms were cleared or approved in the past year, the proportion of algorithms evaluated at a single site “has remained stagnant” over the term of the study.

The authors added that only 17 of the algorithms entailed a clinical study that explicitly included a consideration of demographic subgroup outcomes, and said their own approach to algorithm development based on single-site algorithm training and validation demonstrated that an algorithm thus developed might perform poorly when used at another site. One example of this is the disparity in algorithm performance on white and Black patients.

Thompson, an attorney at the D.C. office of Epstein, Becker & Green P.C., said U.S.-based developers know most of the world employs a four-tier risk system, but noted that the authors have a good reason for using the four tiers under the software 510(k) changes guidance. “An algorithm is most likely to be improved two weeks or so after the first iteration,” Thompson said, which complicates matters for developers due to the prospect that they will have to file a new 510(k) for that algorithm after only one or two months on the market.

Thompson also confirmed that “all the issues raised [in the paper] are issues the FDA is well aware of, and in fact is working on.” Consequently, the impact of the paper is more likely to be on the users of these algorithms than the FDA. Administrators at hospitals and other clinical sites need to be aware of the limitations of these products, but these administrators – and the doctors at the clinical site – want a high degree of sensitivity and specificity for the patients seen at their institution, not at a clinical site on the other side of the country.

The problem for the FDA is that this dynamic creates an incentive to develop a location-specific product while the agency’s regulatory framework is unavoidably a national framework. Thompson said that while he does not have a window into what hospital systems are doing in terms of hires for in-house IT staff, he is under the impression that at least some hospital systems are indeed staffed to handle development of such an algorithm.

Resources still an issue where inspections concerned

“It’s a conundrum for the FDA,” Thompson said, because the agency is not staffed to conduct inspections of hospital IT departments. He also pointed out that the FDA’s medical device data systems guidance exempts these in-house developments if they are for in-house use only. One of the associated problems is that the line between an exempt and non-exempt system is not always clear, but Thompson said the FDA currently seems to lack the appetite to dive back into the question of hospital IT regulation.

The FDA has been making the point about using multiple sites to validate an algorithm for some time, but Thompson said the agency has been a little more vocal about this of late. Some of these products are marketed internationally, so at times there is the question of not just the number of sites in total, but the number of sites located in the U.S.

However, Thompson said the FDA has not offered a particularly helpful set of guidelines for what constitutes a sufficient level of demographic diversity in these training and validation sets despite that this question is front and center for the agency. “It’s tough to know when you’ve achieved enough diversity to satisfy the FDA,” he said, adding that developers believe that it is time for the FDA to offer written guidance on the question.

FDA spokesperson Lauren-Jei McCarthy told BioWorld, that the agency’s action plan for artificial intelligence by the Digital Health Center of Excellence “takes a holistic approach towards [the agency’s} next steps in furthering oversight that provides a reasonable assurance of safety and effectiveness” in these products. McCarthy said that the action reinforces that the FDA “is committed to promoting a patient-centered approach including transparency to all users,” along with the use of good machine learning practices, development of regulatory science methodologies to assess medical device algorithm bias and robustness, and “the increased use of real-world performance monitoring of medical devices.”

The agency’s Center for Devices and Radiological Health “is committed to the advancement of digital health efforts. Even during the COVID-19 public health crisis, CDRH continues to advance efforts to regulate medical devices and emerging digital health technologies,” McCarthy said.