In the machine learning subdomain called natural language processing (NLP), robustness testing is the exception rather than the norm. This is particularly problematic in light of work showing that many models of NLP exploit parasitic connections that inhibit their performance outside of specific tests. One report found that 60% to 70% of answers given by NLP models were incorporated somewhere in the benchmark training sets, indicating that models generally simply memorized the answers. Another study – a meta-analysis of over 3,000 articles on AI – found that the metrics used to compare AI and machine learning models tended to be inconsistent, tracked inconsistently, and not particularly informative. .
This motivated Nazneen Rajani, principal researcher at Salesforce who heads the company’s NLP group, to create an ecosystem for robustness assessments of machine learning models. Together with Stanford Associate Professor of Computer Science Christopher Ré and the University of North Carolina at Mohit Bansal at Chapel Hill, Rajani and the team developed Robustness gymnasium, which aims to unify the patchwork of existing robustness libraries to accelerate the development of new NLP model testing strategies.
“While existing robustness tools implement specific strategies such as adverse attacks or model-based augmentations, Robustness Gym provides a one-stop-shop to execute and compare a wide range of assessment strategies,” explained Rajani. to VentureBeat by email. « We hope that Robustness Gym will make robustness testing a standard component in the machine learning pipeline. »
Robustness Gym provides guidance to practitioners on how key variables – i.e. their task, assessment needs, and resource constraints – can help prioritize which assessments to perform. The following describes the influence of a given task via a known structure and previous evaluations; needs such as generalization of testing, fairness or safety; and constraints such as expertise, access to computing and human resources.
Robustness Gym groups all of the robustness tests into four evaluation « idioms »: subpopulations, transformations, evaluation sets, and adversarial attacks. Practitioners can create what are called slices, where each slice defines a collection of examples for evaluation constructed using one or a combination of evaluation idioms. Users are scaffolded into a simple, two-step workflow, separating storing structured side information on the nut and bolt examples from programmatically creating slices using that information.
Robustness Gym also consolidates slices and results for prototyping, iteration and collaboration. Practitioners can organize slices into a test bench that can be versioned and shared, allowing a community of users to build benchmarks together and track progress. For reports, Robustness Gym provides standard and custom robustness reports which can be generated automatically from test benches and included in paper appendices.
In one case study, Rajani and his coauthors asked a sentiment modeling team from a “big tech company” to measure their model’s bias using subpopulations and transformations. After testing the system on 172 slices spanning three evaluation idioms, the modeling team found performance degradation on 16 slices of up to 18%.
In a more revealing test, Rajani and his team used Robustness Gym to compare the commercial NLP APIs of Microsoft (API Text Analytics), Google (API Cloud Natural Language) and Amazon (API Comprehend) with the open source systems BOOTLEG, WAT and REL. on two reference datasets for named entity binding. (Binding named entities involves identifying key elements of a text, such as the names of people, places, brands, currency values, etc.) They found that business systems struggled to relate rare or less popular entities, were sensitive to the capitalization of entities. , and often ignored contextual clues when making predictions. Microsoft has outperformed other commercial systems, but BOOTLEG has beaten the rest in consistency.
« Google and Microsoft show good performances on certain subjects, for example Google on » alpine sports « and Microsoft on » skating « … [but] trading systems avoid the difficult problem of removing ambiguity from ambiguous entities in favor of returning the most popular answer, ”Rajani and his coauthors wrote in the article describing their work. « Overall, our results suggest that leading academic systems far outperform commercial APIs for named entity binding. »
In a final experiment, Rajani’s team implemented five subpopulations that capture sketchy abstraction, content distillation, positional bias, information dispersion, and information reorganization. After comparing seven NLP models, including Google’s T5 and Pegasus on an open source synthetic dataset across these subpopulations, the researchers found that the models struggled to perform well on examples that were heavily distilled. , required higher amounts of abstraction or contained more references to entities. . Surprisingly, models with different prediction mechanisms appeared to make « highly correlated » errors, suggesting that existing metrics cannot capture significant performance differences.
“By using Robustness Gym, we demonstrate that robustness remains a challenge, even for corporate giants such as Google and Amazon,” said Rajani. « Specifically, we show that these companies’ public APIs are significantly less efficient than simple string matching algorithms for the task of entity disambiguation when evaluated on infrequent (tail) entities. »
The aforementioned document and the source code for Robustness Gym are available from today.
VentureBeat’s mission is to be a digital town square for technical decision-makers to learn about transformative technology and conduct transactions. Our site provides essential information on data technologies and strategies to guide you in running your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the topics that interest you
- our newsletters
- Closed thought leader content and discounted access to our popular events, such as Transform
- network features, and more
Become a member