Today, we are joined by Mor Geva, a postdoctoral researcher, now at Google and previously at the Allen Institute for AI (AI2). Her research focuses on debugging the inner workings of black-box NLP models, to increase their transparency, control their operation, and improve their reasoning abilities. Mor is a previous guest on the show. The last time, she spoke about annotator bias in language models and how it affects the robustness of NLP models. Today, she follows up on that study, investigating where bias starts.
She started by discussing a pattern she observed with datasets from crowdfunded workers. This was largely due to the instructions given to annotators by the researcher. She then detailed ways researchers can frame questions/instructions to avoid propagating bias when hiring crowdfunded workers.
Mor spoke about the StrategyQA dataset, a question-answering benchmark for testing the ability of models to perform implicit reasoning. She discussed how the data was gathered and the steps taken to ensure the data was diversified in terms of topics and reasoning types.
StrategyQA is one of the challenging tasks in the Big Bench benchmark, a collaborative benchmark for measuring the capabilities of large language models. The construction of Big Bench was led by Google and involved contributions from over 400 researchers in the NLP community. She highlighted possible reasons the top-ranking models in the Leaderboard performed well.
Mor then discussed the place of benchmarks in advancing language models. She particularly spoke about BigBench, a Google benchmark that measures the capabilities of language models. In closing, she gave her take on whether the trajectory in language models will lead to AGI. She highlighted some limitations with large language models. You can follow Mor on Twitter @megamor2 or on her webpage.