In today’s world, people turn to their phones and computers in search of information, expecting relevant results at the click of a button. Computers use their ability to recognize language to return personalized results to users in seconds. Two grants awarded in January will allow computer science researchers to improve the capabilities of computers and search engines to retrieve and understand human language.
With support from the Defense Advanced Research Projects Agency and the Intelligence Advanced Research Projects Activity grants — the latter of six million dollars being the largest grant received by the University’s computer science department to date — Assistant Professor of Computer Science Ellie Pavlick and her collaborators aim to solve these problems with two separate projects.
The project “Better Extraction from Text Toward Enhanced Retrieval” aims to solve the efficient searching problem, and the “Grounded Artificial Intelligence Language Acquisition” project seeks to teach computers to learn language like humans do.
Project Better Extraction from Text Toward Enhanced Retrieval
The BETTER project is about information retrieval, said Carsten Eickhoff, assistant professor of medical science and computer science. Information retrieval involves searching through large amounts of data, he said. “It’s a needle in a haystack problem. You want to find the relevant webpages that correspond to your query in the available billions of pages,” he added.
Traditionally, relevant results appear based on the terminology typed into the search engine. The degree of overlap between the terms present in the search query and the terms present in a document, including synonyms and related terms, decides how much the document or link will be promoted in the results, Eickhoff said. Two words can mean nearly the same thing yet not be recognized as synonyms by computers — a shortcoming of this search method.
The team is using neural network technologies to try to achieve cross-lingual information retrieval, which allows for personalized search results, and overcome such shortcomings. Say “you trained a search engine to find news articles in English, and you want to retrieve technical documents in Arabic. We try to find whether there is a good representation of language that will allow you to make that transition without new training,” Pavlick said.
This type of cross-lingual information retrieval could potentially allow users to search through Chinese documents without knowing a word of Chinese, Eickhoff said.
The team is also trying to personalize searches based on the individual user without using much, if any, private data. “Commercial search engines would keep track of your search behavior to learn effective ranking models for you and for everyone,” Eickhoff said. This type of personalization becomes very difficult if the person does not want to “pay the price of privacy,” he added. A possible approach would be to personalize the search results for groups of people rather than for individuals.
Although the project presents potential to be applied in many different domains, for now the team hopes to use it to target news-related searches on topics of public interest.
Project Grounded Artificial Intelligence Language Acquisition
The GAILA project aims to teach computers language in a way that is different from the common approach, known as the distributional semantics approach, in which words are represented with their surroundings in text. “If a word is often being mentioned in a context, and if another word is also used in a similar context, then they are similar words … An eagle or hawk will take flight, but a bus will never do that,” Eickhoff said.
While this approach is effective, it also has its drawbacks. Computers can more easily learn to understand nouns than verbs, but they still “don’t really understand what (words) actually mean. They can’t answer fine-grain questions, … reason hypothetical concepts or stitch several things together,” Pavlick said.
The team had to consider how language is naturally understood and determine “the right representations to use for language so that computers can reason about what is meant by what is said,” Pavlick said.
A solution presented itself in the form of grounded language acquisition. The idea is to “let a little virtual agent loose in a VR setting where it interacts with humans, very similar to how children learn language. A child observes and has questions, and we want to create such interactions. We have kitchens and living rooms, both in 2D and 3D. Right now, it’s at the stage where the robot just observes,” Eickhoff said. The next step would be to have the robot interact with the researchers and ask questions in order to create a feedback loop, he added.
After the tests are run, the researchers compare the quality of the robot’s inferences to what would be expected of a human, Pavlick said.
Roman Feiman, assistant professor of cognitive, linguistic and psychological sciences and one of the co-principal investigators of the GAILA project, is tasked with providing “input, ideas and design constraints informed by what we know about human language learning,” he wrote in an email to The Herald.
One of the main goals of the project is to “build systems that are more useful to people” and to make human-computer interactions easier, which can be accomplished by creating robots that understand natural human language. “Language is a really powerful interface to use to interact with a computer. It is easier to talk than code,” Pavlick said. “If you say, ‘Can you grab that’, (the computer) has to know what ‘that’ is.”
Teaching computers to better interact with language is also essential for access to knowledge. “Everything we know is documented somewhere in language, and the computers should be able to access those,” she added.
Studying human interactions is one way to understand the efficacy of language, Pavlick said. But constructing models like GAILA that interpret and represent language also provides another perspective. “It gives us the power to assess what representations work,” she said.
“This is an exciting grant not only because of its record funding level, but also, and more importantly, because of the multi-disciplinary research and progress it will enable on how machines can understand and use natural language, a core challenge to be addressed towards human-level machine intelligence,” wrote Ugur Cetintemel, professor of computer science and chair of the department, in an email to The Herald.