Imagine being able to reach into a photograph and change it to your liking.
That’s the thinking that led Brown researcher Rahul Sajnani GS, a PhD candidate in the Department of Computer Science, to build GeoDiffuser. The geometry-based image editing model optimizes how objects are moved, rotated, translated or removed from within photos.
Developed in partnership with Amazon Robotics, Sajnani and his co-authors won the Best Student Paper award at the Institute of Electrical and Electronics Engineers/Computer Vision Foundation Winter Conference on Applications of Computer Vision.
“Initially, our research was focused on novel view synthesis, taking an image and trying to generate what an object in it would look like from a new angle,” Sajnani said in an interview with The Herald. “But we realized that the idea of applying geometric transformations could extend far beyond that.”
Initially, the group’s research was focused on “novel view synthesis,” or taking an image and trying to generate what an object in that image would look like from a new angle, Sajnani explained in an interview with The Herald.
GeoDiffuser works differently from traditional image editing models, which often require fine-tuning on large visual datasets or retraining the model for different tasks. Instead, Sajnani’s method applies optimization in the form of a geometric transformation, which instructs the model how to rotate or move an object in 3D space, then implements that change into the generative model’s attention layers — the focus of the input data. The result is a training-free technique and an image that is faithful to the qualities of the object being transformed.
Sajnani compared the model to the work of a cameraman on a movie set. “The cameraman doesn't have a say in the scene, but the cameraman just views it as a passive,” he said. “So you can think of GeoDiffuser as this passive observer.”
GeoDiffuser builds on existing diffusion models, such as the ones powering DALL-E or Stable Diffusion.
The model runs two versions of the image generation in parallel: One recreates the original image, and the other makes the desired edit. The “shared attention” mechanism links the two, so that the model can keep the background and unedited parts consistent between the original and edited image.
This link comes from a method Sajnani described as “injecting geometry.” GeoDiffuser uses depth maps, images that encode the distance of each pixel in a scene from a fixed reference point, as well as transformation matrices to tell the model where to send each pixel, Sajnani explained. These transformations are slipped into the model’s attention layers and guide edits without changing the model’s core weights, or the most important parameters of the image determined by the input.
First, the model is given an instruction — such as “move this car to the right” or “rotate this dog to face left” — which leads it to compute the corresponding geometric transformation and feed that information into the diffusion process. The model then iteratively refines the image through loss functions that help to reduce the noise.
GeoDiffuser’s design enables the model to efficiently preserve the identity of an object. In one experiment, GeoDiffuser removed a boat sitting on a lake and erased its shadow and reflection on the water automatically.
For Sajnani, the most rewarding part of the project wasn’t just optimizing a tool: It was also about understanding systems and how they operate.
Sajnani hopes to develop new models in the future that could aid perception and training data generation for robotic movement. But, according to Sajnani, the difficult challenge isn’t in moving pixels around but in figuring out where and how to apply transformations within the labyrinth-like layers of a generative model.
“We don’t yet completely understand what these models are doing, right?” Sajnani said, adding that “most of these models don’t have geometry” baked in. “How can I inject geometry, even when (I) know this model was not trained with a geometry condition?”
According to Sajnani, the model can still be improved. GeoDiffuser works best with modest changes on objects — like 45 to 60-degree rotations, translations and removals. Extreme rotations, like turning a human 180 degrees, remain an open challenge, he explained.
Sajnani added that while today’s image editing software might let you nudge a lamp or spin a parked car, tomorrow, it could edit entire 3D environments on the fly.
Correction: A previous version of this article misstated quotes from Rahul Sajnani. The article has been updated to accurately reflect Sajnani’s statements on novel view synthesis, the similarity between models and cameramen and the challenges with developing models. The article was also updated to remove quotes from Sajnani that could not be verified. The Herald regrets the error.




