I am a 3rd year PhD student in CS at Princeton advised by Olga Russakovsky.
I received my B.S. in EECS from Berkeley in 2022 and my M.S. in 2023, advised by Jitendra Malik.
I am a recipient of the Princeton President's Fellowship.
I am broadly interested in creating computer vision systems which can learn from and interpret visual data as humans do.
While at Berkeley, I've had the great fortune to have worked with many wonderful people,
including Karttikeya
Mangalam, Alvin Wan, and Dan Hendrycks.
I was also heavily involved in teaching and outreach, serving on CS 70 course
staff multiple times and previously leading Machine Learning @ Berkeley. You can find
out more from my main website here.
If you are interested in collaborating, or want to reach out and chat about research or advice, feel free to reach out to me at [first][last][at]cs[dot]princeton[dot]edu.
CV  / 
Google
Scholar  / 
Twitter  / 
Github
|
|
News
-
[Nov 2025] Preprint on video text alignment is available on arXiv!
-
[Jun 2025] I'm at Google DeepMind, London as a Student Researcher working with Viorica Pătrăucean on video interpretability!
-
[Jan 2025] Preprint on a multi-encoder representation of videos, MERV, is available on arXiv! (May 2025: accepted to ICML!)
-
[Mar 2024] Preprint on large image modeling, xT, is available on arXiv (May 2024: accepted to ICML)! Also co-organizing the Transformers for Vision Workshop @ CVPR 2024 after the great experience I had attending last year.
-
[May 2023] Paper on fast reversible transformers, PaReprop, was accepted as a spotlight at the Transformers for Vision Workshop @ CVPR 2023!
-
[Apr 2023] I am starting my PhD at Princeton in Fall 2023, advised by Professor Olga Russakovsky!
|
|
Research
I'm broadly interested in computer vision to create visual systems which can effectively reason and interact with the real world.
Currently, I am most interested at the intersection of video and language.
I believe that understanding how to promote videos as a fundamental unit of vision can be key to unlocking the next generation of visual systems.
Real-world interaction is also governed by language, thus I am interested in how to bridge the gap between the two modalities.
This includes understanding how to learn from videos efficiently, how to reason about them, and how to represent them.
I am also interested in large models, particularly vision models, and how to make them more efficient and broadly useful.
Much of my previous work has been on general purpose memory-efficient techniques to make this possible.
|
|
|
Extending the Platonic Representation Hypothesis to dynamic inputs like videos and multi-caption text dramatically improves their alignment.
This suggests that models are more aligned than previously thought, just that their inputs were impoverished.
However, representations are not temporally-sensitive yet.
We also analyze alignment as a zero-shot probe of video representations on downstream tasks.
|
|
|
Learning Dynamics of Multitask Training Data in Vision Language Models
Tyler Zhu,
Koome Murungi,
Polina Kirichenko,
Olga Russakovsky
Workshop on Space in Vision, Language, and Embodied AI (SpaVLE) @ NeurIPS 2025
poster
VLMs achieve great performance on a wide array of benchmarks. However, the source of this generalization is poorly understood.
How do we disentangle memorization from generalization?
We evaluate train & val accuracy in 1-epoch setting (seen and unseen) and break examples into visual task-specific categories for analysis.
This reveals inefficiencies in question format (Bounding) and misleading task accuracies (OCR).
|
|
|
We propose a framework for using many visual encoders covering broad visual categories like action recognition and spatial understanding as a unified visual encoder for video LLMs.
This trend is exciting as it could allow our model to scale visual processing with the number of GPUs and run them all in parallel (sharding one expert per device) while still retaining similar runtimes to just a single expert.
|
|
|
A simple yet effective framework for adapting vision models trained on small, 224x224 images to larger images with larger context by using an LLM-style encoder to integrate context over larger regions than otherwise possible.
We also proposed a set of effective benchmarks for reflecting such improvements on larger images.
|
|
|
We overcome the extra overhead of reversible transformers by parallelizing the backward pass using CUDA streams.
This speeds up training for models in both vision and language, making them nearly as fast as the base models with incredible memory savings to boot.
|
|
|
The many faces of robustness; A critical analysis of
out-of-distribution generalization.
Dan Hendrycks,
Steven Basart,
Norman Mu,
Saurav Kadavath,
Frank Wang,
Evan Dorundo,
Rahul Desai,
Tyler Zhu,
Samyak Parajuli,
Mike Guo,
Dawn Song,
Jacob Steinhardt,
Justin Gilmer
ICCV 2021
code
/
arXiv
Four new datasets measuring real-world distribution shifts, most well-known of which is ImageNet-R(enditions), as well as a new
state-of-the-art data augmentation method that outperforms models pretrained
with 1000x more labeled data.
|
|
|
In this work, we present an in-depth analysis of reversible transformers and demonstrate that they can be more accurate, efficient, and fast than their vanilla counterparts. We introduce a new method of reversible backpropagation which is faster and scales better with memory than previous techniques, and also demonstrate new results which show that reversible transformers transfer better to downstream visual tasks.
|
|
|
We propose a framework towards automating prompt tuning for learning preferences in text-to-image synthesis using reinforcement learning with human feedback.
|
Guided Resource and Education Program: High School Workshop Initiative
Advisor (behind the scenes)
Machine Learning at Berkeley, Fall 2023
We piloted a free two-day workshop for local Bay Area high school students with little access to coding resources to teach them the basics of machine learning.
Our goal was to be inclusive and representative of all backgrounds and experiences, and we were able to reach over 40 students evenly split between male and female participants.
|
Broadening Research Collaborations Workshop
Co-organizer
NeurIPS 2022
We organized a workshop at NeurIPS 2022 to bring together researchers from different backgrounds and experiences to discuss the challenges and opportunities in non-traditional collaborations beyond the standard academic and industry models.
|
|