CV Arxiv Daily

Updated on 2025.08.27

Usage instructions: here

VLM

Publish Date	Title	Authors	PDF	Code
2025-07-23	BetterCheck: Towards Safeguarding VLMs for Automotive Perception Systems	Malsha Ashani Mahawatta Dona et.al.	2507.17722	null
2025-07-23	InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation	Shuai Yang et.al.	2507.17520	null
2025-07-23	Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection	Francesco Tonini et.al.	2507.17456	null
2025-07-23	VLM-Guided Visual Place Recognition for Planet-Scale Geo-Localization	Sania Waheed et.al.	2507.17455	null
2025-07-23	Dynamic-DINO: Fine-Grained Mixture of Experts Tuning for Real-time Open-Vocabulary Object Detection	Yehao Lu et.al.	2507.17436	null
2025-07-23	Language-Conditioned Open-Vocabulary Mobile Manipulation with Pretrained Models	Shen Tan et.al.	2507.17379	null
2025-07-23	RoadBench: A Vision-Language Foundation Model and Benchmark for Road Damage Understanding	Xi Xiao et.al.	2507.17353	null
2025-07-23	VisionTrap: Unanswerable Questions On Visual Data	Asir Saadat et.al.	2507.17262	null
2025-07-23	HySafe-AI: Hybrid Safety Architectural Analysis Framework for AI Systems: A Case Study	Mandar Pitale et.al.	2507.17118	null
2025-07-23	FedVLM: Scalable Personalized Vision-Language Models through Federated Learning	Arkajyoti Mitra et.al.	2507.17088	null
2025-07-22	Semi-off-Policy Reinforcement Learning for Vision-Language Slow-thinking Reasoning	Junhao Shen et.al.	2507.16814	null
2025-07-22	Cooling Matters: Benchmarking Large Language Models and Vision-Language Models on Liquid-Cooled Versus Air-Cooled H100 GPU Systems	Imran Latif et.al.	2507.16781	null
2025-07-22	Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning	Ang Li et.al.	2507.16746	null
2025-07-22	Enhancing Remote Sensing Vision-Language Models Through MLLM and LLM-Based High-Quality Image-Text Dataset Generation	Yiguo He et.al.	2507.16716	null
2025-07-22	Experience is the Best Teacher: Grounding VLMs for Robotics through Self-Generated Memory	Guowei Lan et.al.	2507.16713	null
2025-07-22	Spatial 3D-LLM: Exploring Spatial Awareness in 3D Vision-Language Models	Xiaoyan Wang et.al.	2507.16524	null
2025-07-22	SceneLoom: Communicating Data with Scene Context	Lin Gao et.al.	2507.16466	null
2025-07-22	Quality Text, Robust Vision: The Role of Language in Enhancing Visual Robustness of Vision-Language Models	Futa Waseda et.al.	2507.16257	null
2025-07-22	SeC: Advancing Complex Video Object Segmentation via Progressive Concept Construction	Zhixiong Zhang et.al.	2507.15852	null
2025-07-21	Can Your Model Separate Yolks with a Water Bottle? Benchmarking Physical Commonsense Understanding in Video Generation Models	Enes Sanli et.al.	2507.15824	null
2025-07-22	Visual-Language Model Knowledge Distillation Method for Image Quality Assessment	Yongkang Hou et.al.	2507.15680	null
2025-07-21	Smart Eyes for Silent Threats: VLMs and In-Context Learning for THz Imaging	Nicolas Poggi et.al.	2507.15576	null
2025-07-21	HOLa: Zero-Shot HOI Detection with Low-Rank Decomposed VLM Feature Adaptation	Qinqian Lei et.al.	2507.15542	null
2025-07-21	Chart-R1: Chain-of-Thought Supervision and Reinforcement for Advanced Chart Reasoner	Lei Chen et.al.	2507.15509	null
2025-07-21	One Last Attention for Your Vision-Language Model	Liang Chen et.al.	2507.15480	null
2025-07-21	EgoPrune: Efficient Token Pruning for Egomotion Video Reasoning in Embodied Agent	Jiaao Li et.al.	2507.15428	null
2025-07-21	In-context Learning of Vision Language Models for Detection of Physical and Digital Attacks against Face Recognition Systems	Lazaro Janier Gonzalez-Soler et.al.	2507.15285	null
2025-07-21	VLM-UDMC: VLM-Enhanced Unified Decision-Making and Motion Control for Urban Autonomous Driving	Haichao Liu et.al.	2507.15266	null
2025-07-18	VLA-Mark: A cross modal watermark for large vision-language alignment model	Shuliang Liu et.al.	2507.14067	null
2025-07-18	EdgeVLA: Efficient Vision-Language-Action Models	Paweł Budzianowski et.al.	2507.14049	null
2025-07-18	Moodifier: MLLM-Enhanced Emotion-Driven Image Editing	Jiarong Ye et.al.	2507.14024	null
2025-07-18	When Seeing Overrides Knowing: Disentangling Knowledge Conflicts in Vision-Language Models	Francesco Ortu et.al.	2507.13868	null
2025-07-18	Team of One: Cracking Complex Video QA with Model Synergy	Jun Xie et.al.	2507.13820	null
2025-07-18	Teaching Vision-Language Models to Ask: Resolving Ambiguity in Visual Questions	Pu Jian et.al.	2507.13773	null
2025-07-18	Innocence in the Crossfire: Roles of Skip Connections in Jailbreaking Visual Language Models	Palash Nandi et.al.	2507.13761	null
2025-07-17	LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning	Kaihong Wang et.al.	2507.13568	null
2025-07-17	VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning	Senqiao Yang et.al.	2507.13348	null
2025-07-17	Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It	Yulu Qin et.al.	2507.13328	null
2025-07-17	Leveraging Language Prior for Infrared Small Target Detection	Pranav Singh et.al.	2507.13113	null
2025-07-17	GLAD: Generalizable Tuning for Vision-Language Models	Yuqi Peng et.al.	2507.13089	null
2025-07-17	Unmodulated Visible Light Positioning: A Deep Dive into Techniques, Studies, and Future Prospects	Morteza Alijani et.al.	2507.13080	null
2025-07-17	Advancing Complex Wide-Area Scene Understanding with Hierarchical Coresets Selection	Jingyao Wang et.al.	2507.13061	null
2025-07-17	LaViPlan : Language-Guided Visual Path Planning with RLVR	Hayeon Oh et.al.	2507.12911	null
2025-07-17	MCoT-RE: Multi-Faceted Chain-of-Thought and Re-Ranking for Training-Free Zero-Shot Composed Image Retrieval	Jeong-Woo Park et.al.	2507.12819	null
2025-07-17	City-VLM: Towards Multidomain Perception Scene Understanding via Multimodal Incomplete Learning	Penglei Sun et.al.	2507.12795	null
2025-07-16	VLMgineer: Vision Language Models as Robotic Toolsmiths	George Jiayuan Gao et.al.	2507.12644	null
2025-07-17	PhysX: Physical-Grounded 3D Asset Generation	Ziang Cao et.al.	2507.12465	null
2025-07-16	Describe Anything Model for Visual Question Answering on Text-rich Images	Yen-Linh Vu et.al.	2507.12441	null
2025-07-16	AutoVDC: Automated Vision Data Cleaning Using Vision-Language Models	Santosh Vasa et.al.	2507.12414	null
2025-07-16	Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models	Felix Nützel et.al.	2507.12236	null
2025-07-16	MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM	Tao Chen et.al.	2507.12232	null
2025-07-16	InstructFLIP: Exploring Unified Vision-Language Model for Face Anti-spoofing	Kun-Hsiang Lin et.al.	2507.12060	null
2025-07-16	GS-Bias: Global-Spatial Bias Learner for Single-Image Test-Time Adaptation of Vision-Language Models	Zhaohong Huang et.al.	2507.11969	null
2025-07-16	POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering	Yichen Xu et.al.	2507.11939	null
2025-07-15	Seeing the Signs: A Survey of Edge-Deployable OCR Models for Billboard Visibility Analysis	Maciej Szankin et.al.	2507.11730	null
2025-07-15	General Modular Harness for LLM Agents in Multi-Turn Gaming Environments	Yuxuan Zhang et.al.	2507.11633	null
2025-07-15	CogDDN: A Cognitive Demand-Driven Navigation with Decision Optimization and Dual-Process Thinking	Yuehao Huang et.al.	2507.11334	null
2025-07-15	How Far Have Medical Vision-Language Models Come? A Comprehensive Benchmarking Study	Che Liu et.al.	2507.11200	null
2025-07-15	Bridging the Gap in Vision Language Models in Identifying Unsafe Concepts Across Modalities	Yiting Qu et.al.	2507.11155	null
2025-07-15	Assessing Color Vision Test in Large Vision-language Models	Hongfei Ye et.al.	2507.11153	null
2025-07-15	MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models	Seif Ahmed et.al.	2507.11114	null
2025-07-15	Tactical Decision for Multi-UGV Confrontation with a Vision-Language Model-Based Commander	Li Wang et.al.	2507.11079	null
2025-07-15	Bridge Feature Matching and Cross-Modal Alignment with Mutual-filtering for Zero-shot Anomaly Detection	Yuhu Bai et.al.	2507.11003	null
2025-07-14	EmbRACE-3K: Embodied Reasoning and Action in Complex Environments	Mingxian Lin et.al.	2507.10548	null
2025-07-14	CoralVQA: A Large-Scale Visual Question Answering Dataset for Coral Reef Image Understanding	Hongyong Han et.al.	2507.10449	null
2025-07-14	Beyond Graph Model: Reliable VLM Fine-Tuning via Random Graph Adapter	Bo Jiang et.al.	2507.10355	null
2025-07-14	Synthesizing Near-Boundary OOD Samples for Out-of-Distribution Detection	Jinglun Li et.al.	2507.10225	null
2025-07-14	BlueGlass: A Framework for Composite AI Safety	Harshal Nandigramwar et.al.	2507.10106	null
2025-07-14	Foundation Model Driven Robotics: A Comprehensive Review	Muhammad Tayyab Khan et.al.	2507.10087	null
2025-07-14	LayLens: Improving Deepfake Understanding through Simplified Explanations	Abhijeet Narang et.al.	2507.10066	null
2025-07-14	CoSMo: A Multimodal Transformer for Page Stream Segmentation in Comic Books	Marc Serra Ortega et.al.	2507.10053	null
2025-07-14	Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect	Tom Kouwenhoven et.al.	2507.10013	null
2025-07-14	Text-Driven Causal Representation Learning for Source-Free Domain Generalization	Lihua Zhou et.al.	2507.09961	null
2025-07-11	ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way	Rajarshi Roy et.al.	2507.08679	null
2025-07-11	Adaptive Framework for Ambient Intelligence in Rehabilitation Assistance	Gábor Baranyi et.al.	2507.08624	null
2025-07-11	Emergent Natural Language with Communication Games for Improving Image Captioning Capabilities without Additional Data	Parag Dutta et.al.	2507.08610	null
2025-07-11	BayesTTA: Continual-Temporal Test-Time Adaptation for Vision-Language Models via Gaussian Discriminant Analysis	Shuang Cui et.al.	2507.08607	null
2025-07-14	Efficient Deployment of Vision-Language Models on Mobile Devices: A Case Study on OnePlus 13R	Pablo Robin Guerrero et.al.	2507.08505	null
2025-07-11	LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning	Shibo Sun et.al.	2507.08496	null
2025-07-11	Multi-modal Mutual-Guidance Conditional Prompt Learning for Vision-Language Models	Shijun Yang et.al.	2507.08410	null
2025-07-11	Making VLMs More Robot-Friendly: Self-Critical Distillation of Low-Level Procedural Reasoning	Chan Young Park et.al.	2507.08224	null
2025-07-10	CLIP Won’t Learn Object-Attribute Binding from Natural Data and Here is Why	Bijay Gurung et.al.	2507.07985	null
2025-07-10	Scaling RL to Long Videos	Yukang Chen et.al.	2507.07966	null
2025-07-10	SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment	Guoxin Zang et.al.	2507.07939	null
2025-07-10	MoSE: Skill-by-Skill Mixture-of-Expert Learning for Autonomous Driving	Lu Xu et.al.	2507.07818	null
2025-07-10	Energy-Guided Decoding for Object Hallucination Mitigation	Xixi Liu et.al.	2507.07731	null
2025-07-10	One Object, Multiple Lies: A Benchmark for Cross-task Adversarial Attack on Unified Vision-Language Models	Jiale Zhao et.al.	2507.07709	null
2025-07-10	Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought	Shin’ya Yamaguchi et.al.	2507.07685	null
2025-07-11	ViLU: Learning Vision-Language Uncertainties for Failure Prediction	Marc Lafon et.al.	2507.07620	null
2025-07-10	LOSC: LiDAR Open-voc Segmentation Consolidator	Nermin Samet et.al.	2507.07605	null
2025-07-10	Beyond the Linear Separability Ceiling	Enrico Vompa et.al.	2507.07574	null
2025-07-09	4KAgent: Agentic Any Image to 4K Super-Resolution	Yushen Zuo et.al.	2507.07105	null
2025-07-09	Vision-Language-Vision Auto-Encoder: Scalable Knowledge Distillation from Diffusion Models	Tiezheng Zhang et.al.	2507.07104	null
2025-07-09	Evaluating Attribute Confusion in Fashion Text-to-Image Generation	Ziyue Liu et.al.	2507.07079	null
2025-07-09	Free on the Fly: Enhancing Flexibility in Test-Time Adaptation with Online EM	Qiyuan Dai et.al.	2507.06973	null
2025-07-09	CheXPO: Preference Optimization for Chest X-ray VLMs with Counterfactual Rationale	Xiao Liang et.al.	2507.06959	null
2025-07-09	VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation	Ziang Ye et.al.	2507.06899	null
2025-07-09	HVI-CIDNet+: Beyond Extreme Darkness for Low-Light Image Enhancement	Qingsen Yan et.al.	2507.06814	null
2025-07-09	Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu	Yan Hon Michael Chung et.al.	2507.06761	null
2025-07-09	Text-promptable Object Counting via Quantity Awareness Enhancement	Miaojing Shi et.al.	2507.06679	null
2025-07-09	Cross-Modal Dual-Causal Learning for Long-Term Action Recognition	Xu Shaowu et.al.	2507.06603	null
2025-07-08	CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions	Yuchen Huang et.al.	2507.06210	null
2025-07-08	Enhancing Scientific Visual Question Answering through Multimodal Reasoning and Ensemble Modeling	Prahitha Movva et.al.	2507.06183	null
2025-07-09	Skywork-R1V3 Technical Report	Wei Shen et.al.	2507.06167	null
2025-07-08	LangMamba: A Language-driven Mamba Framework for Low-dose CT Denoising with Vision-language Models	Zhihao Chen et.al.	2507.06140	null
2025-07-08	GeoMag: A Vision-Language Model for Pixel-level Fine-Grained Remote Sensing Image Parsing	Xianzhi Ma et.al.	2507.05887	null
2025-07-08	Bridging Perception and Language: A Systematic Benchmark for LVLMs’ Understanding of Amodal Completion Reports	Amane Watahiki et.al.	2507.05799	null
2025-07-08	SPADE: Spatial-Aware Denoising Network for Open-vocabulary Panoptic Scene Graph Generation with Long- and Local-range Context Reasoning	Xin Hu et.al.	2507.05798	null
2025-07-08	A Satellite-Ground Synergistic Large Vision-Language Model System for Earth Observation	Yuxin Zhang et.al.	2507.05731	null
2025-07-09	Integrated Structural Prompt Learning for Vision-Language Models	Jiahui Wang et.al.	2507.05677	null
2025-07-08	R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding	Joonhyung Park et.al.	2507.05673	null
2025-07-07	NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving	Qucheng Peng et.al.	2507.05227	null
2025-07-07	All in One: Visual-Description-Guided Unified Point Cloud Segmentation	Zongyan Han et.al.	2507.05211	null
2025-07-07	Differential Attention for Multimodal Crisis Event Analysis	Nusrat Munia et.al.	2507.05165	null
2025-07-07	INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling	Xin Dong et.al.	2507.05056	null
2025-07-07	Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision	Soham Walimbe et.al.	2507.05020	null
2025-07-07	From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach	Mihai Masala et.al.	2507.04815	null
2025-07-07	Training-free Generation of Temporally Consistent Rewards from VLMs	Yinuo Zhao et.al.	2507.04789	null
2025-07-07	Vision-Language Models Can’t See the Obvious	Yasser Dahou et.al.	2507.04741	null
2025-07-07	An analysis of vision-language models for fabric retrieval	Francesco Giuliari et.al.	2507.04735	null
2025-07-07	A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets	Zexi Jia et.al.	2507.04699	null
2025-07-03	DexVLG: Dexterous Vision-Language-Grasp Model at Scale	Jiawei He et.al.	2507.02747	null
2025-07-03	ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects	Qiaojun Yu et.al.	2507.02600	null
2025-07-02	cVLA: Towards Efficient Camera-Space VLAs	Max Argus et.al.	2507.02190	null
2025-07-02	Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges	Sanjeda Akter et.al.	2507.02074	null
2025-07-02	CI-VID: A Coherent Interleaved Text-Video Dataset	Yiming Ju et.al.	2507.01938	null
2025-07-02	How Do Vision-Language Models Process Conflicting Information Across Modalities?	Tianze Hua et.al.	2507.01790	null
2025-07-02	Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition	Muzammil Behzad et.al.	2507.01673	null
2025-07-02	MARVIS: Modality Adaptive Reasoning over VISualizations	Benjamin Feuer et.al.	2507.01544	null
2025-07-02	Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence	Robert Aufschläger et.al.	2507.01504	null
2025-07-02	BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments	Yibo Qiu et.al.	2507.01485	null
2025-07-03	TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control	Zhenyang Liu et.al.	2507.01424	null
2025-07-02	CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning	Kuniaki Saito et.al.	2507.01409	null
2025-07-02	Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model	Chaoxiang Cai et.al.	2507.01351	null
2025-07-02	VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process	Cristian Gariboldi et.al.	2507.01284	null
2025-06-30	DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World	Xiangtai Li et.al.	2506.24102	null
2025-06-30	EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations	Hyunjong Kim et.al.	2506.24016	null
2025-06-30	The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models	Lijun Sheng et.al.	2506.24000	null
2025-06-30	StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving	Ruiyang Hao et.al.	2506.23982	null
2025-06-30	GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models	Hamza Rasaee et.al.	2506.23903	null
2025-06-30	A Closer Look at Conditional Prompt Tuning for Vision-Language Models	Ji Zhang et.al.	2506.23856	null
2025-06-30	Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model	Shiming Chen et.al.	2506.23822	null
2025-06-30	Visual Textualization for Image Prompted Object Detection	Yongjian Wu et.al.	2506.23785	null
2025-06-30	PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies?	Atharva Gundawar et.al.	2506.23725	null
2025-06-30	On the Domain Robustness of Contrastive Vision-Language Models	Mario Koddenbrock et.al.	2506.23663	null
2025-06-27	MiCo: Multi-image Contrast for Reinforcement Visual Reasoning	Xi Chen et.al.	2506.22434	null
2025-06-27	Test-Time Consistency in Vision Language Models	Shih-Han Chou et.al.	2506.22395	null
2025-06-27	Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation	Tiankai Chen et.al.	2506.22375	null
2025-06-27	Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment	Rui Xu et.al.	2506.22283	null
2025-06-27	COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication	Filippo Merlo et.al.	2506.22274	null
2025-06-27	Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs	Amirmohammad Izadi et.al.	2506.22146	null
2025-06-27	Universal Retrieval for Multimodal Trajectory Modeling	Xuan Zhang et.al.	2506.22056	null
2025-06-27	Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation	Jialei Chen et.al.	2506.22032	null
2025-06-27	SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding	Zhao Jin et.al.	2506.21924	null
2025-06-27	SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation	Adam Goodge et.al.	2506.21892	null
2025-06-26	Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration	Jiahe Chen et.al.	2506.21509	null
2025-06-26	Global and Local Entailment Learning for Natural World Imagery	Srikumar Sastry et.al.	2506.21476	null
2025-06-26	Spatial Mental Modeling from Limited Views	Baiqiao Yin et.al.	2506.21458	null
2025-06-27	ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models	Hongbo Liu et.al.	2506.21356	null
2025-06-26	LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning	Dewen Zhang et.al.	2506.21317	null
2025-06-26	DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images	Badri Vishal Kasuba et.al.	2506.21316	null
2025-06-26	World-aware Planning Narratives Enhance Large Vision-Language Model Planner	Junhao Shi et.al.	2506.21230	null
2025-06-26	Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion	Yuguang Zhang et.al.	2506.21144	null
2025-06-26	V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling	Junwei You et.al.	2506.21041	null
2025-06-26	Multimodal Prompt Alignment for Facial Expression Recognition	Fuyan Ma et.al.	2506.21017	null
2025-06-25	Shape2Animal: Creative Animal Generation from Natural Silhouettes	Quoc-Duy Tran et.al.	2506.20616	null
2025-06-25	HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction	Zhonghao Shi et.al.	2506.20566	null
2025-06-25	Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation	Changlu Guo et.al.	2506.20449	null
2025-06-25	CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition	Joerg Deigmoeller et.al.	2506.20373	null
2025-06-25	Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards	Jihao Gu et.al.	2506.20332	null
2025-06-25	Towards Efficient Exemplar Based Image Editing with Multimodal VLMs	Avadhoot Jadhav et.al.	2506.20155	null
2025-06-25	MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations	Vardhan Dongre et.al.	2506.20100	null
2025-06-24	Unified Vision-Language-Action Model	Yuqi Wang et.al.	2506.19850	null
2025-06-24	Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models	Johannes Rückert et.al.	2506.19825	null
2025-06-24	CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation	Hao Li et.al.	2506.19816	null
2025-06-24	UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation	Yue Zhou et.al.	2506.19694	null
2025-06-24	PEVLM: Parallel Encoding for Vision-Language Models	Letian Kang et.al.	2506.19651	null
2025-06-24	V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis	Yuan Wang et.al.	2506.19610	null
2025-06-24	ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP	Zhiyuan Wang et.al.	2506.19608	null
2025-06-24	Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects	Federico Tavella et.al.	2506.19579	null
2025-06-24	Visual hallucination detection in large vision-language models via evidential conflict	Tao Huang et.al.	2506.19513	null
2025-06-24	T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models	Yiteng Chen et.al.	2506.19498	null
2025-06-23	VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning	Xuanyu Zhang et.al.	2506.18564	null
2025-06-23	Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey	Xinyao Li et.al.	2506.18504	null
2025-06-23	InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models	Nianchen Deng et.al.	2506.18385	null
2025-06-23	Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review	Haoneng Lin et.al.	2506.18378	null
2025-06-23	Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations?	Yiwei Yang et.al.	2506.18322	null
2025-06-24	Referring Expression Instance Retrieval and A Strong End-to-End Baseline	Xiangzhao Hao et.al.	2506.18246	null
2025-06-23	Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning	Yue Li et.al.	2506.18234	null
2025-06-22	See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis	Ruinan Jin et.al.	2506.18140	null
2025-06-22	CLGRPO: Reasoning Ability Enhancement for Small VLMs	Fanyi Wang et.al.	2506.18048	null
2025-06-22	Adapting Vision-Language Models for Evaluating World Models	Mariya Hendriksen et.al.	2506.17967	null
2025-06-20	VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning	Zhangyang Qi et.al.	2506.17221	null
2025-06-20	Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens	Zeyuan Yang et.al.	2506.17218	link
2025-06-20	Do We Need Large VLMs for Spotting Soccer Actions?	Ritabrata Chakraborty et.al.	2506.17144	null
2025-06-20	Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments	Yasir Ali Farrukh et.al.	2506.16994	null
2025-06-20	AnyTraverse: An off-road traversability framework with VLM and human operator in the loop	Sattwik Sahu et.al.	2506.16826	null
2025-06-20	FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation	Fan Yang et.al.	2506.16806	null
2025-06-20	Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes	Chao Chen et.al.	2506.16805	null
2025-06-20	Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models	Lei Jiang et.al.	2506.16760	null
2025-06-20	Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly	Lance Ying et.al.	2506.16755	null
2025-06-20	TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion	Mingrui Zhu et.al.	2506.16730	null
2025-06-18	GenRecal: Generation after Recalibration from Large to Small Vision-Language Models	Byung-Kwan Lee et.al.	2506.15681	null
2025-06-18	Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning	Ankan Deria et.al.	2506.15649	null
2025-06-18	FindingDory: A Benchmark to Evaluate Memory in Embodied Agents	Karmesh Yadav et.al.	2506.15635	null
2025-06-18	WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts	Negar Foroutan et.al.	2506.15594	link
2025-06-18	DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement	Shaoqing Lin et.al.	2506.15583	link
2025-06-18	Context-Informed Grounding Supervision	Hyunji Lee et.al.	2506.15480	link
2025-06-19	OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models	Lanfeng Zhong et.al.	2506.15318	null
2025-06-18	MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering	Xinqi Fan et.al.	2506.15298	null
2025-06-18	Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models	Xuelin Shen et.al.	2506.15201	link
2025-06-18	ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections	Ziling Huang et.al.	2506.15180	null
2025-06-17	RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills	Chunru Lin et.al.	2506.14763	null
2025-06-17	Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models	Huihan Liu et.al.	2506.14727	null
2025-06-17	AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions	Aishan Liu et.al.	2506.14697	null
2025-06-17	Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models	Ling Li et.al.	2506.14674	null
2025-06-17	StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery	Jina Kim et.al.	2506.14670	null
2025-06-17	SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks	Zijian Song et.al.	2506.14512	null
2025-06-17	Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation?	Nitesh Subedi et.al.	2506.14507	link
2025-06-17	Adapting Lightweight Vision Language Models for Radiological Visual Question Answering	Aditya Shourya et.al.	2506.14451	link
2025-06-17	Causally Steered Diffusion for Automated Video Counterfactual Generation	Nikos Spyrou et.al.	2506.14404	link
2025-06-18	ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies	Jinyan Yuan et.al.	2506.14315	null
2025-06-16	Touch begins where vision ends: Generalizable policies for contact-rich manipulation	Zifan Zhao et.al.	2506.13762	null
2025-06-16	Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins	Chuanruo Ning et.al.	2506.13761	null
2025-06-16	OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning	Qiyu Xu et.al.	2506.13723	null
2025-06-16	ROSA: Harnessing Robot States for Vision-Language and Action Alignment	Yuqing Wen et.al.	2506.13679	null
2025-06-16	DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models	Zhiyi Shi et.al.	2506.13638	null
2025-06-16	VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation	Jiaming Chen et.al.	2506.13428	null
2025-06-16	Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation	Utkarsh Bajpai et.al.	2506.13367	null
2025-06-16	Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling	Daichi Tanaka et.al.	2506.13282	null
2025-06-16	AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining	Hongyuan Dong et.al.	2506.13274	null
2025-06-16	Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments	Xuan Wang et.al.	2506.13205	null
2025-06-13	Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale	Junha Lee et.al.	2506.12009	null
2025-06-13	How Visual Representations Map to Language Feature Space in Multimodal LLMs	Constantin Venhoff et.al.	2506.11976	null
2025-06-13	Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation	Xintong Wang et.al.	2506.11820	null
2025-06-13	MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space	Anshul Singh et.al.	2506.11684	null
2025-06-13	VLM@school – Evaluation of AI image understanding on German middle school knowledge	René Peinl et.al.	2506.11604	null
2025-06-13	EasyARC: Evaluating Vision Language Models on True Visual Reasoning	Mert Unsal et.al.	2506.11595	null
2025-06-13	Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis	Yuan Gao et.al.	2506.11526	link
2025-06-13	Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs	Xiao Xu et.al.	2506.11515	null
2025-06-13	Taming Stable Diffusion for Computed Tomography Blind Super-Resolution	Chunlei Li et.al.	2506.11496	null
2025-06-13	On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving	Pedram MohajerAnsari et.al.	2506.11472	null
2025-06-12	AIR: Zero-shot Generative Model Adaptation with Iterative Refinement	Guimeng Liu et.al.	2506.10895	link
2025-06-12	VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos	Jiashuo Yu et.al.	2506.10857	null
2025-06-13	RationalVLA: A Rational Vision-Language-Action Model with Dual System	Wenxuan Song et.al.	2506.10826	null
2025-06-12	Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding	Yuhang Zhang et.al.	2506.10756	null
2025-06-13	IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain	Hong Huang et.al.	2506.10730	link
2025-06-12	GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning	Xiaoyi Bao et.al.	2506.10639	null
2025-06-12	Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning	Chun-Mei Feng et.al.	2506.10575	null
2025-06-12	LLMs Are Not Yet Ready for Deepfake Image Detection	Shahroz Tariq et.al.	2506.10474	null
2025-06-12	UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models	Jun Yin et.al.	2506.10342	null
2025-06-12	Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions	Deliang Wang et.al.	2506.10334	null
2025-06-11	Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing	Junfei Wu et.al.	2506.09965	link
2025-06-11	From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models	Irving Fang et.al.	2506.09930	null
2025-06-11	3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation	Seonho Lee et.al.	2506.09883	link
2025-06-11	Adding simple structure at inference improves Vision-Language Compositionality	Imanol Miranda et.al.	2506.09691	link
2025-06-11	HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios	Kunyu Peng et.al.	2506.09650	link
2025-06-11	FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models	Weiying Zheng et.al.	2506.09638	null
2025-06-11	Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs	Beomsik Cho et.al.	2506.09522	link
2025-06-11	Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning	Cheng Chen et.al.	2506.09473	null
2025-06-11	TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision	Ayush Gupta et.al.	2506.09445	null
2025-06-11	DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt	Yitong Zhang et.al.	2506.09353	link
2025-06-10	VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning	Li Kang et.al.	2506.09049	null
2025-06-11	Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs	Yaniv Nikankin et.al.	2506.09047	link
2025-06-10	Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better	Dianyi Wang et.al.	2506.09040	link
2025-06-10	Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models	Chenyu Lian et.al.	2506.08990	link
2025-06-10	Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions	David Acuna et.al.	2506.08927	null
2025-06-11	Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought	Shuyi Zhang et.al.	2506.08817	null
2025-06-10	Multimodal Representation Alignment for Cross-modal Information Retrieval	Fan Xu et.al.	2506.08774	null
2025-06-10	PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly	Liang Ma et.al.	2506.08708	null
2025-06-10	VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism	Congzhi Zhang et.al.	2506.08691	link
2025-06-10	ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction	Juan Yeo et.al.	2506.08678	null
2025-06-09	Vision Transformers Don’t Need Trained Registers	Nick Jiang et.al.	2506.08010	link
2025-06-09	Hidden in plain sight: VLMs overlook their visual representations	Stephanie Fu et.al.	2506.08008	null
2025-06-09	BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models	Peiyan Li et.al.	2506.07961	null
2025-06-09	Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations	Yizhen Li et.al.	2506.07943	null
2025-06-09	Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models	Chengyue Huang et.al.	2506.07936	null
2025-06-09	SAM2Auto: Auto Annotation Using FLASH	Arash Rocky et.al.	2506.07850	null
2025-06-09	Image Reconstruction as a Tool for Feature Analysis	Eduard Allakhverdov et.al.	2506.07803	null
2025-06-09	Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger	Qi Yang et.al.	2506.07785	null
2025-06-09	Language-Vision Planner and Executor for Text-to-Visual Reasoning	Yichang Xu et.al.	2506.07778	null
2025-06-09	ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models	Jing Zhong et.al.	2506.07739	null
2025-06-06	CoMemo: LVLMs Need Image Context with Image Memory	Shi Liu et.al.	2506.06279	null
2025-06-06	ExAct: A Video-Language Benchmark for Expert Action Analysis	Han Yi et.al.	2506.06277	null
2025-06-06	Movie Facts and Fibs (MF $^2$ ): A Benchmark for Long Movie Understanding	Emmanouil Zaranis et.al.	2506.06275	null
2025-06-06	Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study	Leon Mayer et.al.	2506.06232	null
2025-06-06	GenIR: Generative Visual Feedback for Mental Image Retrieval	Diji Yang et.al.	2506.06220	null
2025-06-06	STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving	Christian Fruhwirth-Reisinger et.al.	2506.06218	link
2025-06-06	WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management	Bowen Yuan et.al.	2506.06084	null
2025-06-06	Full Conformal Adaptation of Medical Vision-Language Models	Julio Silva-Rodríguez et.al.	2506.06076	null
2025-06-06	BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning	Hongyi Zhou et.al.	2506.06072	null
2025-06-06	MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks	Zonglin Wu et.al.	2506.05982	link
2025-06-06	Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs	Haoyuan Li et.al.	2506.05318	null
2025-06-05	MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm	Zhang Li et.al.	2506.05218	link
2025-06-05	Quantifying Cross-Modality Memorization in Vision-Language Models	Yuxin Wen et.al.	2506.05198	null
2025-06-05	CIVET: Systematic Evaluation of Understanding in VLMs	Massimo Rizzoli et.al.	2506.05146	null
2025-06-05	PixCell: A generative foundation model for digital histopathology images	Srikar Yellapragada et.al.	2506.05127	null
2025-06-06	SeedEdit 3.0: Fast and High-Quality Generative Image Editing	Peng Wang et.al.	2506.05083	null
2025-06-05	A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions	Anh Le et.al.	2506.05061	null
2025-06-05	Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System	Haokun Liu et.al.	2506.05020	null
2025-06-05	ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT	Mikołaj Pokrywka et.al.	2506.04929	null
2025-06-05	Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms	Nurul Aisyah et.al.	2506.04822	null
2025-06-04	OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis	Junting Chen et.al.	2506.04217	link
2025-06-04	Image Editing As Programs with Diffusion Models	Yujia Hu et.al.	2506.04158	null
2025-06-04	Recent Advances in Medical Image Classification	Loan Dao et.al.	2506.04129	null
2025-06-04	LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward	Yi Zhao et.al.	2506.04070	link
2025-06-04	Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization	Jiulong Wu et.al.	2506.04039	null
2025-06-04	Vocabulary-free few-shot learning for Vision-Language Models	Maxime Zanella et.al.	2506.04005	null
2025-06-04	DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models	Jia Fu et.al.	2506.03933	null
2025-06-04	Zero-Shot Temporal Interaction Localization for Egocentric Videos	Erhang Zhang et.al.	2506.03662	link
2025-06-04	Spatial Understanding from Videos: Structured Prompts Meet Simulation Data	Haoyu Zhang et.al.	2506.03642	null
2025-06-04	VLMs Can Aggregate Scattered Training Patches	Zhanhui Zhou et.al.	2506.03614	link
2025-06-03	GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents	Qianhui Wu et.al.	2506.03143	null
2025-06-03	OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models	Mengdi Jia et.al.	2506.03135	null
2025-06-03	EgoVLM: Policy Optimization for Egocentric Video Understanding	Ashwin Vinod et.al.	2506.03097	link
2025-06-03	DPO Learning with LLMs-Judge Signal for Computer Use Agents	Man Luo et.al.	2506.03095	null
2025-06-03	From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit	Valérie Costa et.al.	2506.03093	null
2025-06-03	Text-guided Generation of Efficient Personalized Inspection Plans	Xingpeng Sun et.al.	2506.02917	null
2025-06-04	FlySearch: Exploring how vision-language models explore	Adam Pardyl et.al.	2506.02896	link
2025-06-03	Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights	Mathieu Andreux et.al.	2506.02865	null
2025-06-03	SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking	Sifan Li et.al.	2506.02803	null
2025-06-03	FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts	Tongyuan Bai et.al.	2506.02781	null
2025-05-30	ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL	Yu Zhang et.al.	2505.24875	null
2025-05-30	ProxyThinker: Test-Time Guidance through Small Visual Reasoners	Zilin Xiao et.al.	2505.24872	link
2025-05-30	GenSpace: Benchmarking Spatially-Aware Image Generation	Zehan Wang et.al.	2505.24870	null
2025-05-30	Time Blindness: Why Video-Language Models Can’t See What Humans Can?	Ujjwal Upadhyay et.al.	2505.24867	null
2025-05-30	Conformal Prediction for Zero-Shot Models	Julio Silva-Rodríguez et.al.	2505.24693	link
2025-05-30	BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models	Huu-Thien Tran et.al.	2505.24649	null
2025-05-30	SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition	Omer Nacar et.al.	2505.24600	null
2025-05-30	AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders	Yuqi Zhang et.al.	2505.24519	null
2025-05-30	un $^2$ CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP	Yinqi Li et.al.	2505.24517	link
2025-05-30	CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation	Emilio Villa-Cueva et.al.	2505.24456	null
2025-05-29	ZeroGUI: Automating Online GUI Learning at Zero Human Cost	Chenyu Yang et.al.	2505.23762	link
2025-05-29	Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint	Heekyung Lee et.al.	2505.23759	link
2025-05-29	Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models	Haohan Chi et.al.	2505.23757	link
2025-05-29	To Trust Or Not To Trust Your Vision-Language Model’s Prediction	Hao Dong et.al.	2505.23745	link
2025-05-29	LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization	Ronghuan Wu et.al.	2505.23740	null
2025-05-29	Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better	Danny Driess et.al.	2505.23705	null
2025-05-29	CLDTracker: A Comprehensive Language Description for Visual Tracking	Mohamad Alansari et.al.	2505.23704	link
2025-05-29	Grounded Reinforcement Learning for Visual Reasoning	Gabriel Sarch et.al.	2505.23678	link
2025-05-29	Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition	Yu Li et.al.	2505.23566	link
2025-05-30	Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information	Xu Chu et.al.	2505.23558	link
2025-05-28	Zero-Shot Vision Encoder Grafting via LLM Surrogates	Kaiyu Yue et.al.	2505.22664	link
2025-05-28	Training Free Stylized Abstraction	Aimon Rahman et.al.	2505.22663	null
2025-05-28	VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models	Ce Zhang et.al.	2505.22654	null
2025-05-28	Sherlock: Self-Correcting Reasoning in Vision-Language Models	Yi Ding et.al.	2505.22651	null
2025-05-28	DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers	Navve Wasserman et.al.	2505.22584	null
2025-05-28	Hypothesis Testing in Imaging Inverse Problems	Yiming Xi et.al.	2505.22481	null
2025-05-28	Zero-Shot 3D Visual Grounding from Vision-Language Models	Rong Li et.al.	2505.22429	null
2025-05-28	IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth	Md Touhidul Islam et.al.	2505.22305	null
2025-05-28	Investigating Mechanisms for In-Context Vision Language Binding	Darshana Saravanan et.al.	2505.22200	null
2025-05-29	Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging	Runze Xia et.al.	2505.22150	null
2025-05-27	ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models	Dingming Li et.al.	2505.21500	null
2025-05-27	AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery	Haowei Wang et.al.	2505.21499	link
2025-05-27	Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers	Wei Pang et.al.	2505.21497	link
2025-05-27	Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration	Mehrdad Fazli et.al.	2505.21472	null
2025-05-27	ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models	Bozhou Li et.al.	2505.21465	null
2025-05-27	LazyVLM: Neuro-Symbolic Approach to Video Analytics	Xiangru Jian et.al.	2505.21459	null
2025-05-27	DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models	Nastaran Saadati et.al.	2505.21382	null
2025-05-27	XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration	Shaoqing Zhang et.al.	2505.21279	null
2025-05-27	CROP: Contextual Region-Oriented Visual Token Pruning	Jiawei Guo et.al.	2505.21233	null
2025-05-27	Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation	Zhengyang Ji et.al.	2505.21106	null
2025-05-26	Agentic 3D Scene Generation with Spatially Contextualized VLMs	Xinhang Liu et.al.	2505.20129	null
2025-05-26	MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models	Anh Thai et.al.	2505.20122	null
2025-05-27	EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition	Christoph Schuhmann et.al.	2505.20033	null
2025-05-26	ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers	Fotios Lygerakis et.al.	2505.20032	null
2025-05-26	Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models	Hyunsik Chae et.al.	2505.20021	null
2025-05-26	Can Visual Encoder Learn to See Arrows?	Naoyuki Terashita et.al.	2505.19944	null
2025-05-26	Attention! You Vision Language Model Could Be Maliciously Manipulated	Xiaosen Wang et.al.	2505.19911	null
2025-05-26	EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM	Shuang Ao et.al.	2505.19905	null
2025-05-26	Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement	Afrah Shaahid et.al.	2505.19895	null
2025-05-26	Two Causally Related Needles in a Video Haystack	Miaoyu Li et.al.	2505.19853	null
2025-05-23	VideoGameBench: Can Vision-Language Models complete popular video games?	Alex L. Zhang et.al.	2505.18134	null
2025-05-23	One RL to See Them All: Visual Triple Unified Reinforcement Learning	Yan Ma et.al.	2505.18129	null
2025-05-23	CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays	Hyungyung Lee et.al.	2505.18087	link
2025-05-23	FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation	Zherui Zhang et.al.	2505.18053	null
2025-05-23	Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation	Li Zhong et.al.	2505.18039	null
2025-05-23	Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling	Bryan Wong et.al.	2505.17982	null
2025-05-23	Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities	Ziwei Zhou et.al.	2505.17862	link
2025-05-23	VLM Models and Automated Grading of Atopic Dermatitis	Marc Lalonde et.al.	2505.17835	null
2025-05-23	Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations	Boxu Chen et.al.	2505.17812	null
2025-05-23	U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding	Anjie Le et.al.	2505.17779	null
2025-05-23	Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models	Jiaqi Wang et.al.	2505.16854	link
2025-05-23	LaViDa: A Large Diffusion Language Model for Multimodal Understanding	Shufan Li et.al.	2505.16839	link
2025-05-22	From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization	Haonian Ji et.al.	2505.16832	link
2025-05-22	Perceptual Quality Assessment for Embodied AI	Chunyi Li et.al.	2505.16815	link
2025-05-22	SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving	Xuesong Chen et.al.	2505.16805	null
2025-05-22	REOBench: Benchmarking Robustness of Earth Observation Foundation Models	Xiang Li et.al.	2505.16793	link
2025-05-22	Single Domain Generalization for Few-Shot Counting via Universal Representation Matching	Xianing Chen et.al.	2505.16778	link
2025-05-22	IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models	Yiming Gao et.al.	2505.16774	link
2025-05-22	Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation	Hongji Yang et.al.	2505.16763	null
2025-05-22	Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval	Hailong Ning et.al.	2505.16756	null
2025-05-21	InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition	Yijie Zheng et.al.	2505.15818	link
2025-05-21	From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems	Xiuchao Sui et.al.	2505.15685	link
2025-05-21	FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models	Zhen Sun et.al.	2505.15644	link
2025-05-21	Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models	Xin Huang et.al.	2505.15576	link
2025-05-21	TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving	Hossein Hassani et.al.	2505.15564	null
2025-05-21	Clapper: Compact Learning and Video Representation in VLMs	Lingyu Kong et.al.	2505.15529	null
2025-05-21	Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets	Kaiyuan Chen et.al.	2505.15517	null
2025-05-21	Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought	Zihui Cheng et.al.	2505.15510	null
2025-05-21	Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts	Debarshi Brahma et.al.	2505.15506	link
2025-05-21	Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification	Conghao Xiong et.al.	2505.15504	null
2025-05-20	Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning	Jiaer Xia et.al.	2505.14677	null
2025-05-20	UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens	Ruichuan An et.al.	2505.14671	link
2025-05-20	EmoGist: Efficient In-Context Learning for Visual Emotion Understanding	Ronald Seoh et.al.	2505.14660	null
2025-05-20	CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation	Anna C. Doris et.al.	2505.14646	link
2025-05-20	Debating for Better Reasoning: An Unsupervised Multimodal Approach	Ashutosh Adhikari et.al.	2505.14627	null
2025-05-20	KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation	Jiajun Shi et.al.	2505.14552	link
2025-05-20	PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models	He Zhu et.al.	2505.14481	null
2025-05-20	RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding	Jiaang Li et.al.	2505.14462	link
2025-05-20	SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation	Yuyang Dong et.al.	2505.14381	null
2025-05-20	Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds	Joel Currie et.al.	2505.14366	null
2025-05-19	ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models	Liyan Tang et.al.	2505.13444	null
2025-05-19	G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning	Liang Chen et.al.	2505.13426	link
2025-05-19	Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots	Dan BW Choe et.al.	2505.13376	null
2025-05-20	Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning	Song-Lin Li et.al.	2505.13317	null
2025-05-19	I’ll believe it when I see it: Images increase misinformation sharing in Vision-Language Models	Alice Plebe et.al.	2505.13302	link
2025-05-19	Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts	Zekun Wang et.al.	2505.13281	null
2025-05-19	From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection	Lincan Cai et.al.	2505.13233	link
2025-05-19	ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models	Matteo Merler et.al.	2505.13180	link
2025-05-19	Just Dance with $π$ ! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection	Snehashis Majhi et.al.	2505.13123	null
2025-05-19	Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model	Yong Ren et.al.	2505.13062	null
2025-05-16	Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner	Wenchuan Zhang et.al.	2505.11404	link
2025-05-16	Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation	Zihan Wang et.al.	2505.11383	link
2025-05-16	Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild	Derek Ming Siang Tan et.al.	2505.11350	null
2025-05-16	Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models	Keunwoo Peter Yu et.al.	2505.11326	link
2025-05-16	Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation	Donghoon Lee et.al.	2505.11221	link
2025-05-19	FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining	Myunsoo Kim et.al.	2505.11192	null
2025-05-16	Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing	Mathis Jürgen Adler et.al.	2505.11121	null
2025-05-16	CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs	David Méndez et.al.	2505.11060	link
2025-05-16	GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning	Yue Liu et.al.	2505.11049	link
2025-05-16	Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere	Li Ju et.al.	2505.11029	null
2025-05-15	MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models	Mugilan Ganesan et.al.	2505.10526	null
2025-05-15	AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenge	Ranjan Sapkota et.al.	2505.10468	null
2025-05-15	Vision language models have difficulty recognizing virtual objects	Tyler Tran et.al.	2505.10453	null
2025-05-15	Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli	Chunyu Ye et.al.	2505.10356	null
2025-05-15	MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models	Yuncheng Guo et.al.	2505.10088	link
2025-05-15	AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection	Bin-Bin Gao et.al.	2505.09926	link
2025-05-14	Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling	William Xie et.al.	2505.09731	null
2025-05-14	ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation	Enyu Zhao et.al.	2505.09698	null
2025-05-14	Variational Visual Question Answering	Tobias Jan Wieczorek et.al.	2505.09591	null
2025-05-14	VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation	Chaofan Zhang et.al.	2505.09577	null
2025-05-14	Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput	Bo Zhang et.al.	2505.09498	null
2025-05-14	MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment	Siyuan Yan et.al.	2505.09372	link
2025-05-14	Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition	Muzammil Behzad et.al.	2505.09336	null
2025-05-14	MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning	Bin-Bin Gao et.al.	2505.09265	null
2025-05-14	Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models	Lucas Choi et.al.	2505.09139	null
2025-05-14	Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning	Dayong Liang et.al.	2505.09118	null
2025-05-14	OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions	Yuhang Wang et.al.	2505.09092	link
2025-05-13	Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training	Yangyi Chen et.al.	2505.08971	link
2025-05-13	Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving	Zongchuang Zhao et.al.	2505.08725	link
2025-05-13	OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning	Zhaochen Su et.al.	2505.08617	link
2025-05-13	From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation	Yifu Yuan et.al.	2505.08548	link
2025-05-13	Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning?	Md Tahmid Rahman Laskar et.al.	2505.08468	link
2025-05-13	MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos	Xianghui Wang et.al.	2505.08367	null
2025-05-13	Removing Watermarks with Partial Regeneration using Semantic Information	Krti Tallam et.al.	2505.08234	link
2025-05-13	CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding	Wenxuan Ma et.al.	2505.08194	null
2025-05-13	DSADF: Thinking Fast and Slow for Decision Making	Alex Zhihao Dou et.al.	2505.08189	null
2025-05-12	Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models	Seungjae Lee et.al.	2505.07815	null
2025-05-12	Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction	Jingfen Qiao et.al.	2505.07730	link
2025-05-12	Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images	Elisei Rykov et.al.	2505.07704	null
2025-05-12	Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models	Songlin Dong et.al.	2505.07690	null
2025-05-12	Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ ptimization	Seongjae Kang et.al.	2505.07675	link
2025-05-12	Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding	Oriol Barbany et.al.	2505.07600	null
2025-05-12	Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning	Bohan Wang et.al.	2505.07538	null
2025-05-12	QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads	Khurram Mazher et.al.	2505.07531	null
2025-05-12	AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography	Jiewen Yang et.al.	2505.07347	null
2025-05-12	Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning	Xiaokun Wang et.al.	2505.07263	null
2025-05-09	MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks	Wenqi Zeng et.al.	2505.06152	link
2025-05-09	TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations	Shuaiyi Huang et.al.	2505.06079	null
2025-05-09	Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI	Benjamin Raphael Ernhofer et.al.	2505.05895	link
2025-05-09	Describe Anything in Medical Images	Xi Xiao et.al.	2505.05804	null
2025-05-09	3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks	Vineet Bhat et.al.	2505.05800	null
2025-05-08	Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos	Giulio Cesare Mastrocinque Santo et.al.	2505.05681	null
2025-05-08	VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models	Faizan Farooq Khan et.al.	2505.05635	null
2025-05-08	Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments	Pranav Guruprasad et.al.	2505.05540	link
2025-05-08	Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging	Shiqi Chen et.al.	2505.05464	link
2025-05-08	SITE: towards Spatial Intelligence Thorough Evaluation	Wenqi Wang et.al.	2505.05456	null
2025-05-08	DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning	Wenru Liu et.al.	2505.05360	null
2025-05-08	Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization	Sooyoung Park et.al.	2505.05343	link
2025-05-08	Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects	Agnese Chiatti et.al.	2505.05318	null
2025-05-08	Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models	Wei Peng et.al.	2505.05189	link
2025-05-08	OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning	Cong Hua et.al.	2505.05180	link
2025-05-08	Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models	Aishwarya Venkataramanan et.al.	2505.05163	link
2025-05-08	CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models	Mengjun Yi et.al.	2505.05130	null
2025-05-08	X-Driver: Explainable Autonomous Driving with Vision-Language Models	Wei Liu et.al.	2505.05098	null
2025-05-07	“I Can See Forever!”: Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments	Ziyi Zhang et.al.	2505.04488	null
2025-05-07	DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception	Junjie Wang et.al.	2505.04410	link
2025-05-07	CM1 – A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models	Fabian Wolf et.al.	2505.04214	null
2025-05-07	R^3-VQA: “Read the Room” by Video Social Reasoning	Lixing Niu et.al.	2505.04147	null
2025-05-06	X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains	Qianchu Liu et.al.	2505.03981	null
2025-05-06	Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning	François Role et.al.	2505.03703	null
2025-05-06	Distribution-Conditional Generation: From Class Distribution to Creative Generation	Fu Feng et.al.	2505.03667	null
2025-05-06	Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images	Fangling Jiang et.al.	2505.03611	null
2025-05-06	Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection	Fangling Jiang et.al.	2505.03610	null
2025-05-06	LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs	Xinyuan Zhang et.al.	2505.03460	null
2025-05-06	Mitigating Image Captioning Hallucinations in Vision-Language Models	Fei Zhao et.al.	2505.03420	null
2025-05-07	Enhancing Target-unspecific Tasks through a Features Matrix	Fangming Cui et.al.	2505.03414	null
2025-05-06	Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models	Abram Schonfeldt et.al.	2505.03374	null
2025-05-06	A Vision-Language Model for Focal Liver Lesion Classification	Song Jian et.al.	2505.03350	null
2025-05-06	From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection	Guoting Wei et.al.	2505.03334	null
2025-05-05	LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery	Jerome Quenum et.al.	2505.02829	null
2025-05-05	HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction	Muhammad Haris Khan et.al.	2505.02569	null
2025-05-05	Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality	Xueguang Ma et.al.	2505.02466	link
2025-05-05	Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey	Chaohua Li et.al.	2505.02448	null
2025-05-05	SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing	Ming Li et.al.	2505.02370	link
2025-05-05	TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment	Zhichuan Wang et.al.	2505.02325	link
2025-05-04	Compositional Image-Text Matching and Retrieval by Grounding Entities	Madhukar Reddy Vongala et.al.	2505.02278	null
2025-05-04	Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin	Yuchen Wang et.al.	2505.02056	null
2025-05-04	A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models	Liqiang Jing et.al.	2505.01958	null
2025-05-03	PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications	Trisanth Srinivasan et.al.	2505.01881	null
2025-05-02	Dynamic Robot Tool Use with Vision Language Models	Noah Trupin et.al.	2505.01399	null
2025-05-02	Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages	Marco Salmè et.al.	2505.01096	null
2025-05-02	Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation	Daniele Molino et.al.	2505.01091	null
2025-05-02	Transferable Adversarial Attacks on Black-Box Vision-Language Models	Kai Hu et.al.	2505.01050	null
2025-05-01	Robotic Visual Instruction	Yanbang Li et.al.	2505.00693	null
2025-05-01	Visual Test-time Scaling for GUI Agent Grounding	Tiange Luo et.al.	2505.00684	link
2025-05-01	DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation	Zixuan Chen et.al.	2505.00527	null
2025-05-01	LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving	Zhijie Qiao et.al.	2505.00284	link
2025-05-01	AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care	Md Asaduzzaman Jabin et.al.	2505.00275	link
2025-05-02	Empowering Agentic Video Analytics Systems with Video Language Models	Yuxuan Yan et.al.	2505.00254	null
2025-04-30	V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving	Jannik Lübberstedt et.al.	2505.00156	null
2025-04-30	Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models	Minh-Hao Van et.al.	2505.00150	null
2025-04-30	Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design	Vasudev Sharma et.al.	2505.00134	null
2025-04-30	Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization	Anas Anwarul Haq Khan et.al.	2504.21831	null
2025-04-30	Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models	Sangmin Woo et.al.	2504.21559	null
2025-04-30	RoboGround: Robotic Manipulation with Grounded Vision-Language Priors	Haifeng Huang et.al.	2504.21530	null
2025-04-30	Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection	Luoting Zhuang et.al.	2504.21344	null
2025-04-30	An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images	Modesto Castrillón-Santana et.al.	2504.21309	null
2025-04-30	Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA	Xuanzhao Dong et.al.	2504.21252	link
2025-04-29	MemeBLIP2: A novel lightweight multimodal system to detect harmful memes	Jiaqi Liu et.al.	2504.21226	null
2025-04-29	GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model	Haoyan Xu et.al.	2504.21186	null
2025-04-29	Real-Time Wayfinding Assistant for Blind and Low-Vision Users	Dabbrata Das et.al.	2504.20976	null
2025-04-29	FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models	Mainak Singha et.al.	2504.20860	null
2025-04-29	In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer	Zechuan Zhang et.al.	2504.20690	null
2025-04-29	SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data	Michael Ogezi et.al.	2504.20648	null
2025-04-29	PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations	Haowen Sun et.al.	2504.20520	null
2025-04-29	Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception	Yuanchen Wu et.al.	2504.20468	link
2025-04-29	Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks	Konstantinos I. Roumeliotis et.al.	2504.20419	null
2025-04-29	FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding	Yanan Guo et.al.	2504.20384	null
2025-04-28	mrCAD: Multimodal Refinement of Computer-aided Designs	William P. McCarthy et.al.	2504.20294	link
2025-04-28	A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports	Henning Schäfer et.al.	2504.20220	null
2025-04-28	SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning	Wufei Ma et.al.	2504.20024	null
2025-04-29	If Concept Bottlenecks are the Question, are Foundation Models the Answer?	Nicola Debole et.al.	2504.19774	link
2025-04-28	EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia	Valerie Zermatten et.al.	2504.19742	null
2025-04-28	Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model	Muzammil Behzad et.al.	2504.19739	null
2025-04-28	VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning	Run Luo et.al.	2504.19627	null
2025-04-28	LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning	Peijian Zeng et.al.	2504.19524	null
2025-04-27	DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning	Jialang Lu et.al.	2504.19127	null
2025-04-27	Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction	Xiaoran Xu et.al.	2504.19086	null
2025-04-26	Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation	Shahad Albastaki et.al.	2504.18856	link
2025-04-26	Video CLIP Model for Multi-View Echocardiography Interpretation	Ryo Takizawa et.al.	2504.18800	null
2025-04-25	Generalization Capability for Imitation Learning	Yixiao Wang et.al.	2504.18538	null
2025-04-25	Fast-Slow Thinking for Large Vision-Language Model Reasoning	Wenyi Xiao et.al.	2504.18458	null
2025-04-25	Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation	Peiyuan Jing et.al.	2504.18453	null
2025-04-25	HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?	Yusen Zhang et.al.	2504.18406	null
2025-04-25	Revisiting Data Auditing in Large Vision-Language Models	Hongyu Zhu et.al.	2504.18349	null
2025-04-25	A Large Vision-Language Model based Environment Perception System for Visually Impaired People	Zezhou Chen et.al.	2504.18027	null
2025-04-24	CAMU: Context Augmentation for Meme Understanding	Girish A. Koushik et.al.	2504.17902	null
2025-04-25	Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction	Yuanchang Ye et.al.	2504.17671	null
2025-04-24	SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting	Yiming Zhao et.al.	2504.17395	null
2025-04-24	M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction	Chengguang Gan et.al.	2504.17353	null
2025-04-24	DIMT25@ICDAR2025: HW-TSC’s End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model	Zhanglin Wu et.al.	2504.17315	null
2025-04-24	Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning	Lynn Cherif et.al.	2504.17282	null
2025-04-24	Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation	Phillip Y. Lee et.al.	2504.17207	null
2025-04-23	Distilling semantically aware orders for autoregressive image generation	Rishav Pramanik et.al.	2504.17069	null
2025-04-23	DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs	Zhenhailong Wang et.al.	2504.17040	null
2025-04-24	V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations	Zhiyuan Fan et.al.	2504.16727	null
2025-04-23	Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes	Joan Perez et.al.	2504.16538	null
2025-04-23	TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance	Meng Chu et.al.	2504.16505	null
2025-04-23	FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing	Hariseetharam Gunduboina et.al.	2504.16433	null
2025-04-22	CLIP-IT: CLIP-based Pairing for Histology Images Classification	Banafsheh Karimian et.al.	2504.16181	link
2025-04-22	MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention	Yucheng Li et.al.	2504.16083	null
2025-04-22	MR. Video: “MapReduce” is the Principle for Long Video Understanding	Ziqi Pang et.al.	2504.16082	null
2025-04-22	Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation	Zhiyuan Hu et.al.	2504.16073	null
2025-04-22	Describe Anything: Detailed Localized Image and Video Captioning	Long Lian et.al.	2504.16072	null
2025-04-22	Vision language models are unreliable at trivial spatial cognition	Sangeet Khemlani et.al.	2504.16061	null
2025-04-22	Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation	Ziqiao Ma et.al.	2504.16060	link
2025-04-22	Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis	Frank Li et.al.	2504.16047	null
2025-04-22	LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale	Joya Chen et.al.	2504.16030	null
2025-04-22	Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models	Saban Ozturk et.al.	2504.15929	null
2025-04-22	VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining	Zizhi Chen et.al.	2504.15545	null
2025-04-21	Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models	Guo Chen et.al.	2504.15271	null
2025-04-21	KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking	Juyeon Kim et.al.	2504.15135	link
2025-04-21	Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation	Hong-Tao Yu et.al.	2504.14988	link
2025-04-21	VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform	Xingyu Lu et.al.	2504.14904	null
2025-04-21	Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation	Yunpu Zhao et.al.	2504.14848	null
2025-04-20	OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding	Songtao Jiang et.al.	2504.14692	null
2025-04-20	NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation	Junyuan Fang et.al.	2504.14638	null
2025-04-20	LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation	Jiachen Li et.al.	2504.14467	null
2025-04-20	Neglected Risks: The Disturbing Reality of Children’s Images in Datasets and the Urgent Call for Accountability	Carlos Caetano et.al.	2504.14446	null
2025-04-19	Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models	Chung-En et.al.	2504.14395	null
2025-04-18	Analysing the Robustness of Vision-Language-Models to Common Corruptions	Muhammad Usama et.al.	2504.13690	null
2025-04-18	EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model	Sijing Li et.al.	2504.13650	link
2025-04-18	PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting	Huapeng Lin et.al.	2504.13624	null
2025-04-18	Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization	Hongwei Ji et.al.	2504.13460	null
2025-04-18	Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety	Shashank Shriram et.al.	2504.13399	link
2025-04-17	VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture	Long Li et.al.	2504.13365	null
2025-04-17	Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models	Chen Wang et.al.	2504.13351	null
2025-04-17	PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding	Jang Hyun Cho et.al.	2504.13180	link
2025-04-17	Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling	Tsung-Han Wu et.al.	2504.13169	link
2025-04-17	$\texttt{Complex-Edit}$ : CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark	Siwei Yang et.al.	2504.13143	null
2025-04-17	Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training	Xinsong Zhang et.al.	2504.13123	null
2025-04-17	Probing and Inducing Combinational Creativity in Vision-Language Models	Yongqian Peng et.al.	2504.13120	null
2025-04-17	Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration	Yusi Sun et.al.	2504.13119	null
2025-04-17	Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development	Sabrina Haque et.al.	2504.13069	null
2025-04-17	NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation	Xiangyan Liu et.al.	2504.13055	null
2025-04-17	Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning	Baining Zhao et.al.	2504.12680	link
2025-04-17	VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization	Menglan Chen et.al.	2504.12661	null
2025-04-16	FLIP Reasoning Challenge	Andreas Plesner et.al.	2504.12256	link
2025-04-16	Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models -	Laura Fieback et.al.	2504.12137	null
2025-04-17	Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions	Yifei Dong et.al.	2504.11967	null
2025-04-16	Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning	Hairui Ren et.al.	2504.11930	link
2025-04-16	AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection	Yuhao Chao et.al.	2504.11914	null
2025-04-16	A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification	Bianca Lamm et.al.	2504.11838	null
2025-04-17	DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment	Li Yu et.al.	2504.11733	null
2025-04-16	Interpreting the Linear Structure of Vision-language Model Embedding Spaces	Isabel Papadimitriou et.al.	2504.11695	null
2025-04-16	VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps	Biniam Fisseha Demissie et.al.	2504.11675	null
2025-04-15	Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation	Amirhossein Dadashzadeh et.al.	2504.11669	null
2025-04-15	MINDS: The very low-mass star and brown dwarf sample – Hidden water in carbon-dominated protoplanetary disks	Aditya M. Arabhavi et.al.	2504.11425	null
2025-04-15	From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation	Jingkun Chen et.al.	2504.11368	null
2025-04-16	Seedream 3.0 Technical Report	Yu Gao et.al.	2504.11346	null
2025-04-16	UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis	Xinyi Liu et.al.	2504.11257	null
2025-04-15	R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning	Lijun Sheng et.al.	2504.11195	link
2025-04-15	Benchmarking Vision Language Models on German Factual Data	René Peinl et.al.	2504.11108	null
2025-04-16	Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR	Yulong Zhang et.al.	2504.11101	null
2025-04-15	QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models	Yudong Zhang et.al.	2504.11038	link
2025-04-15	MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique	Shuhang Liu et.al.	2504.11009	null
2025-04-15	Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles	Tonko E. W. Bossen et.al.	2504.10873	null
2025-04-14	Art3D: Training-Free 3D Generation from Flat-Colored Illustration	Xiaoyan Cong et.al.	2504.10466	null
2025-04-14	Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding	Tao Zhang et.al.	2504.10465	link
2025-04-15	GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents	Xiaobo Xia et.al.	2504.10458	null
2025-04-15	Efficient Prompt Tuning for Hierarchical Ingredient Recognition	Yinxuan Gui et.al.	2504.10322	null
2025-04-14	SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model	Zongcan Ding et.al.	2504.10320	null
2025-04-15	Breaking the Data Barrier – Building GUI Agents Through Task Generalization	Junlei Zhang et.al.	2504.10127	link
2025-04-14	AGO: Adaptive Grounding for Open World 3D Occupancy Prediction	Peizheng Li et.al.	2504.10117	null
2025-04-14	CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography	I-Sheng Fang et.al.	2504.10090	null
2025-04-14	Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure	Théo Gigant et.al.	2504.10049	null
2025-04-14	Aligning Anime Video Generation with Human Feedback	Bingwen Zhu et.al.	2504.10044	null
2025-04-11	Generating Fine Details of Entity Interactions	Xinyi Gu et.al.	2504.08714	null
2025-04-11	AstroLLaVA: towards the unification of astronomical data and natural language	Sharaf Zaman et.al.	2504.08583	null
2025-04-11	FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations	Cheng-Yu Hsieh et.al.	2504.08368	null
2025-04-11	DSM: Building A Diverse Semantic Map for 3D Visual Grounding	Qinghongbing Xie et.al.	2504.08307	null
2025-04-11	EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models	Minjae Seo et.al.	2504.08205	null
2025-04-10	Investigating Vision-Language Model for Point Cloud-based Vehicle Classification	Yiqiao Li et.al.	2504.08154	null
2025-04-10	The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search	Yutaro Yamada et.al.	2504.08066	link
2025-04-10	VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning	Yukun Qi et.al.	2504.07956	null
2025-04-10	SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement	Xiyao Wang et.al.	2504.07934	link
2025-04-10	SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos	Joshua Li et.al.	2504.07867	null
2025-04-10	CollEX – A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections	Florian Schneider et.al.	2504.07643	null
2025-04-10	VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model	Haozhan Shen et.al.	2504.07615	link
2025-04-10	TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs	Zijian Zhang et.al.	2504.07556	null
2025-04-10	Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models	Yuxiang Lin et.al.	2504.07521	link
2025-04-10	Kimi-VL Technical Report	Kimi Team et.al.	2504.07491	link
2025-04-09	Perception in Reflection	Yana Wei et.al.	2504.07165	null
2025-04-09	Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation	Israfel Salazar et.al.	2504.07072	null
2025-04-09	Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition	Sergio Romero-Tapiador et.al.	2504.06925	null
2025-04-09	MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking	Chang Nie et.al.	2504.06863	null
2025-04-09	ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models	Seonghwan Park et.al.	2504.06838	null
2025-04-09	LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding	Ziyi Wang et.al.	2504.06835	null
2025-04-08	PromptHMR: Promptable Human Mesh Recovery	Yufu Wang et.al.	2504.06397	null
2025-04-08	SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation	Hritam Basak et.al.	2504.06389	null
2025-04-08	OmniSVG: A Unified Scalable Vector Graphics Generation Model	Yiying Yang et.al.	2504.06263	null
2025-04-08	Latent Multimodal Reconstruction for Misinformation Detection	Stefanos-Iordanis Papadopoulos et.al.	2504.06010	link
2025-04-08	econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians	Can Zhang et.al.	2504.06003	null
2025-04-08	Measuring Déjà vu Memorization Efficiently	Narine Kokhlikyan et.al.	2504.05651	null
2025-04-08	A Lightweight Large Vision-language Model for Multimodal Medical Images	Belal Alsinglawi et.al.	2504.05575	null
2025-04-07	ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering	Ahmed Masry et.al.	2504.05506	link
2025-04-07	Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models	Oluwadamilola Sotomi et.al.	2504.05477	null
2025-04-07	REVEAL: Relation-based Video Representation Learning for Video-Question-Answering	Sofian Chaybouti et.al.	2504.05463	null
2025-04-07	Taxonomy-Aware Evaluation of Vision-Language Models	Vésteinn Snæbjarnarson et.al.	2504.05457	null
2025-04-07	Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly	Lianghan Dong et.al.	2504.05445	null
2025-04-07	InteractVLM: 3D Interaction Reasoning from 2D Foundational Models	Sai Kumar Dwivedi et.al.	2504.05303	link
2025-04-07	SmolVLM: Redefining small and efficient multimodal models	Andrés Marafioti et.al.	2504.05299	null
2025-04-07	A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text?	Julio Silva-Rodríguez et.al.	2504.05227	null
2025-04-07	Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation	Jiaming Chen et.al.	2504.05225	link
2025-04-07	EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively	Bingyang Wang et.al.	2504.05141	null
2025-04-07	RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model	Congcong Wen et.al.	2504.04988	null
2025-04-08	A Taxonomy of Self-Handover	Naoki Wake et.al.	2504.04939	null
2025-04-07	SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models	Justus Westerhoff et.al.	2504.04893	link
2025-04-07	Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG	Roie Kazoom et.al.	2504.04858	null
2025-04-07	OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance	Chaoyi Wang et.al.	2504.04781	null
2025-04-04	Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models	Mirko Borszukovszki et.al.	2504.03440	null
2025-04-04	SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding	Yimin Wei et.al.	2504.03254	link
2025-04-04	Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators	Linfeng Zhao et.al.	2504.03245	null
2025-04-04	Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation	Xin Zhang et.al.	2504.03193	link
2025-04-04	REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval	Shabnam Choudhury et.al.	2504.03169	null
2025-04-07	NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving	Kexin Tian et.al.	2504.03164	null
2025-04-04	TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference	Junshan Hu et.al.	2504.03154	null
2025-04-04	MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories	Natalie Tirabassi et.al.	2504.03153	null
2025-04-03	QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding	Binh M. Le et.al.	2504.02971	null
2025-04-03	STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection	Divya Velayudhan et.al.	2504.02823	null
2025-04-03	Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models	Mateusz Pach et.al.	2504.02821	link
2025-04-03	Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence	Anita Rau et.al.	2504.02799	null
2025-04-03	Robot-Led Vision Language Model Wellbeing Assessment of Children	Nida Itrat Abbasi et.al.	2504.02765	null
2025-04-04	Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme	Yan Ma et.al.	2504.02587	link
2025-04-03	Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision	Xiaofeng Han et.al.	2504.02477	null
2025-04-03	Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation	Chuanqi Cheng et.al.	2504.02438	link
2025-04-03	ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback	Xiaolei Li et.al.	2504.02357	null
2025-04-03	Large (Vision) Language Models are Unsupervised In-Context Learners	Artyom Gadetsky et.al.	2504.02349	link
2025-04-03	Re-thinking Temporal Search for Long-Form Video Understanding	Jinhui Ye et.al.	2504.02259	link
2025-04-02	FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs	Mothilal Asokan et.al.	2504.01916	link
2025-04-02	Is Temporal Prompting All We Need For Limited Labeled Action Recognition?	Shreyank N Gowda et.al.	2504.01890	null
2025-04-02	Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images	Nusrat Munia et.al.	2504.01838	link
2025-04-02	BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing	Yunqi Gu et.al.	2504.01786	link
2025-04-02	AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization	Chaohu Liu et.al.	2504.01735	null
2025-04-02	Reasoning LLMs for User-Aware Multimodal Conversational Agents	Hamed Rahimi et.al.	2504.01700	null
2025-04-02	CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition	Sarah Alyami et.al.	2504.01666	link
2025-04-02	BioAtt: Anatomical Prior Driven Low-Dose CT Denoising	Namhun Kim et.al.	2504.01662	null
2025-04-02	Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models	Zhaochen Wang et.al.	2504.01589	null
2025-04-02	STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation	Dandan Shan et.al.	2504.01561	link
2025-03-31	SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation	Ngoc Dung Huynh et.al.	2503.24164	null
2025-03-31	AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference	Kai Huang et.al.	2503.23956	null
2025-03-31	HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment	Zhichao Liao et.al.	2503.23907	null
2025-03-31	Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation	Yongle Li et.al.	2503.23869	null
2025-04-01	Evaluating small vision-language models as AI assistants for radio astronomical source analysis tasks	S. Riggi et.al.	2503.23859	link
2025-03-31	Texture or Semantics? Vision-Language Models Get Lost in Font Recognition	Zhecheng Li et.al.	2503.23768	null
2025-03-31	KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language	Yoonshik Kim et.al.	2503.23730	link
2025-03-31	From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts	Yuto Asano et.al.	2503.23707	null
2025-03-30	Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging	Amar Kumar et.al.	2503.23618	null
2025-03-30	DASH: Detection and Assessment of Systematic Hallucinations of VLMs	Maximilian Augustin et.al.	2503.23573	link
2025-03-28	Understanding Co-speech Gestures in-the-wild	Sindhu B Hegde et.al.	2503.22668	null
2025-03-28	Unicorn: Text-Only Data Synthesis for Vision Language Model Training	Xiaomin Yu et.al.	2503.22655	link
2025-03-28	Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization	Iñigo Pikabea et.al.	2503.22577	null
2025-03-28	VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection	Bin Zhang et.al.	2503.22291	null
2025-03-28	FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning	Dongping Liao et.al.	2503.22263	link
2025-03-28	Multi-modal Knowledge Distillation-based Human Trajectory Forecasting	Jaewoo Jeong et.al.	2503.22201	link
2025-03-28	REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation	Puzhen Yuan et.al.	2503.22122	null
2025-03-28	How Well Can Vison-Language Models Understand Humans’ Intention? An Open-ended Theory of Mind Question Evaluation Benchmark	Ximing Wen et.al.	2503.22093	null
2025-03-28	A Survey on Remote Sensing Foundation Models: From Vision to Multimodality	Ziyue Huang et.al.	2503.22081	link
2025-03-27	CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models	Qingqing Zhao et.al.	2503.22020	null
2025-03-27	Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck	Adrian Bulat et.al.	2503.21757	null
2025-03-27	VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness	Dian Zheng et.al.	2503.21755	link
2025-03-27	Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving	Yue Li et.al.	2503.21505	link
2025-03-27	BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding	Shuming Liu et.al.	2503.21483	link
2025-03-27	Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models	Ruizhou Li et.al.	2503.21435	null
2025-03-27	Cultivating Game Sense for Yourself: Making VLMs Gaming Experts	Wenxuan Lu et.al.	2503.21263	null
2025-03-27	LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models	Hengyuan Zhao et.al.	2503.21227	null
2025-03-27	VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation	Alan Dao et.al.	2503.21214	null
2025-03-27	FakeReasoning: Towards Generalizable Forgery Detection and Reasoning	Yueying Gao et.al.	2503.21210	null
2025-03-27	VideoMix: Aggregating How-To Videos for Task-Oriented Learning	Saelyne Yang et.al.	2503.21130	null
2025-03-27	Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning	Huajie Tan et.al.	2503.20752	null
2025-03-26	MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion	Saron Samuel et.al.	2503.20698	null
2025-03-26	AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction	Sadaf Khademi et.al.	2503.20662	null
2025-03-26	IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting	Hao Fu et.al.	2503.20612	link
2025-03-26	Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models	Fanhu Zeng et.al.	2503.20492	null
2025-03-26	Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding	Joao Pereira et.al.	2503.20362	null
2025-03-26	sudo rm -rf agentic_security	Sejin Lee et.al.	2503.20279	link
2025-03-26	LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions	Yejin Kwon et.al.	2503.20252	null
2025-03-26	Learning Adaptive Dexterous Grasping from Single Demonstrations	Liangzhi Shi et.al.	2503.20208	null
2025-03-26	Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector	Xiao Guo et.al.	2503.20188	link
2025-03-25	CoLLM: A Large Language Model for Composed Image Retrieval	Chuong Huynh et.al.	2503.19910	link
2025-03-25	CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning	Hao Yu et.al.	2503.19900	link
2025-03-25	Towards Online Multi-Modal Social Interaction Understanding	Xinpeng Li et.al.	2503.19851	link
2025-03-25	FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs	Carlos Plou et.al.	2503.19850	null
2025-03-25	FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model	Jun Zhou et.al.	2503.19839	null
2025-03-25	LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation	Vladan Stojnić et.al.	2503.19777	link
2025-03-25	ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation	Haoyu Fu et.al.	2503.19755	null
2025-03-25	Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models	Ilias Stogiannidis et.al.	2503.19707	link
2025-03-25	fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models	Saurav Sharma et.al.	2503.19670	null
2025-03-25	RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models	Mehdi Moshtaghi et.al.	2503.19654	null
2025-03-25	MC-LLaVA: Multi-Concept Personalized Vision-Language Model	Ruichuan An et.al.	2503.18854	link
2025-03-24	Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations	Jeonghyeon Kim et.al.	2503.18817	link
2025-03-24	SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection	Shrikant Malviya et.al.	2503.18812	link
2025-03-24	ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models	Danrui Li et.al.	2503.18680	link
2025-03-25	Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning	Juncen Guo et.al.	2503.18672	null
2025-03-24	Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks	Nina Shvetsova et.al.	2503.18637	null
2025-03-24	Training-Free Personalization via Retrieval and Reasoning on Fingerprints	Deepayan Das et.al.	2503.18623	null
2025-03-24	Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding	Tianyu Chen et.al.	2503.18578	null
2025-03-24	Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models	Bin Li et.al.	2503.18556	null
2025-03-24	MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning	Dawei Yan et.al.	2503.18533	null
2025-03-21	OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement	Yihe Deng et.al.	2503.17352	link
2025-03-21	Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models	Jianing Qi et.al.	2503.17349	null
2025-03-21	Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology	Devavrat Tomar et.al.	2503.17238	link
2025-03-21	Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models	Davide Berasi et.al.	2503.17142	link
2025-03-21	Beyond Accuracy: What Matters in Designing Well-Behaved Models?	Robin Hesse et.al.	2503.17110	null
2025-03-21	ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail	Chandan Yeshwanth et.al.	2503.17044	null
2025-03-21	When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making	Zhe Hu et.al.	2503.16965	null
2025-03-21	PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition	Ibtissam Saadi et.al.	2503.16945	null
2025-03-21	Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks	Haijin Zeng et.al.	2503.16930	null
2025-03-21	Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification	Dongseob Kim et.al.	2503.16873	null
2025-03-20	M3: 3D-Spatial MultiModal Memory	Xueyan Zou et.al.	2503.16413	link
2025-03-20	Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them	Guanyu Chen et.al.	2503.16401	null
2025-03-20	JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse	Muyao Li et.al.	2503.16365	null
2025-03-20	Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model	Zhaochong An et.al.	2503.16282	link
2025-03-20	Probabilistic Prompt Distribution Learning for Animal Pose Estimation	Jiyong Rao et.al.	2503.16120	link
2025-03-20	MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures	Lucas Morin et.al.	2503.16096	link
2025-03-20	STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding	Zichen Liu et.al.	2503.15973	link
2025-03-20	Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation	Clive Tinashe Marimo et.al.	2503.15969	null
2025-03-20	CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention	Yaxiong Chen et.al.	2503.15949	link
2025-03-20	Don’t Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts	Elisei Rykov et.al.	2503.15948	link
2025-03-19	TULIP: Towards Unified Language-Image Pretraining	Zineng Tang et.al.	2503.15485	null
2025-03-19	Visual Persona: Foundation Model for Full-Body Human Customization	Jisu Nam et.al.	2503.15406	null
2025-03-19	EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models	Yinan Liang et.al.	2503.15369	null
2025-03-19	SemEval-2025 Task 1: AdMIRe – Advancing Multimodal Idiomaticity Representation	Thomas Pickard et.al.	2503.15358	null
2025-03-19	TruthLens:A Training-Free Paradigm for DeepFake Detection	Ritabrata Chakraborty et.al.	2503.15342	null
2025-03-19	Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport	Hao Tan et.al.	2503.15337	link
2025-03-19	Sig2text, a Vision-language model for Non-cooperative Radar Signal Parsing	Hancong Feng KaiLI Jiang Bin tang et.al.	2503.15213	link
2025-03-19	Context-Aware Vision Language Foundation Models for Ocular Disease Screening in Retinal Images	Lucie Berger et.al.	2503.15212	null
2025-03-19	A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models, Reactive Planner and Behavior Trees	Faseeh Ahmad et.al.	2503.15202	null
2025-03-19	Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU	Àlex Pujol Vidal et.al.	2503.15166	null
2025-03-18	ExDDV: A New Dataset for Explainable Deepfake Detection in Video	Vlad Hondru et.al.	2503.14421	link
2025-03-18	CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models	Yiqi Zhu et.al.	2503.14161	link
2025-03-18	Growing a Twig to Accelerate Large Vision-Language Models	Zhenwei Shao et.al.	2503.14075	null
2025-03-18	MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding	Siwei Han et.al.	2503.13964	link
2025-03-18	Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models	Yuxiang Lai et.al.	2503.13939	null
2025-03-18	ChatBEV: A Visual Language Model that Understands BEV Maps	Qingyao Xu et.al.	2503.13938	null
2025-03-18	Where do Large Vision-Language Models Look at when Answering Questions?	Xiaoying Xing et.al.	2503.13891	link
2025-03-18	RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving	Yujin Wang et.al.	2503.13861	null
2025-03-18	VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences	Anukriti Singh et.al.	2503.13817	null
2025-03-18	Identifying and Mitigating Position Bias of Multi-image Vision-Language Models	Xinyu Tian et.al.	2503.13792	null
2025-03-17	DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models	Haoyang Li et.al.	2503.13443	link
2025-03-18	MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling	Yingyue Li et.al.	2503.13440	link
2025-03-17	Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions	Wan Ju Kang et.al.	2503.13369	null
2025-03-17	Agents Play Thousands of 3D Video Games	Zhongwen Xu et.al.	2503.13356	null
2025-03-17	HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning	Wensheng Wang et.al.	2503.13171	null
2025-03-17	Free-form language-based robotic reasoning and grasping	Runyu Jiao et.al.	2503.13082	null
2025-03-17	Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari	Harshal Kausadikar et.al.	2503.13060	null
2025-03-17	InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving	Ruiqi Song et.al.	2503.13047	null
2025-03-17	Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization	Ruichuan An et.al.	2503.12999	null
2025-03-17	How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark	Roba Al Majzoub et.al.	2503.12990	link
2025-03-14	Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense	Shuyang Hao et.al.	2503.11619	null
2025-03-14	Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages	Matteo Farina et.al.	2503.11609	link
2025-03-14	SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion	Ahmed Nassar et.al.	2503.11576	null
2025-03-14	Similarity-Aware Token Pruning: Your VLM but Faster	Ahmadreza Jeddi et.al.	2503.11549	link
2025-03-14	Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models	Hao Cheng et.al.	2503.11519	null
2025-03-14	PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture	Xiaokang Wei et.al.	2503.11368	null
2025-03-14	PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models	Mayank Nautiyal et.al.	2503.11360	null
2025-03-14	Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset	Yibing Weng et.al.	2503.11342	null
2025-03-14	DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models	Xirui Zhou et.al.	2503.11265	null
2025-03-14	Compound Expression Recognition via Large Vision-Language Models	Jun Yu et.al.	2503.11241	null
2025-03-13	A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1	Zhaoyi Li et.al.	2503.10635	link
2025-03-13	HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model	Jiaming Liu et.al.	2503.10631	null
2025-03-13	CoSTA $\ast$ : Cost-Sensitive Toolpath Agent for Multi-turn Image Editing	Advait Gupta et.al.	2503.10613	link
2025-03-13	TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention	Jinhao Duan et.al.	2503.10602	link
2025-03-13	GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding	Rui Hu et.al.	2503.10596	link
2025-03-13	VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search	Yiming Jia et.al.	2503.10582	null
2025-03-13	KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation	Zixian Liu et.al.	2503.10546	null
2025-03-13	Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction	Yuhan Wang et.al.	2503.10508	null
2025-03-13	World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning	Siyin Wang et.al.	2503.10480	null
2025-03-13	VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan	Zhi Rui Tam et.al.	2503.10427	link
2025-03-12	SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment	Katrin Renz et.al.	2503.09594	null
2025-03-12	MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions	Zhe Xu et.al.	2503.09499	link
2025-03-12	BAMBI: Developing Baby Language Models for Italian	Alice Suozzi et.al.	2503.09481	null
2025-03-12	SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery	Jiayuan Huang et.al.	2503.09474	null
2025-03-12	Astrea: A MOE-based Visual Understanding Model with Progressive Alignment	Xiaoda Yang et.al.	2503.09445	null
2025-03-12	Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models	Julian Spravil et.al.	2503.09443	null
2025-03-12	OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment	Qi Liu et.al.	2503.09416	null
2025-03-12	Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models	Xiaozhen Qiao et.al.	2503.09394	null
2025-03-12	VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers	Ruanjun Li et.al.	2503.09387	null
2025-03-13	2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos	Marvin Heidinger et.al.	2503.09320	null
2025-03-11	Generating Robot Constitutions & Benchmarks for Semantic Safety	Pierre Sermanet et.al.	2503.08663	null
2025-03-11	TLA: Tactile-Language-Action Model for Contact-Rich Manipulation	Peng Hao et.al.	2503.08548	null
2025-03-11	GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training	Tong Wei et.al.	2503.08525	null
2025-03-11	External Knowledge Injection for CLIP-Based Class-Incremental Learning	Da-Wei Zhou et.al.	2503.08510	link
2025-03-11	MMRL: Multi-Modal Representation Learning for Vision-Language Models	Yuncheng Guo et.al.	2503.08497	link
2025-03-11	SuperCap: Multi-resolution Superpixel-based Image Captioning	Henry Senior et.al.	2503.08496	null
2025-03-11	TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting	Fengyi Zhang et.al.	2503.08485	null
2025-03-11	PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability	Weijie Zhou et.al.	2503.08481	link
2025-03-11	Debiased Prompt Tuning in Vision-Language Model without Annotations	Chaoquan Jiang et.al.	2503.08368	null
2025-03-11	Trinity: A Modular Humanoid Robot AI System	Jingkai Sun et.al.	2503.08338	null
2025-03-10	Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction	Zongzheng Zhang et.al.	2503.07485	link
2025-03-10	REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding	Yan Tai et.al.	2503.07413	link
2025-03-11	FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates	Sangwoo Park et.al.	2503.07216	null
2025-03-10	Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation	Sihao Lin et.al.	2503.07125	null
2025-03-11	PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM	Alan Dao et.al.	2503.07111	null
2025-03-10	XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition	Chuanming Wang et.al.	2503.07075	null
2025-03-10	Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning	Huilin Deng et.al.	2503.07065	link
2025-03-10	CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model	Shihao Hou et.al.	2503.06993	null
2025-03-10	A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis	Xiang Liu et.al.	2503.06973	link
2025-03-10	Handle Object Navigation as Weighted Traveling Repairman Problem	Ruimeng Liu et.al.	2503.06937	link
2025-03-07	VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method	Weiyu Ma et.al.	2503.05383	link
2025-03-07	Data-Efficient Generalization for Zero-shot Composed Image Retrieval	Zining Chen et.al.	2503.05204	null
2025-03-07	Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions	Chan hur et.al.	2503.05186	null
2025-03-07	Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models	Messi H. J. Lee et.al.	2503.05093	null
2025-03-07	Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation	Qingxuan Jia et.al.	2503.05064	null
2025-03-06	LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression	Souvik Kundu et.al.	2503.04982	null
2025-03-06	Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach	Soumyadeep Ro et.al.	2503.04918	null
2025-03-06	LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM	Sambal Shikhar et.al.	2503.04724	null
2025-03-06	A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning	Qing Zhou et.al.	2503.04592	null
2025-03-06	AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM	Sunghyun Ahn et.al.	2503.04504	link
2025-03-06	TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction	Chao Wang et.al.	2503.04457	null
2025-03-06	From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design	Felix Ocker et.al.	2503.04417	null
2025-03-07	Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models	Niccolò Turcato et.al.	2503.04280	null
2025-03-06	An Egocentric Vision-Language Model based Portable Real-time Smart Assistant	Yifei Huang et.al.	2503.04250	link
2025-03-06	Synthetic Data is an Elegant GIFT for Continual Vision-Language Models	Bin Wu et.al.	2503.04229	null
2025-03-06	EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models	Haiyang Yu et.al.	2503.04058	null
2025-03-06	RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models	Wenhui Zhu et.al.	2503.03987	null
2025-03-05	CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning	Yuqi Zhou et.al.	2503.03743	link
2025-03-05	OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction	Huang Huang et.al.	2503.03734	null
2025-03-05	Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions	Jun Li et.al.	2503.03278	null
2025-03-05	Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings	Sneh Pillai et.al.	2503.03202	null
2025-03-05	SpiritSight Agent: Advanced GUI Agent with One Look	Zhiyuan Huang et.al.	2503.03196	null
2025-03-04	LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications	Danqing Zhang et.al.	2503.02950	link
2025-03-05	FairSense-AI: Responsible AI Meets Sustainability	Shaina Raza et.al.	2503.02865	null
2025-03-04	“What If Smart Homes Could See Our Homes?”: Exploring DIY Smart Home Building Experiences with VLM-Based Camera Sensors	Sojeong Yun et.al.	2503.02816	null
2025-03-04	Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation	Junjie Zhu et.al.	2503.02748	null
2025-03-04	ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment	Faryal Batool et.al.	2503.02723	null
2025-03-04	MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments	Ege Özsoy et.al.	2503.02579	link
2025-03-04	UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue	Yasheerah Yaqoot et.al.	2503.02465	null
2025-03-04	*UAV-VLPA: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales**	Oleg Sautenkov et.al.	2503.02454	null
2025-03-04	Vision-Language Model IP Protection via Prompt-based Learning	Lianyu Wang et.al.	2503.02393	null
2025-03-04	Are Large Vision Language Models Good Game Players?	Xinyu Wang et.al.	2503.02358	null
2025-03-04	BiasICL: In-Context Learning and Demographic Biases of Vision Language Models	Sonnet Xu et.al.	2503.02334	null
2025-02-28	Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos	Zhiyu Tan et.al.	2502.21314	null
2025-02-28	PET Image Denoising via Text-Guided Diffusion: Integrating Anatomical Priors through Text Prompts	Boxiao Yu et.al.	2502.21260	null
2025-02-28	FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts	Ziyi Zhang et.al.	2502.21059	null
2025-02-28	DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping	Yifan Zhong et.al.	2502.20900	null
2025-02-28	VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation	Anh Tien Nguyen et.al.	2502.20850	null
2025-02-28	CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval	Zelong Sun et.al.	2502.20826	null
2025-02-28	MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models	Qiao Yan et.al.	2502.20780	link
2025-02-28	Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow	Jiaqi Bai et.al.	2502.20750	link
2025-02-28	T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting	Yifei Qian et.al.	2502.20625	null
2025-02-27	Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference	Mingyuan Wu et.al.	2502.20587	null
2025-02-27	Visual Adaptive Prompting for Compositional Zero-Shot Learning	Kyle Stein et.al.	2502.20292	null
2025-02-27	Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions	Palawat Busaranuvong et.al.	2502.20277	null
2025-02-27	VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers	Ziang Guo et.al.	2502.20108	null
2025-02-27	Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore	Hongseok Oh et.al.	2502.20034	link
2025-02-27	ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models	Ke Niu et.al.	2502.19958	null
2025-02-27	Space Rotation with Basis Transformation for Training-free Test-Time Adaptation	Chenhao Ding et.al.	2502.19946	null
2025-02-27	ProAPO: Progressively Automatic Prompt Optimization for Visual Classification	Xiangyan Qu et.al.	2502.19844	link
2025-02-27	Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study	Reza Abbasi et.al.	2502.19828	null
2025-02-27	Mixtera: A Data Plane for Foundation Model Training	Maximilian Böther et.al.	2502.19790	link
2025-02-27	Open-Vocabulary Semantic Part Segmentation of 3D Human	Keito Suzuki et.al.	2502.19782	null
2025-02-26	Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models	Lucy Xiaoyang Shi et.al.	2502.19417	null
2025-02-26	Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions	Ruben T. Lucassen et.al.	2502.19293	null
2025-02-26	On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation	Ruben T. Lucassen et.al.	2502.19285	null
2025-02-26	Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models	Jiawei Kong et.al.	2502.19269	null
2025-02-26	FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach	Anju Rani et.al.	2502.19038	null
2025-02-26	VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model	Jiani Zheng et.al.	2502.18906	null
2025-02-26	Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP	Chenyang Zhao et.al.	2502.18816	null
2025-02-26	QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries	Nicolas Harvey Chapman et.al.	2502.18735	null
2025-02-25	olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models	Jake Poznanski et.al.	2502.18443	link
2025-02-25	WebGames: Challenging General-Purpose Web-Browsing AI Agents	George Thomas et.al.	2502.18356	link
2025-02-25	Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models	Zhaoyi Liu et.al.	2502.18290	null
2025-02-25	NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts	Muhammad Farid Adilazuarda et.al.	2502.18148	null
2025-02-25	Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models	Cao Yuxuan et.al.	2502.18101	link
2025-02-25	VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion	Pei Liu et.al.	2502.18042	null
2025-02-25	UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting	Haoyuan Li et.al.	2502.17860	null
2025-02-25	Can Multimodal LLMs Perform Time Series Anomaly Detection?	Xiongxiao Xu et.al.	2502.17812	link
2025-02-24	Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures	Akhila Yerukola et.al.	2502.17710	link
2025-02-24	METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling	Bingxuan Li et.al.	2502.17651	null
2025-02-24	DIS-CO: Discovering Copyrighted Content in VLMs Training Data	André V. Duarte et.al.	2502.17358	link
2025-02-24	Applications of Large Models in Medicine	YunHe Su et.al.	2502.17132	null
2025-02-24	Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI	Syed Abdul Gaffar Shakhadri et.al.	2502.17092	null
2025-02-24	Evolution 6.0: Evolving Robotic Capabilities Through Generative Design	Muhammad Haris Khan et.al.	2502.17034	null
2025-02-24	All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark	Davide Testa et.al.	2502.16989	null
2025-02-24	SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models	Kevin Miller et.al.	2502.16911	null
2025-02-24	Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models	Yaqi Sun et.al.	2502.16842	null
2025-02-24	CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification	Liping Lu et.al.	2502.16815	null
2025-02-23	Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation	Yunhai Feng et.al.	2502.16707	null
2025-02-23	Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI?	Qipan Xu et.al.	2502.16618	null
2025-02-21	ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval	Guanqi Zhan et.al.	2502.15682	null
2025-02-21	Testing the limits of fine-tuning to improve reasoning in vision language models	Luca M. Schulze Buschoff et.al.	2502.15678	null
2025-02-21	Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation	Tim Rädsch et.al.	2502.15563	null
2025-02-21	FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models	Jiao Chen et.al.	2502.15481	null
2025-02-21	Chitrarth: Bridging Vision and Language for a Billion People	Shaharukh Khan et.al.	2502.15392	null
2025-02-21	The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting	Masayo Tomita et.al.	2502.15389	null
2025-02-21	CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models	Shunchang Liu et.al.	2502.15278	null
2025-02-21	The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning	Sheila Schoepp et.al.	2502.15214	null
2025-02-21	PairBench: A Systematic Framework for Selecting Reliable Judge VLMs	Aarash Feizi et.al.	2502.15210	null
2025-02-21	CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models	Zihao Sheng et.al.	2502.15119	null
2025-02-20	Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation	Yue Yang et.al.	2502.14846	null
2025-02-20	LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models	Shangqing Tu et.al.	2502.14834	link
2025-02-20	SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features	Michael Tschannen et.al.	2502.14786	link
2025-02-20	ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting	Abhijit Mishra et.al.	2502.14780	null
2025-02-21	HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States	Yilei Jiang et.al.	2502.14744	link
2025-02-20	NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization	Zheyuan Zhang et.al.	2502.14638	link
2025-02-20	Noisy Test-Time Adaptation in Vision-Language Models	Chentao Cao et.al.	2502.14604	link
2025-02-20	PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models	Yu Meng et.al.	2502.14504	null
2025-02-20	How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation	Zhuohang Long et.al.	2502.14486	null
2025-02-20	Evaluating Precise Geolocation Inference Capabilities of Vision Language Models	Neel Jay et.al.	2502.14412	link
2025-02-19	Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images	Shengguang Wu et.al.	2502.13928	null
2025-02-19	NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants	Yiran Qin et.al.	2502.13894	null
2025-02-19	SPEX: Scaling Feature Interaction Explanations for LLMs	Justin Singh Kang et.al.	2502.13870	link
2025-02-19	Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models	Peter Carragher et.al.	2502.13836	null
2025-02-19	GIMMICK – Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking	Florian Schneider et.al.	2502.13766	null
2025-02-19	VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation	Wei Zhao et.al.	2502.13508	link
2025-02-19	Transferring Textual Preferences to Vision-Language Understanding through Model Merging	Chen-An Li et.al.	2502.13487	null
2025-02-19	Estimating Commonsense Plausibility through Semantic Shifts	Wanqing Cui et.al.	2502.13464	null
2025-02-19	MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation	Lingfeng Zhang et.al.	2502.13451	null
2025-02-19	Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning	Hao Ma et.al.	2502.13430	null
2025-02-18	Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization	Shuo Xing et.al.	2502.13146	link
2025-02-18	Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation	Bencheng Liao et.al.	2502.13145	link
2025-02-18	SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation	Zekun Qi et.al.	2502.13143	null
2025-02-18	Understanding and Rectifying Safety Perception Distortion in VLMs	Xiaohan Zou et.al.	2502.13095	null
2025-02-18	MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching	Fabian David Schmidt et.al.	2502.12852	null
2025-02-18	Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum	Yihong Liu et.al.	2502.12599	null
2025-02-18	CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base	Cong-Duy Nguyen et.al.	2502.12591	null
2025-02-18	Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning	Mengshi Qi et.al.	2502.12425	link
2025-02-17	LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models	Zongyu Wu et.al.	2502.12359	null
2025-02-17	Descriminative-Generative Custom Tokens for Vision-Language Models	Pramuditha Perera et.al.	2502.12095	null
2025-02-17	VLM $^2$ -Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues	Jianshu Zhang et.al.	2502.12084	null
2025-02-17	Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning	Jiayang Zhang et.al.	2502.12049	link
2025-02-17	Learning Generalizable Prompt for CLIP with Class Similarity Knowledge	Sehun Jung et.al.	2502.11969	null
2025-02-17	VLP: Vision-Language Preference Learning for Embodied Manipulation	Runze Liu et.al.	2502.11918	null
2025-02-17	From Open-Vocabulary to Vocabulary-Free Semantic Segmentation	Klara Reichard et.al.	2502.11891	null
2025-02-17	VAQUUM: Are Vague Quantifiers Grounded in Visual Data?	Hugh Mee Wong et.al.	2502.11874	null
2025-02-17	Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics	Wenrui Xu et.al.	2502.11859	null
2025-02-17	ILIAS: Instance-Level Image retrieval At Scale	Giorgos Kordopatis-Zilos et.al.	2502.11748	null
2025-02-17	Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering	Md Zarif Ul Alam et.al.	2502.11747	null
2025-02-14	Probing Perceptual Constancy in Large Vision Language Models	Haoran Sun et.al.	2502.10273	null
2025-02-14	VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models	Gokul Karthik Kumar et.al.	2502.10250	null
2025-02-14	Cooperative Multi-Agent Planning with Adaptive Skill Synthesis	Zhiyuan Li et.al.	2502.10148	null
2025-02-14	Image Embedding Sampling Method for Diverse Captioning	Sania Waheed et.al.	2502.10118	null
2025-02-14	Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models	Chenrui Tie et.al.	2502.10090	link
2025-02-14	ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments	Juyeong Hwang et.al.	2502.10046	null
2025-02-14	Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation	Shichao Fan et.al.	2502.10040	null
2025-02-14	HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation	Tianwei Lin et.al.	2502.09838	link
2025-02-13	On the robustness of multimodal language model towards distractions	Ming Liu et.al.	2502.09818	null
2025-02-13	GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis	Angelos Zavras et.al.	2502.09598	link
2025-02-13	When and How Does CLIP Enable Domain and Compositional Generalization?	Elias Kempf et.al.	2502.09507	link
2025-02-13	OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics	Junhui Wang et.al.	2502.09238	null
2025-02-13	LimSim Series: An Autonomous Driving Simulation Platform for Validation and Enhancement	Daocheng Fu et.al.	2502.09170	link
2025-02-13	The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs)	Ye-eun Cho et.al.	2502.09120	null
2025-02-13	Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model	Shiryu Ueno et.al.	2502.09057	link
2025-02-13	AIDE: Agentically Improve Visual Language Model with Domain Experts	Ming-Chang Chiu et.al.	2502.09051	null
2025-02-13	3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning	Guoqin Tang et.al.	2502.08903	null
2025-02-13	Harnessing Vision Models for Time Series Analysis: A Survey	Jingchao Ni et.al.	2502.08869	link
2025-02-12	ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots	Yuxuan Zhang et.al.	2502.08791	null
2025-02-12	A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards	Shivansh Patel et.al.	2502.08643	null
2025-02-12	SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation	Ellie Arar et.al.	2502.08642	null
2025-02-12	Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities?	Jiahe Jin et.al.	2502.08503	link
2025-02-12	ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification	Jiangbo Shi et.al.	2502.08391	link
2025-02-12	Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting	Jiarui Wu et.al.	2502.08317	null
2025-02-12	TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents	Kunal Singh et.al.	2502.08226	null
2025-02-12	Typographic Attacks in a Multi-Image Setting	Xiaomeng Wang et.al.	2502.08193	link
2025-02-12	SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation	Zhiming Ma et.al.	2502.08168	link
2025-02-12	MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models	Peng-Fei Zhang et.al.	2502.08079	null
2025-02-11	Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders	Kshitish Ghate et.al.	2502.07957	link
2025-02-11	Scaling Pre-training to One Hundred Billion Data for Vision Language Models	Xiao Wang et.al.	2502.07617	null
2025-02-11	MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification	Anh-Tien Nguyen et.al.	2502.07409	link
2025-02-11	TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation	Navid Rajabi et.al.	2502.07306	null
2025-02-11	Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired	ByungOk Han et.al.	2502.07183	link
2025-02-10	EVEv2: Improved Baselines for Encoder-Free Vision-Language Models	Haiwen Diao et.al.	2502.06788	link
2025-02-10	Conformal Predictions for Human Action Recognition with Vision-Language Models	Bary Tim et.al.	2502.06631	null
2025-02-10	Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments	Sankalp Nagaonkar et.al.	2502.06445	link
2025-02-10	AppVLM: A Lightweight Vision Language Model for Online App Control	Georgios Papoudakis et.al.	2502.06395	null
2025-02-11	When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs	Aobotao Dai et.al.	2502.06390	link
2025-02-10	Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models	Ce Zhang et.al.	2502.06130	link
2025-02-10	Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models	Peiran Wang et.al.	2502.06094	null
2025-02-09	Noise is an Efficient Learner for Zero-Shot Vision-Language Models	Raza Imam et.al.	2502.06019	null
2025-02-09	Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education	Yanhao Jia et.al.	2502.05863	null
2025-02-09	DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control	Junjie Wen et.al.	2502.05855	null
2025-02-07	Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray	Yunhang Shen et.al.	2502.05177	link
2025-02-07	DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions	Gorkem Can Ates et.al.	2502.05091	null
2025-02-07	OccGS: Zero-shot 3D Occupancy Reconstruction with Semantic and Geometric-Aware Gaussian Splatting	Xiaoyu Zhou et.al.	2502.04981	null
2025-02-07	Training-free Task-oriented Grasp Generation	Jiaming Wang et.al.	2502.04873	null
2025-02-07	ELITE: Enhanced Language-Image Toxicity Evaluation for Safety	Wonjun Lee et.al.	2502.04757	null
2025-02-06	AnyPlace: Learning Generalized Object Placement for Robot Manipulation	Yuchi Zhao et.al.	2502.04531	null
2025-02-06	Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment	Zuyan Liu et.al.	2502.04328	link
2025-02-06	Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion	Marco Mistretta et.al.	2502.04263	link
2025-02-06	Éclair – Extracting Content and Layout with Integrated Reading Order for Documents	Ilia Karmanov et.al.	2502.04223	null
2025-02-06	HD-EPIC: A Highly-Detailed Egocentric Video Dataset	Toby Perrett et.al.	2502.04144	null
2025-02-06	Efficient Few-Shot Continual Learning in Vision-Language Models	Aristeidis Panos et.al.	2502.04098	null
2025-02-06	CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing	Yu Yuan et.al.	2502.03997	null
2025-02-06	Adapting Human Mesh Recovery with Vision-Language Feedback	Chongyang Xu et.al.	2502.03836	null
2025-02-05	The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering	Zhuowei Li et.al.	2502.03628	link
2025-02-05	DynVFX: Augmenting Real Videos with Dynamic Content	Danah Yatim et.al.	2502.03621	null
2025-02-05	EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality	Junlong Chen et.al.	2502.03564	null
2025-02-05	SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living	Arkaprava Sinha et.al.	2502.03459	null
2025-02-05	RadVLM: A Multitask Conversational Vision-Language Model for Radiology	Nicolas Deperrois et.al.	2502.03333	null
2025-02-05	Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search	Jiayi He et.al.	2502.03230	null
2025-02-05	iVISPAR – An Interactive Visual-Spatial Reasoning Benchmark for VLMs	Julius Mayer et.al.	2502.03214	link
2025-02-05	Disentangling CLIP Features for Enhanced Localized Understanding	Samyak Rawelekar et.al.	2502.02977	null
2025-02-04	Vision-Language Model Dialog Games for Self-Improvement	Ksenia Konyushkova et.al.	2502.02740	null
2025-02-04	Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling	Xiaowen Qiu et.al.	2502.02590	null
2025-02-04	COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation	Xueqing Deng et.al.	2502.02589	null
2025-02-04	Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation	Junha Lee et.al.	2502.02548	null
2025-02-04	Personalization Toolkit: Training Free Personalization of Large Vision Language Models	Soroush Seifi et.al.	2502.02452	null
2025-02-04	LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models	Yuto Kojima et.al.	2502.02069	null
2025-02-04	AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs	Hongxin Li et.al.	2502.01977	null
2025-02-04	Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration	Younan Zhu et.al.	2502.01969	null
2025-02-03	From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment	Yilin Wu et.al.	2502.01828	null
2025-02-03	AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis	Basit Alawode et.al.	2502.01785	null
2025-02-03	Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning	Udita Ghosh et.al.	2502.01616	null
2025-01-31	Homogeneity Bias as Differential Sampling Uncertainty in Language Models	Messi H. J. Lee et.al.	2501.19337	null
2025-01-31	Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search	Yuta Oshima et.al.	2501.19252	null
2025-01-31	A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches	Luca Ciampi et.al.	2501.19184	null
2025-01-31	Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs	Kejia Zhang et.al.	2501.19164	null
2025-01-31	Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification	Xiangyu Sun et.al.	2501.19086	null
2025-02-03	Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment	Song-Lin Lv et.al.	2501.19060	null
2025-01-31	RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception	Joshua R. Waite et.al.	2501.18880	null
2025-02-03	UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent	Jianke Zhang et.al.	2501.18867	null
2025-01-31	Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models	Aodi Li et.al.	2501.18864	null
2025-01-30	INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation	Jian Hu et.al.	2501.18753	null
2025-01-30	Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models	Yi Ding et.al.	2501.18533	null
2025-01-30	A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models	Shiho Noda et.al.	2501.18463	link
2025-01-30	Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks	Hao-Zhe Tan et.al.	2501.18271	null
2025-01-29	Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching	Xuzhe Dang et.al.	2501.17665	null
2025-01-29	Exploring Vision Language Models for Multimodal and Multilingual Stance Detection	Jake Vasilakes et.al.	2501.17654	null
2025-01-29	Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation	Lin Chen et.al.	2501.17642	null
2025-01-30	Boosting Weak Positives for Text Based Person Search	Akshay Modi et.al.	2501.17586	null
2025-01-29	Learning Free Token Reduction for Multi-Modal LLM	Zihui Zhao et.al.	2501.17391	null
2025-01-30	Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding	Yun-Shiuan Chuang et.al.	2501.17310	null
2025-01-28	Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection	Xiangyu Gao et.al.	2501.16981	null
2025-01-28	Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet?	Sania Waheed et.al.	2501.16947	null
2025-01-29	TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models	Makoto Shing et.al.	2501.16937	null
2025-01-28	Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks	Shengyao Zhuang et.al.	2501.16902	link
2025-01-28	Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models	Muhammad Atta ur Rahman et.al.	2501.16769	null
2025-01-28	One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning	Chunpeng Zhou et.al.	2501.16720	null
2025-01-28	Improving Vision-Language-Action Model with Online Reinforcement Learning	Yanjiang Guo et.al.	2501.16664	null
2025-01-27	BiFold: Bimanual Cloth Folding with Language Guidance	Oriol Barbany et.al.	2501.16458	null
2025-01-27	PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding	Wei Chow et.al.	2501.16411	null
2025-01-27	CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation	Xiaochuan Ma et.al.	2501.16246	null
2025-01-27	AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants	Pascal J. Sager et.al.	2501.16150	null
2025-01-27	CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification	Huazhong Zhao et.al.	2501.16065	link
2025-01-27	Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge	Anh-Kiet Duong et.al.	2501.16037	link
2025-01-27	SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model	Delin Qu et.al.	2501.15830	null
2025-01-27	MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining	Ruiqi Wu et.al.	2501.15798	link
2025-01-26	Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts	Haodi Ma et.al.	2501.15688	null
2025-01-26	Ocean-OCR: Towards General OCR Application via a Vision-Language Model	Song Chen et.al.	2501.15558	link
2025-01-26	Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification	Dan Song et.al.	2501.15503	null
2025-01-26	Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection	Han Wang et.al.	2501.15438	link
2025-01-24	Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding	Zhongyi Shui et.al.	2501.14548	link
2025-01-24	PAID: A Framework of Product-Centric Advertising Image Design	Hongyu Chen et.al.	2501.14316	null
2025-01-24	Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models	Yuxuan Liang et.al.	2501.14276	null
2025-01-24	PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction	Hammad Ayyubi et.al.	2501.14210	null
2025-01-24	Dynamic Token Reduction during Generation for Vision Language Models	Xiaoyu Liang et.al.	2501.14204	null
2025-01-24	Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models	Saaduddin Mahmud et.al.	2501.14189	null
2025-01-24	SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation	Shuvendu Roy et.al.	2501.14148	null
2025-01-23	Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models	Linh Tran et.al.	2501.13904	null
2025-01-23	Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning	Shiyu Zhang et.al.	2501.13859	null
2025-01-23	Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes	Shiling Deng et.al.	2501.13851	link
2025-01-23	Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models	Chaolei Han et.al.	2501.13795	link
2025-01-23	Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak	Erjia Xiao et.al.	2501.13772	null
2025-01-23	EventVL: Understand Event Streams via Multimodal Large Language Model	Pengteng Li et.al.	2501.13707	null
2025-01-23	Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task	Mohit Vaishnav et.al.	2501.13620	null
2025-01-23	Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving	Lu Wang et.al.	2501.13563	null
2025-01-23	Text-driven Online Action Detection	Manuel Benavent-Lledo et.al.	2501.13518	link
2025-01-23	Iterative Shaping of Multi-Particle Aggregates based on Action Trees and VLM	Hoi-Yin Lee et.al.	2501.13507	null
2025-01-22	Patent Figure Classification using Large Vision-language Models	Sushil Awale et.al.	2501.12751	link
2025-01-22	TeD-Loc: Text Distillation for Weakly Supervised Object Localization	Shakeeb Murtaza et.al.	2501.12632	link
2025-01-22	ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality	Yanming Xiu et.al.	2501.12553	link
2025-01-21	Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models	Tabinda Aman et.al.	2501.12433	null
2025-01-20	ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models	Jingwei Yi et.al.	2501.12418	link
2025-01-21	InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model	Yuhang Zang et.al.	2501.12368	link
2025-01-21	Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2	Md. Rakibul Islam et.al.	2501.12356	null
2025-01-21	CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification	Cristiano Patrício et.al.	2501.12266	null
2025-01-21	Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model	Kazi Hasan Ibn Arif et.al.	2501.12206	link
2025-01-20	Human-AI Collaborative Game Testing with Vision Language Models	Boran Zhang et.al.	2501.11782	null
2025-01-20	SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models	Shu Zou et.al.	2501.11485	link
2025-01-20	Verifying Cross-modal Entity Consistency in News using Vision-language Models	Sahar Tahmasebi et.al.	2501.11403	link
2025-01-20	KPL: Training-Free Medical Knowledge Mining of Vision-Language Models	Jiaxiang Liu et.al.	2501.11231	link
2025-01-19	ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models	Yassir Bendou et.al.	2501.11175	null
2025-01-19	Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding	Zhanpeng Chen et.al.	2501.10967	link
2025-01-17	HiMix: Reducing Computational Complexity in Large Vision-Language Models	Xuange Zhang et.al.	2501.10318	null
2025-01-17	SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning	Yuecheng Liu et.al.	2501.10074	null
2025-01-17	CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment	Yating Liu et.al.	2501.10071	link
2025-01-17	MSTS: A Multimodal Safety Test Suite for Vision-Language Models	Paul Röttger et.al.	2501.10057	link
2025-01-17	Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions	Zhijie Tan et.al.	2501.10011	null
2025-01-17	Explainable artificial intelligence (XAI): from inherent explainability to large language models	Fuseini Mumuni et.al.	2501.09967	null
2025-01-16	Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key	Zhihe Yang et.al.	2501.09695	link
2025-01-16	Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark	Alexis Roger et.al.	2501.09672	null
2025-01-16	AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture	Jiayi Han et.al.	2501.09532	null
2025-01-16	Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness	Zeyu Wang et.al.	2501.09446	null
2025-01-16	Vision-Language Models Do Not Understand Negation	Kumail Alhamoud et.al.	2501.09425	null
2025-01-16	YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks	Saptarashmi Bandyopadhyay et.al.	2501.09355	null
2025-01-16	RoboReflect: Robotic Reflective Reasoning for Grasping Ambiguous-Condition Objects	Zhen Luo et.al.	2501.09307	null
2025-01-16	Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning	Harrison Fuller et.al.	2501.09294	null
2025-01-16	Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites	Abdalwhab Abdalwhab et.al.	2501.09267	null
2025-01-16	**Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5	Finlay Macklon et.al.	2501.09236	null
2025-01-15	CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation	Qi Ma et.al.	2501.08982	null
2025-01-15	MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents	Kuicai Dong et.al.	2501.08828	null
2025-01-15	Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning	Julian Perry et.al.	2501.08597	null
2025-01-14	MiniMax-01: Scaling Foundation Models with Lightning Attention	MiniMax et.al.	2501.08313	null
2025-01-14	Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding	Liping Yuan et.al.	2501.07888	link
2025-01-14	Visual Language Models as Operator Agents in the Space Domain	Alejandro Carrasco et.al.	2501.07802	null
2025-01-14	BMIP: Bi-directional Modality Interaction Prompt Learning for VLM	Song-Lin Lv et.al.	2501.07769	null
2025-01-13	SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing	Varun Biyyala et.al.	2501.07554	link
2025-01-13	RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment	Difei Gu et.al.	2501.07525	link
2025-01-13	Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models	Yasiru Ranasinghe et.al.	2501.07396	null
2025-01-14	GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction	Oleg Kobzarev et.al.	2501.07295	null
2025-01-13	Can Vision-Language Models Evaluate Handwritten Math?	Oikantik Nath et.al.	2501.07244	link
2025-01-13	TimeLogic: A Temporal Logic Benchmark for Video QA	Sirnam Swetha et.al.	2501.07214	null
2025-01-13	BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature	Alejandro Lozano et.al.	2501.07171	link
2025-01-13	Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning	Zhong Peng et.al.	2501.07114	null
2025-01-12	MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis	Sadia Kamal et.al.	2501.06887	null
2025-01-12	Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving	Haoxiang Gao et.al.	2501.06680	null
2025-01-10	VideoAuteur: Towards Long Narrative Video Generation	Junfei Xiao et.al.	2501.06173	null
2025-01-10	CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems	Haichao Liu et.al.	2501.06132	link
2025-01-10	Generate, Transduct, Adapt: Iterative Transduction with VLMs	Oindrila Saha et.al.	2501.06031	null
2025-01-10	Scalable Vision Language Model Training via High Quality Data Curation	Hongyuan Dong et.al.	2501.05952	null
2025-01-10	Valley2: Exploring Multimodal Models with Scalable Vision-Language Design	Ziheng Wu et.al.	2501.05901	link
2025-01-10	Super-class guided Transformer for Zero-Shot Attribute Classification	Sehyung Kim et.al.	2501.05728	link
2025-01-10	From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities	Dominick Reilly et.al.	2501.05711	link
2025-01-09	Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding	Mohammed Elhenawy et.al.	2501.05566	null
2025-01-09	Infecting Generative AI With Viruses	David Noever et.al.	2501.05542	null
2025-01-09	Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation	Darius Petermann et.al.	2501.05413	null
2025-01-09	Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection	Pei-Kang Lee et.al.	2501.05228	null
2025-01-09	Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model	Gregor Geigle et.al.	2501.05122	null
2025-01-09	DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving	Xuran Zheng et.al.	2501.05081	null
2025-01-09	Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning	Huabin Liu et.al.	2501.05069	null
2025-01-09	ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark	Ronghao Dang et.al.	2501.05031	link
2025-01-09	UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation	Oleg Sautenkov et.al.	2501.05014	link
2025-01-09	Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments	Yifan Xu et.al.	2501.04947	null
2025-01-08	Re-ranking the Context for Multimodal Retrieval Augmented Generation	Matin Mortaheb et.al.	2501.04695	null
2025-01-08	Enhancing Financial VQA in Vision Language Models using Intermediate Structured Representations	Archita Srivastava et.al.	2501.04675	null
2025-01-08	DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests	Charles Corbière et.al.	2501.04671	null
2025-01-08	A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI	Kazusato Oko et.al.	2501.04641	link
2025-01-08	Supervision-free Vision-Language Alignment	Giorgio Giannone et.al.	2501.04568	null
2025-01-08	Online Gaussian Test-Time Adaptation of Vision-Language Models	Clément Fuchs et.al.	2501.04352	link
2025-01-08	Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs	Zeyi Huang et.al.	2501.04336	null
2025-01-08	Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts	Miao Rang et.al.	2501.04322	link
2025-01-08	Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation	Senwei Xie et.al.	2501.04268	null
2025-01-07	MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation	Siddharth Joshi et.al.	2501.04155	link
2025-01-07	Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives	Shaoyuan Xie et.al.	2501.04003	link
2025-01-07	Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos	Haobo Yuan et.al.	2501.04001	link
2025-01-07	RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance	Matin Mortaheb et.al.	2501.03995	null
2025-01-07	VLM-driven Behavior Tree for Context-aware Task Planning	Naoki Wake et.al.	2501.03968	link
2025-01-07	Vision Language Models as Values Detectors	Giulio Antonio Abbo et.al.	2501.03957	null
2025-01-07	OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints	Mingjie Pan et.al.	2501.03841	null
2025-01-07	KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration	Chengyuan Li et.al.	2501.03786	null
2025-01-07	Realistic Test-Time Adaptation of Vision-Language Models	Maxime Zanella et.al.	2501.03729	link
2025-01-07	Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein	Xiaotong Guo et.al.	2501.03722	null
2025-01-07	SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning	Andrew Li et.al.	2501.03675	link
2025-01-06	Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation	Yuhui Zhang et.al.	2501.03225	link
2025-01-06	Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches	Alhassan Mumuni et.al.	2501.03151	null
2025-01-06	Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment	Pegah Khayatan et.al.	2501.03012	link
2025-01-06	MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models	Wenyi Hong et.al.	2501.02955	null
2025-01-06	Label-free Concept Based Multiple Instance Learning for Gigapixel Histopathology	Susu Sun et.al.	2501.02922	null
2025-01-06	Large Language Models for Video Surveillance Applications	Ulindu De Silva et.al.	2501.02850	null
2025-01-06	Visual Large Language Models for Generalized and Specialized Applications	Yifan Li et.al.	2501.02765	link
2025-01-05	Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs?	Simon Park et.al.	2501.02669	link
2025-01-05	Efficient Architectures for High Resolution Vision-Language Models	Miguel Carvalho et.al.	2501.02584	link
2025-01-05	FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models	Hui Lin et.al.	2501.02461	null
2025-01-03	Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding	Jiaming Li et.al.	2501.01926	link
2025-01-03	MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning	Pu Yang et.al.	2501.01834	null
2025-01-03	LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction	Er Jin et.al.	2501.01767	null
2025-01-03	MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders	Jiajun Cao et.al.	2501.01709	null
2025-01-03	GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models	Zhangyang Qi et.al.	2501.01428	link
2025-01-02	Training Medical Large Vision-Language Models with Abnormal-Aware Feedback	Yucheng Zhou et.al.	2501.01377	null
2025-01-02	CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering	Ben Vardi et.al.	2501.01371	null
2025-01-02	Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability	Dong Shu et.al.	2501.01346	null
2025-01-02	CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries	Shudong Liu et.al.	2501.01282	null
2025-01-03	2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining	Wenqi Zhang et.al.	2501.00958	link
2025-01-01	Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models	Emily Johnson et.al.	2501.00917	null
2025-01-01	FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation	Bingyu Li et.al.	2501.00877	link
2025-01-01	IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models	Yiming Zhang et.al.	2501.00848	null
2024-12-31	ICONS: Influence Consensus for Vision-Language Data Selection	Xindi Wu et.al.	2501.00654	null
2024-12-30	Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model	Yifei Huang et.al.	2412.21080	link
2024-12-30	Plancraft: an evaluation dataset for planning with LLM agents	Gautier Dagan et.al.	2412.21033	link
2024-12-30	UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI	Fangwei Zhong et.al.	2412.20977	null
2024-12-30	Low-Light Image Enhancement via Generative Perceptual Priors	Han Zhou et.al.	2412.20916	link
2024-12-30	WalkVLM:Aid Visually Impaired People Walking by Vision Language Model	Zhiqiang Yuan et.al.	2412.20903	null
2024-12-30	Towards Compatible Fine-tuning for Vision-Language Model Updates	Zhengbo Wang et.al.	2412.20895	null
2024-12-30	ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets	Fanjun Bu et.al.	2412.20826	null
2024-12-30	Are Vision-Language Models Truly Understanding Multi-vision Sensor?	Sangyun Chung et.al.	2412.20750	link
2024-12-30	UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models	Yujie Li et.al.	2412.20742	link
2024-12-30	M $^3$ oralBench: A MultiModal Moral Benchmark for LVLMs	Bei Yan et.al.	2412.20718	link
2024-12-27	MVTamperBench: Evaluating Robustness of Vision-Language Models	Amit Agarwal et.al.	2412.19794	null
2024-12-27	OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis	Qiushi Sun et.al.	2412.19723	null
2024-12-27	Is Your Text-to-Image Model Robust to Caption Noise?	Weichen Yu et.al.	2412.19531	null
2024-12-27	MBQ: Modality-Balanced Quantization for Large Vision-Language Models	Shiyao Li et.al.	2412.19509	link
2024-12-27	Multi-P $^2$ A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models	Jie Zhang et.al.	2412.19496	link
2024-12-27	Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation	Chengyang Ye et.al.	2412.19492	link
2024-12-26	CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models	Kiet A. Nguyen et.al.	2412.19331	null
2024-12-26	Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches	Kosei Tanada et.al.	2412.19153	null
2024-12-26	MoPD: Mixture-of-Prompts Distillation for Vision-Language Models	Yang Chen et.al.	2412.19087	null
2024-12-26	Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation	Tao Liu et.al.	2412.19021	null
2024-12-24	Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models	Tahira Kazimi et.al.	2412.18604	null
2024-12-24	The Key of Understanding Vision Tasks: Explanatory Instructions	Yang Shen et.al.	2412.18525	link
2024-12-24	LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating	Chao Deng et.al.	2412.18424	link
2024-12-24	Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model	Tenghui Li et.al.	2412.18387	link
2024-12-24	Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model	Yushu Li et.al.	2412.18303	null
2024-12-24	Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight	Xi Ding et.al.	2412.18298	link
2024-12-24	Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration	Zhixuan Shen et.al.	2412.18292	link
2024-12-24	Sampling Bag of Views for Open-Vocabulary Object Detection	Hojun Choi et.al.	2412.18273	null
2024-12-24	VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks	Shiduo Zhang et.al.	2412.18194	null
2024-12-24	EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation	Shuhao Han et.al.	2412.18150	link
2024-12-23	ChatGarment: Garment Estimation, Generation and Editing via Large Language Models	Siyuan Bian et.al.	2412.17811	null
2024-12-23	Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection	Yitong Chen et.al.	2412.17800	link
2024-12-23	Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective	Xinmiao Yu et.al.	2412.17787	null
2024-12-23	Reasoning to Attend: Try to Understand How Token Works	Rui Qian et.al.	2412.17741	link
2024-12-23	Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection	Fenfang Tao et.al.	2412.17619	link
2024-12-23	Personalized Large Vision-Language Models	Chau Pham et.al.	2412.17610	null
2024-12-23	Retention Score: Quantifying Jailbreak Risks for Vision Language Models	Zaitang Li et.al.	2412.17544	null
2024-12-23	Feature Based Methods Domain Adaptation for Object Detection: A Review Paper	Helia Mohamadi et.al.	2412.17325	null
2024-12-23	On the Feasibility of Vision-Language Models for Time-Series Classification	Vinay Prithyani et.al.	2412.17304	link
2024-12-23	GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning	Teja Krishna Cherukuri et.al.	2412.17251	null
2024-12-20	HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding	Chenxin Tao et.al.	2412.16158	null
2024-12-20	Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training	Mingliang Liang et.al.	2412.16148	link
2024-12-20	Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring	Ahmet Bahaddin Ersoz et.al.	2412.16108	null
2024-12-20	VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models	Dexter Neo et.al.	2412.15739	null
2024-12-20	Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage	Zhi Gao et.al.	2412.15606	null
2024-12-20	VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving	Zilin Huang et.al.	2412.15544	null
2024-12-20	PolySmart @ TRECVid 2024 Video-To-Text	Jiaxin Wu et.al.	2412.15509	null
2024-12-19	TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models	Ammar N. Abbas et.al.	2412.15462	null
2024-12-19	PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation	Muntasir Wahed et.al.	2412.15209	null
2024-12-19	AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving	Shuo Xing et.al.	2412.15206	link
2024-12-19	EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues	Sagar Soni et.al.	2412.15190	null
2024-12-19	LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation	Weijia Shi et.al.	2412.15188	null
2024-12-19	ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects	Qihang Cao et.al.	2412.14837	null
2024-12-19	A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space	Yonghao He et.al.	2412.14680	link
2024-12-19	FiVL: A Framework for Improved Vision-Language Alignment	Estelle Aflalo et.al.	2412.14672	null
2024-12-19	HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model	Masanari Ohi et.al.	2412.14613	null
2024-12-19	Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation	Jihao Gu et.al.	2412.14487	null
2024-12-19	GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering	Saumya Saxena et.al.	2412.14480	null
2024-12-18	Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation	Jianyu Zhang et.al.	2412.14145	null
2024-12-18	Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models	Ido Cohen et.al.	2412.14133	link
2024-12-18	Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models	Xinghang Li et.al.	2412.14058	null
2024-12-18	Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence	Jinghan He et.al.	2412.13949	null
2024-12-18	Real Classification by Description: Extending CLIP’s Limits of Part Attributes Recognition	Ethan Baron et.al.	2412.13947	null
2024-12-18	Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection	Le Yang et.al.	2412.13817	link
2024-12-18	Towards Automatic Evaluation for Image Transcreation	Simran Khanuja et.al.	2412.13717	null
2024-12-18	RelationField: Relate Anything in Radiance Fields	Sebastian Koch et.al.	2412.13652	link
2024-12-18	Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation	Changsun Lee et.al.	2412.13558	null
2024-12-18	Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning	Yingjie Zhu et.al.	2412.13540	link
2024-12-17	Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents	Yifei Zhou et.al.	2412.13194	null
2024-12-17	HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction	Chen Bao et.al.	2412.13187	null
2024-12-17	Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration	Mark Endo et.al.	2412.13180	null
2024-12-17	Improving Fine-grained Visual Understanding in VLMs through Text-Only Training	Dasol Choi et.al.	2412.12940	null
2024-12-17	CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models	Zihui Cheng et.al.	2412.12932	link
2024-12-17	An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions	Shreeyash Gowaikar et.al.	2412.12898	null
2024-12-17	ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation	Shiqi Huang et.al.	2412.12798	link
2024-12-17	CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels	Shizhuo Deng et.al.	2412.12793	null
2024-12-17	Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference	Siyuan Wang et.al.	2412.12785	null
2024-12-17	GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models	Mukai Li et.al.	2412.12735	link
2024-12-16	Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning	Qi Sun et.al.	2412.11974	link
2024-12-16	Explainable Procedural Mistake Detection	Shane Storks et.al.	2412.11927	null
2024-12-16	Does VLM Classification Benefit from LLM Description Semantics?	Pingchuan Ma et.al.	2412.11917	link
2024-12-16	From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach	Xilin Wang et.al.	2412.11892	null
2024-12-16	LMM-Regularized CLIP Embeddings for Image Classification	Maria Tzelepi et.al.	2412.11663	null
2024-12-16	Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves	Shihan Wu et.al.	2412.11509	link
2024-12-16	Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents	Wonje Choi et.al.	2412.11484	null
2024-12-16	OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference	Wei Chen et.al.	2412.11475	null
2024-12-16	MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation	Quan-Sheng Zeng et.al.	2412.11464	link
2024-12-16	Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes	Antonio Carlos Rivera et.al.	2412.11396	null
2024-12-13	UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities	Muhammad Uzair Khattak et.al.	2412.10372	link
2024-12-13	A dual contrastive framework	Yuan Sun et.al.	2412.10348	null
2024-12-13	DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding	Zhiyu Wu et.al.	2412.10302	link
2024-12-13	A model-based approach for transforming InSAR-derived vertical land motion from a local to a global reference frame	Mahmoud Reshadati et.al.	2412.10282	null
2024-12-13	VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation	Hyeonseok Lim et.al.	2412.10151	null
2024-12-13	Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples	Yeyuan Wang et.al.	2412.10029	null
2024-12-13	WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model	Songyan Zhang et.al.	2412.09951	link
2024-12-13	CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models	Dongyu Yao et.al.	2412.09936	link
2024-12-13	Selective State Space Memory for Large Vision-Language Models	Chee Ng et.al.	2412.09875	null
2024-12-12	BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation	Pablo Morales-Álvarez et.al.	2412.09718	null
2024-12-13	V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding	Junqi Ge et.al.	2412.09616	link
2024-12-12	PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models	Chenyu Yang et.al.	2412.09613	null
2024-12-12	AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials	Yiheng Xu et.al.	2412.09605	null
2024-12-12	OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation	Jitesh Jain et.al.	2412.09585	link
2024-12-12	Neptune: The Long Orbit to Benchmarking Long Video Understanding	Arsha Nagrani et.al.	2412.09582	link
2024-12-12	Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM	Han Wang et.al.	2412.09530	link
2024-12-12	Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis	Shengxuming Zhang et.al.	2412.09521	null
2024-12-12	ATPrompt: Textual Prompt Learning with Embedded Attributes	Zheng Li et.al.	2412.09442	null
2024-12-12	Causal Graphical Models for Vision-Language Compositional Understanding	Fiorenzo Parascandolo et.al.	2412.09353	link
2024-12-12	Learning Novel Skills from Language-Generated Demonstrations	Ao-Qun Jin et.al.	2412.09286	null
2024-12-11	Synthetic Vision: Training Vision-Language Models to Understand Physics	Vahid Balazadeh et.al.	2412.08619	null
2024-12-12	Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning	Fan Lu et.al.	2412.08614	link
2024-12-11	SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting	Pallavi Jain et.al.	2412.08536	link
2024-12-11	POINTS1.5: Building a Vision-Language Model towards Real World Applications	Yuan Liu et.al.	2412.08443	null
2024-12-11	LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba	Yubo Cui et.al.	2412.08388	null
2024-12-11	HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models	Shiding Zhu et.al.	2412.08378	null
2024-12-11	Position-aware Guided Point Cloud Completion with CLIP Model	Feng Zhou et.al.	2412.08271	null
2024-12-11	TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning	Jingjing Xie et.al.	2412.08176	link
2024-12-11	Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models	Quang-Hung Le et.al.	2412.08125	link
2024-12-11	Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models	Sri Harsha Dumpala et.al.	2412.08111	null
2024-12-10	RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models	Greg Heinrich et.al.	2412.07679	link
2024-12-10	DRUM: Learning Demonstration Retriever for Large MUlti-modal Models	Ellen Yi-Ge et.al.	2412.07619	null
2024-12-10	Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios	Jiaqi Fan et.al.	2412.07518	link
2024-12-10	SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World	Jiaqi Zhang et.al.	2412.07472	link
2024-12-10	MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models	Sayak Chakrabarty et.al.	2412.07148	link
2024-12-10	Maya: An Instruction Finetuned Multilingual Multimodal Model	Nahid Alam et.al.	2412.07112	link
2024-12-10	Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling	Donggeun Kim et.al.	2412.07077	null
2024-12-09	Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models	Yi-Lun Lee et.al.	2412.06775	link
2024-12-09	Visual Lexicon: Rich Image Features in Language Space	XuDong Wang et.al.	2412.06774	null
2024-12-09	Ranking-aware adapter for text-driven image ordering with CLIP	Wei-Hsiang Yu et.al.	2412.06760	link
2024-12-09	ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities	Adhiraj Ghosh et.al.	2412.06745	null
2024-12-09	The Narrow Gate: Localized Image-Text Communication in Vision-Language Models	Alessandro Serra et.al.	2412.06646	null
2024-12-09	From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding	Yixiong Fang et.al.	2412.06474	link
2024-12-09	Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models	Wei Suo et.al.	2412.06458	null
2024-12-09	No Annotations for Object Detection in Art through Stable Diffusion	Patrick Ramos et.al.	2412.06286	link
2024-12-09	iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models	Lianyu Hu et.al.	2412.06263	link
2024-12-09	DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction	Yunheng Li et.al.	2412.06244	null
2024-12-06	Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies	Recep Firat Cekinel et.al.	2412.05155	link
2024-12-06	Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora	Michael Y. Hu et.al.	2412.05149	null
2024-12-06	$S^3$ : Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models	Xiaojie Yin et.al.	2412.04925	null
2024-12-06	Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model	Keunwoo Peter Yu et.al.	2412.04729	null
2024-12-05	Cross-Self KV Cache Pruning for Efficient Vision-Language Inference	Xiaohuan Pei et.al.	2412.04652	link
2024-12-05	NVILA: Efficient Frontier Visual Language Models	Zhijian Liu et.al.	2412.04468	null
2024-12-05	VisionZip: Longer is Better but Not Necessary in Vision Language Models	Senqiao Yang et.al.	2412.04467	link
2024-12-05	Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection	Enshen Zhou et.al.	2412.04455	null
2024-12-05	Grounding Descriptions in Images informs Zero-Shot Visual Recognition	Shaunak Halbe et.al.	2412.04429	link
2024-12-05	Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion	Jiuhai Chen et.al.	2412.04424	link
2024-12-05	SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding	Rong Li et.al.	2412.04383	null
2024-12-05	Discriminative Fine-tuning of LVLMs	Yassine Ouali et.al.	2412.04378	null
2024-12-05	3D Part Segmentation via Geometric Aggregation of 2D Visual Features	Marco Garosi et.al.	2412.04247	link
2024-12-06	VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction	Jiahao Zhang et.al.	2412.04237	null
2024-12-05	Unified Framework for Open-World Compositional Zero-shot Learning	Hirunima Jayasekara et.al.	2412.04083	link
2024-12-04	FLAIR: VLM with Fine-grained Language-informed Image Representations	Rui Xiao et.al.	2412.03561	link
2024-12-04	Best-of-N Jailbreaking	John Hughes et.al.	2412.03556	link
2024-12-04	PaliGemma 2: A Family of Versatile VLMs for Transfer	Andreas Steiner et.al.	2412.03555	null
2024-12-04	PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation	Ao Wang et.al.	2412.03409	link
2024-12-04	A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs	Wangbo Zhao et.al.	2412.03324	link
2024-12-04	Composed Image Retrieval for Training-Free Domain Conversion	Nikos Efthymiadis et.al.	2412.03297	link
2024-12-04	Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation	Gianni Franchi et.al.	2412.03178	null
2024-12-04	AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?	Shouwei Ruan et.al.	2412.03002	null
2024-12-04	Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch	Qing Zhang et.al.	2412.02978	null
2024-12-04	Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis	Po-Hsuan Huang et.al.	2412.02946	null
2024-12-03	Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback	Hiroki Furuta et.al.	2412.02617	null
2024-12-03	CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs	Abhas Kumar et.al.	2412.02602	null
2024-12-03	OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation	Junyuan Zhang et.al.	2412.02592	link
2024-12-03	Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey	Chenyang Liu et.al.	2412.02573	link
2024-12-03	SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection	Joongwon Chae et.al.	2412.02565	link
2024-12-03	Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks	Jinjin Cai et.al.	2412.02531	null
2024-12-03	OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations	Caixin Kang et.al.	2412.02479	null
2024-12-03	BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding	Chenguang Huang et.al.	2412.02449	null
2024-12-03	Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation	Sepand Dyanatkar et.al.	2412.02262	null
2024-12-03	LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models	Fan-Yun Sun et.al.	2412.02193	null
2024-11-29	SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks	Kim-Celine Kahl et.al.	2411.19688	link
2024-11-29	CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation	Qixiu Li et.al.	2411.19650	null
2024-11-29	Interleaved-Modal Chain-of-Thought	Jun Gao et.al.	2411.19488	null
2024-11-29	Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis	Ruoqi Wang et.al.	2411.19475	null
2024-11-28	Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation	Luca Barsellotti et.al.	2411.19331	link
2024-11-28	GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks	Muhammad Sohail Danish et.al.	2411.19325	link
2024-11-28	GRAPE: Generalizing Robot Policy via Preference Alignment	Zijian Zhang et.al.	2411.19309	null
2024-11-28	Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs	Anirudh Phukan et.al.	2411.19187	null
2024-11-28	VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models	Jeongho Ju et.al.	2411.19103	null
2024-11-27	ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics	Letian Chen et.al.	2411.18825	null
2024-11-27	Visual Adversarial Attack on Vision-Language Models for Autonomous Driving	Tianyuan Zhang et.al.	2411.18275	null
2024-11-27	SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought	Aladin Djuhera et.al.	2411.18212	null
2024-11-27	From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects	Zizhao Li et.al.	2411.18207	link
2024-11-27	Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning	Di Zhang et.al.	2411.18203	null
2024-11-27	DistinctAD: Distinctive Audio Description Generation in Contexts	Bo Fang et.al.	2411.18180	null
2024-11-27	COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models	Xiao An et.al.	2411.18145	link
2024-11-27	When Large Vision-Language Models Meet Person Re-Identification	Qizao Wang et.al.	2411.18111	null
2024-11-27	Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis	Weiqin Zhao et.al.	2411.18101	link
2024-11-27	VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis	Donggoo Kang et.al.	2411.18038	null
2024-11-28	Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models	Shuyang Hao et.al.	2411.18000	null
2024-11-26	What’s in the Image? A Deep-Dive into the Vision of Vision Language Models	Omri Kaduri et.al.	2411.17491	null
2024-11-26	VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models	Lei Li et.al.	2411.17451	null
2024-11-26	CoA: Chain-of-Action for Generative Semantic Labels	Meng Wei et.al.	2411.17406	link
2024-11-26	Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment	Dongping Chen et.al.	2411.17188	null
2024-11-26	Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation	Chanyoung Kim et.al.	2411.17150	null
2024-11-26	Free $^2$ Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models	Jaemin Kim et.al.	2411.17041	null
2024-11-26	Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation	Shambhavi Mishra et.al.	2411.17002	link
2024-11-25	Probing the limitations of multimodal language models for chemistry and materials research	Nawaf Alampara et.al.	2411.16955	link
2024-11-25	Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge	Yaqi Zhao et.al.	2411.16824	null
2024-11-25	Generating Out-Of-Distribution Scenarios Using Language Models	Erfan Aasi et.al.	2411.16554	null
2024-11-25	RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics	Chan Hee Song et.al.	2411.16537	null
2024-11-25	Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis	Boming Miao et.al.	2411.16503	null
2024-11-25	A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models	Manuel Schwonberg et.al.	2411.16407	null
2024-11-25	CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain	Jingchao Peng et.al.	2411.16327	null
2024-11-25	Open-Vocabulary Octree-Graph for 3D Scene Understanding	Zhigang Wang et.al.	2411.16253	null
2024-11-25	From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events	Yan Miao et.al.	2411.16027	null
2024-11-25	Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models	Niloufar Alipour Talemi et.al.	2411.16018	null
2024-11-24	Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation	Sule Bai et.al.	2411.15869	link
2024-11-24	ResCLIP: Residual Attention for Training-free Dense Vision-language Inference	Yuhang Yang et.al.	2411.15851	link
2024-11-22	Context-Aware Multimodal Pretraining	Karsten Roth et.al.	2411.15099	null
2024-11-22	Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning	Junjie Shan et.al.	2411.14937	link
2024-11-22	ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos	Tanveer Hannan et.al.	2411.14901	link
2024-11-22	VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models	Camilo Chacón Sartori et.al.	2411.14832	null
2024-11-22	Continual SFT Matches Multimodal RLHF with Negative Supervision	Ke Zhu et.al.	2411.14797	null
2024-11-22	VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection	Songhao Han et.al.	2411.14794	link
2024-11-22	Effective SAM Combination for Open-Vocabulary Semantic Segmentation	Minhyeok Lee et.al.	2411.14723	null
2024-11-21	GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI	Tianbin Li et.al.	2411.14522	link
2024-11-21	Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance	Haozhe Zhao et.al.	2411.14279	null
2024-11-21	Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body	Zeqing Wang et.al.	2411.14205	null
2024-11-21	FoPru: Focal Pruning for Efficient Large Vision-Language Models	Lei Jiang et.al.	2411.14164	null
2024-11-21	Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset	Heejeong Nam et.al.	2411.14137	link
2024-11-20	BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games	Davide Paglieri et.al.	2411.13543	null
2024-11-20	Teaching VLMs to Localize Specific Objects from In-context Examples	Sivan Doveh et.al.	2411.13317	link
2024-11-20	XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation	Ziyi Wang et.al.	2411.13243	link
2024-11-21	ViSTa Dataset: Do vision-language models understand sequential tasks?	Evžen Wybitul et.al.	2411.13211	link
2024-11-20	TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models	Xin Wang et.al.	2411.13136	null
2024-11-20	LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement	Siwen Jiao et.al.	2411.12980	null
2024-11-19	VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge	Vishwesh Nath et.al.	2411.12915	null
2024-11-19	CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs	Zhehan Kan et.al.	2411.12713	null
2024-11-19	Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes	Rahul Garg et.al.	2411.12174	null
2024-11-18	Vision Language Models Are Few-Shot Audio Spectrogram Classifiers	Satvik Dixit et.al.	2411.12058	null
2024-11-18	ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements	M. Arda Aydın et.al.	2411.12044	link
2024-11-18	MC-LLaVA: Multi-Concept Personalized Vision-Language Model	Ruichuan An et.al.	2411.11706	link
2024-11-18	TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World	Xianlong Wang et.al.	2411.11683	null
2024-11-18	VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation	Bangguo Yu et.al.	2411.11609	null
2024-11-18	Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment	Zhendong Liu et.al.	2411.11543	null
2024-11-19	Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models	Chenhang Cui et.al.	2411.11496	link
2024-11-18	Exploring Emerging Trends and Research Opportunities in Visual Place Recognition	Antonios Gasteratos et.al.	2411.11481	null
2024-11-18	Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts	Jingxuan Li et.al.	2411.11479	null
2024-11-18	GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts	Junwen He et.al.	2411.11435	null
2024-11-18	Efficient Transfer Learning for Video-language Foundation Models	Haoxing Chen et.al.	2411.11223	link
2024-11-17	Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection	Wentao Bao et.al.	2411.10922	link
2024-11-15	VeriGraph: Scene Graphs for Execution Verifiable Robot Planning	Daniel Ekpo et.al.	2411.10446	null
2024-11-15	LLaVA-o1: Let Vision Language Models Reason Step-by-Step	Guowei Xu et.al.	2411.10440	link
2024-11-15	SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning	Zewen Chen et.al.	2411.10161	link
2024-11-15	Federated Domain Generalization via Prompt Learning and Aggregation	Shuai Gong et.al.	2411.10063	link
2024-11-15	Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement	Yanyan Huang et.al.	2411.09894	link
2024-11-14	LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution	Chenyang Wang et.al.	2411.09293	null
2024-11-13	ClevrSkills: Compositional Language and Visual Reasoning in Robotics	Sanjay Haresh et.al.	2411.09052	link
2024-11-13	DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models	Yongdong Wang et.al.	2411.09022	link
2024-11-13	Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions	Moran Yanuka et.al.	2411.09018	link
2024-11-13	The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models	Daniel P. Jeong et.al.	2411.08870	link
2024-11-13	Sharingan: Extract User Action Sequence from Desktop Recordings	Yanting Chen et.al.	2411.08768	null
2024-11-13	Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification	Jose-Luis Matez-Bandera et.al.	2411.08727	link
2024-11-13	LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation	Pengwei Yin et.al.	2411.08606	null
2024-11-13	NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation	Youzhi Liu et.al.	2411.08579	null
2024-11-13	Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints	Nishanth Kumar et.al.	2411.08253	null
2024-11-12	JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation	Yiyang Ma et.al.	2411.07975	link
2024-11-12	Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer’s Disease	Francesco Chiumento et.al.	2411.07871	null
2024-11-12	BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions	Anas Awadalla et.al.	2411.07461	null
2024-11-11	SAMPart3D: Segment Any Part in 3D Objects	Yunhan Yang et.al.	2411.07184	link
2024-11-11	StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification	Yichen He et.al.	2411.07076	link
2024-11-11	UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models	Jiachen Liang et.al.	2411.06921	link
2024-11-11	Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning	Hongsheng Zhang et.al.	2411.06764	null
2024-11-11	Learning from Feedback: Semantic Enhancement for Object SLAM Using Foundation Models	Jungseok Hong et.al.	2411.06752	null
2024-11-11	Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs	Jia Syuen Lim et.al.	2411.06702	null
2024-11-11	Renaissance: Investigating the Pretraining of Vision-Language Encoders	Clayton Fields et.al.	2411.06657	link
2024-11-09	Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models	Arshia Hemmat et.al.	2411.06287	link
2024-11-09	Sampling-Based Model Predictive Control for Dexterous Manipulation on a Biomimetic Tendon-Driven Hand	Adrian Hess et.al.	2411.06183	null
2024-11-09	Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding	Kaixuan Lu et.al.	2411.06142	null
2024-11-08	End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering	Dylan Goetting et.al.	2411.05755	link
2024-11-08	Poze: Sports Technique Feedback under Data Constraints	Agamdeep Singh et.al.	2411.05734	null
2024-11-08	A Two-Step Concept-Based Approach for Enhanced Interpretability and Trust in Skin Lesion Diagnosis	Cristiano Patrício et.al.	2411.05609	link
2024-11-08	Enhancing Visual Classification using Comparative Descriptors	Hankyeol Lee et.al.	2411.05357	link
2024-11-08	Real-World Offline Reinforcement Learning from Vision Language Model Feedback	Sreyas Venkataraman et.al.	2411.05273	null
2024-11-07	RT-Grasp: Reasoning Tuning Robotic Grasping via Multi-modal Large Language Model	Jinxuan Xu et.al.	2411.05212	null
2024-11-07	On Erroneous Agreements of CLIP Image Embeddings	Siting Li et.al.	2411.05195	null
2024-11-07	Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning	Joey Hong et.al.	2411.05193	null
2024-11-07	Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model	Sheng Cheng et.al.	2411.05079	link
2024-11-07	DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation	Peiqi Liu et.al.	2411.04999	link
2024-11-07	A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model	Panwen Hu et.al.	2411.04942	null
2024-11-07	In the Era of Prompt Learning with Vision-Language Models	Ankit Jha et.al.	2411.04892	null
2024-11-07	TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models	Jonathan Fhima et.al.	2411.04642	null
2024-11-07	Vision Language Models are In-Context Value Learners	Yecheng Jason Ma et.al.	2411.04549	null
2024-11-07	BendVLM: Test-Time Debiasing of Vision-Language Embeddings	Walter Gerych et.al.	2411.04420	link
2024-11-06	Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models	Saketh Bachu et.al.	2411.04291	null
2024-11-06	Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress?	Daniel P. Jeong et.al.	2411.04118	link
2024-11-06	RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models	Maya Varma et.al.	2411.04097	link
2024-11-06	H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models	Nhi Pham et.al.	2411.04077	null
2024-11-06	Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval	Davide Buoso et.al.	2411.04006	null
2024-11-06	Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models	Minh Duc Bui et.al.	2411.03888	link
2024-11-06	DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model	Tianhao He et.al.	2411.03827	null
2024-11-06	Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction	Muhammad Tayyab Khan et.al.	2411.03707	null
2024-11-05	Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset	Yingzi Ma et.al.	2411.03554	link
2024-11-05	VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation	Haochen Zhang et.al.	2411.03540	link
2024-11-05	Personalized Video Summarization by Multimodal Video Understanding	Brian Chen et.al.	2411.03531	null
2024-11-05	Inference Optimal VLMs Need Only One Visual Token but Larger Models	Kevin Y. Li et.al.	2411.03312	link
2024-11-05	HumanVLM: Foundation for Human-Scene Vision-Language Model	Dawei Dai et.al.	2411.03034	null
2024-11-05	Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge	Bin Huang et.al.	2411.02999	null
2024-11-05	Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments	Emilio Olivastri et.al.	2411.02938	null
2024-11-05	Membership Inference Attacks against Large Vision-Language Models	Zhan Li et.al.	2411.02902	link
2024-11-05	Leveraging Vision-Language Models for Manufacturing Feature Recognition in CAD Designs	Muhammad Tayyab Khan et.al.	2411.02810	null
2024-11-05	Label Critic: Design Data Before Models	Pedro R. A. S. Bassi et.al.	2411.02753	link
2024-11-05	DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark	Haodong Li et.al.	2411.02733	link
2024-11-05	V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization	Yuxi Xie et.al.	2411.02712	link
2024-11-04	Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack	Xiaojun Jia et.al.	2411.02669	link
2024-11-04	Attacking Vision-Language Computer Agents via Pop-ups	Yanzhe Zhang et.al.	2411.02391	link
2024-11-04	One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering	Deepayan Das et.al.	2411.02210	null
2024-11-04	GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery	Bhupendra Solanki et.al.	2411.02074	null
2024-11-03	RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering	Hui Lin et.al.	2411.01595	null
2024-11-03	Addressing Failures in Robotics using Vision-Based Language Models (VLMs) and Behavior Trees (BT)	Faseeh Ahmad et.al.	2411.01568	null
2024-11-03	Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting	Zhaohui Chen et.al.	2411.01511	null
2024-11-03	A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning	Fei Wang et.al.	2411.01445	null
2024-11-01	Identifying Implicit Social Biases in Vision-Language Models	Kimia Hamidieh et.al.	2411.00997	null
2024-11-01	Retrieval-enriched zero-shot image classification in low-resource domains	Nicola Dall’Asen et.al.	2411.00988	null
2024-11-01	Does GenAI Make Usability Testing Obsolete?	Ali Ebrahimi Pourasad et.al.	2411.00634	null
2024-10-31	$π_0$ : A Vision-Language-Action Flow Model for General Robot Control	Kevin Black et.al.	2410.24164	null
2024-10-31	Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age	Nouar AlDahoul et.al.	2410.24148	null
2024-10-31	Bayesian-guided Label Mapping for Visual Reprogramming	Chengyi Cai et.al.	2410.24018	link
2024-10-31	EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection	Qinqian Lei et.al.	2410.23904	link
2024-10-31	Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP	Chen Huang et.al.	2410.23698	null
2024-10-31	Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey	Chiyu Zhang et.al.	2410.23687	null
2024-10-31	SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design	Tomohiro Motoda et.al.	2410.23640	null
2024-10-30	Keypoint Abstraction using Large Models for Object-Relative Imitation Learning	Xiaolin Fang et.al.	2410.23254	null
2024-10-31	Aligning Audio-Visual Joint Representations with an Agentic Workflow	Shentong Mo et.al.	2410.23230	null
2024-10-30	OS-ATLAS: A Foundation Action Model for Generalist GUI Agents	Zhiyong Wu et.al.	2410.23218	link
2024-10-30	VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning	Yichao Liang et.al.	2410.23156	null
2024-10-30	Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models	Junjie Wu et.al.	2410.23114	link
2024-10-30	An Individual Identity-Driven Framework for Animal Re-Identification	Yihao Wu et.al.	2410.22927	link
2024-10-30	Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector	Youcheng Huang et.al.	2410.22888	link
2024-10-30	Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model	Keito Sasagawa et.al.	2410.22736	null
2024-10-30	Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization	Kento Kawaharazuka et.al.	2410.22707	null
2024-10-30	SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset	Ngoc Dung Huynh et.al.	2410.22648	null
2024-10-29	Task Vectors are Cross-Modal	Grace Luo et.al.	2410.22330	null
2024-10-29	Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier	Kai Wang et.al.	2410.22317	link
2024-10-29	Natural Language Inference Improves Compositionality in Vision-Language Models	Paola Cascante-Bonilla et.al.	2410.22315	null
2024-10-29	Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving	Bo Jiang et.al.	2410.22313	link
2024-10-29	ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising	Ashutosh Chaubey et.al.	2410.22233	link
2024-10-29	Active Learning for Vision-Language Models	Bardia Safaei et.al.	2410.22187	null
2024-10-29	Are VLMs Really Blind	Ayush Singh et.al.	2410.22029	link
2024-10-29	Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation	Halil Utku Unlu et.al.	2410.21926	null
2024-10-30	Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models	Lu Yu et.al.	2410.21802	link
2024-10-29	PerSRV: Personalized Sticker Retrieval with Vision-Language Model	Heng Er Metilda Chee et.al.	2410.21801	link
2024-10-29	AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?	Han Bao et.al.	2410.21259	link
2024-10-28	Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce	Zhantao Yang et.al.	2410.21237	null
2024-10-28	Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines	Zhixin Zhang et.al.	2410.21220	link
2024-10-29	Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction	Qintong Zhang et.al.	2410.21169	null
2024-10-28	Zero-Shot Action Recognition in Surveillance Videos	Joao Pereira et.al.	2410.21113	null
2024-10-28	BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks	Yunhan Zhao et.al.	2410.20971	null
2024-10-29	VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions	Guanyan Chen et.al.	2410.20927	null
2024-10-28	Improving Generalization in Visual Reasoning via Self-Ensemble	Tien-Huy Nguyen et.al.	2410.20883	null
2024-10-28	Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments	Sangmim Song et.al.	2410.20666	null
2024-10-27	MatViX: Multimodal Information Extraction from Visually Rich Articles	Ghazal Khalighinejad et.al.	2410.20494	null
2024-10-25	Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models	Yucheng Zhou et.al.	2410.19732	null
2024-10-25	GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing	Hosam Elgendy et.al.	2410.19552	link
2024-10-25	Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad?	Antonia Wüst et.al.	2410.19546	link
2024-10-25	EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data	Xuetian Chen et.al.	2410.19461	null
2024-10-25	COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training	Haocheng Xi et.al.	2410.19313	link
2024-10-25	Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting	Xingyu Zhu et.al.	2410.19294	null
2024-10-24	Probabilistic Language-Image Pre-Training	Sanghyuk Chun et.al.	2410.18857	link
2024-10-24	Zero-shot Object Navigation with Vision-Language Models Reasoning	Congcong Wen et.al.	2410.18570	null
2024-10-24	Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data	Shuhao Gu et.al.	2410.18558	null
2024-10-24	Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics	Jinghao Hu et.al.	2410.18537	null
2024-10-23	Lightweight Neural App Control	Filippos Christianos et.al.	2410.17883	null
2024-10-23	ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting	Shaofei Cai et.al.	2410.17856	link
2024-10-23	RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification	Marco Mistretta et.al.	2410.17827	null
2024-10-23	An Intelligent Agentic System for Complex Image Restoration Problems	Kaiwen Zhu et.al.	2410.17809	link
2024-10-23	MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models	Ziyu Liu et.al.	2410.17637	link
2024-10-22	AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents	Chejian Xu et.al.	2410.17401	null
2024-10-22	Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities	Zheyuan Zhang et.al.	2410.17385	link
2024-10-22	PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction	Long Xing et.al.	2410.17247	link
2024-10-22	MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model	Meng Xu et.al.	2410.16840	null
2024-10-22	Combining Ontological Knowledge and Large Language Model for User-Friendly Service Robots	Haru Nakajima et.al.	2410.16804	null
2024-10-22	Foundation Models for Remote Sensing and Earth Observation: A Survey	Aoran Xiao et.al.	2410.16602	link
2024-10-21	Integrating Reinforcement Learning with Foundation Models for Autonomous Robotics: Methods and Perspectives	Angelo Moroncelli et.al.	2410.16411	link
2024-10-21	VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use	Zhehao Zhang et.al.	2410.16400	null
2024-10-21	xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs	Michael S. Ryoo et.al.	2410.16267	null
2024-10-21	Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping	Ryan Li et.al.	2410.16232	null
2024-10-21	Improve Vision Language Model Chain-of-thought Reasoning	Ruohong Zhang et.al.	2410.16198	link
2024-10-21	Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning	Yihong Tang et.al.	2410.16162	null
2024-10-21	Mitigating Object Hallucination via Concentric Causal Attention	Yun Xing et.al.	2410.15926	link
2024-10-21	MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images	Pablo Meseguer et.al.	2410.15881	null
2024-10-21	Task-oriented Robotic Manipulation with Vision Language Models	Nurhan Bulus Guran et.al.	2410.15863	null
2024-10-21	An Efficient System for Automatic Map Storytelling – A Case Study on Historical Maps	Ziyi Liu et.al.	2410.15780	link
2024-10-22	Reducing Hallucinations in Vision-Language Models via Latent Space Steering	Sheng Liu et.al.	2410.15778	link
2024-10-21	AutoTrain: No-code training for state-of-the-art models	Abhishek Thakur et.al.	2410.15735	link
2024-10-18	NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples	Baiqi Li et.al.	2410.14669	null
2024-10-18	Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets	Namid R. Stillman et.al.	2410.14587	null
2024-10-18	CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection	Andrea Appiani et.al.	2410.14509	null
2024-10-18	Zero-shot Action Localization via the Confidence of Large Vision-Language Models	Josiah Aklilu et.al.	2410.14340	null
2024-10-18	Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models	Olga Loginova et.al.	2410.14248	null
2024-10-18	E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model	Haoran Lai et.al.	2410.14200	null
2024-10-18	LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs	Yujun Zhou et.al.	2410.14182	null
2024-10-18	MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems	Zifeng Zhu et.al.	2410.14179	link
2024-10-18	ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom	Jingqi Zhou et.al.	2410.14138	null
2024-10-17	Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers	Yuxin Wen et.al.	2410.14072	null
2024-10-17	VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding	Runsen Xu et.al.	2410.13860	link
2024-10-17	Differentiable Robot Rendering	Ruoshi Liu et.al.	2410.13851	null
2024-10-17	Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning	Xiaodan Xing et.al.	2410.13823	link
2024-10-17	Improving Multi-modal Large Language Model through Boosting Vision Capabilities	Yanpeng Sun et.al.	2410.13733	null
2024-10-17	VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks	Shailaja Keyur Sampat et.al.	2410.13666	link
2024-10-17	H2OVL-Mississippi Vision Language Models Technical Report	Shaikat Galib et.al.	2410.13611	null
2024-10-17	GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models	Aditya Sharma et.al.	2410.13510	null
2024-10-17	Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding	Kyungmin Min et.al.	2410.13321	null
2024-10-17	Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead	Kuleen Sasse et.al.	2410.13146	link
2024-10-17	Trust but Verify: Programmatic VLM Evaluation in the Wild	Viraj Prabhu et.al.	2410.13121	null
2024-10-16	Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models	Ce Zhang et.al.	2410.12790	link
2024-10-16	Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions	Zhenyu Jiang et.al.	2410.12773	null
2024-10-16	WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation	João Matos et.al.	2410.12722	link
2024-10-16	WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines	Genta Indra Winata et.al.	2410.12705	link
2024-10-16	VividMed: Vision Language Model with Versatile Visual Grounding for Medicine	Lingxiao Luo et.al.	2410.12694	link
2024-10-16	Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models	Shicheng Xu et.al.	2410.12662	null
2024-10-16	CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training	Zhiyuan Ma et.al.	2410.12595	null
2024-10-16	FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion	Jiacheng Ruan et.al.	2410.12564	link
2024-10-16	Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety	Lucas Choi et.al.	2410.12225	null
2024-10-16	Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution	Timothy Wei et.al.	2410.12165	null
2024-10-15	A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem	Kun Ding et.al.	2410.11686	null
2024-10-15	VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models	Zihang Li et.al.	2410.11665	null
2024-10-16	Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models	Fan Yang et.al.	2410.11639	null
2024-10-15	MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval	Reno Kriz et.al.	2410.11619	null
2024-10-15	PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model	Shang-Ching Liu et.al.	2410.11564	null
2024-10-15	LargePiG: Your Large Language Model is Secretly a Pointer Generator	Zhongxiang Sun et.al.	2410.11366	null
2024-10-15	Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs	Shuo Li et.al.	2410.11302	null
2024-10-15	CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification	Huazhong Zhao et.al.	2410.11255	null
2024-10-15	Tree of Attributes Prompt Learning for Vision-Language Models	Tong Ding et.al.	2410.11201	link
2024-10-14	Locality Alignment Improves Vision-Language Models	Ian Covert et.al.	2410.11087	null
2024-10-14	Towards Foundation Models for 3D Vision: How Close Are We?	Yiming Zuo et.al.	2410.10799	link
2024-10-14	VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents	Shi Yu et.al.	2410.10594	link
2024-10-14	Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification	Jiaxiang Gou et.al.	2410.10573	link
2024-10-14	MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks	Jiacheng Chen et.al.	2410.10563	link
2024-10-14	Learning to Ground VLMs without Forgetting	Aritra Bhowmik et.al.	2410.10491	null
2024-10-14	LG-CAV: Train Any Concept Activation Vector with Language Guidance	Qihan Huang et.al.	2410.10308	null
2024-10-14	Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection	Jiawen Zhu et.al.	2410.10289	link
2024-10-14	LOBG:Less Overfitting for Better Generalization in Vision-Language Model	Chenhao Ding et.al.	2410.10247	null
2024-10-14	MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models	Peng Xia et.al.	2410.10139	link
2024-10-14	Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models	Jun Luo et.al.	2410.10114	link
2024-10-11	Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models	Qin Liu et.al.	2410.09047	null
2024-10-11	The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals	Xiaofeng Wu et.al.	2410.09013	null
2024-10-11	DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection	Haochen Li et.al.	2410.09004	link
2024-10-11	SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation	Haosheng Li et.al.	2410.08901	null
2024-10-11	Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation	Kun Ding et.al.	2410.08895	null
2024-10-11	RoRA-VLM: Robust Retrieval-Augmented Vision Language Models	Jingyuan Qi et.al.	2410.08876	null
2024-10-11	Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies	Yingqiang Gao et.al.	2410.08860	null
2024-10-11	VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model	Beichen Wang et.al.	2410.08792	null
2024-10-11	Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models	Reza Abbasi et.al.	2410.08791	link
2024-10-11	Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping	Yue Yang et.al.	2410.08695	link
2024-10-10	LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts	Anh-Quan Cao et.al.	2410.08211	null
2024-10-10	MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models	Wenbo Hu et.al.	2410.08182	null
2024-10-10	On the Evaluation of Generative Robotic Simulations	Feng Chen et.al.	2410.08172	null
2024-10-10	Q-VLM: Post-training Quantization for Large Vision-Language Models	Changyuan Wang et.al.	2410.08119	link
2024-10-10	Unsupervised Data Validation Methods for Efficient Model Training	Yurii Paniv et.al.	2410.07880	null
2024-10-10	HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter	Yumiao Zhao et.al.	2410.07854	null
2024-10-10	FLIER: Few-shot Language Image Models Embedded with Latent Representations	Zhinuo Zhou et.al.	2410.07648	null
2024-10-10	A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks	Hoin Jung et.al.	2410.07593	link
2024-10-10	3D Vision-Language Gaussian Splatting	Qucheng Peng et.al.	2410.07577	null
2024-10-10	How Does Vision-Language Adaptation Impact the Safety of Vision Language Models?	Seongyun Lee et.al.	2410.07571	null
2024-10-09	Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate	Qidong Huang et.al.	2410.07167	link
2024-10-09	Towards Interpreting Visual Information Processing in Vision-Language Models	Clement Neo et.al.	2410.07149	link
2024-10-10	EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models	Rui Zhao et.al.	2410.07133	link
2024-10-09	VHELM: A Holistic Evaluation of Vision Language Models	Tony Lee et.al.	2410.07112	link
2024-10-09	Pixtral 12B	Pravesh Agrawal et.al.	2410.07073	link
2024-10-09	Clean Evaluations on Contaminated Visual Language Models	Hongyuan Lu et.al.	2410.07030	null
2024-10-09	Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback	Dennis Hein et.al.	2410.07025	null
2024-10-09	$\texttt{ModSCAN}$ : Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities	Yukun Jiang et.al.	2410.06967	link
2024-10-09	Compositional Entailment Learning for Hyperbolic Vision-Language Models	Avik Pal et.al.	2410.06912	link
2024-10-09	From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models	Yuying Shang et.al.	2410.06795	null
2024-10-07	Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia	Mohammad Fahes et.al.	2410.05270	link
2024-10-07	TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens	Ya-Qi Yu et.al.	2410.05261	null
2024-10-08	TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models	Rabin Adhikari et.al.	2410.05239	link
2024-10-07	Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality	Youngtaek Oh et.al.	2410.05210	link
2024-10-07	LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation	Zhijie Wang et.al.	2410.05191	null
2024-10-07	VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks	Ziyan Jiang et.al.	2410.05160	null
2024-10-07	HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation	Xinyu Zhou et.al.	2410.05090	link
2024-10-07	HE-Drive: Human-Like End-to-End Driving with Vision Language Models	Junming Wang et.al.	2410.05051	null
2024-10-07	Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models	Dehong Kong et.al.	2410.04884	null
2024-10-07	TLDR: Token-Level Detective Reward Model for Large Vision Language Models	Deqing Fu et.al.	2410.04734	null
2024-10-06	Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress	Christopher Agia et.al.	2410.04640	null

Video-Text Retrieval

Publish Date	Title	Authors	PDF	Code
2025-07-23	HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning	Li Jun et.al.	2507.17402	null
2025-07-17	FFI-VTR: Lightweight and Robust Visual Teach and Repeat Navigation based on Feature Flow Indicator and Probabilistic Motion Planning	Jikai Wang et.al.	2507.12800	null
2025-07-15	Double Duty: FPGA Architecture to Enable Concurrent LUT and Adder Chain Usage	Junius Pun et.al.	2507.11709	null
2025-07-15	Mapping Fusion: Improving FPGA Technology Mapping with ASIC Mapper	Cunxi Yu et.al.	2507.10912	null
2025-06-09	Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval	CH Cho et.al.	2506.07471	null
2025-06-04	MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval	Xinru Ying et.al.	2506.03473	null
2025-06-03	Uneven Event Modeling for Partially Relevant Video Retrieval	Sa Zhu et.al.	2506.00891	null
2025-04-28	Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval	Junlong Ren et.al.	2504.19637	null
2025-04-17	Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval	WonJun Moon et.al.	2504.13035	null
2025-04-15	Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering	Peipei Song et.al.	2504.10920	link
2025-03-20	DSLUT: An Asymmetric LUT and its Automatic Design Flow Based on Practical Functions	Moucheng Yang et.al.	2503.16109	null
2025-02-26	Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation	Humza Sami et.al.	2502.19091	link
2025-01-12	Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories	Faaiq Waqar et.al.	2501.06921	null
2024-12-16	View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection	Qi Zhang et.al.	2412.11428	link
2024-07-10	EA-VTR: Event-Aware Video-Text Retrieval	Zongyang Ma et.al.	2407.07478	null
2024-05-22	GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval	Yuting Wang et.al.	2405.13824	link
2024-04-08	The role of non-scientific factors vis-a-vis the quality of publications in determining their scholarly impact	Giovanni Abramo et.al.	2404.05345	null
2024-04-06	VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA	Sachini Wickramasinghe et.al.	2404.04527	null
2024-02-21	Event-aware Video Corpus Moment Retrieval	Danyang Hou et.al.	2402.13566	null
2024-01-27	Sigma-lognormal modeling of speech	C. Carmona-Duarte et.al.	2401.17320	null
2024-03-12	Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval	Taichi Nishimura et.al.	2312.00414	null
2024-07-14	DAG-aware Synthesis Orchestration	Yingjie Li et.al.	2310.07846	null
2024-01-03	GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval	Yuting Wang et.al.	2310.05195	link
2023-09-25	Multi-event Video-Text Retrieval	Gengyuan Zhang et.al.	2308.11551	link
2022-08-26	Partially Relevant Video Retrieval	Jianfeng Dong et.al.	2208.12510	link

Human-Centric

Publish Date	Title	Authors	PDF	Code
2025-07-18	When Person Re-Identification Meets Event Camera: A Benchmark Dataset and An Attribute-guided Re-Identification Framework	Xiao Wang et.al.	2507.13659	null
2025-07-17	Weakly Supervised Visible-Infrared Person Re-Identification via Heterogeneous Expert Collaborative Consistency Learning	Yafei Zhang et.al.	2507.12942	null
2025-07-09	Colors See Colors Ignore: Clothes Changing ReID with Color Disentanglement	Priyank Pathak et.al.	2507.07230	null
2025-06-14	Domain Generalization for Person Re-identification: A Survey Towards Domain-Agnostic Person Matching	Hyeonseo Lee et.al.	2506.12413	link
2025-06-11	ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model	Jialong Zuo et.al.	2506.09385	link
2025-05-23	DART $^3$ : Leveraging Distance for Test Time Adaptation in Person Re-Identification	Rajarshi Bhattacharya et.al.	2505.18337	null
2025-05-07	DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition	Kailash A. Hambarde et.al.	2505.04793	null
2025-04-29	Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining	Weizhen He et.al.	2504.20800	null
2025-06-03	Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data	Prasanna Reddy Pulakurthi et.al.	2504.13077	link
2025-05-24	LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification	Yiding Lu et.al.	2504.10174	link
2025-03-27	FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation	Jincheng Yan et.al.	2503.21595	null
2025-03-24	Multi-modal Multi-platform Person Re-Identification: Benchmark and Method	Ruiyang Ha et.al.	2503.17096	null
2025-03-11	AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification	Huy Nguyen et.al.	2503.08121	link
2025-02-10	Group-CLIP Uncertainty Modeling for Group Re-Identification	Qingxin Zhang et.al.	2502.06460	null
2025-01-13	Event-based Video Person Re-identification via Cross-Modality and Temporal Collaboration	Renkai Li et.al.	2501.07296	null
2024-12-23	Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification	Yuhao Wang et.al.	2412.17239	link
2024-12-06	Mix-Modality Person Re-Identification: A New and Practical Paradigm	Wei Liu et.al.	2412.04719	null
2024-12-02	Cerberus: Attribute-based person re-identification using semantic IDs	Chanho Eom et.al.	2412.01048	null
2024-11-27	When Large Vision-Language Models Meet Person Re-Identification	Qizao Wang et.al.	2411.18111	null
2025-04-11	Anti-Forgetting Adaptation for Unsupervised Person Re-identification	Hao Chen et.al.	2411.14695	null
2025-01-15	Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification	Shengxun Wei et.al.	2411.00330	null
2024-10-11	Aligned Divergent Pathways for Omni-Domain Generalized Person Re-Identification	Eugene P. W. Ang et.al.	2410.08466	null
2024-10-11	Diverse Deep Feature Ensemble Learning for Omni-Domain Generalized Person Re-identification	Eugene P. W. Ang et.al.	2410.08460	null
2024-10-11	A Unified Deep Semantic Expansion Framework for Domain-Generalized Person Re-identification	Eugene P. W. Ang et.al.	2410.08456	null
2024-10-25	Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification	Chenyue Li et.al.	2410.06977	link
2024-09-03	Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach	Shizhou Zhang et.al.	2408.07500	link
2024-08-20	PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification	Bin Hu et.al.	2408.05398	link
2024-05-30	DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person Re-Identification in Real-World	Wenli Sun et.al.	2405.19990	null
2024-05-28	Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification	Weizhen He et.al.	2405.17790	link
2024-05-15	Color Space Learning for Cross-Color Person Re-Identification	Jiahao Nie et.al.	2405.09487	link
2024-07-01	Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID	Wentao Tan et.al.	2405.04940	link
2024-04-19	Rethinking Clothes Changing Person ReID: Conflicts, Synthesis, and Optimization	Junjie Li et.al.	2404.12611	null
2024-05-01	Part-Attention Based Model Make Occluded Person Re-Identification Stronger	Zhihao Chen et.al.	2404.03443	null
2024-03-16	Bidirectional Multi-Step Domain Generalization for Visible-Infrared Person Re-Identification	Mahdi Alehdaghi et.al.	2403.10782	null
2024-03-13	Data Augmentation in Human-Centric Vision	Wentao Jiang et.al.	2403.08650	null