Contributors Forks Stargazers Issues

Updated on 2025.07.09

Usage instructions: here

VLM

Publish Date Title Authors PDF Code
2025-07-07 NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving Qucheng Peng et.al. 2507.05227 null
2025-07-07 All in One: Visual-Description-Guided Unified Point Cloud Segmentation Zongyan Han et.al. 2507.05211 null
2025-07-07 Differential Attention for Multimodal Crisis Event Analysis Nusrat Munia et.al. 2507.05165 null
2025-07-07 INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling Xin Dong et.al. 2507.05056 null
2025-07-07 Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision Soham Walimbe et.al. 2507.05020 null
2025-07-07 From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach Mihai Masala et.al. 2507.04815 null
2025-07-07 Training-free Generation of Temporally Consistent Rewards from VLMs Yinuo Zhao et.al. 2507.04789 null
2025-07-07 Vision-Language Models Can’t See the Obvious Yasser Dahou et.al. 2507.04741 null
2025-07-07 An analysis of vision-language models for fabric retrieval Francesco Giuliari et.al. 2507.04735 null
2025-07-07 A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets Zexi Jia et.al. 2507.04699 null
2025-07-03 DexVLG: Dexterous Vision-Language-Grasp Model at Scale Jiawei He et.al. 2507.02747 null
2025-07-03 ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects Qiaojun Yu et.al. 2507.02600 null
2025-07-02 cVLA: Towards Efficient Camera-Space VLAs Max Argus et.al. 2507.02190 null
2025-07-02 Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges Sanjeda Akter et.al. 2507.02074 null
2025-07-02 CI-VID: A Coherent Interleaved Text-Video Dataset Yiming Ju et.al. 2507.01938 null
2025-07-02 How Do Vision-Language Models Process Conflicting Information Across Modalities? Tianze Hua et.al. 2507.01790 null
2025-07-02 Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition Muzammil Behzad et.al. 2507.01673 null
2025-07-02 MARVIS: Modality Adaptive Reasoning over VISualizations Benjamin Feuer et.al. 2507.01544 null
2025-07-02 Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence Robert Aufschläger et.al. 2507.01504 null
2025-07-02 BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments Yibo Qiu et.al. 2507.01485 null
2025-07-03 TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control Zhenyang Liu et.al. 2507.01424 null
2025-07-02 CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning Kuniaki Saito et.al. 2507.01409 null
2025-07-02 Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model Chaoxiang Cai et.al. 2507.01351 null
2025-07-02 VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process Cristian Gariboldi et.al. 2507.01284 null
2025-06-30 DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World Xiangtai Li et.al. 2506.24102 null
2025-06-30 EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations Hyunjong Kim et.al. 2506.24016 null
2025-06-30 The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models Lijun Sheng et.al. 2506.24000 null
2025-06-30 StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving Ruiyang Hao et.al. 2506.23982 null
2025-06-30 GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models Hamza Rasaee et.al. 2506.23903 null
2025-06-30 A Closer Look at Conditional Prompt Tuning for Vision-Language Models Ji Zhang et.al. 2506.23856 null
2025-06-30 Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model Shiming Chen et.al. 2506.23822 null
2025-06-30 Visual Textualization for Image Prompted Object Detection Yongjian Wu et.al. 2506.23785 null
2025-06-30 PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? Atharva Gundawar et.al. 2506.23725 null
2025-06-30 On the Domain Robustness of Contrastive Vision-Language Models Mario Koddenbrock et.al. 2506.23663 null
2025-06-27 MiCo: Multi-image Contrast for Reinforcement Visual Reasoning Xi Chen et.al. 2506.22434 null
2025-06-27 Test-Time Consistency in Vision Language Models Shih-Han Chou et.al. 2506.22395 null
2025-06-27 Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation Tiankai Chen et.al. 2506.22375 null
2025-06-27 Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment Rui Xu et.al. 2506.22283 null
2025-06-27 COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication Filippo Merlo et.al. 2506.22274 null
2025-06-27 Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs Amirmohammad Izadi et.al. 2506.22146 null
2025-06-27 Universal Retrieval for Multimodal Trajectory Modeling Xuan Zhang et.al. 2506.22056 null
2025-06-27 Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation Jialei Chen et.al. 2506.22032 null
2025-06-27 SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding Zhao Jin et.al. 2506.21924 null
2025-06-27 SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation Adam Goodge et.al. 2506.21892 null
2025-06-26 Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration Jiahe Chen et.al. 2506.21509 null
2025-06-26 Global and Local Entailment Learning for Natural World Imagery Srikumar Sastry et.al. 2506.21476 null
2025-06-26 Spatial Mental Modeling from Limited Views Baiqiao Yin et.al. 2506.21458 null
2025-06-27 ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models Hongbo Liu et.al. 2506.21356 null
2025-06-26 LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning Dewen Zhang et.al. 2506.21317 null
2025-06-26 DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images Badri Vishal Kasuba et.al. 2506.21316 null
2025-06-26 World-aware Planning Narratives Enhance Large Vision-Language Model Planner Junhao Shi et.al. 2506.21230 null
2025-06-26 Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion Yuguang Zhang et.al. 2506.21144 null
2025-06-26 V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling Junwei You et.al. 2506.21041 null
2025-06-26 Multimodal Prompt Alignment for Facial Expression Recognition Fuyan Ma et.al. 2506.21017 null
2025-06-25 Shape2Animal: Creative Animal Generation from Natural Silhouettes Quoc-Duy Tran et.al. 2506.20616 null
2025-06-25 HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction Zhonghao Shi et.al. 2506.20566 null
2025-06-25 Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation Changlu Guo et.al. 2506.20449 null
2025-06-25 CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition Joerg Deigmoeller et.al. 2506.20373 null
2025-06-25 Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards Jihao Gu et.al. 2506.20332 null
2025-06-25 Towards Efficient Exemplar Based Image Editing with Multimodal VLMs Avadhoot Jadhav et.al. 2506.20155 null
2025-06-25 MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations Vardhan Dongre et.al. 2506.20100 null
2025-06-24 Unified Vision-Language-Action Model Yuqi Wang et.al. 2506.19850 null
2025-06-24 Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models Johannes Rückert et.al. 2506.19825 null
2025-06-24 CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation Hao Li et.al. 2506.19816 null
2025-06-24 UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation Yue Zhou et.al. 2506.19694 null
2025-06-24 PEVLM: Parallel Encoding for Vision-Language Models Letian Kang et.al. 2506.19651 null
2025-06-24 V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis Yuan Wang et.al. 2506.19610 null
2025-06-24 ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP Zhiyuan Wang et.al. 2506.19608 null
2025-06-24 Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects Federico Tavella et.al. 2506.19579 null
2025-06-24 Visual hallucination detection in large vision-language models via evidential conflict Tao Huang et.al. 2506.19513 null
2025-06-24 T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models Yiteng Chen et.al. 2506.19498 null
2025-06-23 VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning Xuanyu Zhang et.al. 2506.18564 null
2025-06-23 Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey Xinyao Li et.al. 2506.18504 null
2025-06-23 InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models Nianchen Deng et.al. 2506.18385 null
2025-06-23 Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review Haoneng Lin et.al. 2506.18378 null
2025-06-23 Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? Yiwei Yang et.al. 2506.18322 null
2025-06-24 Referring Expression Instance Retrieval and A Strong End-to-End Baseline Xiangzhao Hao et.al. 2506.18246 null
2025-06-23 Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning Yue Li et.al. 2506.18234 null
2025-06-22 See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis Ruinan Jin et.al. 2506.18140 null
2025-06-22 CLGRPO: Reasoning Ability Enhancement for Small VLMs Fanyi Wang et.al. 2506.18048 null
2025-06-22 Adapting Vision-Language Models for Evaluating World Models Mariya Hendriksen et.al. 2506.17967 null
2025-06-20 VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning Zhangyang Qi et.al. 2506.17221 null
2025-06-20 Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens Zeyuan Yang et.al. 2506.17218 link
2025-06-20 Do We Need Large VLMs for Spotting Soccer Actions? Ritabrata Chakraborty et.al. 2506.17144 null
2025-06-20 Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments Yasir Ali Farrukh et.al. 2506.16994 null
2025-06-20 AnyTraverse: An off-road traversability framework with VLM and human operator in the loop Sattwik Sahu et.al. 2506.16826 null
2025-06-20 FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation Fan Yang et.al. 2506.16806 null
2025-06-20 Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes Chao Chen et.al. 2506.16805 null
2025-06-20 Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models Lei Jiang et.al. 2506.16760 null
2025-06-20 Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly Lance Ying et.al. 2506.16755 null
2025-06-20 TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion Mingrui Zhu et.al. 2506.16730 null
2025-06-18 GenRecal: Generation after Recalibration from Large to Small Vision-Language Models Byung-Kwan Lee et.al. 2506.15681 null
2025-06-18 Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning Ankan Deria et.al. 2506.15649 null
2025-06-18 FindingDory: A Benchmark to Evaluate Memory in Embodied Agents Karmesh Yadav et.al. 2506.15635 null
2025-06-18 WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts Negar Foroutan et.al. 2506.15594 link
2025-06-18 DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement Shaoqing Lin et.al. 2506.15583 link
2025-06-18 Context-Informed Grounding Supervision Hyunji Lee et.al. 2506.15480 link
2025-06-19 OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models Lanfeng Zhong et.al. 2506.15318 null
2025-06-18 MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering Xinqi Fan et.al. 2506.15298 null
2025-06-18 Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models Xuelin Shen et.al. 2506.15201 link
2025-06-18 ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections Ziling Huang et.al. 2506.15180 null
2025-06-17 RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills Chunru Lin et.al. 2506.14763 null
2025-06-17 Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models Huihan Liu et.al. 2506.14727 null
2025-06-17 AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions Aishan Liu et.al. 2506.14697 null
2025-06-17 Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models Ling Li et.al. 2506.14674 null
2025-06-17 StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery Jina Kim et.al. 2506.14670 null
2025-06-17 SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks Zijian Song et.al. 2506.14512 null
2025-06-17 Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? Nitesh Subedi et.al. 2506.14507 link
2025-06-17 Adapting Lightweight Vision Language Models for Radiological Visual Question Answering Aditya Shourya et.al. 2506.14451 link
2025-06-17 Causally Steered Diffusion for Automated Video Counterfactual Generation Nikos Spyrou et.al. 2506.14404 link
2025-06-18 ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies Jinyan Yuan et.al. 2506.14315 null
2025-06-16 Touch begins where vision ends: Generalizable policies for contact-rich manipulation Zifan Zhao et.al. 2506.13762 null
2025-06-16 Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins Chuanruo Ning et.al. 2506.13761 null
2025-06-16 OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning Qiyu Xu et.al. 2506.13723 null
2025-06-16 ROSA: Harnessing Robot States for Vision-Language and Action Alignment Yuqing Wen et.al. 2506.13679 null
2025-06-16 DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models Zhiyi Shi et.al. 2506.13638 null
2025-06-16 VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation Jiaming Chen et.al. 2506.13428 null
2025-06-16 Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation Utkarsh Bajpai et.al. 2506.13367 null
2025-06-16 Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling Daichi Tanaka et.al. 2506.13282 null
2025-06-16 AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining Hongyuan Dong et.al. 2506.13274 null
2025-06-16 Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments Xuan Wang et.al. 2506.13205 null
2025-06-13 Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale Junha Lee et.al. 2506.12009 null
2025-06-13 How Visual Representations Map to Language Feature Space in Multimodal LLMs Constantin Venhoff et.al. 2506.11976 null
2025-06-13 Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation Xintong Wang et.al. 2506.11820 null
2025-06-13 MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space Anshul Singh et.al. 2506.11684 null
2025-06-13 VLM@school – Evaluation of AI image understanding on German middle school knowledge René Peinl et.al. 2506.11604 null
2025-06-13 EasyARC: Evaluating Vision Language Models on True Visual Reasoning Mert Unsal et.al. 2506.11595 null
2025-06-13 Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis Yuan Gao et.al. 2506.11526 link
2025-06-13 Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs Xiao Xu et.al. 2506.11515 null
2025-06-13 Taming Stable Diffusion for Computed Tomography Blind Super-Resolution Chunlei Li et.al. 2506.11496 null
2025-06-13 On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving Pedram MohajerAnsari et.al. 2506.11472 null
2025-06-12 AIR: Zero-shot Generative Model Adaptation with Iterative Refinement Guimeng Liu et.al. 2506.10895 link
2025-06-12 VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos Jiashuo Yu et.al. 2506.10857 null
2025-06-13 RationalVLA: A Rational Vision-Language-Action Model with Dual System Wenxuan Song et.al. 2506.10826 null
2025-06-12 Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding Yuhang Zhang et.al. 2506.10756 null
2025-06-13 IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain Hong Huang et.al. 2506.10730 link
2025-06-12 GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning Xiaoyi Bao et.al. 2506.10639 null
2025-06-12 Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning Chun-Mei Feng et.al. 2506.10575 null
2025-06-12 LLMs Are Not Yet Ready for Deepfake Image Detection Shahroz Tariq et.al. 2506.10474 null
2025-06-12 UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models Jun Yin et.al. 2506.10342 null
2025-06-12 Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions Deliang Wang et.al. 2506.10334 null
2025-06-11 Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing Junfei Wu et.al. 2506.09965 link
2025-06-11 From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models Irving Fang et.al. 2506.09930 null
2025-06-11 3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation Seonho Lee et.al. 2506.09883 link
2025-06-11 Adding simple structure at inference improves Vision-Language Compositionality Imanol Miranda et.al. 2506.09691 link
2025-06-11 HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios Kunyu Peng et.al. 2506.09650 link
2025-06-11 FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models Weiying Zheng et.al. 2506.09638 null
2025-06-11 Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs Beomsik Cho et.al. 2506.09522 link
2025-06-11 Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning Cheng Chen et.al. 2506.09473 null
2025-06-11 TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision Ayush Gupta et.al. 2506.09445 null
2025-06-11 DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt Yitong Zhang et.al. 2506.09353 link
2025-06-10 VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning Li Kang et.al. 2506.09049 null
2025-06-11 Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs Yaniv Nikankin et.al. 2506.09047 link
2025-06-10 Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better Dianyi Wang et.al. 2506.09040 link
2025-06-10 Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models Chenyu Lian et.al. 2506.08990 link
2025-06-10 Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions David Acuna et.al. 2506.08927 null
2025-06-11 Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought Shuyi Zhang et.al. 2506.08817 null
2025-06-10 Multimodal Representation Alignment for Cross-modal Information Retrieval Fan Xu et.al. 2506.08774 null
2025-06-10 PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly Liang Ma et.al. 2506.08708 null
2025-06-10 VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism Congzhi Zhang et.al. 2506.08691 link
2025-06-10 ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction Juan Yeo et.al. 2506.08678 null
2025-06-09 Vision Transformers Don’t Need Trained Registers Nick Jiang et.al. 2506.08010 link
2025-06-09 Hidden in plain sight: VLMs overlook their visual representations Stephanie Fu et.al. 2506.08008 null
2025-06-09 BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models Peiyan Li et.al. 2506.07961 null
2025-06-09 Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations Yizhen Li et.al. 2506.07943 null
2025-06-09 Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models Chengyue Huang et.al. 2506.07936 null
2025-06-09 SAM2Auto: Auto Annotation Using FLASH Arash Rocky et.al. 2506.07850 null
2025-06-09 Image Reconstruction as a Tool for Feature Analysis Eduard Allakhverdov et.al. 2506.07803 null
2025-06-09 Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger Qi Yang et.al. 2506.07785 null
2025-06-09 Language-Vision Planner and Executor for Text-to-Visual Reasoning Yichang Xu et.al. 2506.07778 null
2025-06-09 ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models Jing Zhong et.al. 2506.07739 null
2025-06-06 CoMemo: LVLMs Need Image Context with Image Memory Shi Liu et.al. 2506.06279 null
2025-06-06 ExAct: A Video-Language Benchmark for Expert Action Analysis Han Yi et.al. 2506.06277 null
2025-06-06 Movie Facts and Fibs (MF $^2$ ): A Benchmark for Long Movie Understanding Emmanouil Zaranis et.al. 2506.06275 null
2025-06-06 Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study Leon Mayer et.al. 2506.06232 null
2025-06-06 GenIR: Generative Visual Feedback for Mental Image Retrieval Diji Yang et.al. 2506.06220 null
2025-06-06 STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving Christian Fruhwirth-Reisinger et.al. 2506.06218 link
2025-06-06 WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management Bowen Yuan et.al. 2506.06084 null
2025-06-06 Full Conformal Adaptation of Medical Vision-Language Models Julio Silva-Rodríguez et.al. 2506.06076 null
2025-06-06 BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning Hongyi Zhou et.al. 2506.06072 null
2025-06-06 MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks Zonglin Wu et.al. 2506.05982 link
2025-06-06 Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs Haoyuan Li et.al. 2506.05318 null
2025-06-05 MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm Zhang Li et.al. 2506.05218 link
2025-06-05 Quantifying Cross-Modality Memorization in Vision-Language Models Yuxin Wen et.al. 2506.05198 null
2025-06-05 CIVET: Systematic Evaluation of Understanding in VLMs Massimo Rizzoli et.al. 2506.05146 null
2025-06-05 PixCell: A generative foundation model for digital histopathology images Srikar Yellapragada et.al. 2506.05127 null
2025-06-06 SeedEdit 3.0: Fast and High-Quality Generative Image Editing Peng Wang et.al. 2506.05083 null
2025-06-05 A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions Anh Le et.al. 2506.05061 null
2025-06-05 Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System Haokun Liu et.al. 2506.05020 null
2025-06-05 ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT Mikołaj Pokrywka et.al. 2506.04929 null
2025-06-05 Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms Nurul Aisyah et.al. 2506.04822 null
2025-06-04 OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis Junting Chen et.al. 2506.04217 link
2025-06-04 Image Editing As Programs with Diffusion Models Yujia Hu et.al. 2506.04158 null
2025-06-04 Recent Advances in Medical Image Classification Loan Dao et.al. 2506.04129 null
2025-06-04 LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward Yi Zhao et.al. 2506.04070 link
2025-06-04 Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization Jiulong Wu et.al. 2506.04039 null
2025-06-04 Vocabulary-free few-shot learning for Vision-Language Models Maxime Zanella et.al. 2506.04005 null
2025-06-04 DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models Jia Fu et.al. 2506.03933 null
2025-06-04 Zero-Shot Temporal Interaction Localization for Egocentric Videos Erhang Zhang et.al. 2506.03662 link
2025-06-04 Spatial Understanding from Videos: Structured Prompts Meet Simulation Data Haoyu Zhang et.al. 2506.03642 null
2025-06-04 VLMs Can Aggregate Scattered Training Patches Zhanhui Zhou et.al. 2506.03614 link
2025-06-03 GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents Qianhui Wu et.al. 2506.03143 null
2025-06-03 OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models Mengdi Jia et.al. 2506.03135 null
2025-06-03 EgoVLM: Policy Optimization for Egocentric Video Understanding Ashwin Vinod et.al. 2506.03097 link
2025-06-03 DPO Learning with LLMs-Judge Signal for Computer Use Agents Man Luo et.al. 2506.03095 null
2025-06-03 From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit Valérie Costa et.al. 2506.03093 null
2025-06-03 Text-guided Generation of Efficient Personalized Inspection Plans Xingpeng Sun et.al. 2506.02917 null
2025-06-04 FlySearch: Exploring how vision-language models explore Adam Pardyl et.al. 2506.02896 link
2025-06-03 Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights Mathieu Andreux et.al. 2506.02865 null
2025-06-03 SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking Sifan Li et.al. 2506.02803 null
2025-06-03 FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts Tongyuan Bai et.al. 2506.02781 null
2025-05-30 ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL Yu Zhang et.al. 2505.24875 null
2025-05-30 ProxyThinker: Test-Time Guidance through Small Visual Reasoners Zilin Xiao et.al. 2505.24872 link
2025-05-30 GenSpace: Benchmarking Spatially-Aware Image Generation Zehan Wang et.al. 2505.24870 null
2025-05-30 Time Blindness: Why Video-Language Models Can’t See What Humans Can? Ujjwal Upadhyay et.al. 2505.24867 null
2025-05-30 Conformal Prediction for Zero-Shot Models Julio Silva-Rodríguez et.al. 2505.24693 link
2025-05-30 BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models Huu-Thien Tran et.al. 2505.24649 null
2025-05-30 SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition Omer Nacar et.al. 2505.24600 null
2025-05-30 AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders Yuqi Zhang et.al. 2505.24519 null
2025-05-30 un $^2$ CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP Yinqi Li et.al. 2505.24517 link
2025-05-30 CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation Emilio Villa-Cueva et.al. 2505.24456 null
2025-05-29 ZeroGUI: Automating Online GUI Learning at Zero Human Cost Chenyu Yang et.al. 2505.23762 link
2025-05-29 Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint Heekyung Lee et.al. 2505.23759 link
2025-05-29 Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models Haohan Chi et.al. 2505.23757 link
2025-05-29 To Trust Or Not To Trust Your Vision-Language Model’s Prediction Hao Dong et.al. 2505.23745 link
2025-05-29 LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization Ronghuan Wu et.al. 2505.23740 null
2025-05-29 Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better Danny Driess et.al. 2505.23705 null
2025-05-29 CLDTracker: A Comprehensive Language Description for Visual Tracking Mohamad Alansari et.al. 2505.23704 link
2025-05-29 Grounded Reinforcement Learning for Visual Reasoning Gabriel Sarch et.al. 2505.23678 link
2025-05-29 Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition Yu Li et.al. 2505.23566 link
2025-05-30 Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information Xu Chu et.al. 2505.23558 link
2025-05-28 Zero-Shot Vision Encoder Grafting via LLM Surrogates Kaiyu Yue et.al. 2505.22664 link
2025-05-28 Training Free Stylized Abstraction Aimon Rahman et.al. 2505.22663 null
2025-05-28 VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models Ce Zhang et.al. 2505.22654 null
2025-05-28 Sherlock: Self-Correcting Reasoning in Vision-Language Models Yi Ding et.al. 2505.22651 null
2025-05-28 DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers Navve Wasserman et.al. 2505.22584 null
2025-05-28 Hypothesis Testing in Imaging Inverse Problems Yiming Xi et.al. 2505.22481 null
2025-05-28 Zero-Shot 3D Visual Grounding from Vision-Language Models Rong Li et.al. 2505.22429 null
2025-05-28 IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth Md Touhidul Islam et.al. 2505.22305 null
2025-05-28 Investigating Mechanisms for In-Context Vision Language Binding Darshana Saravanan et.al. 2505.22200 null
2025-05-29 Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging Runze Xia et.al. 2505.22150 null
2025-05-27 ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models Dingming Li et.al. 2505.21500 null
2025-05-27 AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery Haowei Wang et.al. 2505.21499 link
2025-05-27 Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Wei Pang et.al. 2505.21497 link
2025-05-27 Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration Mehrdad Fazli et.al. 2505.21472 null
2025-05-27 ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models Bozhou Li et.al. 2505.21465 null
2025-05-27 LazyVLM: Neuro-Symbolic Approach to Video Analytics Xiangru Jian et.al. 2505.21459 null
2025-05-27 DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models Nastaran Saadati et.al. 2505.21382 null
2025-05-27 XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration Shaoqing Zhang et.al. 2505.21279 null
2025-05-27 CROP: Contextual Region-Oriented Visual Token Pruning Jiawei Guo et.al. 2505.21233 null
2025-05-27 Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation Zhengyang Ji et.al. 2505.21106 null
2025-05-26 Agentic 3D Scene Generation with Spatially Contextualized VLMs Xinhang Liu et.al. 2505.20129 null
2025-05-26 MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models Anh Thai et.al. 2505.20122 null
2025-05-27 EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition Christoph Schuhmann et.al. 2505.20033 null
2025-05-26 ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers Fotios Lygerakis et.al. 2505.20032 null
2025-05-26 Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models Hyunsik Chae et.al. 2505.20021 null
2025-05-26 Can Visual Encoder Learn to See Arrows? Naoyuki Terashita et.al. 2505.19944 null
2025-05-26 Attention! You Vision Language Model Could Be Maliciously Manipulated Xiaosen Wang et.al. 2505.19911 null
2025-05-26 EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM Shuang Ao et.al. 2505.19905 null
2025-05-26 Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement Afrah Shaahid et.al. 2505.19895 null
2025-05-26 Two Causally Related Needles in a Video Haystack Miaoyu Li et.al. 2505.19853 null
2025-05-23 VideoGameBench: Can Vision-Language Models complete popular video games? Alex L. Zhang et.al. 2505.18134 null
2025-05-23 One RL to See Them All: Visual Triple Unified Reinforcement Learning Yan Ma et.al. 2505.18129 null
2025-05-23 CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays Hyungyung Lee et.al. 2505.18087 link
2025-05-23 FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation Zherui Zhang et.al. 2505.18053 null
2025-05-23 Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation Li Zhong et.al. 2505.18039 null
2025-05-23 Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling Bryan Wong et.al. 2505.17982 null
2025-05-23 Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities Ziwei Zhou et.al. 2505.17862 link
2025-05-23 VLM Models and Automated Grading of Atopic Dermatitis Marc Lalonde et.al. 2505.17835 null
2025-05-23 Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations Boxu Chen et.al. 2505.17812 null
2025-05-23 U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding Anjie Le et.al. 2505.17779 null
2025-05-23 Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models Jiaqi Wang et.al. 2505.16854 link
2025-05-23 LaViDa: A Large Diffusion Language Model for Multimodal Understanding Shufan Li et.al. 2505.16839 link
2025-05-22 From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization Haonian Ji et.al. 2505.16832 link
2025-05-22 Perceptual Quality Assessment for Embodied AI Chunyi Li et.al. 2505.16815 link
2025-05-22 SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving Xuesong Chen et.al. 2505.16805 null
2025-05-22 REOBench: Benchmarking Robustness of Earth Observation Foundation Models Xiang Li et.al. 2505.16793 link
2025-05-22 Single Domain Generalization for Few-Shot Counting via Universal Representation Matching Xianing Chen et.al. 2505.16778 link
2025-05-22 IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models Yiming Gao et.al. 2505.16774 link
2025-05-22 Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation Hongji Yang et.al. 2505.16763 null
2025-05-22 Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval Hailong Ning et.al. 2505.16756 null
2025-05-21 InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition Yijie Zheng et.al. 2505.15818 link
2025-05-21 From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems Xiuchao Sui et.al. 2505.15685 link
2025-05-21 FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models Zhen Sun et.al. 2505.15644 link
2025-05-21 Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models Xin Huang et.al. 2505.15576 link
2025-05-21 TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving Hossein Hassani et.al. 2505.15564 null
2025-05-21 Clapper: Compact Learning and Video Representation in VLMs Lingyu Kong et.al. 2505.15529 null
2025-05-21 Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets Kaiyuan Chen et.al. 2505.15517 null
2025-05-21 Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought Zihui Cheng et.al. 2505.15510 null
2025-05-21 Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts Debarshi Brahma et.al. 2505.15506 link
2025-05-21 Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification Conghao Xiong et.al. 2505.15504 null
2025-05-20 Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning Jiaer Xia et.al. 2505.14677 null
2025-05-20 UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens Ruichuan An et.al. 2505.14671 link
2025-05-20 EmoGist: Efficient In-Context Learning for Visual Emotion Understanding Ronald Seoh et.al. 2505.14660 null
2025-05-20 CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation Anna C. Doris et.al. 2505.14646 link
2025-05-20 Debating for Better Reasoning: An Unsupervised Multimodal Approach Ashutosh Adhikari et.al. 2505.14627 null
2025-05-20 KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation Jiajun Shi et.al. 2505.14552 link
2025-05-20 PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models He Zhu et.al. 2505.14481 null
2025-05-20 RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding Jiaang Li et.al. 2505.14462 link
2025-05-20 SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation Yuyang Dong et.al. 2505.14381 null
2025-05-20 Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds Joel Currie et.al. 2505.14366 null
2025-05-19 ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models Liyan Tang et.al. 2505.13444 null
2025-05-19 G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning Liang Chen et.al. 2505.13426 link
2025-05-19 Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots Dan BW Choe et.al. 2505.13376 null
2025-05-20 Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning Song-Lin Li et.al. 2505.13317 null
2025-05-19 I’ll believe it when I see it: Images increase misinformation sharing in Vision-Language Models Alice Plebe et.al. 2505.13302 link
2025-05-19 Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts Zekun Wang et.al. 2505.13281 null
2025-05-19 From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection Lincan Cai et.al. 2505.13233 link
2025-05-19 ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models Matteo Merler et.al. 2505.13180 link
2025-05-19 Just Dance with $π$ ! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection Snehashis Majhi et.al. 2505.13123 null
2025-05-19 Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model Yong Ren et.al. 2505.13062 null
2025-05-16 Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner Wenchuan Zhang et.al. 2505.11404 link
2025-05-16 Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation Zihan Wang et.al. 2505.11383 link
2025-05-16 Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild Derek Ming Siang Tan et.al. 2505.11350 null
2025-05-16 Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models Keunwoo Peter Yu et.al. 2505.11326 link
2025-05-16 Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation Donghoon Lee et.al. 2505.11221 link
2025-05-19 FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining Myunsoo Kim et.al. 2505.11192 null
2025-05-16 Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing Mathis Jürgen Adler et.al. 2505.11121 null
2025-05-16 CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs David Méndez et.al. 2505.11060 link
2025-05-16 GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning Yue Liu et.al. 2505.11049 link
2025-05-16 Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere Li Ju et.al. 2505.11029 null
2025-05-15 MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models Mugilan Ganesan et.al. 2505.10526 null
2025-05-15 AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenge Ranjan Sapkota et.al. 2505.10468 null
2025-05-15 Vision language models have difficulty recognizing virtual objects Tyler Tran et.al. 2505.10453 null
2025-05-15 Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli Chunyu Ye et.al. 2505.10356 null
2025-05-15 MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models Yuncheng Guo et.al. 2505.10088 link
2025-05-15 AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection Bin-Bin Gao et.al. 2505.09926 link
2025-05-14 Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling William Xie et.al. 2505.09731 null
2025-05-14 ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation Enyu Zhao et.al. 2505.09698 null
2025-05-14 Variational Visual Question Answering Tobias Jan Wieczorek et.al. 2505.09591 null
2025-05-14 VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation Chaofan Zhang et.al. 2505.09577 null
2025-05-14 Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput Bo Zhang et.al. 2505.09498 null
2025-05-14 MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment Siyuan Yan et.al. 2505.09372 link
2025-05-14 Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition Muzammil Behzad et.al. 2505.09336 null
2025-05-14 MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning Bin-Bin Gao et.al. 2505.09265 null
2025-05-14 Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models Lucas Choi et.al. 2505.09139 null
2025-05-14 Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning Dayong Liang et.al. 2505.09118 null
2025-05-14 OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions Yuhang Wang et.al. 2505.09092 link
2025-05-13 Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training Yangyi Chen et.al. 2505.08971 link
2025-05-13 Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving Zongchuang Zhao et.al. 2505.08725 link
2025-05-13 OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning Zhaochen Su et.al. 2505.08617 link
2025-05-13 From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation Yifu Yuan et.al. 2505.08548 link
2025-05-13 Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? Md Tahmid Rahman Laskar et.al. 2505.08468 link
2025-05-13 MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos Xianghui Wang et.al. 2505.08367 null
2025-05-13 Removing Watermarks with Partial Regeneration using Semantic Information Krti Tallam et.al. 2505.08234 link
2025-05-13 CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding Wenxuan Ma et.al. 2505.08194 null
2025-05-13 DSADF: Thinking Fast and Slow for Decision Making Alex Zhihao Dou et.al. 2505.08189 null
2025-05-12 Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models Seungjae Lee et.al. 2505.07815 null
2025-05-12 Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction Jingfen Qiao et.al. 2505.07730 link
2025-05-12 Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images Elisei Rykov et.al. 2505.07704 null
2025-05-12 Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models Songlin Dong et.al. 2505.07690 null
2025-05-12 Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ ptimization Seongjae Kang et.al. 2505.07675 link
2025-05-12 Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding Oriol Barbany et.al. 2505.07600 null
2025-05-12 Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning Bohan Wang et.al. 2505.07538 null
2025-05-12 QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads Khurram Mazher et.al. 2505.07531 null
2025-05-12 AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography Jiewen Yang et.al. 2505.07347 null
2025-05-12 Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning Xiaokun Wang et.al. 2505.07263 null
2025-05-09 MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks Wenqi Zeng et.al. 2505.06152 link
2025-05-09 TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations Shuaiyi Huang et.al. 2505.06079 null
2025-05-09 Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI Benjamin Raphael Ernhofer et.al. 2505.05895 link
2025-05-09 Describe Anything in Medical Images Xi Xiao et.al. 2505.05804 null
2025-05-09 3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks Vineet Bhat et.al. 2505.05800 null
2025-05-08 Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos Giulio Cesare Mastrocinque Santo et.al. 2505.05681 null
2025-05-08 VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models Faizan Farooq Khan et.al. 2505.05635 null
2025-05-08 Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments Pranav Guruprasad et.al. 2505.05540 link
2025-05-08 Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging Shiqi Chen et.al. 2505.05464 link
2025-05-08 SITE: towards Spatial Intelligence Thorough Evaluation Wenqi Wang et.al. 2505.05456 null
2025-05-08 DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning Wenru Liu et.al. 2505.05360 null
2025-05-08 Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization Sooyoung Park et.al. 2505.05343 link
2025-05-08 Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects Agnese Chiatti et.al. 2505.05318 null
2025-05-08 Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models Wei Peng et.al. 2505.05189 link
2025-05-08 OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning Cong Hua et.al. 2505.05180 link
2025-05-08 Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models Aishwarya Venkataramanan et.al. 2505.05163 link
2025-05-08 CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models Mengjun Yi et.al. 2505.05130 null
2025-05-08 X-Driver: Explainable Autonomous Driving with Vision-Language Models Wei Liu et.al. 2505.05098 null
2025-05-07 “I Can See Forever!”: Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments Ziyi Zhang et.al. 2505.04488 null
2025-05-07 DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception Junjie Wang et.al. 2505.04410 link
2025-05-07 CM1 – A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models Fabian Wolf et.al. 2505.04214 null
2025-05-07 R^3-VQA: “Read the Room” by Video Social Reasoning Lixing Niu et.al. 2505.04147 null
2025-05-06 X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains Qianchu Liu et.al. 2505.03981 null
2025-05-06 Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning François Role et.al. 2505.03703 null
2025-05-06 Distribution-Conditional Generation: From Class Distribution to Creative Generation Fu Feng et.al. 2505.03667 null
2025-05-06 Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images Fangling Jiang et.al. 2505.03611 null
2025-05-06 Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection Fangling Jiang et.al. 2505.03610 null
2025-05-06 LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs Xinyuan Zhang et.al. 2505.03460 null
2025-05-06 Mitigating Image Captioning Hallucinations in Vision-Language Models Fei Zhao et.al. 2505.03420 null
2025-05-07 Enhancing Target-unspecific Tasks through a Features Matrix Fangming Cui et.al. 2505.03414 null
2025-05-06 Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models Abram Schonfeldt et.al. 2505.03374 null
2025-05-06 A Vision-Language Model for Focal Liver Lesion Classification Song Jian et.al. 2505.03350 null
2025-05-06 From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection Guoting Wei et.al. 2505.03334 null
2025-05-05 LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery Jerome Quenum et.al. 2505.02829 null
2025-05-05 HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction Muhammad Haris Khan et.al. 2505.02569 null
2025-05-05 Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality Xueguang Ma et.al. 2505.02466 link
2025-05-05 Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey Chaohua Li et.al. 2505.02448 null
2025-05-05 SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing Ming Li et.al. 2505.02370 link
2025-05-05 TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment Zhichuan Wang et.al. 2505.02325 link
2025-05-04 Compositional Image-Text Matching and Retrieval by Grounding Entities Madhukar Reddy Vongala et.al. 2505.02278 null
2025-05-04 Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin Yuchen Wang et.al. 2505.02056 null
2025-05-04 A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models Liqiang Jing et.al. 2505.01958 null
2025-05-03 PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications Trisanth Srinivasan et.al. 2505.01881 null
2025-05-02 Dynamic Robot Tool Use with Vision Language Models Noah Trupin et.al. 2505.01399 null
2025-05-02 Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages Marco Salmè et.al. 2505.01096 null
2025-05-02 Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation Daniele Molino et.al. 2505.01091 null
2025-05-02 Transferable Adversarial Attacks on Black-Box Vision-Language Models Kai Hu et.al. 2505.01050 null
2025-05-01 Robotic Visual Instruction Yanbang Li et.al. 2505.00693 null
2025-05-01 Visual Test-time Scaling for GUI Agent Grounding Tiange Luo et.al. 2505.00684 link
2025-05-01 DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation Zixuan Chen et.al. 2505.00527 null
2025-05-01 LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving Zhijie Qiao et.al. 2505.00284 link
2025-05-01 AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care Md Asaduzzaman Jabin et.al. 2505.00275 link
2025-05-02 Empowering Agentic Video Analytics Systems with Video Language Models Yuxuan Yan et.al. 2505.00254 null
2025-04-30 V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving Jannik Lübberstedt et.al. 2505.00156 null
2025-04-30 Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models Minh-Hao Van et.al. 2505.00150 null
2025-04-30 Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design Vasudev Sharma et.al. 2505.00134 null
2025-04-30 Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization Anas Anwarul Haq Khan et.al. 2504.21831 null
2025-04-30 Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models Sangmin Woo et.al. 2504.21559 null
2025-04-30 RoboGround: Robotic Manipulation with Grounded Vision-Language Priors Haifeng Huang et.al. 2504.21530 null
2025-04-30 Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection Luoting Zhuang et.al. 2504.21344 null
2025-04-30 An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images Modesto Castrillón-Santana et.al. 2504.21309 null
2025-04-30 Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA Xuanzhao Dong et.al. 2504.21252 link
2025-04-29 MemeBLIP2: A novel lightweight multimodal system to detect harmful memes Jiaqi Liu et.al. 2504.21226 null
2025-04-29 GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model Haoyan Xu et.al. 2504.21186 null
2025-04-29 Real-Time Wayfinding Assistant for Blind and Low-Vision Users Dabbrata Das et.al. 2504.20976 null
2025-04-29 FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models Mainak Singha et.al. 2504.20860 null
2025-04-29 In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer Zechuan Zhang et.al. 2504.20690 null
2025-04-29 SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data Michael Ogezi et.al. 2504.20648 null
2025-04-29 PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations Haowen Sun et.al. 2504.20520 null
2025-04-29 Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception Yuanchen Wu et.al. 2504.20468 link
2025-04-29 Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks Konstantinos I. Roumeliotis et.al. 2504.20419 null
2025-04-29 FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding Yanan Guo et.al. 2504.20384 null
2025-04-28 mrCAD: Multimodal Refinement of Computer-aided Designs William P. McCarthy et.al. 2504.20294 link
2025-04-28 A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports Henning Schäfer et.al. 2504.20220 null
2025-04-28 SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning Wufei Ma et.al. 2504.20024 null
2025-04-29 If Concept Bottlenecks are the Question, are Foundation Models the Answer? Nicola Debole et.al. 2504.19774 link
2025-04-28 EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia Valerie Zermatten et.al. 2504.19742 null
2025-04-28 Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model Muzammil Behzad et.al. 2504.19739 null
2025-04-28 VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning Run Luo et.al. 2504.19627 null
2025-04-28 LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning Peijian Zeng et.al. 2504.19524 null
2025-04-27 DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning Jialang Lu et.al. 2504.19127 null
2025-04-27 Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction Xiaoran Xu et.al. 2504.19086 null
2025-04-26 Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation Shahad Albastaki et.al. 2504.18856 link
2025-04-26 Video CLIP Model for Multi-View Echocardiography Interpretation Ryo Takizawa et.al. 2504.18800 null
2025-04-25 Generalization Capability for Imitation Learning Yixiao Wang et.al. 2504.18538 null
2025-04-25 Fast-Slow Thinking for Large Vision-Language Model Reasoning Wenyi Xiao et.al. 2504.18458 null
2025-04-25 Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation Peiyuan Jing et.al. 2504.18453 null
2025-04-25 HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding? Yusen Zhang et.al. 2504.18406 null
2025-04-25 Revisiting Data Auditing in Large Vision-Language Models Hongyu Zhu et.al. 2504.18349 null
2025-04-25 A Large Vision-Language Model based Environment Perception System for Visually Impaired People Zezhou Chen et.al. 2504.18027 null
2025-04-24 CAMU: Context Augmentation for Meme Understanding Girish A. Koushik et.al. 2504.17902 null
2025-04-25 Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction Yuanchang Ye et.al. 2504.17671 null
2025-04-24 SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting Yiming Zhao et.al. 2504.17395 null
2025-04-24 M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction Chengguang Gan et.al. 2504.17353 null
2025-04-24 DIMT25@ICDAR2025: HW-TSC’s End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model Zhanglin Wu et.al. 2504.17315 null
2025-04-24 Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning Lynn Cherif et.al. 2504.17282 null
2025-04-24 Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation Phillip Y. Lee et.al. 2504.17207 null
2025-04-23 Distilling semantically aware orders for autoregressive image generation Rishav Pramanik et.al. 2504.17069 null
2025-04-23 DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs Zhenhailong Wang et.al. 2504.17040 null
2025-04-24 V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations Zhiyuan Fan et.al. 2504.16727 null
2025-04-23 Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes Joan Perez et.al. 2504.16538 null
2025-04-23 TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance Meng Chu et.al. 2504.16505 null
2025-04-23 FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing Hariseetharam Gunduboina et.al. 2504.16433 null
2025-04-22 CLIP-IT: CLIP-based Pairing for Histology Images Classification Banafsheh Karimian et.al. 2504.16181 link
2025-04-22 MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention Yucheng Li et.al. 2504.16083 null
2025-04-22 MR. Video: “MapReduce” is the Principle for Long Video Understanding Ziqi Pang et.al. 2504.16082 null
2025-04-22 Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation Zhiyuan Hu et.al. 2504.16073 null
2025-04-22 Describe Anything: Detailed Localized Image and Video Captioning Long Lian et.al. 2504.16072 null
2025-04-22 Vision language models are unreliable at trivial spatial cognition Sangeet Khemlani et.al. 2504.16061 null
2025-04-22 Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation Ziqiao Ma et.al. 2504.16060 link
2025-04-22 Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis Frank Li et.al. 2504.16047 null
2025-04-22 LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale Joya Chen et.al. 2504.16030 null
2025-04-22 Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models Saban Ozturk et.al. 2504.15929 null
2025-04-22 VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining Zizhi Chen et.al. 2504.15545 null
2025-04-21 Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models Guo Chen et.al. 2504.15271 null
2025-04-21 KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking Juyeon Kim et.al. 2504.15135 link
2025-04-21 Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation Hong-Tao Yu et.al. 2504.14988 link
2025-04-21 VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform Xingyu Lu et.al. 2504.14904 null
2025-04-21 Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation Yunpu Zhao et.al. 2504.14848 null
2025-04-20 OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding Songtao Jiang et.al. 2504.14692 null
2025-04-20 NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation Junyuan Fang et.al. 2504.14638 null
2025-04-20 LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation Jiachen Li et.al. 2504.14467 null
2025-04-20 Neglected Risks: The Disturbing Reality of Children’s Images in Datasets and the Urgent Call for Accountability Carlos Caetano et.al. 2504.14446 null
2025-04-19 Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models Chung-En et.al. 2504.14395 null
2025-04-18 Analysing the Robustness of Vision-Language-Models to Common Corruptions Muhammad Usama et.al. 2504.13690 null
2025-04-18 EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model Sijing Li et.al. 2504.13650 link
2025-04-18 PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting Huapeng Lin et.al. 2504.13624 null
2025-04-18 Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization Hongwei Ji et.al. 2504.13460 null
2025-04-18 Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety Shashank Shriram et.al. 2504.13399 link
2025-04-17 VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture Long Li et.al. 2504.13365 null
2025-04-17 Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models Chen Wang et.al. 2504.13351 null
2025-04-17 PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding Jang Hyun Cho et.al. 2504.13180 link
2025-04-17 Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling Tsung-Han Wu et.al. 2504.13169 link
2025-04-17 $\texttt{Complex-Edit}$ : CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark Siwei Yang et.al. 2504.13143 null
2025-04-17 Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training Xinsong Zhang et.al. 2504.13123 null
2025-04-17 Probing and Inducing Combinational Creativity in Vision-Language Models Yongqian Peng et.al. 2504.13120 null
2025-04-17 Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration Yusi Sun et.al. 2504.13119 null
2025-04-17 Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development Sabrina Haque et.al. 2504.13069 null
2025-04-17 NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation Xiangyan Liu et.al. 2504.13055 null
2025-04-17 Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning Baining Zhao et.al. 2504.12680 link
2025-04-17 VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization Menglan Chen et.al. 2504.12661 null
2025-04-16 FLIP Reasoning Challenge Andreas Plesner et.al. 2504.12256 link
2025-04-16 Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models - Laura Fieback et.al. 2504.12137 null
2025-04-17 Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions Yifei Dong et.al. 2504.11967 null
2025-04-16 Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning Hairui Ren et.al. 2504.11930 link
2025-04-16 AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection Yuhao Chao et.al. 2504.11914 null
2025-04-16 A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification Bianca Lamm et.al. 2504.11838 null
2025-04-17 DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment Li Yu et.al. 2504.11733 null
2025-04-16 Interpreting the Linear Structure of Vision-language Model Embedding Spaces Isabel Papadimitriou et.al. 2504.11695 null
2025-04-16 VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps Biniam Fisseha Demissie et.al. 2504.11675 null
2025-04-15 Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation Amirhossein Dadashzadeh et.al. 2504.11669 null
2025-04-15 MINDS: The very low-mass star and brown dwarf sample – Hidden water in carbon-dominated protoplanetary disks Aditya M. Arabhavi et.al. 2504.11425 null
2025-04-15 From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation Jingkun Chen et.al. 2504.11368 null
2025-04-16 Seedream 3.0 Technical Report Yu Gao et.al. 2504.11346 null
2025-04-16 UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis Xinyi Liu et.al. 2504.11257 null
2025-04-15 R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning Lijun Sheng et.al. 2504.11195 link
2025-04-15 Benchmarking Vision Language Models on German Factual Data René Peinl et.al. 2504.11108 null
2025-04-16 Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR Yulong Zhang et.al. 2504.11101 null
2025-04-15 QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models Yudong Zhang et.al. 2504.11038 link
2025-04-15 MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique Shuhang Liu et.al. 2504.11009 null
2025-04-15 Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles Tonko E. W. Bossen et.al. 2504.10873 null
2025-04-14 Art3D: Training-Free 3D Generation from Flat-Colored Illustration Xiaoyan Cong et.al. 2504.10466 null
2025-04-14 Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding Tao Zhang et.al. 2504.10465 link
2025-04-15 GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents Xiaobo Xia et.al. 2504.10458 null
2025-04-15 Efficient Prompt Tuning for Hierarchical Ingredient Recognition Yinxuan Gui et.al. 2504.10322 null
2025-04-14 SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model Zongcan Ding et.al. 2504.10320 null
2025-04-15 Breaking the Data Barrier – Building GUI Agents Through Task Generalization Junlei Zhang et.al. 2504.10127 link
2025-04-14 AGO: Adaptive Grounding for Open World 3D Occupancy Prediction Peizheng Li et.al. 2504.10117 null
2025-04-14 CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography I-Sheng Fang et.al. 2504.10090 null
2025-04-14 Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure Théo Gigant et.al. 2504.10049 null
2025-04-14 Aligning Anime Video Generation with Human Feedback Bingwen Zhu et.al. 2504.10044 null
2025-04-11 Generating Fine Details of Entity Interactions Xinyi Gu et.al. 2504.08714 null
2025-04-11 AstroLLaVA: towards the unification of astronomical data and natural language Sharaf Zaman et.al. 2504.08583 null
2025-04-11 FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations Cheng-Yu Hsieh et.al. 2504.08368 null
2025-04-11 DSM: Building A Diverse Semantic Map for 3D Visual Grounding Qinghongbing Xie et.al. 2504.08307 null
2025-04-11 EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models Minjae Seo et.al. 2504.08205 null
2025-04-10 Investigating Vision-Language Model for Point Cloud-based Vehicle Classification Yiqiao Li et.al. 2504.08154 null
2025-04-10 The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search Yutaro Yamada et.al. 2504.08066 link
2025-04-10 VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning Yukun Qi et.al. 2504.07956 null
2025-04-10 SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement Xiyao Wang et.al. 2504.07934 link
2025-04-10 SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos Joshua Li et.al. 2504.07867 null
2025-04-10 CollEX – A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections Florian Schneider et.al. 2504.07643 null
2025-04-10 VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model Haozhan Shen et.al. 2504.07615 link
2025-04-10 TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs Zijian Zhang et.al. 2504.07556 null
2025-04-10 Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models Yuxiang Lin et.al. 2504.07521 link
2025-04-10 Kimi-VL Technical Report Kimi Team et.al. 2504.07491 link
2025-04-09 Perception in Reflection Yana Wei et.al. 2504.07165 null
2025-04-09 Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation Israfel Salazar et.al. 2504.07072 null
2025-04-09 Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition Sergio Romero-Tapiador et.al. 2504.06925 null
2025-04-09 MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking Chang Nie et.al. 2504.06863 null
2025-04-09 ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models Seonghwan Park et.al. 2504.06838 null
2025-04-09 LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding Ziyi Wang et.al. 2504.06835 null
2025-04-08 PromptHMR: Promptable Human Mesh Recovery Yufu Wang et.al. 2504.06397 null
2025-04-08 SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation Hritam Basak et.al. 2504.06389 null
2025-04-08 OmniSVG: A Unified Scalable Vector Graphics Generation Model Yiying Yang et.al. 2504.06263 null
2025-04-08 Latent Multimodal Reconstruction for Misinformation Detection Stefanos-Iordanis Papadopoulos et.al. 2504.06010 link
2025-04-08 econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians Can Zhang et.al. 2504.06003 null
2025-04-08 Measuring Déjà vu Memorization Efficiently Narine Kokhlikyan et.al. 2504.05651 null
2025-04-08 A Lightweight Large Vision-language Model for Multimodal Medical Images Belal Alsinglawi et.al. 2504.05575 null
2025-04-07 ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering Ahmed Masry et.al. 2504.05506 link
2025-04-07 Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models Oluwadamilola Sotomi et.al. 2504.05477 null
2025-04-07 REVEAL: Relation-based Video Representation Learning for Video-Question-Answering Sofian Chaybouti et.al. 2504.05463 null
2025-04-07 Taxonomy-Aware Evaluation of Vision-Language Models Vésteinn Snæbjarnarson et.al. 2504.05457 null
2025-04-07 Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly Lianghan Dong et.al. 2504.05445 null
2025-04-07 InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Sai Kumar Dwivedi et.al. 2504.05303 link
2025-04-07 SmolVLM: Redefining small and efficient multimodal models Andrés Marafioti et.al. 2504.05299 null
2025-04-07 A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? Julio Silva-Rodríguez et.al. 2504.05227 null
2025-04-07 Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation Jiaming Chen et.al. 2504.05225 link
2025-04-07 EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively Bingyang Wang et.al. 2504.05141 null
2025-04-07 RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model Congcong Wen et.al. 2504.04988 null
2025-04-08 A Taxonomy of Self-Handover Naoki Wake et.al. 2504.04939 null
2025-04-07 SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models Justus Westerhoff et.al. 2504.04893 link
2025-04-07 Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG Roie Kazoom et.al. 2504.04858 null
2025-04-07 OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance Chaoyi Wang et.al. 2504.04781 null
2025-04-04 Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models Mirko Borszukovszki et.al. 2504.03440 null
2025-04-04 SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding Yimin Wei et.al. 2504.03254 link
2025-04-04 Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators Linfeng Zhao et.al. 2504.03245 null
2025-04-04 Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation Xin Zhang et.al. 2504.03193 link
2025-04-04 REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval Shabnam Choudhury et.al. 2504.03169 null
2025-04-07 NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving Kexin Tian et.al. 2504.03164 null
2025-04-04 TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference Junshan Hu et.al. 2504.03154 null
2025-04-04 MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories Natalie Tirabassi et.al. 2504.03153 null
2025-04-03 QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding Binh M. Le et.al. 2504.02971 null
2025-04-03 STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection Divya Velayudhan et.al. 2504.02823 null
2025-04-03 Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models Mateusz Pach et.al. 2504.02821 link
2025-04-03 Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence Anita Rau et.al. 2504.02799 null
2025-04-03 Robot-Led Vision Language Model Wellbeing Assessment of Children Nida Itrat Abbasi et.al. 2504.02765 null
2025-04-04 Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme Yan Ma et.al. 2504.02587 link
2025-04-03 Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision Xiaofeng Han et.al. 2504.02477 null
2025-04-03 Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation Chuanqi Cheng et.al. 2504.02438 link
2025-04-03 ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback Xiaolei Li et.al. 2504.02357 null
2025-04-03 Large (Vision) Language Models are Unsupervised In-Context Learners Artyom Gadetsky et.al. 2504.02349 link
2025-04-03 Re-thinking Temporal Search for Long-Form Video Understanding Jinhui Ye et.al. 2504.02259 link
2025-04-02 FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs Mothilal Asokan et.al. 2504.01916 link
2025-04-02 Is Temporal Prompting All We Need For Limited Labeled Action Recognition? Shreyank N Gowda et.al. 2504.01890 null
2025-04-02 Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images Nusrat Munia et.al. 2504.01838 link
2025-04-02 BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing Yunqi Gu et.al. 2504.01786 link
2025-04-02 AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization Chaohu Liu et.al. 2504.01735 null
2025-04-02 Reasoning LLMs for User-Aware Multimodal Conversational Agents Hamed Rahimi et.al. 2504.01700 null
2025-04-02 CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition Sarah Alyami et.al. 2504.01666 link
2025-04-02 BioAtt: Anatomical Prior Driven Low-Dose CT Denoising Namhun Kim et.al. 2504.01662 null
2025-04-02 Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models Zhaochen Wang et.al. 2504.01589 null
2025-04-02 STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation Dandan Shan et.al. 2504.01561 link
2025-03-31 SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation Ngoc Dung Huynh et.al. 2503.24164 null
2025-03-31 AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference Kai Huang et.al. 2503.23956 null
2025-03-31 HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment Zhichao Liao et.al. 2503.23907 null
2025-03-31 Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation Yongle Li et.al. 2503.23869 null
2025-04-01 Evaluating small vision-language models as AI assistants for radio astronomical source analysis tasks S. Riggi et.al. 2503.23859 link
2025-03-31 Texture or Semantics? Vision-Language Models Get Lost in Font Recognition Zhecheng Li et.al. 2503.23768 null
2025-03-31 KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language Yoonshik Kim et.al. 2503.23730 link
2025-03-31 From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts Yuto Asano et.al. 2503.23707 null
2025-03-30 Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging Amar Kumar et.al. 2503.23618 null
2025-03-30 DASH: Detection and Assessment of Systematic Hallucinations of VLMs Maximilian Augustin et.al. 2503.23573 link
2025-03-28 Understanding Co-speech Gestures in-the-wild Sindhu B Hegde et.al. 2503.22668 null
2025-03-28 Unicorn: Text-Only Data Synthesis for Vision Language Model Training Xiaomin Yu et.al. 2503.22655 link
2025-03-28 Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization Iñigo Pikabea et.al. 2503.22577 null
2025-03-28 VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection Bin Zhang et.al. 2503.22291 null
2025-03-28 FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning Dongping Liao et.al. 2503.22263 link
2025-03-28 Multi-modal Knowledge Distillation-based Human Trajectory Forecasting Jaewoo Jeong et.al. 2503.22201 link
2025-03-28 REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation Puzhen Yuan et.al. 2503.22122 null
2025-03-28 How Well Can Vison-Language Models Understand Humans’ Intention? An Open-ended Theory of Mind Question Evaluation Benchmark Ximing Wen et.al. 2503.22093 null
2025-03-28 A Survey on Remote Sensing Foundation Models: From Vision to Multimodality Ziyue Huang et.al. 2503.22081 link
2025-03-27 CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models Qingqing Zhao et.al. 2503.22020 null
2025-03-27 Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck Adrian Bulat et.al. 2503.21757 null
2025-03-27 VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness Dian Zheng et.al. 2503.21755 link
2025-03-27 Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving Yue Li et.al. 2503.21505 link
2025-03-27 BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding Shuming Liu et.al. 2503.21483 link
2025-03-27 Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models Ruizhou Li et.al. 2503.21435 null
2025-03-27 Cultivating Game Sense for Yourself: Making VLMs Gaming Experts Wenxuan Lu et.al. 2503.21263 null
2025-03-27 LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models Hengyuan Zhao et.al. 2503.21227 null
2025-03-27 VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation Alan Dao et.al. 2503.21214 null
2025-03-27 FakeReasoning: Towards Generalizable Forgery Detection and Reasoning Yueying Gao et.al. 2503.21210 null
2025-03-27 VideoMix: Aggregating How-To Videos for Task-Oriented Learning Saelyne Yang et.al. 2503.21130 null
2025-03-27 Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning Huajie Tan et.al. 2503.20752 null
2025-03-26 MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion Saron Samuel et.al. 2503.20698 null
2025-03-26 AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction Sadaf Khademi et.al. 2503.20662 null
2025-03-26 IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting Hao Fu et.al. 2503.20612 link
2025-03-26 Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models Fanhu Zeng et.al. 2503.20492 null
2025-03-26 Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding Joao Pereira et.al. 2503.20362 null
2025-03-26 sudo rm -rf agentic_security Sejin Lee et.al. 2503.20279 link
2025-03-26 LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions Yejin Kwon et.al. 2503.20252 null
2025-03-26 Learning Adaptive Dexterous Grasping from Single Demonstrations Liangzhi Shi et.al. 2503.20208 null
2025-03-26 Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector Xiao Guo et.al. 2503.20188 link
2025-03-25 CoLLM: A Large Language Model for Composed Image Retrieval Chuong Huynh et.al. 2503.19910 link
2025-03-25 CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning Hao Yu et.al. 2503.19900 link
2025-03-25 Towards Online Multi-Modal Social Interaction Understanding Xinpeng Li et.al. 2503.19851 link
2025-03-25 FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs Carlos Plou et.al. 2503.19850 null
2025-03-25 FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model Jun Zhou et.al. 2503.19839 null
2025-03-25 LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation Vladan Stojnić et.al. 2503.19777 link
2025-03-25 ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation Haoyu Fu et.al. 2503.19755 null
2025-03-25 Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models Ilias Stogiannidis et.al. 2503.19707 link
2025-03-25 fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models Saurav Sharma et.al. 2503.19670 null
2025-03-25 RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models Mehdi Moshtaghi et.al. 2503.19654 null
2025-03-25 MC-LLaVA: Multi-Concept Personalized Vision-Language Model Ruichuan An et.al. 2503.18854 link
2025-03-24 Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations Jeonghyeon Kim et.al. 2503.18817 link
2025-03-24 SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection Shrikant Malviya et.al. 2503.18812 link
2025-03-24 ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models Danrui Li et.al. 2503.18680 link
2025-03-25 Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning Juncen Guo et.al. 2503.18672 null
2025-03-24 Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks Nina Shvetsova et.al. 2503.18637 null
2025-03-24 Training-Free Personalization via Retrieval and Reasoning on Fingerprints Deepayan Das et.al. 2503.18623 null
2025-03-24 Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding Tianyu Chen et.al. 2503.18578 null
2025-03-24 Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models Bin Li et.al. 2503.18556 null
2025-03-24 MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning Dawei Yan et.al. 2503.18533 null
2025-03-21 OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement Yihe Deng et.al. 2503.17352 link
2025-03-21 Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models Jianing Qi et.al. 2503.17349 null
2025-03-21 Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology Devavrat Tomar et.al. 2503.17238 link
2025-03-21 Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models Davide Berasi et.al. 2503.17142 link
2025-03-21 Beyond Accuracy: What Matters in Designing Well-Behaved Models? Robin Hesse et.al. 2503.17110 null
2025-03-21 ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail Chandan Yeshwanth et.al. 2503.17044 null
2025-03-21 When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making Zhe Hu et.al. 2503.16965 null
2025-03-21 PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition Ibtissam Saadi et.al. 2503.16945 null
2025-03-21 Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks Haijin Zeng et.al. 2503.16930 null
2025-03-21 Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification Dongseob Kim et.al. 2503.16873 null
2025-03-20 M3: 3D-Spatial MultiModal Memory Xueyan Zou et.al. 2503.16413 link
2025-03-20 Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them Guanyu Chen et.al. 2503.16401 null
2025-03-20 JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Muyao Li et.al. 2503.16365 null
2025-03-20 Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model Zhaochong An et.al. 2503.16282 link
2025-03-20 Probabilistic Prompt Distribution Learning for Animal Pose Estimation Jiyong Rao et.al. 2503.16120 link
2025-03-20 MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures Lucas Morin et.al. 2503.16096 link
2025-03-20 STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding Zichen Liu et.al. 2503.15973 link
2025-03-20 Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation Clive Tinashe Marimo et.al. 2503.15969 null
2025-03-20 CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention Yaxiong Chen et.al. 2503.15949 link
2025-03-20 Don’t Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts Elisei Rykov et.al. 2503.15948 link
2025-03-19 TULIP: Towards Unified Language-Image Pretraining Zineng Tang et.al. 2503.15485 null
2025-03-19 Visual Persona: Foundation Model for Full-Body Human Customization Jisu Nam et.al. 2503.15406 null
2025-03-19 EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models Yinan Liang et.al. 2503.15369 null
2025-03-19 SemEval-2025 Task 1: AdMIRe – Advancing Multimodal Idiomaticity Representation Thomas Pickard et.al. 2503.15358 null
2025-03-19 TruthLens:A Training-Free Paradigm for DeepFake Detection Ritabrata Chakraborty et.al. 2503.15342 null
2025-03-19 Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport Hao Tan et.al. 2503.15337 link
2025-03-19 Sig2text, a Vision-language model for Non-cooperative Radar Signal Parsing Hancong Feng KaiLI Jiang Bin tang et.al. 2503.15213 link
2025-03-19 Context-Aware Vision Language Foundation Models for Ocular Disease Screening in Retinal Images Lucie Berger et.al. 2503.15212 null
2025-03-19 A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models, Reactive Planner and Behavior Trees Faseeh Ahmad et.al. 2503.15202 null
2025-03-19 Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU Àlex Pujol Vidal et.al. 2503.15166 null
2025-03-18 ExDDV: A New Dataset for Explainable Deepfake Detection in Video Vlad Hondru et.al. 2503.14421 link
2025-03-18 CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models Yiqi Zhu et.al. 2503.14161 link
2025-03-18 Growing a Twig to Accelerate Large Vision-Language Models Zhenwei Shao et.al. 2503.14075 null
2025-03-18 MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding Siwei Han et.al. 2503.13964 link
2025-03-18 Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models Yuxiang Lai et.al. 2503.13939 null
2025-03-18 ChatBEV: A Visual Language Model that Understands BEV Maps Qingyao Xu et.al. 2503.13938 null
2025-03-18 Where do Large Vision-Language Models Look at when Answering Questions? Xiaoying Xing et.al. 2503.13891 link
2025-03-18 RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving Yujin Wang et.al. 2503.13861 null
2025-03-18 VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences Anukriti Singh et.al. 2503.13817 null
2025-03-18 Identifying and Mitigating Position Bias of Multi-image Vision-Language Models Xinyu Tian et.al. 2503.13792 null
2025-03-17 DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models Haoyang Li et.al. 2503.13443 link
2025-03-18 MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling Yingyue Li et.al. 2503.13440 link
2025-03-17 Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions Wan Ju Kang et.al. 2503.13369 null
2025-03-17 Agents Play Thousands of 3D Video Games Zhongwen Xu et.al. 2503.13356 null
2025-03-17 HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning Wensheng Wang et.al. 2503.13171 null
2025-03-17 Free-form language-based robotic reasoning and grasping Runyu Jiao et.al. 2503.13082 null
2025-03-17 Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari Harshal Kausadikar et.al. 2503.13060 null
2025-03-17 InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving Ruiqi Song et.al. 2503.13047 null
2025-03-17 Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization Ruichuan An et.al. 2503.12999 null
2025-03-17 How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark Roba Al Majzoub et.al. 2503.12990 link
2025-03-14 Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense Shuyang Hao et.al. 2503.11619 null
2025-03-14 Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages Matteo Farina et.al. 2503.11609 link
2025-03-14 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion Ahmed Nassar et.al. 2503.11576 null
2025-03-14 Similarity-Aware Token Pruning: Your VLM but Faster Ahmadreza Jeddi et.al. 2503.11549 link
2025-03-14 Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models Hao Cheng et.al. 2503.11519 null
2025-03-14 PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture Xiaokang Wei et.al. 2503.11368 null
2025-03-14 PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models Mayank Nautiyal et.al. 2503.11360 null
2025-03-14 Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset Yibing Weng et.al. 2503.11342 null
2025-03-14 DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models Xirui Zhou et.al. 2503.11265 null
2025-03-14 Compound Expression Recognition via Large Vision-Language Models Jun Yu et.al. 2503.11241 null
2025-03-13 A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 Zhaoyi Li et.al. 2503.10635 link
2025-03-13 HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model Jiaming Liu et.al. 2503.10631 null
2025-03-13 CoSTA $\ast$ : Cost-Sensitive Toolpath Agent for Multi-turn Image Editing Advait Gupta et.al. 2503.10613 link
2025-03-13 TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention Jinhao Duan et.al. 2503.10602 link
2025-03-13 GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding Rui Hu et.al. 2503.10596 link
2025-03-13 VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search Yiming Jia et.al. 2503.10582 null
2025-03-13 KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation Zixian Liu et.al. 2503.10546 null
2025-03-13 Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction Yuhan Wang et.al. 2503.10508 null
2025-03-13 World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning Siyin Wang et.al. 2503.10480 null
2025-03-13 VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan Zhi Rui Tam et.al. 2503.10427 link
2025-03-12 SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment Katrin Renz et.al. 2503.09594 null
2025-03-12 MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions Zhe Xu et.al. 2503.09499 link
2025-03-12 BAMBI: Developing Baby Language Models for Italian Alice Suozzi et.al. 2503.09481 null
2025-03-12 SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery Jiayuan Huang et.al. 2503.09474 null
2025-03-12 Astrea: A MOE-based Visual Understanding Model with Progressive Alignment Xiaoda Yang et.al. 2503.09445 null
2025-03-12 Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models Julian Spravil et.al. 2503.09443 null
2025-03-12 OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment Qi Liu et.al. 2503.09416 null
2025-03-12 Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models Xiaozhen Qiao et.al. 2503.09394 null
2025-03-12 VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers Ruanjun Li et.al. 2503.09387 null
2025-03-13 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos Marvin Heidinger et.al. 2503.09320 null
2025-03-11 Generating Robot Constitutions & Benchmarks for Semantic Safety Pierre Sermanet et.al. 2503.08663 null
2025-03-11 TLA: Tactile-Language-Action Model for Contact-Rich Manipulation Peng Hao et.al. 2503.08548 null
2025-03-11 GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training Tong Wei et.al. 2503.08525 null
2025-03-11 External Knowledge Injection for CLIP-Based Class-Incremental Learning Da-Wei Zhou et.al. 2503.08510 link
2025-03-11 MMRL: Multi-Modal Representation Learning for Vision-Language Models Yuncheng Guo et.al. 2503.08497 link
2025-03-11 SuperCap: Multi-resolution Superpixel-based Image Captioning Henry Senior et.al. 2503.08496 null
2025-03-11 TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting Fengyi Zhang et.al. 2503.08485 null
2025-03-11 PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability Weijie Zhou et.al. 2503.08481 link
2025-03-11 Debiased Prompt Tuning in Vision-Language Model without Annotations Chaoquan Jiang et.al. 2503.08368 null
2025-03-11 Trinity: A Modular Humanoid Robot AI System Jingkai Sun et.al. 2503.08338 null
2025-03-10 Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction Zongzheng Zhang et.al. 2503.07485 link
2025-03-10 REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding Yan Tai et.al. 2503.07413 link
2025-03-11 FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates Sangwoo Park et.al. 2503.07216 null
2025-03-10 Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation Sihao Lin et.al. 2503.07125 null
2025-03-11 PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM Alan Dao et.al. 2503.07111 null
2025-03-10 XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition Chuanming Wang et.al. 2503.07075 null
2025-03-10 Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning Huilin Deng et.al. 2503.07065 link
2025-03-10 CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model Shihao Hou et.al. 2503.06993 null
2025-03-10 A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis Xiang Liu et.al. 2503.06973 link
2025-03-10 Handle Object Navigation as Weighted Traveling Repairman Problem Ruimeng Liu et.al. 2503.06937 link
2025-03-07 VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method Weiyu Ma et.al. 2503.05383 link
2025-03-07 Data-Efficient Generalization for Zero-shot Composed Image Retrieval Zining Chen et.al. 2503.05204 null
2025-03-07 Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions Chan hur et.al. 2503.05186 null
2025-03-07 Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models Messi H. J. Lee et.al. 2503.05093 null
2025-03-07 Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation Qingxuan Jia et.al. 2503.05064 null
2025-03-06 LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression Souvik Kundu et.al. 2503.04982 null
2025-03-06 Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach Soumyadeep Ro et.al. 2503.04918 null
2025-03-06 LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM Sambal Shikhar et.al. 2503.04724 null
2025-03-06 A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning Qing Zhou et.al. 2503.04592 null
2025-03-06 AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM Sunghyun Ahn et.al. 2503.04504 link
2025-03-06 TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction Chao Wang et.al. 2503.04457 null
2025-03-06 From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design Felix Ocker et.al. 2503.04417 null
2025-03-07 Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models Niccolò Turcato et.al. 2503.04280 null
2025-03-06 An Egocentric Vision-Language Model based Portable Real-time Smart Assistant Yifei Huang et.al. 2503.04250 link
2025-03-06 Synthetic Data is an Elegant GIFT for Continual Vision-Language Models Bin Wu et.al. 2503.04229 null
2025-03-06 EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models Haiyang Yu et.al. 2503.04058 null
2025-03-06 RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models Wenhui Zhu et.al. 2503.03987 null
2025-03-05 CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning Yuqi Zhou et.al. 2503.03743 link
2025-03-05 OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction Huang Huang et.al. 2503.03734 null
2025-03-05 Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions Jun Li et.al. 2503.03278 null
2025-03-05 Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings Sneh Pillai et.al. 2503.03202 null
2025-03-05 SpiritSight Agent: Advanced GUI Agent with One Look Zhiyuan Huang et.al. 2503.03196 null
2025-03-04 LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications Danqing Zhang et.al. 2503.02950 link
2025-03-05 FairSense-AI: Responsible AI Meets Sustainability Shaina Raza et.al. 2503.02865 null
2025-03-04 “What If Smart Homes Could See Our Homes?”: Exploring DIY Smart Home Building Experiences with VLM-Based Camera Sensors Sojeong Yun et.al. 2503.02816 null
2025-03-04 Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation Junjie Zhu et.al. 2503.02748 null
2025-03-04 ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment Faryal Batool et.al. 2503.02723 null
2025-03-04 MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments Ege Özsoy et.al. 2503.02579 link
2025-03-04 UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue Yasheerah Yaqoot et.al. 2503.02465 null
2025-03-04 UAV-VLPA*: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales Oleg Sautenkov et.al. 2503.02454 null
2025-03-04 Vision-Language Model IP Protection via Prompt-based Learning Lianyu Wang et.al. 2503.02393 null
2025-03-04 Are Large Vision Language Models Good Game Players? Xinyu Wang et.al. 2503.02358 null
2025-03-04 BiasICL: In-Context Learning and Demographic Biases of Vision Language Models Sonnet Xu et.al. 2503.02334 null
2025-02-28 Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos Zhiyu Tan et.al. 2502.21314 null
2025-02-28 PET Image Denoising via Text-Guided Diffusion: Integrating Anatomical Priors through Text Prompts Boxiao Yu et.al. 2502.21260 null
2025-02-28 FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts Ziyi Zhang et.al. 2502.21059 null
2025-02-28 DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping Yifan Zhong et.al. 2502.20900 null
2025-02-28 VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation Anh Tien Nguyen et.al. 2502.20850 null
2025-02-28 CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval Zelong Sun et.al. 2502.20826 null
2025-02-28 MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models Qiao Yan et.al. 2502.20780 link
2025-02-28 Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow Jiaqi Bai et.al. 2502.20750 link
2025-02-28 T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting Yifei Qian et.al. 2502.20625 null
2025-02-27 Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference Mingyuan Wu et.al. 2502.20587 null
2025-02-27 Visual Adaptive Prompting for Compositional Zero-Shot Learning Kyle Stein et.al. 2502.20292 null
2025-02-27 Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions Palawat Busaranuvong et.al. 2502.20277 null
2025-02-27 VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers Ziang Guo et.al. 2502.20108 null
2025-02-27 Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore Hongseok Oh et.al. 2502.20034 link
2025-02-27 ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models Ke Niu et.al. 2502.19958 null
2025-02-27 Space Rotation with Basis Transformation for Training-free Test-Time Adaptation Chenhao Ding et.al. 2502.19946 null
2025-02-27 ProAPO: Progressively Automatic Prompt Optimization for Visual Classification Xiangyan Qu et.al. 2502.19844 link
2025-02-27 Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study Reza Abbasi et.al. 2502.19828 null
2025-02-27 Mixtera: A Data Plane for Foundation Model Training Maximilian Böther et.al. 2502.19790 link
2025-02-27 Open-Vocabulary Semantic Part Segmentation of 3D Human Keito Suzuki et.al. 2502.19782 null
2025-02-26 Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models Lucy Xiaoyang Shi et.al. 2502.19417 null
2025-02-26 Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions Ruben T. Lucassen et.al. 2502.19293 null
2025-02-26 On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation Ruben T. Lucassen et.al. 2502.19285 null
2025-02-26 Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models Jiawei Kong et.al. 2502.19269 null
2025-02-26 FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach Anju Rani et.al. 2502.19038 null
2025-02-26 VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model Jiani Zheng et.al. 2502.18906 null
2025-02-26 Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP Chenyang Zhao et.al. 2502.18816 null
2025-02-26 QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries Nicolas Harvey Chapman et.al. 2502.18735 null
2025-02-25 olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models Jake Poznanski et.al. 2502.18443 link
2025-02-25 WebGames: Challenging General-Purpose Web-Browsing AI Agents George Thomas et.al. 2502.18356 link
2025-02-25 Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models Zhaoyi Liu et.al. 2502.18290 null
2025-02-25 NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts Muhammad Farid Adilazuarda et.al. 2502.18148 null
2025-02-25 Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models Cao Yuxuan et.al. 2502.18101 link
2025-02-25 VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion Pei Liu et.al. 2502.18042 null
2025-02-25 UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting Haoyuan Li et.al. 2502.17860 null
2025-02-25 Can Multimodal LLMs Perform Time Series Anomaly Detection? Xiongxiao Xu et.al. 2502.17812 link
2025-02-24 Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures Akhila Yerukola et.al. 2502.17710 link
2025-02-24 METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling Bingxuan Li et.al. 2502.17651 null
2025-02-24 DIS-CO: Discovering Copyrighted Content in VLMs Training Data André V. Duarte et.al. 2502.17358 link
2025-02-24 Applications of Large Models in Medicine YunHe Su et.al. 2502.17132 null
2025-02-24 Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI Syed Abdul Gaffar Shakhadri et.al. 2502.17092 null
2025-02-24 Evolution 6.0: Evolving Robotic Capabilities Through Generative Design Muhammad Haris Khan et.al. 2502.17034 null
2025-02-24 All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark Davide Testa et.al. 2502.16989 null
2025-02-24 SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models Kevin Miller et.al. 2502.16911 null
2025-02-24 Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models Yaqi Sun et.al. 2502.16842 null
2025-02-24 CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification Liping Lu et.al. 2502.16815 null
2025-02-23 Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation Yunhai Feng et.al. 2502.16707 null
2025-02-23 Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI? Qipan Xu et.al. 2502.16618 null
2025-02-21 ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval Guanqi Zhan et.al. 2502.15682 null
2025-02-21 Testing the limits of fine-tuning to improve reasoning in vision language models Luca M. Schulze Buschoff et.al. 2502.15678 null
2025-02-21 Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation Tim Rädsch et.al. 2502.15563 null
2025-02-21 FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models Jiao Chen et.al. 2502.15481 null
2025-02-21 Chitrarth: Bridging Vision and Language for a Billion People Shaharukh Khan et.al. 2502.15392 null
2025-02-21 The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting Masayo Tomita et.al. 2502.15389 null
2025-02-21 CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models Shunchang Liu et.al. 2502.15278 null
2025-02-21 The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning Sheila Schoepp et.al. 2502.15214 null
2025-02-21 PairBench: A Systematic Framework for Selecting Reliable Judge VLMs Aarash Feizi et.al. 2502.15210 null
2025-02-21 CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models Zihao Sheng et.al. 2502.15119 null
2025-02-20 Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation Yue Yang et.al. 2502.14846 null
2025-02-20 LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models Shangqing Tu et.al. 2502.14834 link
2025-02-20 SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features Michael Tschannen et.al. 2502.14786 link
2025-02-20 ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting Abhijit Mishra et.al. 2502.14780 null
2025-02-21 HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States Yilei Jiang et.al. 2502.14744 link
2025-02-20 NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization Zheyuan Zhang et.al. 2502.14638 link
2025-02-20 Noisy Test-Time Adaptation in Vision-Language Models Chentao Cao et.al. 2502.14604 link
2025-02-20 PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models Yu Meng et.al. 2502.14504 null
2025-02-20 How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation Zhuohang Long et.al. 2502.14486 null
2025-02-20 Evaluating Precise Geolocation Inference Capabilities of Vision Language Models Neel Jay et.al. 2502.14412 link
2025-02-19 Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images Shengguang Wu et.al. 2502.13928 null
2025-02-19 NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants Yiran Qin et.al. 2502.13894 null
2025-02-19 SPEX: Scaling Feature Interaction Explanations for LLMs Justin Singh Kang et.al. 2502.13870 link
2025-02-19 Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models Peter Carragher et.al. 2502.13836 null
2025-02-19 GIMMICK – Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking Florian Schneider et.al. 2502.13766 null
2025-02-19 VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation Wei Zhao et.al. 2502.13508 link
2025-02-19 Transferring Textual Preferences to Vision-Language Understanding through Model Merging Chen-An Li et.al. 2502.13487 null
2025-02-19 Estimating Commonsense Plausibility through Semantic Shifts Wanqing Cui et.al. 2502.13464 null
2025-02-19 MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation Lingfeng Zhang et.al. 2502.13451 null
2025-02-19 Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning Hao Ma et.al. 2502.13430 null
2025-02-18 Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization Shuo Xing et.al. 2502.13146 link
2025-02-18 Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation Bencheng Liao et.al. 2502.13145 link
2025-02-18 SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation Zekun Qi et.al. 2502.13143 null
2025-02-18 Understanding and Rectifying Safety Perception Distortion in VLMs Xiaohan Zou et.al. 2502.13095 null
2025-02-18 MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching Fabian David Schmidt et.al. 2502.12852 null
2025-02-18 Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum Yihong Liu et.al. 2502.12599 null
2025-02-18 CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base Cong-Duy Nguyen et.al. 2502.12591 null
2025-02-18 Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning Mengshi Qi et.al. 2502.12425 link
2025-02-17 LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models Zongyu Wu et.al. 2502.12359 null
2025-02-17 Descriminative-Generative Custom Tokens for Vision-Language Models Pramuditha Perera et.al. 2502.12095 null
2025-02-17 VLM $^2$ -Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues Jianshu Zhang et.al. 2502.12084 null
2025-02-17 Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning Jiayang Zhang et.al. 2502.12049 link
2025-02-17 Learning Generalizable Prompt for CLIP with Class Similarity Knowledge Sehun Jung et.al. 2502.11969 null
2025-02-17 VLP: Vision-Language Preference Learning for Embodied Manipulation Runze Liu et.al. 2502.11918 null
2025-02-17 From Open-Vocabulary to Vocabulary-Free Semantic Segmentation Klara Reichard et.al. 2502.11891 null
2025-02-17 VAQUUM: Are Vague Quantifiers Grounded in Visual Data? Hugh Mee Wong et.al. 2502.11874 null
2025-02-17 Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics Wenrui Xu et.al. 2502.11859 null
2025-02-17 ILIAS: Instance-Level Image retrieval At Scale Giorgos Kordopatis-Zilos et.al. 2502.11748 null
2025-02-17 Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering Md Zarif Ul Alam et.al. 2502.11747 null
2025-02-14 Probing Perceptual Constancy in Large Vision Language Models Haoran Sun et.al. 2502.10273 null
2025-02-14 VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models Gokul Karthik Kumar et.al. 2502.10250 null
2025-02-14 Cooperative Multi-Agent Planning with Adaptive Skill Synthesis Zhiyuan Li et.al. 2502.10148 null
2025-02-14 Image Embedding Sampling Method for Diverse Captioning Sania Waheed et.al. 2502.10118 null
2025-02-14 Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models Chenrui Tie et.al. 2502.10090 link
2025-02-14 ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments Juyeong Hwang et.al. 2502.10046 null
2025-02-14 Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation Shichao Fan et.al. 2502.10040 null
2025-02-14 HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation Tianwei Lin et.al. 2502.09838 link
2025-02-13 On the robustness of multimodal language model towards distractions Ming Liu et.al. 2502.09818 null
2025-02-13 GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis Angelos Zavras et.al. 2502.09598 link
2025-02-13 When and How Does CLIP Enable Domain and Compositional Generalization? Elias Kempf et.al. 2502.09507 link
2025-02-13 OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics Junhui Wang et.al. 2502.09238 null
2025-02-13 LimSim Series: An Autonomous Driving Simulation Platform for Validation and Enhancement Daocheng Fu et.al. 2502.09170 link
2025-02-13 The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs) Ye-eun Cho et.al. 2502.09120 null
2025-02-13 Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model Shiryu Ueno et.al. 2502.09057 link
2025-02-13 AIDE: Agentically Improve Visual Language Model with Domain Experts Ming-Chang Chiu et.al. 2502.09051 null
2025-02-13 3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning Guoqin Tang et.al. 2502.08903 null
2025-02-13 Harnessing Vision Models for Time Series Analysis: A Survey Jingchao Ni et.al. 2502.08869 link
2025-02-12 ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots Yuxuan Zhang et.al. 2502.08791 null
2025-02-12 A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards Shivansh Patel et.al. 2502.08643 null
2025-02-12 SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation Ellie Arar et.al. 2502.08642 null
2025-02-12 Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities? Jiahe Jin et.al. 2502.08503 link
2025-02-12 ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification Jiangbo Shi et.al. 2502.08391 link
2025-02-12 Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting Jiarui Wu et.al. 2502.08317 null
2025-02-12 TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents Kunal Singh et.al. 2502.08226 null
2025-02-12 Typographic Attacks in a Multi-Image Setting Xiaomeng Wang et.al. 2502.08193 link
2025-02-12 SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation Zhiming Ma et.al. 2502.08168 link
2025-02-12 MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models Peng-Fei Zhang et.al. 2502.08079 null
2025-02-11 Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders Kshitish Ghate et.al. 2502.07957 link
2025-02-11 Scaling Pre-training to One Hundred Billion Data for Vision Language Models Xiao Wang et.al. 2502.07617 null
2025-02-11 MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification Anh-Tien Nguyen et.al. 2502.07409 link
2025-02-11 TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation Navid Rajabi et.al. 2502.07306 null
2025-02-11 Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired ByungOk Han et.al. 2502.07183 link
2025-02-10 EVEv2: Improved Baselines for Encoder-Free Vision-Language Models Haiwen Diao et.al. 2502.06788 link
2025-02-10 Conformal Predictions for Human Action Recognition with Vision-Language Models Bary Tim et.al. 2502.06631 null
2025-02-10 Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments Sankalp Nagaonkar et.al. 2502.06445 link
2025-02-10 AppVLM: A Lightweight Vision Language Model for Online App Control Georgios Papoudakis et.al. 2502.06395 null
2025-02-11 When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs Aobotao Dai et.al. 2502.06390 link
2025-02-10 Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models Ce Zhang et.al. 2502.06130 link
2025-02-10 Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models Peiran Wang et.al. 2502.06094 null
2025-02-09 Noise is an Efficient Learner for Zero-Shot Vision-Language Models Raza Imam et.al. 2502.06019 null
2025-02-09 Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education Yanhao Jia et.al. 2502.05863 null
2025-02-09 DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control Junjie Wen et.al. 2502.05855 null
2025-02-07 Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray Yunhang Shen et.al. 2502.05177 link
2025-02-07 DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions Gorkem Can Ates et.al. 2502.05091 null
2025-02-07 OccGS: Zero-shot 3D Occupancy Reconstruction with Semantic and Geometric-Aware Gaussian Splatting Xiaoyu Zhou et.al. 2502.04981 null
2025-02-07 Training-free Task-oriented Grasp Generation Jiaming Wang et.al. 2502.04873 null
2025-02-07 ELITE: Enhanced Language-Image Toxicity Evaluation for Safety Wonjun Lee et.al. 2502.04757 null
2025-02-06 AnyPlace: Learning Generalized Object Placement for Robot Manipulation Yuchi Zhao et.al. 2502.04531 null
2025-02-06 Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment Zuyan Liu et.al. 2502.04328 link
2025-02-06 Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion Marco Mistretta et.al. 2502.04263 link
2025-02-06 Éclair – Extracting Content and Layout with Integrated Reading Order for Documents Ilia Karmanov et.al. 2502.04223 null
2025-02-06 HD-EPIC: A Highly-Detailed Egocentric Video Dataset Toby Perrett et.al. 2502.04144 null
2025-02-06 Efficient Few-Shot Continual Learning in Vision-Language Models Aristeidis Panos et.al. 2502.04098 null
2025-02-06 CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing Yu Yuan et.al. 2502.03997 null
2025-02-06 Adapting Human Mesh Recovery with Vision-Language Feedback Chongyang Xu et.al. 2502.03836 null
2025-02-05 The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering Zhuowei Li et.al. 2502.03628 link
2025-02-05 DynVFX: Augmenting Real Videos with Dynamic Content Danah Yatim et.al. 2502.03621 null
2025-02-05 EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality Junlong Chen et.al. 2502.03564 null
2025-02-05 SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living Arkaprava Sinha et.al. 2502.03459 null
2025-02-05 RadVLM: A Multitask Conversational Vision-Language Model for Radiology Nicolas Deperrois et.al. 2502.03333 null
2025-02-05 Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search Jiayi He et.al. 2502.03230 null
2025-02-05 iVISPAR – An Interactive Visual-Spatial Reasoning Benchmark for VLMs Julius Mayer et.al. 2502.03214 link
2025-02-05 Disentangling CLIP Features for Enhanced Localized Understanding Samyak Rawelekar et.al. 2502.02977 null
2025-02-04 Vision-Language Model Dialog Games for Self-Improvement Ksenia Konyushkova et.al. 2502.02740 null
2025-02-04 Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling Xiaowen Qiu et.al. 2502.02590 null
2025-02-04 COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation Xueqing Deng et.al. 2502.02589 null
2025-02-04 Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation Junha Lee et.al. 2502.02548 null
2025-02-04 Personalization Toolkit: Training Free Personalization of Large Vision Language Models Soroush Seifi et.al. 2502.02452 null
2025-02-04 LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models Yuto Kojima et.al. 2502.02069 null
2025-02-04 AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs Hongxin Li et.al. 2502.01977 null
2025-02-04 Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration Younan Zhu et.al. 2502.01969 null
2025-02-03 From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment Yilin Wu et.al. 2502.01828 null
2025-02-03 AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis Basit Alawode et.al. 2502.01785 null
2025-02-03 Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning Udita Ghosh et.al. 2502.01616 null
2025-01-31 Homogeneity Bias as Differential Sampling Uncertainty in Language Models Messi H. J. Lee et.al. 2501.19337 null
2025-01-31 Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search Yuta Oshima et.al. 2501.19252 null
2025-01-31 A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches Luca Ciampi et.al. 2501.19184 null
2025-01-31 Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs Kejia Zhang et.al. 2501.19164 null
2025-01-31 Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification Xiangyu Sun et.al. 2501.19086 null
2025-02-03 Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment Song-Lin Lv et.al. 2501.19060 null
2025-01-31 RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception Joshua R. Waite et.al. 2501.18880 null
2025-02-03 UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent Jianke Zhang et.al. 2501.18867 null
2025-01-31 Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models Aodi Li et.al. 2501.18864 null
2025-01-30 INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation Jian Hu et.al. 2501.18753 null
2025-01-30 Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models Yi Ding et.al. 2501.18533 null
2025-01-30 A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models Shiho Noda et.al. 2501.18463 link
2025-01-30 Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks Hao-Zhe Tan et.al. 2501.18271 null
2025-01-29 Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching Xuzhe Dang et.al. 2501.17665 null
2025-01-29 Exploring Vision Language Models for Multimodal and Multilingual Stance Detection Jake Vasilakes et.al. 2501.17654 null
2025-01-29 Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation Lin Chen et.al. 2501.17642 null
2025-01-30 Boosting Weak Positives for Text Based Person Search Akshay Modi et.al. 2501.17586 null
2025-01-29 Learning Free Token Reduction for Multi-Modal LLM Zihui Zhao et.al. 2501.17391 null
2025-01-30 Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding Yun-Shiuan Chuang et.al. 2501.17310 null
2025-01-28 Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection Xiangyu Gao et.al. 2501.16981 null
2025-01-28 Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? Sania Waheed et.al. 2501.16947 null
2025-01-29 TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models Makoto Shing et.al. 2501.16937 null
2025-01-28 Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks Shengyao Zhuang et.al. 2501.16902 link
2025-01-28 Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models Muhammad Atta ur Rahman et.al. 2501.16769 null
2025-01-28 One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning Chunpeng Zhou et.al. 2501.16720 null
2025-01-28 Improving Vision-Language-Action Model with Online Reinforcement Learning Yanjiang Guo et.al. 2501.16664 null
2025-01-27 BiFold: Bimanual Cloth Folding with Language Guidance Oriol Barbany et.al. 2501.16458 null
2025-01-27 PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding Wei Chow et.al. 2501.16411 null
2025-01-27 CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation Xiaochuan Ma et.al. 2501.16246 null
2025-01-27 AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants Pascal J. Sager et.al. 2501.16150 null
2025-01-27 CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification Huazhong Zhao et.al. 2501.16065 link
2025-01-27 Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge Anh-Kiet Duong et.al. 2501.16037 link
2025-01-27 SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model Delin Qu et.al. 2501.15830 null
2025-01-27 MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining Ruiqi Wu et.al. 2501.15798 link
2025-01-26 Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts Haodi Ma et.al. 2501.15688 null
2025-01-26 Ocean-OCR: Towards General OCR Application via a Vision-Language Model Song Chen et.al. 2501.15558 link
2025-01-26 Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification Dan Song et.al. 2501.15503 null
2025-01-26 Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection Han Wang et.al. 2501.15438 link
2025-01-24 Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding Zhongyi Shui et.al. 2501.14548 link
2025-01-24 PAID: A Framework of Product-Centric Advertising Image Design Hongyu Chen et.al. 2501.14316 null
2025-01-24 Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models Yuxuan Liang et.al. 2501.14276 null
2025-01-24 PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction Hammad Ayyubi et.al. 2501.14210 null
2025-01-24 Dynamic Token Reduction during Generation for Vision Language Models Xiaoyu Liang et.al. 2501.14204 null
2025-01-24 Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models Saaduddin Mahmud et.al. 2501.14189 null
2025-01-24 SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation Shuvendu Roy et.al. 2501.14148 null
2025-01-23 Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models Linh Tran et.al. 2501.13904 null
2025-01-23 Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning Shiyu Zhang et.al. 2501.13859 null
2025-01-23 Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes Shiling Deng et.al. 2501.13851 link
2025-01-23 Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models Chaolei Han et.al. 2501.13795 link
2025-01-23 Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak Erjia Xiao et.al. 2501.13772 null
2025-01-23 EventVL: Understand Event Streams via Multimodal Large Language Model Pengteng Li et.al. 2501.13707 null
2025-01-23 Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task Mohit Vaishnav et.al. 2501.13620 null
2025-01-23 Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving Lu Wang et.al. 2501.13563 null
2025-01-23 Text-driven Online Action Detection Manuel Benavent-Lledo et.al. 2501.13518 link
2025-01-23 Iterative Shaping of Multi-Particle Aggregates based on Action Trees and VLM Hoi-Yin Lee et.al. 2501.13507 null
2025-01-22 Patent Figure Classification using Large Vision-language Models Sushil Awale et.al. 2501.12751 link
2025-01-22 TeD-Loc: Text Distillation for Weakly Supervised Object Localization Shakeeb Murtaza et.al. 2501.12632 link
2025-01-22 ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality Yanming Xiu et.al. 2501.12553 link
2025-01-21 Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models Tabinda Aman et.al. 2501.12433 null
2025-01-20 ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models Jingwei Yi et.al. 2501.12418 link
2025-01-21 InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model Yuhang Zang et.al. 2501.12368 link
2025-01-21 Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2 Md. Rakibul Islam et.al. 2501.12356 null
2025-01-21 CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification Cristiano Patrício et.al. 2501.12266 null
2025-01-21 Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model Kazi Hasan Ibn Arif et.al. 2501.12206 link
2025-01-20 Human-AI Collaborative Game Testing with Vision Language Models Boran Zhang et.al. 2501.11782 null
2025-01-20 SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models Shu Zou et.al. 2501.11485 link
2025-01-20 Verifying Cross-modal Entity Consistency in News using Vision-language Models Sahar Tahmasebi et.al. 2501.11403 link
2025-01-20 KPL: Training-Free Medical Knowledge Mining of Vision-Language Models Jiaxiang Liu et.al. 2501.11231 link
2025-01-19 ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models Yassir Bendou et.al. 2501.11175 null
2025-01-19 Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding Zhanpeng Chen et.al. 2501.10967 link
2025-01-17 HiMix: Reducing Computational Complexity in Large Vision-Language Models Xuange Zhang et.al. 2501.10318 null
2025-01-17 SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning Yuecheng Liu et.al. 2501.10074 null
2025-01-17 CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment Yating Liu et.al. 2501.10071 link
2025-01-17 MSTS: A Multimodal Safety Test Suite for Vision-Language Models Paul Röttger et.al. 2501.10057 link
2025-01-17 Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions Zhijie Tan et.al. 2501.10011 null
2025-01-17 Explainable artificial intelligence (XAI): from inherent explainability to large language models Fuseini Mumuni et.al. 2501.09967 null
2025-01-16 Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key Zhihe Yang et.al. 2501.09695 link
2025-01-16 Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark Alexis Roger et.al. 2501.09672 null
2025-01-16 AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture Jiayi Han et.al. 2501.09532 null
2025-01-16 Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness Zeyu Wang et.al. 2501.09446 null
2025-01-16 Vision-Language Models Do Not Understand Negation Kumail Alhamoud et.al. 2501.09425 null
2025-01-16 YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks Saptarashmi Bandyopadhyay et.al. 2501.09355 null
2025-01-16 RoboReflect: Robotic Reflective Reasoning for Grasping Ambiguous-Condition Objects Zhen Luo et.al. 2501.09307 null
2025-01-16 Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning Harrison Fuller et.al. 2501.09294 null
2025-01-16 Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites Abdalwhab Abdalwhab et.al. 2501.09267 null
2025-01-16 **Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 Applications** Finlay Macklon et.al. 2501.09236 null
2025-01-15 CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation Qi Ma et.al. 2501.08982 null
2025-01-15 MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents Kuicai Dong et.al. 2501.08828 null
2025-01-15 Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning Julian Perry et.al. 2501.08597 null
2025-01-14 MiniMax-01: Scaling Foundation Models with Lightning Attention MiniMax et.al. 2501.08313 null
2025-01-14 Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding Liping Yuan et.al. 2501.07888 link
2025-01-14 Visual Language Models as Operator Agents in the Space Domain Alejandro Carrasco et.al. 2501.07802 null
2025-01-14 BMIP: Bi-directional Modality Interaction Prompt Learning for VLM Song-Lin Lv et.al. 2501.07769 null
2025-01-13 SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing Varun Biyyala et.al. 2501.07554 link
2025-01-13 RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment Difei Gu et.al. 2501.07525 link
2025-01-13 Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models Yasiru Ranasinghe et.al. 2501.07396 null
2025-01-14 GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction Oleg Kobzarev et.al. 2501.07295 null
2025-01-13 Can Vision-Language Models Evaluate Handwritten Math? Oikantik Nath et.al. 2501.07244 link
2025-01-13 TimeLogic: A Temporal Logic Benchmark for Video QA Sirnam Swetha et.al. 2501.07214 null
2025-01-13 BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature Alejandro Lozano et.al. 2501.07171 link
2025-01-13 Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning Zhong Peng et.al. 2501.07114 null
2025-01-12 MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis Sadia Kamal et.al. 2501.06887 null
2025-01-12 Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving Haoxiang Gao et.al. 2501.06680 null
2025-01-10 VideoAuteur: Towards Long Narrative Video Generation Junfei Xiao et.al. 2501.06173 null
2025-01-10 CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems Haichao Liu et.al. 2501.06132 link
2025-01-10 Generate, Transduct, Adapt: Iterative Transduction with VLMs Oindrila Saha et.al. 2501.06031 null
2025-01-10 Scalable Vision Language Model Training via High Quality Data Curation Hongyuan Dong et.al. 2501.05952 null
2025-01-10 Valley2: Exploring Multimodal Models with Scalable Vision-Language Design Ziheng Wu et.al. 2501.05901 link
2025-01-10 Super-class guided Transformer for Zero-Shot Attribute Classification Sehyung Kim et.al. 2501.05728 link
2025-01-10 From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities Dominick Reilly et.al. 2501.05711 link
2025-01-09 Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding Mohammed Elhenawy et.al. 2501.05566 null
2025-01-09 Infecting Generative AI With Viruses David Noever et.al. 2501.05542 null
2025-01-09 Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation Darius Petermann et.al. 2501.05413 null
2025-01-09 Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection Pei-Kang Lee et.al. 2501.05228 null
2025-01-09 Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model Gregor Geigle et.al. 2501.05122 null
2025-01-09 DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving Xuran Zheng et.al. 2501.05081 null
2025-01-09 Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning Huabin Liu et.al. 2501.05069 null
2025-01-09 ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark Ronghao Dang et.al. 2501.05031 link
2025-01-09 UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation Oleg Sautenkov et.al. 2501.05014 link
2025-01-09 Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments Yifan Xu et.al. 2501.04947 null
2025-01-08 Re-ranking the Context for Multimodal Retrieval Augmented Generation Matin Mortaheb et.al. 2501.04695 null
2025-01-08 Enhancing Financial VQA in Vision Language Models using Intermediate Structured Representations Archita Srivastava et.al. 2501.04675 null
2025-01-08 DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests Charles Corbière et.al. 2501.04671 null
2025-01-08 A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI Kazusato Oko et.al. 2501.04641 link
2025-01-08 Supervision-free Vision-Language Alignment Giorgio Giannone et.al. 2501.04568 null
2025-01-08 Online Gaussian Test-Time Adaptation of Vision-Language Models Clément Fuchs et.al. 2501.04352 link
2025-01-08 Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs Zeyi Huang et.al. 2501.04336 null
2025-01-08 Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts Miao Rang et.al. 2501.04322 link
2025-01-08 Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation Senwei Xie et.al. 2501.04268 null
2025-01-07 MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation Siddharth Joshi et.al. 2501.04155 link
2025-01-07 Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives Shaoyuan Xie et.al. 2501.04003 link
2025-01-07 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos Haobo Yuan et.al. 2501.04001 link
2025-01-07 RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance Matin Mortaheb et.al. 2501.03995 null
2025-01-07 VLM-driven Behavior Tree for Context-aware Task Planning Naoki Wake et.al. 2501.03968 link
2025-01-07 Vision Language Models as Values Detectors Giulio Antonio Abbo et.al. 2501.03957 null
2025-01-07 OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints Mingjie Pan et.al. 2501.03841 null
2025-01-07 KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration Chengyuan Li et.al. 2501.03786 null
2025-01-07 Realistic Test-Time Adaptation of Vision-Language Models Maxime Zanella et.al. 2501.03729 link
2025-01-07 Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein Xiaotong Guo et.al. 2501.03722 null
2025-01-07 SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning Andrew Li et.al. 2501.03675 link
2025-01-06 Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation Yuhui Zhang et.al. 2501.03225 link
2025-01-06 Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches Alhassan Mumuni et.al. 2501.03151 null
2025-01-06 Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment Pegah Khayatan et.al. 2501.03012 link
2025-01-06 MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models Wenyi Hong et.al. 2501.02955 null
2025-01-06 Label-free Concept Based Multiple Instance Learning for Gigapixel Histopathology Susu Sun et.al. 2501.02922 null
2025-01-06 Large Language Models for Video Surveillance Applications Ulindu De Silva et.al. 2501.02850 null
2025-01-06 Visual Large Language Models for Generalized and Specialized Applications Yifan Li et.al. 2501.02765 link
2025-01-05 Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? Simon Park et.al. 2501.02669 link
2025-01-05 Efficient Architectures for High Resolution Vision-Language Models Miguel Carvalho et.al. 2501.02584 link
2025-01-05 FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models Hui Lin et.al. 2501.02461 null
2025-01-03 Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding Jiaming Li et.al. 2501.01926 link
2025-01-03 MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning Pu Yang et.al. 2501.01834 null
2025-01-03 LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction Er Jin et.al. 2501.01767 null
2025-01-03 MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders Jiajun Cao et.al. 2501.01709 null
2025-01-03 GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models Zhangyang Qi et.al. 2501.01428 link
2025-01-02 Training Medical Large Vision-Language Models with Abnormal-Aware Feedback Yucheng Zhou et.al. 2501.01377 null
2025-01-02 CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering Ben Vardi et.al. 2501.01371 null
2025-01-02 Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability Dong Shu et.al. 2501.01346 null
2025-01-02 CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries Shudong Liu et.al. 2501.01282 null
2025-01-03 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining Wenqi Zhang et.al. 2501.00958 link
2025-01-01 Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models Emily Johnson et.al. 2501.00917 null
2025-01-01 FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation Bingyu Li et.al. 2501.00877 link
2025-01-01 IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models Yiming Zhang et.al. 2501.00848 null
2024-12-31 ICONS: Influence Consensus for Vision-Language Data Selection Xindi Wu et.al. 2501.00654 null
2024-12-30 Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model Yifei Huang et.al. 2412.21080 link
2024-12-30 Plancraft: an evaluation dataset for planning with LLM agents Gautier Dagan et.al. 2412.21033 link
2024-12-30 UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI Fangwei Zhong et.al. 2412.20977 null
2024-12-30 Low-Light Image Enhancement via Generative Perceptual Priors Han Zhou et.al. 2412.20916 link
2024-12-30 WalkVLM:Aid Visually Impaired People Walking by Vision Language Model Zhiqiang Yuan et.al. 2412.20903 null
2024-12-30 Towards Compatible Fine-tuning for Vision-Language Model Updates Zhengbo Wang et.al. 2412.20895 null
2024-12-30 ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets Fanjun Bu et.al. 2412.20826 null
2024-12-30 Are Vision-Language Models Truly Understanding Multi-vision Sensor? Sangyun Chung et.al. 2412.20750 link
2024-12-30 UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models Yujie Li et.al. 2412.20742 link
2024-12-30 M $^3$ oralBench: A MultiModal Moral Benchmark for LVLMs Bei Yan et.al. 2412.20718 link
2024-12-27 MVTamperBench: Evaluating Robustness of Vision-Language Models Amit Agarwal et.al. 2412.19794 null
2024-12-27 OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis Qiushi Sun et.al. 2412.19723 null
2024-12-27 Is Your Text-to-Image Model Robust to Caption Noise? Weichen Yu et.al. 2412.19531 null
2024-12-27 MBQ: Modality-Balanced Quantization for Large Vision-Language Models Shiyao Li et.al. 2412.19509 link
2024-12-27 Multi-P $^2$ A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models Jie Zhang et.al. 2412.19496 link
2024-12-27 Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation Chengyang Ye et.al. 2412.19492 link
2024-12-26 CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models Kiet A. Nguyen et.al. 2412.19331 null
2024-12-26 Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches Kosei Tanada et.al. 2412.19153 null
2024-12-26 MoPD: Mixture-of-Prompts Distillation for Vision-Language Models Yang Chen et.al. 2412.19087 null
2024-12-26 Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation Tao Liu et.al. 2412.19021 null
2024-12-24 Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models Tahira Kazimi et.al. 2412.18604 null
2024-12-24 The Key of Understanding Vision Tasks: Explanatory Instructions Yang Shen et.al. 2412.18525 link
2024-12-24 LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating Chao Deng et.al. 2412.18424 link
2024-12-24 Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model Tenghui Li et.al. 2412.18387 link
2024-12-24 Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model Yushu Li et.al. 2412.18303 null
2024-12-24 Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight Xi Ding et.al. 2412.18298 link
2024-12-24 Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration Zhixuan Shen et.al. 2412.18292 link
2024-12-24 Sampling Bag of Views for Open-Vocabulary Object Detection Hojun Choi et.al. 2412.18273 null
2024-12-24 VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks Shiduo Zhang et.al. 2412.18194 null
2024-12-24 EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation Shuhao Han et.al. 2412.18150 link
2024-12-23 ChatGarment: Garment Estimation, Generation and Editing via Large Language Models Siyuan Bian et.al. 2412.17811 null
2024-12-23 Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection Yitong Chen et.al. 2412.17800 link
2024-12-23 Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective Xinmiao Yu et.al. 2412.17787 null
2024-12-23 **Reasoning to Attend: Try to Understand How Token Works** Rui Qian et.al. 2412.17741 link
2024-12-23 Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection Fenfang Tao et.al. 2412.17619 link
2024-12-23 Personalized Large Vision-Language Models Chau Pham et.al. 2412.17610 null
2024-12-23 Retention Score: Quantifying Jailbreak Risks for Vision Language Models Zaitang Li et.al. 2412.17544 null
2024-12-23 Feature Based Methods Domain Adaptation for Object Detection: A Review Paper Helia Mohamadi et.al. 2412.17325 null
2024-12-23 On the Feasibility of Vision-Language Models for Time-Series Classification Vinay Prithyani et.al. 2412.17304 link
2024-12-23 GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning Teja Krishna Cherukuri et.al. 2412.17251 null
2024-12-20 HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding Chenxin Tao et.al. 2412.16158 null
2024-12-20 Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training Mingliang Liang et.al. 2412.16148 link
2024-12-20 Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring Ahmet Bahaddin Ersoz et.al. 2412.16108 null
2024-12-20 VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models Dexter Neo et.al. 2412.15739 null
2024-12-20 Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage Zhi Gao et.al. 2412.15606 null
2024-12-20 VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving Zilin Huang et.al. 2412.15544 null
2024-12-20 PolySmart @ TRECVid 2024 Video-To-Text Jiaxin Wu et.al. 2412.15509 null
2024-12-19 TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models Ammar N. Abbas et.al. 2412.15462 null
2024-12-19 PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation Muntasir Wahed et.al. 2412.15209 null
2024-12-19 AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving Shuo Xing et.al. 2412.15206 link
2024-12-19 EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues Sagar Soni et.al. 2412.15190 null
2024-12-19 LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation Weijia Shi et.al. 2412.15188 null
2024-12-19 ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects Qihang Cao et.al. 2412.14837 null
2024-12-19 A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space Yonghao He et.al. 2412.14680 link
2024-12-19 FiVL: A Framework for Improved Vision-Language Alignment Estelle Aflalo et.al. 2412.14672 null
2024-12-19 HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model Masanari Ohi et.al. 2412.14613 null
2024-12-19 Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation Jihao Gu et.al. 2412.14487 null
2024-12-19 GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering Saumya Saxena et.al. 2412.14480 null
2024-12-18 Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation Jianyu Zhang et.al. 2412.14145 null
2024-12-18 Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models Ido Cohen et.al. 2412.14133 link
2024-12-18 Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models Xinghang Li et.al. 2412.14058 null
2024-12-18 Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence Jinghan He et.al. 2412.13949 null
2024-12-18 Real Classification by Description: Extending CLIP’s Limits of Part Attributes Recognition Ethan Baron et.al. 2412.13947 null
2024-12-18 Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection Le Yang et.al. 2412.13817 link
2024-12-18 Towards Automatic Evaluation for Image Transcreation Simran Khanuja et.al. 2412.13717 null
2024-12-18 RelationField: Relate Anything in Radiance Fields Sebastian Koch et.al. 2412.13652 link
2024-12-18 Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation Changsun Lee et.al. 2412.13558 null
2024-12-18 Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning Yingjie Zhu et.al. 2412.13540 link
2024-12-17 Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents Yifei Zhou et.al. 2412.13194 null
2024-12-17 HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction Chen Bao et.al. 2412.13187 null
2024-12-17 Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration Mark Endo et.al. 2412.13180 null
2024-12-17 Improving Fine-grained Visual Understanding in VLMs through Text-Only Training Dasol Choi et.al. 2412.12940 null
2024-12-17 CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models Zihui Cheng et.al. 2412.12932 link
2024-12-17 An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions Shreeyash Gowaikar et.al. 2412.12898 null
2024-12-17 ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation Shiqi Huang et.al. 2412.12798 link
2024-12-17 CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels Shizhuo Deng et.al. 2412.12793 null
2024-12-17 Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference Siyuan Wang et.al. 2412.12785 null
2024-12-17 GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models Mukai Li et.al. 2412.12735 link
2024-12-16 Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning Qi Sun et.al. 2412.11974 link
2024-12-16 Explainable Procedural Mistake Detection Shane Storks et.al. 2412.11927 null
2024-12-16 Does VLM Classification Benefit from LLM Description Semantics? Pingchuan Ma et.al. 2412.11917 link
2024-12-16 From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach Xilin Wang et.al. 2412.11892 null
2024-12-16 LMM-Regularized CLIP Embeddings for Image Classification Maria Tzelepi et.al. 2412.11663 null
2024-12-16 Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves Shihan Wu et.al. 2412.11509 link
2024-12-16 Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents Wonje Choi et.al. 2412.11484 null
2024-12-16 OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference Wei Chen et.al. 2412.11475 null
2024-12-16 MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation Quan-Sheng Zeng et.al. 2412.11464 link
2024-12-16 Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes Antonio Carlos Rivera et.al. 2412.11396 null
2024-12-13 UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities Muhammad Uzair Khattak et.al. 2412.10372 link
2024-12-13 A dual contrastive framework Yuan Sun et.al. 2412.10348 null
2024-12-13 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding Zhiyu Wu et.al. 2412.10302 link
2024-12-13 A model-based approach for transforming InSAR-derived vertical land motion from a local to a global reference frame Mahmoud Reshadati et.al. 2412.10282 null
2024-12-13 VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation Hyeonseok Lim et.al. 2412.10151 null
2024-12-13 Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples Yeyuan Wang et.al. 2412.10029 null
2024-12-13 WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model Songyan Zhang et.al. 2412.09951 link
2024-12-13 CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models Dongyu Yao et.al. 2412.09936 link
2024-12-13 Selective State Space Memory for Large Vision-Language Models Chee Ng et.al. 2412.09875 null
2024-12-12 BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation Pablo Morales-Álvarez et.al. 2412.09718 null
2024-12-13 V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding Junqi Ge et.al. 2412.09616 link
2024-12-12 PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models Chenyu Yang et.al. 2412.09613 null
2024-12-12 AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials Yiheng Xu et.al. 2412.09605 null
2024-12-12 OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation Jitesh Jain et.al. 2412.09585 link
2024-12-12 Neptune: The Long Orbit to Benchmarking Long Video Understanding Arsha Nagrani et.al. 2412.09582 link
2024-12-12 Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM Han Wang et.al. 2412.09530 link
2024-12-12 Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis Shengxuming Zhang et.al. 2412.09521 null
2024-12-12 ATPrompt: Textual Prompt Learning with Embedded Attributes Zheng Li et.al. 2412.09442 null
2024-12-12 Causal Graphical Models for Vision-Language Compositional Understanding Fiorenzo Parascandolo et.al. 2412.09353 link
2024-12-12 Learning Novel Skills from Language-Generated Demonstrations Ao-Qun Jin et.al. 2412.09286 null
2024-12-11 Synthetic Vision: Training Vision-Language Models to Understand Physics Vahid Balazadeh et.al. 2412.08619 null
2024-12-12 Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning Fan Lu et.al. 2412.08614 link
2024-12-11 SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting Pallavi Jain et.al. 2412.08536 link
2024-12-11 POINTS1.5: Building a Vision-Language Model towards Real World Applications Yuan Liu et.al. 2412.08443 null
2024-12-11 LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba Yubo Cui et.al. 2412.08388 null
2024-12-11 HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models Shiding Zhu et.al. 2412.08378 null
2024-12-11 Position-aware Guided Point Cloud Completion with CLIP Model Feng Zhou et.al. 2412.08271 null
2024-12-11 TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning Jingjing Xie et.al. 2412.08176 link
2024-12-11 Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models Quang-Hung Le et.al. 2412.08125 link
2024-12-11 Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models Sri Harsha Dumpala et.al. 2412.08111 null
2024-12-10 RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models Greg Heinrich et.al. 2412.07679 link
2024-12-10 DRUM: Learning Demonstration Retriever for Large MUlti-modal Models Ellen Yi-Ge et.al. 2412.07619 null
2024-12-10 Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios Jiaqi Fan et.al. 2412.07518 link
2024-12-10 SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World Jiaqi Zhang et.al. 2412.07472 link
2024-12-10 MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models Sayak Chakrabarty et.al. 2412.07148 link
2024-12-10 Maya: An Instruction Finetuned Multilingual Multimodal Model Nahid Alam et.al. 2412.07112 link
2024-12-10 Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling Donggeun Kim et.al. 2412.07077 null
2024-12-09 Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models Yi-Lun Lee et.al. 2412.06775 link
2024-12-09 Visual Lexicon: Rich Image Features in Language Space XuDong Wang et.al. 2412.06774 null
2024-12-09 Ranking-aware adapter for text-driven image ordering with CLIP Wei-Hsiang Yu et.al. 2412.06760 link
2024-12-09 ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Adhiraj Ghosh et.al. 2412.06745 null
2024-12-09 The Narrow Gate: Localized Image-Text Communication in Vision-Language Models Alessandro Serra et.al. 2412.06646 null
2024-12-09 From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding Yixiong Fang et.al. 2412.06474 link
2024-12-09 Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models Wei Suo et.al. 2412.06458 null
2024-12-09 No Annotations for Object Detection in Art through Stable Diffusion Patrick Ramos et.al. 2412.06286 link
2024-12-09 iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models Lianyu Hu et.al. 2412.06263 link
2024-12-09 DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction Yunheng Li et.al. 2412.06244 null
2024-12-06 Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies Recep Firat Cekinel et.al. 2412.05155 link
2024-12-06 Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Michael Y. Hu et.al. 2412.05149 null
2024-12-06 $S^3$ : Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models Xiaojie Yin et.al. 2412.04925 null
2024-12-06 Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model Keunwoo Peter Yu et.al. 2412.04729 null
2024-12-05 Cross-Self KV Cache Pruning for Efficient Vision-Language Inference Xiaohuan Pei et.al. 2412.04652 link
2024-12-05 NVILA: Efficient Frontier Visual Language Models Zhijian Liu et.al. 2412.04468 null
2024-12-05 VisionZip: Longer is Better but Not Necessary in Vision Language Models Senqiao Yang et.al. 2412.04467 link
2024-12-05 Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection Enshen Zhou et.al. 2412.04455 null
2024-12-05 Grounding Descriptions in Images informs Zero-Shot Visual Recognition Shaunak Halbe et.al. 2412.04429 link
2024-12-05 Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion Jiuhai Chen et.al. 2412.04424 link
2024-12-05 SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding Rong Li et.al. 2412.04383 null
2024-12-05 Discriminative Fine-tuning of LVLMs Yassine Ouali et.al. 2412.04378 null
2024-12-05 3D Part Segmentation via Geometric Aggregation of 2D Visual Features Marco Garosi et.al. 2412.04247 link
2024-12-06 VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction Jiahao Zhang et.al. 2412.04237 null
2024-12-05 Unified Framework for Open-World Compositional Zero-shot Learning Hirunima Jayasekara et.al. 2412.04083 link
2024-12-04 FLAIR: VLM with Fine-grained Language-informed Image Representations Rui Xiao et.al. 2412.03561 link
2024-12-04 Best-of-N Jailbreaking John Hughes et.al. 2412.03556 link
2024-12-04 PaliGemma 2: A Family of Versatile VLMs for Transfer Andreas Steiner et.al. 2412.03555 null
2024-12-04 PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation Ao Wang et.al. 2412.03409 link
2024-12-04 A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs Wangbo Zhao et.al. 2412.03324 link
2024-12-04 Composed Image Retrieval for Training-Free Domain Conversion Nikos Efthymiadis et.al. 2412.03297 link
2024-12-04 Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation Gianni Franchi et.al. 2412.03178 null
2024-12-04 AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? Shouwei Ruan et.al. 2412.03002 null
2024-12-04 Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch Qing Zhang et.al. 2412.02978 null
2024-12-04 Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis Po-Hsuan Huang et.al. 2412.02946 null
2024-12-03 Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback Hiroki Furuta et.al. 2412.02617 null
2024-12-03 CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs Abhas Kumar et.al. 2412.02602 null
2024-12-03 OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation Junyuan Zhang et.al. 2412.02592 link
2024-12-03 Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey Chenyang Liu et.al. 2412.02573 link
2024-12-03 SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection Joongwon Chae et.al. 2412.02565 link
2024-12-03 Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks Jinjin Cai et.al. 2412.02531 null
2024-12-03 OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations Caixin Kang et.al. 2412.02479 null
2024-12-03 BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding Chenguang Huang et.al. 2412.02449 null
2024-12-03 Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation Sepand Dyanatkar et.al. 2412.02262 null
2024-12-03 LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models Fan-Yun Sun et.al. 2412.02193 null
2024-11-29 SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks Kim-Celine Kahl et.al. 2411.19688 link
2024-11-29 CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation Qixiu Li et.al. 2411.19650 null
2024-11-29 Interleaved-Modal Chain-of-Thought Jun Gao et.al. 2411.19488 null
2024-11-29 Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis Ruoqi Wang et.al. 2411.19475 null
2024-11-28 Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation Luca Barsellotti et.al. 2411.19331 link
2024-11-28 GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks Muhammad Sohail Danish et.al. 2411.19325 link
2024-11-28 GRAPE: Generalizing Robot Policy via Preference Alignment Zijian Zhang et.al. 2411.19309 null
2024-11-28 Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs Anirudh Phukan et.al. 2411.19187 null
2024-11-28 VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models Jeongho Ju et.al. 2411.19103 null
2024-11-27 ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics Letian Chen et.al. 2411.18825 null
2024-11-27 Visual Adversarial Attack on Vision-Language Models for Autonomous Driving Tianyuan Zhang et.al. 2411.18275 null
2024-11-27 SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought Aladin Djuhera et.al. 2411.18212 null
2024-11-27 From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects Zizhao Li et.al. 2411.18207 link
2024-11-27 Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning Di Zhang et.al. 2411.18203 null
2024-11-27 DistinctAD: Distinctive Audio Description Generation in Contexts Bo Fang et.al. 2411.18180 null
2024-11-27 COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models Xiao An et.al. 2411.18145 link
2024-11-27 When Large Vision-Language Models Meet Person Re-Identification Qizao Wang et.al. 2411.18111 null
2024-11-27 Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis Weiqin Zhao et.al. 2411.18101 link
2024-11-27 VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis Donggoo Kang et.al. 2411.18038 null
2024-11-28 Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models Shuyang Hao et.al. 2411.18000 null
2024-11-26 What’s in the Image? A Deep-Dive into the Vision of Vision Language Models Omri Kaduri et.al. 2411.17491 null
2024-11-26 VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models Lei Li et.al. 2411.17451 null
2024-11-26 CoA: Chain-of-Action for Generative Semantic Labels Meng Wei et.al. 2411.17406 link
2024-11-26 Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment Dongping Chen et.al. 2411.17188 null
2024-11-26 Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation Chanyoung Kim et.al. 2411.17150 null
2024-11-26 Free $^2$ Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models Jaemin Kim et.al. 2411.17041 null
2024-11-26 Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation Shambhavi Mishra et.al. 2411.17002 link
2024-11-25 Probing the limitations of multimodal language models for chemistry and materials research Nawaf Alampara et.al. 2411.16955 link
2024-11-25 Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge Yaqi Zhao et.al. 2411.16824 null
2024-11-25 Generating Out-Of-Distribution Scenarios Using Language Models Erfan Aasi et.al. 2411.16554 null
2024-11-25 RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics Chan Hee Song et.al. 2411.16537 null
2024-11-25 Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis Boming Miao et.al. 2411.16503 null
2024-11-25 A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models Manuel Schwonberg et.al. 2411.16407 null
2024-11-25 CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain Jingchao Peng et.al. 2411.16327 null
2024-11-25 Open-Vocabulary Octree-Graph for 3D Scene Understanding Zhigang Wang et.al. 2411.16253 null
2024-11-25 From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events Yan Miao et.al. 2411.16027 null
2024-11-25 Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models Niloufar Alipour Talemi et.al. 2411.16018 null
2024-11-24 Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation Sule Bai et.al. 2411.15869 link
2024-11-24 ResCLIP: Residual Attention for Training-free Dense Vision-language Inference Yuhang Yang et.al. 2411.15851 link
2024-11-22 Context-Aware Multimodal Pretraining Karsten Roth et.al. 2411.15099 null
2024-11-22 Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning Junjie Shan et.al. 2411.14937 link
2024-11-22 ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos Tanveer Hannan et.al. 2411.14901 link
2024-11-22 VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models Camilo Chacón Sartori et.al. 2411.14832 null
2024-11-22 Continual SFT Matches Multimodal RLHF with Negative Supervision Ke Zhu et.al. 2411.14797 null
2024-11-22 VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection Songhao Han et.al. 2411.14794 link
2024-11-22 Effective SAM Combination for Open-Vocabulary Semantic Segmentation Minhyeok Lee et.al. 2411.14723 null
2024-11-21 GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI Tianbin Li et.al. 2411.14522 link
2024-11-21 Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance Haozhe Zhao et.al. 2411.14279 null
2024-11-21 Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body Zeqing Wang et.al. 2411.14205 null
2024-11-21 FoPru: Focal Pruning for Efficient Large Vision-Language Models Lei Jiang et.al. 2411.14164 null
2024-11-21 Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset Heejeong Nam et.al. 2411.14137 link
2024-11-20 BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Davide Paglieri et.al. 2411.13543 null
2024-11-20 Teaching VLMs to Localize Specific Objects from In-context Examples Sivan Doveh et.al. 2411.13317 link
2024-11-20 XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation Ziyi Wang et.al. 2411.13243 link
2024-11-21 ViSTa Dataset: Do vision-language models understand sequential tasks? Evžen Wybitul et.al. 2411.13211 link
2024-11-20 TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models Xin Wang et.al. 2411.13136 null
2024-11-20 LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement Siwen Jiao et.al. 2411.12980 null
2024-11-19 VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge Vishwesh Nath et.al. 2411.12915 null
2024-11-19 CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs Zhehan Kan et.al. 2411.12713 null
2024-11-19 Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes Rahul Garg et.al. 2411.12174 null
2024-11-18 Vision Language Models Are Few-Shot Audio Spectrogram Classifiers Satvik Dixit et.al. 2411.12058 null
2024-11-18 ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements M. Arda Aydın et.al. 2411.12044 link
2024-11-18 MC-LLaVA: Multi-Concept Personalized Vision-Language Model Ruichuan An et.al. 2411.11706 link
2024-11-18 TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World Xianlong Wang et.al. 2411.11683 null
2024-11-18 VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation Bangguo Yu et.al. 2411.11609 null
2024-11-18 Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment Zhendong Liu et.al. 2411.11543 null
2024-11-19 Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models Chenhang Cui et.al. 2411.11496 link
2024-11-18 Exploring Emerging Trends and Research Opportunities in Visual Place Recognition Antonios Gasteratos et.al. 2411.11481 null
2024-11-18 Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts Jingxuan Li et.al. 2411.11479 null
2024-11-18 GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts Junwen He et.al. 2411.11435 null
2024-11-18 Efficient Transfer Learning for Video-language Foundation Models Haoxing Chen et.al. 2411.11223 link
2024-11-17 Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection Wentao Bao et.al. 2411.10922 link
2024-11-15 VeriGraph: Scene Graphs for Execution Verifiable Robot Planning Daniel Ekpo et.al. 2411.10446 null
2024-11-15 LLaVA-o1: Let Vision Language Models Reason Step-by-Step Guowei Xu et.al. 2411.10440 link
2024-11-15 SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning Zewen Chen et.al. 2411.10161 link
2024-11-15 Federated Domain Generalization via Prompt Learning and Aggregation Shuai Gong et.al. 2411.10063 link
2024-11-15 Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement Yanyan Huang et.al. 2411.09894 link
2024-11-14 LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution Chenyang Wang et.al. 2411.09293 null
2024-11-13 ClevrSkills: Compositional Language and Visual Reasoning in Robotics Sanjay Haresh et.al. 2411.09052 link
2024-11-13 DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models Yongdong Wang et.al. 2411.09022 link
2024-11-13 Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions Moran Yanuka et.al. 2411.09018 link
2024-11-13 The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models Daniel P. Jeong et.al. 2411.08870 link
2024-11-13 Sharingan: Extract User Action Sequence from Desktop Recordings Yanting Chen et.al. 2411.08768 null
2024-11-13 Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification Jose-Luis Matez-Bandera et.al. 2411.08727 link
2024-11-13 LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation Pengwei Yin et.al. 2411.08606 null
2024-11-13 NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation Youzhi Liu et.al. 2411.08579 null
2024-11-13 Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints Nishanth Kumar et.al. 2411.08253 null
2024-11-12 JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation Yiyang Ma et.al. 2411.07975 link
2024-11-12 Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer’s Disease Francesco Chiumento et.al. 2411.07871 null
2024-11-12 BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions Anas Awadalla et.al. 2411.07461 null
2024-11-11 SAMPart3D: Segment Any Part in 3D Objects Yunhan Yang et.al. 2411.07184 link
2024-11-11 StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification Yichen He et.al. 2411.07076 link
2024-11-11 UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models Jiachen Liang et.al. 2411.06921 link
2024-11-11 Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning Hongsheng Zhang et.al. 2411.06764 null
2024-11-11 Learning from Feedback: Semantic Enhancement for Object SLAM Using Foundation Models Jungseok Hong et.al. 2411.06752 null
2024-11-11 Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs Jia Syuen Lim et.al. 2411.06702 null
2024-11-11 Renaissance: Investigating the Pretraining of Vision-Language Encoders Clayton Fields et.al. 2411.06657 link
2024-11-09 Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models Arshia Hemmat et.al. 2411.06287 link
2024-11-09 Sampling-Based Model Predictive Control for Dexterous Manipulation on a Biomimetic Tendon-Driven Hand Adrian Hess et.al. 2411.06183 null
2024-11-09 Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding Kaixuan Lu et.al. 2411.06142 null
2024-11-08 End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering Dylan Goetting et.al. 2411.05755 link
2024-11-08 Poze: Sports Technique Feedback under Data Constraints Agamdeep Singh et.al. 2411.05734 null
2024-11-08 A Two-Step Concept-Based Approach for Enhanced Interpretability and Trust in Skin Lesion Diagnosis Cristiano Patrício et.al. 2411.05609 link
2024-11-08 Enhancing Visual Classification using Comparative Descriptors Hankyeol Lee et.al. 2411.05357 link
2024-11-08 Real-World Offline Reinforcement Learning from Vision Language Model Feedback Sreyas Venkataraman et.al. 2411.05273 null
2024-11-07 RT-Grasp: Reasoning Tuning Robotic Grasping via Multi-modal Large Language Model Jinxuan Xu et.al. 2411.05212 null
2024-11-07 On Erroneous Agreements of CLIP Image Embeddings Siting Li et.al. 2411.05195 null
2024-11-07 Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning Joey Hong et.al. 2411.05193 null
2024-11-07 Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model Sheng Cheng et.al. 2411.05079 link
2024-11-07 DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation Peiqi Liu et.al. 2411.04999 link
2024-11-07 A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model Panwen Hu et.al. 2411.04942 null
2024-11-07 In the Era of Prompt Learning with Vision-Language Models Ankit Jha et.al. 2411.04892 null
2024-11-07 TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models Jonathan Fhima et.al. 2411.04642 null
2024-11-07 Vision Language Models are In-Context Value Learners Yecheng Jason Ma et.al. 2411.04549 null
2024-11-07 BendVLM: Test-Time Debiasing of Vision-Language Embeddings Walter Gerych et.al. 2411.04420 link
2024-11-06 Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models Saketh Bachu et.al. 2411.04291 null
2024-11-06 Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? Daniel P. Jeong et.al. 2411.04118 link
2024-11-06 RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models Maya Varma et.al. 2411.04097 link
2024-11-06 H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models Nhi Pham et.al. 2411.04077 null
2024-11-06 Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval Davide Buoso et.al. 2411.04006 null
2024-11-06 Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models Minh Duc Bui et.al. 2411.03888 link
2024-11-06 DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model Tianhao He et.al. 2411.03827 null
2024-11-06 Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction Muhammad Tayyab Khan et.al. 2411.03707 null
2024-11-05 Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset Yingzi Ma et.al. 2411.03554 link
2024-11-05 VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation Haochen Zhang et.al. 2411.03540 link
2024-11-05 Personalized Video Summarization by Multimodal Video Understanding Brian Chen et.al. 2411.03531 null
2024-11-05 Inference Optimal VLMs Need Only One Visual Token but Larger Models Kevin Y. Li et.al. 2411.03312 link
2024-11-05 HumanVLM: Foundation for Human-Scene Vision-Language Model Dawei Dai et.al. 2411.03034 null
2024-11-05 Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge Bin Huang et.al. 2411.02999 null
2024-11-05 Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments Emilio Olivastri et.al. 2411.02938 null
2024-11-05 Membership Inference Attacks against Large Vision-Language Models Zhan Li et.al. 2411.02902 link
2024-11-05 Leveraging Vision-Language Models for Manufacturing Feature Recognition in CAD Designs Muhammad Tayyab Khan et.al. 2411.02810 null
2024-11-05 Label Critic: Design Data Before Models Pedro R. A. S. Bassi et.al. 2411.02753 link
2024-11-05 DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark Haodong Li et.al. 2411.02733 link
2024-11-05 V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization Yuxi Xie et.al. 2411.02712 link
2024-11-04 Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack Xiaojun Jia et.al. 2411.02669 link
2024-11-04 Attacking Vision-Language Computer Agents via Pop-ups Yanzhe Zhang et.al. 2411.02391 link
2024-11-04 One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering Deepayan Das et.al. 2411.02210 null
2024-11-04 GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery Bhupendra Solanki et.al. 2411.02074 null
2024-11-03 RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering Hui Lin et.al. 2411.01595 null
2024-11-03 Addressing Failures in Robotics using Vision-Based Language Models (VLMs) and Behavior Trees (BT) Faseeh Ahmad et.al. 2411.01568 null
2024-11-03 Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting Zhaohui Chen et.al. 2411.01511 null
2024-11-03 A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning Fei Wang et.al. 2411.01445 null
2024-11-01 Identifying Implicit Social Biases in Vision-Language Models Kimia Hamidieh et.al. 2411.00997 null
2024-11-01 Retrieval-enriched zero-shot image classification in low-resource domains Nicola Dall’Asen et.al. 2411.00988 null
2024-11-01 Does GenAI Make Usability Testing Obsolete? Ali Ebrahimi Pourasad et.al. 2411.00634 null
2024-10-31 $π_0$ : A Vision-Language-Action Flow Model for General Robot Control Kevin Black et.al. 2410.24164 null
2024-10-31 Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age Nouar AlDahoul et.al. 2410.24148 null
2024-10-31 Bayesian-guided Label Mapping for Visual Reprogramming Chengyi Cai et.al. 2410.24018 link
2024-10-31 EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection Qinqian Lei et.al. 2410.23904 link
2024-10-31 Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP Chen Huang et.al. 2410.23698 null
2024-10-31 Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey Chiyu Zhang et.al. 2410.23687 null
2024-10-31 SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design Tomohiro Motoda et.al. 2410.23640 null
2024-10-30 Keypoint Abstraction using Large Models for Object-Relative Imitation Learning Xiaolin Fang et.al. 2410.23254 null
2024-10-31 Aligning Audio-Visual Joint Representations with an Agentic Workflow Shentong Mo et.al. 2410.23230 null
2024-10-30 OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Zhiyong Wu et.al. 2410.23218 link
2024-10-30 VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning Yichao Liang et.al. 2410.23156 null
2024-10-30 Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models Junjie Wu et.al. 2410.23114 link
2024-10-30 An Individual Identity-Driven Framework for Animal Re-Identification Yihao Wu et.al. 2410.22927 link
2024-10-30 Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector Youcheng Huang et.al. 2410.22888 link
2024-10-30 Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model Keito Sasagawa et.al. 2410.22736 null
2024-10-30 Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization Kento Kawaharazuka et.al. 2410.22707 null
2024-10-30 SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset Ngoc Dung Huynh et.al. 2410.22648 null
2024-10-29 Task Vectors are Cross-Modal Grace Luo et.al. 2410.22330 null
2024-10-29 Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier Kai Wang et.al. 2410.22317 link
2024-10-29 Natural Language Inference Improves Compositionality in Vision-Language Models Paola Cascante-Bonilla et.al. 2410.22315 null
2024-10-29 Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving Bo Jiang et.al. 2410.22313 link
2024-10-29 ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising Ashutosh Chaubey et.al. 2410.22233 link
2024-10-29 Active Learning for Vision-Language Models Bardia Safaei et.al. 2410.22187 null
2024-10-29 Are VLMs Really Blind Ayush Singh et.al. 2410.22029 link
2024-10-29 Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation Halil Utku Unlu et.al. 2410.21926 null
2024-10-30 Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models Lu Yu et.al. 2410.21802 link
2024-10-29 PerSRV: Personalized Sticker Retrieval with Vision-Language Model Heng Er Metilda Chee et.al. 2410.21801 link
2024-10-29 AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? Han Bao et.al. 2410.21259 link
2024-10-28 Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce Zhantao Yang et.al. 2410.21237 null
2024-10-28 Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines Zhixin Zhang et.al. 2410.21220 link
2024-10-29 Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction Qintong Zhang et.al. 2410.21169 null
2024-10-28 Zero-Shot Action Recognition in Surveillance Videos Joao Pereira et.al. 2410.21113 null
2024-10-28 BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks Yunhan Zhao et.al. 2410.20971 null
2024-10-29 VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions Guanyan Chen et.al. 2410.20927 null
2024-10-28 Improving Generalization in Visual Reasoning via Self-Ensemble Tien-Huy Nguyen et.al. 2410.20883 null
2024-10-28 Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments Sangmim Song et.al. 2410.20666 null
2024-10-27 MatViX: Multimodal Information Extraction from Visually Rich Articles Ghazal Khalighinejad et.al. 2410.20494 null
2024-10-25 Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models Yucheng Zhou et.al. 2410.19732 null
2024-10-25 GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing Hosam Elgendy et.al. 2410.19552 link
2024-10-25 Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? Antonia Wüst et.al. 2410.19546 link
2024-10-25 EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data Xuetian Chen et.al. 2410.19461 null
2024-10-25 COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training Haocheng Xi et.al. 2410.19313 link
2024-10-25 Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting Xingyu Zhu et.al. 2410.19294 null
2024-10-24 Probabilistic Language-Image Pre-Training Sanghyuk Chun et.al. 2410.18857 link
2024-10-24 Zero-shot Object Navigation with Vision-Language Models Reasoning Congcong Wen et.al. 2410.18570 null
2024-10-24 Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data Shuhao Gu et.al. 2410.18558 null
2024-10-24 Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics Jinghao Hu et.al. 2410.18537 null
2024-10-23 Lightweight Neural App Control Filippos Christianos et.al. 2410.17883 null
2024-10-23 ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting Shaofei Cai et.al. 2410.17856 link
2024-10-23 RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification Marco Mistretta et.al. 2410.17827 null
2024-10-23 An Intelligent Agentic System for Complex Image Restoration Problems Kaiwen Zhu et.al. 2410.17809 link
2024-10-23 MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models Ziyu Liu et.al. 2410.17637 link
2024-10-22 AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents Chejian Xu et.al. 2410.17401 null
2024-10-22 Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities Zheyuan Zhang et.al. 2410.17385 link
2024-10-22 PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction Long Xing et.al. 2410.17247 link
2024-10-22 MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model Meng Xu et.al. 2410.16840 null
2024-10-22 Combining Ontological Knowledge and Large Language Model for User-Friendly Service Robots Haru Nakajima et.al. 2410.16804 null
2024-10-22 Foundation Models for Remote Sensing and Earth Observation: A Survey Aoran Xiao et.al. 2410.16602 link
2024-10-21 Integrating Reinforcement Learning with Foundation Models for Autonomous Robotics: Methods and Perspectives Angelo Moroncelli et.al. 2410.16411 link
2024-10-21 VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use Zhehao Zhang et.al. 2410.16400 null
2024-10-21 xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs Michael S. Ryoo et.al. 2410.16267 null
2024-10-21 Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping Ryan Li et.al. 2410.16232 null
2024-10-21 Improve Vision Language Model Chain-of-thought Reasoning Ruohong Zhang et.al. 2410.16198 link
2024-10-21 Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning Yihong Tang et.al. 2410.16162 null
2024-10-21 Mitigating Object Hallucination via Concentric Causal Attention Yun Xing et.al. 2410.15926 link
2024-10-21 MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images Pablo Meseguer et.al. 2410.15881 null
2024-10-21 Task-oriented Robotic Manipulation with Vision Language Models Nurhan Bulus Guran et.al. 2410.15863 null
2024-10-21 An Efficient System for Automatic Map Storytelling – A Case Study on Historical Maps Ziyi Liu et.al. 2410.15780 link
2024-10-22 Reducing Hallucinations in Vision-Language Models via Latent Space Steering Sheng Liu et.al. 2410.15778 link
2024-10-21 AutoTrain: No-code training for state-of-the-art models Abhishek Thakur et.al. 2410.15735 link
2024-10-18 NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Baiqi Li et.al. 2410.14669 null
2024-10-18 Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets Namid R. Stillman et.al. 2410.14587 null
2024-10-18 CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection Andrea Appiani et.al. 2410.14509 null
2024-10-18 Zero-shot Action Localization via the Confidence of Large Vision-Language Models Josiah Aklilu et.al. 2410.14340 null
2024-10-18 Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models Olga Loginova et.al. 2410.14248 null
2024-10-18 E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model Haoran Lai et.al. 2410.14200 null
2024-10-18 LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs Yujun Zhou et.al. 2410.14182 null
2024-10-18 MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems Zifeng Zhu et.al. 2410.14179 link
2024-10-18 ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom Jingqi Zhou et.al. 2410.14138 null
2024-10-17 Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers Yuxin Wen et.al. 2410.14072 null
2024-10-17 VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding Runsen Xu et.al. 2410.13860 link
2024-10-17 Differentiable Robot Rendering Ruoshi Liu et.al. 2410.13851 null
2024-10-17 Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning Xiaodan Xing et.al. 2410.13823 link
2024-10-17 Improving Multi-modal Large Language Model through Boosting Vision Capabilities Yanpeng Sun et.al. 2410.13733 null
2024-10-17 VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks Shailaja Keyur Sampat et.al. 2410.13666 link
2024-10-17 H2OVL-Mississippi Vision Language Models Technical Report Shaikat Galib et.al. 2410.13611 null
2024-10-17 GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models Aditya Sharma et.al. 2410.13510 null
2024-10-17 Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding Kyungmin Min et.al. 2410.13321 null
2024-10-17 Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead Kuleen Sasse et.al. 2410.13146 link
2024-10-17 Trust but Verify: Programmatic VLM Evaluation in the Wild Viraj Prabhu et.al. 2410.13121 null
2024-10-16 Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models Ce Zhang et.al. 2410.12790 link
2024-10-16 Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions Zhenyu Jiang et.al. 2410.12773 null
2024-10-16 WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation João Matos et.al. 2410.12722 link
2024-10-16 WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines Genta Indra Winata et.al. 2410.12705 link
2024-10-16 VividMed: Vision Language Model with Versatile Visual Grounding for Medicine Lingxiao Luo et.al. 2410.12694 link
2024-10-16 Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models Shicheng Xu et.al. 2410.12662 null
2024-10-16 CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training Zhiyuan Ma et.al. 2410.12595 null
2024-10-16 FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion Jiacheng Ruan et.al. 2410.12564 link
2024-10-16 Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety Lucas Choi et.al. 2410.12225 null
2024-10-16 Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution Timothy Wei et.al. 2410.12165 null
2024-10-15 A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem Kun Ding et.al. 2410.11686 null
2024-10-15 VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models Zihang Li et.al. 2410.11665 null
2024-10-16 Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models Fan Yang et.al. 2410.11639 null
2024-10-15 MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval Reno Kriz et.al. 2410.11619 null
2024-10-15 PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model Shang-Ching Liu et.al. 2410.11564 null
2024-10-15 LargePiG: Your Large Language Model is Secretly a Pointer Generator Zhongxiang Sun et.al. 2410.11366 null
2024-10-15 Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs Shuo Li et.al. 2410.11302 null
2024-10-15 CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification Huazhong Zhao et.al. 2410.11255 null
2024-10-15 Tree of Attributes Prompt Learning for Vision-Language Models Tong Ding et.al. 2410.11201 link
2024-10-14 Locality Alignment Improves Vision-Language Models Ian Covert et.al. 2410.11087 null
2024-10-14 Towards Foundation Models for 3D Vision: How Close Are We? Yiming Zuo et.al. 2410.10799 link
2024-10-14 VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents Shi Yu et.al. 2410.10594 link
2024-10-14 Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification Jiaxiang Gou et.al. 2410.10573 link
2024-10-14 MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks Jiacheng Chen et.al. 2410.10563 link
2024-10-14 Learning to Ground VLMs without Forgetting Aritra Bhowmik et.al. 2410.10491 null
2024-10-14 LG-CAV: Train Any Concept Activation Vector with Language Guidance Qihan Huang et.al. 2410.10308 null
2024-10-14 Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection Jiawen Zhu et.al. 2410.10289 link
2024-10-14 LOBG:Less Overfitting for Better Generalization in Vision-Language Model Chenhao Ding et.al. 2410.10247 null
2024-10-14 MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models Peng Xia et.al. 2410.10139 link
2024-10-14 Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models Jun Luo et.al. 2410.10114 link
2024-10-11 Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models Qin Liu et.al. 2410.09047 null
2024-10-11 The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals Xiaofeng Wu et.al. 2410.09013 null
2024-10-11 DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection Haochen Li et.al. 2410.09004 link
2024-10-11 SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation Haosheng Li et.al. 2410.08901 null
2024-10-11 Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation Kun Ding et.al. 2410.08895 null
2024-10-11 RoRA-VLM: Robust Retrieval-Augmented Vision Language Models Jingyuan Qi et.al. 2410.08876 null
2024-10-11 Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies Yingqiang Gao et.al. 2410.08860 null
2024-10-11 VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model Beichen Wang et.al. 2410.08792 null
2024-10-11 Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models Reza Abbasi et.al. 2410.08791 link
2024-10-11 Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping Yue Yang et.al. 2410.08695 link
2024-10-10 LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts Anh-Quan Cao et.al. 2410.08211 null
2024-10-10 MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models Wenbo Hu et.al. 2410.08182 null
2024-10-10 On the Evaluation of Generative Robotic Simulations Feng Chen et.al. 2410.08172 null
2024-10-10 Q-VLM: Post-training Quantization for Large Vision-Language Models Changyuan Wang et.al. 2410.08119 link
2024-10-10 Unsupervised Data Validation Methods for Efficient Model Training Yurii Paniv et.al. 2410.07880 null
2024-10-10 HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter Yumiao Zhao et.al. 2410.07854 null
2024-10-10 FLIER: Few-shot Language Image Models Embedded with Latent Representations Zhinuo Zhou et.al. 2410.07648 null
2024-10-10 A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks Hoin Jung et.al. 2410.07593 link
2024-10-10 3D Vision-Language Gaussian Splatting Qucheng Peng et.al. 2410.07577 null
2024-10-10 How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? Seongyun Lee et.al. 2410.07571 null
2024-10-09 Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate Qidong Huang et.al. 2410.07167 link
2024-10-09 Towards Interpreting Visual Information Processing in Vision-Language Models Clement Neo et.al. 2410.07149 link
2024-10-10 EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models Rui Zhao et.al. 2410.07133 link
2024-10-09 VHELM: A Holistic Evaluation of Vision Language Models Tony Lee et.al. 2410.07112 link
2024-10-09 Pixtral 12B Pravesh Agrawal et.al. 2410.07073 link
2024-10-09 Clean Evaluations on Contaminated Visual Language Models Hongyuan Lu et.al. 2410.07030 null
2024-10-09 Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback Dennis Hein et.al. 2410.07025 null
2024-10-09 $\texttt{ModSCAN}$ : Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities Yukun Jiang et.al. 2410.06967 link
2024-10-09 Compositional Entailment Learning for Hyperbolic Vision-Language Models Avik Pal et.al. 2410.06912 link
2024-10-09 From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models Yuying Shang et.al. 2410.06795 null
2024-10-07 Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia Mohammad Fahes et.al. 2410.05270 link
2024-10-07 TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens Ya-Qi Yu et.al. 2410.05261 null
2024-10-08 TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models Rabin Adhikari et.al. 2410.05239 link
2024-10-07 Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality Youngtaek Oh et.al. 2410.05210 link
2024-10-07 LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation Zhijie Wang et.al. 2410.05191 null
2024-10-07 VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks Ziyan Jiang et.al. 2410.05160 null
2024-10-07 HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation Xinyu Zhou et.al. 2410.05090 link
2024-10-07 HE-Drive: Human-Like End-to-End Driving with Vision Language Models Junming Wang et.al. 2410.05051 null
2024-10-07 Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models Dehong Kong et.al. 2410.04884 null
2024-10-07 TLDR: Token-Level Detective Reward Model for Large Vision Language Models Deqing Fu et.al. 2410.04734 null
2024-10-06 Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress Christopher Agia et.al. 2410.04640 null

Video-Text Retrieval

Publish Date Title Authors PDF Code
2025-06-09 Ambiguity-Restrained Text-Video Representation Learning for Partially Relevant Video Retrieval CH Cho et.al. 2506.07471 null
2025-06-04 MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval Xinru Ying et.al. 2506.03473 null
2025-06-03 Uneven Event Modeling for Partially Relevant Video Retrieval Sa Zhu et.al. 2506.00891 null
2025-04-28 Exploiting Inter-Sample Correlation and Intra-Sample Redundancy for Partially Relevant Video Retrieval Junlong Ren et.al. 2504.19637 null
2025-04-17 Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval WonJun Moon et.al. 2504.13035 null
2025-04-15 Towards Efficient Partially Relevant Video Retrieval with Active Moment Discovering Peipei Song et.al. 2504.10920 link
2025-03-20 DSLUT: An Asymmetric LUT and its Automatic Design Flow Based on Practical Functions Moucheng Yang et.al. 2503.16109 null
2025-02-26 Nexus: A Lightweight and Scalable Multi-Agent Framework for Complex Tasks Automation Humza Sami et.al. 2502.19091 link
2025-01-12 Monolithic 3D FPGAs Utilizing Back-End-of-Line Configuration Memories Faaiq Waqar et.al. 2501.06921 null
2024-12-16 View Transformation Robustness for Multi-View 3D Object Reconstruction with Reconstruction Error-Guided View Selection Qi Zhang et.al. 2412.11428 link
2024-07-10 EA-VTR: Event-Aware Video-Text Retrieval Zongyang Ma et.al. 2407.07478 null
2024-05-22 GMMFormer v2: An Uncertainty-aware Framework for Partially Relevant Video Retrieval Yuting Wang et.al. 2405.13824 link
2024-04-08 The role of non-scientific factors vis-a-vis the quality of publications in determining their scholarly impact Giovanni Abramo et.al. 2404.05345 null
2024-04-06 VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA Sachini Wickramasinghe et.al. 2404.04527 null
2024-02-21 Event-aware Video Corpus Moment Retrieval Danyang Hou et.al. 2402.13566 null
2024-01-27 Sigma-lognormal modeling of speech C. Carmona-Duarte et.al. 2401.17320 null
2024-03-12 Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval Taichi Nishimura et.al. 2312.00414 null
2024-07-14 DAG-aware Synthesis Orchestration Yingjie Li et.al. 2310.07846 null
2024-01-03 GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval Yuting Wang et.al. 2310.05195 link
2023-09-25 Multi-event Video-Text Retrieval Gengyuan Zhang et.al. 2308.11551 link
2022-08-26 Partially Relevant Video Retrieval Jianfeng Dong et.al. 2208.12510 link

Human-Centric

Publish Date Title Authors PDF Code
2025-06-14 Domain Generalization for Person Re-identification: A Survey Towards Domain-Agnostic Person Matching Hyeonseo Lee et.al. 2506.12413 link
2025-06-11 ReID5o: Achieving Omni Multi-modal Person Re-identification in a Single Model Jialong Zuo et.al. 2506.09385 link
2025-05-23 DART $^3$ : Leveraging Distance for Test Time Adaptation in Person Re-Identification Rajarshi Bhattacharya et.al. 2505.18337 null
2025-05-07 DetReIDX: A Stress-Test Dataset for Real-World UAV-Based Person Recognition Kailash A. Hambarde et.al. 2505.04793 null
2025-04-29 Adept: Annotation-Denoising Auxiliary Tasks with Discrete Cosine Transform Map and Keypoint for Human-Centric Pretraining Weizhen He et.al. 2504.20800 null
2025-06-03 Effective Dual-Region Augmentation for Reduced Reliance on Large Amounts of Labeled Data Prasanna Reddy Pulakurthi et.al. 2504.13077 link
2025-05-24 LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification Yiding Lu et.al. 2504.10174 link
2025-03-27 FusionSegReID: Advancing Person Re-Identification with Multimodal Retrieval and Precise Segmentation Jincheng Yan et.al. 2503.21595 null
2025-03-24 Multi-modal Multi-platform Person Re-Identification: Benchmark and Method Ruiyang Ha et.al. 2503.17096 null
2025-03-11 AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification Huy Nguyen et.al. 2503.08121 link
2025-02-10 Group-CLIP Uncertainty Modeling for Group Re-Identification Qingxin Zhang et.al. 2502.06460 null
2025-01-13 Event-based Video Person Re-identification via Cross-Modality and Temporal Collaboration Renkai Li et.al. 2501.07296 null
2024-12-23 Unity is Strength: Unifying Convolutional and Transformeral Features for Better Person Re-Identification Yuhao Wang et.al. 2412.17239 link
2024-12-06 Mix-Modality Person Re-Identification: A New and Practical Paradigm Wei Liu et.al. 2412.04719 null
2024-12-02 Cerberus: Attribute-based person re-identification using semantic IDs Chanho Eom et.al. 2412.01048 null
2024-11-27 When Large Vision-Language Models Meet Person Re-Identification Qizao Wang et.al. 2411.18111 null
2025-04-11 Anti-Forgetting Adaptation for Unsupervised Person Re-identification Hao Chen et.al. 2411.14695 null
2025-01-15 Multiple Information Prompt Learning for Cloth-Changing Person Re-Identification Shengxun Wei et.al. 2411.00330 null
2024-10-11 Aligned Divergent Pathways for Omni-Domain Generalized Person Re-Identification Eugene P. W. Ang et.al. 2410.08466 null
2024-10-11 Diverse Deep Feature Ensemble Learning for Omni-Domain Generalized Person Re-identification Eugene P. W. Ang et.al. 2410.08460 null
2024-10-11 A Unified Deep Semantic Expansion Framework for Domain-Generalized Person Re-identification Eugene P. W. Ang et.al. 2410.08456 null
2024-10-25 Adaptive High-Frequency Transformer for Diverse Wildlife Re-Identification Chenyue Li et.al. 2410.06977 link
2024-09-03 Cross-Platform Video Person ReID: A New Benchmark Dataset and Adaptation Approach Shizhou Zhang et.al. 2408.07500 link
2024-08-20 PersonViT: Large-scale Self-supervised Vision Transformer for Person Re-Identification Bin Hu et.al. 2408.05398 link
2024-05-30 DiffPhysBA: Diffusion-based Physical Backdoor Attack against Person Re-Identification in Real-World Wenli Sun et.al. 2405.19990 null
2024-05-28 Instruct-ReID++: Towards Universal Purpose Instruction-Guided Person Re-identification Weizhen He et.al. 2405.17790 link
2024-05-15 Color Space Learning for Cross-Color Person Re-Identification Jiahao Nie et.al. 2405.09487 link
2024-07-01 Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID Wentao Tan et.al. 2405.04940 link
2024-04-19 Rethinking Clothes Changing Person ReID: Conflicts, Synthesis, and Optimization Junjie Li et.al. 2404.12611 null
2024-05-01 Part-Attention Based Model Make Occluded Person Re-Identification Stronger Zhihao Chen et.al. 2404.03443 null
2024-03-16 Bidirectional Multi-Step Domain Generalization for Visible-Infrared Person Re-Identification Mahdi Alehdaghi et.al. 2403.10782 null
2024-03-13 Data Augmentation in Human-Centric Vision Wentao Jiang et.al. 2403.08650 null