2025-07-07 |
NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving |
Qucheng Peng et.al. |
2507.05227 |
null |
2025-07-07 |
All in One: Visual-Description-Guided Unified Point Cloud Segmentation |
Zongyan Han et.al. |
2507.05211 |
null |
2025-07-07 |
Differential Attention for Multimodal Crisis Event Analysis |
Nusrat Munia et.al. |
2507.05165 |
null |
2025-07-07 |
INTER: Mitigating Hallucination in Large Vision-Language Models by Interaction Guidance Sampling |
Xin Dong et.al. |
2507.05056 |
null |
2025-07-07 |
Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision |
Soham Walimbe et.al. |
2507.05020 |
null |
2025-07-07 |
From Vision To Language through Graph of Events in Space and Time: An Explainable Self-supervised Approach |
Mihai Masala et.al. |
2507.04815 |
null |
2025-07-07 |
Training-free Generation of Temporally Consistent Rewards from VLMs |
Yinuo Zhao et.al. |
2507.04789 |
null |
2025-07-07 |
Vision-Language Models Can’t See the Obvious |
Yasser Dahou et.al. |
2507.04741 |
null |
2025-07-07 |
An analysis of vision-language models for fabric retrieval |
Francesco Giuliari et.al. |
2507.04735 |
null |
2025-07-07 |
A Visual Leap in CLIP Compositionality Reasoning through Generation of Counterfactual Sets |
Zexi Jia et.al. |
2507.04699 |
null |
2025-07-03 |
DexVLG: Dexterous Vision-Language-Grasp Model at Scale |
Jiawei He et.al. |
2507.02747 |
null |
2025-07-03 |
ArtGS:3D Gaussian Splatting for Interactive Visual-Physical Modeling and Manipulation of Articulated Objects |
Qiaojun Yu et.al. |
2507.02600 |
null |
2025-07-02 |
cVLA: Towards Efficient Camera-Space VLAs |
Max Argus et.al. |
2507.02190 |
null |
2025-07-02 |
Large Language Models for Crash Detection in Video: A Survey of Methods, Datasets, and Challenges |
Sanjeda Akter et.al. |
2507.02074 |
null |
2025-07-02 |
CI-VID: A Coherent Interleaved Text-Video Dataset |
Yiming Ju et.al. |
2507.01938 |
null |
2025-07-02 |
How Do Vision-Language Models Process Conflicting Information Across Modalities? |
Tianze Hua et.al. |
2507.01790 |
null |
2025-07-02 |
Facial Emotion Learning with Text-Guided Multiview Fusion via Vision-Language Model for 3D/4D Facial Expression Recognition |
Muzammil Behzad et.al. |
2507.01673 |
null |
2025-07-02 |
MARVIS: Modality Adaptive Reasoning over VISualizations |
Benjamin Feuer et.al. |
2507.01544 |
null |
2025-07-02 |
Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence |
Robert Aufschläger et.al. |
2507.01504 |
null |
2025-07-02 |
BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments |
Yibo Qiu et.al. |
2507.01485 |
null |
2025-07-03 |
TriVLA: A Triple-System-Based Unified Vision-Language-Action Model for General Robot Control |
Zhenyang Liu et.al. |
2507.01424 |
null |
2025-07-02 |
CaptionSmiths: Flexibly Controlling Language Pattern in Image Captioning |
Kuniaki Saito et.al. |
2507.01409 |
null |
2025-07-02 |
Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model |
Chaoxiang Cai et.al. |
2507.01351 |
null |
2025-07-02 |
VLAD: A VLM-Augmented Autonomous Driving Framework with Hierarchical Planning and Interpretable Decision Process |
Cristian Gariboldi et.al. |
2507.01284 |
null |
2025-06-30 |
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World |
Xiangtai Li et.al. |
2506.24102 |
null |
2025-06-30 |
EXPERT: An Explainable Image Captioning Evaluation Metric with Structured Explanations |
Hyunjong Kim et.al. |
2506.24016 |
null |
2025-06-30 |
The Illusion of Progress? A Critical Look at Test-Time Adaptation for Vision-Language Models |
Lijun Sheng et.al. |
2506.24000 |
null |
2025-06-30 |
StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving |
Ruiyang Hao et.al. |
2506.23982 |
null |
2025-06-30 |
GroundingDINO-US-SAM: Text-Prompted Multi-Organ Segmentation in Ultrasound with LoRA-Tuned Vision-Language Models |
Hamza Rasaee et.al. |
2506.23903 |
null |
2025-06-30 |
A Closer Look at Conditional Prompt Tuning for Vision-Language Models |
Ji Zhang et.al. |
2506.23856 |
null |
2025-06-30 |
Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model |
Shiming Chen et.al. |
2506.23822 |
null |
2025-06-30 |
Visual Textualization for Image Prompted Object Detection |
Yongjian Wu et.al. |
2506.23785 |
null |
2025-06-30 |
PAC Bench: Do Foundation Models Understand Prerequisites for Executing Manipulation Policies? |
Atharva Gundawar et.al. |
2506.23725 |
null |
2025-06-30 |
On the Domain Robustness of Contrastive Vision-Language Models |
Mario Koddenbrock et.al. |
2506.23663 |
null |
2025-06-27 |
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning |
Xi Chen et.al. |
2506.22434 |
null |
2025-06-27 |
Test-Time Consistency in Vision Language Models |
Shih-Han Chou et.al. |
2506.22395 |
null |
2025-06-27 |
Exploiting Vision Language Model for Training-Free 3D Point Cloud OOD Detection via Graph Score Propagation |
Tiankai Chen et.al. |
2506.22375 |
null |
2025-06-27 |
Rethinking Visual Token Reduction in LVLMs under Cross-modal Misalignment |
Rui Xu et.al. |
2506.22283 |
null |
2025-06-27 |
COOCO – Common Objects Out-of-Context – Semantic Violation in Scenes: Investigating Multimodal Context in Referential Communication |
Filippo Merlo et.al. |
2506.22274 |
null |
2025-06-27 |
Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs |
Amirmohammad Izadi et.al. |
2506.22146 |
null |
2025-06-27 |
Universal Retrieval for Multimodal Trajectory Modeling |
Xuan Zhang et.al. |
2506.22056 |
null |
2025-06-27 |
Partial CLIP is Enough: Chimera-Seg for Zero-shot Semantic Segmentation |
Jialei Chen et.al. |
2506.22032 |
null |
2025-06-27 |
SPAZER: Spatial-Semantic Progressive Reasoning Agent for Zero-shot 3D Visual Grounding |
Zhao Jin et.al. |
2506.21924 |
null |
2025-06-27 |
SODA: Out-of-Distribution Detection in Domain-Shifted Point Clouds via Neighborhood Propagation |
Adam Goodge et.al. |
2506.21892 |
null |
2025-06-26 |
Mitigating Hallucination of Large Vision-Language Models via Dynamic Logits Calibration |
Jiahe Chen et.al. |
2506.21509 |
null |
2025-06-26 |
Global and Local Entailment Learning for Natural World Imagery |
Srikumar Sastry et.al. |
2506.21476 |
null |
2025-06-26 |
Spatial Mental Modeling from Limited Views |
Baiqiao Yin et.al. |
2506.21458 |
null |
2025-06-27 |
ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models |
Hongbo Liu et.al. |
2506.21356 |
null |
2025-06-26 |
LLaVA-Pose: Enhancing Human Pose and Action Understanding via Keypoint-Integrated Instruction Tuning |
Dewen Zhang et.al. |
2506.21317 |
null |
2025-06-26 |
DrishtiKon: Multi-Granular Visual Grounding for Text-Rich Document Images |
Badri Vishal Kasuba et.al. |
2506.21316 |
null |
2025-06-26 |
World-aware Planning Narratives Enhance Large Vision-Language Model Planner |
Junhao Shi et.al. |
2506.21230 |
null |
2025-06-26 |
Personalized Federated Learning via Dual-Prompt Optimization and Cross Fusion |
Yuguang Zhang et.al. |
2506.21144 |
null |
2025-06-26 |
V2X-REALM: Vision-Language Model-Based Robust End-to-End Cooperative Autonomous Driving with Adaptive Long-Tail Modeling |
Junwei You et.al. |
2506.21041 |
null |
2025-06-26 |
Multimodal Prompt Alignment for Facial Expression Recognition |
Fuyan Ma et.al. |
2506.21017 |
null |
2025-06-25 |
Shape2Animal: Creative Animal Generation from Natural Silhouettes |
Quoc-Duy Tran et.al. |
2506.20616 |
null |
2025-06-25 |
HRIBench: Benchmarking Vision-Language Models for Real-Time Human Perception in Human-Robot Interaction |
Zhonghao Shi et.al. |
2506.20566 |
null |
2025-06-25 |
Med-Art: Diffusion Transformer for 2D Medical Text-to-Image Generation |
Changlu Guo et.al. |
2506.20449 |
null |
2025-06-25 |
CARMA: Context-Aware Situational Grounding of Human-Robot Group Interactions by Combining Vision-Language Models with Object and Action Recognition |
Joerg Deigmoeller et.al. |
2506.20373 |
null |
2025-06-25 |
Mobile-R1: Towards Interactive Reinforcement Learning for VLM-Based Mobile Agent via Task-Level Rewards |
Jihao Gu et.al. |
2506.20332 |
null |
2025-06-25 |
Towards Efficient Exemplar Based Image Editing with Multimodal VLMs |
Avadhoot Jadhav et.al. |
2506.20155 |
null |
2025-06-25 |
MIRAGE: A Benchmark for Multimodal Information-Seeking and Reasoning in Agricultural Expert-Guided Conversations |
Vardhan Dongre et.al. |
2506.20100 |
null |
2025-06-24 |
Unified Vision-Language-Action Model |
Yuqi Wang et.al. |
2506.19850 |
null |
2025-06-24 |
Evaluating Compliance with Visualization Guidelines in Diagrams for Scientific Publications Using Large Vision Language Models |
Johannes Rückert et.al. |
2506.19825 |
null |
2025-06-24 |
CronusVLA: Transferring Latent Motion Across Time for Multi-Frame Prediction in Manipulation |
Hao Li et.al. |
2506.19816 |
null |
2025-06-24 |
UltraAD: Fine-Grained Ultrasound Anomaly Classification via Few-Shot CLIP Adaptation |
Yue Zhou et.al. |
2506.19694 |
null |
2025-06-24 |
PEVLM: Parallel Encoding for Vision-Language Models |
Letian Kang et.al. |
2506.19651 |
null |
2025-06-24 |
V2T-CoT: From Vision to Text Chain-of-Thought for Medical Reasoning and Diagnosis |
Yuan Wang et.al. |
2506.19610 |
null |
2025-06-24 |
ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP |
Zhiyuan Wang et.al. |
2506.19608 |
null |
2025-06-24 |
Fake or Real, Can Robots Tell? Evaluating Embodied Vision-Language Models on Real and 3D-Printed Objects |
Federico Tavella et.al. |
2506.19579 |
null |
2025-06-24 |
Visual hallucination detection in large vision-language models via evidential conflict |
Tao Huang et.al. |
2506.19513 |
null |
2025-06-24 |
T-Rex: Task-Adaptive Spatial Representation Extraction for Robotic Manipulation with Vision-Language Models |
Yiteng Chen et.al. |
2506.19498 |
null |
2025-06-23 |
VQ-Insight: Teaching VLMs for AI-Generated Video Quality Understanding via Progressive Visual Reinforcement Learning |
Xuanyu Zhang et.al. |
2506.18564 |
null |
2025-06-23 |
Generalizing Vision-Language Models to Novel Domains: A Comprehensive Survey |
Xinyao Li et.al. |
2506.18504 |
null |
2025-06-23 |
InternSpatial: A Comprehensive Dataset for Spatial Reasoning in Vision-Language Models |
Nianchen Deng et.al. |
2506.18385 |
null |
2025-06-23 |
Taming Vision-Language Models for Medical Image Analysis: A Comprehensive Review |
Haoneng Lin et.al. |
2506.18378 |
null |
2025-06-23 |
Escaping the SpuriVerse: Can Large Vision-Language Models Generalize Beyond Seen Spurious Correlations? |
Yiwei Yang et.al. |
2506.18322 |
null |
2025-06-24 |
Referring Expression Instance Retrieval and A Strong End-to-End Baseline |
Xiangzhao Hao et.al. |
2506.18246 |
null |
2025-06-23 |
Drive-R1: Bridging Reasoning and Planning in VLMs for Autonomous Driving with Reinforcement Learning |
Yue Li et.al. |
2506.18234 |
null |
2025-06-22 |
See-in-Pairs: Reference Image-Guided Comparative Vision-Language Models for Medical Diagnosis |
Ruinan Jin et.al. |
2506.18140 |
null |
2025-06-22 |
CLGRPO: Reasoning Ability Enhancement for Small VLMs |
Fanyi Wang et.al. |
2506.18048 |
null |
2025-06-22 |
Adapting Vision-Language Models for Evaluating World Models |
Mariya Hendriksen et.al. |
2506.17967 |
null |
2025-06-20 |
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning |
Zhangyang Qi et.al. |
2506.17221 |
null |
2025-06-20 |
Machine Mental Imagery: Empower Multimodal Reasoning with Latent Visual Tokens |
Zeyuan Yang et.al. |
2506.17218 |
link |
2025-06-20 |
Do We Need Large VLMs for Spotting Soccer Actions? |
Ritabrata Chakraborty et.al. |
2506.17144 |
null |
2025-06-20 |
Prmpt2Adpt: Prompt-Based Zero-Shot Domain Adaptation for Resource-Constrained Environments |
Yasir Ali Farrukh et.al. |
2506.16994 |
null |
2025-06-20 |
AnyTraverse: An off-road traversability framework with VLM and human operator in the loop |
Sattwik Sahu et.al. |
2506.16826 |
null |
2025-06-20 |
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation |
Fan Yang et.al. |
2506.16806 |
null |
2025-06-20 |
Co-VisiON: Co-Visibility ReasONing on Sparse Image Sets of Indoor Scenes |
Chao Chen et.al. |
2506.16805 |
null |
2025-06-20 |
Cross-Modal Obfuscation for Jailbreak Attacks on Large Vision-Language Models |
Lei Jiang et.al. |
2506.16760 |
null |
2025-06-20 |
Language-Informed Synthesis of Rational Agent Models for Grounded Theory-of-Mind Reasoning On-The-Fly |
Lance Ying et.al. |
2506.16755 |
null |
2025-06-20 |
TeSG: Textual Semantic Guidance for Infrared and Visible Image Fusion |
Mingrui Zhu et.al. |
2506.16730 |
null |
2025-06-18 |
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models |
Byung-Kwan Lee et.al. |
2506.15681 |
null |
2025-06-18 |
Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning |
Ankan Deria et.al. |
2506.15649 |
null |
2025-06-18 |
FindingDory: A Benchmark to Evaluate Memory in Embodied Agents |
Karmesh Yadav et.al. |
2506.15635 |
null |
2025-06-18 |
WikiMixQA: A Multimodal Benchmark for Question Answering over Tables and Charts |
Negar Foroutan et.al. |
2506.15594 |
link |
2025-06-18 |
DiscoSG: Towards Discourse-Level Text Scene Graph Parsing through Iterative Graph Refinement |
Shaoqing Lin et.al. |
2506.15583 |
link |
2025-06-18 |
Context-Informed Grounding Supervision |
Hyunji Lee et.al. |
2506.15480 |
link |
2025-06-19 |
OpenPath: Open-Set Active Learning for Pathology Image Classification via Pre-trained Vision-Language Models |
Lanfeng Zhong et.al. |
2506.15318 |
null |
2025-06-18 |
MEGC2025: Micro-Expression Grand Challenge on Spot Then Recognize and Visual Question Answering |
Xinqi Fan et.al. |
2506.15298 |
null |
2025-06-18 |
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models |
Xuelin Shen et.al. |
2506.15201 |
link |
2025-06-18 |
ReSeDis: A Dataset for Referring-based Object Search across Large-Scale Image Collections |
Ziling Huang et.al. |
2506.15180 |
null |
2025-06-17 |
RobotSmith: Generative Robotic Tool Design for Acquisition of Complex Manipulation Skills |
Chunru Lin et.al. |
2506.14763 |
null |
2025-06-17 |
Casper: Inferring Diverse Intents for Assistive Teleoperation with Vision Language Models |
Huihan Liu et.al. |
2506.14727 |
null |
2025-06-17 |
AGENTSAFE: Benchmarking the Safety of Embodied Agents on Hazardous Instructions |
Aishan Liu et.al. |
2506.14697 |
null |
2025-06-17 |
Recognition through Reasoning: Reinforcing Image Geo-localization with Large Vision-Language Models |
Ling Li et.al. |
2506.14674 |
null |
2025-06-17 |
StreetLens: Enabling Human-Centered AI Agents for Neighborhood Assessment from Street View Imagery |
Jina Kim et.al. |
2506.14670 |
null |
2025-06-17 |
SIRI-Bench: Challenging VLMs’ Spatial Intelligence through Complex Reasoning Tasks |
Zijian Song et.al. |
2506.14512 |
null |
2025-06-17 |
Can Pretrained Vision-Language Embeddings Alone Guide Robot Navigation? |
Nitesh Subedi et.al. |
2506.14507 |
link |
2025-06-17 |
Adapting Lightweight Vision Language Models for Radiological Visual Question Answering |
Aditya Shourya et.al. |
2506.14451 |
link |
2025-06-17 |
Causally Steered Diffusion for Automated Video Counterfactual Generation |
Nikos Spyrou et.al. |
2506.14404 |
link |
2025-06-18 |
ImmerseGen: Agent-Guided Immersive World Generation with Alpha-Textured Proxies |
Jinyan Yuan et.al. |
2506.14315 |
null |
2025-06-16 |
Touch begins where vision ends: Generalizable policies for contact-rich manipulation |
Zifan Zhao et.al. |
2506.13762 |
null |
2025-06-16 |
Prompting with the Future: Open-World Model Predictive Control with Interactive Digital Twins |
Chuanruo Ning et.al. |
2506.13761 |
null |
2025-06-16 |
OTFusion: Bridging Vision-only and Vision-Language Models via Optimal Transport for Transductive Zero-Shot Learning |
Qiyu Xu et.al. |
2506.13723 |
null |
2025-06-16 |
ROSA: Harnessing Robot States for Vision-Language and Action Alignment |
Yuqing Wen et.al. |
2506.13679 |
null |
2025-06-16 |
DualEdit: Dual Editing for Knowledge Updating in Vision-Language Models |
Zhiyi Shi et.al. |
2506.13638 |
null |
2025-06-16 |
VLM-SFD: VLM-Assisted Siamese Flow Diffusion Framework for Dual-Arm Cooperative Manipulation |
Jiaming Chen et.al. |
2506.13428 |
null |
2025-06-16 |
Uncertainty-Informed Active Perception for Open Vocabulary Object Goal Navigation |
Utkarsh Bajpai et.al. |
2506.13367 |
null |
2025-06-16 |
Anomaly Object Segmentation with Vision-Language Models for Steel Scrap Recycling |
Daichi Tanaka et.al. |
2506.13282 |
null |
2025-06-16 |
AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining |
Hongyuan Dong et.al. |
2506.13274 |
null |
2025-06-16 |
Screen Hijack: Visual Poisoning of VLM Agents in Mobile Environments |
Xuan Wang et.al. |
2506.13205 |
null |
2025-06-13 |
Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale |
Junha Lee et.al. |
2506.12009 |
null |
2025-06-13 |
How Visual Representations Map to Language Feature Space in Multimodal LLMs |
Constantin Venhoff et.al. |
2506.11976 |
null |
2025-06-13 |
Rethinking Multilingual Vision-Language Translation: Dataset, Evaluation, and Adaptation |
Xintong Wang et.al. |
2506.11820 |
null |
2025-06-13 |
MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space |
Anshul Singh et.al. |
2506.11684 |
null |
2025-06-13 |
VLM@school – Evaluation of AI image understanding on German middle school knowledge |
René Peinl et.al. |
2506.11604 |
null |
2025-06-13 |
EasyARC: Evaluating Vision Language Models on True Visual Reasoning |
Mert Unsal et.al. |
2506.11595 |
null |
2025-06-13 |
Foundation Models in Autonomous Driving: A Survey on Scenario Generation and Scenario Analysis |
Yuan Gao et.al. |
2506.11526 |
link |
2025-06-13 |
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs |
Xiao Xu et.al. |
2506.11515 |
null |
2025-06-13 |
Taming Stable Diffusion for Computed Tomography Blind Super-Resolution |
Chunlei Li et.al. |
2506.11496 |
null |
2025-06-13 |
On the Natural Robustness of Vision-Language Models Against Visual Perception Attacks in Autonomous Driving |
Pedram MohajerAnsari et.al. |
2506.11472 |
null |
2025-06-12 |
AIR: Zero-shot Generative Model Adaptation with Iterative Refinement |
Guimeng Liu et.al. |
2506.10895 |
link |
2025-06-12 |
VRBench: A Benchmark for Multi-Step Reasoning in Long Narrative Videos |
Jiashuo Yu et.al. |
2506.10857 |
null |
2025-06-13 |
RationalVLA: A Rational Vision-Language-Action Model with Dual System |
Wenxuan Song et.al. |
2506.10826 |
null |
2025-06-12 |
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding |
Yuhang Zhang et.al. |
2506.10756 |
null |
2025-06-13 |
IQE-CLIP: Instance-aware Query Embedding for Zero-/Few-shot Anomaly Detection in Medical Domain |
Hong Huang et.al. |
2506.10730 |
link |
2025-06-12 |
GigaVideo-1: Advancing Video Generation via Automatic Feedback with 4 GPU-Hours Fine-Tuning |
Xiaoyi Bao et.al. |
2506.10639 |
null |
2025-06-12 |
Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning |
Chun-Mei Feng et.al. |
2506.10575 |
null |
2025-06-12 |
LLMs Are Not Yet Ready for Deepfake Image Detection |
Shahroz Tariq et.al. |
2506.10474 |
null |
2025-06-12 |
UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models |
Jun Yin et.al. |
2506.10342 |
null |
2025-06-12 |
Using Vision Language Models to Detect Students’ Academic Emotion through Facial Expressions |
Deliang Wang et.al. |
2506.10334 |
null |
2025-06-11 |
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing |
Junfei Wu et.al. |
2506.09965 |
link |
2025-06-11 |
From Intention to Execution: Probing the Generalization Boundaries of Vision-Language-Action Models |
Irving Fang et.al. |
2506.09930 |
null |
2025-06-11 |
3D-Aware Vision-Language Models Fine-Tuning with Geometric Distillation |
Seonho Lee et.al. |
2506.09883 |
link |
2025-06-11 |
Adding simple structure at inference improves Vision-Language Compositionality |
Imanol Miranda et.al. |
2506.09691 |
link |
2025-06-11 |
HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios |
Kunyu Peng et.al. |
2506.09650 |
link |
2025-06-11 |
FedVLMBench: Benchmarking Federated Fine-Tuning of Vision-Language Models |
Weiying Zheng et.al. |
2506.09638 |
null |
2025-06-11 |
Revisit What You See: Disclose Language Prior in Vision Tokens for Efficient Guided Decoding of LVLMs |
Beomsik Cho et.al. |
2506.09522 |
link |
2025-06-11 |
Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning |
Cheng Chen et.al. |
2506.09473 |
null |
2025-06-11 |
TOGA: Temporally Grounded Open-Ended Video QA with Weak Supervision |
Ayush Gupta et.al. |
2506.09445 |
null |
2025-06-11 |
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt |
Yitong Zhang et.al. |
2506.09353 |
link |
2025-06-10 |
VIKI-R: Coordinating Embodied Multi-Agent Cooperation via Reinforcement Learning |
Li Kang et.al. |
2506.09049 |
null |
2025-06-11 |
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs |
Yaniv Nikankin et.al. |
2506.09047 |
link |
2025-06-10 |
Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better |
Dianyi Wang et.al. |
2506.09040 |
link |
2025-06-10 |
Efficient Medical Vision-Language Alignment Through Adapting Masked Vision Models |
Chenyu Lian et.al. |
2506.08990 |
link |
2025-06-10 |
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions |
David Acuna et.al. |
2506.08927 |
null |
2025-06-11 |
Video-CoT: A Comprehensive Dataset for Spatiotemporal Understanding of Videos Based on Chain-of-Thought |
Shuyi Zhang et.al. |
2506.08817 |
null |
2025-06-10 |
Multimodal Representation Alignment for Cross-modal Information Retrieval |
Fan Xu et.al. |
2506.08774 |
null |
2025-06-10 |
PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly |
Liang Ma et.al. |
2506.08708 |
null |
2025-06-10 |
VReST: Enhancing Reasoning in Large Vision-Language Models through Tree Search and Self-Reward Mechanism |
Congzhi Zhang et.al. |
2506.08691 |
link |
2025-06-10 |
ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction |
Juan Yeo et.al. |
2506.08678 |
null |
2025-06-09 |
Vision Transformers Don’t Need Trained Registers |
Nick Jiang et.al. |
2506.08010 |
link |
2025-06-09 |
Hidden in plain sight: VLMs overlook their visual representations |
Stephanie Fu et.al. |
2506.08008 |
null |
2025-06-09 |
BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models |
Peiyan Li et.al. |
2506.07961 |
null |
2025-06-09 |
Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations |
Yizhen Li et.al. |
2506.07943 |
null |
2025-06-09 |
Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models |
Chengyue Huang et.al. |
2506.07936 |
null |
2025-06-09 |
SAM2Auto: Auto Annotation Using FLASH |
Arash Rocky et.al. |
2506.07850 |
null |
2025-06-09 |
Image Reconstruction as a Tool for Feature Analysis |
Eduard Allakhverdov et.al. |
2506.07803 |
null |
2025-06-09 |
Re-ranking Reasoning Context with Tree Search Makes Large Vision-Language Models Stronger |
Qi Yang et.al. |
2506.07785 |
null |
2025-06-09 |
Language-Vision Planner and Executor for Text-to-Visual Reasoning |
Yichang Xu et.al. |
2506.07778 |
null |
2025-06-09 |
ArchiLense: A Framework for Quantitative Analysis of Architectural Styles Based on Vision Large Language Models |
Jing Zhong et.al. |
2506.07739 |
null |
2025-06-06 |
CoMemo: LVLMs Need Image Context with Image Memory |
Shi Liu et.al. |
2506.06279 |
null |
2025-06-06 |
ExAct: A Video-Language Benchmark for Expert Action Analysis |
Han Yi et.al. |
2506.06277 |
null |
2025-06-06 |
Movie Facts and Fibs (MF $^2$ ): A Benchmark for Long Movie Understanding |
Emmanouil Zaranis et.al. |
2506.06275 |
null |
2025-06-06 |
Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study |
Leon Mayer et.al. |
2506.06232 |
null |
2025-06-06 |
GenIR: Generative Visual Feedback for Mental Image Retrieval |
Diji Yang et.al. |
2506.06220 |
null |
2025-06-06 |
STSBench: A Spatio-temporal Scenario Benchmark for Multi-modal Large Language Models in Autonomous Driving |
Christian Fruhwirth-Reisinger et.al. |
2506.06218 |
link |
2025-06-06 |
WisWheat: A Three-Tiered Vision-Language Dataset for Wheat Management |
Bowen Yuan et.al. |
2506.06084 |
null |
2025-06-06 |
Full Conformal Adaptation of Medical Vision-Language Models |
Julio Silva-Rodríguez et.al. |
2506.06076 |
null |
2025-06-06 |
BEAST: Efficient Tokenization of B-Splines Encoded Action Sequences for Imitation Learning |
Hongyi Zhou et.al. |
2506.06072 |
null |
2025-06-06 |
MCA-Bench: A Multimodal Benchmark for Evaluating CAPTCHA Robustness Against VLM-based Attacks |
Zonglin Wu et.al. |
2506.05982 |
link |
2025-06-06 |
Does Your 3D Encoder Really Work? When Pretrain-SFT from 2D VLMs Meets 3D VLMs |
Haoyuan Li et.al. |
2506.05318 |
null |
2025-06-05 |
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm |
Zhang Li et.al. |
2506.05218 |
link |
2025-06-05 |
Quantifying Cross-Modality Memorization in Vision-Language Models |
Yuxin Wen et.al. |
2506.05198 |
null |
2025-06-05 |
CIVET: Systematic Evaluation of Understanding in VLMs |
Massimo Rizzoli et.al. |
2506.05146 |
null |
2025-06-05 |
PixCell: A generative foundation model for digital histopathology images |
Srikar Yellapragada et.al. |
2506.05127 |
null |
2025-06-06 |
SeedEdit 3.0: Fast and High-Quality Generative Image Editing |
Peng Wang et.al. |
2506.05083 |
null |
2025-06-05 |
A Survey on Vietnamese Document Analysis and Recognition: Challenges and Future Directions |
Anh Le et.al. |
2506.05061 |
null |
2025-06-05 |
Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System |
Haokun Liu et.al. |
2506.05020 |
null |
2025-06-05 |
ConECT Dataset: Overcoming Data Scarcity in Context-Aware E-Commerce MT |
Mikołaj Pokrywka et.al. |
2506.04929 |
null |
2025-06-05 |
Evaluating Vision-Language and Large Language Models for Automated Student Assessment in Indonesian Classrooms |
Nurul Aisyah et.al. |
2506.04822 |
null |
2025-06-04 |
OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis |
Junting Chen et.al. |
2506.04217 |
link |
2025-06-04 |
Image Editing As Programs with Diffusion Models |
Yujia Hu et.al. |
2506.04158 |
null |
2025-06-04 |
Recent Advances in Medical Image Classification |
Loan Dao et.al. |
2506.04129 |
null |
2025-06-04 |
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward |
Yi Zhao et.al. |
2506.04070 |
link |
2025-06-04 |
Mitigating Hallucinations in Large Vision-Language Models via Entity-Centric Multimodal Preference Optimization |
Jiulong Wu et.al. |
2506.04039 |
null |
2025-06-04 |
Vocabulary-free few-shot learning for Vision-Language Models |
Maxime Zanella et.al. |
2506.04005 |
null |
2025-06-04 |
DiffCAP: Diffusion-based Cumulative Adversarial Purification for Vision Language Models |
Jia Fu et.al. |
2506.03933 |
null |
2025-06-04 |
Zero-Shot Temporal Interaction Localization for Egocentric Videos |
Erhang Zhang et.al. |
2506.03662 |
link |
2025-06-04 |
Spatial Understanding from Videos: Structured Prompts Meet Simulation Data |
Haoyu Zhang et.al. |
2506.03642 |
null |
2025-06-04 |
VLMs Can Aggregate Scattered Training Patches |
Zhanhui Zhou et.al. |
2506.03614 |
link |
2025-06-03 |
GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents |
Qianhui Wu et.al. |
2506.03143 |
null |
2025-06-03 |
OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models |
Mengdi Jia et.al. |
2506.03135 |
null |
2025-06-03 |
EgoVLM: Policy Optimization for Egocentric Video Understanding |
Ashwin Vinod et.al. |
2506.03097 |
link |
2025-06-03 |
DPO Learning with LLMs-Judge Signal for Computer Use Agents |
Man Luo et.al. |
2506.03095 |
null |
2025-06-03 |
From Flat to Hierarchical: Extracting Sparse Representations with Matching Pursuit |
Valérie Costa et.al. |
2506.03093 |
null |
2025-06-03 |
Text-guided Generation of Efficient Personalized Inspection Plans |
Xingpeng Sun et.al. |
2506.02917 |
null |
2025-06-04 |
FlySearch: Exploring how vision-language models explore |
Adam Pardyl et.al. |
2506.02896 |
link |
2025-06-03 |
Surfer-H Meets Holo1: Cost-Efficient Web Agent Powered by Open Weights |
Mathieu Andreux et.al. |
2506.02865 |
null |
2025-06-03 |
SemVink: Advancing VLMs’ Semantic Understanding of Optical Illusions via Visual Global Thinking |
Sifan Li et.al. |
2506.02803 |
null |
2025-06-03 |
FreeScene: Mixed Graph Diffusion for 3D Scene Synthesis from Free Prompts |
Tongyuan Bai et.al. |
2506.02781 |
null |
2025-05-30 |
ReasonGen-R1: CoT for Autoregressive Image generation models through SFT and RL |
Yu Zhang et.al. |
2505.24875 |
null |
2025-05-30 |
ProxyThinker: Test-Time Guidance through Small Visual Reasoners |
Zilin Xiao et.al. |
2505.24872 |
link |
2025-05-30 |
GenSpace: Benchmarking Spatially-Aware Image Generation |
Zehan Wang et.al. |
2505.24870 |
null |
2025-05-30 |
Time Blindness: Why Video-Language Models Can’t See What Humans Can? |
Ujjwal Upadhyay et.al. |
2505.24867 |
null |
2025-05-30 |
Conformal Prediction for Zero-Shot Models |
Julio Silva-Rodríguez et.al. |
2505.24693 |
link |
2025-05-30 |
BIMA: Bijective Maximum Likelihood Learning Approach to Hallucination Prediction and Mitigation in Large Vision-Language Models |
Huu-Thien Tran et.al. |
2505.24649 |
null |
2025-05-30 |
SARD: A Large-Scale Synthetic Arabic OCR Dataset for Book-Style Text Recognition |
Omer Nacar et.al. |
2505.24600 |
null |
2025-05-30 |
AMIA: Automatic Masking and Joint Intention Analysis Makes LVLMs Robust Jailbreak Defenders |
Yuqi Zhang et.al. |
2505.24519 |
null |
2025-05-30 |
un $^2$ CLIP: Improving CLIP’s Visual Detail Capturing Ability via Inverting unCLIP |
Yinqi Li et.al. |
2505.24517 |
link |
2025-05-30 |
CaMMT: Benchmarking Culturally Aware Multimodal Machine Translation |
Emilio Villa-Cueva et.al. |
2505.24456 |
null |
2025-05-29 |
ZeroGUI: Automating Online GUI Learning at Zero Human Cost |
Chenyu Yang et.al. |
2505.23762 |
link |
2025-05-29 |
Puzzled by Puzzles: When Vision-Language Models Can’t Take a Hint |
Heekyung Lee et.al. |
2505.23759 |
link |
2025-05-29 |
Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models |
Haohan Chi et.al. |
2505.23757 |
link |
2025-05-29 |
To Trust Or Not To Trust Your Vision-Language Model’s Prediction |
Hao Dong et.al. |
2505.23745 |
link |
2025-05-29 |
LayerPeeler: Autoregressive Peeling for Layer-wise Image Vectorization |
Ronghuan Wu et.al. |
2505.23740 |
null |
2025-05-29 |
Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better |
Danny Driess et.al. |
2505.23705 |
null |
2025-05-29 |
CLDTracker: A Comprehensive Language Description for Visual Tracking |
Mohamad Alansari et.al. |
2505.23704 |
link |
2025-05-29 |
Grounded Reinforcement Learning for Visual Reasoning |
Gabriel Sarch et.al. |
2505.23678 |
link |
2025-05-29 |
Uni-MuMER: Unified Multi-Task Fine-Tuning of Vision-Language Model for Handwritten Mathematical Expression Recognition |
Yu Li et.al. |
2505.23566 |
link |
2025-05-30 |
Qwen Look Again: Guiding Vision-Language Reasoning Models to Re-attention Visual Information |
Xu Chu et.al. |
2505.23558 |
link |
2025-05-28 |
Zero-Shot Vision Encoder Grafting via LLM Surrogates |
Kaiyu Yue et.al. |
2505.22664 |
link |
2025-05-28 |
Training Free Stylized Abstraction |
Aimon Rahman et.al. |
2505.22663 |
null |
2025-05-28 |
VScan: Rethinking Visual Token Reduction for Efficient Large Vision-Language Models |
Ce Zhang et.al. |
2505.22654 |
null |
2025-05-28 |
Sherlock: Self-Correcting Reasoning in Vision-Language Models |
Yi Ding et.al. |
2505.22651 |
null |
2025-05-28 |
DocReRank: Single-Page Hard Negative Query Generation for Training Multi-Modal RAG Rerankers |
Navve Wasserman et.al. |
2505.22584 |
null |
2025-05-28 |
Hypothesis Testing in Imaging Inverse Problems |
Yiming Xi et.al. |
2505.22481 |
null |
2025-05-28 |
Zero-Shot 3D Visual Grounding from Vision-Language Models |
Rong Li et.al. |
2505.22429 |
null |
2025-05-28 |
IKIWISI: An Interactive Visual Pattern Generator for Evaluating the Reliability of Vision-Language Models Without Ground Truth |
Md Touhidul Islam et.al. |
2505.22305 |
null |
2025-05-28 |
Investigating Mechanisms for In-Context Vision Language Binding |
Darshana Saravanan et.al. |
2505.22200 |
null |
2025-05-29 |
Improving Brain-to-Image Reconstruction via Fine-Grained Text Bridging |
Runze Xia et.al. |
2505.22150 |
null |
2025-05-27 |
ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-Language Models |
Dingming Li et.al. |
2505.21500 |
null |
2025-05-27 |
AdInject: Real-World Black-Box Attacks on Web Agents via Advertising Delivery |
Haowei Wang et.al. |
2505.21499 |
link |
2025-05-27 |
Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers |
Wei Pang et.al. |
2505.21497 |
link |
2025-05-27 |
Mitigating Hallucination in Large Vision-Language Models via Adaptive Attention Calibration |
Mehrdad Fazli et.al. |
2505.21472 |
null |
2025-05-27 |
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models |
Bozhou Li et.al. |
2505.21465 |
null |
2025-05-27 |
LazyVLM: Neuro-Symbolic Approach to Video Analytics |
Xiangru Jian et.al. |
2505.21459 |
null |
2025-05-27 |
DeCAF: Decentralized Consensus-And-Factorization for Low-Rank Adaptation of Foundation Models |
Nastaran Saadati et.al. |
2505.21382 |
null |
2025-05-27 |
XBOUND: Exploring the Capability Boundaries of Device-Control Agents through Trajectory Tree Exploration |
Shaoqing Zhang et.al. |
2505.21279 |
null |
2025-05-27 |
CROP: Contextual Region-Oriented Visual Token Pruning |
Jiawei Guo et.al. |
2505.21233 |
null |
2025-05-27 |
Interpreting Social Bias in LVLMs via Information Flow Analysis and Multi-Round Dialogue Evaluation |
Zhengyang Ji et.al. |
2505.21106 |
null |
2025-05-26 |
Agentic 3D Scene Generation with Spatially Contextualized VLMs |
Xinhang Liu et.al. |
2505.20129 |
null |
2025-05-26 |
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models |
Anh Thai et.al. |
2505.20122 |
null |
2025-05-27 |
EmoNet-Face: An Expert-Annotated Benchmark for Synthetic Emotion Recognition |
Christoph Schuhmann et.al. |
2505.20033 |
null |
2025-05-26 |
ViTaPEs: Visuotactile Position Encodings for Cross-Modal Alignment in Multimodal Transformers |
Fotios Lygerakis et.al. |
2505.20032 |
null |
2025-05-26 |
Decomposing Complex Visual Comprehension into Atomic Visual Skills for Vision Language Models |
Hyunsik Chae et.al. |
2505.20021 |
null |
2025-05-26 |
Can Visual Encoder Learn to See Arrows? |
Naoyuki Terashita et.al. |
2505.19944 |
null |
2025-05-26 |
Attention! You Vision Language Model Could Be Maliciously Manipulated |
Xiaosen Wang et.al. |
2505.19911 |
null |
2025-05-26 |
EMAC+: Embodied Multimodal Agent for Collaborative Planning with VLM+LLM |
Shuang Ao et.al. |
2505.19905 |
null |
2025-05-26 |
Underwater Diffusion Attention Network with Contrastive Language-Image Joint Learning for Underwater Image Enhancement |
Afrah Shaahid et.al. |
2505.19895 |
null |
2025-05-26 |
Two Causally Related Needles in a Video Haystack |
Miaoyu Li et.al. |
2505.19853 |
null |
2025-05-23 |
VideoGameBench: Can Vision-Language Models complete popular video games? |
Alex L. Zhang et.al. |
2505.18134 |
null |
2025-05-23 |
One RL to See Them All: Visual Triple Unified Reinforcement Learning |
Yan Ma et.al. |
2505.18129 |
null |
2025-05-23 |
CXReasonBench: A Benchmark for Evaluating Structured Diagnostic Reasoning in Chest X-rays |
Hyungyung Lee et.al. |
2505.18087 |
link |
2025-05-23 |
FDBPL: Faster Distillation-Based Prompt Learning for Region-Aware Vision-Language Models Adaptation |
Zherui Zhang et.al. |
2505.18053 |
null |
2025-05-23 |
Clip4Retrofit: Enabling Real-Time Image Labeling on Edge Devices via Cross-Architecture CLIP Distillation |
Li Zhong et.al. |
2505.18039 |
null |
2025-05-23 |
Few-Shot Learning from Gigapixel Images via Hierarchical Vision-Language Alignment and Modeling |
Bryan Wong et.al. |
2505.17982 |
null |
2025-05-23 |
Daily-Omni: Towards Audio-Visual Reasoning with Temporal Alignment across Modalities |
Ziwei Zhou et.al. |
2505.17862 |
link |
2025-05-23 |
VLM Models and Automated Grading of Atopic Dermatitis |
Marc Lalonde et.al. |
2505.17835 |
null |
2025-05-23 |
Seeing It or Not? Interpretable Vision-aware Latent Steering to Mitigate Object Hallucinations |
Boxu Chen et.al. |
2505.17812 |
null |
2025-05-23 |
U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding |
Anjie Le et.al. |
2505.17779 |
null |
2025-05-23 |
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models |
Jiaqi Wang et.al. |
2505.16854 |
link |
2025-05-23 |
LaViDa: A Large Diffusion Language Model for Multimodal Understanding |
Shufan Li et.al. |
2505.16839 |
link |
2025-05-22 |
From EduVisBench to EduVisAgent: A Benchmark and Multi-Agent Framework for Pedagogical Visualization |
Haonian Ji et.al. |
2505.16832 |
link |
2025-05-22 |
Perceptual Quality Assessment for Embodied AI |
Chunyi Li et.al. |
2505.16815 |
link |
2025-05-22 |
SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving |
Xuesong Chen et.al. |
2505.16805 |
null |
2025-05-22 |
REOBench: Benchmarking Robustness of Earth Observation Foundation Models |
Xiang Li et.al. |
2505.16793 |
link |
2025-05-22 |
Single Domain Generalization for Few-Shot Counting via Universal Representation Matching |
Xianing Chen et.al. |
2505.16778 |
link |
2025-05-22 |
IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models |
Yiming Gao et.al. |
2505.16774 |
link |
2025-05-22 |
Self-Rewarding Large Vision-Language Models for Optimizing Prompts in Text-to-Image Generation |
Hongji Yang et.al. |
2505.16763 |
null |
2025-05-22 |
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval |
Hailong Ning et.al. |
2505.16756 |
null |
2025-05-21 |
InstructSAM: A Training-Free Framework for Instruction-Oriented Remote Sensing Object Recognition |
Yijie Zheng et.al. |
2505.15818 |
link |
2025-05-21 |
From Grounding to Manipulation: Case Studies of Foundation Model Integration in Embodied Robotic Systems |
Xiuchao Sui et.al. |
2505.15685 |
link |
2025-05-21 |
FragFake: A Dataset for Fine-Grained Detection of Edited Images with Vision Language Models |
Zhen Sun et.al. |
2505.15644 |
link |
2025-05-21 |
Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models |
Xin Huang et.al. |
2505.15576 |
link |
2025-05-21 |
TinyDrive: Multiscale Visual Question Answering with Selective Token Routing for Autonomous Driving |
Hossein Hassani et.al. |
2505.15564 |
null |
2025-05-21 |
Clapper: Compact Learning and Video Representation in VLMs |
Lingyu Kong et.al. |
2505.15529 |
null |
2025-05-21 |
Robo2VLM: Visual Question Answering from Large-Scale In-the-Wild Robot Manipulation Datasets |
Kaiyuan Chen et.al. |
2505.15517 |
null |
2025-05-21 |
Visual Thoughts: A Unified Perspective of Understanding Multimodal Chain-of-Thought |
Zihui Cheng et.al. |
2505.15510 |
null |
2025-05-21 |
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts |
Debarshi Brahma et.al. |
2505.15506 |
link |
2025-05-21 |
Beyond Linearity: Squeeze-and-Recalibrate Blocks for Few-Shot Whole Slide Image Classification |
Conghao Xiong et.al. |
2505.15504 |
null |
2025-05-20 |
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning |
Jiaer Xia et.al. |
2505.14677 |
null |
2025-05-20 |
UniCTokens: Boosting Personalized Understanding and Generation via Unified Concept Tokens |
Ruichuan An et.al. |
2505.14671 |
link |
2025-05-20 |
EmoGist: Efficient In-Context Learning for Visual Emotion Understanding |
Ronald Seoh et.al. |
2505.14660 |
null |
2025-05-20 |
CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation |
Anna C. Doris et.al. |
2505.14646 |
link |
2025-05-20 |
Debating for Better Reasoning: An Unsupervised Multimodal Approach |
Ashutosh Adhikari et.al. |
2505.14627 |
null |
2025-05-20 |
KORGym: A Dynamic Game Platform for LLM Reasoning Evaluation |
Jiajun Shi et.al. |
2505.14552 |
link |
2025-05-20 |
PlanGPT-VL: Enhancing Urban Planning with Domain-Specific Vision-Language Models |
He Zhu et.al. |
2505.14481 |
null |
2025-05-20 |
RAVENEA: A Benchmark for Multimodal Retrieval-Augmented Visual Culture Understanding |
Jiaang Li et.al. |
2505.14462 |
link |
2025-05-20 |
SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation |
Yuyang Dong et.al. |
2505.14381 |
null |
2025-05-20 |
Towards Embodied Cognition in Robots via Spatially Grounded Synthetic Worlds |
Joel Currie et.al. |
2505.14366 |
null |
2025-05-19 |
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models |
Liyan Tang et.al. |
2505.13444 |
null |
2025-05-19 |
G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning |
Liang Chen et.al. |
2505.13426 |
link |
2025-05-19 |
Seeing, Saying, Solving: An LLM-to-TL Framework for Cooperative Robots |
Dan BW Choe et.al. |
2505.13376 |
null |
2025-05-20 |
Unlabeled Data or Pre-trained Model: Rethinking Semi-Supervised Learning and Pretrain-Finetuning |
Song-Lin Li et.al. |
2505.13317 |
null |
2025-05-19 |
I’ll believe it when I see it: Images increase misinformation sharing in Vision-Language Models |
Alice Plebe et.al. |
2505.13302 |
link |
2025-05-19 |
Computer Vision Models Show Human-Like Sensitivity to Geometric and Topological Concepts |
Zekun Wang et.al. |
2505.13281 |
null |
2025-05-19 |
From Local Details to Global Context: Advancing Vision-Language Models with Attention-Based Selection |
Lincan Cai et.al. |
2505.13233 |
link |
2025-05-19 |
ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models |
Matteo Merler et.al. |
2505.13180 |
link |
2025-05-19 |
Just Dance with $π$ ! A Poly-modal Inductor for Weakly-supervised Video Anomaly Detection |
Snehashis Majhi et.al. |
2505.13123 |
null |
2025-05-19 |
Hearing from Silence: Reasoning Audio Descriptions from Silent Videos via Vision-Language Model |
Yong Ren et.al. |
2505.13062 |
null |
2025-05-16 |
Patho-R1: A Multimodal Reinforcement Learning-Based Pathology Expert Reasoner |
Wenchuan Zhang et.al. |
2505.11404 |
link |
2025-05-16 |
Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation |
Zihan Wang et.al. |
2505.11383 |
link |
2025-05-16 |
Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild |
Derek Ming Siang Tan et.al. |
2505.11350 |
null |
2025-05-16 |
Temporally-Grounded Language Generation: A Benchmark for Real-Time Vision-Language Models |
Keunwoo Peter Yu et.al. |
2505.11326 |
link |
2025-05-16 |
Sample Efficient Reinforcement Learning via Large Vision Language Model Distillation |
Donghoon Lee et.al. |
2505.11221 |
link |
2025-05-19 |
FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining |
Myunsoo Kim et.al. |
2505.11192 |
null |
2025-05-16 |
Redundancy-Aware Pretraining of Vision-Language Foundation Models in Remote Sensing |
Mathis Jürgen Adler et.al. |
2505.11121 |
null |
2025-05-16 |
CUBIC: Concept Embeddings for Unsupervised Bias Identification using VLMs |
David Méndez et.al. |
2505.11060 |
link |
2025-05-16 |
GuardReasoner-VL: Safeguarding VLMs via Reinforced Reasoning |
Yue Liu et.al. |
2505.11049 |
link |
2025-05-16 |
Exploiting the Asymmetric Uncertainty Structure of Pre-trained VLMs on the Unit Hypersphere |
Li Ju et.al. |
2505.11029 |
null |
2025-05-15 |
MASSV: Multimodal Adaptation and Self-Data Distillation for Speculative Decoding of Vision-Language Models |
Mugilan Ganesan et.al. |
2505.10526 |
null |
2025-05-15 |
AI Agents vs. Agentic AI: A Conceptual Taxonomy, Applications and Challenge |
Ranjan Sapkota et.al. |
2505.10468 |
null |
2025-05-15 |
Vision language models have difficulty recognizing virtual objects |
Tyler Tran et.al. |
2505.10453 |
null |
2025-05-15 |
Coherent Language Reconstruction from Brain Recordings with Flexible Multi-Modal Input Stimuli |
Chunyu Ye et.al. |
2505.10356 |
null |
2025-05-15 |
MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models |
Yuncheng Guo et.al. |
2505.10088 |
link |
2025-05-15 |
AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection |
Bin-Bin Gao et.al. |
2505.09926 |
link |
2025-05-14 |
Unfettered Forceful Skill Acquisition with Physical Reasoning and Coordinate Frame Labeling |
William Xie et.al. |
2505.09731 |
null |
2025-05-14 |
ManipBench: Benchmarking Vision-Language Models for Low-Level Robot Manipulation |
Enyu Zhao et.al. |
2505.09698 |
null |
2025-05-14 |
Variational Visual Question Answering |
Tobias Jan Wieczorek et.al. |
2505.09591 |
null |
2025-05-14 |
VTLA: Vision-Tactile-Language-Action Model with Preference Learning for Insertion Manipulation |
Chaofan Zhang et.al. |
2505.09577 |
null |
2025-05-14 |
Flash-VL 2B: Optimizing Vision-Language Model Performance for Ultra-Low Latency and High Throughput |
Bo Zhang et.al. |
2505.09498 |
null |
2025-05-14 |
MAKE: Multi-Aspect Knowledge-Enhanced Vision-Language Pretraining for Zero-shot Dermatological Assessment |
Siyuan Yan et.al. |
2505.09372 |
link |
2025-05-14 |
Unsupervised Multiview Contrastive Language-Image Joint Learning with Pseudo-Labeled Prompts Via Vision-Language Model for 3D/4D Facial Expression Recognition |
Muzammil Behzad et.al. |
2505.09336 |
null |
2025-05-14 |
MetaUAS: Universal Anomaly Segmentation with One-Prompt Meta-Learning |
Bin-Bin Gao et.al. |
2505.09265 |
null |
2025-05-14 |
Beyond General Prompts: Automated Prompt Refinement using Contrastive Class Alignment Scores for Disambiguating Objects in Vision-Language Models |
Lucas Choi et.al. |
2505.09139 |
null |
2025-05-14 |
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning |
Dayong Liang et.al. |
2505.09118 |
null |
2025-05-14 |
OpenLKA: An Open Dataset of Lane Keeping Assist from Recent Car Models under Real-world Driving Conditions |
Yuhang Wang et.al. |
2505.09092 |
link |
2025-05-13 |
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training |
Yangyi Chen et.al. |
2505.08971 |
link |
2025-05-13 |
Extending Large Vision-Language Model for Diverse Interactive Tasks in Autonomous Driving |
Zongchuang Zhao et.al. |
2505.08725 |
link |
2025-05-13 |
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning |
Zhaochen Su et.al. |
2505.08617 |
link |
2025-05-13 |
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation |
Yifu Yuan et.al. |
2505.08548 |
link |
2025-05-13 |
Judging the Judges: Can Large Vision-Language Models Fairly Evaluate Chart Comprehension and Reasoning? |
Md Tahmid Rahman Laskar et.al. |
2505.08468 |
link |
2025-05-13 |
MA-ROESL: Motion-aware Rapid Reward Optimization for Efficient Robot Skill Learning from Single Videos |
Xianghui Wang et.al. |
2505.08367 |
null |
2025-05-13 |
Removing Watermarks with Partial Regeneration using Semantic Information |
Krti Tallam et.al. |
2505.08234 |
link |
2025-05-13 |
CLTP: Contrastive Language-Tactile Pre-training for 3D Contact Geometry Understanding |
Wenxuan Ma et.al. |
2505.08194 |
null |
2025-05-13 |
DSADF: Thinking Fast and Slow for Decision Making |
Alex Zhihao Dou et.al. |
2505.08189 |
null |
2025-05-12 |
Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models |
Seungjae Lee et.al. |
2505.07815 |
null |
2025-05-12 |
Reproducibility, Replicability, and Insights into Visual Document Retrieval with Late Interaction |
Jingfen Qiao et.al. |
2505.07730 |
link |
2025-05-12 |
Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images |
Elisei Rykov et.al. |
2505.07704 |
null |
2025-05-12 |
Beyond CLIP Generalization: Against Forward&Backward Forgetting Adapter for Continual Learning of Vision-Language Models |
Songlin Dong et.al. |
2505.07690 |
null |
2025-05-12 |
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ ptimization |
Seongjae Kang et.al. |
2505.07675 |
link |
2025-05-12 |
Beyond Static Perception: Integrating Temporal Context into VLMs for Cloth Folding |
Oriol Barbany et.al. |
2505.07600 |
null |
2025-05-12 |
Discrete Visual Tokens of Autoregression, by Diffusion, and for Reasoning |
Bohan Wang et.al. |
2505.07538 |
null |
2025-05-12 |
QuantX: A Framework for Hardware-Aware Quantization of Generative AI Workloads |
Khurram Mazher et.al. |
2505.07531 |
null |
2025-05-12 |
AI-Enabled Accurate Non-Invasive Assessment of Pulmonary Hypertension Progression via Multi-Modal Echocardiography |
Jiewen Yang et.al. |
2505.07347 |
null |
2025-05-12 |
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning |
Xiaokun Wang et.al. |
2505.07263 |
null |
2025-05-09 |
MM-Skin: Enhancing Dermatology Vision-Language Model with an Image-Text Dataset Derived from Textbooks |
Wenqi Zeng et.al. |
2505.06152 |
link |
2025-05-09 |
TREND: Tri-teaching for Robust Preference-based Reinforcement Learning with Demonstrations |
Shuaiyi Huang et.al. |
2505.06079 |
null |
2025-05-09 |
Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI |
Benjamin Raphael Ernhofer et.al. |
2505.05895 |
link |
2025-05-09 |
Describe Anything in Medical Images |
Xi Xiao et.al. |
2505.05804 |
null |
2025-05-09 |
3D CAVLA: Leveraging Depth and 3D Context to Generalize Vision Language Action Models for Unseen Tasks |
Vineet Bhat et.al. |
2505.05800 |
null |
2025-05-08 |
Fine-Tuning Video-Text Contrastive Model for Primate Behavior Retrieval from Unlabeled Raw Videos |
Giulio Cesare Mastrocinque Santo et.al. |
2505.05681 |
null |
2025-05-08 |
VR-RAG: Open-vocabulary Species Recognition with RAG-Assisted Large Multi-Modal Models |
Faizan Farooq Khan et.al. |
2505.05635 |
null |
2025-05-08 |
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments |
Pranav Guruprasad et.al. |
2505.05540 |
link |
2025-05-08 |
Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging |
Shiqi Chen et.al. |
2505.05464 |
link |
2025-05-08 |
SITE: towards Spatial Intelligence Thorough Evaluation |
Wenqi Wang et.al. |
2505.05456 |
null |
2025-05-08 |
DSDrive: Distilling Large Language Model for Lightweight End-to-End Autonomous Driving with Unified Reasoning and Planning |
Wenru Liu et.al. |
2505.05360 |
null |
2025-05-08 |
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization |
Sooyoung Park et.al. |
2505.05343 |
link |
2025-05-08 |
Mapping User Trust in Vision Language Models: Research Landscape, Challenges, and Prospects |
Agnese Chiatti et.al. |
2505.05318 |
null |
2025-05-08 |
Biomed-DPT: Dual Modality Prompt Tuning for Biomedical Vision-Language Models |
Wei Peng et.al. |
2505.05189 |
link |
2025-05-08 |
OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning |
Cong Hua et.al. |
2505.05180 |
link |
2025-05-08 |
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models |
Aishwarya Venkataramanan et.al. |
2505.05163 |
link |
2025-05-08 |
CacheFL: Efficient Federated Cache Model Fine-Tuning for Vision-Language Models |
Mengjun Yi et.al. |
2505.05130 |
null |
2025-05-08 |
X-Driver: Explainable Autonomous Driving with Vision-Language Models |
Wei Liu et.al. |
2505.05098 |
null |
2025-05-07 |
“I Can See Forever!”: Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments |
Ziyi Zhang et.al. |
2505.04488 |
null |
2025-05-07 |
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception |
Junjie Wang et.al. |
2505.04410 |
link |
2025-05-07 |
CM1 – A Dataset for Evaluating Few-Shot Information Extraction with Large Vision Language Models |
Fabian Wolf et.al. |
2505.04214 |
null |
2025-05-07 |
R^3-VQA: “Read the Room” by Video Social Reasoning |
Lixing Niu et.al. |
2505.04147 |
null |
2025-05-06 |
X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains |
Qianchu Liu et.al. |
2505.03981 |
null |
2025-05-06 |
Fill the Gap: Quantifying and Reducing the Modality Gap in Image-Text Representation Learning |
François Role et.al. |
2505.03703 |
null |
2025-05-06 |
Distribution-Conditional Generation: From Class Distribution to Creative Generation |
Fu Feng et.al. |
2505.03667 |
null |
2025-05-06 |
Learning Unknown Spoof Prompts for Generalized Face Anti-Spoofing Using Only Real Face Images |
Fangling Jiang et.al. |
2505.03611 |
null |
2025-05-06 |
Learning Knowledge-based Prompts for Robust 3D Mask Presentation Attack Detection |
Fangling Jiang et.al. |
2505.03610 |
null |
2025-05-06 |
LogisticsVLN: Vision-Language Navigation For Low-Altitude Terminal Delivery Based on Agentic UAVs |
Xinyuan Zhang et.al. |
2505.03460 |
null |
2025-05-06 |
Mitigating Image Captioning Hallucinations in Vision-Language Models |
Fei Zhao et.al. |
2505.03420 |
null |
2025-05-07 |
Enhancing Target-unspecific Tasks through a Features Matrix |
Fangming Cui et.al. |
2505.03414 |
null |
2025-05-06 |
Reducing Annotation Burden in Physical Activity Research Using Vision-Language Models |
Abram Schonfeldt et.al. |
2505.03374 |
null |
2025-05-06 |
A Vision-Language Model for Focal Liver Lesion Classification |
Song Jian et.al. |
2505.03350 |
null |
2025-05-06 |
From Word to Sentence: A Large-Scale Multi-Instance Dataset for Open-Set Aerial Detection |
Guoting Wei et.al. |
2505.03334 |
null |
2025-05-05 |
LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery |
Jerome Quenum et.al. |
2505.02829 |
null |
2025-05-05 |
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction |
Muhammad Haris Khan et.al. |
2505.02569 |
null |
2025-05-05 |
Tevatron 2.0: Unified Document Retrieval Toolkit across Scale, Language, and Modality |
Xueguang Ma et.al. |
2505.02466 |
link |
2025-05-05 |
Recent Advances in Out-of-Distribution Detection with CLIP-Like Models: A Survey |
Chaohua Li et.al. |
2505.02448 |
null |
2025-05-05 |
SuperEdit: Rectifying and Facilitating Supervision for Instruction-Based Image Editing |
Ming Li et.al. |
2505.02370 |
link |
2025-05-05 |
TeDA: Boosting Vision-Lanuage Models for Zero-Shot 3D Object Retrieval via Testing-time Distribution Alignment |
Zhichuan Wang et.al. |
2505.02325 |
link |
2025-05-04 |
Compositional Image-Text Matching and Retrieval by Grounding Entities |
Madhukar Reddy Vongala et.al. |
2505.02278 |
null |
2025-05-04 |
Handling Imbalanced Pseudolabels for Vision-Language Models with Concept Alignment and Confusion-Aware Calibrated Margin |
Yuchen Wang et.al. |
2505.02056 |
null |
2025-05-04 |
A Comprehensive Analysis for Visual Object Hallucination in Large Vision-Language Models |
Liqiang Jing et.al. |
2505.01958 |
null |
2025-05-03 |
PhysNav-DG: A Novel Adaptive Framework for Robust VLM-Sensor Fusion in Navigation Applications |
Trisanth Srinivasan et.al. |
2505.01881 |
null |
2025-05-02 |
Dynamic Robot Tool Use with Vision Language Models |
Noah Trupin et.al. |
2505.01399 |
null |
2025-05-02 |
Evaluating Vision Language Model Adaptations for Radiology Report Generation in Low-Resource Languages |
Marco Salmè et.al. |
2505.01096 |
null |
2025-05-02 |
Any-to-Any Vision-Language Model for Multimodal X-ray Imaging and Radiological Report Generation |
Daniele Molino et.al. |
2505.01091 |
null |
2025-05-02 |
Transferable Adversarial Attacks on Black-Box Vision-Language Models |
Kai Hu et.al. |
2505.01050 |
null |
2025-05-01 |
Robotic Visual Instruction |
Yanbang Li et.al. |
2505.00693 |
null |
2025-05-01 |
Visual Test-time Scaling for GUI Agent Grounding |
Tiange Luo et.al. |
2505.00684 |
link |
2025-05-01 |
DeCo: Task Decomposition and Skill Composition for Zero-Shot Generalization in Long-Horizon 3D Manipulation |
Zixuan Chen et.al. |
2505.00527 |
null |
2025-05-01 |
LightEMMA: Lightweight End-to-End Multimodal Model for Autonomous Driving |
Zhijie Qiao et.al. |
2505.00284 |
link |
2025-05-01 |
AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care |
Md Asaduzzaman Jabin et.al. |
2505.00275 |
link |
2025-05-02 |
Empowering Agentic Video Analytics Systems with Video Language Models |
Yuxuan Yan et.al. |
2505.00254 |
null |
2025-04-30 |
V3LMA: Visual 3D-enhanced Language Model for Autonomous Driving |
Jannik Lübberstedt et.al. |
2505.00156 |
null |
2025-04-30 |
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models |
Minh-Hao Van et.al. |
2505.00150 |
null |
2025-04-30 |
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design |
Vasudev Sharma et.al. |
2505.00134 |
null |
2025-04-30 |
Early Exit and Multi Stage Knowledge Distillation in VLMs for Video Summarization |
Anas Anwarul Haq Khan et.al. |
2504.21831 |
null |
2025-04-30 |
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models |
Sangmin Woo et.al. |
2504.21559 |
null |
2025-04-30 |
RoboGround: Robotic Manipulation with Grounded Vision-Language Priors |
Haifeng Huang et.al. |
2504.21530 |
null |
2025-04-30 |
Vision-Language Model-Based Semantic-Guided Imaging Biomarker for Early Lung Cancer Detection |
Luoting Zhuang et.al. |
2504.21344 |
null |
2025-04-30 |
An Evaluation of a Visual Question Answering Strategy for Zero-shot Facial Expression Recognition in Still Images |
Modesto Castrillón-Santana et.al. |
2504.21309 |
null |
2025-04-30 |
Talk Before You Retrieve: Agent-Led Discussions for Better RAG in Medical QA |
Xuanzhao Dong et.al. |
2504.21252 |
link |
2025-04-29 |
MemeBLIP2: A novel lightweight multimodal system to detect harmful memes |
Jiaqi Liu et.al. |
2504.21226 |
null |
2025-04-29 |
GLIP-OOD: Zero-Shot Graph OOD Detection with Foundation Model |
Haoyan Xu et.al. |
2504.21186 |
null |
2025-04-29 |
Real-Time Wayfinding Assistant for Blind and Low-Vision Users |
Dabbrata Das et.al. |
2504.20976 |
null |
2025-04-29 |
FedMVP: Federated Multi-modal Visual Prompt Tuning for Vision-Language Models |
Mainak Singha et.al. |
2504.20860 |
null |
2025-04-29 |
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer |
Zechuan Zhang et.al. |
2504.20690 |
null |
2025-04-29 |
SpaRE: Enhancing Spatial Reasoning in Vision-Language Models with Synthetic Data |
Michael Ogezi et.al. |
2504.20648 |
null |
2025-04-29 |
PRISM: Projection-based Reward Integration for Scene-Aware Real-to-Sim-to-Real Transfer with Few Demonstrations |
Haowen Sun et.al. |
2504.20520 |
null |
2025-04-29 |
Antidote: A Unified Framework for Mitigating LVLM Hallucinations in Counterfactual Presupposition and Object Perception |
Yuanchen Wu et.al. |
2504.20468 |
link |
2025-04-29 |
Plant Disease Detection through Multimodal Large Language Models and Convolutional Neural Networks |
Konstantinos I. Roumeliotis et.al. |
2504.20419 |
null |
2025-04-29 |
FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding |
Yanan Guo et.al. |
2504.20384 |
null |
2025-04-28 |
mrCAD: Multimodal Refinement of Computer-aided Designs |
William P. McCarthy et.al. |
2504.20294 |
link |
2025-04-28 |
A Multimodal Pipeline for Clinical Data Extraction: Applying Vision-Language Models to Scans of Transfusion Reaction Reports |
Henning Schäfer et.al. |
2504.20220 |
null |
2025-04-28 |
SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning |
Wufei Ma et.al. |
2504.20024 |
null |
2025-04-29 |
If Concept Bottlenecks are the Question, are Foundation Models the Answer? |
Nicola Debole et.al. |
2504.19774 |
link |
2025-04-28 |
EcoWikiRS: Learning Ecological Representation of Satellite Images from Weak Supervision with Species Observations and Wikipedia |
Valerie Zermatten et.al. |
2504.19742 |
null |
2025-04-28 |
Contrastive Language-Image Learning with Augmented Textual Prompts for 3D/4D FER Using Vision-Language Model |
Muzammil Behzad et.al. |
2504.19739 |
null |
2025-04-28 |
VCM: Vision Concept Modeling Based on Implicit Contrastive Learning with Vision-Language Instruction Fine-Tuning |
Run Luo et.al. |
2504.19627 |
null |
2025-04-28 |
LR-IAD:Mask-Free Industrial Anomaly Detection with Logical Reasoning |
Peijian Zeng et.al. |
2504.19524 |
null |
2025-04-27 |
DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning |
Jialang Lu et.al. |
2504.19127 |
null |
2025-04-27 |
Boosting Single-domain Generalized Object Detection via Vision-Language Knowledge Interaction |
Xiaoran Xu et.al. |
2504.19086 |
null |
2025-04-26 |
Multi-Resolution Pathology-Language Pre-training Model with Text-Guided Visual Representation |
Shahad Albastaki et.al. |
2504.18856 |
link |
2025-04-26 |
Video CLIP Model for Multi-View Echocardiography Interpretation |
Ryo Takizawa et.al. |
2504.18800 |
null |
2025-04-25 |
Generalization Capability for Imitation Learning |
Yixiao Wang et.al. |
2504.18538 |
null |
2025-04-25 |
Fast-Slow Thinking for Large Vision-Language Model Reasoning |
Wenyi Xiao et.al. |
2504.18458 |
null |
2025-04-25 |
Reason Like a Radiologist: Chain-of-Thought and Reinforcement Learning for Verifiable Report Generation |
Peiyuan Jing et.al. |
2504.18453 |
null |
2025-04-25 |
HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding? |
Yusen Zhang et.al. |
2504.18406 |
null |
2025-04-25 |
Revisiting Data Auditing in Large Vision-Language Models |
Hongyu Zhu et.al. |
2504.18349 |
null |
2025-04-25 |
A Large Vision-Language Model based Environment Perception System for Visually Impaired People |
Zezhou Chen et.al. |
2504.18027 |
null |
2025-04-24 |
CAMU: Context Augmentation for Meme Understanding |
Girish A. Koushik et.al. |
2504.17902 |
null |
2025-04-25 |
Data-Driven Calibration of Prediction Sets in Large Vision-Language Models Based on Inductive Conformal Prediction |
Yuanchang Ye et.al. |
2504.17671 |
null |
2025-04-24 |
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting |
Yiming Zhao et.al. |
2504.17395 |
null |
2025-04-24 |
M-MRE: Extending the Mutual Reinforcement Effect to Multimodal Information Extraction |
Chengguang Gan et.al. |
2504.17353 |
null |
2025-04-24 |
DIMT25@ICDAR2025: HW-TSC’s End-to-End Document Image Machine Translation System Leveraging Large Vision-Language Model |
Zhanglin Wu et.al. |
2504.17315 |
null |
2025-04-24 |
Cracking the Code of Action: a Generative Approach to Affordances for Reinforcement Learning |
Lynn Cherif et.al. |
2504.17282 |
null |
2025-04-24 |
Perspective-Aware Reasoning in Vision-Language Models via Mental Imagery Simulation |
Phillip Y. Lee et.al. |
2504.17207 |
null |
2025-04-23 |
Distilling semantically aware orders for autoregressive image generation |
Rishav Pramanik et.al. |
2504.17069 |
null |
2025-04-23 |
DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs |
Zhenhailong Wang et.al. |
2504.17040 |
null |
2025-04-24 |
V $^2$ R-Bench: Holistically Evaluating LVLM Robustness to Fundamental Visual Variations |
Zhiyuan Fan et.al. |
2504.16727 |
null |
2025-04-23 |
Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes |
Joan Perez et.al. |
2504.16538 |
null |
2025-04-23 |
TraveLLaMA: Facilitating Multi-modal Large Language Models to Understand Urban Scenes and Provide Travel Assistance |
Meng Chu et.al. |
2504.16505 |
null |
2025-04-23 |
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing |
Hariseetharam Gunduboina et.al. |
2504.16433 |
null |
2025-04-22 |
CLIP-IT: CLIP-based Pairing for Histology Images Classification |
Banafsheh Karimian et.al. |
2504.16181 |
link |
2025-04-22 |
MMInference: Accelerating Pre-filling for Long-Context VLMs via Modality-Aware Permutation Sparse Attention |
Yucheng Li et.al. |
2504.16083 |
null |
2025-04-22 |
MR. Video: “MapReduce” is the Principle for Long Video Understanding |
Ziqi Pang et.al. |
2504.16082 |
null |
2025-04-22 |
Guiding VLM Agents with Process Rewards at Inference Time for GUI Navigation |
Zhiyuan Hu et.al. |
2504.16073 |
null |
2025-04-22 |
Describe Anything: Detailed Localized Image and Video Captioning |
Long Lian et.al. |
2504.16072 |
null |
2025-04-22 |
Vision language models are unreliable at trivial spatial cognition |
Sangeet Khemlani et.al. |
2504.16061 |
null |
2025-04-22 |
Vision-Language Models Are Not Pragmatically Competent in Referring Expression Generation |
Ziqiao Ma et.al. |
2504.16060 |
link |
2025-04-22 |
Evaluating Vision Language Models (VLMs) for Radiology: A Comprehensive Analysis |
Frank Li et.al. |
2504.16047 |
null |
2025-04-22 |
LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale |
Joya Chen et.al. |
2504.16030 |
null |
2025-04-22 |
Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models |
Saban Ozturk et.al. |
2504.15929 |
null |
2025-04-22 |
VLM-based Prompts as the Optimal Assistant for Unpaired Histopathology Virtual Staining |
Zizhi Chen et.al. |
2504.15545 |
null |
2025-04-21 |
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models |
Guo Chen et.al. |
2504.15271 |
null |
2025-04-21 |
KGMEL: Knowledge Graph-Enhanced Multimodal Entity Linking |
Juyeon Kim et.al. |
2504.15135 |
link |
2025-04-21 |
Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation |
Hong-Tao Yu et.al. |
2504.14988 |
link |
2025-04-21 |
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform |
Xingyu Lu et.al. |
2504.14904 |
null |
2025-04-21 |
Object-Level Verbalized Confidence Calibration in Vision-Language Models via Semantic Perturbation |
Yunpu Zhao et.al. |
2504.14848 |
null |
2025-04-20 |
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding |
Songtao Jiang et.al. |
2504.14692 |
null |
2025-04-20 |
NVSMask3D: Hard Visual Prompting with Camera Pose Interpolation for 3D Open Vocabulary Instance Segmentation |
Junyuan Fang et.al. |
2504.14638 |
null |
2025-04-20 |
LGD: Leveraging Generative Descriptions for Zero-Shot Referring Image Segmentation |
Jiachen Li et.al. |
2504.14467 |
null |
2025-04-20 |
Neglected Risks: The Disturbing Reality of Children’s Images in Datasets and the Urgent Call for Accountability |
Carlos Caetano et.al. |
2504.14446 |
null |
2025-04-19 |
Hydra: An Agentic Reasoning Approach for Enhancing Adversarial Robustness and Mitigating Hallucinations in Vision-Language Models |
Chung-En et.al. |
2504.14395 |
null |
2025-04-18 |
Analysing the Robustness of Vision-Language-Models to Common Corruptions |
Muhammad Usama et.al. |
2504.13690 |
null |
2025-04-18 |
EyecareGPT: Boosting Comprehensive Ophthalmology Understanding with Tailored Dataset, Benchmark and Model |
Sijing Li et.al. |
2504.13650 |
link |
2025-04-18 |
PV-VLM: A Multimodal Vision-Language Approach Incorporating Sky Images for Intra-Hour Photovoltaic Power Forecasting |
Huapeng Lin et.al. |
2504.13624 |
null |
2025-04-18 |
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization |
Hongwei Ji et.al. |
2504.13460 |
null |
2025-04-18 |
Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety |
Shashank Shriram et.al. |
2504.13399 |
link |
2025-04-17 |
VLLFL: A Vision-Language Model Based Lightweight Federated Learning Framework for Smart Agriculture |
Long Li et.al. |
2504.13365 |
null |
2025-04-17 |
Chain-of-Modality: Learning Manipulation Programs from Multimodal Human Videos with Vision-Language-Models |
Chen Wang et.al. |
2504.13351 |
null |
2025-04-17 |
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding |
Jang Hyun Cho et.al. |
2504.13180 |
link |
2025-04-17 |
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling |
Tsung-Han Wu et.al. |
2504.13169 |
link |
2025-04-17 |
$\texttt{Complex-Edit}$ : CoT-Like Instruction Generation for Complexity-Controllable Image Editing Benchmark |
Siwei Yang et.al. |
2504.13143 |
null |
2025-04-17 |
Low-hallucination Synthetic Captions for Large-Scale Vision-Language Model Pre-training |
Xinsong Zhang et.al. |
2504.13123 |
null |
2025-04-17 |
Probing and Inducing Combinational Creativity in Vision-Language Models |
Yongqian Peng et.al. |
2504.13120 |
null |
2025-04-17 |
Object-Driven Narrative in AR: A Scenario-Metaphor Framework with VLM Integration |
Yusi Sun et.al. |
2504.13119 |
null |
2025-04-17 |
Early Accessibility: Automating Alt-Text Generation for UI Icons During App Development |
Sabrina Haque et.al. |
2504.13069 |
null |
2025-04-17 |
NoisyRollout: Reinforcing Visual Reasoning with Data Augmentation |
Xiangyan Liu et.al. |
2504.13055 |
null |
2025-04-17 |
Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning |
Baining Zhao et.al. |
2504.12680 |
link |
2025-04-17 |
VLMGuard-R1: Proactive Safety Alignment for VLMs via Reasoning-Driven Prompt Optimization |
Menglan Chen et.al. |
2504.12661 |
null |
2025-04-16 |
FLIP Reasoning Challenge |
Andreas Plesner et.al. |
2504.12256 |
link |
2025-04-16 |
Efficient Contrastive Decoding with Probabilistic Hallucination Detection - Mitigating Hallucinations in Large Vision Language Models - |
Laura Fieback et.al. |
2504.12137 |
null |
2025-04-17 |
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions |
Yifei Dong et.al. |
2504.11967 |
null |
2025-04-16 |
Beyond Words: Augmenting Discriminative Richness via Diffusions in Unsupervised Prompt Learning |
Hairui Ren et.al. |
2504.11930 |
link |
2025-04-16 |
AnomalyR1: A GRPO-based End-to-end MLLM for Industrial Anomaly Detection |
Yuhao Chao et.al. |
2504.11914 |
null |
2025-04-16 |
A Visual RAG Pipeline for Few-Shot Fine-Grained Product Classification |
Bianca Lamm et.al. |
2504.11838 |
null |
2025-04-17 |
DVLTA-VQA: Decoupled Vision-Language Modeling with Text-Guided Adaptation for Blind Video Quality Assessment |
Li Yu et.al. |
2504.11733 |
null |
2025-04-16 |
Interpreting the Linear Structure of Vision-language Model Embedding Spaces |
Isabel Papadimitriou et.al. |
2504.11695 |
null |
2025-04-16 |
VLM-Fuzz: Vision Language Model Assisted Recursive Depth-first Search Exploration for Effective UI Testing of Android Apps |
Biniam Fisseha Demissie et.al. |
2504.11675 |
null |
2025-04-15 |
Co-STAR: Collaborative Curriculum Self-Training with Adaptive Regularization for Source-Free Video Domain Adaptation |
Amirhossein Dadashzadeh et.al. |
2504.11669 |
null |
2025-04-15 |
MINDS: The very low-mass star and brown dwarf sample – Hidden water in carbon-dominated protoplanetary disks |
Aditya M. Arabhavi et.al. |
2504.11425 |
null |
2025-04-15 |
From Gaze to Insight: Bridging Human Visual Attention and Vision Language Model Explanation for Weakly-Supervised Medical Image Segmentation |
Jingkun Chen et.al. |
2504.11368 |
null |
2025-04-16 |
Seedream 3.0 Technical Report |
Yu Gao et.al. |
2504.11346 |
null |
2025-04-16 |
UI-E2I-Synth: Advancing GUI Grounding with Large-Scale Instruction Synthesis |
Xinyi Liu et.al. |
2504.11257 |
null |
2025-04-15 |
R-TPT: Improving Adversarial Robustness of Vision-Language Models through Test-Time Prompt Tuning |
Lijun Sheng et.al. |
2504.11195 |
link |
2025-04-15 |
Benchmarking Vision Language Models on German Factual Data |
René Peinl et.al. |
2504.11108 |
null |
2025-04-16 |
Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR |
Yulong Zhang et.al. |
2504.11101 |
null |
2025-04-15 |
QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models |
Yudong Zhang et.al. |
2504.11038 |
link |
2025-04-15 |
MMC: Iterative Refinement of VLM Reasoning via MCTS-based Multimodal Critique |
Shuhang Liu et.al. |
2504.11009 |
null |
2025-04-15 |
Can Vision-Language Models Understand and Interpret Dynamic Gestures from Pedestrians? Pilot Datasets and Exploration Towards Instructive Nonverbal Commands for Cooperative Autonomous Vehicles |
Tonko E. W. Bossen et.al. |
2504.10873 |
null |
2025-04-14 |
Art3D: Training-Free 3D Generation from Flat-Colored Illustration |
Xiaoyan Cong et.al. |
2504.10466 |
null |
2025-04-14 |
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding |
Tao Zhang et.al. |
2504.10465 |
link |
2025-04-15 |
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents |
Xiaobo Xia et.al. |
2504.10458 |
null |
2025-04-15 |
Efficient Prompt Tuning for Hierarchical Ingredient Recognition |
Yinxuan Gui et.al. |
2504.10322 |
null |
2025-04-14 |
SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model |
Zongcan Ding et.al. |
2504.10320 |
null |
2025-04-15 |
Breaking the Data Barrier – Building GUI Agents Through Task Generalization |
Junlei Zhang et.al. |
2504.10127 |
link |
2025-04-14 |
AGO: Adaptive Grounding for Open World 3D Occupancy Prediction |
Peizheng Li et.al. |
2504.10117 |
null |
2025-04-14 |
CameraBench: Benchmarking Visual Reasoning in MLLMs via Photography |
I-Sheng Fang et.al. |
2504.10090 |
null |
2025-04-14 |
Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure |
Théo Gigant et.al. |
2504.10049 |
null |
2025-04-14 |
Aligning Anime Video Generation with Human Feedback |
Bingwen Zhu et.al. |
2504.10044 |
null |
2025-04-11 |
Generating Fine Details of Entity Interactions |
Xinyi Gu et.al. |
2504.08714 |
null |
2025-04-11 |
AstroLLaVA: towards the unification of astronomical data and natural language |
Sharaf Zaman et.al. |
2504.08583 |
null |
2025-04-11 |
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations |
Cheng-Yu Hsieh et.al. |
2504.08368 |
null |
2025-04-11 |
DSM: Building A Diverse Semantic Map for 3D Visual Grounding |
Qinghongbing Xie et.al. |
2504.08307 |
null |
2025-04-11 |
EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models |
Minjae Seo et.al. |
2504.08205 |
null |
2025-04-10 |
Investigating Vision-Language Model for Point Cloud-based Vehicle Classification |
Yiqiao Li et.al. |
2504.08154 |
null |
2025-04-10 |
The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search |
Yutaro Yamada et.al. |
2504.08066 |
link |
2025-04-10 |
VCR-Bench: A Comprehensive Evaluation Framework for Video Chain-of-Thought Reasoning |
Yukun Qi et.al. |
2504.07956 |
null |
2025-04-10 |
SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement |
Xiyao Wang et.al. |
2504.07934 |
link |
2025-04-10 |
SAMJAM: Zero-Shot Video Scene Graph Generation for Egocentric Kitchen Videos |
Joshua Li et.al. |
2504.07867 |
null |
2025-04-10 |
CollEX – A Multimodal Agentic RAG System Enabling Interactive Exploration of Scientific Collections |
Florian Schneider et.al. |
2504.07643 |
null |
2025-04-10 |
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model |
Haozhan Shen et.al. |
2504.07615 |
link |
2025-04-10 |
TokenFocus-VQA: Enhancing Text-to-Image Alignment with Position-Aware Focus and Multi-Perspective Aggregations on LVLMs |
Zijian Zhang et.al. |
2504.07556 |
null |
2025-04-10 |
Why We Feel: Breaking Boundaries in Emotional Reasoning with Multimodal Large Language Models |
Yuxiang Lin et.al. |
2504.07521 |
link |
2025-04-10 |
Kimi-VL Technical Report |
Kimi Team et.al. |
2504.07491 |
link |
2025-04-09 |
Perception in Reflection |
Yana Wei et.al. |
2504.07165 |
null |
2025-04-09 |
Kaleidoscope: In-language Exams for Massively Multilingual Vision Evaluation |
Israfel Salazar et.al. |
2504.07072 |
null |
2025-04-09 |
Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition |
Sergio Romero-Tapiador et.al. |
2504.06925 |
null |
2025-04-09 |
MovSAM: A Single-image Moving Object Segmentation Framework Based on Deep Thinking |
Chang Nie et.al. |
2504.06863 |
null |
2025-04-09 |
ZIP: An Efficient Zeroth-order Prompt Tuning for Black-box Vision-Language Models |
Seonghwan Park et.al. |
2504.06838 |
null |
2025-04-09 |
LVC: A Lightweight Compression Framework for Enhancing VLMs in Long Video Understanding |
Ziyi Wang et.al. |
2504.06835 |
null |
2025-04-08 |
PromptHMR: Promptable Human Mesh Recovery |
Yufu Wang et.al. |
2504.06397 |
null |
2025-04-08 |
SemiDAViL: Semi-supervised Domain Adaptation with Vision-Language Guidance for Semantic Segmentation |
Hritam Basak et.al. |
2504.06389 |
null |
2025-04-08 |
OmniSVG: A Unified Scalable Vector Graphics Generation Model |
Yiying Yang et.al. |
2504.06263 |
null |
2025-04-08 |
Latent Multimodal Reconstruction for Misinformation Detection |
Stefanos-Iordanis Papadopoulos et.al. |
2504.06010 |
link |
2025-04-08 |
econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians |
Can Zhang et.al. |
2504.06003 |
null |
2025-04-08 |
Measuring Déjà vu Memorization Efficiently |
Narine Kokhlikyan et.al. |
2504.05651 |
null |
2025-04-08 |
A Lightweight Large Vision-language Model for Multimodal Medical Images |
Belal Alsinglawi et.al. |
2504.05575 |
null |
2025-04-07 |
ChartQAPro: A More Diverse and Challenging Benchmark for Chart Question Answering |
Ahmed Masry et.al. |
2504.05506 |
link |
2025-04-07 |
Trust Through Transparency: Explainable Social Navigation for Autonomous Mobile Robots via Vision-Language Models |
Oluwadamilola Sotomi et.al. |
2504.05477 |
null |
2025-04-07 |
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering |
Sofian Chaybouti et.al. |
2504.05463 |
null |
2025-04-07 |
Taxonomy-Aware Evaluation of Vision-Language Models |
Vésteinn Snæbjarnarson et.al. |
2504.05457 |
null |
2025-04-07 |
Probing the Visualization Literacy of Vision Language Models: the Good, the Bad, and the Ugly |
Lianghan Dong et.al. |
2504.05445 |
null |
2025-04-07 |
InteractVLM: 3D Interaction Reasoning from 2D Foundational Models |
Sai Kumar Dwivedi et.al. |
2504.05303 |
link |
2025-04-07 |
SmolVLM: Redefining small and efficient multimodal models |
Andrés Marafioti et.al. |
2504.05299 |
null |
2025-04-07 |
A Reality Check of Vision-Language Pre-training in Radiology: Have We Progressed Using Text? |
Julio Silva-Rodríguez et.al. |
2504.05227 |
null |
2025-04-07 |
Vision-Language Model Predictive Control for Manipulation Planning and Trajectory Generation |
Jiaming Chen et.al. |
2504.05225 |
link |
2025-04-07 |
EffOWT: Transfer Visual Language Models to Open-World Tracking Efficiently and Effectively |
Bingyang Wang et.al. |
2504.05141 |
null |
2025-04-07 |
RS-RAG: Bridging Remote Sensing Imagery and Comprehensive Knowledge with a Multi-Modal Dataset and Retrieval-Augmented Generation Model |
Congcong Wen et.al. |
2504.04988 |
null |
2025-04-08 |
A Taxonomy of Self-Handover |
Naoki Wake et.al. |
2504.04939 |
null |
2025-04-07 |
SCAM: A Real-World Typographic Robustness Evaluation for Multimodal Foundation Models |
Justus Westerhoff et.al. |
2504.04893 |
link |
2025-04-07 |
Don’t Lag, RAG: Training-Free Adversarial Detection Using RAG |
Roie Kazoom et.al. |
2504.04858 |
null |
2025-04-07 |
OCC-MLLM-CoT-Alpha: Towards Multi-stage Occlusion Recognition Based on Large Language Models via 3D-Aware Supervision and Chain-of-Thoughts Guidance |
Chaoyi Wang et.al. |
2504.04781 |
null |
2025-04-04 |
Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models |
Mirko Borszukovszki et.al. |
2504.03440 |
null |
2025-04-04 |
SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding |
Yimin Wei et.al. |
2504.03254 |
link |
2025-04-04 |
Seeing is Believing: Belief-Space Planning with Foundation Models as Uncertainty Estimators |
Linfeng Zhao et.al. |
2504.03245 |
null |
2025-04-04 |
Mamba as a Bridge: Where Vision Foundation Models Meet Vision Language Models for Domain-Generalized Semantic Segmentation |
Xin Zhang et.al. |
2504.03193 |
link |
2025-04-04 |
REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval |
Shabnam Choudhury et.al. |
2504.03169 |
null |
2025-04-07 |
NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving |
Kexin Tian et.al. |
2504.03164 |
null |
2025-04-04 |
TokenFLEX: Unified VLM Training for Flexible Visual Tokens Inference |
Junshan Hu et.al. |
2504.03154 |
null |
2025-04-04 |
MORAL: A Multimodal Reinforcement Learning Framework for Decision Making in Autonomous Laboratories |
Natalie Tirabassi et.al. |
2504.03153 |
null |
2025-04-03 |
QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding |
Binh M. Le et.al. |
2504.02971 |
null |
2025-04-03 |
STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection |
Divya Velayudhan et.al. |
2504.02823 |
null |
2025-04-03 |
Sparse Autoencoders Learn Monosemantic Features in Vision-Language Models |
Mateusz Pach et.al. |
2504.02821 |
link |
2025-04-03 |
Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence |
Anita Rau et.al. |
2504.02799 |
null |
2025-04-03 |
Robot-Led Vision Language Model Wellbeing Assessment of Children |
Nida Itrat Abbasi et.al. |
2504.02765 |
null |
2025-04-04 |
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme |
Yan Ma et.al. |
2504.02587 |
link |
2025-04-03 |
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision |
Xiaofeng Han et.al. |
2504.02477 |
null |
2025-04-03 |
Scaling Video-Language Models to 10K Frames via Hierarchical Differential Distillation |
Chuanqi Cheng et.al. |
2504.02438 |
link |
2025-04-03 |
ReuseDroid: A VLM-empowered Android UI Test Migrator Boosted by Active Feedback |
Xiaolei Li et.al. |
2504.02357 |
null |
2025-04-03 |
Large (Vision) Language Models are Unsupervised In-Context Learners |
Artyom Gadetsky et.al. |
2504.02349 |
link |
2025-04-03 |
Re-thinking Temporal Search for Long-Form Video Understanding |
Jinhui Ye et.al. |
2504.02259 |
link |
2025-04-02 |
FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs |
Mothilal Asokan et.al. |
2504.01916 |
link |
2025-04-02 |
Is Temporal Prompting All We Need For Limited Labeled Action Recognition? |
Shreyank N Gowda et.al. |
2504.01890 |
null |
2025-04-02 |
Prompting Medical Vision-Language Models to Mitigate Diagnosis Bias by Generating Realistic Dermoscopic Images |
Nusrat Munia et.al. |
2504.01838 |
link |
2025-04-02 |
BlenderGym: Benchmarking Foundational Model Systems for Graphics Editing |
Yunqi Gu et.al. |
2504.01786 |
link |
2025-04-02 |
AdPO: Enhancing the Adversarial Robustness of Large Vision-Language Models with Preference Optimization |
Chaohu Liu et.al. |
2504.01735 |
null |
2025-04-02 |
Reasoning LLMs for User-Aware Multimodal Conversational Agents |
Hamed Rahimi et.al. |
2504.01700 |
null |
2025-04-02 |
CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition |
Sarah Alyami et.al. |
2504.01666 |
link |
2025-04-02 |
BioAtt: Anatomical Prior Driven Low-Dose CT Denoising |
Namhun Kim et.al. |
2504.01662 |
null |
2025-04-02 |
Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models |
Zhaochen Wang et.al. |
2504.01589 |
null |
2025-04-02 |
STPNet: Scale-aware Text Prompt Network for Medical Image Segmentation |
Dandan Shan et.al. |
2504.01561 |
link |
2025-03-31 |
SVLA: A Unified Speech-Vision-Language Assistant with Multimodal Reasoning and Speech Generation |
Ngoc Dung Huynh et.al. |
2503.24164 |
null |
2025-03-31 |
AirCache: Activating Inter-modal Relevancy KV Cache Compression for Efficient Large Vision-Language Model Inference |
Kai Huang et.al. |
2503.23956 |
null |
2025-03-31 |
HumanAesExpert: Advancing a Multi-Modality Foundation Model for Human Image Aesthetic Assessment |
Zhichao Liao et.al. |
2503.23907 |
null |
2025-03-31 |
Communication-Efficient and Personalized Federated Foundation Model Fine-Tuning via Tri-Matrix Adaptation |
Yongle Li et.al. |
2503.23869 |
null |
2025-04-01 |
Evaluating small vision-language models as AI assistants for radio astronomical source analysis tasks |
S. Riggi et.al. |
2503.23859 |
link |
2025-03-31 |
Texture or Semantics? Vision-Language Models Get Lost in Font Recognition |
Zhecheng Li et.al. |
2503.23768 |
null |
2025-03-31 |
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language |
Yoonshik Kim et.al. |
2503.23730 |
link |
2025-03-31 |
From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts |
Yuto Asano et.al. |
2503.23707 |
null |
2025-03-30 |
Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging |
Amar Kumar et.al. |
2503.23618 |
null |
2025-03-30 |
DASH: Detection and Assessment of Systematic Hallucinations of VLMs |
Maximilian Augustin et.al. |
2503.23573 |
link |
2025-03-28 |
Understanding Co-speech Gestures in-the-wild |
Sindhu B Hegde et.al. |
2503.22668 |
null |
2025-03-28 |
Unicorn: Text-Only Data Synthesis for Vision Language Model Training |
Xiaomin Yu et.al. |
2503.22655 |
link |
2025-03-28 |
Breaking Language Barriers in Visual Language Models via Multilingual Textual Regularization |
Iñigo Pikabea et.al. |
2503.22577 |
null |
2025-03-28 |
VisTa: Visual-contextual and Text-augmented Zero-shot Object-level OOD Detection |
Bin Zhang et.al. |
2503.22291 |
null |
2025-03-28 |
FLIP: Towards Comprehensive and Reliable Evaluation of Federated Prompt Learning |
Dongping Liao et.al. |
2503.22263 |
link |
2025-03-28 |
Multi-modal Knowledge Distillation-based Human Trajectory Forecasting |
Jaewoo Jeong et.al. |
2503.22201 |
link |
2025-03-28 |
REMAC: Self-Reflective and Self-Evolving Multi-Agent Collaboration for Long-Horizon Robot Manipulation |
Puzhen Yuan et.al. |
2503.22122 |
null |
2025-03-28 |
How Well Can Vison-Language Models Understand Humans’ Intention? An Open-ended Theory of Mind Question Evaluation Benchmark |
Ximing Wen et.al. |
2503.22093 |
null |
2025-03-28 |
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality |
Ziyue Huang et.al. |
2503.22081 |
link |
2025-03-27 |
CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models |
Qingqing Zhao et.al. |
2503.22020 |
null |
2025-03-27 |
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck |
Adrian Bulat et.al. |
2503.21757 |
null |
2025-03-27 |
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness |
Dian Zheng et.al. |
2503.21755 |
link |
2025-03-27 |
Fine-Grained Evaluation of Large Vision-Language Models in Autonomous Driving |
Yue Li et.al. |
2503.21505 |
link |
2025-03-27 |
BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding |
Shuming Liu et.al. |
2503.21483 |
link |
2025-03-27 |
Graph-to-Vision: Multi-graph Understanding and Reasoning using Vision-Language Models |
Ruizhou Li et.al. |
2503.21435 |
null |
2025-03-27 |
Cultivating Game Sense for Yourself: Making VLMs Gaming Experts |
Wenxuan Lu et.al. |
2503.21263 |
null |
2025-03-27 |
LLaVA-CMoE: Towards Continual Mixture of Experts for Large Vision-Language Models |
Hengyuan Zhao et.al. |
2503.21227 |
null |
2025-03-27 |
VoxRep: Enhancing 3D Spatial Understanding in 2D Vision-Language Models via Voxel Representation |
Alan Dao et.al. |
2503.21214 |
null |
2025-03-27 |
FakeReasoning: Towards Generalizable Forgery Detection and Reasoning |
Yueying Gao et.al. |
2503.21210 |
null |
2025-03-27 |
VideoMix: Aggregating How-To Videos for Task-Oriented Learning |
Saelyne Yang et.al. |
2503.21130 |
null |
2025-03-27 |
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning |
Huajie Tan et.al. |
2503.20752 |
null |
2025-03-26 |
MMMORRF: Multimodal Multilingual Modularized Reciprocal Rank Fusion |
Saron Samuel et.al. |
2503.20698 |
null |
2025-03-26 |
AutoRad-Lung: A Radiomic-Guided Prompting Autoregressive Vision-Language Model for Lung Nodule Malignancy Prediction |
Sadaf Khademi et.al. |
2503.20662 |
null |
2025-03-26 |
IAP: Improving Continual Learning of Vision-Language Models via Instance-Aware Prompting |
Hao Fu et.al. |
2503.20612 |
link |
2025-03-26 |
Towards Efficient and General-Purpose Few-Shot Misclassification Detection for Vision-Language Models |
Fanhu Zeng et.al. |
2503.20492 |
null |
2025-03-26 |
Self-ReS: Self-Reflection in Large Vision-Language Models for Long Video Understanding |
Joao Pereira et.al. |
2503.20362 |
null |
2025-03-26 |
sudo rm -rf agentic_security |
Sejin Lee et.al. |
2503.20279 |
link |
2025-03-26 |
LogicQA: Logical Anomaly Detection with Vision Language Model Generated Questions |
Yejin Kwon et.al. |
2503.20252 |
null |
2025-03-26 |
Learning Adaptive Dexterous Grasping from Single Demonstrations |
Liangzhi Shi et.al. |
2503.20208 |
null |
2025-03-26 |
Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector |
Xiao Guo et.al. |
2503.20188 |
link |
2025-03-25 |
CoLLM: A Large Language Model for Composed Image Retrieval |
Chuong Huynh et.al. |
2503.19910 |
link |
2025-03-25 |
CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning |
Hao Yu et.al. |
2503.19900 |
link |
2025-03-25 |
Towards Online Multi-Modal Social Interaction Understanding |
Xinpeng Li et.al. |
2503.19851 |
link |
2025-03-25 |
FALCONEye: Finding Answers and Localizing Content in ONE-hour-long videos with multi-modal LLMs |
Carlos Plou et.al. |
2503.19850 |
null |
2025-03-25 |
FireEdit: Fine-grained Instruction-based Image Editing via Region-aware Vision Language Model |
Jun Zhou et.al. |
2503.19839 |
null |
2025-03-25 |
LPOSS: Label Propagation Over Patches and Pixels for Open-vocabulary Semantic Segmentation |
Vladan Stojnić et.al. |
2503.19777 |
link |
2025-03-25 |
ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation |
Haoyu Fu et.al. |
2503.19755 |
null |
2025-03-25 |
Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models |
Ilias Stogiannidis et.al. |
2503.19707 |
link |
2025-03-25 |
fine-CLIP: Enhancing Zero-Shot Fine-Grained Surgical Action Recognition with Vision-Language Models |
Saurav Sharma et.al. |
2503.19670 |
null |
2025-03-25 |
RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models |
Mehdi Moshtaghi et.al. |
2503.19654 |
null |
2025-03-25 |
MC-LLaVA: Multi-Concept Personalized Vision-Language Model |
Ruichuan An et.al. |
2503.18854 |
link |
2025-03-24 |
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations |
Jeonghyeon Kim et.al. |
2503.18817 |
link |
2025-03-24 |
SKDU at De-Factify 4.0: Vision Transformer with Data Augmentation for AI-Generated Image Detection |
Shrikant Malviya et.al. |
2503.18812 |
link |
2025-03-24 |
ArchSeek: Retrieving Architectural Case Studies Using Vision-Language Models |
Danrui Li et.al. |
2503.18680 |
link |
2025-03-25 |
Feature Calibration enhanced Parameter Synthesis for CLIP-based Class-incremental Learning |
Juncen Guo et.al. |
2503.18672 |
null |
2025-03-24 |
Unbiasing through Textual Descriptions: Mitigating Representation Bias in Video Benchmarks |
Nina Shvetsova et.al. |
2503.18637 |
null |
2025-03-24 |
Training-Free Personalization via Retrieval and Reasoning on Fingerprints |
Deepayan Das et.al. |
2503.18623 |
null |
2025-03-24 |
Galaxy Walker: Geometry-aware VLMs For Galaxy-scale Understanding |
Tianyu Chen et.al. |
2503.18578 |
null |
2025-03-24 |
Instruction-Aligned Visual Attention for Mitigating Hallucinations in Large Vision-Language Models |
Bin Li et.al. |
2503.18556 |
null |
2025-03-24 |
MMCR: Advancing Visual Language Model in Multimodal Multi-Turn Contextual Reasoning |
Dawei Yan et.al. |
2503.18533 |
null |
2025-03-21 |
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement |
Yihe Deng et.al. |
2503.17352 |
link |
2025-03-21 |
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models |
Jianing Qi et.al. |
2503.17349 |
null |
2025-03-21 |
Slide-Level Prompt Learning with Vision Language Models for Few-Shot Multiple Instance Learning in Histopathology |
Devavrat Tomar et.al. |
2503.17238 |
link |
2025-03-21 |
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models |
Davide Berasi et.al. |
2503.17142 |
link |
2025-03-21 |
Beyond Accuracy: What Matters in Designing Well-Behaved Models? |
Robin Hesse et.al. |
2503.17110 |
null |
2025-03-21 |
ExCap3D: Expressive 3D Scene Understanding via Object Captioning with Varying Detail |
Chandan Yeshwanth et.al. |
2503.17044 |
null |
2025-03-21 |
When Words Outperform Vision: VLMs Can Self-Improve Via Text-Only Training For Human-Centered Decision Making |
Zhe Hu et.al. |
2503.16965 |
null |
2025-03-21 |
PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition |
Ibtissam Saadi et.al. |
2503.16945 |
null |
2025-03-21 |
Vision-Language Gradient Descent-driven All-in-One Deep Unfolding Networks |
Haijin Zeng et.al. |
2503.16930 |
null |
2025-03-21 |
Classifier-guided CLIP Distillation for Unsupervised Multi-label Classification |
Dongseob Kim et.al. |
2503.16873 |
null |
2025-03-20 |
M3: 3D-Spatial MultiModal Memory |
Xueyan Zou et.al. |
2503.16413 |
link |
2025-03-20 |
Exploring the Hidden Reasoning Process of Large Language Models by Misleading Them |
Guanyu Chen et.al. |
2503.16401 |
null |
2025-03-20 |
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse |
Muyao Li et.al. |
2503.16365 |
null |
2025-03-20 |
Generalized Few-shot 3D Point Cloud Segmentation with Vision-Language Model |
Zhaochong An et.al. |
2503.16282 |
link |
2025-03-20 |
Probabilistic Prompt Distribution Learning for Animal Pose Estimation |
Jiyong Rao et.al. |
2503.16120 |
link |
2025-03-20 |
MarkushGrapher: Joint Visual and Textual Recognition of Markush Structures |
Lucas Morin et.al. |
2503.16096 |
link |
2025-03-20 |
STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding |
Zichen Liu et.al. |
2503.15973 |
link |
2025-03-20 |
Beyond the Visible: Multispectral Vision-Language Learning for Earth Observation |
Clive Tinashe Marimo et.al. |
2503.15969 |
null |
2025-03-20 |
CausalCLIPSeg: Unlocking CLIP’s Potential in Referring Medical Image Segmentation with Causal Intervention |
Yaxiong Chen et.al. |
2503.15949 |
link |
2025-03-20 |
Don’t Fight Hallucinations, Use Them: Estimating Image Realism using NLI over Atomic Facts |
Elisei Rykov et.al. |
2503.15948 |
link |
2025-03-19 |
TULIP: Towards Unified Language-Image Pretraining |
Zineng Tang et.al. |
2503.15485 |
null |
2025-03-19 |
Visual Persona: Foundation Model for Full-Body Human Customization |
Jisu Nam et.al. |
2503.15406 |
null |
2025-03-19 |
EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models |
Yinan Liang et.al. |
2503.15369 |
null |
2025-03-19 |
SemEval-2025 Task 1: AdMIRe – Advancing Multimodal Idiomaticity Representation |
Thomas Pickard et.al. |
2503.15358 |
null |
2025-03-19 |
TruthLens:A Training-Free Paradigm for DeepFake Detection |
Ritabrata Chakraborty et.al. |
2503.15342 |
null |
2025-03-19 |
Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport |
Hao Tan et.al. |
2503.15337 |
link |
2025-03-19 |
Sig2text, a Vision-language model for Non-cooperative Radar Signal Parsing |
Hancong Feng KaiLI Jiang Bin tang et.al. |
2503.15213 |
link |
2025-03-19 |
Context-Aware Vision Language Foundation Models for Ocular Disease Screening in Retinal Images |
Lucie Berger et.al. |
2503.15212 |
null |
2025-03-19 |
A Unified Framework for Real-Time Failure Handling in Robotics Using Vision-Language Models, Reactive Planner and Behavior Trees |
Faseeh Ahmad et.al. |
2503.15202 |
null |
2025-03-19 |
Machine Unlearning in Hyperbolic vs. Euclidean Multimodal Contrastive Learning: Adapting Alignment Calibration to MERU |
Àlex Pujol Vidal et.al. |
2503.15166 |
null |
2025-03-18 |
ExDDV: A New Dataset for Explainable Deepfake Detection in Video |
Vlad Hondru et.al. |
2503.14421 |
link |
2025-03-18 |
CoSpace: Benchmarking Continuous Space Perception Ability for Vision-Language Models |
Yiqi Zhu et.al. |
2503.14161 |
link |
2025-03-18 |
Growing a Twig to Accelerate Large Vision-Language Models |
Zhenwei Shao et.al. |
2503.14075 |
null |
2025-03-18 |
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding |
Siwei Han et.al. |
2503.13964 |
link |
2025-03-18 |
Med-R1: Reinforcement Learning for Generalizable Medical Reasoning in Vision-Language Models |
Yuxiang Lai et.al. |
2503.13939 |
null |
2025-03-18 |
ChatBEV: A Visual Language Model that Understands BEV Maps |
Qingyao Xu et.al. |
2503.13938 |
null |
2025-03-18 |
Where do Large Vision-Language Models Look at when Answering Questions? |
Xiaoying Xing et.al. |
2503.13891 |
link |
2025-03-18 |
RAD: Retrieval-Augmented Decision-Making of Meta-Actions with Vision-Language Models in Autonomous Driving |
Yujin Wang et.al. |
2503.13861 |
null |
2025-03-18 |
VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences |
Anukriti Singh et.al. |
2503.13817 |
null |
2025-03-18 |
Identifying and Mitigating Position Bias of Multi-image Vision-Language Models |
Xinyu Tian et.al. |
2503.13792 |
null |
2025-03-17 |
DPC: Dual-Prompt Collaboration for Tuning Vision-Language Models |
Haoyang Li et.al. |
2503.13443 |
link |
2025-03-18 |
MaTVLM: Hybrid Mamba-Transformer for Efficient Vision-Language Modeling |
Yingyue Li et.al. |
2503.13440 |
link |
2025-03-17 |
Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions |
Wan Ju Kang et.al. |
2503.13369 |
null |
2025-03-17 |
Agents Play Thousands of 3D Video Games |
Zhongwen Xu et.al. |
2503.13356 |
null |
2025-03-17 |
HybridGen: VLM-Guided Hybrid Planning for Scalable Data Generation of Imitation Learning |
Wensheng Wang et.al. |
2503.13171 |
null |
2025-03-17 |
Free-form language-based robotic reasoning and grasping |
Runyu Jiao et.al. |
2503.13082 |
null |
2025-03-17 |
Historic Scripts to Modern Vision: A Novel Dataset and A VLM Framework for Transliteration of Modi Script to Devanagari |
Harshal Kausadikar et.al. |
2503.13060 |
null |
2025-03-17 |
InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving |
Ruiqi Song et.al. |
2503.13047 |
null |
2025-03-17 |
Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization |
Ruichuan An et.al. |
2503.12999 |
null |
2025-03-17 |
How Good is my Histopathology Vision-Language Foundation Model? A Holistic Benchmark |
Roba Al Majzoub et.al. |
2503.12990 |
link |
2025-03-14 |
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense |
Shuyang Hao et.al. |
2503.11619 |
null |
2025-03-14 |
Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages |
Matteo Farina et.al. |
2503.11609 |
link |
2025-03-14 |
SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion |
Ahmed Nassar et.al. |
2503.11576 |
null |
2025-03-14 |
Similarity-Aware Token Pruning: Your VLM but Faster |
Ahmadreza Jeddi et.al. |
2503.11549 |
link |
2025-03-14 |
Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models |
Hao Cheng et.al. |
2503.11519 |
null |
2025-03-14 |
PBR3DGen: A VLM-guided Mesh Generation with High-quality PBR Texture |
Xiaokang Wei et.al. |
2503.11368 |
null |
2025-03-14 |
PARIC: Probabilistic Attention Regularization for Language Guided Image Classification from Pre-trained Vison Language Models |
Mayank Nautiyal et.al. |
2503.11360 |
null |
2025-03-14 |
Road Rage Reasoning with Vision-language Models (VLMs): Task Definition and Evaluation Dataset |
Yibing Weng et.al. |
2503.11342 |
null |
2025-03-14 |
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models |
Xirui Zhou et.al. |
2503.11265 |
null |
2025-03-14 |
Compound Expression Recognition via Large Vision-Language Models |
Jun Yu et.al. |
2503.11241 |
null |
2025-03-13 |
A Frustratingly Simple Yet Highly Effective Attack Baseline: Over 90% Success Rate Against the Strong Black-box Models of GPT-4.5/4o/o1 |
Zhaoyi Li et.al. |
2503.10635 |
link |
2025-03-13 |
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model |
Jiaming Liu et.al. |
2503.10631 |
null |
2025-03-13 |
CoSTA $\ast$ : Cost-Sensitive Toolpath Agent for Multi-turn Image Editing |
Advait Gupta et.al. |
2503.10613 |
link |
2025-03-13 |
TruthPrInt: Mitigating LVLM Object Hallucination Via Latent Truthful-Guided Pre-Intervention |
Jinhao Duan et.al. |
2503.10602 |
link |
2025-03-13 |
GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding |
Rui Hu et.al. |
2503.10596 |
link |
2025-03-13 |
VisualWebInstruct: Scaling up Multimodal Instruction Data through Web Search |
Yiming Jia et.al. |
2503.10582 |
null |
2025-03-13 |
KUDA: Keypoints to Unify Dynamics Learning and Visual Prompting for Open-Vocabulary Robotic Manipulation |
Zixian Liu et.al. |
2503.10546 |
null |
2025-03-13 |
Hoi2Anomaly: An Explainable Anomaly Detection Approach Guided by Human-Object Interaction |
Yuhan Wang et.al. |
2503.10508 |
null |
2025-03-13 |
World Modeling Makes a Better Planner: Dual Preference Optimization for Embodied Task Planning |
Siyin Wang et.al. |
2503.10480 |
null |
2025-03-13 |
VisTai: Benchmarking Vision-Language Models for Traditional Chinese in Taiwan |
Zhi Rui Tam et.al. |
2503.10427 |
link |
2025-03-12 |
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment |
Katrin Renz et.al. |
2503.09594 |
null |
2025-03-12 |
MindGYM: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions |
Zhe Xu et.al. |
2503.09499 |
link |
2025-03-12 |
BAMBI: Developing Baby Language Models for Italian |
Alice Suozzi et.al. |
2503.09481 |
null |
2025-03-12 |
SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery |
Jiayuan Huang et.al. |
2503.09474 |
null |
2025-03-12 |
Astrea: A MOE-based Visual Understanding Model with Progressive Alignment |
Xiaoda Yang et.al. |
2503.09445 |
null |
2025-03-12 |
Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models |
Julian Spravil et.al. |
2503.09443 |
null |
2025-03-12 |
OpenVidVRD: Open-Vocabulary Video Visual Relation Detection via Prompt-Driven Semantic Space Alignment |
Qi Liu et.al. |
2503.09416 |
null |
2025-03-12 |
Bidirectional Prototype-Reward co-Evolution for Test-Time Adaptation of Vision-Language Models |
Xiaozhen Qiao et.al. |
2503.09394 |
null |
2025-03-12 |
VideoScan: Enabling Efficient Streaming Video Understanding via Frame-level Semantic Carriers |
Ruanjun Li et.al. |
2503.09387 |
null |
2025-03-13 |
2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos |
Marvin Heidinger et.al. |
2503.09320 |
null |
2025-03-11 |
Generating Robot Constitutions & Benchmarks for Semantic Safety |
Pierre Sermanet et.al. |
2503.08663 |
null |
2025-03-11 |
TLA: Tactile-Language-Action Model for Contact-Rich Manipulation |
Peng Hao et.al. |
2503.08548 |
null |
2025-03-11 |
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training |
Tong Wei et.al. |
2503.08525 |
null |
2025-03-11 |
External Knowledge Injection for CLIP-Based Class-Incremental Learning |
Da-Wei Zhou et.al. |
2503.08510 |
link |
2025-03-11 |
MMRL: Multi-Modal Representation Learning for Vision-Language Models |
Yuncheng Guo et.al. |
2503.08497 |
link |
2025-03-11 |
SuperCap: Multi-resolution Superpixel-based Image Captioning |
Henry Senior et.al. |
2503.08496 |
null |
2025-03-11 |
TT-GaussOcc: Test-Time Compute for Self-Supervised Occupancy Prediction via Spatio-Temporal Gaussian Splatting |
Fengyi Zhang et.al. |
2503.08485 |
null |
2025-03-11 |
PhysVLM: Enabling Visual Language Models to Understand Robotic Physical Reachability |
Weijie Zhou et.al. |
2503.08481 |
link |
2025-03-11 |
Debiased Prompt Tuning in Vision-Language Model without Annotations |
Chaoquan Jiang et.al. |
2503.08368 |
null |
2025-03-11 |
Trinity: A Modular Humanoid Robot AI System |
Jingkai Sun et.al. |
2503.08338 |
null |
2025-03-10 |
Chameleon: Fast-slow Neuro-symbolic Lane Topology Extraction |
Zongzheng Zhang et.al. |
2503.07485 |
link |
2025-03-10 |
REF-VLM: Triplet-Based Referring Paradigm for Unified Visual Decoding |
Yan Tai et.al. |
2503.07413 |
link |
2025-03-11 |
FedRand: Enhancing Privacy in Federated Learning with Randomized LoRA Subparameter Updates |
Sangwoo Park et.al. |
2503.07216 |
null |
2025-03-10 |
Learning A Zero-shot Occupancy Network from Vision Foundation Models via Self-supervised Adaptation |
Sihao Lin et.al. |
2503.07125 |
null |
2025-03-11 |
PoseLess: Depth-Free Vision-to-Joint Control via Direct Image Mapping with VLM |
Alan Dao et.al. |
2503.07111 |
null |
2025-03-10 |
XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition |
Chuanming Wang et.al. |
2503.07075 |
null |
2025-03-10 |
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning |
Huilin Deng et.al. |
2503.07065 |
link |
2025-03-10 |
CAPT: Class-Aware Prompt Tuning for Federated Long-Tailed Learning with Vision-Language Model |
Shihao Hou et.al. |
2503.06993 |
null |
2025-03-10 |
A Multimodal Benchmark Dataset and Model for Crop Disease Diagnosis |
Xiang Liu et.al. |
2503.06973 |
link |
2025-03-10 |
Handle Object Navigation as Weighted Traveling Repairman Problem |
Ruimeng Liu et.al. |
2503.06937 |
link |
2025-03-07 |
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method |
Weiyu Ma et.al. |
2503.05383 |
link |
2025-03-07 |
Data-Efficient Generalization for Zero-shot Composed Image Retrieval |
Zining Chen et.al. |
2503.05204 |
null |
2025-03-07 |
Narrating the Video: Boosting Text-Video Retrieval via Comprehensive Utilization of Frame-Level Captions |
Chan hur et.al. |
2503.05186 |
null |
2025-03-07 |
Visual Cues of Gender and Race are Associated with Stereotyping in Vision-Language Models |
Messi H. J. Lee et.al. |
2503.05093 |
null |
2025-03-07 |
Perceiving, Reasoning, Adapting: A Dual-Layer Framework for VLM-Guided Precision Robotic Manipulation |
Qingxuan Jia et.al. |
2503.05064 |
null |
2025-03-06 |
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression |
Souvik Kundu et.al. |
2503.04982 |
null |
2025-03-06 |
Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach |
Soumyadeep Ro et.al. |
2503.04918 |
null |
2025-03-06 |
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM |
Sambal Shikhar et.al. |
2503.04724 |
null |
2025-03-06 |
A Benchmark for Multi-Lingual Vision-Language Learning in Remote Sensing Image Captioning |
Qing Zhou et.al. |
2503.04592 |
null |
2025-03-06 |
AnyAnomaly: Zero-Shot Customizable Video Anomaly Detection with LVLM |
Sunghyun Ahn et.al. |
2503.04504 |
link |
2025-03-06 |
TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction |
Chao Wang et.al. |
2503.04457 |
null |
2025-03-06 |
From Idea to CAD: A Language Model-Driven Multi-Agent System for Collaborative Design |
Felix Ocker et.al. |
2503.04417 |
null |
2025-03-07 |
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models |
Niccolò Turcato et.al. |
2503.04280 |
null |
2025-03-06 |
An Egocentric Vision-Language Model based Portable Real-time Smart Assistant |
Yifei Huang et.al. |
2503.04250 |
link |
2025-03-06 |
Synthetic Data is an Elegant GIFT for Continual Vision-Language Models |
Bin Wu et.al. |
2503.04229 |
null |
2025-03-06 |
EVE: Towards End-to-End Video Subtitle Extraction with Vision-Language Models |
Haiyang Yu et.al. |
2503.04058 |
null |
2025-03-06 |
RetinalGPT: A Retinal Clinical Preference Conversational Assistant Powered by Large Vision-Language Models |
Wenhui Zhu et.al. |
2503.03987 |
null |
2025-03-05 |
CHOP: Mobile Operating Assistant with Constrained High-frequency Optimized Subtask Planning |
Yuqi Zhou et.al. |
2503.03743 |
link |
2025-03-05 |
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction |
Huang Huang et.al. |
2503.03734 |
null |
2025-03-05 |
Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions |
Jun Li et.al. |
2503.03278 |
null |
2025-03-05 |
Variance-Aware Loss Scheduling for Multimodal Alignment in Low-Data Settings |
Sneh Pillai et.al. |
2503.03202 |
null |
2025-03-05 |
SpiritSight Agent: Advanced GUI Agent with One Look |
Zhiyuan Huang et.al. |
2503.03196 |
null |
2025-03-04 |
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications |
Danqing Zhang et.al. |
2503.02950 |
link |
2025-03-05 |
FairSense-AI: Responsible AI Meets Sustainability |
Shaina Raza et.al. |
2503.02865 |
null |
2025-03-04 |
“What If Smart Homes Could See Our Homes?”: Exploring DIY Smart Home Building Experiences with VLM-Based Camera Sensors |
Sojeong Yun et.al. |
2503.02816 |
null |
2025-03-04 |
Bridging VLM and KMP: Enabling Fine-grained robotic manipulation via Semantic Keypoints Representation |
Junjie Zhu et.al. |
2503.02748 |
null |
2025-03-04 |
ImpedanceGPT: VLM-driven Impedance Control of Swarm of Mini-drones for Intelligent Navigation in Dynamic Environment |
Faryal Batool et.al. |
2503.02723 |
null |
2025-03-04 |
MM-OR: A Large Multimodal Operating Room Dataset for Semantic Understanding of High-Intensity Surgical Environments |
Ege Özsoy et.al. |
2503.02579 |
link |
2025-03-04 |
UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue |
Yasheerah Yaqoot et.al. |
2503.02465 |
null |
2025-03-04 |
UAV-VLPA*: A Vision-Language-Path-Action System for Optimal Route Generation on a Large Scales |
Oleg Sautenkov et.al. |
2503.02454 |
null |
2025-03-04 |
Vision-Language Model IP Protection via Prompt-based Learning |
Lianyu Wang et.al. |
2503.02393 |
null |
2025-03-04 |
Are Large Vision Language Models Good Game Players? |
Xinyu Wang et.al. |
2503.02358 |
null |
2025-03-04 |
BiasICL: In-Context Learning and Demographic Biases of Vision Language Models |
Sonnet Xu et.al. |
2503.02334 |
null |
2025-02-28 |
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos |
Zhiyu Tan et.al. |
2502.21314 |
null |
2025-02-28 |
PET Image Denoising via Text-Guided Diffusion: Integrating Anatomical Priors through Text Prompts |
Boxiao Yu et.al. |
2502.21260 |
null |
2025-02-28 |
FC-Attack: Jailbreaking Large Vision-Language Models via Auto-Generated Flowcharts |
Ziyi Zhang et.al. |
2502.21059 |
null |
2025-02-28 |
DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping |
Yifan Zhong et.al. |
2502.20900 |
null |
2025-02-28 |
VLEER: Vision and Language Embeddings for Explainable Whole Slide Image Representation |
Anh Tien Nguyen et.al. |
2502.20850 |
null |
2025-02-28 |
CoTMR: Chain-of-Thought Multi-Scale Reasoning for Training-Free Zero-Shot Composed Image Retrieval |
Zelong Sun et.al. |
2502.20826 |
null |
2025-02-28 |
MedHallTune: An Instruction-Tuning Benchmark for Mitigating Medical Hallucination in Vision-Language Models |
Qiao Yan et.al. |
2502.20780 |
link |
2025-02-28 |
Mitigating Hallucinations in Large Vision-Language Models by Adaptively Constraining Information Flow |
Jiaqi Bai et.al. |
2502.20750 |
link |
2025-02-28 |
T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting |
Yifei Qian et.al. |
2502.20625 |
null |
2025-02-27 |
Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference |
Mingyuan Wu et.al. |
2502.20587 |
null |
2025-02-27 |
Visual Adaptive Prompting for Compositional Zero-Shot Learning |
Kyle Stein et.al. |
2502.20292 |
null |
2025-02-27 |
Explainable, Multi-modal Wound Infection Classification from Images Augmented with Generated Captions |
Palawat Busaranuvong et.al. |
2502.20277 |
null |
2025-02-27 |
VDT-Auto: End-to-end Autonomous Driving with VLM-Guided Diffusion Transformers |
Ziang Guo et.al. |
2502.20108 |
null |
2025-02-27 |
Vision-Encoders (Already) Know What They See: Mitigating Object Hallucination via Simple Fine-Grained CLIPScore |
Hongseok Oh et.al. |
2502.20034 |
link |
2025-02-27 |
ChatReID: Open-ended Interactive Person Retrieval via Hierarchical Progressive Tuning for Vision Language Models |
Ke Niu et.al. |
2502.19958 |
null |
2025-02-27 |
Space Rotation with Basis Transformation for Training-free Test-Time Adaptation |
Chenhao Ding et.al. |
2502.19946 |
null |
2025-02-27 |
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification |
Xiangyan Qu et.al. |
2502.19844 |
link |
2025-02-27 |
Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study |
Reza Abbasi et.al. |
2502.19828 |
null |
2025-02-27 |
Mixtera: A Data Plane for Foundation Model Training |
Maximilian Böther et.al. |
2502.19790 |
link |
2025-02-27 |
Open-Vocabulary Semantic Part Segmentation of 3D Human |
Keito Suzuki et.al. |
2502.19782 |
null |
2025-02-26 |
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models |
Lucy Xiaoyang Shi et.al. |
2502.19417 |
null |
2025-02-26 |
Pathology Report Generation and Multimodal Representation Learning for Cutaneous Melanocytic Lesions |
Ruben T. Lucassen et.al. |
2502.19293 |
null |
2025-02-26 |
On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation |
Ruben T. Lucassen et.al. |
2502.19285 |
null |
2025-02-26 |
Neural Antidote: Class-Wise Prompt Tuning for Purifying Backdoors in Pre-trained Vision-Language Models |
Jiawei Kong et.al. |
2502.19269 |
null |
2025-02-26 |
FungalZSL: Zero-Shot Fungal Classification with Image Captioning Using a Synthetic Data Approach |
Anju Rani et.al. |
2502.19038 |
null |
2025-02-26 |
VEM: Environment-Free Exploration for Training GUI Agent with Value Environment Model |
Jiani Zheng et.al. |
2502.18906 |
null |
2025-02-26 |
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP |
Chenyang Zhao et.al. |
2502.18816 |
null |
2025-02-26 |
QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries |
Nicolas Harvey Chapman et.al. |
2502.18735 |
null |
2025-02-25 |
olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models |
Jake Poznanski et.al. |
2502.18443 |
link |
2025-02-25 |
WebGames: Challenging General-Purpose Web-Browsing AI Agents |
George Thomas et.al. |
2502.18356 |
link |
2025-02-25 |
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models |
Zhaoyi Liu et.al. |
2502.18290 |
null |
2025-02-25 |
NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts |
Muhammad Farid Adilazuarda et.al. |
2502.18148 |
null |
2025-02-25 |
Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models |
Cao Yuxuan et.al. |
2502.18101 |
link |
2025-02-25 |
VLM-E2E: Enhancing End-to-End Autonomous Driving with Multimodal Driver Attention Fusion |
Pei Liu et.al. |
2502.18042 |
null |
2025-02-25 |
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting |
Haoyuan Li et.al. |
2502.17860 |
null |
2025-02-25 |
Can Multimodal LLMs Perform Time Series Anomaly Detection? |
Xiongxiao Xu et.al. |
2502.17812 |
link |
2025-02-24 |
Mind the Gesture: Evaluating AI Sensitivity to Culturally Offensive Non-Verbal Gestures |
Akhila Yerukola et.al. |
2502.17710 |
link |
2025-02-24 |
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling |
Bingxuan Li et.al. |
2502.17651 |
null |
2025-02-24 |
DIS-CO: Discovering Copyrighted Content in VLMs Training Data |
André V. Duarte et.al. |
2502.17358 |
link |
2025-02-24 |
Applications of Large Models in Medicine |
YunHe Su et.al. |
2502.17132 |
null |
2025-02-24 |
Shakti-VLMs: Scalable Vision-Language Models for Enterprise AI |
Syed Abdul Gaffar Shakhadri et.al. |
2502.17092 |
null |
2025-02-24 |
Evolution 6.0: Evolving Robotic Capabilities Through Generative Design |
Muhammad Haris Khan et.al. |
2502.17034 |
null |
2025-02-24 |
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark |
Davide Testa et.al. |
2502.16989 |
null |
2025-02-24 |
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models |
Kevin Miller et.al. |
2502.16911 |
null |
2025-02-24 |
Exploring Causes and Mitigation of Hallucinations in Large Vision Language Models |
Yaqi Sun et.al. |
2502.16842 |
null |
2025-02-24 |
CLIP-SENet: CLIP-based Semantic Enhancement Network for Vehicle Re-identification |
Liping Lu et.al. |
2502.16815 |
null |
2025-02-23 |
Reflective Planning: Vision-Language Models for Multi-Stage Long-Horizon Robotic Manipulation |
Yunhai Feng et.al. |
2502.16707 |
null |
2025-02-23 |
Can Large Vision-Language Models Detect Images Copyright Infringement from GenAI? |
Qipan Xu et.al. |
2502.16618 |
null |
2025-02-21 |
ELIP: Enhanced Visual-Language Foundation Models for Image Retrieval |
Guanqi Zhan et.al. |
2502.15682 |
null |
2025-02-21 |
Testing the limits of fine-tuning to improve reasoning in vision language models |
Luca M. Schulze Buschoff et.al. |
2502.15678 |
null |
2025-02-21 |
Bridging vision language model (VLM) evaluation gaps with a framework for scalable and cost-effective benchmark generation |
Tim Rädsch et.al. |
2502.15563 |
null |
2025-02-21 |
FaultGPT: Industrial Fault Diagnosis Question Answering System by Vision Language Models |
Jiao Chen et.al. |
2502.15481 |
null |
2025-02-21 |
Chitrarth: Bridging Vision and Language for a Billion People |
Shaharukh Khan et.al. |
2502.15392 |
null |
2025-02-21 |
The Role of Background Information in Reducing Object Hallucination in Vision-Language Models: Insights from Cutoff API Prompting |
Masayo Tomita et.al. |
2502.15389 |
null |
2025-02-21 |
CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models |
Shunchang Liu et.al. |
2502.15278 |
null |
2025-02-21 |
The Evolving Landscape of LLM- and VLM-Integrated Reinforcement Learning |
Sheila Schoepp et.al. |
2502.15214 |
null |
2025-02-21 |
PairBench: A Systematic Framework for Selecting Reliable Judge VLMs |
Aarash Feizi et.al. |
2502.15210 |
null |
2025-02-21 |
CurricuVLM: Towards Safe Autonomous Driving via Personalized Safety-Critical Curriculum Learning with Vision-Language Models |
Zihao Sheng et.al. |
2502.15119 |
null |
2025-02-20 |
Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation |
Yue Yang et.al. |
2502.14846 |
null |
2025-02-20 |
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models |
Shangqing Tu et.al. |
2502.14834 |
link |
2025-02-20 |
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features |
Michael Tschannen et.al. |
2502.14786 |
link |
2025-02-20 |
ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting |
Abhijit Mishra et.al. |
2502.14780 |
null |
2025-02-21 |
HiddenDetect: Detecting Jailbreak Attacks against Large Vision-Language Models via Monitoring Hidden States |
Yilei Jiang et.al. |
2502.14744 |
link |
2025-02-20 |
NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization |
Zheyuan Zhang et.al. |
2502.14638 |
link |
2025-02-20 |
Noisy Test-Time Adaptation in Vision-Language Models |
Chentao Cao et.al. |
2502.14604 |
link |
2025-02-20 |
PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models |
Yu Meng et.al. |
2502.14504 |
null |
2025-02-20 |
How Jailbreak Defenses Work and Ensemble? A Mechanistic Investigation |
Zhuohang Long et.al. |
2502.14486 |
null |
2025-02-20 |
Evaluating Precise Geolocation Inference Capabilities of Vision Language Models |
Neel Jay et.al. |
2502.14412 |
link |
2025-02-19 |
Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images |
Shengguang Wu et.al. |
2502.13928 |
null |
2025-02-19 |
NavigateDiff: Visual Predictors are Zero-Shot Navigation Assistants |
Yiran Qin et.al. |
2502.13894 |
null |
2025-02-19 |
SPEX: Scaling Feature Interaction Explanations for LLMs |
Justin Singh Kang et.al. |
2502.13870 |
link |
2025-02-19 |
Quantifying Memorization and Retriever Performance in Retrieval-Augmented Vision-Language Models |
Peter Carragher et.al. |
2502.13836 |
null |
2025-02-19 |
GIMMICK – Globally Inclusive Multimodal Multitask Cultural Knowledge Benchmarking |
Florian Schneider et.al. |
2502.13766 |
null |
2025-02-19 |
VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation |
Wei Zhao et.al. |
2502.13508 |
link |
2025-02-19 |
Transferring Textual Preferences to Vision-Language Understanding through Model Merging |
Chen-An Li et.al. |
2502.13487 |
null |
2025-02-19 |
Estimating Commonsense Plausibility through Semantic Shifts |
Wanqing Cui et.al. |
2502.13464 |
null |
2025-02-19 |
MapNav: A Novel Memory Representation via Annotated Semantic Maps for VLM-based Vision-and-Language Navigation |
Lingfeng Zhang et.al. |
2502.13451 |
null |
2025-02-19 |
Vision-Based Generic Potential Function for Policy Alignment in Multi-Agent Reinforcement Learning |
Hao Ma et.al. |
2502.13430 |
null |
2025-02-18 |
Re-Align: Aligning Vision Language Models via Retrieval-Augmented Direct Preference Optimization |
Shuo Xing et.al. |
2502.13146 |
link |
2025-02-18 |
Multimodal Mamba: Decoder-only Multimodal State Space Model via Quadratic to Linear Distillation |
Bencheng Liao et.al. |
2502.13145 |
link |
2025-02-18 |
SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation |
Zekun Qi et.al. |
2502.13143 |
null |
2025-02-18 |
Understanding and Rectifying Safety Perception Distortion in VLMs |
Xiaohan Zou et.al. |
2502.13095 |
null |
2025-02-18 |
MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching |
Fabian David Schmidt et.al. |
2502.12852 |
null |
2025-02-18 |
Learning a High-quality Robotic Wiping Policy Using Systematic Reward Analysis and Visual-Language Model Based Curriculum |
Yihong Liu et.al. |
2502.12599 |
null |
2025-02-18 |
CutPaste&Find: Efficient Multimodal Hallucination Detector with Visual-aid Knowledge Base |
Cong-Duy Nguyen et.al. |
2502.12591 |
null |
2025-02-18 |
Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning |
Mengshi Qi et.al. |
2502.12425 |
link |
2025-02-17 |
LanP: Rethinking the Impact of Language Priors in Large Vision-Language Models |
Zongyu Wu et.al. |
2502.12359 |
null |
2025-02-17 |
Descriminative-Generative Custom Tokens for Vision-Language Models |
Pramuditha Perera et.al. |
2502.12095 |
null |
2025-02-17 |
VLM $^2$ -Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues |
Jianshu Zhang et.al. |
2502.12084 |
null |
2025-02-17 |
Classifying the Stoichiometry of Virus-like Particles with Interpretable Machine Learning |
Jiayang Zhang et.al. |
2502.12049 |
link |
2025-02-17 |
Learning Generalizable Prompt for CLIP with Class Similarity Knowledge |
Sehun Jung et.al. |
2502.11969 |
null |
2025-02-17 |
VLP: Vision-Language Preference Learning for Embodied Manipulation |
Runze Liu et.al. |
2502.11918 |
null |
2025-02-17 |
From Open-Vocabulary to Vocabulary-Free Semantic Segmentation |
Klara Reichard et.al. |
2502.11891 |
null |
2025-02-17 |
VAQUUM: Are Vague Quantifiers Grounded in Visual Data? |
Hugh Mee Wong et.al. |
2502.11874 |
null |
2025-02-17 |
Defining and Evaluating Visual Language Models’ Basic Spatial Abilities: A Perspective from Psychometrics |
Wenrui Xu et.al. |
2502.11859 |
null |
2025-02-17 |
ILIAS: Instance-Level Image retrieval At Scale |
Giorgos Kordopatis-Zilos et.al. |
2502.11748 |
null |
2025-02-17 |
Multi-Modal Retrieval Augmentation for Open-Ended and Knowledge-Intensive Video Question Answering |
Md Zarif Ul Alam et.al. |
2502.11747 |
null |
2025-02-14 |
Probing Perceptual Constancy in Large Vision Language Models |
Haoran Sun et.al. |
2502.10273 |
null |
2025-02-14 |
VisCon-100K: Leveraging Contextual Web Data for Fine-tuning Vision Language Models |
Gokul Karthik Kumar et.al. |
2502.10250 |
null |
2025-02-14 |
Cooperative Multi-Agent Planning with Adaptive Skill Synthesis |
Zhiyuan Li et.al. |
2502.10148 |
null |
2025-02-14 |
Image Embedding Sampling Method for Diverse Captioning |
Sania Waheed et.al. |
2502.10118 |
null |
2025-02-14 |
Manual2Skill: Learning to Read Manuals and Acquire Robotic Skills for Furniture Assembly Using Vision-Language Models |
Chenrui Tie et.al. |
2502.10090 |
link |
2025-02-14 |
ViRAC: A Vision-Reasoning Agent Head Movement Control Framework in Arbitrary Virtual Environments |
Juyeong Hwang et.al. |
2502.10046 |
null |
2025-02-14 |
Diffusion Trajectory-guided Policy for Long-horizon Robot Manipulation |
Shichao Fan et.al. |
2502.10040 |
null |
2025-02-14 |
HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Generation via Heterogeneous Knowledge Adaptation |
Tianwei Lin et.al. |
2502.09838 |
link |
2025-02-13 |
On the robustness of multimodal language model towards distractions |
Ming Liu et.al. |
2502.09818 |
null |
2025-02-13 |
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis |
Angelos Zavras et.al. |
2502.09598 |
link |
2025-02-13 |
When and How Does CLIP Enable Domain and Compositional Generalization? |
Elias Kempf et.al. |
2502.09507 |
link |
2025-02-13 |
OpenBench: A New Benchmark and Baseline for Semantic Navigation in Smart Logistics |
Junhui Wang et.al. |
2502.09238 |
null |
2025-02-13 |
LimSim Series: An Autonomous Driving Simulation Platform for Validation and Enhancement |
Daocheng Fu et.al. |
2502.09170 |
link |
2025-02-13 |
The influence of visual and linguistic cues on ignorance inference in Vision-Language Models (VLMs) |
Ye-eun Cho et.al. |
2502.09120 |
null |
2025-02-13 |
Vision-Language In-Context Learning Driven Few-Shot Visual Inspection Model |
Shiryu Ueno et.al. |
2502.09057 |
link |
2025-02-13 |
AIDE: Agentically Improve Visual Language Model with Domain Experts |
Ming-Chang Chiu et.al. |
2502.09051 |
null |
2025-02-13 |
3D-Grounded Vision-Language Framework for Robotic Task Planning: Automated Prompt Synthesis and Supervised Reasoning |
Guoqin Tang et.al. |
2502.08903 |
null |
2025-02-13 |
Harnessing Vision Models for Time Series Analysis: A Survey |
Jingchao Ni et.al. |
2502.08869 |
link |
2025-02-12 |
ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots |
Yuxuan Zhang et.al. |
2502.08791 |
null |
2025-02-12 |
A Real-to-Sim-to-Real Approach to Robotic Manipulation with VLM-Generated Iterative Keypoint Rewards |
Shivansh Patel et.al. |
2502.08643 |
null |
2025-02-12 |
SwiftSketch: A Diffusion Model for Image-to-Vector Sketch Generation |
Ellie Arar et.al. |
2502.08642 |
null |
2025-02-12 |
Revisiting 3D LLM Benchmarks: Are We Really Testing 3D Capabilities? |
Jiahe Jin et.al. |
2502.08503 |
link |
2025-02-12 |
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification |
Jiangbo Shi et.al. |
2502.08391 |
link |
2025-02-12 |
Mitigating Hallucinations in Multimodal Spatial Relations through Constraint-Aware Prompting |
Jiarui Wu et.al. |
2502.08317 |
null |
2025-02-12 |
TRISHUL: Towards Region Identification and Screen Hierarchy Understanding for Large VLM based GUI Agents |
Kunal Singh et.al. |
2502.08226 |
null |
2025-02-12 |
Typographic Attacks in a Multi-Image Setting |
Xiaomeng Wang et.al. |
2502.08193 |
link |
2025-02-12 |
SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation |
Zhiming Ma et.al. |
2502.08168 |
link |
2025-02-12 |
MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models |
Peng-Fei Zhang et.al. |
2502.08079 |
null |
2025-02-11 |
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders |
Kshitish Ghate et.al. |
2502.07957 |
link |
2025-02-11 |
Scaling Pre-training to One Hundred Billion Data for Vision Language Models |
Xiao Wang et.al. |
2502.07617 |
null |
2025-02-11 |
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification |
Anh-Tien Nguyen et.al. |
2502.07409 |
link |
2025-02-11 |
TRAVEL: Training-Free Retrieval and Alignment for Vision-and-Language Navigation |
Navid Rajabi et.al. |
2502.07306 |
null |
2025-02-11 |
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired |
ByungOk Han et.al. |
2502.07183 |
link |
2025-02-10 |
EVEv2: Improved Baselines for Encoder-Free Vision-Language Models |
Haiwen Diao et.al. |
2502.06788 |
link |
2025-02-10 |
Conformal Predictions for Human Action Recognition with Vision-Language Models |
Bary Tim et.al. |
2502.06631 |
null |
2025-02-10 |
Benchmarking Vision-Language Models on Optical Character Recognition in Dynamic Video Environments |
Sankalp Nagaonkar et.al. |
2502.06445 |
link |
2025-02-10 |
AppVLM: A Lightweight Vision Language Model for Online App Control |
Georgios Papoudakis et.al. |
2502.06395 |
null |
2025-02-11 |
When Data Manipulation Meets Attack Goals: An In-depth Survey of Attacks for VLMs |
Aobotao Dai et.al. |
2502.06390 |
link |
2025-02-10 |
Self-Correcting Decoding with Generative Feedback for Mitigating Hallucinations in Large Vision-Language Models |
Ce Zhang et.al. |
2502.06130 |
link |
2025-02-10 |
Fair-MoE: Fairness-Oriented Mixture of Experts in Vision-Language Models |
Peiran Wang et.al. |
2502.06094 |
null |
2025-02-09 |
Noise is an Efficient Learner for Zero-Shot Vision-Language Models |
Raza Imam et.al. |
2502.06019 |
null |
2025-02-09 |
Uni-Retrieval: A Multi-Style Retrieval Framework for STEM’s Education |
Yanhao Jia et.al. |
2502.05863 |
null |
2025-02-09 |
DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control |
Junjie Wen et.al. |
2502.05855 |
null |
2025-02-07 |
Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuray |
Yunhang Shen et.al. |
2502.05177 |
link |
2025-02-07 |
DCFormer: Efficient 3D Vision-Language Modeling with Decomposed Convolutions |
Gorkem Can Ates et.al. |
2502.05091 |
null |
2025-02-07 |
OccGS: Zero-shot 3D Occupancy Reconstruction with Semantic and Geometric-Aware Gaussian Splatting |
Xiaoyu Zhou et.al. |
2502.04981 |
null |
2025-02-07 |
Training-free Task-oriented Grasp Generation |
Jiaming Wang et.al. |
2502.04873 |
null |
2025-02-07 |
ELITE: Enhanced Language-Image Toxicity Evaluation for Safety |
Wonjun Lee et.al. |
2502.04757 |
null |
2025-02-06 |
AnyPlace: Learning Generalized Object Placement for Robot Manipulation |
Yuchi Zhao et.al. |
2502.04531 |
null |
2025-02-06 |
Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment |
Zuyan Liu et.al. |
2502.04328 |
link |
2025-02-06 |
Cross the Gap: Exposing the Intra-modal Misalignment in CLIP via Modality Inversion |
Marco Mistretta et.al. |
2502.04263 |
link |
2025-02-06 |
Éclair – Extracting Content and Layout with Integrated Reading Order for Documents |
Ilia Karmanov et.al. |
2502.04223 |
null |
2025-02-06 |
HD-EPIC: A Highly-Detailed Egocentric Video Dataset |
Toby Perrett et.al. |
2502.04144 |
null |
2025-02-06 |
Efficient Few-Shot Continual Learning in Vision-Language Models |
Aristeidis Panos et.al. |
2502.04098 |
null |
2025-02-06 |
CAD-Editor: A Locate-then-Infill Framework with Automated Training Data Synthesis for Text-Based CAD Editing |
Yu Yuan et.al. |
2502.03997 |
null |
2025-02-06 |
Adapting Human Mesh Recovery with Vision-Language Feedback |
Chongyang Xu et.al. |
2502.03836 |
null |
2025-02-05 |
The Hidden Life of Tokens: Reducing Hallucination of Large Vision-Language Models via Visual Information Steering |
Zhuowei Li et.al. |
2502.03628 |
link |
2025-02-05 |
DynVFX: Augmenting Real Videos with Dynamic Content |
Danah Yatim et.al. |
2502.03621 |
null |
2025-02-05 |
EnVisionVR: A Scene Interpretation Tool for Visual Accessibility in Virtual Reality |
Junlong Chen et.al. |
2502.03564 |
null |
2025-02-05 |
SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living |
Arkaprava Sinha et.al. |
2502.03459 |
null |
2025-02-05 |
RadVLM: A Multitask Conversational Vision-Language Model for Radiology |
Nicolas Deperrois et.al. |
2502.03333 |
null |
2025-02-05 |
Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search |
Jiayi He et.al. |
2502.03230 |
null |
2025-02-05 |
iVISPAR – An Interactive Visual-Spatial Reasoning Benchmark for VLMs |
Julius Mayer et.al. |
2502.03214 |
link |
2025-02-05 |
Disentangling CLIP Features for Enhanced Localized Understanding |
Samyak Rawelekar et.al. |
2502.02977 |
null |
2025-02-04 |
Vision-Language Model Dialog Games for Self-Improvement |
Ksenia Konyushkova et.al. |
2502.02740 |
null |
2025-02-04 |
Articulate AnyMesh: Open-Vocabulary 3D Articulated Objects Modeling |
Xiaowen Qiu et.al. |
2502.02590 |
null |
2025-02-04 |
COCONut-PanCap: Joint Panoptic Segmentation and Grounded Captions for Fine-Grained Understanding and Generation |
Xueqing Deng et.al. |
2502.02589 |
null |
2025-02-04 |
Mosaic3D: Foundation Dataset and Model for Open-Vocabulary 3D Segmentation |
Junha Lee et.al. |
2502.02548 |
null |
2025-02-04 |
Personalization Toolkit: Training Free Personalization of Large Vision Language Models |
Soroush Seifi et.al. |
2502.02452 |
null |
2025-02-04 |
LoRA-TTT: Low-Rank Test-Time Training for Vision-Language Models |
Yuto Kojima et.al. |
2502.02069 |
null |
2025-02-04 |
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs |
Hongxin Li et.al. |
2502.01977 |
null |
2025-02-04 |
Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration |
Younan Zhu et.al. |
2502.01969 |
null |
2025-02-03 |
From Foresight to Forethought: VLM-In-the-Loop Policy Steering via Latent Alignment |
Yilin Wu et.al. |
2502.01828 |
null |
2025-02-03 |
AquaticCLIP: A Vision-Language Foundation Model for Underwater Scene Analysis |
Basit Alawode et.al. |
2502.01785 |
null |
2025-02-03 |
Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning |
Udita Ghosh et.al. |
2502.01616 |
null |
2025-01-31 |
Homogeneity Bias as Differential Sampling Uncertainty in Language Models |
Messi H. J. Lee et.al. |
2501.19337 |
null |
2025-01-31 |
Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search |
Yuta Oshima et.al. |
2501.19252 |
null |
2025-01-31 |
A Survey on Class-Agnostic Counting: Advancements from Reference-Based to Open-World Text-Guided Approaches |
Luca Ciampi et.al. |
2501.19184 |
null |
2025-01-31 |
Poison as Cure: Visual Noise for Mitigating Object Hallucinations in LVMs |
Kejia Zhang et.al. |
2501.19164 |
null |
2025-01-31 |
Fairness Analysis of CLIP-Based Foundation Models for X-Ray Image Classification |
Xiangyu Sun et.al. |
2501.19086 |
null |
2025-02-03 |
Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment |
Song-Lin Lv et.al. |
2501.19060 |
null |
2025-01-31 |
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception |
Joshua R. Waite et.al. |
2501.18880 |
null |
2025-02-03 |
UP-VLA: A Unified Understanding and Prediction Model for Embodied Agent |
Jianke Zhang et.al. |
2501.18867 |
null |
2025-01-31 |
Test-time Loss Landscape Adaptation for Zero-Shot Generalization in Vision-Language Models |
Aodi Li et.al. |
2501.18864 |
null |
2025-01-30 |
INT: Instance-Specific Negative Mining for Task-Generic Promptable Segmentation |
Jian Hu et.al. |
2501.18753 |
null |
2025-01-30 |
Rethinking Bottlenecks in Safety Fine-Tuning of Vision Language Models |
Yi Ding et.al. |
2501.18533 |
null |
2025-01-30 |
A Benchmark and Evaluation for Real-World Out-of-Distribution Detection Using Vision-Language Models |
Shiho Noda et.al. |
2501.18463 |
link |
2025-01-30 |
Pre-Trained Vision-Language Model Selection and Reuse for Downstream Tasks |
Hao-Zhe Tan et.al. |
2501.18271 |
null |
2025-01-29 |
Planning with Vision-Language Models and a Use Case in Robot-Assisted Teaching |
Xuzhe Dang et.al. |
2501.17665 |
null |
2025-01-29 |
Exploring Vision Language Models for Multimodal and Multilingual Stance Detection |
Jake Vasilakes et.al. |
2501.17654 |
null |
2025-01-29 |
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation |
Lin Chen et.al. |
2501.17642 |
null |
2025-01-30 |
Boosting Weak Positives for Text Based Person Search |
Akshay Modi et.al. |
2501.17586 |
null |
2025-01-29 |
Learning Free Token Reduction for Multi-Modal LLM |
Zihui Zhao et.al. |
2501.17391 |
null |
2025-01-30 |
Probing LLM World Models: Enhancing Guesstimation with Wisdom of Crowds Decoding |
Yun-Shiuan Chuang et.al. |
2501.17310 |
null |
2025-01-28 |
Modulating CNN Features with Pre-Trained ViT Representations for Open-Vocabulary Object Detection |
Xiangyu Gao et.al. |
2501.16981 |
null |
2025-01-28 |
Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? |
Sania Waheed et.al. |
2501.16947 |
null |
2025-01-29 |
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models |
Makoto Shing et.al. |
2501.16937 |
null |
2025-01-28 |
Document Screenshot Retrievers are Vulnerable to Pixel Poisoning Attacks |
Shengyao Zhuang et.al. |
2501.16902 |
link |
2025-01-28 |
Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models |
Muhammad Atta ur Rahman et.al. |
2501.16769 |
null |
2025-01-28 |
One Head Eight Arms: Block Matrix based Low Rank Adaptation for CLIP-based Few-Shot Learning |
Chunpeng Zhou et.al. |
2501.16720 |
null |
2025-01-28 |
Improving Vision-Language-Action Model with Online Reinforcement Learning |
Yanjiang Guo et.al. |
2501.16664 |
null |
2025-01-27 |
BiFold: Bimanual Cloth Folding with Language Guidance |
Oriol Barbany et.al. |
2501.16458 |
null |
2025-01-27 |
PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding |
Wei Chow et.al. |
2501.16411 |
null |
2025-01-27 |
CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation |
Xiaochuan Ma et.al. |
2501.16246 |
null |
2025-01-27 |
AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants |
Pascal J. Sager et.al. |
2501.16150 |
null |
2025-01-27 |
CILP-FGDI: Exploiting Vision-Language Model for Generalizable Person Re-Identification |
Huazhong Zhao et.al. |
2501.16065 |
link |
2025-01-27 |
Addressing Out-of-Label Hazard Detection in Dashcam Videos: Insights from the COOOL Challenge |
Anh-Kiet Duong et.al. |
2501.16037 |
link |
2025-01-27 |
SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model |
Delin Qu et.al. |
2501.15830 |
null |
2025-01-27 |
MM-Retinal V2: Transfer an Elite Knowledge Spark into Fundus Vision-Language Pretraining |
Ruiqi Wu et.al. |
2501.15798 |
link |
2025-01-26 |
Transformer-Based Multimodal Knowledge Graph Completion with Link-Aware Contexts |
Haodi Ma et.al. |
2501.15688 |
null |
2025-01-26 |
Ocean-OCR: Towards General OCR Application via a Vision-Language Model |
Song Chen et.al. |
2501.15558 |
link |
2025-01-26 |
Domain Adaptation from Generated Multi-Weather Images for Unsupervised Maritime Object Classification |
Dan Song et.al. |
2501.15503 |
null |
2025-01-26 |
Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection |
Han Wang et.al. |
2501.15438 |
link |
2025-01-24 |
Large-scale and Fine-grained Vision-language Pre-training for Enhanced CT Image Understanding |
Zhongyi Shui et.al. |
2501.14548 |
link |
2025-01-24 |
PAID: A Framework of Product-Centric Advertising Image Design |
Hongyu Chen et.al. |
2501.14316 |
null |
2025-01-24 |
Global Semantic-Guided Sub-image Feature Weight Allocation in High-Resolution Large Vision-Language Models |
Yuxuan Liang et.al. |
2501.14276 |
null |
2025-01-24 |
PuzzleGPT: Emulating Human Puzzle-Solving Ability for Time and Location Prediction |
Hammad Ayyubi et.al. |
2501.14210 |
null |
2025-01-24 |
Dynamic Token Reduction during Generation for Vision Language Models |
Xiaoyu Liang et.al. |
2501.14204 |
null |
2025-01-24 |
Distributed Multi-Agent Coordination Using Multi-Modal Foundation Models |
Saaduddin Mahmud et.al. |
2501.14189 |
null |
2025-01-24 |
SelfPrompt: Confidence-Aware Semi-Supervised Tuning for Robust Vision-Language Model Adaptation |
Shuvendu Roy et.al. |
2501.14148 |
null |
2025-01-23 |
Privacy-Preserving Personalized Federated Prompt Learning for Multimodal Large Language Models |
Linh Tran et.al. |
2501.13904 |
null |
2025-01-23 |
Dual-Modal Prototype Joint Learning for Compositional Zero-Shot Learning |
Shiyu Zhang et.al. |
2501.13859 |
null |
2025-01-23 |
Large Vision-Language Models for Knowledge-Grounded Data Annotation of Memes |
Shiling Deng et.al. |
2501.13851 |
link |
2025-01-23 |
Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models |
Chaolei Han et.al. |
2501.13795 |
link |
2025-01-23 |
Tune In, Act Up: Exploring the Impact of Audio Modality-Specific Edits on Large Audio Language Models in Jailbreak |
Erjia Xiao et.al. |
2501.13772 |
null |
2025-01-23 |
EventVL: Understand Event Streams via Multimodal Large Language Model |
Pengteng Li et.al. |
2501.13707 |
null |
2025-01-23 |
Cognitive Paradigms for Evaluating VLMs on Visual Reasoning Task |
Mohit Vaishnav et.al. |
2501.13620 |
null |
2025-01-23 |
Black-Box Adversarial Attack on Vision Language Models for Autonomous Driving |
Lu Wang et.al. |
2501.13563 |
null |
2025-01-23 |
Text-driven Online Action Detection |
Manuel Benavent-Lledo et.al. |
2501.13518 |
link |
2025-01-23 |
Iterative Shaping of Multi-Particle Aggregates based on Action Trees and VLM |
Hoi-Yin Lee et.al. |
2501.13507 |
null |
2025-01-22 |
Patent Figure Classification using Large Vision-language Models |
Sushil Awale et.al. |
2501.12751 |
link |
2025-01-22 |
TeD-Loc: Text Distillation for Weakly Supervised Object Localization |
Shakeeb Murtaza et.al. |
2501.12632 |
link |
2025-01-22 |
ViDDAR: Vision Language Model-Based Task-Detrimental Content Detection for Augmented Reality |
Yanming Xiu et.al. |
2501.12553 |
link |
2025-01-21 |
Owls are wise and foxes are unfaithful: Uncovering animal stereotypes in vision-language models |
Tabinda Aman et.al. |
2501.12433 |
null |
2025-01-20 |
ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models |
Jingwei Yi et.al. |
2501.12418 |
link |
2025-01-21 |
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model |
Yuhang Zang et.al. |
2501.12368 |
link |
2025-01-21 |
Vision-Language Models for Automated Chest X-ray Interpretation: Leveraging ViT and GPT-2 |
Md. Rakibul Islam et.al. |
2501.12356 |
null |
2025-01-21 |
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification |
Cristiano Patrício et.al. |
2501.12266 |
null |
2025-01-21 |
Fixing Imbalanced Attention to Mitigate In-Context Hallucination of Large Vision-Language Model |
Kazi Hasan Ibn Arif et.al. |
2501.12206 |
link |
2025-01-20 |
Human-AI Collaborative Game Testing with Vision Language Models |
Boran Zhang et.al. |
2501.11782 |
null |
2025-01-20 |
SimLabel: Consistency-Guided OOD Detection with Pretrained Vision-Language Models |
Shu Zou et.al. |
2501.11485 |
link |
2025-01-20 |
Verifying Cross-modal Entity Consistency in News using Vision-language Models |
Sahar Tahmasebi et.al. |
2501.11403 |
link |
2025-01-20 |
KPL: Training-Free Medical Knowledge Mining of Vision-Language Models |
Jiaxiang Liu et.al. |
2501.11231 |
link |
2025-01-19 |
ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models |
Yassir Bendou et.al. |
2501.11175 |
null |
2025-01-19 |
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding |
Zhanpeng Chen et.al. |
2501.10967 |
link |
2025-01-17 |
HiMix: Reducing Computational Complexity in Large Vision-Language Models |
Xuange Zhang et.al. |
2501.10318 |
null |
2025-01-17 |
SpatialCoT: Advancing Spatial Reasoning through Coordinate Alignment and Chain-of-Thought for Embodied Task Planning |
Yuecheng Liu et.al. |
2501.10074 |
null |
2025-01-17 |
CLIP-PCQA: Exploring Subjective-Aligned Vision-Language Modeling for Point Cloud Quality Assessment |
Yating Liu et.al. |
2501.10071 |
link |
2025-01-17 |
MSTS: A Multimodal Safety Test Suite for Vision-Language Models |
Paul Röttger et.al. |
2501.10057 |
link |
2025-01-17 |
Mitigating Hallucinations on Object Attributes using Multiview Images and Negative Instructions |
Zhijie Tan et.al. |
2501.10011 |
null |
2025-01-17 |
Explainable artificial intelligence (XAI): from inherent explainability to large language models |
Fuseini Mumuni et.al. |
2501.09967 |
null |
2025-01-16 |
Mitigating Hallucinations in Large Vision-Language Models via DPO: On-Policy Data Hold the Key |
Zhihe Yang et.al. |
2501.09695 |
link |
2025-01-16 |
Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark |
Alexis Roger et.al. |
2501.09672 |
null |
2025-01-16 |
AdaFV: Accelerating VLMs with Self-Adaptive Cross-Modality Attention Mixture |
Jiayi Han et.al. |
2501.09532 |
null |
2025-01-16 |
Double Visual Defense: Adversarial Pre-training and Instruction Tuning for Improving Vision-Language Model Robustness |
Zeyu Wang et.al. |
2501.09446 |
null |
2025-01-16 |
Vision-Language Models Do Not Understand Negation |
Kumail Alhamoud et.al. |
2501.09425 |
null |
2025-01-16 |
YETI (YET to Intervene) Proactive Interventions by Multimodal AI Agents in Augmented Reality Tasks |
Saptarashmi Bandyopadhyay et.al. |
2501.09355 |
null |
2025-01-16 |
RoboReflect: Robotic Reflective Reasoning for Grasping Ambiguous-Condition Objects |
Zhen Luo et.al. |
2501.09307 |
null |
2025-01-16 |
Efficient Few-Shot Medical Image Analysis via Hierarchical Contrastive Vision-Language Learning |
Harrison Fuller et.al. |
2501.09294 |
null |
2025-01-16 |
Are Open-Vocabulary Models Ready for Detection of MEP Elements on Construction Sites |
Abdalwhab Abdalwhab et.al. |
2501.09267 |
null |
2025-01-16 |
**Exploring the Capabilities of Vision-Language Models to Detect Visual Bugs in HTML5 |
Finlay Macklon et.al. |
2501.09236 |
null |
2025-01-15 |
CityLoc: 6 DoF Localization of Text Descriptions in Large-Scale Scenes with Gaussian Representation |
Qi Ma et.al. |
2501.08982 |
null |
2025-01-15 |
MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents |
Kuicai Dong et.al. |
2501.08828 |
null |
2025-01-15 |
Dynamic Knowledge Integration for Enhanced Vision-Language Reasoning |
Julian Perry et.al. |
2501.08597 |
null |
2025-01-14 |
MiniMax-01: Scaling Foundation Models with Lightning Attention |
MiniMax et.al. |
2501.08313 |
null |
2025-01-14 |
Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding |
Liping Yuan et.al. |
2501.07888 |
link |
2025-01-14 |
Visual Language Models as Operator Agents in the Space Domain |
Alejandro Carrasco et.al. |
2501.07802 |
null |
2025-01-14 |
BMIP: Bi-directional Modality Interaction Prompt Learning for VLM |
Song-Lin Lv et.al. |
2501.07769 |
null |
2025-01-13 |
SST-EM: Advanced Metrics for Evaluating Semantic, Spatial and Temporal Aspects in Video Editing |
Varun Biyyala et.al. |
2501.07554 |
link |
2025-01-13 |
RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment |
Difei Gu et.al. |
2501.07525 |
link |
2025-01-13 |
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models |
Yasiru Ranasinghe et.al. |
2501.07396 |
null |
2025-01-14 |
GestLLM: Advanced Hand Gesture Interpretation via Large Language Models for Human-Robot Interaction |
Oleg Kobzarev et.al. |
2501.07295 |
null |
2025-01-13 |
Can Vision-Language Models Evaluate Handwritten Math? |
Oikantik Nath et.al. |
2501.07244 |
link |
2025-01-13 |
TimeLogic: A Temporal Logic Benchmark for Video QA |
Sirnam Swetha et.al. |
2501.07214 |
null |
2025-01-13 |
BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature |
Alejandro Lozano et.al. |
2501.07171 |
link |
2025-01-13 |
Duplex: Dual Prototype Learning for Compositional Zero-Shot Learning |
Zhong Peng et.al. |
2501.07114 |
null |
2025-01-12 |
MedGrad E-CLIP: Enhancing Trust and Transparency in AI-Driven Skin Lesion Diagnosis |
Sadia Kamal et.al. |
2501.06887 |
null |
2025-01-12 |
Application of Vision-Language Model to Pedestrians Behavior and Scene Understanding in Autonomous Driving |
Haoxiang Gao et.al. |
2501.06680 |
null |
2025-01-10 |
VideoAuteur: Towards Long Narrative Video Generation |
Junfei Xiao et.al. |
2501.06173 |
null |
2025-01-10 |
CoDriveVLM: VLM-Enhanced Urban Cooperative Dispatching and Motion Planning for Future Autonomous Mobility on Demand Systems |
Haichao Liu et.al. |
2501.06132 |
link |
2025-01-10 |
Generate, Transduct, Adapt: Iterative Transduction with VLMs |
Oindrila Saha et.al. |
2501.06031 |
null |
2025-01-10 |
Scalable Vision Language Model Training via High Quality Data Curation |
Hongyuan Dong et.al. |
2501.05952 |
null |
2025-01-10 |
Valley2: Exploring Multimodal Models with Scalable Vision-Language Design |
Ziheng Wu et.al. |
2501.05901 |
link |
2025-01-10 |
Super-class guided Transformer for Zero-Shot Attribute Classification |
Sehyung Kim et.al. |
2501.05728 |
link |
2025-01-10 |
From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities |
Dominick Reilly et.al. |
2501.05711 |
link |
2025-01-09 |
Vision-Language Models for Autonomous Driving: CLIP-Based Dynamic Scene Understanding |
Mohammed Elhenawy et.al. |
2501.05566 |
null |
2025-01-09 |
Infecting Generative AI With Viruses |
David Noever et.al. |
2501.05542 |
null |
2025-01-09 |
Seeing Sound: Assembling Sounds from Visuals for Audio-to-Image Generation |
Darius Petermann et.al. |
2501.05413 |
null |
2025-01-09 |
Harnessing Large Language and Vision-Language Models for Robust Out-of-Distribution Detection |
Pei-Kang Lee et.al. |
2501.05228 |
null |
2025-01-09 |
Centurio: On Drivers of Multilingual Ability of Large Vision-Language Model |
Gregor Geigle et.al. |
2501.05122 |
null |
2025-01-09 |
DriVLM: Domain Adaptation of Vision-Language Models in Autonomous Driving |
Xuran Zheng et.al. |
2501.05081 |
null |
2025-01-09 |
Commonsense Video Question Answering through Video-Grounded Entailment Tree Reasoning |
Huabin Liu et.al. |
2501.05069 |
null |
2025-01-09 |
ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark |
Ronghao Dang et.al. |
2501.05031 |
link |
2025-01-09 |
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation |
Oleg Sautenkov et.al. |
2501.05014 |
link |
2025-01-09 |
Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments |
Yifan Xu et.al. |
2501.04947 |
null |
2025-01-08 |
Re-ranking the Context for Multimodal Retrieval Augmented Generation |
Matin Mortaheb et.al. |
2501.04695 |
null |
2025-01-08 |
Enhancing Financial VQA in Vision Language Models using Intermediate Structured Representations |
Archita Srivastava et.al. |
2501.04675 |
null |
2025-01-08 |
DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests |
Charles Corbière et.al. |
2501.04671 |
null |
2025-01-08 |
A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI |
Kazusato Oko et.al. |
2501.04641 |
link |
2025-01-08 |
Supervision-free Vision-Language Alignment |
Giorgio Giannone et.al. |
2501.04568 |
null |
2025-01-08 |
Online Gaussian Test-Time Adaptation of Vision-Language Models |
Clément Fuchs et.al. |
2501.04352 |
link |
2025-01-08 |
Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs |
Zeyi Huang et.al. |
2501.04336 |
null |
2025-01-08 |
Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts |
Miao Rang et.al. |
2501.04322 |
link |
2025-01-08 |
Robotic Programmer: Video Instructed Policy Code Generation for Robotic Manipulation |
Senwei Xie et.al. |
2501.04268 |
null |
2025-01-07 |
MM-GEN: Enhancing Task Performance Through Targeted Multimodal Data Curation |
Siddharth Joshi et.al. |
2501.04155 |
link |
2025-01-07 |
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives |
Shaoyuan Xie et.al. |
2501.04003 |
link |
2025-01-07 |
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos |
Haobo Yuan et.al. |
2501.04001 |
link |
2025-01-07 |
RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance |
Matin Mortaheb et.al. |
2501.03995 |
null |
2025-01-07 |
VLM-driven Behavior Tree for Context-aware Task Planning |
Naoki Wake et.al. |
2501.03968 |
link |
2025-01-07 |
Vision Language Models as Values Detectors |
Giulio Antonio Abbo et.al. |
2501.03957 |
null |
2025-01-07 |
OmniManip: Towards General Robotic Manipulation via Object-Centric Interaction Primitives as Spatial Constraints |
Mingjie Pan et.al. |
2501.03841 |
null |
2025-01-07 |
KAnoCLIP: Zero-Shot Anomaly Detection through Knowledge-Driven Prompt Learning and Enhanced Cross-Modal Integration |
Chengyuan Li et.al. |
2501.03786 |
null |
2025-01-07 |
Realistic Test-Time Adaptation of Vision-Language Models |
Maxime Zanella et.al. |
2501.03729 |
link |
2025-01-07 |
Self-adaptive vision-language model for 3D segmentation of pulmonary artery and vein |
Xiaotong Guo et.al. |
2501.03722 |
null |
2025-01-07 |
SMIR: Efficient Synthetic Data Pipeline To Improve Multi-Image Reasoning |
Andrew Li et.al. |
2501.03675 |
link |
2025-01-06 |
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation |
Yuhui Zhang et.al. |
2501.03225 |
link |
2025-01-06 |
Large language models for artificial general intelligence (AGI): A survey of foundational principles and approaches |
Alhassan Mumuni et.al. |
2501.03151 |
null |
2025-01-06 |
Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment |
Pegah Khayatan et.al. |
2501.03012 |
link |
2025-01-06 |
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models |
Wenyi Hong et.al. |
2501.02955 |
null |
2025-01-06 |
Label-free Concept Based Multiple Instance Learning for Gigapixel Histopathology |
Susu Sun et.al. |
2501.02922 |
null |
2025-01-06 |
Large Language Models for Video Surveillance Applications |
Ulindu De Silva et.al. |
2501.02850 |
null |
2025-01-06 |
Visual Large Language Models for Generalized and Specialized Applications |
Yifan Li et.al. |
2501.02765 |
link |
2025-01-05 |
Generalizing from SIMPLE to HARD Visual Reasoning: Can We Mitigate Modality Imbalance in VLMs? |
Simon Park et.al. |
2501.02669 |
link |
2025-01-05 |
Efficient Architectures for High Resolution Vision-Language Models |
Miguel Carvalho et.al. |
2501.02584 |
link |
2025-01-05 |
FedRSClip: Federated Learning for Remote Sensing Scene Classification Using Vision-Language Models |
Hui Lin et.al. |
2501.02461 |
null |
2025-01-03 |
Mitigating Hallucination for Large Vision Language Model by Inter-Modality Correlation Calibration Decoding |
Jiaming Li et.al. |
2501.01926 |
link |
2025-01-03 |
MoColl: Agent-Based Specific and General Model Collaboration for Image Captioning |
Pu Yang et.al. |
2501.01834 |
null |
2025-01-03 |
LogicAD: Explainable Anomaly Detection via VLM-based Text Feature Extraction |
Er Jin et.al. |
2501.01767 |
null |
2025-01-03 |
MoVE-KD: Knowledge Distillation for VLMs with Mixture of Visual Encoders |
Jiajun Cao et.al. |
2501.01709 |
null |
2025-01-03 |
GPT4Scene: Understand 3D Scenes from Videos with Vision-Language Models |
Zhangyang Qi et.al. |
2501.01428 |
link |
2025-01-02 |
Training Medical Large Vision-Language Models with Abnormal-Aware Feedback |
Yucheng Zhou et.al. |
2501.01377 |
null |
2025-01-02 |
CLIP-UP: CLIP-Based Unanswerable Problem Detection for Visual Question Answering |
Ben Vardi et.al. |
2501.01371 |
null |
2025-01-02 |
Large Vision-Language Model Alignment and Misalignment: A Survey Through the Lens of Explainability |
Dong Shu et.al. |
2501.01346 |
null |
2025-01-02 |
CultureVLM: Characterizing and Improving Cultural Understanding of Vision-Language Models for over 100 Countries |
Shudong Liu et.al. |
2501.01282 |
null |
2025-01-03 |
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining |
Wenqi Zhang et.al. |
2501.00958 |
link |
2025-01-01 |
Hierarchical Vision-Language Alignment for Text-to-Image Generation via Diffusion Models |
Emily Johnson et.al. |
2501.00917 |
null |
2025-01-01 |
FGAseg: Fine-Grained Pixel-Text Alignment for Open-Vocabulary Semantic Segmentation |
Bingyu Li et.al. |
2501.00877 |
link |
2025-01-01 |
IllusionBench: A Large-scale and Comprehensive Benchmark for Visual Illusion Understanding in Vision-Language Models |
Yiming Zhang et.al. |
2501.00848 |
null |
2024-12-31 |
ICONS: Influence Consensus for Vision-Language Data Selection |
Xindi Wu et.al. |
2501.00654 |
null |
2024-12-30 |
Vinci: A Real-time Embodied Smart Assistant based on Egocentric Vision-Language Model |
Yifei Huang et.al. |
2412.21080 |
link |
2024-12-30 |
Plancraft: an evaluation dataset for planning with LLM agents |
Gautier Dagan et.al. |
2412.21033 |
link |
2024-12-30 |
UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI |
Fangwei Zhong et.al. |
2412.20977 |
null |
2024-12-30 |
Low-Light Image Enhancement via Generative Perceptual Priors |
Han Zhou et.al. |
2412.20916 |
link |
2024-12-30 |
WalkVLM:Aid Visually Impaired People Walking by Vision Language Model |
Zhiqiang Yuan et.al. |
2412.20903 |
null |
2024-12-30 |
Towards Compatible Fine-tuning for Vision-Language Model Updates |
Zhengbo Wang et.al. |
2412.20895 |
null |
2024-12-30 |
ReStory: VLM-augmentation of Social Human-Robot Interaction Datasets |
Fanjun Bu et.al. |
2412.20826 |
null |
2024-12-30 |
Are Vision-Language Models Truly Understanding Multi-vision Sensor? |
Sangyun Chung et.al. |
2412.20750 |
link |
2024-12-30 |
UniRS: Unifying Multi-temporal Remote Sensing Tasks through Vision Language Models |
Yujie Li et.al. |
2412.20742 |
link |
2024-12-30 |
M $^3$ oralBench: A MultiModal Moral Benchmark for LVLMs |
Bei Yan et.al. |
2412.20718 |
link |
2024-12-27 |
MVTamperBench: Evaluating Robustness of Vision-Language Models |
Amit Agarwal et.al. |
2412.19794 |
null |
2024-12-27 |
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis |
Qiushi Sun et.al. |
2412.19723 |
null |
2024-12-27 |
Is Your Text-to-Image Model Robust to Caption Noise? |
Weichen Yu et.al. |
2412.19531 |
null |
2024-12-27 |
MBQ: Modality-Balanced Quantization for Large Vision-Language Models |
Shiyao Li et.al. |
2412.19509 |
link |
2024-12-27 |
Multi-P $^2$ A: A Multi-perspective Benchmark on Privacy Assessment for Large Vision-Language Models |
Jie Zhang et.al. |
2412.19496 |
link |
2024-12-27 |
Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation |
Chengyang Ye et.al. |
2412.19492 |
link |
2024-12-26 |
CALICO: Part-Focused Semantic Co-Segmentation with Large Vision-Language Models |
Kiet A. Nguyen et.al. |
2412.19331 |
null |
2024-12-26 |
Sketch-MoMa: Teleoperation for Mobile Manipulator via Interpretation of Hand-Drawn Sketches |
Kosei Tanada et.al. |
2412.19153 |
null |
2024-12-26 |
MoPD: Mixture-of-Prompts Distillation for Vision-Language Models |
Yang Chen et.al. |
2412.19087 |
null |
2024-12-26 |
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation |
Tao Liu et.al. |
2412.19021 |
null |
2024-12-24 |
Explaining in Diffusion: Explaining a Classifier Through Hierarchical Semantics with Text-to-Image Diffusion Models |
Tahira Kazimi et.al. |
2412.18604 |
null |
2024-12-24 |
The Key of Understanding Vision Tasks: Explanatory Instructions |
Yang Shen et.al. |
2412.18525 |
link |
2024-12-24 |
LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating |
Chao Deng et.al. |
2412.18424 |
link |
2024-12-24 |
Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model |
Tenghui Li et.al. |
2412.18387 |
link |
2024-12-24 |
Efficient and Context-Aware Label Propagation for Zero-/Few-Shot Training-Free Adaptation of Vision-Language Model |
Yushu Li et.al. |
2412.18303 |
null |
2024-12-24 |
Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight |
Xi Ding et.al. |
2412.18298 |
link |
2024-12-24 |
Enhancing Multi-Robot Semantic Navigation Through Multimodal Chain-of-Thought Score Collaboration |
Zhixuan Shen et.al. |
2412.18292 |
link |
2024-12-24 |
Sampling Bag of Views for Open-Vocabulary Object Detection |
Hojun Choi et.al. |
2412.18273 |
null |
2024-12-24 |
VLABench: A Large-Scale Benchmark for Language-Conditioned Robotics Manipulation with Long-Horizon Reasoning Tasks |
Shiduo Zhang et.al. |
2412.18194 |
null |
2024-12-24 |
EvalMuse-40K: A Reliable and Fine-Grained Benchmark with Comprehensive Human Annotations for Text-to-Image Generation Model Evaluation |
Shuhao Han et.al. |
2412.18150 |
link |
2024-12-23 |
ChatGarment: Garment Estimation, Generation and Editing via Large Language Models |
Siyuan Bian et.al. |
2412.17811 |
null |
2024-12-23 |
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection |
Yitong Chen et.al. |
2412.17800 |
link |
2024-12-23 |
Cross-Lingual Text-Rich Visual Comprehension: An Information Theory Perspective |
Xinmiao Yu et.al. |
2412.17787 |
null |
2024-12-23 |
**Reasoning to Attend: Try to Understand How Token Works** |
Rui Qian et.al. |
2412.17741 |
link |
2024-12-23 |
Kernel-Aware Graph Prompt Learning for Few-Shot Anomaly Detection |
Fenfang Tao et.al. |
2412.17619 |
link |
2024-12-23 |
Personalized Large Vision-Language Models |
Chau Pham et.al. |
2412.17610 |
null |
2024-12-23 |
Retention Score: Quantifying Jailbreak Risks for Vision Language Models |
Zaitang Li et.al. |
2412.17544 |
null |
2024-12-23 |
Feature Based Methods Domain Adaptation for Object Detection: A Review Paper |
Helia Mohamadi et.al. |
2412.17325 |
null |
2024-12-23 |
On the Feasibility of Vision-Language Models for Time-Series Classification |
Vinay Prithyani et.al. |
2412.17304 |
link |
2024-12-23 |
GCS-M3VLT: Guided Context Self-Attention based Multi-modal Medical Vision Language Transformer for Retinal Image Captioning |
Teja Krishna Cherukuri et.al. |
2412.17251 |
null |
2024-12-20 |
HoVLE: Unleashing the Power of Monolithic Vision-Language Models with Holistic Vision-Language Embedding |
Chenxin Tao et.al. |
2412.16158 |
null |
2024-12-20 |
Frequency Is What You Need: Word-frequency Masking Benefits Vision-Language Model Pre-training |
Mingliang Liang et.al. |
2412.16148 |
link |
2024-12-20 |
Demystifying the Potential of ChatGPT-4 Vision for Construction Progress Monitoring |
Ahmet Bahaddin Ersoz et.al. |
2412.16108 |
null |
2024-12-20 |
VORD: Visual Ordinal Calibration for Mitigating Object Hallucinations in Large Vision-Language Models |
Dexter Neo et.al. |
2412.15739 |
null |
2024-12-20 |
Multi-modal Agent Tuning: Building a VLM-Driven Agent for Efficient Tool Usage |
Zhi Gao et.al. |
2412.15606 |
null |
2024-12-20 |
VLM-RL: A Unified Vision Language Models and Reinforcement Learning Framework for Safe Autonomous Driving |
Zilin Huang et.al. |
2412.15544 |
null |
2024-12-20 |
PolySmart @ TRECVid 2024 Video-To-Text |
Jiaxin Wu et.al. |
2412.15509 |
null |
2024-12-19 |
TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models |
Ammar N. Abbas et.al. |
2412.15462 |
null |
2024-12-19 |
PRIMA: Multi-Image Vision-Language Models for Reasoning Segmentation |
Muntasir Wahed et.al. |
2412.15209 |
null |
2024-12-19 |
AutoTrust: Benchmarking Trustworthiness in Large Vision Language Models for Autonomous Driving |
Shuo Xing et.al. |
2412.15206 |
link |
2024-12-19 |
EarthDial: Turning Multi-sensory Earth Observations to Interactive Dialogues |
Sagar Soni et.al. |
2412.15190 |
null |
2024-12-19 |
LlamaFusion: Adapting Pretrained Language Models for Multimodal Generation |
Weijia Shi et.al. |
2412.15188 |
null |
2024-12-19 |
ObjVariantEnsemble: Advancing Point Cloud LLM Evaluation in Challenging Scenes with Subtly Distinguished Objects |
Qihang Cao et.al. |
2412.14837 |
null |
2024-12-19 |
A Light-Weight Framework for Open-Set Object Detection with Decoupled Feature Alignment in Joint Space |
Yonghao He et.al. |
2412.14680 |
link |
2024-12-19 |
FiVL: A Framework for Improved Vision-Language Alignment |
Estelle Aflalo et.al. |
2412.14672 |
null |
2024-12-19 |
HarmonicEval: Multi-modal, Multi-task, Multi-criteria Automatic Evaluation Using a Vision Language Model |
Masanari Ohi et.al. |
2412.14613 |
null |
2024-12-19 |
Token Preference Optimization with Self-Calibrated Visual-Anchored Rewards for Hallucination Mitigation |
Jihao Gu et.al. |
2412.14487 |
null |
2024-12-19 |
GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering |
Saumya Saxena et.al. |
2412.14480 |
null |
2024-12-18 |
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation |
Jianyu Zhang et.al. |
2412.14145 |
null |
2024-12-18 |
Performance Gap in Entity Knowledge Extraction Across Modalities in Vision Language Models |
Ido Cohen et.al. |
2412.14133 |
link |
2024-12-18 |
Towards Generalist Robot Policies: What Matters in Building Vision-Language-Action Models |
Xinghang Li et.al. |
2412.14058 |
null |
2024-12-18 |
Cracking the Code of Hallucination in LVLMs with Vision-aware Head Divergence |
Jinghan He et.al. |
2412.13949 |
null |
2024-12-18 |
Real Classification by Description: Extending CLIP’s Limits of Part Attributes Recognition |
Ethan Baron et.al. |
2412.13947 |
null |
2024-12-18 |
Nullu: Mitigating Object Hallucinations in Large Vision-Language Models via HalluSpace Projection |
Le Yang et.al. |
2412.13817 |
link |
2024-12-18 |
Towards Automatic Evaluation for Image Transcreation |
Simran Khanuja et.al. |
2412.13717 |
null |
2024-12-18 |
RelationField: Relate Anything in Radiance Fields |
Sebastian Koch et.al. |
2412.13652 |
link |
2024-12-18 |
Read Like a Radiologist: Efficient Vision-Language Model for 3D Medical Imaging Interpretation |
Changsun Lee et.al. |
2412.13558 |
null |
2024-12-18 |
Benchmarking and Improving Large Vision-Language Models for Fundamental Visual Graph Understanding and Reasoning |
Yingjie Zhu et.al. |
2412.13540 |
link |
2024-12-17 |
Proposer-Agent-Evaluator(PAE): Autonomous Skill Discovery For Foundation Model Internet Agents |
Yifei Zhou et.al. |
2412.13194 |
null |
2024-12-17 |
HandsOnVLM: Vision-Language Models for Hand-Object Interaction Prediction |
Chen Bao et.al. |
2412.13187 |
null |
2024-12-17 |
Feather the Throttle: Revisiting Visual Token Pruning for Vision-Language Model Acceleration |
Mark Endo et.al. |
2412.13180 |
null |
2024-12-17 |
Improving Fine-grained Visual Understanding in VLMs through Text-Only Training |
Dasol Choi et.al. |
2412.12940 |
null |
2024-12-17 |
CoMT: A Novel Benchmark for Chain of Multi-modal Thought on Large Vision-Language Models |
Zihui Cheng et.al. |
2412.12932 |
link |
2024-12-17 |
An Agentic Approach to Automatic Creation of P&ID Diagrams from Natural Language Descriptions |
Shreeyash Gowaikar et.al. |
2412.12898 |
null |
2024-12-17 |
ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation |
Shiqi Huang et.al. |
2412.12798 |
link |
2024-12-17 |
CRoF: CLIP-based Robust Few-shot Learning on Noisy Labels |
Shizhuo Deng et.al. |
2412.12793 |
null |
2024-12-17 |
Activating Distributed Visual Region within LLMs for Efficient and Effective Vision-Language Training and Inference |
Siyuan Wang et.al. |
2412.12785 |
null |
2024-12-17 |
GIRAFFE: Design Choices for Extending the Context Length of Visual Language Models |
Mukai Li et.al. |
2412.12735 |
link |
2024-12-16 |
Emma-X: An Embodied Multimodal Action Model with Grounded Chain of Thought and Look-ahead Spatial Reasoning |
Qi Sun et.al. |
2412.11974 |
link |
2024-12-16 |
Explainable Procedural Mistake Detection |
Shane Storks et.al. |
2412.11927 |
null |
2024-12-16 |
Does VLM Classification Benefit from LLM Description Semantics? |
Pingchuan Ma et.al. |
2412.11917 |
link |
2024-12-16 |
From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach |
Xilin Wang et.al. |
2412.11892 |
null |
2024-12-16 |
LMM-Regularized CLIP Embeddings for Image Classification |
Maria Tzelepi et.al. |
2412.11663 |
null |
2024-12-16 |
Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves |
Shihan Wu et.al. |
2412.11509 |
link |
2024-12-16 |
Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents |
Wonje Choi et.al. |
2412.11484 |
null |
2024-12-16 |
OmniVLM: A Token-Compressed, Sub-Billion-Parameter Vision-Language Model for Efficient On-Device Inference |
Wei Chen et.al. |
2412.11475 |
null |
2024-12-16 |
MaskCLIP++: A Mask-Based CLIP Fine-tuning Framework for Open-Vocabulary Image Segmentation |
Quan-Sheng Zeng et.al. |
2412.11464 |
link |
2024-12-16 |
Leveraging Retrieval-Augmented Tags for Large Vision-Language Understanding in Complex Scenes |
Antonio Carlos Rivera et.al. |
2412.11396 |
null |
2024-12-13 |
UniMed-CLIP: Towards a Unified Image-Text Pretraining Paradigm for Diverse Medical Imaging Modalities |
Muhammad Uzair Khattak et.al. |
2412.10372 |
link |
2024-12-13 |
A dual contrastive framework |
Yuan Sun et.al. |
2412.10348 |
null |
2024-12-13 |
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding |
Zhiyu Wu et.al. |
2412.10302 |
link |
2024-12-13 |
A model-based approach for transforming InSAR-derived vertical land motion from a local to a global reference frame |
Mahmoud Reshadati et.al. |
2412.10282 |
null |
2024-12-13 |
VLR-Bench: Multilingual Benchmark Dataset for Vision-Language Retrieval Augmented Generation |
Hyeonseok Lim et.al. |
2412.10151 |
null |
2024-12-13 |
Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples |
Yeyuan Wang et.al. |
2412.10029 |
null |
2024-12-13 |
WiseAD: Knowledge Augmented End-to-End Autonomous Driving with Vision-Language Model |
Songyan Zhang et.al. |
2412.09951 |
link |
2024-12-13 |
CaLoRAify: Calorie Estimation with Visual-Text Pairing and LoRA-Driven Visual Language Models |
Dongyu Yao et.al. |
2412.09936 |
link |
2024-12-13 |
Selective State Space Memory for Large Vision-Language Models |
Chee Ng et.al. |
2412.09875 |
null |
2024-12-12 |
BayesAdapter: enhanced uncertainty estimation in CLIP few-shot adaptation |
Pablo Morales-Álvarez et.al. |
2412.09718 |
null |
2024-12-13 |
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding |
Junqi Ge et.al. |
2412.09616 |
link |
2024-12-12 |
PVC: Progressive Visual Token Compression for Unified Image and Video Processing in Large Vision-Language Models |
Chenyu Yang et.al. |
2412.09613 |
null |
2024-12-12 |
AgentTrek: Agent Trajectory Synthesis via Guiding Replay with Web Tutorials |
Yiheng Xu et.al. |
2412.09605 |
null |
2024-12-12 |
OLA-VLM: Elevating Visual Perception in Multimodal LLMs with Auxiliary Embedding Distillation |
Jitesh Jain et.al. |
2412.09585 |
link |
2024-12-12 |
Neptune: The Long Orbit to Benchmarking Long Video Understanding |
Arsha Nagrani et.al. |
2412.09582 |
link |
2024-12-12 |
Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM |
Han Wang et.al. |
2412.09530 |
link |
2024-12-12 |
Efficient and Comprehensive Feature Extraction in Large Vision-Language Model for Clinical Pathology Analysis |
Shengxuming Zhang et.al. |
2412.09521 |
null |
2024-12-12 |
ATPrompt: Textual Prompt Learning with Embedded Attributes |
Zheng Li et.al. |
2412.09442 |
null |
2024-12-12 |
Causal Graphical Models for Vision-Language Compositional Understanding |
Fiorenzo Parascandolo et.al. |
2412.09353 |
link |
2024-12-12 |
Learning Novel Skills from Language-Generated Demonstrations |
Ao-Qun Jin et.al. |
2412.09286 |
null |
2024-12-11 |
Synthetic Vision: Training Vision-Language Models to Understand Physics |
Vahid Balazadeh et.al. |
2412.08619 |
null |
2024-12-12 |
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning |
Fan Lu et.al. |
2412.08614 |
link |
2024-12-11 |
SenCLIP: Enhancing zero-shot land-use mapping for Sentinel-2 with ground-level prompting |
Pallavi Jain et.al. |
2412.08536 |
link |
2024-12-11 |
POINTS1.5: Building a Vision-Language Model towards Real World Applications |
Yuan Liu et.al. |
2412.08443 |
null |
2024-12-11 |
LOMA: Language-assisted Semantic Occupancy Network via Triplane Mamba |
Yubo Cui et.al. |
2412.08388 |
null |
2024-12-11 |
HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models |
Shiding Zhu et.al. |
2412.08378 |
null |
2024-12-11 |
Position-aware Guided Point Cloud Completion with CLIP Model |
Feng Zhou et.al. |
2412.08271 |
null |
2024-12-11 |
TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning |
Jingjing Xie et.al. |
2412.08176 |
link |
2024-12-11 |
Progressive Multi-granular Alignments for Grounded Reasoning in Large Vision-Language Models |
Quang-Hung Le et.al. |
2412.08125 |
link |
2024-12-11 |
Seeing Syntax: Uncovering Syntactic Learning Limitations in Vision-Language Models |
Sri Harsha Dumpala et.al. |
2412.08111 |
null |
2024-12-10 |
RADIO Amplified: Improved Baselines for Agglomerative Vision Foundation Models |
Greg Heinrich et.al. |
2412.07679 |
link |
2024-12-10 |
DRUM: Learning Demonstration Retriever for Large MUlti-modal Models |
Ellen Yi-Ge et.al. |
2412.07619 |
null |
2024-12-10 |
Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios |
Jiaqi Fan et.al. |
2412.07518 |
link |
2024-12-10 |
SmartAgent: Chain-of-User-Thought for Embodied Personalized Agent in Cyber World |
Jiaqi Zhang et.al. |
2412.07472 |
link |
2024-12-10 |
MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models |
Sayak Chakrabarty et.al. |
2412.07148 |
link |
2024-12-10 |
Maya: An Instruction Finetuned Multilingual Multimodal Model |
Nahid Alam et.al. |
2412.07112 |
link |
2024-12-10 |
Retaining and Enhancing Pre-trained Knowledge in Vision-Language Models with Prompt Ensembling |
Donggeun Kim et.al. |
2412.07077 |
null |
2024-12-09 |
Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models |
Yi-Lun Lee et.al. |
2412.06775 |
link |
2024-12-09 |
Visual Lexicon: Rich Image Features in Language Space |
XuDong Wang et.al. |
2412.06774 |
null |
2024-12-09 |
Ranking-aware adapter for text-driven image ordering with CLIP |
Wei-Hsiang Yu et.al. |
2412.06760 |
link |
2024-12-09 |
ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities |
Adhiraj Ghosh et.al. |
2412.06745 |
null |
2024-12-09 |
The Narrow Gate: Localized Image-Text Communication in Vision-Language Models |
Alessandro Serra et.al. |
2412.06646 |
null |
2024-12-09 |
From Uncertainty to Trust: Enhancing Reliability in Vision-Language Models with Uncertainty-Guided Dropout Decoding |
Yixiong Fang et.al. |
2412.06474 |
link |
2024-12-09 |
Pruning All-Rounder: Rethinking and Improving Inference Efficiency for Large Vision Language Models |
Wei Suo et.al. |
2412.06458 |
null |
2024-12-09 |
No Annotations for Object Detection in Art through Stable Diffusion |
Patrick Ramos et.al. |
2412.06286 |
link |
2024-12-09 |
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models |
Lianyu Hu et.al. |
2412.06263 |
link |
2024-12-09 |
DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction |
Yunheng Li et.al. |
2412.06244 |
null |
2024-12-06 |
Multimodal Fact-Checking with Vision Language Models: A Probing Classifier based Solution with Embedding Strategies |
Recep Firat Cekinel et.al. |
2412.05155 |
link |
2024-12-06 |
Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora |
Michael Y. Hu et.al. |
2412.05149 |
null |
2024-12-06 |
$S^3$ : Synonymous Semantic Space for Improving Zero-Shot Generalization of Vision-Language Models |
Xiaojie Yin et.al. |
2412.04925 |
null |
2024-12-06 |
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model |
Keunwoo Peter Yu et.al. |
2412.04729 |
null |
2024-12-05 |
Cross-Self KV Cache Pruning for Efficient Vision-Language Inference |
Xiaohuan Pei et.al. |
2412.04652 |
link |
2024-12-05 |
NVILA: Efficient Frontier Visual Language Models |
Zhijian Liu et.al. |
2412.04468 |
null |
2024-12-05 |
VisionZip: Longer is Better but Not Necessary in Vision Language Models |
Senqiao Yang et.al. |
2412.04467 |
link |
2024-12-05 |
Code-as-Monitor: Constraint-aware Visual Programming for Reactive and Proactive Robotic Failure Detection |
Enshen Zhou et.al. |
2412.04455 |
null |
2024-12-05 |
Grounding Descriptions in Images informs Zero-Shot Visual Recognition |
Shaunak Halbe et.al. |
2412.04429 |
link |
2024-12-05 |
Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion |
Jiuhai Chen et.al. |
2412.04424 |
link |
2024-12-05 |
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding |
Rong Li et.al. |
2412.04383 |
null |
2024-12-05 |
Discriminative Fine-tuning of LVLMs |
Yassine Ouali et.al. |
2412.04378 |
null |
2024-12-05 |
3D Part Segmentation via Geometric Aggregation of 2D Visual Features |
Marco Garosi et.al. |
2412.04247 |
link |
2024-12-06 |
VASCAR: Content-Aware Layout Generation via Visual-Aware Self-Correction |
Jiahao Zhang et.al. |
2412.04237 |
null |
2024-12-05 |
Unified Framework for Open-World Compositional Zero-shot Learning |
Hirunima Jayasekara et.al. |
2412.04083 |
link |
2024-12-04 |
FLAIR: VLM with Fine-grained Language-informed Image Representations |
Rui Xiao et.al. |
2412.03561 |
link |
2024-12-04 |
Best-of-N Jailbreaking |
John Hughes et.al. |
2412.03556 |
link |
2024-12-04 |
PaliGemma 2: A Family of Versatile VLMs for Transfer |
Andreas Steiner et.al. |
2412.03555 |
null |
2024-12-04 |
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation |
Ao Wang et.al. |
2412.03409 |
link |
2024-12-04 |
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for accelerating Large VLMs |
Wangbo Zhao et.al. |
2412.03324 |
link |
2024-12-04 |
Composed Image Retrieval for Training-Free Domain Conversion |
Nikos Efthymiadis et.al. |
2412.03297 |
link |
2024-12-04 |
Towards Understanding and Quantifying Uncertainty for Text-to-Image Generation |
Gianni Franchi et.al. |
2412.03178 |
null |
2024-12-04 |
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations? |
Shouwei Ruan et.al. |
2412.03002 |
null |
2024-12-04 |
Progressive Vision-Language Prompt for Multi-Organ Multi-Class Cell Semantic Segmentation with Single Branch |
Qing Zhang et.al. |
2412.02978 |
null |
2024-12-04 |
Who Brings the Frisbee: Probing Hidden Hallucination Factors in Large Vision-Language Model via Causality Analysis |
Po-Hsuan Huang et.al. |
2412.02946 |
null |
2024-12-03 |
Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback |
Hiroki Furuta et.al. |
2412.02617 |
null |
2024-12-03 |
CEGI: Measuring the trade-off between efficiency and carbon emissions for SLMs and VLMs |
Abhas Kumar et.al. |
2412.02602 |
null |
2024-12-03 |
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation |
Junyuan Zhang et.al. |
2412.02592 |
link |
2024-12-03 |
Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey |
Chenyang Liu et.al. |
2412.02573 |
link |
2024-12-03 |
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection |
Joongwon Chae et.al. |
2412.02565 |
link |
2024-12-03 |
Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks |
Jinjin Cai et.al. |
2412.02531 |
null |
2024-12-03 |
OODFace: Benchmarking Robustness of Face Recognition under Common Corruptions and Appearance Variations |
Caixin Kang et.al. |
2412.02479 |
null |
2024-12-03 |
BYE: Build Your Encoder with One Sequence of Exploration Data for Long-Term Dynamic Scene Understanding |
Chenguang Huang et.al. |
2412.02449 |
null |
2024-12-03 |
Composing Open-domain Vision with RAG for Ocean Monitoring and Conservation |
Sepand Dyanatkar et.al. |
2412.02262 |
null |
2024-12-03 |
LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models |
Fan-Yun Sun et.al. |
2412.02193 |
null |
2024-11-29 |
SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks |
Kim-Celine Kahl et.al. |
2411.19688 |
link |
2024-11-29 |
CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation |
Qixiu Li et.al. |
2411.19650 |
null |
2024-11-29 |
Interleaved-Modal Chain-of-Thought |
Jun Gao et.al. |
2411.19488 |
null |
2024-11-29 |
Effective Fine-Tuning of Vision-Language Models for Accurate Galaxy Morphology Analysis |
Ruoqi Wang et.al. |
2411.19475 |
null |
2024-11-28 |
Talking to DINO: Bridging Self-Supervised Vision Backbones with Language for Open-Vocabulary Segmentation |
Luca Barsellotti et.al. |
2411.19331 |
link |
2024-11-28 |
GEOBench-VLM: Benchmarking Vision-Language Models for Geospatial Tasks |
Muhammad Sohail Danish et.al. |
2411.19325 |
link |
2024-11-28 |
GRAPE: Generalizing Robot Policy via Preference Alignment |
Zijian Zhang et.al. |
2411.19309 |
null |
2024-11-28 |
Beyond Logit Lens: Contextual Embeddings for Robust Hallucination Detection & Grounding in VLMs |
Anirudh Phukan et.al. |
2411.19187 |
null |
2024-11-28 |
VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models |
Jeongho Ju et.al. |
2411.19103 |
null |
2024-11-27 |
ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics |
Letian Chen et.al. |
2411.18825 |
null |
2024-11-27 |
Visual Adversarial Attack on Vision-Language Models for Autonomous Driving |
Tianyuan Zhang et.al. |
2411.18275 |
null |
2024-11-27 |
SCoTT: Wireless-Aware Path Planning with Vision Language Models and Strategic Chains-of-Thought |
Aladin Djuhera et.al. |
2411.18212 |
null |
2024-11-27 |
From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects |
Zizhao Li et.al. |
2411.18207 |
link |
2024-11-27 |
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning |
Di Zhang et.al. |
2411.18203 |
null |
2024-11-27 |
DistinctAD: Distinctive Audio Description Generation in Contexts |
Bo Fang et.al. |
2411.18180 |
null |
2024-11-27 |
COREval: A Comprehensive and Objective Benchmark for Evaluating the Remote Sensing Capabilities of Large Vision-Language Models |
Xiao An et.al. |
2411.18145 |
link |
2024-11-27 |
When Large Vision-Language Models Meet Person Re-Identification |
Qizao Wang et.al. |
2411.18111 |
null |
2024-11-27 |
Aligning Knowledge Concepts to Whole Slide Images for Precise Histopathology Image Analysis |
Weiqin Zhao et.al. |
2411.18101 |
link |
2024-11-27 |
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis |
Donggoo Kang et.al. |
2411.18038 |
null |
2024-11-28 |
Exploring Visual Vulnerabilities via Multi-Loss Adversarial Search for Jailbreaking Vision-Language Models |
Shuyang Hao et.al. |
2411.18000 |
null |
2024-11-26 |
What’s in the Image? A Deep-Dive into the Vision of Vision Language Models |
Omri Kaduri et.al. |
2411.17491 |
null |
2024-11-26 |
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models |
Lei Li et.al. |
2411.17451 |
null |
2024-11-26 |
CoA: Chain-of-Action for Generative Semantic Labels |
Meng Wei et.al. |
2411.17406 |
link |
2024-11-26 |
Interleaved Scene Graph for Interleaved Text-and-Image Generation Assessment |
Dongping Chen et.al. |
2411.17188 |
null |
2024-11-26 |
Distilling Spectral Graph for Object-Context Aware Open-Vocabulary Semantic Segmentation |
Chanyoung Kim et.al. |
2411.17150 |
null |
2024-11-26 |
Free $^2$ Guide: Gradient-Free Path Integral Control for Enhancing Text-to-Video Generation with Large Vision-Language Models |
Jaemin Kim et.al. |
2411.17041 |
null |
2024-11-26 |
Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation |
Shambhavi Mishra et.al. |
2411.17002 |
link |
2024-11-25 |
Probing the limitations of multimodal language models for chemistry and materials research |
Nawaf Alampara et.al. |
2411.16955 |
link |
2024-11-25 |
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge |
Yaqi Zhao et.al. |
2411.16824 |
null |
2024-11-25 |
Generating Out-Of-Distribution Scenarios Using Language Models |
Erfan Aasi et.al. |
2411.16554 |
null |
2024-11-25 |
RoboSpatial: Teaching Spatial Understanding to 2D and 3D Vision-Language Models for Robotics |
Chan Hee Song et.al. |
2411.16537 |
null |
2024-11-25 |
Noise Diffusion for Enhancing Semantic Faithfulness in Text-to-Image Synthesis |
Boming Miao et.al. |
2411.16503 |
null |
2024-11-25 |
A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models |
Manuel Schwonberg et.al. |
2411.16407 |
null |
2024-11-25 |
CapHDR2IR: Caption-Driven Transfer from Visible Light to Infrared Domain |
Jingchao Peng et.al. |
2411.16327 |
null |
2024-11-25 |
Open-Vocabulary Octree-Graph for 3D Scene Understanding |
Zhigang Wang et.al. |
2411.16253 |
null |
2024-11-25 |
From Dashcam Videos to Driving Simulations: Stress Testing Automated Vehicles against Rare Events |
Yan Miao et.al. |
2411.16027 |
null |
2024-11-25 |
Style-Pro: Style-Guided Prompt Learning for Generalizable Vision-Language Models |
Niloufar Alipour Talemi et.al. |
2411.16018 |
null |
2024-11-24 |
Self-Calibrated CLIP for Training-Free Open-Vocabulary Segmentation |
Sule Bai et.al. |
2411.15869 |
link |
2024-11-24 |
ResCLIP: Residual Attention for Training-free Dense Vision-language Inference |
Yuhang Yang et.al. |
2411.15851 |
link |
2024-11-22 |
Context-Aware Multimodal Pretraining |
Karsten Roth et.al. |
2411.15099 |
null |
2024-11-22 |
Geminio: Language-Guided Gradient Inversion Attacks in Federated Learning |
Junjie Shan et.al. |
2411.14937 |
link |
2024-11-22 |
ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos |
Tanveer Hannan et.al. |
2411.14901 |
link |
2024-11-22 |
VisGraphVar: A Benchmark Generator for Assessing Variability in Graph Analysis Using Large Vision-Language Models |
Camilo Chacón Sartori et.al. |
2411.14832 |
null |
2024-11-22 |
Continual SFT Matches Multimodal RLHF with Negative Supervision |
Ke Zhu et.al. |
2411.14797 |
null |
2024-11-22 |
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection |
Songhao Han et.al. |
2411.14794 |
link |
2024-11-22 |
Effective SAM Combination for Open-Vocabulary Semantic Segmentation |
Minhyeok Lee et.al. |
2411.14723 |
null |
2024-11-21 |
GMAI-VL & GMAI-VL-5.5M: A Large Vision-Language Model and A Comprehensive Multimodal Dataset Towards General Medical AI |
Tianbin Li et.al. |
2411.14522 |
link |
2024-11-21 |
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance |
Haozhe Zhao et.al. |
2411.14279 |
null |
2024-11-21 |
Is this Generated Person Existed in Real-world? Fine-grained Detecting and Calibrating Abnormal Human-body |
Zeqing Wang et.al. |
2411.14205 |
null |
2024-11-21 |
FoPru: Focal Pruning for Efficient Large Vision-Language Models |
Lei Jiang et.al. |
2411.14164 |
null |
2024-11-21 |
Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset |
Heejeong Nam et.al. |
2411.14137 |
link |
2024-11-20 |
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games |
Davide Paglieri et.al. |
2411.13543 |
null |
2024-11-20 |
Teaching VLMs to Localize Specific Objects from In-context Examples |
Sivan Doveh et.al. |
2411.13317 |
link |
2024-11-20 |
XMask3D: Cross-modal Mask Reasoning for Open Vocabulary 3D Semantic Segmentation |
Ziyi Wang et.al. |
2411.13243 |
link |
2024-11-21 |
ViSTa Dataset: Do vision-language models understand sequential tasks? |
Evžen Wybitul et.al. |
2411.13211 |
link |
2024-11-20 |
TAPT: Test-Time Adversarial Prompt Tuning for Robust Inference in Vision-Language Models |
Xin Wang et.al. |
2411.13136 |
null |
2024-11-20 |
LaVida Drive: Vision-Text Interaction VLM for Autonomous Driving with Token Selection, Recovery and Enhancement |
Siwen Jiao et.al. |
2411.12980 |
null |
2024-11-19 |
VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge |
Vishwesh Nath et.al. |
2411.12915 |
null |
2024-11-19 |
CATCH: Complementary Adaptive Token-level Contrastive Decoding to Mitigate Hallucinations in LVLMs |
Zhehan Kan et.al. |
2411.12713 |
null |
2024-11-19 |
Just KIDDIN: Knowledge Infusion and Distillation for Detection of INdecent Memes |
Rahul Garg et.al. |
2411.12174 |
null |
2024-11-18 |
Vision Language Models Are Few-Shot Audio Spectrogram Classifiers |
Satvik Dixit et.al. |
2411.12058 |
null |
2024-11-18 |
ITACLIP: Boosting Training-Free Semantic Segmentation with Image, Text, and Architectural Enhancements |
M. Arda Aydın et.al. |
2411.12044 |
link |
2024-11-18 |
MC-LLaVA: Multi-Concept Personalized Vision-Language Model |
Ruichuan An et.al. |
2411.11706 |
link |
2024-11-18 |
TrojanRobot: Backdoor Attacks Against Robotic Manipulation in the Physical World |
Xianlong Wang et.al. |
2411.11683 |
null |
2024-11-18 |
VLN-Game: Vision-Language Equilibrium Search for Zero-Shot Semantic Navigation |
Bangguo Yu et.al. |
2411.11609 |
null |
2024-11-18 |
Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment |
Zhendong Liu et.al. |
2411.11543 |
null |
2024-11-19 |
Safe + Safe = Unsafe? Exploring How Safe Images Can Be Exploited to Jailbreak Large Vision-Language Models |
Chenhang Cui et.al. |
2411.11496 |
link |
2024-11-18 |
Exploring Emerging Trends and Research Opportunities in Visual Place Recognition |
Antonios Gasteratos et.al. |
2411.11481 |
null |
2024-11-18 |
Quantifying Preferences of Vision-Language Models via Value Decomposition in Social Media Contexts |
Jingxuan Li et.al. |
2411.11479 |
null |
2024-11-18 |
GLDesigner: Leveraging Multi-Modal LLMs as Designer for Enhanced Aesthetic Text Glyph Layouts |
Junwen He et.al. |
2411.11435 |
null |
2024-11-18 |
Efficient Transfer Learning for Video-language Foundation Models |
Haoxing Chen et.al. |
2411.11223 |
link |
2024-11-17 |
Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection |
Wentao Bao et.al. |
2411.10922 |
link |
2024-11-15 |
VeriGraph: Scene Graphs for Execution Verifiable Robot Planning |
Daniel Ekpo et.al. |
2411.10446 |
null |
2024-11-15 |
LLaVA-o1: Let Vision Language Models Reason Step-by-Step |
Guowei Xu et.al. |
2411.10440 |
link |
2024-11-15 |
SEAGULL: No-reference Image Quality Assessment for Regions of Interest via Vision-Language Instruction Tuning |
Zewen Chen et.al. |
2411.10161 |
link |
2024-11-15 |
Federated Domain Generalization via Prompt Learning and Aggregation |
Shuai Gong et.al. |
2411.10063 |
link |
2024-11-15 |
Free Lunch in Pathology Foundation Model: Task-specific Model Adaptation with Concept-Guided Feature Enhancement |
Yanyan Huang et.al. |
2411.09894 |
link |
2024-11-14 |
LLV-FSR: Exploiting Large Language-Vision Prior for Face Super-resolution |
Chenyang Wang et.al. |
2411.09293 |
null |
2024-11-13 |
ClevrSkills: Compositional Language and Visual Reasoning in Robotics |
Sanjay Haresh et.al. |
2411.09052 |
link |
2024-11-13 |
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models |
Yongdong Wang et.al. |
2411.09022 |
link |
2024-11-13 |
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions |
Moran Yanuka et.al. |
2411.09018 |
link |
2024-11-13 |
The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models |
Daniel P. Jeong et.al. |
2411.08870 |
link |
2024-11-13 |
Sharingan: Extract User Action Sequence from Desktop Recordings |
Yanting Chen et.al. |
2411.08768 |
null |
2024-11-13 |
Voxeland: Probabilistic Instance-Aware Semantic Mapping with Evidence-based Uncertainty Quantification |
Jose-Luis Matez-Bandera et.al. |
2411.08727 |
link |
2024-11-13 |
LG-Gaze: Learning Geometry-aware Continuous Prompts for Language-Guided Gaze Estimation |
Pengwei Yin et.al. |
2411.08606 |
null |
2024-11-13 |
NavAgent: Multi-scale Urban Street View Fusion For UAV Embodied Vision-and-Language Navigation |
Youzhi Liu et.al. |
2411.08579 |
null |
2024-11-13 |
Open-World Task and Motion Planning via Vision-Language Model Inferred Constraints |
Nishanth Kumar et.al. |
2411.08253 |
null |
2024-11-12 |
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation |
Yiyang Ma et.al. |
2411.07975 |
link |
2024-11-12 |
Leveraging Multimodal Models for Enhanced Neuroimaging Diagnostics in Alzheimer’s Disease |
Francesco Chiumento et.al. |
2411.07871 |
null |
2024-11-12 |
BLIP3-KALE: Knowledge Augmented Large-Scale Dense Captions |
Anas Awadalla et.al. |
2411.07461 |
null |
2024-11-11 |
SAMPart3D: Segment Any Part in 3D Objects |
Yunhan Yang et.al. |
2411.07184 |
link |
2024-11-11 |
StoryTeller: Improving Long Video Description through Global Audio-Visual Character Identification |
Yichen He et.al. |
2411.07076 |
link |
2024-11-11 |
UMFC: Unsupervised Multi-Domain Feature Calibration for Vision-Language Models |
Jiachen Liang et.al. |
2411.06921 |
link |
2024-11-11 |
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning |
Hongsheng Zhang et.al. |
2411.06764 |
null |
2024-11-11 |
Learning from Feedback: Semantic Enhancement for Object SLAM Using Foundation Models |
Jungseok Hong et.al. |
2411.06752 |
null |
2024-11-11 |
Track Any Peppers: Weakly Supervised Sweet Pepper Tracking Using VLMs |
Jia Syuen Lim et.al. |
2411.06702 |
null |
2024-11-11 |
Renaissance: Investigating the Pretraining of Vision-Language Encoders |
Clayton Fields et.al. |
2411.06657 |
link |
2024-11-09 |
Hidden in Plain Sight: Evaluating Abstract Shape Recognition in Vision-Language Models |
Arshia Hemmat et.al. |
2411.06287 |
link |
2024-11-09 |
Sampling-Based Model Predictive Control for Dexterous Manipulation on a Biomimetic Tendon-Driven Hand |
Adrian Hess et.al. |
2411.06183 |
null |
2024-11-09 |
Aquila-plus: Prompt-Driven Visual-Language Models for Pixel-Level Remote Sensing Image Understanding |
Kaixuan Lu et.al. |
2411.06142 |
null |
2024-11-08 |
End-to-End Navigation with Vision Language Models: Transforming Spatial Reasoning into Question-Answering |
Dylan Goetting et.al. |
2411.05755 |
link |
2024-11-08 |
Poze: Sports Technique Feedback under Data Constraints |
Agamdeep Singh et.al. |
2411.05734 |
null |
2024-11-08 |
A Two-Step Concept-Based Approach for Enhanced Interpretability and Trust in Skin Lesion Diagnosis |
Cristiano Patrício et.al. |
2411.05609 |
link |
2024-11-08 |
Enhancing Visual Classification using Comparative Descriptors |
Hankyeol Lee et.al. |
2411.05357 |
link |
2024-11-08 |
Real-World Offline Reinforcement Learning from Vision Language Model Feedback |
Sreyas Venkataraman et.al. |
2411.05273 |
null |
2024-11-07 |
RT-Grasp: Reasoning Tuning Robotic Grasping via Multi-modal Large Language Model |
Jinxuan Xu et.al. |
2411.05212 |
null |
2024-11-07 |
On Erroneous Agreements of CLIP Image Embeddings |
Siting Li et.al. |
2411.05195 |
null |
2024-11-07 |
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning |
Joey Hong et.al. |
2411.05193 |
null |
2024-11-07 |
Precision or Recall? An Analysis of Image Captions for Training Text-to-Image Generation Model |
Sheng Cheng et.al. |
2411.05079 |
link |
2024-11-07 |
DynaMem: Online Dynamic Spatio-Semantic Memory for Open World Mobile Manipulation |
Peiqi Liu et.al. |
2411.04999 |
link |
2024-11-07 |
A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model |
Panwen Hu et.al. |
2411.04942 |
null |
2024-11-07 |
In the Era of Prompt Learning with Vision-Language Models |
Ankit Jha et.al. |
2411.04892 |
null |
2024-11-07 |
TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models |
Jonathan Fhima et.al. |
2411.04642 |
null |
2024-11-07 |
Vision Language Models are In-Context Value Learners |
Yecheng Jason Ma et.al. |
2411.04549 |
null |
2024-11-07 |
BendVLM: Test-Time Debiasing of Vision-Language Embeddings |
Walter Gerych et.al. |
2411.04420 |
link |
2024-11-06 |
Unfair Alignment: Examining Safety Alignment Across Vision Encoder Layers in Vision-Language Models |
Saketh Bachu et.al. |
2411.04291 |
null |
2024-11-06 |
Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? |
Daniel P. Jeong et.al. |
2411.04118 |
link |
2024-11-06 |
RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models |
Maya Varma et.al. |
2411.04097 |
link |
2024-11-06 |
H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models |
Nhi Pham et.al. |
2411.04077 |
null |
2024-11-06 |
Select2Plan: Training-Free ICL-Based Planning through VQA and Memory Retrieval |
Davide Buoso et.al. |
2411.04006 |
null |
2024-11-06 |
Multi3Hate: Multimodal, Multilingual, and Multicultural Hate Speech Detection with Vision-Language Models |
Minh Duc Bui et.al. |
2411.03888 |
link |
2024-11-06 |
DesignMinds: Enhancing Video-Based Design Ideation with Vision-Language Model and Context-Injected Large Language Model |
Tianhao He et.al. |
2411.03827 |
null |
2024-11-06 |
Fine-Tuning Vision-Language Model for Automated Engineering Drawing Information Extraction |
Muhammad Tayyab Khan et.al. |
2411.03707 |
null |
2024-11-05 |
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset |
Yingzi Ma et.al. |
2411.03554 |
link |
2024-11-05 |
VLA-3D: A Dataset for 3D Semantic Scene Understanding and Navigation |
Haochen Zhang et.al. |
2411.03540 |
link |
2024-11-05 |
Personalized Video Summarization by Multimodal Video Understanding |
Brian Chen et.al. |
2411.03531 |
null |
2024-11-05 |
Inference Optimal VLMs Need Only One Visual Token but Larger Models |
Kevin Y. Li et.al. |
2411.03312 |
link |
2024-11-05 |
HumanVLM: Foundation for Human-Scene Vision-Language Model |
Dawei Dai et.al. |
2411.03034 |
null |
2024-11-05 |
Precise Drive with VLM: First Prize Solution for PRCV 2024 Drive LM challenge |
Bin Huang et.al. |
2411.02999 |
null |
2024-11-05 |
Multi-Modal 3D Scene Graph Updater for Shared and Dynamic Environments |
Emilio Olivastri et.al. |
2411.02938 |
null |
2024-11-05 |
Membership Inference Attacks against Large Vision-Language Models |
Zhan Li et.al. |
2411.02902 |
link |
2024-11-05 |
Leveraging Vision-Language Models for Manufacturing Feature Recognition in CAD Designs |
Muhammad Tayyab Khan et.al. |
2411.02810 |
null |
2024-11-05 |
Label Critic: Design Data Before Models |
Pedro R. A. S. Bassi et.al. |
2411.02753 |
link |
2024-11-05 |
DDFAV: Remote Sensing Large Vision Language Models Dataset and Evaluation Benchmark |
Haodong Li et.al. |
2411.02733 |
link |
2024-11-05 |
V-DPO: Mitigating Hallucination in Large Vision Language Models via Vision-Guided Direct Preference Optimization |
Yuxi Xie et.al. |
2411.02712 |
link |
2024-11-04 |
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack |
Xiaojun Jia et.al. |
2411.02669 |
link |
2024-11-04 |
Attacking Vision-Language Computer Agents via Pop-ups |
Yanzhe Zhang et.al. |
2411.02391 |
link |
2024-11-04 |
One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering |
Deepayan Das et.al. |
2411.02210 |
null |
2024-11-04 |
GraphVL: Graph-Enhanced Semantic Modeling via Vision-Language Models for Generalized Class Discovery |
Bhupendra Solanki et.al. |
2411.02074 |
null |
2024-11-03 |
RS-MoE: Mixture of Experts for Remote Sensing Image Captioning and Visual Question Answering |
Hui Lin et.al. |
2411.01595 |
null |
2024-11-03 |
Addressing Failures in Robotics using Vision-Based Language Models (VLMs) and Behavior Trees (BT) |
Faseeh Ahmad et.al. |
2411.01568 |
null |
2024-11-03 |
Integration of Large Vision Language Models for Efficient Post-disaster Damage Assessment and Reporting |
Zhaohui Chen et.al. |
2411.01511 |
null |
2024-11-03 |
A Visual Question Answering Method for SAR Ship: Breaking the Requirement for Multimodal Dataset Construction and Model Fine-Tuning |
Fei Wang et.al. |
2411.01445 |
null |
2024-11-01 |
Identifying Implicit Social Biases in Vision-Language Models |
Kimia Hamidieh et.al. |
2411.00997 |
null |
2024-11-01 |
Retrieval-enriched zero-shot image classification in low-resource domains |
Nicola Dall’Asen et.al. |
2411.00988 |
null |
2024-11-01 |
Does GenAI Make Usability Testing Obsolete? |
Ali Ebrahimi Pourasad et.al. |
2411.00634 |
null |
2024-10-31 |
$π_0$ : A Vision-Language-Action Flow Model for General Robot Control |
Kevin Black et.al. |
2410.24164 |
null |
2024-10-31 |
Exploring Vision Language Models for Facial Attribute Recognition: Emotion, Race, Gender, and Age |
Nouar AlDahoul et.al. |
2410.24148 |
null |
2024-10-31 |
Bayesian-guided Label Mapping for Visual Reprogramming |
Chengyi Cai et.al. |
2410.24018 |
link |
2024-10-31 |
EZ-HOI: VLM Adaptation via Guided Prompt Learning for Zero-Shot HOI Detection |
Qinqian Lei et.al. |
2410.23904 |
link |
2024-10-31 |
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP |
Chen Huang et.al. |
2410.23698 |
null |
2024-10-31 |
Adversarial Attacks of Vision Tasks in the Past 10 Years: A Survey |
Chiyu Zhang et.al. |
2410.23687 |
null |
2024-10-31 |
SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design |
Tomohiro Motoda et.al. |
2410.23640 |
null |
2024-10-30 |
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning |
Xiaolin Fang et.al. |
2410.23254 |
null |
2024-10-31 |
Aligning Audio-Visual Joint Representations with an Agentic Workflow |
Shentong Mo et.al. |
2410.23230 |
null |
2024-10-30 |
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents |
Zhiyong Wu et.al. |
2410.23218 |
link |
2024-10-30 |
VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning |
Yichao Liang et.al. |
2410.23156 |
null |
2024-10-30 |
Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models |
Junjie Wu et.al. |
2410.23114 |
link |
2024-10-30 |
An Individual Identity-Driven Framework for Animal Re-Identification |
Yihao Wu et.al. |
2410.22927 |
link |
2024-10-30 |
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector |
Youcheng Huang et.al. |
2410.22888 |
link |
2024-10-30 |
Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model |
Keito Sasagawa et.al. |
2410.22736 |
null |
2024-10-30 |
Robotic State Recognition with Image-to-Text Retrieval Task of Pre-Trained Vision-Language Model and Black-Box Optimization |
Kento Kawaharazuka et.al. |
2410.22707 |
null |
2024-10-30 |
SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset |
Ngoc Dung Huynh et.al. |
2410.22648 |
null |
2024-10-29 |
Task Vectors are Cross-Modal |
Grace Luo et.al. |
2410.22330 |
null |
2024-10-29 |
Multi-Class Textual-Inversion Secretly Yields a Semantic-Agnostic Classifier |
Kai Wang et.al. |
2410.22317 |
link |
2024-10-29 |
Natural Language Inference Improves Compositionality in Vision-Language Models |
Paola Cascante-Bonilla et.al. |
2410.22315 |
null |
2024-10-29 |
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving |
Bo Jiang et.al. |
2410.22313 |
link |
2024-10-29 |
ContextIQ: A Multimodal Expert-Based Video Retrieval System for Contextual Advertising |
Ashutosh Chaubey et.al. |
2410.22233 |
link |
2024-10-29 |
Active Learning for Vision-Language Models |
Bardia Safaei et.al. |
2410.22187 |
null |
2024-10-29 |
Are VLMs Really Blind |
Ayush Singh et.al. |
2410.22029 |
link |
2024-10-29 |
Reliable Semantic Understanding for Real World Zero-shot Object Goal Navigation |
Halil Utku Unlu et.al. |
2410.21926 |
null |
2024-10-30 |
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models |
Lu Yu et.al. |
2410.21802 |
link |
2024-10-29 |
PerSRV: Personalized Sticker Retrieval with Vision-Language Model |
Heng Er Metilda Chee et.al. |
2410.21801 |
link |
2024-10-29 |
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves? |
Han Bao et.al. |
2410.21259 |
link |
2024-10-28 |
Hierarchical Knowledge Graph Construction from Images for Scalable E-Commerce |
Zhantao Yang et.al. |
2410.21237 |
null |
2024-10-28 |
Vision Search Assistant: Empower Vision-Language Models as Multimodal Search Engines |
Zhixin Zhang et.al. |
2410.21220 |
link |
2024-10-29 |
Document Parsing Unveiled: Techniques, Challenges, and Prospects for Structured Information Extraction |
Qintong Zhang et.al. |
2410.21169 |
null |
2024-10-28 |
Zero-Shot Action Recognition in Surveillance Videos |
Joao Pereira et.al. |
2410.21113 |
null |
2024-10-28 |
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks |
Yunhan Zhao et.al. |
2410.20971 |
null |
2024-10-29 |
VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions |
Guanyan Chen et.al. |
2410.20927 |
null |
2024-10-28 |
Improving Generalization in Visual Reasoning via Self-Ensemble |
Tien-Huy Nguyen et.al. |
2410.20883 |
null |
2024-10-28 |
Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments |
Sangmim Song et.al. |
2410.20666 |
null |
2024-10-27 |
MatViX: Multimodal Information Extraction from Visually Rich Articles |
Ghazal Khalighinejad et.al. |
2410.20494 |
null |
2024-10-25 |
Rethinking Visual Dependency in Long-Context Reasoning for Large Vision-Language Models |
Yucheng Zhou et.al. |
2410.19732 |
null |
2024-10-25 |
GeoLLaVA: Efficient Fine-Tuned Vision-Language Models for Temporal Change Detection in Remote Sensing |
Hosam Elgendy et.al. |
2410.19552 |
link |
2024-10-25 |
Bongard in Wonderland: Visual Puzzles that Still Make AI Go Mad? |
Antonia Wüst et.al. |
2410.19546 |
link |
2024-10-25 |
EDGE: Enhanced Grounded GUI Understanding with Enriched Multi-Granularity Synthetic Data |
Xuetian Chen et.al. |
2410.19461 |
null |
2024-10-25 |
COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training |
Haocheng Xi et.al. |
2410.19313 |
link |
2024-10-25 |
Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting |
Xingyu Zhu et.al. |
2410.19294 |
null |
2024-10-24 |
Probabilistic Language-Image Pre-Training |
Sanghyuk Chun et.al. |
2410.18857 |
link |
2024-10-24 |
Zero-shot Object Navigation with Vision-Language Models Reasoning |
Congcong Wen et.al. |
2410.18570 |
null |
2024-10-24 |
Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data |
Shuhao Gu et.al. |
2410.18558 |
null |
2024-10-24 |
Beyond Color and Lines: Zero-Shot Style-Specific Image Variations with Coordinated Semantics |
Jinghao Hu et.al. |
2410.18537 |
null |
2024-10-23 |
Lightweight Neural App Control |
Filippos Christianos et.al. |
2410.17883 |
null |
2024-10-23 |
ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting |
Shaofei Cai et.al. |
2410.17856 |
link |
2024-10-23 |
RE-tune: Incremental Fine Tuning of Biomedical Vision-Language Models for Multi-label Chest X-ray Classification |
Marco Mistretta et.al. |
2410.17827 |
null |
2024-10-23 |
An Intelligent Agentic System for Complex Image Restoration Problems |
Kaiwen Zhu et.al. |
2410.17809 |
link |
2024-10-23 |
MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models |
Ziyu Liu et.al. |
2410.17637 |
link |
2024-10-22 |
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents |
Chejian Xu et.al. |
2410.17401 |
null |
2024-10-22 |
Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities |
Zheyuan Zhang et.al. |
2410.17385 |
link |
2024-10-22 |
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction |
Long Xing et.al. |
2410.17247 |
link |
2024-10-22 |
MPDS: A Movie Posters Dataset for Image Generation with Diffusion Model |
Meng Xu et.al. |
2410.16840 |
null |
2024-10-22 |
Combining Ontological Knowledge and Large Language Model for User-Friendly Service Robots |
Haru Nakajima et.al. |
2410.16804 |
null |
2024-10-22 |
Foundation Models for Remote Sensing and Earth Observation: A Survey |
Aoran Xiao et.al. |
2410.16602 |
link |
2024-10-21 |
Integrating Reinforcement Learning with Foundation Models for Autonomous Robotics: Methods and Perspectives |
Angelo Moroncelli et.al. |
2410.16411 |
link |
2024-10-21 |
VipAct: Visual-Perception Enhancement via Specialized VLM Agent Collaboration and Tool-use |
Zhehao Zhang et.al. |
2410.16400 |
null |
2024-10-21 |
xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs |
Michael S. Ryoo et.al. |
2410.16267 |
null |
2024-10-21 |
Sketch2Code: Evaluating Vision-Language Models for Interactive Web Design Prototyping |
Ryan Li et.al. |
2410.16232 |
null |
2024-10-21 |
Improve Vision Language Model Chain-of-thought Reasoning |
Ruohong Zhang et.al. |
2410.16198 |
link |
2024-10-21 |
Sparkle: Mastering Basic Spatial Capabilities in Vision Language Models Elicits Generalization to Composite Spatial Reasoning |
Yihong Tang et.al. |
2410.16162 |
null |
2024-10-21 |
Mitigating Object Hallucination via Concentric Causal Attention |
Yun Xing et.al. |
2410.15926 |
link |
2024-10-21 |
MI-VisionShot: Few-shot adaptation of vision-language models for slide-level classification of histopathological images |
Pablo Meseguer et.al. |
2410.15881 |
null |
2024-10-21 |
Task-oriented Robotic Manipulation with Vision Language Models |
Nurhan Bulus Guran et.al. |
2410.15863 |
null |
2024-10-21 |
An Efficient System for Automatic Map Storytelling – A Case Study on Historical Maps |
Ziyi Liu et.al. |
2410.15780 |
link |
2024-10-22 |
Reducing Hallucinations in Vision-Language Models via Latent Space Steering |
Sheng Liu et.al. |
2410.15778 |
link |
2024-10-21 |
AutoTrain: No-code training for state-of-the-art models |
Abhishek Thakur et.al. |
2410.15735 |
link |
2024-10-18 |
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples |
Baiqi Li et.al. |
2410.14669 |
null |
2024-10-18 |
Neuro-Symbolic Traders: Assessing the Wisdom of AI Crowds in Markets |
Namid R. Stillman et.al. |
2410.14587 |
null |
2024-10-18 |
CLIP-VAD: Exploiting Vision-Language Models for Voice Activity Detection |
Andrea Appiani et.al. |
2410.14509 |
null |
2024-10-18 |
Zero-shot Action Localization via the Confidence of Large Vision-Language Models |
Josiah Aklilu et.al. |
2410.14340 |
null |
2024-10-18 |
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models |
Olga Loginova et.al. |
2410.14248 |
null |
2024-10-18 |
E3D-GPT: Enhanced 3D Visual Foundation for Medical Vision-Language Model |
Haoran Lai et.al. |
2410.14200 |
null |
2024-10-18 |
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs |
Yujun Zhou et.al. |
2410.14182 |
null |
2024-10-18 |
MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems |
Zifeng Zhu et.al. |
2410.14179 |
link |
2024-10-18 |
ProReason: Multi-Modal Proactive Reasoning with Decoupled Eyesight and Wisdom |
Jingqi Zhou et.al. |
2410.14138 |
null |
2024-10-17 |
Efficient Vision-Language Models by Summarizing Visual Tokens into Compact Registers |
Yuxin Wen et.al. |
2410.14072 |
null |
2024-10-17 |
VLM-Grounder: A VLM Agent for Zero-Shot 3D Visual Grounding |
Runsen Xu et.al. |
2410.13860 |
link |
2024-10-17 |
Differentiable Robot Rendering |
Ruoshi Liu et.al. |
2410.13851 |
null |
2024-10-17 |
Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning |
Xiaodan Xing et.al. |
2410.13823 |
link |
2024-10-17 |
Improving Multi-modal Large Language Model through Boosting Vision Capabilities |
Yanpeng Sun et.al. |
2410.13733 |
null |
2024-10-17 |
VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks |
Shailaja Keyur Sampat et.al. |
2410.13666 |
link |
2024-10-17 |
H2OVL-Mississippi Vision Language Models Technical Report |
Shaikat Galib et.al. |
2410.13611 |
null |
2024-10-17 |
GeoCoder: Solving Geometry Problems by Generating Modular Code through Vision-Language Models |
Aditya Sharma et.al. |
2410.13510 |
null |
2024-10-17 |
Mitigating Hallucinations in Large Vision-Language Models via Summary-Guided Decoding |
Kyungmin Min et.al. |
2410.13321 |
null |
2024-10-17 |
Mapping Bias in Vision Language Models: Signposts, Pitfalls, and the Road Ahead |
Kuleen Sasse et.al. |
2410.13146 |
link |
2024-10-17 |
Trust but Verify: Programmatic VLM Evaluation in the Wild |
Viraj Prabhu et.al. |
2410.13121 |
null |
2024-10-16 |
Dual Prototype Evolving for Test-Time Generalization of Vision-Language Models |
Ce Zhang et.al. |
2410.12790 |
link |
2024-10-16 |
Harmon: Whole-Body Motion Generation of Humanoid Robots from Language Descriptions |
Zhenyu Jiang et.al. |
2410.12773 |
null |
2024-10-16 |
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation |
João Matos et.al. |
2410.12722 |
link |
2024-10-16 |
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines |
Genta Indra Winata et.al. |
2410.12705 |
link |
2024-10-16 |
VividMed: Vision Language Model with Versatile Visual Grounding for Medicine |
Lingxiao Luo et.al. |
2410.12694 |
link |
2024-10-16 |
Cross-Modal Safety Mechanism Transfer in Large Vision-Language Models |
Shicheng Xu et.al. |
2410.12662 |
null |
2024-10-16 |
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training |
Zhiyuan Ma et.al. |
2410.12595 |
null |
2024-10-16 |
FTII-Bench: A Comprehensive Multimodal Benchmark for Flow Text with Image Insertion |
Jiacheng Ruan et.al. |
2410.12564 |
link |
2024-10-16 |
Evaluating Cascaded Methods of Vision-Language Models for Zero-Shot Detection and Association of Hardhats for Increased Construction Safety |
Lucas Choi et.al. |
2410.12225 |
null |
2024-10-16 |
Dual-Model Distillation for Efficient Action Classification with Hybrid Edge-Cloud Solution |
Timothy Wei et.al. |
2410.12165 |
null |
2024-10-15 |
A Survey of Low-shot Vision-Language Model Adaptation via Representer Theorem |
Kun Ding et.al. |
2410.11686 |
null |
2024-10-15 |
VisualRWKV-HD and UHD: Advancing High-Resolution Processing for Visual Language Models |
Zihang Li et.al. |
2410.11665 |
null |
2024-10-16 |
Efficient and Effective Universal Adversarial Attack against Vision-Language Pre-training Models |
Fan Yang et.al. |
2410.11639 |
null |
2024-10-15 |
MultiVENT 2.0: A Massive Multilingual Benchmark for Event-Centric Video Retrieval |
Reno Kriz et.al. |
2410.11619 |
null |
2024-10-15 |
PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model |
Shang-Ching Liu et.al. |
2410.11564 |
null |
2024-10-15 |
LargePiG: Your Large Language Model is Secretly a Pointer Generator |
Zhongxiang Sun et.al. |
2410.11366 |
null |
2024-10-15 |
Have the VLMs Lost Confidence? A Study of Sycophancy in VLMs |
Shuo Li et.al. |
2410.11302 |
null |
2024-10-15 |
CLIP-DFGS: A Hard Sample Mining Method for CLIP in Generalizable Person Re-Identification |
Huazhong Zhao et.al. |
2410.11255 |
null |
2024-10-15 |
Tree of Attributes Prompt Learning for Vision-Language Models |
Tong Ding et.al. |
2410.11201 |
link |
2024-10-14 |
Locality Alignment Improves Vision-Language Models |
Ian Covert et.al. |
2410.11087 |
null |
2024-10-14 |
Towards Foundation Models for 3D Vision: How Close Are We? |
Yiming Zuo et.al. |
2410.10799 |
link |
2024-10-14 |
VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents |
Shi Yu et.al. |
2410.10594 |
link |
2024-10-14 |
Queryable Prototype Multiple Instance Learning with Vision-Language Models for Incremental Whole Slide Image Classification |
Jiaxiang Gou et.al. |
2410.10573 |
link |
2024-10-14 |
MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks |
Jiacheng Chen et.al. |
2410.10563 |
link |
2024-10-14 |
Learning to Ground VLMs without Forgetting |
Aritra Bhowmik et.al. |
2410.10491 |
null |
2024-10-14 |
LG-CAV: Train Any Concept Activation Vector with Language Guidance |
Qihan Huang et.al. |
2410.10308 |
null |
2024-10-14 |
Fine-grained Abnormality Prompt Learning for Zero-shot Anomaly Detection |
Jiawen Zhu et.al. |
2410.10289 |
link |
2024-10-14 |
LOBG:Less Overfitting for Better Generalization in Vision-Language Model |
Chenhao Ding et.al. |
2410.10247 |
null |
2024-10-14 |
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models |
Peng Xia et.al. |
2410.10139 |
link |
2024-10-14 |
Mixture of Experts Made Personalized: Federated Prompt Learning for Vision-Language Models |
Jun Luo et.al. |
2410.10114 |
link |
2024-10-11 |
Unraveling and Mitigating Safety Alignment Degradation of Vision-Language Models |
Qin Liu et.al. |
2410.09047 |
null |
2024-10-11 |
The Impact of Visual Information in Chinese Characters: Evaluating Large Models’ Ability to Recognize and Utilize Radicals |
Xiaofeng Wu et.al. |
2410.09013 |
null |
2024-10-11 |
DA-Ada: Learning Domain-Aware Adapter for Domain Adaptive Object Detection |
Haochen Li et.al. |
2410.09004 |
link |
2024-10-11 |
SegGrasp: Zero-Shot Task-Oriented Grasping via Semantic and Geometric Guided Segmentation |
Haosheng Li et.al. |
2410.08901 |
null |
2024-10-11 |
Calibrated Cache Model for Few-Shot Vision-Language Model Adaptation |
Kun Ding et.al. |
2410.08895 |
null |
2024-10-11 |
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models |
Jingyuan Qi et.al. |
2410.08876 |
null |
2024-10-11 |
Audio Description Generation in the Era of LLMs and VLMs: A Review of Transferable Generative AI Technologies |
Yingqiang Gao et.al. |
2410.08860 |
null |
2024-10-11 |
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model |
Beichen Wang et.al. |
2410.08792 |
null |
2024-10-11 |
Superpipeline: A Universal Approach for Reducing GPU Memory Usage in Large Models |
Reza Abbasi et.al. |
2410.08791 |
link |
2024-10-11 |
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping |
Yue Yang et.al. |
2410.08695 |
link |
2024-10-10 |
LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts |
Anh-Quan Cao et.al. |
2410.08211 |
null |
2024-10-10 |
MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models |
Wenbo Hu et.al. |
2410.08182 |
null |
2024-10-10 |
On the Evaluation of Generative Robotic Simulations |
Feng Chen et.al. |
2410.08172 |
null |
2024-10-10 |
Q-VLM: Post-training Quantization for Large Vision-Language Models |
Changyuan Wang et.al. |
2410.08119 |
link |
2024-10-10 |
Unsupervised Data Validation Methods for Efficient Model Training |
Yurii Paniv et.al. |
2410.07880 |
null |
2024-10-10 |
HeGraphAdapter: Tuning Multi-Modal Vision-Language Models with Heterogeneous Graph Adapter |
Yumiao Zhao et.al. |
2410.07854 |
null |
2024-10-10 |
FLIER: Few-shot Language Image Models Embedded with Latent Representations |
Zhinuo Zhou et.al. |
2410.07648 |
null |
2024-10-10 |
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks |
Hoin Jung et.al. |
2410.07593 |
link |
2024-10-10 |
3D Vision-Language Gaussian Splatting |
Qucheng Peng et.al. |
2410.07577 |
null |
2024-10-10 |
How Does Vision-Language Adaptation Impact the Safety of Vision Language Models? |
Seongyun Lee et.al. |
2410.07571 |
null |
2024-10-09 |
Deciphering Cross-Modal Alignment in Large Vision-Language Models with Modality Integration Rate |
Qidong Huang et.al. |
2410.07167 |
link |
2024-10-09 |
Towards Interpreting Visual Information Processing in Vision-Language Models |
Clement Neo et.al. |
2410.07149 |
link |
2024-10-10 |
EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models |
Rui Zhao et.al. |
2410.07133 |
link |
2024-10-09 |
VHELM: A Holistic Evaluation of Vision Language Models |
Tony Lee et.al. |
2410.07112 |
link |
2024-10-09 |
Pixtral 12B |
Pravesh Agrawal et.al. |
2410.07073 |
link |
2024-10-09 |
Clean Evaluations on Contaminated Visual Language Models |
Hongyuan Lu et.al. |
2410.07030 |
null |
2024-10-09 |
Preference Fine-Tuning for Factuality in Chest X-Ray Interpretation Models Without Human Feedback |
Dennis Hein et.al. |
2410.07025 |
null |
2024-10-09 |
$\texttt{ModSCAN}$ : Measuring Stereotypical Bias in Large Vision-Language Models from Vision and Language Modalities |
Yukun Jiang et.al. |
2410.06967 |
link |
2024-10-09 |
Compositional Entailment Learning for Hyperbolic Vision-Language Models |
Avik Pal et.al. |
2410.06912 |
link |
2024-10-09 |
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models |
Yuying Shang et.al. |
2410.06795 |
null |
2024-10-07 |
Fine-Tuning CLIP’s Last Visual Projector: A Few-Shot Cornucopia |
Mohammad Fahes et.al. |
2410.05270 |
link |
2024-10-07 |
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens |
Ya-Qi Yu et.al. |
2410.05261 |
null |
2024-10-08 |
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models |
Rabin Adhikari et.al. |
2410.05239 |
link |
2024-10-07 |
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality |
Youngtaek Oh et.al. |
2410.05210 |
link |
2024-10-07 |
LADEV: A Language-Driven Testing and Evaluation Platform for Vision-Language-Action Models in Robotic Manipulation |
Zhijie Wang et.al. |
2410.05191 |
null |
2024-10-07 |
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks |
Ziyan Jiang et.al. |
2410.05160 |
null |
2024-10-07 |
HyperINF: Unleashing the HyperPower of the Schulz’s Method for Data Influence Estimation |
Xinyu Zhou et.al. |
2410.05090 |
link |
2024-10-07 |
HE-Drive: Human-Like End-to-End Driving with Vision Language Models |
Junming Wang et.al. |
2410.05051 |
null |
2024-10-07 |
Patch is Enough: Naturalistic Adversarial Patch against Vision-Language Pre-training Models |
Dehong Kong et.al. |
2410.04884 |
null |
2024-10-07 |
TLDR: Token-Level Detective Reward Model for Large Vision Language Models |
Deqing Fu et.al. |
2410.04734 |
null |
2024-10-06 |
Unpacking Failure Modes of Generative Policies: Runtime Monitoring of Consistency and Progress |
Christopher Agia et.al. |
2410.04640 |
null |