Publications : AgAID Institute

2026

Ranjan Sapkota; Manoj Karkee

Object detection with multimodal large vision-language models: An in-depth review Journal Article

In: Information Fusion, vol. 126, pp. 103575, 2026, ISSN: 1566-2535.

Abstract | Links | BibTeX | Tags: Information fusion, Language and vision fusion, Large language models, Object detection, Vision-language models

@article{sapkota_object_2026,

title = {Object detection with multimodal large vision-language models: An in-depth review},

author = {Ranjan Sapkota and Manoj Karkee},

url = {https://www.sciencedirect.com/science/article/pii/S1566253525006475},

doi = {10.1016/j.inffus.2025.103575},

issn = {1566-2535},

year  = {2026},

date = {2026-02-01},

urldate = {2026-02-01},

journal = {Information Fusion},

volume = {126},

pages = {103575},

abstract = {The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. Furthermore, this review presents comprehensive visualizations demonstrating LVLMs’ effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review analysis, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. However, because of the unique and complimentary characteristics of traditional deep learning approaches and LVLMS, it is anticipated that hybrid approaches integrating both types of object detection models will be utilized in the future to maximize the speed, reliability and robotiness of the systems. Moreover, the review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and automated applications in the future.},

keywords = {Information fusion, Language and vision fusion, Large language models, Object detection, Vision-language models},

pubstate = {published},

tppubtype = {article}

}

The fusion of language and vision in large vision-language models (LVLMs) has revolutionized deep learning-based object detection by enhancing adaptability, contextual reasoning, and generalization beyond traditional architectures. This in-depth review presents a structured exploration of the state-of-the-art in LVLMs, systematically organized through a three-step research review process. First, we discuss the functioning of vision language models (VLMs) for object detection, describing how these models harness natural language processing (NLP) and computer vision (CV) techniques to revolutionize object detection and localization. We then explain the architectural innovations, training paradigms, and output flexibility of recent LVLMs for object detection, highlighting how they achieve advanced contextual understanding for object detection. The review thoroughly examines the approaches used in integration of visual and textual information, demonstrating the progress made in object detection using VLMs that facilitate more sophisticated object detection and localization strategies. Furthermore, this review presents comprehensive visualizations demonstrating LVLMs’ effectiveness in diverse scenarios including localization and segmentation, and then compares their real-time performance, adaptability, and complexity to traditional deep learning systems. Based on the review analysis, its is expected that LVLMs will soon meet or surpass the performance of conventional methods in object detection. However, because of the unique and complimentary characteristics of traditional deep learning approaches and LVLMS, it is anticipated that hybrid approaches integrating both types of object detection models will be utilized in the future to maximize the speed, reliability and robotiness of the systems. Moreover, the review also identifies a few major limitations of the current LVLM modes, proposes solutions to address those challenges, and presents a clear roadmap for the future advancement in this field. We conclude, based on this study, that the recent advancement in LVLMs have made and will continue to make a transformative impact on object detection and automated applications in the future.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie is used to record the user consent for the cookies in the "Advertisement" category .
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	Records the default button state of the corresponding category & the status of CCPA. It works only in coordination with the primary cookie.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_ga_MBM6JBSF93	2 years	This cookie is installed by Google Analytics.

2026

Welcome to AgAID.org!