CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update

CVPR 2024

1School of Intelligence Science and Technology, Peking University 2State Key Laboratory of General Artificial Intelligence, BIGAI 3Beijing Jiaotong University 4Department of Automation, Tsinghua University

TL;DR: We propose CLOVA, a general visual assistant that updates both LLMs and visual models via inference, reflection, and learning in a closed-loop framework.


Figure 1. Several examples of CLOVA

Abstract

Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.


Figure 2. Framework of CLOVA. CLOVA is a general visual assistant that updates both LLMs and visual models via inference, reflection, and learning in a closed-loop framework. During inference, CLOVA uses LLMs to integrate visual tools to accomplish given tasks. In reflection, CLOVA identifies models that require updating based on environmental feedback. Finally, in learning, CLOVA collects data and updates models accordingly.

Method

CLOVA has three phases: inference, reflection, and learning, as shown Figure 2. In the inference phase, CLOVA uses LLMs to generate programs and executes corresponding tools to solve the task. The reflection phase introduces a multimodal global-local reflection scheme that uses LLMs to generate critiques, identifying which tool needs to be updated. During learning, we employ three manners to collect training data and use a training-validation prompt tuning scheme to update the tools.

Inference

Our inference phase is based on VISPROG, while the difference is that CLOVA first uses LLMs to generate plans and then generates programs based on the plans, instead of directly generating programs. Plans can be seen as intermediate reasoning chains that benefit the inference and reflection phases. Given a task, CLOVA selects in-context examples from a demonstration pool, including correct examples and incorrect examples with error critiques. These examples are used to create prompts that are then sent to LLMs for plan and program generation. Finally, the program is parsed to execute visual tools.


Figure 3. Illustration of the inference phase in CLOVA.

Reflection

In the inference phase, if a task is not solved correctly, the multimodal global-local reflection scheme uses LLMs to generate critiques, identifying which tool needs to be updated. We convert visual results into textual form. Then, CLOVA uses global reflection to generate critiques in a one-go manner. If CLOVA still fails after the tools are updated via global reflection and the learning phase --meaning the actual tools that lead to the faulty response are still to be found, we resort to local reflection to analyze each step of the program.


Figure 4. Illustration of the reflection phase in CLOVA.

Learning

After obtaining tools that need to be updated from the reflection phase, CLOVA then moves to the learning phase to collect training data and goes through training-validation prompt tuning to update them. Since the tools that need to be updated can be rather different, we explore three manners to collect data in real-time. Given the collected data, we invoke training-validation prompt tuning to update tools. During inference, we use prompt ensemble to retrieve and utilize prompts from a aforementioned prompt pool.


Figure 5. Illustration of the learning phase in CLOVA.

Video Demo

BibTeX

@article{gao2024clova,
    title = {CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update},
    author = {Gao, Zhi and Du, Yuntao and Zhang, Xintong and Ma, Xiaojian and Han, Wenjuan and Zhu, Song-Chun and Li, Qing},
    journal = {Conference on Computer Vision and Pattern Recognition (CVPR)},
    year = {2024}
}