728x90

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

https://llava-vl.github.io/blog/2024-01-30-llava-next/

LLaVA-NeXT: Improved reasoning, OCR, and world knowledge

LLaVA team presents LLaVA-NeXT, with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.

llava-vl.github.io

https://llava-vl.github.io/blog/2024-04-30-llava-next-video/

LLaVA-NeXT: A Strong Zero-shot Video Understanding Model

On January 30, 2024, we released LLaVA-NeXT, an open-source Large Multimodal Model (LMM) that has been trained exclusively on text-image data. With the proposed AnyRes technique, it boosts capabilities in reasoning, OCR, and world knowledge, demonstrating

llava-vl.github.io

LLaVA-NeXT는 정식 게재된 논문은 아닌 LLaVA 1.5의 옆그레이드 버전 정도로 생각하면 된다.
LLaVA-NeXT의 확장 버전인 LLaVA-NeXT-Interleave에 대한 paper가 존재한다.

(https://arxiv.org/abs/2407.07895)

Method

1. Zero-shot video representation capabilities with AnyRes

AnyRes 기법은 고해상도 이미지를 여러 개의 작은 이미지로 나누어, 사전 학습된 ViT가 처리하는 방식
(VIT 가 "소화" 할 수 있도록 resize, multi-patch 로 나눈다)
이미지를 하나의 연속적인 시퀀스로 구성해 입력하며, 이는 비디오의 여러 프레임을 처리하는 데 자연스럽게 확장
덕분에 이미지로만 학습된 모델이 비디오에서도 zero-shot 전이 능력을 발휘
이는 LMMs(Large Multimodal Models)에서 처음으로 관측된 강력한 모달리티 전이 능력

2. Inference with length generalization improves on longer videos

길이 일반화(length generalization)를 통한 장편 비디오 추론
선형 스케일링(linear scaling) 기법을 통해 입력 길이에 대한 일반화 능력을 확보
LLM의 "max_token_length" 한계를 넘어 긴 비디오를 효과적으로 처리할 수 있게 함

3. Strong video understanding ability

LLaVA-NeXT-Image: AnyRes + 길이 일반화 조합으로 기존 오픈소스 LMMs보다 우수한 zero-shot 비디오 성능을 달성.
LLaVA-NeXT-Video: 비디오 데이터로 추가 supervised fine-tuning(SFT)을 수행하여 비디오 이해 성능을 한층 강화
LLaVA-NeXT-Video-DPO: DPO(Direct Preference Optimization)를 적용해 AI 피드백에 맞춘 응답 정렬을 수행함으로써 성능이 눈에 띄게 향상
※ DPO: AI 모델의 응답을 사람 또는 다른 모델의 '선호(preference)'에 맞춰 직접적으로 조정하는 학습 방법

4. Efficient deployment and inference with SGLang

SGLang을 통한 최적화된 추론으로 비디오 태스크에서 기존 대비 5배 빠른 성능을 실현
덕분에 수백만 개 규모의 비디오 리캡셔닝 등 대규모 서비스에도 적합한 스케일러블한 시스템 제공이 가능
※SGLang: 복잡한 언어 모델 프로그램을 효율적으로 실행하기 위한 시스템

Experiments

*The video input is represented as only one frame.

ps. LLaVA-NeXT: A Strong Zero-shot Video Understanding Model라는 이름의 블로그가 더 나중에 나온것으로 보아 공식 제목이 맞고 LLaVA-NeXT: Improved reasoning, OCR, and world knowledge는 이전 연구의 아카이빙 정도로 보인다.

=> LLaVA-OneVision 논문 인트로덕션을 보면 용도에 따라 제목을 달리하여 블로그 글을 작성한 것을 확인할 수 있다.

728x90

'인공지능 > 논문 리뷰' 카테고리의 다른 글

[논문 리뷰] LLaVA-Video: OneVision: Easy Visual Task Transfer (4)	2025.06.30
[논문 리뷰] LLaVA-OneVision: Easy Visual Task Transfer (1)	2025.06.26
[논문 리뷰] LLaVA 1.5: Improved Baselines with Visual Instruction Tuning (1)	2025.06.25
[논문 리뷰] LLaVA: Visual Instruction Tuning (1)	2025.06.25
[논문 리뷰] STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications (1)	2025.06.18

인공지능 관련 잡부 라이프

[논문 리뷰] LLaVA-NeXT: A Strong Zero-shot Video Understanding Model