본문 바로가기

멀티모달

[논문 리뷰] LLaVA-Video: OneVision: Easy Visual Task Transfer LLaVA-Video: Video Instruction Tuning With Synthetic Datahttps://arxiv.org/abs/2410.02713 Video Instruction Tuning With Synthetic DataThe development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset sparxiv.or.. 더보기
[논문 리뷰] LLaVA-OneVision: Easy Visual Task Transfer LLaVA-OneVision: Easy Visual Task Transferhttps://arxiv.org/abs/2408.03326 LLaVA-OneVision: Easy Visual Task TransferWe present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is tharxiv.org1. IntroductionL.. 더보기
[논문 리뷰] LLaVA-NeXT: A Strong Zero-shot Video Understanding Model LLaVA-NeXT: A Strong Zero-shot Video Understanding Modelhttps://llava-vl.github.io/blog/2024-01-30-llava-next/ LLaVA-NeXT: Improved reasoning, OCR, and world knowledgeLLaVA team presents LLaVA-NeXT, with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.llava-vl.github.io https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ LLaVA-NeXT:.. 더보기
[논문 리뷰] LLaVA 1.5: Improved Baselines with Visual Instruction Tuning Improved Baselines with Visual Instruction Tuninghttps://arxiv.org/abs/2310.03744 Improved Baselines with Visual Instruction TuningLarge multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple moarxiv.orgLi.. 더보기
[논문 리뷰] LLaVA: Visual Instruction Tuning Visual Instruction Tuninghttps://arxiv.org/abs/2304.08485 Visual Instruction TuningInstruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use larxiv.orgLiu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visu.. 더보기