본문 바로가기

딥러닝

[논문 리뷰] LLaVA-Video: OneVision: Easy Visual Task Transfer LLaVA-Video: Video Instruction Tuning With Synthetic Datahttps://arxiv.org/abs/2410.02713 Video Instruction Tuning With Synthetic DataThe development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset sparxiv.or.. 더보기
[논문 리뷰] LLaVA-OneVision: Easy Visual Task Transfer LLaVA-OneVision: Easy Visual Task Transferhttps://arxiv.org/abs/2408.03326 LLaVA-OneVision: Easy Visual Task TransferWe present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed by consolidating our insights into data, models, and visual representations in the LLaVA-NeXT blog series. Our experimental results demonstrate that LLaVA-OneVision is tharxiv.org1. IntroductionL.. 더보기
[논문 리뷰] LLaVA-NeXT: A Strong Zero-shot Video Understanding Model LLaVA-NeXT: A Strong Zero-shot Video Understanding Modelhttps://llava-vl.github.io/blog/2024-01-30-llava-next/ LLaVA-NeXT: Improved reasoning, OCR, and world knowledgeLLaVA team presents LLaVA-NeXT, with improved reasoning, OCR, and world knowledge. LLaVA-NeXT even exceeds Gemini Pro on several benchmarks.llava-vl.github.io https://llava-vl.github.io/blog/2024-04-30-llava-next-video/ LLaVA-NeXT:.. 더보기
[논문 리뷰] LLaVA 1.5: Improved Baselines with Visual Instruction Tuning Improved Baselines with Visual Instruction Tuninghttps://arxiv.org/abs/2310.03744 Improved Baselines with Visual Instruction TuningLarge multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple moarxiv.orgLi.. 더보기
[논문 리뷰] LLaVA: Visual Instruction Tuning Visual Instruction Tuninghttps://arxiv.org/abs/2304.08485 Visual Instruction TuningInstruction tuning large language models (LLMs) using machine-generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use larxiv.orgLiu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visu.. 더보기
[논문 리뷰] STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applications STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive Applicationshttps://arxiv.org/abs/2503.07942 STEAD: Spatio-Temporal Efficient Anomaly Detection for Time and Compute Sensitive ApplicationsThis paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements, such as autonomous driving, with unparalleled efficienc.. 더보기
[논문 리뷰] JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videos JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance Videoshttps://arxiv.org/abs/2405.02961 JOSENet: A Joint Stream Embedding Network for Violence Detection in Surveillance VideosThe increasing proliferation of video surveillance cameras and the escalating demand for crime prevention have intensified interest in the task of violence detection within the research commu.. 더보기
[논문 리뷰] VadCLIP: Adapting Vision-Language Models for Weakly SupervisedVideo Anomaly Detection VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detectionhttps://ojs.aaai.org/index.php/AAAI/article/view/28423 VadCLIP: Adapting Vision-Language Models for Weakly Supervised Video Anomaly Detection | Proceedings of the AAAI Conferen ojs.aaai.orgWu, P., Zhou, X., Pang, G., Zhou, L., Yan, Q., Wang, P., & Zhang, Y. (2024, March). Vadclip: Adapting vision-language model.. 더보기