高级检索

基于领域上下文辅助的开放域行为识别

Open Domain Action Recognition Based on Domain Context Assistance

  • 摘要: 如何将预训练模型中的知识迁移到视频理解下游任务是计算机视觉研究中的一个关键问题。在开放域场景中, 由于不利的数据条件, 知识迁移变得更具挑战性。受自然语言处理技术的启示, 近期, 许多多模态预训练模型通过设计文本提示进行迁移学习。作者利用大语言模型对开放域的理解能力, 提出一种基于领域上下文辅助的开放域行为识别方法, 提升模型在开放域场景下的理解能力。通过大语言模型对文本标签的上下文信息进行丰富, 将视觉表示与人类行为的多层次描述进行对齐, 实现鲁棒的分类。在开放域场景下进行了广泛的行为识别实验, 在全监督设置中, 该文方法在 ARID 数据集上得到了 71.86% 的 Top1 准确率, 而在 Tiny-VARIT 数据集上得到了 80.93% 的平均精确率。此外, 在无源视频领域自适应设置下, 该研究得到了 48.63% 的 Top1 准确率, 而在多源视频领域自适应设置中, 该研究得到了 54.36% 的 Top1 准确率, 实验结果表明了领域上下文辅助在各种开放域环境下的有效性。

     

    Abstract: Effectively transferring knowledge from pre-trained models to downstream video understanding tasks is an important topic in computer vision research. Knowledge transfer becomes more challenging in open domain due to poor data conditions. Many recent multi-modal pre-training models are inspired by natural language processing and perform transfer learning by designing prompt learning. The paper leverages the comprehension ability of large language models over open domains and proposes a domain-context-assisted method for open-domain behavior recognition. This approach aligns visual representation with multi-level descriptions of human actions for robust classification, by enriching action labels with context knowledge in large language model. In the experiments of open-domain action recognition with fully supervised setting, it obtain a Top1 accuracy of 71.86% on the ARID dataset, and an mean average precision of 80.93% on the Tiny-VARIT dataset. More important, it can achieve Top1 accuracy of 48.63% in source-free video domain adaptation and 54.36% in multi-source video domain adaptation. The experimental results demonstrate the efficacy of domain context-assisted in a variety of open domain environments.

     

/

返回文章
返回
Baidu
map