Skip to main content
Video s3
    Details
    Presenter(s)
    Edwin Arkel Rios Headshot
    Display Name
    Edwin Arkel Rios
    Affiliation
    Affiliation
    National Yang Ming Chiao Tung University
    Country
    Author(s)
    Display Name
    Edwin Arkel Rios
    Affiliation
    Affiliation
    National Yang Ming Chiao Tung University
    Display Name
    Min-Chun Hu
    Affiliation
    Affiliation
    National Tsing Hua University
    Display Name
    Bo-Cheng Lai
    Affiliation
    Affiliation
    National Yang Ming Chiao Tung University
    Abstract

    We study anime character recognition task. We propose a novel Intermediate Features Aggregation classification head for this task, which helps smooth the optimization landscape of Vision Transformers (ViTs) by adding skip connections between intermediate layers and the classification head, thereby improving relative classification accuracy by up to 28\\%. We conduct extensive experiments using a variety of classification models and also adapt Vision-Language Transformers (ViLT), to incorporate external tag data for classification, without additional multimodal pre-training. Our results present new insights into how hyperparameters such as input sequence length, mini-batch size, and variations on the architecture, affect this task.

    Slides
    • Anime Character Recognition Using Intermediate Features Aggregation (application/pdf)