Details
Presenter(s)
![Edwin Arkel Rios Headshot](https://confcats-catavault.s3.amazonaws.com/CATAVault/ieeecass/master/files/styles/cc_user_photo/s3/user-pictures/18052_0.jpg?h=04d92ac6&itok=nAaFfSOk)
Display Name
Edwin Arkel Rios
- Affiliation
-
AffiliationNational Yang Ming Chiao Tung University
- Country
Abstract
We study anime character recognition task. We propose a novel Intermediate Features Aggregation classification head for this task, which helps smooth the optimization landscape of Vision Transformers (ViTs) by adding skip connections between intermediate layers and the classification head, thereby improving relative classification accuracy by up to 28\\%. We conduct extensive experiments using a variety of classification models and also adapt Vision-Language Transformers (ViLT), to incorporate external tag data for classification, without additional multimodal pre-training. Our results present new insights into how hyperparameters such as input sequence length, mini-batch size, and variations on the architecture, affect this task.