NeurIPS 2021 — Curated papers — Part 1
UniDoc: Unified Pretraining Framework for Document Understanding Authors has proposed a self-supervised framework for document understanding from multi-modal point of view. Language Pre-training using transformers have become extremely popular. In this work, authors have showed how to do SSL using transformers by taking inputs from different modalities such as image and text. UniDoc has mainly 4 steps : Feature Extraction : Given a document image I and location of document elements, using OCR sentences and it’s corresponding bounding boxes are extracted. Feature Embedding : For bounding box, features are extracted through CNN backbone+RoIAlign and they are quantized using Gumble-softmax (similar to Wav2Vec2) and embedding for sentences are extracted from pre-trained hierarchical transformers. Gated cross attention : It’s one of the main ingredient of the work , where cross-modal interaction takes places between text and visual embedding through typical cross-attention ...