NeurIPS 2021 — Curated papers — Part 1

UniDoc: Unified Pretraining Framework for Document Understanding

  1. Feature Extraction : Given a document image I and location of document elements, using OCR sentences and it’s corresponding bounding boxes are extracted.
  2. Feature Embedding : For bounding box, features are extracted through CNN backbone+RoIAlign and they are quantized using Gumble-softmax (similar to Wav2Vec2) and embedding for sentences are extracted from pre-trained hierarchical transformers.
  3. Gated cross attention : It’s one of the main ingredient of the work , where cross-modal interaction takes places between text and visual embedding through typical cross-attention mechanism. Now gating is used to combine the representation from both modalities . (Gating is nothing but a learned parameter alpha (between 0 and 1) which determines how embeddings are combined ).
  4. Objective function : There are mainly three parts which constitutes the objective function. a) Masked Sentence Modelling (Unlike words as in the case of BERT). b) Contrastive learning over masked ROI c) Vision-language alignment.
  1. Freeze all the layers of language models
  2. Train a vision-encoder which takes image I and output of it’s pooling layer which has dimension of D * K channels , which are fed as sequence of k embeddings to pre-trained language transformer as a prefix embedding.
  3. Since transformer layers are frozen, gradients from transformer layer are only used to update vision encoder in an auto-regressive way.
  4. So Image and part of the caption is given as an output and label will be remaining part of the label.
 

 

 https://papers.nips.cc/paper/2021/file/01b7575c38dac42f3cfb7d500438b875-Paper.pdf

  1. Third term of the equation represents reducing the difference in counter-factual between protected and non-protected groups on original data
  2. Second term pushes the cost recourse for the non-protected group for perturbed input ,
    Referring to the above example, small change in the input of non-protected group (Men’ Age) should not result in different explanation hence that brings fairness towards protected group cost recourse.

Comments