Multi-Modal Information Extraction from Academic Resumes

May 10, 2023 ยท 1 min read

This project addresses the challenge of extracting structured information from academic resumes, which often span multiple pages and contain complex, domain-specific content. We developed a novel approach combining document layout analysis and sequence tagging to accurately segment and extract key information from various resume sections.

Key aspects of this research include:

  • Utilizing Document-Image-Transformer (DiT) for title detection and resume sectioning
  • Implementing BERT-based sequence tagging models for information extraction from specific sections (education, employment, publications)
  • Creating a labeled dataset of 30+ academic resumes (250+ pages) for model training and evaluation