Multi-Modal Information Extraction from Academic Resumes
May 10, 2023
ยท
1 min read
This project addresses the challenge of extracting structured information from academic resumes, which often span multiple pages and contain complex, domain-specific content. We developed a novel approach combining document layout analysis and sequence tagging to accurately segment and extract key information from various resume sections.
Key aspects of this research include:
- Utilizing Document-Image-Transformer (DiT) for title detection and resume sectioning
- Implementing BERT-based sequence tagging models for information extraction from specific sections (education, employment, publications)
- Creating a labeled dataset of 30+ academic resumes (250+ pages) for model training and evaluation