Multi-Modal Information Extraction from Academic Resumes

This project addresses the challenge of extracting structured information from academic resumes, which often span multiple pages and contain complex, domain-specific content. We developed a novel approach combining document layout analysis and sequence tagging to accurately segment and extract key information from various resume sections.

Key aspects of this research include:

Utilizing Document-Image-Transformer (DiT) for title detection and resume sectioning
Implementing BERT-based sequence tagging models for information extraction from specific sections (education, employment, publications)
Creating a labeled dataset of 30+ academic resumes (250+ pages) for model training and evaluation

Last updated on Oct 7, 2024