
This Python script automates the extraction of candidate names and email addresses from resume files (.pdf
, .docx
, .doc
) stored in a Google Drive folder. It leverages the Google Gemini Vision API (gemini-1.5-flash-latest
) for analysis by converting the first page of each resume into an image and then processing it with a structured prompt. The extracted information is saved into a CSV file.
Features
Core Function Overview
convert_pdf_to_image_files()
: Converts the first page of a PDF into a PNG image.convert_docx_to_pdf()
: Converts.docx
files to PDFs usingpandoc
.convert_doc_to_pdf()
: Converts.doc
files to PDFs using LibreOffice.ask_llm_for_details_gemini_vision()
: Sends the image to Gemini Vision API and extracts name and email.extract_resume_info()
: Manages a single resume's end-to-end processing.process_resume_directory()
: The controller function that processes the entire directory.
Prerequisites
Python 3.x
Google Account (for Google Drive and Colab access).
Google AI Gemini API Key (Get it here).
Google Colab Environment Recommended
External Tools:
poppler-utils
for PDF conversion.pandoc
for DOCX to PDF.libreoffice
for DOC to PDF.
Installation (Google Colab)
Install Python dependencies:
!pip install google-generativeai PyMuPDF pdf2image pypandoc Pillow docx
Install system dependencies:
!apt-get update
!apt-get install -y poppler-utils pandoc libreoffice
Mount Google Drive:
from google.colab import drive drive.mount('/content/drive')
Configuration
GEMINI_API_KEY
: Replace"YOUR_GEMINI_API_KEY_HERE"
with your real API key.resume_directory
: Set to the path of your resume folder in Google Drive.output_csv_file
: Define where you want to store the extracted data CSV.IMAGE_OUTPUT_DIR
: Optional. Default isresume_page_images
.
Usage Steps
Open your script in Google Colab.
Install dependencies.
Configure your API key,
resume_directory
, andoutput_csv_file
.Mount Google Drive.
Run the script and monitor the logs.
Review the output CSV for results.
Output CSV Format
Filename |
---|
Example.pdf |
---|
Extracted Name |
---|
John Doe |
Extracted Email |
---|
john@example.com |
Raw LLM Response |
---|
{JSON} |
---|
Processing Error |
---|
None |