Mayvel

This Python script automates the extraction of candidate names and email addresses from resume files (.pdf, .docx, .doc) stored in a Google Drive folder. It leverages the Google Gemini Vision API (gemini-1.5-flash-latest) for analysis by converting the first page of each resume into an image and then processing it with a structured prompt. The extracted information is saved into a CSV file.

Scroll down

Features

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Automated Conversion

docx files to PDF using pandoc.
.doc files to PDF using LibreOffice (soffice).
First page of PDFs (original or converted) into PNG images.

Automated Conversion

docx files to PDF using pandoc.
.doc files to PDF using LibreOffice (soffice).
First page of PDFs (original or converted) into PNG images.

Automated Conversion

docx files to PDF using pandoc.
.doc files to PDF using LibreOffice (soffice).
First page of PDFs (original or converted) into PNG images.

Automated Conversion

docx files to PDF using pandoc.
.doc files to PDF using LibreOffice (soffice).
First page of PDFs (original or converted) into PNG images.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Core Function Overview

convert_pdf_to_image_files(): Converts the first page of a PDF into a PNG image.
convert_docx_to_pdf(): Converts .docx files to PDFs using pandoc.
convert_doc_to_pdf(): Converts .doc files to PDFs using LibreOffice.
ask_llm_for_details_gemini_vision(): Sends the image to Gemini Vision API and extracts name and email.
extract_resume_info(): Manages a single resume's end-to-end processing.
process_resume_directory(): The controller function that processes the entire directory.

Prerequisites

Python 3.x
Google Account (for Google Drive and Colab access).
Google AI Gemini API Key (Get it here).
Google Colab Environment Recommended
External Tools:
1. poppler-utils for PDF conversion.
2. pandoc for DOCX to PDF.
3. libreoffice for DOC to PDF.

Installation (Google Colab)

Install Python dependencies:
1. !pip install google-generativeai PyMuPDF pdf2image pypandoc Pillow docx

Install system dependencies:
1. !apt-get update
2. !apt-get install -y poppler-utils pandoc libreoffice

Mount Google Drive:
1. from google.colab import drive drive.mount('/content/drive')

Configuration

GEMINI_API_KEY: Replace "YOUR_GEMINI_API_KEY_HERE" with your real API key.
resume_directory: Set to the path of your resume folder in Google Drive.
output_csv_file: Define where you want to store the extracted data CSV.
IMAGE_OUTPUT_DIR: Optional. Default is resume_page_images.

Usage Steps

Open your script in Google Colab.
Install dependencies.
Configure your API key, resume_directory, and output_csv_file.
Mount Google Drive.
Run the script and monitor the logs.
Review the output CSV for results.

Output CSV Format

Filename

Example.pdf

Extracted Name

John Doe

Extracted Email

john@example.com

Raw LLM Response

{JSON}

Processing Error

None

Features

Processes Multiple Formats

Processes Multiple Formats

Processes Multiple Formats

Processes Multiple Formats

Google Drive Integration

Google Drive Integration

Google Drive Integration

Google Drive Integration

Automated Conversion

Automated Conversion

Automated Conversion

Automated Conversion

AI-Powered Extraction

AI-Powered Extraction

AI-Powered Extraction

AI-Powered Extraction

Structured Output

Structured Output

Structured Output

Structured Output

Error Handling

Error Handling

Error Handling

Error Handling

Temporary File Cleanup

Temporary File Cleanup

Temporary File Cleanup

Temporary File Cleanup

Core Function Overview

Prerequisites

Installation (Google Colab)

Configuration

Usage Steps

Output CSV Format

Filename

Extracted Name

Extracted Email

Raw LLM Response

Processing Error