Resume Information Extractor using Gemini Vision Model

Resume Information Extractor

using Gemini Vision Model

This Python script automates the extraction of candidate names and email addresses from resume files (.pdf, .docx, .doc) stored in a Google Drive folder. It leverages the Google Gemini Vision API (gemini-1.5-flash-latest) for analysis by converting the first page of each resume into an image and then processing it with a structured prompt. The extracted information is saved into a CSV file.

Features

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Processes Multiple Formats

Supports PDF, DOCX, and DOC resume files.

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Google Drive Integration

Reads resumes directly from a specified Google Drive directory (ideal for Google Colab).

Automated Conversion
  1. docx files to PDF using pandoc.

  2. .doc files to PDF using LibreOffice (soffice).

  3. First page of PDFs (original or converted) into PNG images.

Automated Conversion
  1. docx files to PDF using pandoc.

  2. .doc files to PDF using LibreOffice (soffice).

  3. First page of PDFs (original or converted) into PNG images.

Automated Conversion
  1. docx files to PDF using pandoc.

  2. .doc files to PDF using LibreOffice (soffice).

  3. First page of PDFs (original or converted) into PNG images.

Automated Conversion
  1. docx files to PDF using pandoc.

  2. .doc files to PDF using LibreOffice (soffice).

  3. First page of PDFs (original or converted) into PNG images.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

AI-Powered Extraction

Uses gemini-1.5-flash-latest model via the Google Generative AI API to extract the name and email from the image.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Structured Output

Stores extracted data into a CSV file.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Error Handling

Logs issues during processing and API calls.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Temporary File Cleanup

Cleans up intermediate files after processing.

Core Function Overview

  1. convert_pdf_to_image_files(): Converts the first page of a PDF into a PNG image.

  2. convert_docx_to_pdf(): Converts .docx files to PDFs using pandoc.

  3. convert_doc_to_pdf(): Converts .doc files to PDFs using LibreOffice.

  4. ask_llm_for_details_gemini_vision(): Sends the image to Gemini Vision API and extracts name and email.

  5. extract_resume_info(): Manages a single resume's end-to-end processing.

  6. process_resume_directory(): The controller function that processes the entire directory.

Prerequisites

  1. Python 3.x

  2. Google Account (for Google Drive and Colab access).

  3. Google AI Gemini API Key (Get it here).

  4. Google Colab Environment Recommended

  5. External Tools:

    1. poppler-utils for PDF conversion.

    2. pandoc for DOCX to PDF.

    3. libreoffice for DOC to PDF.

Installation (Google Colab)

  1. Install Python dependencies:

    1. !pip install google-generativeai PyMuPDF pdf2image pypandoc Pillow docx

  1. Install system dependencies:

    1. !apt-get update

    2. !apt-get install -y poppler-utils pandoc libreoffice

  1. Mount Google Drive:

    1. from google.colab import drive drive.mount('/content/drive')

Configuration

  1. GEMINI_API_KEY: Replace "YOUR_GEMINI_API_KEY_HERE" with your real API key.

  2. resume_directory: Set to the path of your resume folder in Google Drive.

  3. output_csv_file: Define where you want to store the extracted data CSV.

  4. IMAGE_OUTPUT_DIR: Optional. Default is resume_page_images.

Usage Steps

  1. Open your script in Google Colab.

  2. Install dependencies.

  3. Configure your API key, resume_directory, and output_csv_file.

  4. Mount Google Drive.

  5. Run the script and monitor the logs.

  6. Review the output CSV for results.

Output CSV Format

Filename

Example.pdf

Extracted Name

John Doe

Extracted Email

john@example.com

Raw LLM Response

{JSON}

Processing Error

None

Menu

Menu

Menu

Menu