AI-Powered Data Cleaning Pipeline
✦ Project Overview
An intelligent, automated system designed to handle the tedious process of data cleaning. By combining traditional data engineering techniques with Large Language Models (LLMs) like Google Gemini and LangGraph, this pipeline performs context-aware cleaning, intelligent imputation, and logical consistency checks.
✦ Key Features
- ♥Hybrid cleaning approach: Standard + AI-Agentic cleaning
- ♥Context-aware data imputation using LLMs
- ♥Streamlit UI for easy file upload and cleaning visualization
- ♥Automated logical consistency checks tailored to data context
✦ Methodology
This project employs a hybrid Neuro-Symbolic approach to data cleaning:
Initial Profiling
The system first runs standard statistical analysis (Pandas) to identify missing values, outliers, and data type mismatches.
Agentic Reasoning
LangGraph agents analyze column semantics to propose cleaning strategies. For example, inferring that 'age' values > 120 are likely errors.
Context-Aware Imputation
Instead of simple mean/mode filling, the LLM looks at row context to predict missing values (e.g., inferring 'City' from 'Zip Code').