AI-Powered Data Cleaning Pipeline

PythonPandasGoogle Gemini APILangGraphStreamlit

✦ Project Overview

An intelligent, automated system designed to handle the tedious process of data cleaning. By combining traditional data engineering techniques with Large Language Models (LLMs) like Google Gemini and LangGraph, this pipeline performs context-aware cleaning, intelligent imputation, and logical consistency checks.

✦ Key Features

♥Hybrid cleaning approach: Standard + AI-Agentic cleaning
♥Context-aware data imputation using LLMs
♥Streamlit UI for easy file upload and cleaning visualization
♥Automated logical consistency checks tailored to data context

✦ Methodology

This project employs a hybrid Neuro-Symbolic approach to data cleaning:

01.

Initial Profiling

The system first runs standard statistical analysis (Pandas) to identify missing values, outliers, and data type mismatches.

02.

Agentic Reasoning

LangGraph agents analyze column semantics to propose cleaning strategies. For example, inferring that 'age' values > 120 are likely errors.

03.

Context-Aware Imputation

Instead of simple mean/mode filling, the LLM looks at row context to predict missing values (e.g., inferring 'City' from 'Zip Code').