14.5 AI-Assisted Data Analysis
When Product Managers No Longer Need to "Please Help Me Check the Data"
Imagine this scenario: The product manager @mentions you in the group: "Hey, can you check last week's new user retention rate, grouped by channel, excluding test accounts." You stare at your computer screen, take a deep breath, open the SQL client, and start writing that similar query you've already written eight times this month...
Wait! It's 2026—why are we still being human SQL translation machines? AI can directly convert "last week's new user retention rate" into SQL queries, and even draw charts for you. Today, let's talk about how AI can make data analysis as simple as chatting.
Text-to-SQL: Let AI Be Your SQL Translator
The core idea of Text-to-SQL is simple: users ask questions in natural language, AI automatically generates corresponding SQL queries. Sounds sci-fi? Current large models can already do this quite well.
How It Works
The entire process goes something like this:
- User Question: "How many customers had sales exceeding 100,000 last month?"
- AI Understands Intent: Identifies tables to query (customers, orders), time range (last month), aggregation conditions (sales > 100k)
- Generate SQL: Generate corresponding query based on database schema
- Execute and Return Results: Can directly display data or summarize in natural language
The key: AI needs to know your database structure (schema). Just as you can't expect someone who's never seen your codebase to write correct queries, AI also needs to "know" your tables and fields.
Hands-On Code: Build Text-to-SQL with OpenAI
Let's write a simple but practical Text-to-SQL function:
import openai
import sqlite3
from typing import Dict, Any
class TextToSQLAgent:
def __init__(self, db_path: str, openai_api_key: str):
self.db_path = db_path
self.conn = sqlite3.connect(db_path)
openai.api_key = openai_api_key
self.schema = self._extract_schema()
def _extract_schema(self) -> str:
"""Extract database schema for AI reference"""
cursor = self.conn.cursor()
cursor.execute("SELECT sql FROM sqlite_master WHERE type='table'")
tables = cursor.fetchall()
schema_description = "Database structure:\n\n"
for table in tables:
schema_description += f"{table[0]}\n\n"
return schema_description
def query(self, natural_language_question: str) -> Dict[str, Any]:
"""
Convert natural language question to SQL and execute
Args:
natural_language_question: User's natural language question
Returns:
Dictionary containing SQL, results, and explanation
"""
# Build prompt
prompt = f"""You are a SQL expert. Based on the following database structure, convert the user's question into a SQL query.
{self.schema}
User question: {natural_language_question}
Please return a JSON object with the following fields:
- sql: The generated SQL query
- explanation: Brief explanation in English of what this query does
Return only JSON, nothing else."""
# Call OpenAI API
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a database query assistant, proficient in SQL."},
{"role": "user", "content": prompt}
],
temperature=0.2 # Lower temperature for more stable generation
)
# Parse response
import json
result = json.loads(response.choices[0].message.content)
# Execute SQL
try:
cursor = self.conn.cursor()
cursor.execute(result['sql'])
data = cursor.fetchall()
columns = [description[0] for description in cursor.description]
return {
"sql": result['sql'],
"explanation": result['explanation'],
"data": data,
"columns": columns,
"success": True
}
except Exception as e:
return {
"sql": result['sql'],
"explanation": result['explanation'],
"error": str(e),
"success": False
}
def __del__(self):
self.conn.close()
# Usage example
agent = TextToSQLAgent(
db_path="company.db",
openai_api_key="your-api-key"
)
# Natural language query
result = agent.query("Of users registered in the past 30 days, how many have completed at least one purchase?")
if result['success']:
print(f"SQL: {result['sql']}\n")
print(f"Explanation: {result['explanation']}\n")
print(f"Results:")
print(f"{result['columns']}")
for row in result['data']:
print(row)
else:
print(f"Query failed: {result['error']}")Advanced Techniques
The above code is a basic version. In actual production environments, you also need to consider:
1. Schema Optimization
Don't stuff the entire database structure into the AI—that'll blow up your tokens. You can:
- Include only relevant tables (pre-filter by keyword matching)
- Add field comments and example values
- Mark relationships between tables
def _extract_schema_smart(self, question: str) -> str:
"""Intelligently filter relevant tables based on question"""
# Simple keyword matching
keywords = question.lower().split()
relevant_tables = []
cursor = self.conn.cursor()
cursor.execute("SELECT name, sql FROM sqlite_master WHERE type='table'")
for name, sql in cursor.fetchall():
# If table name or structure contains question keywords, include it
if any(keyword in name.lower() or keyword in sql.lower() for keyword in keywords):
relevant_tables.append((name, sql))
return "\n\n".join([f"Table: {name}\n{sql}" for name, sql in relevant_tables])2. SQL Safety Check
AI might generate dangerous SQL (like DROP TABLE), must add whitelist validation:
def _is_safe_sql(self, sql: str) -> bool:
"""Check if SQL is safe (only allow SELECT)"""
sql_upper = sql.strip().upper()
dangerous_keywords = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER', 'CREATE']
if not sql_upper.startswith('SELECT'):
return False
for keyword in dangerous_keywords:
if keyword in sql_upper:
return False
return True3. Result Caching
Same questions don't need to call AI and database repeatedly:
import hashlib
from functools import lru_cache
@lru_cache(maxsize=100)
def query_cached(self, question: str):
return self.query(question)AI-Powered BI: Data Analysis Without Writing Code
If you don't want to build your own Text-to-SQL system, there are many ready-made AI-powered BI tools on the market:
Julius AI
Julius AI is an AI assistant designed for data analysis. You can:
- Upload CSV/Excel files
- Ask questions in natural language: "Show sales trends by month"
- Auto-generate charts and statistical reports
Especially suitable for product managers or operations folks who aren't very SQL-savvy.
ChatGPT Code Interpreter (Advanced Data Analysis)
OpenAI's Code Interpreter feature, now renamed "Advanced Data Analysis", can:
- Upload data files
- Describe analysis needs in natural language
- AI automatically writes Python code for analysis, generates visualizations
I've tested it—handles simple data cleaning and visualization very quickly, much faster than writing pandas code yourself.
Tableau Pulse / Power BI Copilot
Traditional BI tools are also embracing AI:
- Tableau Pulse: Automatically discovers data insights, explains trends in natural language
- Power BI Copilot: Natural language generates DAX queries and visualizations
If your company already uses these tools, just enable AI features directly—no need to rebuild from scratch.
Build an Internal "Ask the Database" Bot
The most practical scenario is building a Slack/Enterprise WeChat bot for your team, letting anyone query data with natural language.
Architecture Design
[User] -> [Chat Platform Bot] -> [Text-to-SQL Service] -> [Database]
|
v
[Permission Check & Result Formatting]
|
v
[Return Results to User]Core Features
- Permission Control: Different users can query different tables
- Result Limiting: Auto-add LIMIT to prevent excessive queries
- Query Auditing: Log who queried what, for tracking
- Smart Suggestions: Recommend common questions based on query history
Quick Prototype: Slack Bot Example
from slack_bolt import App
from slack_bolt.adapter.socket_mode import SocketModeHandler
app = App(token="xoxb-your-token")
# Initialize Text-to-SQL agent
sql_agent = TextToSQLAgent(db_path="company.db", openai_api_key="your-key")
@app.message("query:")
def handle_query(message, say):
"""Handle messages starting with 'query:'"""
question = message['text'].replace("query:", "").strip()
user = message['user']
# Permission check (simplified version)
if not is_user_allowed(user):
say("Sorry, you don't have query permissions. Please contact administrator.")
return
# Execute query
say(f"Querying: {question}...")
result = sql_agent.query(question)
if result['success']:
# Format results
response = f"*SQL:*\n```{result['sql']}```\n\n*Results:*\n"
# Simple table format
response += " | ".join(result['columns']) + "\n"
response += "-" * 50 + "\n"
for row in result['data'][:10]: # Only show first 10 rows
response += " | ".join(str(cell) for cell in row) + "\n"
if len(result['data']) > 10:
response += f"\n_(Total {len(result['data'])} results, showing first 10 only)_"
say(response)
else:
say(f"Query failed: {result['error']}")
def is_user_allowed(user_id: str) -> bool:
"""Check if user has query permissions"""
# Real projects should query permissions database
allowed_users = ["U01234567", "U07654321"]
return user_id in allowed_users
# Start bot
if __name__ == "__main__":
handler = SocketModeHandler(app, "xapp-your-token")
handler.start()Complete Example: Production-Grade Text-to-SQL Function
Integrating all the best practices above:
import openai
import sqlite3
import hashlib
import json
from typing import Dict, Any, List, Optional
from datetime import datetime
class ProductionTextToSQL:
def __init__(
self,
db_path: str,
openai_api_key: str,
allowed_tables: Optional[List[str]] = None
):
self.db_path = db_path
self.conn = sqlite3.connect(db_path)
openai.api_key = openai_api_key
self.allowed_tables = allowed_tables
self.query_log = []
def _get_relevant_schema(self, question: str) -> str:
"""Intelligently select relevant table schemas based on question"""
cursor = self.conn.cursor()
if self.allowed_tables:
placeholders = ','.join('?' * len(self.allowed_tables))
cursor.execute(
f"SELECT name, sql FROM sqlite_master WHERE type='table' AND name IN ({placeholders})",
self.allowed_tables
)
else:
cursor.execute("SELECT name, sql FROM sqlite_master WHERE type='table'")
all_tables = cursor.fetchall()
# Simple relevance scoring
keywords = set(question.lower().split())
scored_tables = []
for name, sql in all_tables:
score = sum(1 for keyword in keywords if keyword in name.lower() or keyword in sql.lower())
if score > 0:
scored_tables.append((score, name, sql))
# Return top 5 most relevant tables
scored_tables.sort(reverse=True)
schema = "\n\n".join([
f"Table name: {name}\nDDL: {sql}"
for _, name, sql in scored_tables[:5]
])
return schema if schema else "No relevant tables found"
def _is_safe_query(self, sql: str) -> tuple[bool, str]:
"""Check if SQL is safe"""
sql_upper = sql.strip().upper()
# Must be SELECT
if not sql_upper.startswith('SELECT'):
return False, "Only SELECT queries allowed"
# Forbidden keywords
forbidden = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER', 'CREATE', 'TRUNCATE']
for keyword in forbidden:
if keyword in sql_upper:
return False, f"Use of {keyword} operation prohibited"
# Must have LIMIT
if 'LIMIT' not in sql_upper:
sql += " LIMIT 100" # Auto-add LIMIT
return True, sql
def query(
self,
question: str,
user_id: Optional[str] = None,
explain: bool = True
) -> Dict[str, Any]:
"""
Execute natural language query
Args:
question: Natural language question
user_id: User ID (for auditing)
explain: Whether to return AI's explanation
Returns:
Query result dictionary
"""
start_time = datetime.now()
# Get relevant schema
schema = self._get_relevant_schema(question)
# Build prompt
prompt = f"""You are a SQL expert. Convert the question into SQL based on the database structure.
{schema}
Question: {question}
Return JSON:
\{\{
"sql": "generated SQL query",
"explanation": "brief explanation in English"
\}\}
Notes:
1. Generate only SELECT statements
2. If aggregating multiple records, use LIMIT
3. For date-related queries, use date() function"""
try:
# Call AI
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a database query assistant."},
{"role": "user", "content": prompt}
],
temperature=0.1
)
ai_result = json.loads(response.choices[0].message.content)
sql = ai_result['sql']
explanation = ai_result.get('explanation', '')
# Safety check
is_safe, safe_sql = self._is_safe_query(sql)
if not is_safe:
return {
"success": False,
"error": safe_sql,
"question": question
}
# Execute SQL
cursor = self.conn.cursor()
cursor.execute(safe_sql)
data = cursor.fetchall()
columns = [desc[0] for desc in cursor.description]
# Log query
duration = (datetime.now() - start_time).total_seconds()
self._log_query(user_id, question, safe_sql, True, duration)
return {
"success": True,
"sql": safe_sql,
"explanation": explanation if explain else None,
"columns": columns,
"data": data,
"row_count": len(data),
"duration_seconds": duration
}
except json.JSONDecodeError:
return {
"success": False,
"error": "AI return format error",
"question": question
}
except sqlite3.Error as e:
self._log_query(user_id, question, sql, False, 0)
return {
"success": False,
"error": f"SQL execution error: {str(e)}",
"sql": sql,
"question": question
}
def _log_query(
self,
user_id: Optional[str],
question: str,
sql: str,
success: bool,
duration: float
):
"""Log query"""
log_entry = {
"timestamp": datetime.now().isoformat(),
"user_id": user_id,
"question": question,
"sql": sql,
"success": success,
"duration": duration
}
self.query_log.append(log_entry)
# Real projects should write to persistent storage
# Simplified here as in-memory list
def get_query_stats(self) -> Dict[str, Any]:
"""Get query statistics"""
if not self.query_log:
return {"total_queries": 0}
total = len(self.query_log)
success = sum(1 for log in self.query_log if log['success'])
avg_duration = sum(log['duration'] for log in self.query_log) / total
return {
"total_queries": total,
"success_rate": success / total,
"avg_duration_seconds": avg_duration,
"recent_queries": self.query_log[-5:]
}
# Usage example
if __name__ == "__main__":
agent = ProductionTextToSQL(
db_path="company.db",
openai_api_key="your-api-key",
allowed_tables=["users", "orders", "products"] # Whitelist
)
# Execute query
result = agent.query(
question="Number of orders per day for the past 7 days",
user_id="user_123"
)
if result['success']:
print(f"✓ Query successful ({result['duration_seconds']:.2f}s)")
print(f"\nSQL:\n{result['sql']}\n")
print(f"Explanation: {result['explanation']}\n")
print(f"Results ({result['row_count']} rows):")
print(result['columns'])
for row in result['data']:
print(row)
else:
print(f"✗ Query failed: {result['error']}")
# View statistics
stats = agent.get_query_stats()
print(f"\nStatistics: {stats}")Notes and Best Practices
1. Cost Control
Every GPT-4 call costs money. Ways to control costs:
- Use GPT-3.5-turbo for simple queries
- Cache SQL for common questions
- Compress schema info, only pass essentials
2. Accuracy Optimization
AI-generated SQL isn't 100% accurate. You can:
- Provide few-shot examples (few-shot learning)
- Add detailed comments for fields and tables
- Have AI generate multiple candidate SQLs, use heuristic rules to select best
3. User Experience
A good data query bot should:
- Respond quickly (< 5 seconds)
- Support follow-ups ("group by region too")
- Auto-generate visualizations (simple bar/line charts)
- Provide query history for easy reuse
One-Sentence Summary
AI-driven data analysis makes "asking data" as simple as chatting—Text-to-SQL technology converts natural language directly into SQL queries. With permission control and safety checks, you can build a production-grade "ask the database" bot in under 200 lines of code, letting anyone on your team self-serve data analysis without waiting in line for data analysts.
Next Steps
Now you know how to use AI to query data, but how do queried data become documentation and knowledge? In the next section, we'll discuss AI-Driven Documentation and Knowledge Management, looking at how AI can auto-generate API docs, meeting minutes, and even detect stale documentation.