Pandas for SQL People

Overview & Setup

What is Pandas and how does it map to your SQL brain

🐼 What is Pandas?

Pandas is Python's SQL-like library for data manipulation. Think of it as a DataFrame = a Table. You do everything SQL does — select, filter, group, join — just in Python code.

SQL Concept	Pandas Equivalent
Table	`DataFrame (df)`
Column	`Series (df['col'])`
Row	Row (accessed via index)
Primary Key / Index	`df.index`
NULL	`NaN / None`
Schema / data types	`df.dtypes`

⚙️ Setup & Imports

python

import pandas as pd
import numpy as np          # for NaN, numeric ops

# Check version (interviewers sometimes ask)
pd.__version__

🔎 First look at a DataFrame — like DESCRIBE + SELECT *

python

# Like:  SELECT * FROM table LIMIT 5
df.head(5)
df.tail(5)

# Like:  DESCRIBE table  (column names + dtypes)
df.info()
df.dtypes

# Like:  SELECT COUNT(*), MIN(), MAX(), AVG() for all cols
df.describe()        # for numeric cols
df.describe(include='all')   # for all cols

# Shape: (rows, columns)
df.shape
len(df)             # row count

🔍

SELECT / Filter

Column selection, row filtering, boolean masks

🗂️

GROUP BY

groupby(), agg(), transform()

🔗

JOINs

merge(), join() — inner, left, right, outer

🪟

Window Functions

rank(), cumsum(), shift(), rolling()

🕳️

NULL Handling

isna(), fillna(), dropna()

🎯

Interview Cheatsheet

Top 20 patterns asked in interviews

Data Structures

DataFrame = Table, Series = Column — that's 95% of it

DataFrame — The Table

2D labeled data structure. Rows + Columns. This is your primary object.

python

# Create from dict (like an INSERT with hardcoded values)
df = pd.DataFrame({
    'patient_id': [1, 2, 3],
    'name':       ['Alice', 'Bob', 'Charlie'],
    'age':        [30, 45, 28]
})

# Output:
#    patient_id     name  age
# 0           1    Alice   30
# 1           2      Bob   45
# 2           3  Charlie   28
# ^--- This 0,1,2 is the INDEX (like rowid)

Series — The Column

1D array with labels. When you pick one column from a DataFrame, you get a Series.

python

# Get a column → Series
ages = df['age']       # Series
type(df['age'])        # pandas.core.series.Series
type(df[['age']])      # pandas.core.frame.DataFrame  ← double brackets!

# Series has index + values
ages.values             # numpy array of values
ages.index              # RangeIndex(start=0, stop=3)

⚠️ Interview trap: df['col'] returns Series, df[['col']] returns DataFrame. This difference causes many bugs!

Index — The Row Label

Every DataFrame has an index (like a row identifier). Default is 0,1,2,... You can set any column as index.

python

# Set a column as index (like a PK)
df = df.set_index('patient_id')

# Reset back to default 0,1,2...
df = df.reset_index()

# Access row by index label
df.loc[1]              # row where index = 1

# Access row by position (0-based)
df.iloc[0]             # first row

Reading Data

Like connecting to a table — CSV, Excel, JSON, SQL

📂 Read files

python

# CSV — most common
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep='|')        # pipe delimited
df = pd.read_csv('data.csv', dtype=str)       # all cols as string
df = pd.read_csv('data.csv', usecols=['a','b']) # select cols
df = pd.read_csv('data.csv', nrows=1000)      # LIMIT 1000

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# From SQL query — very useful for DE interviews!
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db')
df = pd.read_sql("SELECT * FROM patients", engine)

💾 Write / Export data

python

df.to_csv('output.csv', index=False)   # index=False → don't write row numbers
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')
df.to_sql('table_name', engine, if_exists='replace')

SELECT / Filtering Columns

Picking columns and rows — like SELECT col1, col2 FROM table

Select Columns SQL Pandas

SQL

SELECT name, age
FROM patients;

SELECT *
FROM patients;

Pandas

# Select specific columns
df[['name', 'age']]

# Select all (just use df)
df

Row Access: loc vs iloc

loc = label-based (use column/index names). iloc = position-based (use numbers like array indexing).

python

# loc[row_label, col_label]
df.loc[0, 'name']           # single cell
df.loc[0:2, ['name', 'age']]  # rows 0-2, 2 cols (INCLUSIVE end!)
df.loc[:, 'age']             # all rows, one col

# iloc[row_position, col_position]
df.iloc[0, 1]                # row 0, col index 1
df.iloc[0:3, 0:2]            # rows 0-2, cols 0-1 (EXCLUSIVE end!)
df.iloc[-1]                  # last row

⚠️ Interview trap: loc slicing is INCLUSIVE on both ends. iloc slicing is EXCLUSIVE at the end. loc[0:3] → rows 0,1,2,3. iloc[0:3] → rows 0,1,2.

Add / Rename / Drop columns

python

# Add column (like ALTER TABLE + UPDATE)
df['full_name'] = df['first'] + ' ' + df['last']
df['age_plus_10'] = df['age'] + 10

# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Drop columns (like DROP COLUMN)
df.drop(columns=['col1', 'col2'], inplace=True)

# Reorder columns
df = df[['col3', 'col1', 'col2']]

WHERE Conditions

Filtering rows — the bread and butter of data work

Basic WHERE SQL Pandas

SQL

SELECT *
FROM patients
WHERE age > 30;

WHERE city = 'Delhi'
  AND age > 30;

WHERE city = 'Delhi'
   OR city = 'Mumbai';

Pandas

# Boolean mask
df[df['age'] > 30]

# AND → &
df[(df['city']=='Delhi') & (df['age']>30)]

# OR → |
df[(df['city']=='Delhi') | (df['city']=='Mumbai')]

⚠️ Use & not and, and | not or. And ALWAYS wrap each condition in parentheses!

IN, NOT IN, BETWEEN, LIKE

SQL

WHERE city IN ('Delhi', 'Mumbai')

WHERE city NOT IN ('Delhi')

WHERE age BETWEEN 20 AND 40

WHERE name LIKE '%Kumar%'

Pandas

df[df['city'].isin(['Delhi', 'Mumbai'])]

df[~df['city'].isin(['Delhi'])]

df[df['age'].between(20, 40)]

df[df['name'].str.contains('Kumar')]

query() — write SQL-like strings!

If you love SQL syntax, query() lets you filter in a readable way:

python

# Almost like SQL WHERE clause
df.query("age > 30 and city == 'Delhi'")
df.query("age between 20 and 40")

# Use @ to reference Python variables
min_age = 25
df.query("age > @min_age")

NULL checks

SQL

WHERE col IS NULL

WHERE col IS NOT NULL

Pandas

df[df['col'].isna()]

df[df['col'].notna()]

GROUP BY & Aggregation

The most tested topic in Data Engineering interviews

Basic GROUP BY + Aggregation

SQL

SELECT city,
       COUNT(*) AS cnt,
       AVG(age)  AS avg_age,
       MAX(age)  AS max_age,
       SUM(salary) AS total
FROM employees
GROUP BY city;

Pandas

df.groupby('city').agg(
  cnt       = ('age', 'count'),
  avg_age   = ('age', 'mean'),
  max_age   = ('age', 'max'),
  total_sal = ('salary', 'sum')
).reset_index()

GROUP BY multiple columns + HAVING

SQL

SELECT city, dept,
       COUNT(*) AS cnt
FROM employees
GROUP BY city, dept
HAVING COUNT(*) > 5;

Pandas

result = df.groupby(['city', 'dept']).agg(
    cnt=('id', 'count')
).reset_index()

# HAVING → filter after groupby
result[result['cnt'] > 5]

All Aggregation Functions

SQL Function	Pandas agg string	Direct method
COUNT(*)	'count'	df['col'].count()
SUM(col)	'sum'	df['col'].sum()
AVG(col)	'mean'	df['col'].mean()
MIN(col)	'min'	df['col'].min()
MAX(col)	'max'	df['col'].max()
COUNT(DISTINCT col)	'nunique'	df['col'].nunique()
STDDEV()	'std'	df['col'].std()
MEDIAN()	'median'	df['col'].median()
PERCENTILE_CONT(0.25)	lambda x: x.quantile(0.25)	df['col'].quantile(0.25)

value_counts() — Quick GROUP BY COUNT

python

# Like: SELECT city, COUNT(*) FROM t GROUP BY city ORDER BY 2 DESC
df['city'].value_counts()

# With normalize=True → percentages
df['city'].value_counts(normalize=True)

# Include NULLs
df['city'].value_counts(dropna=False)

asked very frequently

combine with head(10)

JOINs

merge() is your JOIN in Pandas — master this!

All JOIN Types with merge()

SQL

SELECT * FROM a
INNER JOIN b ON a.id = b.id;

LEFT JOIN b ON a.id = b.id;

RIGHT JOIN b ON a.id = b.id;

FULL OUTER JOIN b ON a.id = b.id;

Pandas

pd.merge(a, b, on='id', how='inner') pd.merge(a, b, on='id', how='left') pd.merge(a, b, on='id', how='right') pd.merge(a, b, on='id', how='outer')

Join on different column names

SQL

SELECT * FROM orders o
JOIN patients p
  ON o.patient_id = p.id;

Pandas

pd.merge(
  orders, patients,
  left_on='patient_id',
  right_on='id',
  how='inner'
)

Handling duplicate column names after merge

python

# If both tables have a column 'name', pandas adds _x and _y suffixes
merged = pd.merge(df1, df2, on='id')
# → name_x (from df1), name_y (from df2)

# Control suffixes
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))

UNION ALL / UNION → concat()

SQL

SELECT * FROM table_2024
UNION ALL
SELECT * FROM table_2025;

UNION  -- removes duplicates

Pandas

pd.concat([df_2024, df_2025], ignore_index=True) pd.concat([...]).drop_duplicates()

concat axis=0 → stack rows

concat axis=1 → stack cols

ORDER BY / LIMIT / DISTINCT

Sorting, deduplication, and limiting output

ORDER BY, LIMIT, DISTINCT

SQL

SELECT * FROM t
ORDER BY age DESC, name ASC
LIMIT 10;

SELECT DISTINCT city FROM t;

SELECT COUNT(DISTINCT city)
FROM t;

Pandas

df.sort_values(
  by=['age', 'name'],
  ascending=[False, True]
).head(10)

df['city'].unique()
df.drop_duplicates(subset=['city'])

df['city'].nunique()

TOP N per group — Classic Interview Question!

SQL

-- Top 2 salaries per dept
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(
      PARTITION BY dept
      ORDER BY salary DESC
    ) AS rn
  FROM employees
) t WHERE rn <= 2;

Pandas

# Top 2 salaries per dept df.sort_values('salary', ascending=False)\ .groupby('dept')\ .head(2) # OR using rank() df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df[df['rn'] <= 2]

very commonly asked

Window Functions

ROW_NUMBER, RANK, LAG, LEAD, running totals — all in Pandas

ROW_NUMBER / RANK / DENSE_RANK

SQL

ROW_NUMBER() OVER(
  PARTITION BY dept
  ORDER BY salary DESC)

RANK() OVER(...)

DENSE_RANK() OVER(...)

Pandas

df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df['rnk'] = df.groupby('dept')['salary']\ .rank(method='min', ascending=False) df['drnk'] = df.groupby('dept')['salary']\ .rank(method='dense', ascending=False)

rank() method	SQL Equivalent	Behavior on ties
'first'	ROW_NUMBER()	No ties, sequential
'min'	RANK()	Ties get same rank, gap after
'dense'	DENSE_RANK()	Ties get same rank, no gap
'average'	—	Average of tied ranks

LAG / LEAD

SQL

LAG(salary, 1) OVER(
  PARTITION BY dept
  ORDER BY hire_date)

LEAD(salary, 1) OVER(...)

Pandas

df['prev_salary'] = df.groupby('dept')['salary']\ .shift(1) # LAG(1) df['next_salary'] = df.groupby('dept')['salary']\ .shift(-1) # LEAD(1)

Running Totals — Cumulative functions

SQL

SUM(amount) OVER(
  ORDER BY date
  ROWS UNBOUNDED PRECEDING)

AVG(amount) OVER(
  ORDER BY date
  ROWS 6 PRECEDING)

Pandas

df['running_sum'] = df['amount'].cumsum() df['running_max'] = df['amount'].cummax() df['running_min'] = df['amount'].cummin() # Rolling window (last 7 rows) df['7day_avg'] = df['amount'].rolling(7).mean()

transform() — Window without collapsing rows

Use when you want the GROUP BY result attached back to every original row (like a window function).

SQL

-- dept avg salary on each row
SELECT *,
  AVG(salary) OVER(
    PARTITION BY dept) AS dept_avg
FROM employees;

Pandas

df['dept_avg'] = df.groupby('dept')['salary']\ .transform('mean') # Now each row has its dept avg!

transform vs agg — must know!

NULL Handling

NaN in Pandas = NULL in SQL

Detect, Drop, Fill NULLs

python

# Detect nulls
df.isna()                    # bool DataFrame
df.isna().sum()              # count nulls per column
df.isna().sum().sum()        # total null count
df['col'].isna()            # null mask for one col

# Drop rows with nulls
df.dropna()                  # drop if ANY col is null
df.dropna(subset=['col1', 'col2'])  # only if these cols null
df.dropna(thresh=3)          # keep rows with at least 3 non-null

# Fill nulls (like COALESCE / ISNULL)
df.fillna(0)                  # fill all NaN with 0
df['age'].fillna(df['age'].mean())  # fill with mean
df.fillna({'age': 0, 'name': 'Unknown'})  # col-specific fills
df['col'].ffill()             # forward fill (carry last value)
df['col'].bfill()             # backward fill

COALESCE equivalent

SQL

COALESCE(col1, col2, 0)

NULLIF(col, '')

Pandas

df['col1'].combine_first(df['col2']) df['col'].replace('', np.nan)

String Operations

All string functions live under .str accessor

String methods with .str

python

# Access with .str accessor (like SQL string functions)
df['name'].str.upper()          # UPPER()
df['name'].str.lower()          # LOWER()
df['name'].str.strip()          # TRIM()
df['name'].str.lstrip()         # LTRIM()
df['name'].str.len()             # LEN() / LENGTH()
df['name'].str.replace('a', 'b') # REPLACE()
df['name'].str[0:3]             # SUBSTRING(1,3) / LEFT(3)
df['name'].str.contains('Kumar')  # LIKE '%Kumar%'
df['name'].str.startswith('A')   # LIKE 'A%'
df['name'].str.split('_')        # STRING_SPLIT()
df['name'].str.split('_', expand=True)  # split into columns

# Regex
df['col'].str.extract(r'(\d+)')    # extract first number
df['col'].str.findall(r'\d+')     # find all numbers
df['col'].str.replace(r'[^a-z]', '', regex=True)  # regex replace

Date & Time

Date parsing, extraction, and arithmetic — critical for DE

Parse and extract date parts

python

# Parse string to datetime (always do this first!)
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Extract parts via .dt accessor
df['year']    = df['date'].dt.year      # YEAR()
df['month']   = df['date'].dt.month     # MONTH()
df['day']     = df['date'].dt.day       # DAY()
df['weekday'] = df['date'].dt.dayofweek # 0=Mon, 6=Sun
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['date'] + pd.Timedelta(days=7)
(df['end_date'] - df['start_date']).dt.days  # DATEDIFF

# Floor / truncate date
df['month_start'] = df['date'].dt.to_period('M')  # DATE_TRUNC month

Apply & Transform

Custom logic per row or column — like CASE WHEN on steroids

CASE WHEN → np.where / map / apply

SQL

CASE WHEN age < 18 THEN 'minor'
     WHEN age < 60 THEN 'adult'
     ELSE 'senior'
END AS age_group

Pandas

# np.select for multiple conditions conditions = [ df['age'] < 18, df['age'] < 60 ] choices = ['minor', 'adult'] df['age_group'] = np.select(conditions, choices, default='senior')

apply() — Custom function per row

python

# Apply to a column (fast)
df['age_squared'] = df['age'].apply(lambda x: x**2)

# Apply to each row (axis=1) — slower but flexible
df['label'] = df.apply(
    lambda row: f"{row['name']}_{row['dept']}",
    axis=1
)

# Custom function
def risk_score(row):
    if row['age'] > 60 and row['smoking'] == 'Y':
        return 'high'
    return 'low'

df['risk'] = df.apply(risk_score, axis=1)

map() — Replace values like a lookup table

SQL

LEFT JOIN code_lookup cl
  ON t.code = cl.code

Pandas

lookup = {'M': 'Male', 'F': 'Female'} df['gender_full'] = df['gender'].map(lookup)

Pivot / Melt

Reshape data — wide to long and long to wide

pivot_table() — like Excel Pivot / PIVOT in SQL

python
# Wide format: one row per dept, cols = each city's avg salary
df.pivot_table(
    index='dept',
    columns='city',
    values='salary',
    aggfunc='mean',
    fill_value=0
)
          

melt() — Wide to Long (UNPIVOT)

SQL

-- UNPIVOT months to rows SELECT id, month, sales FROM t UNPIVOT (sales FOR month IN (jan, feb, mar));

Pandas

df.melt( id_vars=['id'], value_vars=['jan', 'feb', 'mar'], var_name='month', value_name='sales' )

Performance Tips

What separates a junior from a senior in interviews

🚀 Speed Hierarchy (fastest → slowest)

Method	When to use	Speed
Vectorized ops (df['a'] + df['b'])	Arithmetic, comparisons	⚡⚡⚡ Fastest
NumPy functions (np.where)	CASE WHEN logic	⚡⚡⚡
.str methods	String operations	⚡⚡
groupby + agg	Aggregations	⚡⚡
apply() on column	Custom logic per value	⚡
apply(axis=1)	Custom logic per row	🐌 Slow
iterrows() / itertuples()	Avoid in production	🐌🐌 Slowest

💾 Memory Optimization

python

# Check memory usage
df.memory_usage(deep=True).sum() / 1e6  # in MB

# Downcast numeric types
df['age'] = pd.to_numeric(df['age'], downcast='integer')

# Convert low-cardinality strings to category (like an index/enum)
df['city'] = df['city'].astype('category')

# Read only needed columns
df = pd.read_csv('big.csv', usecols=['id', 'amount'])

🔑 Key Best Practices

python

# 1. Copy to avoid SettingWithCopyWarning
filtered = df[df['age'] > 30].copy()

# 2. Always reset_index after groupby
result = df.groupby('dept').agg(...).reset_index()

# 3. Use .assign() for chaining
df = (df
    .assign(full_name=lambda x: x['first']+' '+x['last'])
    .assign(age_group=lambda x: np.where(x['age']>60, 'senior', 'adult'))
    .dropna(subset=['id'])
)

🎯 Interview Cheatsheet

Top patterns asked in Data Engineering & Analyst interviews

Top 20 Must-Know Operations

python — top 20 interview patterns

# 1. Find duplicate rows
df[df.duplicated()]                    # show duplicates
df.drop_duplicates()                    # remove all dups
df.drop_duplicates(subset=['id'], keep='last')  # keep latest

# 2. Top N rows per group
df.sort_values('salary', ascending=False).groupby('dept').head(2)

# 3. Percent of total per group
df['pct'] = df.groupby('dept')['sales'].transform(lambda x: x / x.sum() * 100)

# 4. Find second highest salary
df['salary'].nlargest(2).iloc[-1]
df['salary'].drop_duplicates().nlargest(2).min()

# 5. Count nulls per column
df.isna().sum()

# 6. Correlation between columns
df[['age', 'salary']].corr()

# 7. Running total within group
df['cumsum'] = df.groupby('dept')['sales'].cumsum()

# 8. Month-over-month change
df.sort_values('date', inplace=True)
df['mom_change'] = df['revenue'] - df['revenue'].shift(1)
df['mom_pct']    = df['revenue'].pct_change() * 100

# 9. Flatten a column with lists
df.explode('tags_list')

# 10. Cross join
df1['key'] = 1; df2['key'] = 1
pd.merge(df1, df2, on='key').drop('key', axis=1)

# 11. Pivot (wide format)
df.pivot_table(index='dept', columns='month', values='sales', aggfunc='sum')

# 12. Anti-join (rows in A not in B)
merged = pd.merge(a, b, on='id', how='left', indicator=True)
anti = merged[merged['_merge'] == 'left_only']

# 13. Conditional column update
df.loc[df['age'] > 60, 'category'] = 'senior'

# 14. Find rows where col A > col B
df[df['actual'] > df['target']]

# 15. String split into multiple columns
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)

# 16. Normalize / standardize a column
df['norm'] = (df['val'] - df['val'].min()) / (df['val'].max() - df['val'].min())

# 17. Multiple conditions → new column
df['flag'] = np.where((df['age']>50) & (df['smoker']=='Y'), 1, 0)

# 18. Resample time series (like DATE_TRUNC + GROUP BY)
df.set_index('date').resample('M')['sales'].sum()

# 19. Stack / Unstack MultiIndex
pivot.stack()         # wide → long
long.unstack()        # long → wide

# 20. Assign multiple columns at once
df = df.assign(
    full_name   = df['first'] + ' ' + df['last'],
    age_bucket  = pd.cut(df['age'], bins=[0,18,60,100], labels=['minor','adult','senior'])
)

Common Interview Traps ⚠️

Trap	Wrong	Correct
Single vs double brackets	df['col'] → Series	df[['col']] → DataFrame
AND/OR in filters	df[a > 1 and b > 2]	df[(a > 1) & (b > 2)]
loc vs iloc end	iloc[0:3] → 0,1,2	loc[0:3] → 0,1,2,3
inplace forgotten	df.sort_values('a') (no effect)	df.sort_values('a', inplace=True)
groupby index	result.dept (KeyError)	result.reset_index().dept
chained assignment	df[mask]['col'] = val → warning	df.loc[mask, 'col'] = val
Comparing to NaN	df[df['a'] == None]	df[df['a'].isna()]

Pandas forSQL People

🐼 What is Pandas?

⚙️ Setup & Imports

🔎 First look at a DataFrame — like DESCRIBE + SELECT *

DataFrame — The Table

Series — The Column

Index — The Row Label

📂 Read files

💾 Write / Export data

Select Columns SQL Pandas

Row Access: loc vs iloc

Add / Rename / Drop columns

Basic WHERE SQL Pandas

IN, NOT IN, BETWEEN, LIKE

query() — write SQL-like strings!

NULL checks

Basic GROUP BY + Aggregation

GROUP BY multiple columns + HAVING

All Aggregation Functions

value_counts() — Quick GROUP BY COUNT

All JOIN Types with merge()

Join on different column names

Handling duplicate column names after merge

UNION ALL / UNION → concat()

ORDER BY, LIMIT, DISTINCT

TOP N per group — Classic Interview Question!

ROW_NUMBER / RANK / DENSE_RANK

LAG / LEAD

Running Totals — Cumulative functions

transform() — Window without collapsing rows

Detect, Drop, Fill NULLs

COALESCE equivalent

String methods with .str

Parse and extract date parts

CASE WHEN → np.where / map / apply

apply() — Custom function per row

map() — Replace values like a lookup table

pivot_table() — like Excel Pivot / PIVOT in SQL

melt() — Wide to Long (UNPIVOT)

🚀 Speed Hierarchy (fastest → slowest)

💾 Memory Optimization

🔑 Key Best Practices

Top 20 Must-Know Operations

Common Interview Traps ⚠️

Pandas for
SQL People