interview prep ยท sql background

Pandas for
SQL People

Every concept explained as its SQL equivalent. You already know the what โ€” just learn the Pandas syntax.

SQL equivalent shown
Pandas syntax
Interview traps
Pro tips
Overview & Setup
What is Pandas and how does it map to your SQL brain

๐Ÿผ What is Pandas?

Pandas is Python's SQL-like library for data manipulation. Think of it as a DataFrame = a Table. You do everything SQL does โ€” select, filter, group, join โ€” just in Python code.

SQL ConceptPandas Equivalent
TableDataFrame (df)
ColumnSeries (df['col'])
RowRow (accessed via index)
Primary Key / Indexdf.index
NULLNaN / None
Schema / data typesdf.dtypes

โš™๏ธ Setup & Imports

python
import pandas as pd
import numpy as np          # for NaN, numeric ops

# Check version (interviewers sometimes ask)
pd.__version__

๐Ÿ”Ž First look at a DataFrame โ€” like DESCRIBE + SELECT *

python
# Like:  SELECT * FROM table LIMIT 5
df.head(5)
df.tail(5)

# Like:  DESCRIBE table  (column names + dtypes)
df.info()
df.dtypes

# Like:  SELECT COUNT(*), MIN(), MAX(), AVG() for all cols
df.describe()        # for numeric cols
df.describe(include='all')   # for all cols

# Shape: (rows, columns)
df.shape
len(df)             # row count
๐Ÿ”
SELECT / Filter
Column selection, row filtering, boolean masks
๐Ÿ—‚๏ธ
GROUP BY
groupby(), agg(), transform()
๐Ÿ”—
JOINs
merge(), join() โ€” inner, left, right, outer
๐ŸชŸ
Window Functions
rank(), cumsum(), shift(), rolling()
๐Ÿ•ณ๏ธ
NULL Handling
isna(), fillna(), dropna()
๐ŸŽฏ
Interview Cheatsheet
Top 20 patterns asked in interviews
Data Structures
DataFrame = Table, Series = Column โ€” that's 95% of it

DataFrame โ€” The Table

2D labeled data structure. Rows + Columns. This is your primary object.

python
# Create from dict (like an INSERT with hardcoded values)
df = pd.DataFrame({
    'patient_id': [1, 2, 3],
    'name':       ['Alice', 'Bob', 'Charlie'],
    'age':        [30, 45, 28]
})

# Output:
#    patient_id     name  age
# 0           1    Alice   30
# 1           2      Bob   45
# 2           3  Charlie   28
# ^--- This 0,1,2 is the INDEX (like rowid)

Series โ€” The Column

1D array with labels. When you pick one column from a DataFrame, you get a Series.

python
# Get a column โ†’ Series
ages = df['age']       # Series
type(df['age'])        # pandas.core.series.Series
type(df[['age']])      # pandas.core.frame.DataFrame  โ† double brackets!

# Series has index + values
ages.values             # numpy array of values
ages.index              # RangeIndex(start=0, stop=3)
โš ๏ธ Interview trap: df['col'] returns Series, df[['col']] returns DataFrame. This difference causes many bugs!

Index โ€” The Row Label

Every DataFrame has an index (like a row identifier). Default is 0,1,2,... You can set any column as index.

python
# Set a column as index (like a PK)
df = df.set_index('patient_id')

# Reset back to default 0,1,2...
df = df.reset_index()

# Access row by index label
df.loc[1]              # row where index = 1

# Access row by position (0-based)
df.iloc[0]             # first row
Reading Data
Like connecting to a table โ€” CSV, Excel, JSON, SQL

๐Ÿ“‚ Read files

python
# CSV โ€” most common
df = pd.read_csv('data.csv')
df = pd.read_csv('data.csv', sep='|')        # pipe delimited
df = pd.read_csv('data.csv', dtype=str)       # all cols as string
df = pd.read_csv('data.csv', usecols=['a','b']) # select cols
df = pd.read_csv('data.csv', nrows=1000)      # LIMIT 1000

# Excel
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# JSON
df = pd.read_json('data.json')

# From SQL query โ€” very useful for DE interviews!
from sqlalchemy import create_engine
engine = create_engine('postgresql://user:pass@host/db')
df = pd.read_sql("SELECT * FROM patients", engine)

๐Ÿ’พ Write / Export data

python
df.to_csv('output.csv', index=False)   # index=False โ†’ don't write row numbers
df.to_excel('output.xlsx', index=False)
df.to_json('output.json', orient='records')
df.to_sql('table_name', engine, if_exists='replace')
SELECT / Filtering Columns
Picking columns and rows โ€” like SELECT col1, col2 FROM table

Select Columns SQL Pandas

SQL
SELECT name, age
FROM patients;

SELECT *
FROM patients;
Pandas
# Select specific columns
df[['name', 'age']]

# Select all (just use df)
df

Row Access: loc vs iloc

loc = label-based (use column/index names). iloc = position-based (use numbers like array indexing).

python
# loc[row_label, col_label]
df.loc[0, 'name']           # single cell
df.loc[0:2, ['name', 'age']]  # rows 0-2, 2 cols (INCLUSIVE end!)
df.loc[:, 'age']             # all rows, one col

# iloc[row_position, col_position]
df.iloc[0, 1]                # row 0, col index 1
df.iloc[0:3, 0:2]            # rows 0-2, cols 0-1 (EXCLUSIVE end!)
df.iloc[-1]                  # last row
โš ๏ธ Interview trap: loc slicing is INCLUSIVE on both ends. iloc slicing is EXCLUSIVE at the end. loc[0:3] โ†’ rows 0,1,2,3. iloc[0:3] โ†’ rows 0,1,2.

Add / Rename / Drop columns

python
# Add column (like ALTER TABLE + UPDATE)
df['full_name'] = df['first'] + ' ' + df['last']
df['age_plus_10'] = df['age'] + 10

# Rename columns
df.rename(columns={'old_name': 'new_name'}, inplace=True)

# Drop columns (like DROP COLUMN)
df.drop(columns=['col1', 'col2'], inplace=True)

# Reorder columns
df = df[['col3', 'col1', 'col2']]
๐Ÿ’ก inplace=True modifies the df directly. Without it, pandas returns a new df and the original is unchanged. Many bugs come from forgetting this!
WHERE Conditions
Filtering rows โ€” the bread and butter of data work

Basic WHERE SQL Pandas

SQL
SELECT *
FROM patients
WHERE age > 30;

WHERE city = 'Delhi'
  AND age > 30;

WHERE city = 'Delhi'
   OR city = 'Mumbai';
Pandas
# Boolean mask
df[df['age'] > 30]

# AND โ†’ &
df[(df['city']=='Delhi') & (df['age']>30)]

# OR โ†’ |
df[(df['city']=='Delhi') | (df['city']=='Mumbai')]
โš ๏ธ Use & not and, and | not or. And ALWAYS wrap each condition in parentheses!

IN, NOT IN, BETWEEN, LIKE

SQL
WHERE city IN ('Delhi', 'Mumbai')

WHERE city NOT IN ('Delhi')

WHERE age BETWEEN 20 AND 40

WHERE name LIKE '%Kumar%'
Pandas
df[df['city'].isin(['Delhi', 'Mumbai'])]

df[~df['city'].isin(['Delhi'])]

df[df['age'].between(20, 40)]

df[df['name'].str.contains('Kumar')]

query() โ€” write SQL-like strings!

If you love SQL syntax, query() lets you filter in a readable way:

python
# Almost like SQL WHERE clause
df.query("age > 30 and city == 'Delhi'")
df.query("age between 20 and 40")

# Use @ to reference Python variables
min_age = 25
df.query("age > @min_age")
๐Ÿ’ก query() is very readable and faster on large DataFrames. Great to show in interviews!

NULL checks

SQL
WHERE col IS NULL

WHERE col IS NOT NULL
Pandas
df[df['col'].isna()]

df[df['col'].notna()]
GROUP BY & Aggregation
The most tested topic in Data Engineering interviews

Basic GROUP BY + Aggregation

SQL
SELECT city,
       COUNT(*) AS cnt,
       AVG(age)  AS avg_age,
       MAX(age)  AS max_age,
       SUM(salary) AS total
FROM employees
GROUP BY city;
Pandas
df.groupby('city').agg(
  cnt       = ('age', 'count'),
  avg_age   = ('age', 'mean'),
  max_age   = ('age', 'max'),
  total_sal = ('salary', 'sum')
).reset_index()

GROUP BY multiple columns + HAVING

SQL
SELECT city, dept,
       COUNT(*) AS cnt
FROM employees
GROUP BY city, dept
HAVING COUNT(*) > 5;
Pandas
result = df.groupby(['city', 'dept']).agg(
    cnt=('id', 'count')
).reset_index()

# HAVING โ†’ filter after groupby
result[result['cnt'] > 5]

All Aggregation Functions

SQL FunctionPandas agg stringDirect method
COUNT(*)'count'df['col'].count()
SUM(col)'sum'df['col'].sum()
AVG(col)'mean'df['col'].mean()
MIN(col)'min'df['col'].min()
MAX(col)'max'df['col'].max()
COUNT(DISTINCT col)'nunique'df['col'].nunique()
STDDEV()'std'df['col'].std()
MEDIAN()'median'df['col'].median()
PERCENTILE_CONT(0.25)lambda x: x.quantile(0.25)df['col'].quantile(0.25)

value_counts() โ€” Quick GROUP BY COUNT

python
# Like: SELECT city, COUNT(*) FROM t GROUP BY city ORDER BY 2 DESC
df['city'].value_counts()

# With normalize=True โ†’ percentages
df['city'].value_counts(normalize=True)

# Include NULLs
df['city'].value_counts(dropna=False)
asked very frequently
combine with head(10)
JOINs
merge() is your JOIN in Pandas โ€” master this!

All JOIN Types with merge()

SQL
SELECT * FROM a
INNER JOIN b ON a.id = b.id;

LEFT JOIN b ON a.id = b.id;

RIGHT JOIN b ON a.id = b.id;

FULL OUTER JOIN b ON a.id = b.id;
Pandas
pd.merge(a, b, on='id', how='inner') pd.merge(a, b, on='id', how='left') pd.merge(a, b, on='id', how='right') pd.merge(a, b, on='id', how='outer')

Join on different column names

SQL
SELECT * FROM orders o
JOIN patients p
  ON o.patient_id = p.id;
Pandas
pd.merge(
  orders, patients,
  left_on='patient_id',
  right_on='id',
  how='inner'
)

Handling duplicate column names after merge

python
# If both tables have a column 'name', pandas adds _x and _y suffixes
merged = pd.merge(df1, df2, on='id')
# โ†’ name_x (from df1), name_y (from df2)

# Control suffixes
merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))

UNION ALL / UNION โ†’ concat()

SQL
SELECT * FROM table_2024
UNION ALL
SELECT * FROM table_2025;

UNION  -- removes duplicates
Pandas
pd.concat([df_2024, df_2025], ignore_index=True) pd.concat([...]).drop_duplicates()
concat axis=0 โ†’ stack rows
concat axis=1 โ†’ stack cols
ORDER BY / LIMIT / DISTINCT
Sorting, deduplication, and limiting output

ORDER BY, LIMIT, DISTINCT

SQL
SELECT * FROM t
ORDER BY age DESC, name ASC
LIMIT 10;

SELECT DISTINCT city FROM t;

SELECT COUNT(DISTINCT city)
FROM t;
Pandas
df.sort_values(
  by=['age', 'name'],
  ascending=[False, True]
).head(10)

df['city'].unique()
df.drop_duplicates(subset=['city'])

df['city'].nunique()

TOP N per group โ€” Classic Interview Question!

SQL
-- Top 2 salaries per dept
SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER(
      PARTITION BY dept
      ORDER BY salary DESC
    ) AS rn
  FROM employees
) t WHERE rn <= 2;
Pandas
# Top 2 salaries per dept df.sort_values('salary', ascending=False)\ .groupby('dept')\ .head(2) # OR using rank() df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df[df['rn'] <= 2]
very commonly asked
Window Functions
ROW_NUMBER, RANK, LAG, LEAD, running totals โ€” all in Pandas

ROW_NUMBER / RANK / DENSE_RANK

SQL
ROW_NUMBER() OVER(
  PARTITION BY dept
  ORDER BY salary DESC)

RANK() OVER(...)

DENSE_RANK() OVER(...)
Pandas
df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df['rnk'] = df.groupby('dept')['salary']\ .rank(method='min', ascending=False) df['drnk'] = df.groupby('dept')['salary']\ .rank(method='dense', ascending=False)
rank() methodSQL EquivalentBehavior on ties
'first'ROW_NUMBER()No ties, sequential
'min'RANK()Ties get same rank, gap after
'dense'DENSE_RANK()Ties get same rank, no gap
'average'โ€”Average of tied ranks

LAG / LEAD

SQL
LAG(salary, 1) OVER(
  PARTITION BY dept
  ORDER BY hire_date)

LEAD(salary, 1) OVER(...)
Pandas
df['prev_salary'] = df.groupby('dept')['salary']\ .shift(1) # LAG(1) df['next_salary'] = df.groupby('dept')['salary']\ .shift(-1) # LEAD(1)

Running Totals โ€” Cumulative functions

SQL
SUM(amount) OVER(
  ORDER BY date
  ROWS UNBOUNDED PRECEDING)

AVG(amount) OVER(
  ORDER BY date
  ROWS 6 PRECEDING)
Pandas
df['running_sum'] = df['amount'].cumsum() df['running_max'] = df['amount'].cummax() df['running_min'] = df['amount'].cummin() # Rolling window (last 7 rows) df['7day_avg'] = df['amount'].rolling(7).mean()

transform() โ€” Window without collapsing rows

Use when you want the GROUP BY result attached back to every original row (like a window function).

SQL
-- dept avg salary on each row
SELECT *,
  AVG(salary) OVER(
    PARTITION BY dept) AS dept_avg
FROM employees;
Pandas
df['dept_avg'] = df.groupby('dept')['salary']\ .transform('mean') # Now each row has its dept avg!
transform vs agg โ€” must know!
NULL Handling
NaN in Pandas = NULL in SQL

Detect, Drop, Fill NULLs

python
# Detect nulls
df.isna()                    # bool DataFrame
df.isna().sum()              # count nulls per column
df.isna().sum().sum()        # total null count
df['col'].isna()            # null mask for one col

# Drop rows with nulls
df.dropna()                  # drop if ANY col is null
df.dropna(subset=['col1', 'col2'])  # only if these cols null
df.dropna(thresh=3)          # keep rows with at least 3 non-null

# Fill nulls (like COALESCE / ISNULL)
df.fillna(0)                  # fill all NaN with 0
df['age'].fillna(df['age'].mean())  # fill with mean
df.fillna({'age': 0, 'name': 'Unknown'})  # col-specific fills
df['col'].ffill()             # forward fill (carry last value)
df['col'].bfill()             # backward fill

COALESCE equivalent

SQL
COALESCE(col1, col2, 0)

NULLIF(col, '')
Pandas
df['col1'].combine_first(df['col2']) df['col'].replace('', np.nan)
String Operations
All string functions live under .str accessor

String methods with .str

python
# Access with .str accessor (like SQL string functions)
df['name'].str.upper()          # UPPER()
df['name'].str.lower()          # LOWER()
df['name'].str.strip()          # TRIM()
df['name'].str.lstrip()         # LTRIM()
df['name'].str.len()             # LEN() / LENGTH()
df['name'].str.replace('a', 'b') # REPLACE()
df['name'].str[0:3]             # SUBSTRING(1,3) / LEFT(3)
df['name'].str.contains('Kumar')  # LIKE '%Kumar%'
df['name'].str.startswith('A')   # LIKE 'A%'
df['name'].str.split('_')        # STRING_SPLIT()
df['name'].str.split('_', expand=True)  # split into columns

# Regex
df['col'].str.extract(r'(\d+)')    # extract first number
df['col'].str.findall(r'\d+')     # find all numbers
df['col'].str.replace(r'[^a-z]', '', regex=True)  # regex replace
Date & Time
Date parsing, extraction, and arithmetic โ€” critical for DE

Parse and extract date parts

python
# Parse string to datetime (always do this first!)
df['date'] = pd.to_datetime(df['date'])
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')

# Extract parts via .dt accessor
df['year']    = df['date'].dt.year      # YEAR()
df['month']   = df['date'].dt.month     # MONTH()
df['day']     = df['date'].dt.day       # DAY()
df['weekday'] = df['date'].dt.dayofweek # 0=Mon, 6=Sun
df['quarter'] = df['date'].dt.quarter

# Date arithmetic
df['date'] + pd.Timedelta(days=7)
(df['end_date'] - df['start_date']).dt.days  # DATEDIFF

# Floor / truncate date
df['month_start'] = df['date'].dt.to_period('M')  # DATE_TRUNC month
Apply & Transform
Custom logic per row or column โ€” like CASE WHEN on steroids

CASE WHEN โ†’ np.where / map / apply

SQL
CASE WHEN age < 18 THEN 'minor'
     WHEN age < 60 THEN 'adult'
     ELSE 'senior'
END AS age_group
Pandas
# np.select for multiple conditions conditions = [ df['age'] < 18, df['age'] < 60 ] choices = ['minor', 'adult'] df['age_group'] = np.select(conditions, choices, default='senior')

apply() โ€” Custom function per row

python
# Apply to a column (fast)
df['age_squared'] = df['age'].apply(lambda x: x**2)

# Apply to each row (axis=1) โ€” slower but flexible
df['label'] = df.apply(
    lambda row: f"{row['name']}_{row['dept']}",
    axis=1
)

# Custom function
def risk_score(row):
    if row['age'] > 60 and row['smoking'] == 'Y':
        return 'high'
    return 'low'

df['risk'] = df.apply(risk_score, axis=1)
โš ๏ธ apply() with axis=1 is slow on large data. Prefer vectorized operations (np.where, .str methods, arithmetic) when possible.

map() โ€” Replace values like a lookup table

SQL
LEFT JOIN code_lookup cl
  ON t.code = cl.code
Pandas
lookup = {'M': 'Male', 'F': 'Female'} df['gender_full'] = df['gender'].map(lookup)
Pivot / Melt
Reshape data โ€” wide to long and long to wide

pivot_table() โ€” like Excel Pivot / PIVOT in SQL

python
# Wide format: one row per dept, cols = each city's avg salary df.pivot_table( index='dept', columns='city', values='salary', aggfunc='mean', fill_value=0 )

melt() โ€” Wide to Long (UNPIVOT)

SQL
-- UNPIVOT months to rows SELECT id, month, sales FROM t UNPIVOT (sales FOR month IN (jan, feb, mar));
Pandas
df.melt( id_vars=['id'], value_vars=['jan', 'feb', 'mar'], var_name='month', value_name='sales' )
Performance Tips
What separates a junior from a senior in interviews

๐Ÿš€ Speed Hierarchy (fastest โ†’ slowest)

MethodWhen to useSpeed
Vectorized ops (df['a'] + df['b'])Arithmetic, comparisonsโšกโšกโšก Fastest
NumPy functions (np.where)CASE WHEN logicโšกโšกโšก
.str methodsString operationsโšกโšก
groupby + aggAggregationsโšกโšก
apply() on columnCustom logic per valueโšก
apply(axis=1)Custom logic per row๐ŸŒ Slow
iterrows() / itertuples()Avoid in production๐ŸŒ๐ŸŒ Slowest

๐Ÿ’พ Memory Optimization

python
# Check memory usage
df.memory_usage(deep=True).sum() / 1e6  # in MB

# Downcast numeric types
df['age'] = pd.to_numeric(df['age'], downcast='integer')

# Convert low-cardinality strings to category (like an index/enum)
df['city'] = df['city'].astype('category')

# Read only needed columns
df = pd.read_csv('big.csv', usecols=['id', 'amount'])

๐Ÿ”‘ Key Best Practices

python
# 1. Copy to avoid SettingWithCopyWarning
filtered = df[df['age'] > 30].copy()

# 2. Always reset_index after groupby
result = df.groupby('dept').agg(...).reset_index()

# 3. Use .assign() for chaining
df = (df
    .assign(full_name=lambda x: x['first']+' '+x['last'])
    .assign(age_group=lambda x: np.where(x['age']>60, 'senior', 'adult'))
    .dropna(subset=['id'])
)
๐ŸŽฏ Interview Cheatsheet
Top patterns asked in Data Engineering & Analyst interviews

Top 20 Must-Know Operations

python โ€” top 20 interview patterns
# 1. Find duplicate rows
df[df.duplicated()]                    # show duplicates
df.drop_duplicates()                    # remove all dups
df.drop_duplicates(subset=['id'], keep='last')  # keep latest

# 2. Top N rows per group
df.sort_values('salary', ascending=False).groupby('dept').head(2)

# 3. Percent of total per group
df['pct'] = df.groupby('dept')['sales'].transform(lambda x: x / x.sum() * 100)

# 4. Find second highest salary
df['salary'].nlargest(2).iloc[-1]
df['salary'].drop_duplicates().nlargest(2).min()

# 5. Count nulls per column
df.isna().sum()

# 6. Correlation between columns
df[['age', 'salary']].corr()

# 7. Running total within group
df['cumsum'] = df.groupby('dept')['sales'].cumsum()

# 8. Month-over-month change
df.sort_values('date', inplace=True)
df['mom_change'] = df['revenue'] - df['revenue'].shift(1)
df['mom_pct']    = df['revenue'].pct_change() * 100

# 9. Flatten a column with lists
df.explode('tags_list')

# 10. Cross join
df1['key'] = 1; df2['key'] = 1
pd.merge(df1, df2, on='key').drop('key', axis=1)

# 11. Pivot (wide format)
df.pivot_table(index='dept', columns='month', values='sales', aggfunc='sum')

# 12. Anti-join (rows in A not in B)
merged = pd.merge(a, b, on='id', how='left', indicator=True)
anti = merged[merged['_merge'] == 'left_only']

# 13. Conditional column update
df.loc[df['age'] > 60, 'category'] = 'senior'

# 14. Find rows where col A > col B
df[df['actual'] > df['target']]

# 15. String split into multiple columns
df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True)

# 16. Normalize / standardize a column
df['norm'] = (df['val'] - df['val'].min()) / (df['val'].max() - df['val'].min())

# 17. Multiple conditions โ†’ new column
df['flag'] = np.where((df['age']>50) & (df['smoker']=='Y'), 1, 0)

# 18. Resample time series (like DATE_TRUNC + GROUP BY)
df.set_index('date').resample('M')['sales'].sum()

# 19. Stack / Unstack MultiIndex
pivot.stack()         # wide โ†’ long
long.unstack()        # long โ†’ wide

# 20. Assign multiple columns at once
df = df.assign(
    full_name   = df['first'] + ' ' + df['last'],
    age_bucket  = pd.cut(df['age'], bins=[0,18,60,100], labels=['minor','adult','senior'])
)

Common Interview Traps โš ๏ธ

TrapWrongCorrect
Single vs double bracketsdf['col'] โ†’ Seriesdf[['col']] โ†’ DataFrame
AND/OR in filtersdf[a > 1 and b > 2]df[(a > 1) & (b > 2)]
loc vs iloc endiloc[0:3] โ†’ 0,1,2loc[0:3] โ†’ 0,1,2,3
inplace forgottendf.sort_values('a') (no effect)df.sort_values('a', inplace=True)
groupby indexresult.dept (KeyError)result.reset_index().dept
chained assignmentdf[mask]['col'] = val โ†’ warningdf.loc[mask, 'col'] = val
Comparing to NaNdf[df['a'] == None]df[df['a'].isna()]