๐ผ What is Pandas?
Pandas is Python's SQL-like library for data manipulation. Think of it as a DataFrame = a Table. You do everything SQL does โ select, filter, group, join โ just in Python code.
| SQL Concept | Pandas Equivalent |
|---|---|
| Table | DataFrame (df) |
| Column | Series (df['col']) |
| Row | Row (accessed via index) |
| Primary Key / Index | df.index |
| NULL | NaN / None |
| Schema / data types | df.dtypes |
โ๏ธ Setup & Imports
import pandas as pd import numpy as np # for NaN, numeric ops # Check version (interviewers sometimes ask) pd.__version__
๐ First look at a DataFrame โ like DESCRIBE + SELECT *
# Like: SELECT * FROM table LIMIT 5 df.head(5) df.tail(5) # Like: DESCRIBE table (column names + dtypes) df.info() df.dtypes # Like: SELECT COUNT(*), MIN(), MAX(), AVG() for all cols df.describe() # for numeric cols df.describe(include='all') # for all cols # Shape: (rows, columns) df.shape len(df) # row count
DataFrame โ The Table
2D labeled data structure. Rows + Columns. This is your primary object.
# Create from dict (like an INSERT with hardcoded values) df = pd.DataFrame({ 'patient_id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'age': [30, 45, 28] }) # Output: # patient_id name age # 0 1 Alice 30 # 1 2 Bob 45 # 2 3 Charlie 28 # ^--- This 0,1,2 is the INDEX (like rowid)
Series โ The Column
1D array with labels. When you pick one column from a DataFrame, you get a Series.
# Get a column โ Series ages = df['age'] # Series type(df['age']) # pandas.core.series.Series type(df[['age']]) # pandas.core.frame.DataFrame โ double brackets! # Series has index + values ages.values # numpy array of values ages.index # RangeIndex(start=0, stop=3)
Index โ The Row Label
Every DataFrame has an index (like a row identifier). Default is 0,1,2,... You can set any column as index.
# Set a column as index (like a PK) df = df.set_index('patient_id') # Reset back to default 0,1,2... df = df.reset_index() # Access row by index label df.loc[1] # row where index = 1 # Access row by position (0-based) df.iloc[0] # first row
๐ Read files
# CSV โ most common df = pd.read_csv('data.csv') df = pd.read_csv('data.csv', sep='|') # pipe delimited df = pd.read_csv('data.csv', dtype=str) # all cols as string df = pd.read_csv('data.csv', usecols=['a','b']) # select cols df = pd.read_csv('data.csv', nrows=1000) # LIMIT 1000 # Excel df = pd.read_excel('data.xlsx', sheet_name='Sheet1') # JSON df = pd.read_json('data.json') # From SQL query โ very useful for DE interviews! from sqlalchemy import create_engine engine = create_engine('postgresql://user:pass@host/db') df = pd.read_sql("SELECT * FROM patients", engine)
๐พ Write / Export data
df.to_csv('output.csv', index=False) # index=False โ don't write row numbers df.to_excel('output.xlsx', index=False) df.to_json('output.json', orient='records') df.to_sql('table_name', engine, if_exists='replace')
Select Columns SQL Pandas
SELECT name, age FROM patients; SELECT * FROM patients;
# Select specific columns df[['name', 'age']] # Select all (just use df) df
Row Access: loc vs iloc
loc = label-based (use column/index names). iloc = position-based (use numbers like array indexing).
# loc[row_label, col_label] df.loc[0, 'name'] # single cell df.loc[0:2, ['name', 'age']] # rows 0-2, 2 cols (INCLUSIVE end!) df.loc[:, 'age'] # all rows, one col # iloc[row_position, col_position] df.iloc[0, 1] # row 0, col index 1 df.iloc[0:3, 0:2] # rows 0-2, cols 0-1 (EXCLUSIVE end!) df.iloc[-1] # last row
Add / Rename / Drop columns
# Add column (like ALTER TABLE + UPDATE) df['full_name'] = df['first'] + ' ' + df['last'] df['age_plus_10'] = df['age'] + 10 # Rename columns df.rename(columns={'old_name': 'new_name'}, inplace=True) # Drop columns (like DROP COLUMN) df.drop(columns=['col1', 'col2'], inplace=True) # Reorder columns df = df[['col3', 'col1', 'col2']]
Basic WHERE SQL Pandas
SELECT * FROM patients WHERE age > 30; WHERE city = 'Delhi' AND age > 30; WHERE city = 'Delhi' OR city = 'Mumbai';
# Boolean mask df[df['age'] > 30] # AND โ & df[(df['city']=='Delhi') & (df['age']>30)] # OR โ | df[(df['city']=='Delhi') | (df['city']=='Mumbai')]
IN, NOT IN, BETWEEN, LIKE
WHERE city IN ('Delhi', 'Mumbai') WHERE city NOT IN ('Delhi') WHERE age BETWEEN 20 AND 40 WHERE name LIKE '%Kumar%'
df[df['city'].isin(['Delhi', 'Mumbai'])] df[~df['city'].isin(['Delhi'])] df[df['age'].between(20, 40)] df[df['name'].str.contains('Kumar')]
query() โ write SQL-like strings!
If you love SQL syntax, query() lets you filter in a readable way:
# Almost like SQL WHERE clause df.query("age > 30 and city == 'Delhi'") df.query("age between 20 and 40") # Use @ to reference Python variables min_age = 25 df.query("age > @min_age")
NULL checks
WHERE col IS NULL WHERE col IS NOT NULL
df[df['col'].isna()] df[df['col'].notna()]
Basic GROUP BY + Aggregation
SELECT city, COUNT(*) AS cnt, AVG(age) AS avg_age, MAX(age) AS max_age, SUM(salary) AS total FROM employees GROUP BY city;
df.groupby('city').agg( cnt = ('age', 'count'), avg_age = ('age', 'mean'), max_age = ('age', 'max'), total_sal = ('salary', 'sum') ).reset_index()
GROUP BY multiple columns + HAVING
SELECT city, dept, COUNT(*) AS cnt FROM employees GROUP BY city, dept HAVING COUNT(*) > 5;
result = df.groupby(['city', 'dept']).agg( cnt=('id', 'count') ).reset_index() # HAVING โ filter after groupby result[result['cnt'] > 5]
All Aggregation Functions
| SQL Function | Pandas agg string | Direct method |
|---|---|---|
| COUNT(*) | 'count' | df['col'].count() |
| SUM(col) | 'sum' | df['col'].sum() |
| AVG(col) | 'mean' | df['col'].mean() |
| MIN(col) | 'min' | df['col'].min() |
| MAX(col) | 'max' | df['col'].max() |
| COUNT(DISTINCT col) | 'nunique' | df['col'].nunique() |
| STDDEV() | 'std' | df['col'].std() |
| MEDIAN() | 'median' | df['col'].median() |
| PERCENTILE_CONT(0.25) | lambda x: x.quantile(0.25) | df['col'].quantile(0.25) |
value_counts() โ Quick GROUP BY COUNT
# Like: SELECT city, COUNT(*) FROM t GROUP BY city ORDER BY 2 DESC df['city'].value_counts() # With normalize=True โ percentages df['city'].value_counts(normalize=True) # Include NULLs df['city'].value_counts(dropna=False)
All JOIN Types with merge()
SELECT * FROM a INNER JOIN b ON a.id = b.id; LEFT JOIN b ON a.id = b.id; RIGHT JOIN b ON a.id = b.id; FULL OUTER JOIN b ON a.id = b.id;
pd.merge(a, b, on='id', how='inner') pd.merge(a, b, on='id', how='left') pd.merge(a, b, on='id', how='right') pd.merge(a, b, on='id', how='outer')
Join on different column names
SELECT * FROM orders o JOIN patients p ON o.patient_id = p.id;
pd.merge( orders, patients, left_on='patient_id', right_on='id', how='inner' )
Handling duplicate column names after merge
# If both tables have a column 'name', pandas adds _x and _y suffixes merged = pd.merge(df1, df2, on='id') # โ name_x (from df1), name_y (from df2) # Control suffixes merged = pd.merge(df1, df2, on='id', suffixes=('_left', '_right'))
UNION ALL / UNION โ concat()
SELECT * FROM table_2024 UNION ALL SELECT * FROM table_2025; UNION -- removes duplicates
pd.concat([df_2024, df_2025], ignore_index=True) pd.concat([...]).drop_duplicates()
ORDER BY, LIMIT, DISTINCT
SELECT * FROM t ORDER BY age DESC, name ASC LIMIT 10; SELECT DISTINCT city FROM t; SELECT COUNT(DISTINCT city) FROM t;
df.sort_values( by=['age', 'name'], ascending=[False, True] ).head(10) df['city'].unique() df.drop_duplicates(subset=['city']) df['city'].nunique()
TOP N per group โ Classic Interview Question!
-- Top 2 salaries per dept SELECT * FROM ( SELECT *, ROW_NUMBER() OVER( PARTITION BY dept ORDER BY salary DESC ) AS rn FROM employees ) t WHERE rn <= 2;
# Top 2 salaries per dept df.sort_values('salary', ascending=False)\ .groupby('dept')\ .head(2) # OR using rank() df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df[df['rn'] <= 2]
ROW_NUMBER / RANK / DENSE_RANK
ROW_NUMBER() OVER( PARTITION BY dept ORDER BY salary DESC) RANK() OVER(...) DENSE_RANK() OVER(...)
df['rn'] = df.groupby('dept')['salary']\ .rank(method='first', ascending=False) df['rnk'] = df.groupby('dept')['salary']\ .rank(method='min', ascending=False) df['drnk'] = df.groupby('dept')['salary']\ .rank(method='dense', ascending=False)
| rank() method | SQL Equivalent | Behavior on ties |
|---|---|---|
| 'first' | ROW_NUMBER() | No ties, sequential |
| 'min' | RANK() | Ties get same rank, gap after |
| 'dense' | DENSE_RANK() | Ties get same rank, no gap |
| 'average' | โ | Average of tied ranks |
LAG / LEAD
LAG(salary, 1) OVER( PARTITION BY dept ORDER BY hire_date) LEAD(salary, 1) OVER(...)
df['prev_salary'] = df.groupby('dept')['salary']\ .shift(1) # LAG(1) df['next_salary'] = df.groupby('dept')['salary']\ .shift(-1) # LEAD(1)
Running Totals โ Cumulative functions
SUM(amount) OVER( ORDER BY date ROWS UNBOUNDED PRECEDING) AVG(amount) OVER( ORDER BY date ROWS 6 PRECEDING)
df['running_sum'] = df['amount'].cumsum() df['running_max'] = df['amount'].cummax() df['running_min'] = df['amount'].cummin() # Rolling window (last 7 rows) df['7day_avg'] = df['amount'].rolling(7).mean()
transform() โ Window without collapsing rows
Use when you want the GROUP BY result attached back to every original row (like a window function).
-- dept avg salary on each row SELECT *, AVG(salary) OVER( PARTITION BY dept) AS dept_avg FROM employees;
df['dept_avg'] = df.groupby('dept')['salary']\ .transform('mean') # Now each row has its dept avg!
Detect, Drop, Fill NULLs
# Detect nulls df.isna() # bool DataFrame df.isna().sum() # count nulls per column df.isna().sum().sum() # total null count df['col'].isna() # null mask for one col # Drop rows with nulls df.dropna() # drop if ANY col is null df.dropna(subset=['col1', 'col2']) # only if these cols null df.dropna(thresh=3) # keep rows with at least 3 non-null # Fill nulls (like COALESCE / ISNULL) df.fillna(0) # fill all NaN with 0 df['age'].fillna(df['age'].mean()) # fill with mean df.fillna({'age': 0, 'name': 'Unknown'}) # col-specific fills df['col'].ffill() # forward fill (carry last value) df['col'].bfill() # backward fill
COALESCE equivalent
COALESCE(col1, col2, 0) NULLIF(col, '')
df['col1'].combine_first(df['col2']) df['col'].replace('', np.nan)
String methods with .str
# Access with .str accessor (like SQL string functions) df['name'].str.upper() # UPPER() df['name'].str.lower() # LOWER() df['name'].str.strip() # TRIM() df['name'].str.lstrip() # LTRIM() df['name'].str.len() # LEN() / LENGTH() df['name'].str.replace('a', 'b') # REPLACE() df['name'].str[0:3] # SUBSTRING(1,3) / LEFT(3) df['name'].str.contains('Kumar') # LIKE '%Kumar%' df['name'].str.startswith('A') # LIKE 'A%' df['name'].str.split('_') # STRING_SPLIT() df['name'].str.split('_', expand=True) # split into columns # Regex df['col'].str.extract(r'(\d+)') # extract first number df['col'].str.findall(r'\d+') # find all numbers df['col'].str.replace(r'[^a-z]', '', regex=True) # regex replace
Parse and extract date parts
# Parse string to datetime (always do this first!) df['date'] = pd.to_datetime(df['date']) df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d') # Extract parts via .dt accessor df['year'] = df['date'].dt.year # YEAR() df['month'] = df['date'].dt.month # MONTH() df['day'] = df['date'].dt.day # DAY() df['weekday'] = df['date'].dt.dayofweek # 0=Mon, 6=Sun df['quarter'] = df['date'].dt.quarter # Date arithmetic df['date'] + pd.Timedelta(days=7) (df['end_date'] - df['start_date']).dt.days # DATEDIFF # Floor / truncate date df['month_start'] = df['date'].dt.to_period('M') # DATE_TRUNC month
CASE WHEN โ np.where / map / apply
CASE WHEN age < 18 THEN 'minor' WHEN age < 60 THEN 'adult' ELSE 'senior' END AS age_group
# np.select for multiple conditions conditions = [ df['age'] < 18, df['age'] < 60 ] choices = ['minor', 'adult'] df['age_group'] = np.select(conditions, choices, default='senior')
apply() โ Custom function per row
# Apply to a column (fast) df['age_squared'] = df['age'].apply(lambda x: x**2) # Apply to each row (axis=1) โ slower but flexible df['label'] = df.apply( lambda row: f"{row['name']}_{row['dept']}", axis=1 ) # Custom function def risk_score(row): if row['age'] > 60 and row['smoking'] == 'Y': return 'high' return 'low' df['risk'] = df.apply(risk_score, axis=1)
map() โ Replace values like a lookup table
LEFT JOIN code_lookup cl ON t.code = cl.code
lookup = {'M': 'Male', 'F': 'Female'}
df['gender_full'] = df['gender'].map(lookup)
pivot_table() โ like Excel Pivot / PIVOT in SQL
# Wide format: one row per dept, cols = each city's avg salary df.pivot_table( index='dept', columns='city', values='salary', aggfunc='mean', fill_value=0 )
melt() โ Wide to Long (UNPIVOT)
-- UNPIVOT months to rows SELECT id, month, sales FROM t UNPIVOT (sales FOR month IN (jan, feb, mar));
df.melt( id_vars=['id'], value_vars=['jan', 'feb', 'mar'], var_name='month', value_name='sales' )
๐ Speed Hierarchy (fastest โ slowest)
| Method | When to use | Speed |
|---|---|---|
| Vectorized ops (df['a'] + df['b']) | Arithmetic, comparisons | โกโกโก Fastest |
| NumPy functions (np.where) | CASE WHEN logic | โกโกโก |
| .str methods | String operations | โกโก |
| groupby + agg | Aggregations | โกโก |
| apply() on column | Custom logic per value | โก |
| apply(axis=1) | Custom logic per row | ๐ Slow |
| iterrows() / itertuples() | Avoid in production | ๐๐ Slowest |
๐พ Memory Optimization
# Check memory usage df.memory_usage(deep=True).sum() / 1e6 # in MB # Downcast numeric types df['age'] = pd.to_numeric(df['age'], downcast='integer') # Convert low-cardinality strings to category (like an index/enum) df['city'] = df['city'].astype('category') # Read only needed columns df = pd.read_csv('big.csv', usecols=['id', 'amount'])
๐ Key Best Practices
# 1. Copy to avoid SettingWithCopyWarning filtered = df[df['age'] > 30].copy() # 2. Always reset_index after groupby result = df.groupby('dept').agg(...).reset_index() # 3. Use .assign() for chaining df = (df .assign(full_name=lambda x: x['first']+' '+x['last']) .assign(age_group=lambda x: np.where(x['age']>60, 'senior', 'adult')) .dropna(subset=['id']) )
Top 20 Must-Know Operations
# 1. Find duplicate rows df[df.duplicated()] # show duplicates df.drop_duplicates() # remove all dups df.drop_duplicates(subset=['id'], keep='last') # keep latest # 2. Top N rows per group df.sort_values('salary', ascending=False).groupby('dept').head(2) # 3. Percent of total per group df['pct'] = df.groupby('dept')['sales'].transform(lambda x: x / x.sum() * 100) # 4. Find second highest salary df['salary'].nlargest(2).iloc[-1] df['salary'].drop_duplicates().nlargest(2).min() # 5. Count nulls per column df.isna().sum() # 6. Correlation between columns df[['age', 'salary']].corr() # 7. Running total within group df['cumsum'] = df.groupby('dept')['sales'].cumsum() # 8. Month-over-month change df.sort_values('date', inplace=True) df['mom_change'] = df['revenue'] - df['revenue'].shift(1) df['mom_pct'] = df['revenue'].pct_change() * 100 # 9. Flatten a column with lists df.explode('tags_list') # 10. Cross join df1['key'] = 1; df2['key'] = 1 pd.merge(df1, df2, on='key').drop('key', axis=1) # 11. Pivot (wide format) df.pivot_table(index='dept', columns='month', values='sales', aggfunc='sum') # 12. Anti-join (rows in A not in B) merged = pd.merge(a, b, on='id', how='left', indicator=True) anti = merged[merged['_merge'] == 'left_only'] # 13. Conditional column update df.loc[df['age'] > 60, 'category'] = 'senior' # 14. Find rows where col A > col B df[df['actual'] > df['target']] # 15. String split into multiple columns df[['first', 'last']] = df['full_name'].str.split(' ', n=1, expand=True) # 16. Normalize / standardize a column df['norm'] = (df['val'] - df['val'].min()) / (df['val'].max() - df['val'].min()) # 17. Multiple conditions โ new column df['flag'] = np.where((df['age']>50) & (df['smoker']=='Y'), 1, 0) # 18. Resample time series (like DATE_TRUNC + GROUP BY) df.set_index('date').resample('M')['sales'].sum() # 19. Stack / Unstack MultiIndex pivot.stack() # wide โ long long.unstack() # long โ wide # 20. Assign multiple columns at once df = df.assign( full_name = df['first'] + ' ' + df['last'], age_bucket = pd.cut(df['age'], bins=[0,18,60,100], labels=['minor','adult','senior']) )
Common Interview Traps โ ๏ธ
| Trap | Wrong | Correct |
|---|---|---|
| Single vs double brackets | df['col'] โ Series | df[['col']] โ DataFrame |
| AND/OR in filters | df[a > 1 and b > 2] | df[(a > 1) & (b > 2)] |
| loc vs iloc end | iloc[0:3] โ 0,1,2 | loc[0:3] โ 0,1,2,3 |
| inplace forgotten | df.sort_values('a') (no effect) | df.sort_values('a', inplace=True) |
| groupby index | result.dept (KeyError) | result.reset_index().dept |
| chained assignment | df[mask]['col'] = val โ warning | df.loc[mask, 'col'] = val |
| Comparing to NaN | df[df['a'] == None] | df[df['a'].isna()] |