The Data Analysis Workflow
Professional data scientists follow a repeatable workflow for every project:
- Ask a question — what do you want to find out?
- Collect data — CSV, APIs, manual entry
- Clean data — fix missing values, convert types
- Analyse — statistics, filtering, grouping
- Visualise — charts that reveal patterns
- Communicate findings — tell a clear story
💡
Step 1 Is Most Important
A clear question guides every other step. "What is the average temperature?" is a clear question. "Tell me something about this data" is not.
Cleaning & Filtering Data
Real data is messy. Common issues: missing values (""), wrong types (numbers stored as strings), and outliers. Filter with list comprehensions:
# Remove empty entries
clean = [v for v in data if v != ""]
# Filter: only rows where temp > 70
hot_days = [row for row in rows if float(row["temp"]) > 70]
# Convert column to numbers
temps = [float(row["temp"]) for row in rows]
Putting It All Together
A complete analysis loads data, cleans it, computes statistics, and creates charts — all in one Python script:
import csv, statistics, matplotlib.pyplot as plt
temps = []
with open("weather.csv") as f:
for row in csv.DictReader(f):
temps.append(float(row["temp_f"]))
print("Mean:", statistics.mean(temps))
print("Median:", statistics.median(temps))
plt.hist(temps, bins=8, color="steelblue")
plt.title("Temperature Distribution")
plt.show()
🆕
Tell a Story
The best data analyses end with a clear sentence: "Cities in the south averaged 15°F warmer than cities in the north." Numbers alone don't communicate — interpretation does.