200+ Best Data Science Projects for High School Students (By Domain and Skill Level)
At the high school level, a good data science project teaches you something a classroom lesson never can: how to handle a question where you don't already know the answer. Most students start with the Titanic dataset because it's the obvious first step, and that's fine as a learning exercise, but the projects that actually build skill, and the ones that stand out in a portfolio or college application, ask a real question about real data.
We did some research and compiled this blog, which covers 200+ of the latest data science project ideas for high school students, organized by domain and difficulty, from your first exploratory analysis to more advanced machine learning and NLP work. If you want structured guidance turning any of these into a polished, defensible project rather than a tutorial clone, a mentored program like Veritas AI pairs you directly with a data science or AI researcher to build something original. More on that at the end!
What makes a data science project worth doing at the high school level?
Before the list, a quick framing question that will save you a lot of wasted effort: are you trying to answer something you don't already know, or are you trying to demonstrate a technique? Both are valid reasons to do a project, but they should be approached differently.
If you're learning a technique, pick a clean, well-documented dataset and focus on getting the mechanics right. If you're trying to answer a real question, expect the data to be messier, expect your first approach to need revision, and expect the most interesting part of the project to be explaining what you found and why it matters. The second kind of project is what makes a portfolio, application, or competition submission memorable.
Exploratory Data Analysis (EDA) Projects
EDA is the foundation of every data science project and the most commonly underweighted skill among beginners. These projects focus on finding patterns, generating questions, and visualizing data clearly.
Exploring global life expectancy trends and their correlation with healthcare spending using World Bank data
Analyzing decades of Billboard Hot 100 data to find patterns in genre, tempo, and song length over time
Investigating the relationship between screen time and reported sleep quality using a public survey dataset
Exploring NBA player statistics to identify which metrics best predict All-Star selection
Analyzing global CO2 emissions data to compare trends across developed and developing economies
Investigating crime statistics in a major city to find seasonal or geographic patterns
Exploring Spotify's audio feature dataset to characterize what makes a song "danceable" versus "energetic"
Analyzing global coffee production and consumption data to map supply chains and trends
Investigating Olympic medal counts over time and their correlation with GDP and population
Exploring restaurant inspection data from a major city's open data portal to find patterns in violation types
Analyzing airline on-time performance data to identify which airports or routes are most prone to delay
Investigating global internet adoption rates and their relationship to economic development indicators
Exploring a decade of box office data to understand what genre and budget combinations perform best
Analyzing housing price trends across U.S. metro areas using Zillow's public datasets
Investigating global plastic waste generation and mismanagement using Our World in Data datasets
Data Cleaning Projects
These projects specifically focus on the unglamorous but essential skill of taking messy, real-world data and making it usable. Choose datasets that are intentionally imperfect.
Cleaning and standardizing a scraped dataset of job postings with inconsistent salary and location formatting
Reconciling multiple government datasets on the same topic (e.g., unemployment) that use different geographic boundaries
Cleaning a multi-year dataset of restaurant reviews with inconsistent date formats, duplicate entries, and missing ratings
Standardizing a dataset of global city names and country codes pulled from several different sources
Cleaning a public health dataset with significant missing data and comparing imputation strategies
Reconciling currency and unit inconsistencies across an international economic dataset
Cleaning a social media dataset with duplicate accounts, bot-like activity, and inconsistent timestamp formats
Standardizing inconsistent school district names across multiple years of public education data
Cleaning a dataset of historical weather observations with sensor errors and missing readings
Merging and reconciling product data from multiple e-commerce sources with different naming conventions
Regression and Predictive Modeling Projects
Predicting house prices using property features, with a focus on explaining which features matter most and why
Building a model to predict a city's air quality index from traffic, weather, and industrial activity data
Predicting student performance from attendance, study habits, and demographic survey data
Modeling the relationship between marketing spend and sales for a small business dataset
Predicting NBA player salaries from performance statistics and comparing to actual market value
Building a regression model to predict crop yield from weather and soil data
Predicting restaurant ratings from menu pricing, location, and review sentiment features
Modeling energy consumption in buildings based on size, occupancy, and weather data
Predicting flight delay duration (not just likelihood) using historical airline performance data
Building a model to predict a country's life expectancy from healthcare and economic indicators
Predicting used car prices from mileage, age, brand, and condition features
Modeling the relationship between social media engagement metrics and follower growth rate
Classification Projects
Building a classifier to predict loan default risk from financial and demographic features
Classifying news headlines as clickbait or legitimate using text features
Predicting customer churn for a subscription service using behavioral data
Classifying wine quality (good, average, poor) from chemical composition features
Building a spam email classifier and comparing Naive Bayes, logistic regression, and random forest performance
Classifying handwritten digits using the MNIST dataset and comparing model architectures
Predicting which patients are at risk for diabetes using clinical features from a public health dataset
Classifying mushroom species as edible or poisonous from physical characteristics
Building a model to predict whether a job applicant will be hired based on resume features (and auditing it for bias)
Classifying customer reviews by sentiment and comparing accuracy across different feature representations
Predicting whether a startup will receive follow-on funding using Crunchbase data
Building a classifier to detect fraudulent credit card transactions in an imbalanced dataset
Clustering and Unsupervised Learning Projects
Segmenting customers into behavioral groups using purchase history data and K-Means clustering
Clustering countries by socioeconomic indicators to identify natural groupings beyond standard regional categories
Using clustering to identify distinct genres or styles within a large music dataset based on audio features
Segmenting news articles into topic clusters using unsupervised text clustering
Clustering NBA players by playing style using performance statistics rather than position labels
Using anomaly detection to identify unusual patterns in network traffic or transaction data
Clustering U.S. counties by demographic and economic profile to find unexpected similarities
Applying dimensionality reduction (PCA, t-SNE, UMAP) to visualize high-dimensional genetic or survey data
Clustering customer support tickets to identify common issue categories without predefined labels
Using clustering to group similar recipes based on ingredient profiles
Time Series and Forecasting Projects
Forecasting a city's monthly electricity demand using historical consumption and weather data
Predicting stock price movement direction (not exact price) using technical indicators, with honest discussion of limitations
Forecasting retail sales for a seasonal product category using historical transaction data
Modeling and forecasting COVID-19 case trends using an SIR-style model fit to historical data
Forecasting traffic congestion patterns at a specific intersection using historical sensor data
Predicting a river's water level using historical flow and rainfall data for flood risk assessment
Forecasting airline ticket prices over time to identify the best time to book
Modeling seasonal unemployment trends and comparing forecasts to actual subsequent data
Forecasting social media engagement trends for a brand or public figure using historical post data
Predicting future temperature anomalies using historical climate data and comparing to published climate models
Natural Language Processing (NLP) Projects
Building a sentiment classifier for product reviews and analyzing which words drive the strongest sentiment signal
Creating a text summarizer for news articles using extractive summarization techniques
Building a fake news classifier and analyzing which linguistic features are most predictive
Analyzing the sentiment of presidential debate transcripts over time and across candidates
Building a chatbot that answers questions from a specific knowledge base using simple retrieval methods
Classifying movie reviews by genre based on plot summary text alone
Building a resume keyword matcher that scores resumes against job description requirements
Analyzing the readability level of news articles across different publications using standard readability metrics
Building a topic model (LDA) to identify themes in a large collection of customer feedback
Detecting toxic or harassing language in online comments and evaluating model fairness across demographic proxies
Building a simple language translator using sequence-to-sequence modeling on a small parallel corpus
Analyzing how language use in song lyrics has changed across decades for a specific genre
Computer Vision Projects
Building an image classifier to distinguish between similar-looking species (birds, dog breeds, plant types)
Building a model to detect whether a chest X-ray shows signs of pneumonia, using a public medical imaging dataset
Creating a handwritten digit recognizer and testing its robustness to different handwriting styles
Building a model to classify food images by cuisine type
Detecting and counting objects in images (cars in a parking lot, people in a crowd) using object detection
Building a facial emotion recognition model and evaluating its accuracy across different demographic groups
Creating a model that classifies satellite images by land use type (urban, agricultural, forest)
Building a model to detect plant disease from leaf images
Creating an image-based style classifier for art (impressionist, cubist, realist) using a museum's open dataset
Building a model to detect deepfake images and analyzing what visual artifacts it relies on
Sports Analytics Projects
Building a model to predict NBA game outcomes using team and player statistics
Analyzing whether "clutch" performance in basketball is statistically distinguishable from random variation
Predicting NFL draft success from college performance statistics
Building an expected goals (xG) model for soccer using shot location and situation data
Analyzing whether home field advantage has changed over time across different sports
Predicting tennis match outcomes using player ranking, surface type, and historical head-to-head data
Building a fantasy football points predictor using weekly player performance data
Analyzing pitch sequencing patterns in baseball to identify pitcher tendencies
Building a model to evaluate which basketball shot locations produce the best expected value
Healthcare and Public Health Projects
Analyzing the relationship between food access (food deserts) and obesity rates by county
Building a model to predict hospital readmission risk from patient intake data
Analyzing vaccination rate trends and their correlation with disease incidence at the county level
Investigating disparities in healthcare access using publicly available insurance coverage data
Building a model to predict diabetes risk from lifestyle and demographic survey data
Analyzing mental health crisis trends among adolescents using CDC survey data
Investigating the relationship between air pollution exposure and asthma hospitalization rates
Building a model to estimate the spread of an infectious disease through a simulated contact network
Analyzing maternal health outcome disparities across different U.S. regions
Investigating whether telehealth adoption correlates with improved patient outcomes using available data
Finance and Economics Projects
Building a portfolio optimization model using historical stock return and volatility data
Analyzing the relationship between interest rate changes and stock market sector performance
Building a credit risk model and comparing its decisions to actual loan outcomes
Analyzing cryptocurrency price volatility and its correlation with trading volume and news sentiment
Investigating whether ESG (environmental, social, governance) scores correlate with financial performance
Building a simple algorithmic trading strategy backtester and evaluating its risk-adjusted returns honestly
Analyzing the relationship between minimum wage changes and small business employment at the state level
Investigating wealth inequality trends using historical income distribution data
Building a model to predict small business loan default risk
Analyzing how inflation has affected purchasing power across different income brackets over time
Environmental and Climate Data Science Projects
Analyzing global temperature anomaly data and comparing trends to climate model predictions
Building a model to predict wildfire risk from weather, vegetation, and historical fire data
Investigating deforestation trends using satellite imagery and land use change data
Analyzing the relationship between urban tree coverage and local temperature using city-level data
Building a model to forecast renewable energy output (solar, wind) from weather data
Investigating ocean temperature trends and their correlation with coral bleaching events
Analyzing air quality trends in major cities before and after specific policy changes
Building a model to predict water scarcity risk by region using rainfall and population data
Investigating the relationship between EV adoption rates and charging infrastructure availability by state
Analyzing global biodiversity loss trends using the Living Planet Index and related datasets
Social Science and Survey Data Projects
Analyzing General Social Survey data to track changes in public opinion on a specific issue over decades
Investigating the relationship between social media use and self-reported life satisfaction using survey data
Analyzing voter turnout patterns across demographic groups using public election data
Building a model to predict political affiliation from publicly available, non-invasive survey responses
Investigating gender pay gap persistence after controlling for industry, role, and experience
Analyzing changes in family structure and household composition using Census data over time
Investigating the relationship between education spending and standardized test outcomes by state
Analyzing public trust in institutions over time using Pew Research survey data
Investigating residential segregation patterns using Census tract-level demographic data
Analyzing the relationship between commute time and self-reported wellbeing using survey data
Recommendation Systems Projects
Building a movie recommendation system using collaborative filtering on the MovieLens dataset
Creating a book recommendation engine based on plot similarity and reader rating patterns
Building a music recommendation system based on audio feature similarity rather than collaborative filtering
Creating a recipe recommender that suggests meals based on available ingredients and dietary restrictions
Building a course recommendation system for students based on past enrollment and performance patterns
Creating a news article recommender and evaluating it for filter bubble effects
Building a product recommendation system using purchase history and item similarity
A/B Testing and Experimental Design Projects
Designing and running an A/B test on a simple website change and analyzing statistical significance correctly
Analyzing a public A/B test dataset (many companies publish case studies) and critiquing the experimental design
Designing a survey experiment to test whether question wording affects response patterns
Running a controlled study on whether background music affects task performance, with proper statistical analysis
Analyzing the statistical power of a hypothetical experiment and determining the sample size needed for a reliable result
Network Analysis Projects
Analyzing a social network dataset to identify the most influential nodes using centrality measures
Building a co-authorship network from academic papers in a specific field and identifying key collaboration hubs
Analyzing airline route networks to identify hub airports and network vulnerability to disruption
Building a network model of disease spread through a simulated contact network and testing intervention strategies
Analyzing a citation network to trace the influence of a foundational academic paper over time
Audio and Signal Processing Projects
Building a music genre classifier using audio features extracted with a library like Librosa
Creating a speech emotion recognition model using vocal feature extraction
Building a simple speaker identification system that distinguishes between a small set of known voices
Analyzing heart rate variability data from wearable devices to detect patterns related to stress or activity
Building a model to classify environmental sounds (sirens, alarms, traffic) for accessibility applications
Data Science x Specific Industries
Analyzing customer support ticket data to identify the most common and most costly issue categories for a business
Building a churn prediction model for a subscription business and estimating the revenue impact of reducing churn by a target percentage
Analyzing supply chain data to identify bottlenecks in a simulated logistics network
Building a demand forecasting model for a retail business to optimize inventory decisions
Analyzing employee survey data to identify the strongest predictors of job satisfaction and retention
Building a pricing elasticity model for a consumer product using historical sales and price data
Analyzing real estate listing data to identify which features most influence time-on-market
Building a model to predict which marketing channel drives the highest-value customers
Open-Ended and Original Research Projects
These are the most ambitious project types and the ones most likely to result in a genuinely novel contribution, suitable for science fairs, research journals, or a flagship portfolio piece.
Using public datasets to test whether a claim made in a news article or popular book actually holds up statistically
Replicating a well-known published study's analysis using publicly available data and checking whether the conclusions still hold
Building a novel composite index (similar to the Human Development Index) for a specific question you care about, and validating it against known outcomes
Investigating a question specific to your own school or community using data you collect yourself, with proper survey design
Combining two unrelated public datasets to investigate a question neither dataset could answer alone (e.g., weather data and crime data, social media data and stock prices)
Building and validating a predictive model for an outcome that genuinely matters to a community you're part of (school, sports team, local nonprofit)
Conducting a meta-analysis of multiple published studies on a topic and synthesizing their findings statistically
How should you choose the right data science project in high school?
With this many options, the hardest part is often deciding where to start. A few questions worth asking yourself:
Does this project use a skill I haven't tried yet, or does it deepen a skill I'm still shaky on?
Is there a real dataset available, and is it large and clean enough (or appropriately messy) for what I want to do?
Can I clearly state, in one sentence, the question I'm trying to answer?
If you can't answer that last question clearly, the project needs more thinking before you open a notebook. The strongest data science work, at any level, starts with a super sharp question.
How can you get feedback on your work?
The single biggest difference between a project that looks like a tutorial exercise and one that looks like real research is feedback from someone who knows what they're looking at. A mentor can tell you when your model's high accuracy is actually a sign of data leakage, when your visualization is technically correct but misleading, or when your interesting result needs one more robustness check before you can trust it.
While a mentor can be a teacher at school, a community on Reddit, or even a senior classmate, mentorship + a structured program to help you work on your project is the best way to succeed.
Veritas AI offers two structured paths depending on where you are. The AI Scholars program is a 10-week, mentor-led bootcamp where you build a real project in a small group, covering Python, machine learning fundamentals, and data analysis from the ground up. If you’re ready to go deeper, the AI Fellowship pairs you one-on-one with a mentor from a top university for 12 to 15 weeks of original research, with past projects spanning healthcare, finance, and computer vision, and direct support from Veritas AI's publication team to help you submit your work to a high school research journal.
If any of the ideas on this list are the ones you keep coming back to, that's usually a sign it's worth doing properly, with someone who can help you do it right.
Explore Veritas AI's programs here!
Frequently Asked Questions
What are good data science project ideas for high school students?
The strongest projects combine an accessible dataset with a genuinely open question, not just a demonstration of a technique. EDA and regression projects are the most accessible starting point. NLP, computer vision, and time series projects offer more advanced challenges once you have the fundamentals down. The best topic is usually one connected to something you're already interested in, whether that's sports, healthcare, music, or your own community.
Where can I find datasets for data science projects?
Kaggle, Google's Dataset Search, Data.gov, the UCI Machine Learning Repository, FiveThirtyEight's data portal, and World Bank Open Data are all strong starting points. For more original projects, government open data portals (city and state level) and public APIs from organizations like NOAA, CDC, and the Bureau of Labor Statistics offer less commonly used data that can lead to more original findings.
How long should a data science project take?
A solid EDA or regression project can be completed in one to two weeks with focused effort. Machine learning projects with proper evaluation and tuning typically take two to four weeks. The most ambitious original research projects, the kind suitable for a science fair or journal submission, often take two to three months including iteration and write-up.
Do I need to know advanced math to do data science projects?
A working understanding of statistics (distributions, hypothesis testing, correlation versus causation) is essential for almost every project on this list. Linear algebra and calculus become more important as you move into deep learning and more advanced model architectures, but plenty of genuinely strong projects, especially in EDA, regression, and classification, are accessible with a solid statistics foundation and working Python skills.
Can a high school data science project be published?
Yes. Journals specifically for high school research, such as the Journal of Emerging Investigators, accept rigorous data science research. Original projects with a clear research question, honest methodology, and a defensible conclusion have a real shot, particularly when developed with mentorship from someone who can help you meet the bar for publication.
P.S. Once you've picked a data science project, our guide to building a data science portfolio covers how to structure, host, and present your work. We've also put together a list of machine learning projects for high school students if you want to go deeper on the ML side specifically, and a roundup of data science programs for high school students in California if you're looking for structured, in-person options at the biggest tech hub in the world!
