Dotun Opasina

  • About
  • AI Projects
  • DotunData
  • Practical Datascience
  • Trainings
  • impact

Building JapaAdvisorAI for Agentic Immigration Assistance In LangGraph

March 22, 2025 by Oladotun Opasina

In my latest hobbyist project, JapaAdvisorAI, I engineered an agent-driven AI assistant that leverages stateful orchestration, large language models (LLMs), and real-time information retrieval to provide dynamic, high-fidelity immigration guidance.

This project is not just a chatbot—it embodies AI as an autonomous agent, making real-time decisions, adapting its strategy based on user interactions, and leveraging external tools for knowledge augmentation. Below, I outline the technical stack, architectural principles, and agentic AI design that underpin JapaAdvisorAI’s intelligence.

Tech Stack for JapaAdvisorAI: From Frontend to Backend

Building an AI-enabled product involved a whole host of learnings and utilizing various tools and products. I relied on using AI questions and answering tools for code generation and UI generation but spent a good amount of time getting things to work. Below are the different categories of tools used:

  1. Frontend & API: Javascript/Html/Css (UI), FastAPI (backend API).

  2. AI Orchestration: LangChain (prompt management), LangGraph (stateful conversation flow).

  3. LLM Integration: ChatOpenAI (response generation, query refinement).

  4. State & Storage: PostgreSQL (data storage).

  5. Knowledge Retrieval: TavilySearchResults (real-time search), content extraction for summaries.

  6. Cloud & Deployment: Digital Ocean Serverless tier, Docker.

  7. Code generation: Github co-pilot, Galileo UI, ChatGPT and Deep Seek.

Architecting an Agentic AI System with LangGraph

Unlike conventional rule-based chatbots, JapaAdvisorAI employs LangGraph to model a multi-agent system where conversational nodes represent discrete agents responsible for distinct cognitive functions. This modular approach enhances adaptability, ensures context retention, and enables autonomous decision-making within the AI system.

The key components include:

  • Multi-Agent Workflow with LangGraph’s StateGraph

At the core of JapaAdvisorAI is LangGraph’s StateGraph, which orchestrates the AI’s decision-making process by defining functional agents as graph nodes. This ensures:

  1. Dynamic conversation routing based on user queries

  2. Adaptive state transitions based on contextual understanding

  3. Parallel execution of reasoning and retrieval tasks

Each node in the graph represents a specialized AI function, forming a decomposed agentic system rather than a monolithic model.

  • Defining Agent Roles and Responsibilities

The AI system consists of distinct functional agents that operate in a coordinated manner:

Preprocessing Agent (Initialization & Context Injection)

    1. preprocess_node: Captures user intent, extracts contextual details, and initializes the state with prior interactions.

    2. Agent Role: Ensures that every user session retains contextual continuity, mitigating information loss between queries.

Query Refinement Agent (Enhancing User Input)

    1. refine_tool: Utilizes prompt engineering to restructure and disambiguate user queries before passing them to the core LLM.

    2. Agent Role: Acts as a pre-LLM filter, ensuring that user questions are structured optimally for precision in AI responses.

Conversational AI Agent (LLM Integration & Response Generation)

    1. LLM backbone: ChatOpenAI API processes refined queries and generates contextual, multi-turn responses.

    2. Agent Role: Dynamically adapts responses based on conversation history and prior decisions in the state graph.

Information Retrieval Agent (Real-Time Search & Augmented Responses)

    1. generate_content_from_link: Extracts and summarizes retrieved documents for inclusion in responses. Uses TavilySearchResults to fetch relevant immigration information from trusted sources.

    2. Agent Role: Functions as an external knowledge augmenter, ensuring that responses remain factually accurate and policy-compliant.

Intent Extraction Agent (Conditional Routing & Adaptive Transitions)

    1. Intent extraction: Implements conditional_edge logic to route user queries dynamically based on complexity and knowledge gaps.

    2. Agent Role: Determines whether to proceed with response generation, follow-up questioning, or external retrieval, ensuring a multi-step reasoning framework.

      • Each agent operates independently yet collaboratively, forming a distributed AI system where decisions emerge from collective reasoning rather than a single-pass LLM call.

Key Innovations in the Agentic AI Architecture

  1. Stateful Orchestration with Graph-Based AI: By leveraging LangGraph’s graph-based AI architecture, JapaAdvisorAI transitions from traditional conversational models to a true agentic AI system.StateGraph ensures a structured, adaptive flow where conversations evolve dynamically rather than following static rules. Agents execute concurrently, enabling parallel processing of follow-ups, search queries, and response generation.AI decisions are no longer sequential; they are emergent behaviors resulting from interactions between different nodes in the system.

  2. Multi-Agent Collaboration for Enhanced Decision-Making: JapaAdvisorAI’s agents operate as a distributed decision-making system where each node specializes in a particular task.

    1. Autonomous Follow-Up Generation: The Follow-Up Agent analyzes gaps in user input and triggers clarifying questions automatically.

    2. Search-Augmented Reasoning: The AI seamlessly determines when it lacks sufficient knowledge and invokes the Retrieval Agent for real-time updates.

    3. Adaptive Response Refinement: The Query Refinement Agent continuously optimizes user inputs to maximize response accuracy.

      This approach mirrors real-world expert consultations, where multiple specialists contribute insights to refine answers progressively.

  3. Hybrid AI: LLM Augmentation with External Knowledge

    One of the major challenges in AI-driven advisory systems is maintaining factual accuracy in rapidly changing domains like immigration law. To address this, JapaAdvisorAI employs a hybrid AI approach:
     - LLM-powered natural language processing for reasoning and conversation
     - External search integration for real-time fact-checking
     - Decision logic to determine when search augmentation is needed

    This ensures that the AI remains accurate, up-to-date, and grounded in real-world data rather than relying on outdated model knowledge.

Lessons Learned: Building the Next Generation of AI Advisors

1. Agentic AI Is Great For Conversational Systems: Traditional chatbots struggle with context retention, adaptability, and complex decision-making. By employing graph-based AI orchestration, JapaAdvisorAI demonstrates that:

  • AI assistants can dynamically adjust their strategies based on real-time inputs.

  • Autonomous agents enhance AI reasoning by specializing in distinct tasks.

  • Multi-agent systems enable emergent behaviors, where responses evolve based on collective agent decisions.

2. The Best AI Advisors Combine Intelligence with Knowledge Augmentation: LLMs are not sufficient on their own—they require:

  • Structured query refinement to ensure clarity

  • Automated knowledge retrieval for real-time updates

  • Graph-based decision-making to route user queries intelligently

  • This hybrid AI paradigm significantly outperforms traditional chatbot architectures in knowledge-sensitive domains.

3. Modular AI Design Increases Scalability & Maintainability: By decoupling AI functions into agent-based nodes, I ensured that:

  • New capabilities (e.g., new search APIs, expanded Q&A models) can be added without overhauling the system.

  • The AI can scale horizontally, distributing different functions across specialized modules.

  • The logic remains explainable and auditable, improving trust and compliance in high-stakes AI applications.

 

JapaAdvisorAI Demo

Here is a video demonstration of Japa Advisor AI agent deployed using Digital Ocean and responding to a simple query about documents required for a USA visa from Nigeria.

Final Thoughts: Building AI for Complex Decision-Making

JapaAdvisorAI represents a paradigm shift in AI advisory systems, proving that agentic architectures, knowledge augmentation, and adaptive workflows are the key to next-gen AI assistants. If you want to use JapaAdvisorAI, please contact me.

As a Principal AI Expert, I specialize in:

  • Designing multi-agent AI systems using LangGraph

  • Developing LLM-powered advisory tools with real-world augmentation

  • Architecting AI that blends reasoning, search, and stateful decision-making

If your organization is exploring agentic AI solutions or needs a leader to drive AI innovation, let’s connect!

March 22, 2025 /Oladotun Opasina
Comment
PatientNoShow.png

Using Machine Learning to know Patients that are No Shows

March 01, 2020 by Oladotun Opasina

Here is a brief introduction into the project.

Please check out the blog post: https://www.dotunopasina.com/datascience/noshowappointments

Introduction

In this project, we will be utilizing machine learning algorithms to perform feature selection on patient appointments data. The goal is to understand what characteristics of a particular patient that makes them miss their appointment.

Dataset

The dataset for this project was gotten from Kaggle consisting of 14 columns and 110527 rows of data.

The data consists of the following columns:

  1. Patient Id

    • Identification of a patient

  2. Appointment ID

    • Identification of each appointment

  3. Gender

    • Male or Female. Female is the greater proportion, woman takes way more care of they health in comparison to a man.

  4. AppointmentDate

    • The day of the actual appointment, when they have to visit the doctor.

  5. Scheduled Date

    • The day someone called or registered the appointment, this is before appointment of course.

  6. Age

    • How old is the patient.

  7. Neighborhood

    • Where the appointment takes place.

  8. Scholarship

    • True of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia

  9. Hypertension

    • True or False

  10. Diabetes

    • True or False

  11. Alcoholism

    • True or False

  12. Handicap

    • True or False

  13. SMS_received

    • 1 or more messages sent to the patient.

  14. No-show

    • True or False.

Machine Learning Process

The steps taken to accomplish our results include the following:

  1. Data preprocessing.

  2. Create awaiting time field (Days between Scheduled and appointed times)

  3. Exploratory data analysis.

  4. Pass the data through the machine learning algorithm

  5. Select top 10 features that affect appointment times and least 10 features that affect appointment times.

The code of the project can be found on my github.

Exploratory Data Analysis

The below pie chart shows the number of Yes (shows up to appointment) as 85,299 and No (misses appointment) as 21,677. This implies we have an imbalanced data set and we need to keep that in mind as we move along.

Number of Yes and No to appointments

Number of Yes and No to appointments

Machine Learning Model

The machine learning model used here was a logistic regression with lasso regularization. Regularization is a way of penalizing the model’s cost function to ensure that the model does not overfit. In this case, the features that are not important are made to zero while we can select the important features.

Results and Insights

The model selected the most important features that affect patients missing their appointment as seen in the figure below.

Feature selections of Appointment No Shows

Feature selections of Appointment No Shows

From the image above we can break the groups of data into more likely to miss appointment and less likely to miss appointment.

More Likely to Miss Appointment

  • Patients who had a large difference between their scheduled and appointment date missed their appointment the most

  • Interestingly patients who received an SMS message still missed their appointment

  • Patients in the Itarare and Santos dumont neighborhood were more likely to miss their appointment

  • Patients between the ages of 13 and 14 were more likely to miss their appointments

Less Likely to Miss Appointment

  • Patients who were age 64 and 69

  • Patients who lived in Santa martha, Jardim da Penha and Jardim Camburi

  • Patients who had Hypertension were less likely to miss their appointments

March 01, 2020 /Oladotun Opasina
Comment
CreditCardFraudDetection.png

Credit Card Fraud Detection using Logistic Regression , Naive Bayes and Random Forest Classifiers

February 23, 2020 by Oladotun Opasina

Introduction

The goal of this project is to utilize machine learning algorithms to classifier a transaction as fraudulent or not based on multiple inputs.

Datasets

The datasets for this project was gotten from Kaggle . The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The transaction contained inputs of datasets from dimension reduced observations V1…V28 , amount and class representing if the transaction is fradulent or not.

Machine Learning Algorithm Process

The machine learning algorithm process highlights the steps taken to get the models up and running from start to end. It also describes the data preprocessing and cleaning stage of the problem. The machine learning algorithm process includes:

  1. Download data sets from Kaggle.

  2. Load data into Jupyter notebook and perform exploratory analysis.

  3. Split the data into input and output columns.

  4. Standard scale data to remove skewness in the datasets.

  5. Pass the data into grid search logistic regression, naive bayes, support vector classifiers and random forest regression.

  6. Calculate the performance metric of the models. Note since we have imbalanced data we use a confusion matrix and f1 score to evaluate models.

Code for project can be found on my github website

Results

The results of the 4 machine learning models evaluated are as follows:

  1. Naive Bayes performed worse with a f1 score of 11.31 %,

  2. Logistic regression was 72.9 %

  3. Random forest was 87.4 %

  4. Support vector machines took so long to run that I had to stop the process.

Conclusion

In general random forest classifier performed best as it is a combination of decision trees and is protected from the problem overfitting due to it ensembling method. This project was an interesting to learn from and the outcomes from the result was used to further my knowledge in machine learning. Please find the code used for the project on my github page

February 23, 2020 /Oladotun Opasina
Comment
houseprices.jpg

Predicting House Prices in the Massachusetts area using Linear Regression, Support Vector Regression and Random Forest Regression

February 10, 2020 by Oladotun Opasina

Introduction

For this project, I scraped housing data from Zillow websites and used features such as the price, the size of a house in feet, number of beds and baths to predict the price of the house. The goal of this project is to understand multiple regression algorithms and see how they function on a real-world problem.

The Machine Learning Process

The data process included :

  1. Data collection and cleaning: I scraped the data from Zillow websites for the particular cities in Massachusetts and did some data cleaning by removing rows with empty fields and those without bedrooms and baths in their columns. The data was log transformed to ensure that there were perfectly skewed.

  2. Machine Learning Algorithm: The clean data was then passed into multiple machine learning algorithms such as linear regression, support vector regressor and random forest regressor to predict the price of the houses.

  3. Evaluate Machine Learning Algorithms: The myriad algorithms were evaluated to see how they performed. Spoiler alert: Random forest regression did the best amongst the three while linear regression performed worse.

  4. Display Result: The predictor was then sent to a flask app where the users can try multiple inputs to the models to see how the price of the algorithm changes.

Screen Shot 2020-02-10 at 11.24.44 AM.png
 

Machine Learning Introductions

Linear Regression

Linear regression is a supervised learning algorithm that fits a line through the data to predict a continuous variable. The algorithm tries to minimize the residuals (the difference between the predicted and actual value) as it calculates its prediction,

Screen Shot 2020-02-10 at 11.37.21 AM.png

Support Vector Regression

Support Vector regression is a supervised learning algorithm that tries to fit the error rate within a threshold boundary line. Usually the boundaries are the innermost threshold that supports the vectors. A threshold is chosen that allows for misclassification to ensure the data is transformed to its right plane.

Screen Shot 2020-02-10 at 11.42.21 AM.png

Random Forest Regression

Random forest regression is a supervised learning algorithm that utilizes a combination of decision trees to produce a continuous output. Usually the combination of models together is called an ensemble model.Random forest uses a Bagging ensemble method (Bootstrapping and Aggregation) to come up with it model output. Bootstrapping is row sampling with replacement and takes the output and then Aggregates the results using a voting classifier to get the output.

Screen Shot 2020-02-10 at 11.55.48 AM.png

Evaluation Metric

The model was evaluated using the R squared metric which basically finds the goodness of fit of the model. There is a generic formula for calculating r squared that can be found online.

Here are the result of the models.

  1. Linear regression: 65.7%

  2. Support Vector Regressor: 71.2%

  3. Random Forest Regressor: 80.4%


Project Demo

A screenshot of the actual website of the project can be found below. Users can enter their inputs and get results from the different models in the application. Something to note is that we are predicting per city and some cities have more observations than others.

Screen Shot 2020-02-10 at 12.24.56 PM.png

Conclusion and Future Works

Prediction of house prices in the Massachusetts area has been achieved. I need to collect more data for better prediction and develop the flask web ui some more.

Link to the project can be found here on github with my contact.

February 10, 2020 /Oladotun Opasina
Comment
kickstarter-funded.png

Metis Final Project: Predicting Kickstarter Projects Successes using Neural Networks

September 22, 2019 by Oladotun Opasina

We just completed our final project at Metis in Seattle, Washington, USA. These past 12 weeks went by quickly and were filled with many lessons. It is going to take a while to process them and I am glad for the period of growth.

For this project, we were tasked to individually select a passion project and use our newly learned machine learning algorithm skills on them.

After much discussions with our very fine instructors, I decided to focus on predicting successes of Kickstarter data using neural networks and logistic regression.

Tools Used:

  • MongoDB for storing data

  • Python for coding

  • TensorFlow and Keras for neural networks

  • Tableau for data Visualization

  • Flask app for displaying the website

The code and data for this project can be found on Github.

Challenge:

“Why did Football Heroes, a Mobile game company using Kickstarter achieve its goal of 12,000 while CowBell Hero another company that used Kickstarter did not? The goal of this project is to help campaigns succeed as they raise their funds“

This is a Tough Problem

This is a tough problem because Kickstarter data contains both text data such as the project title and description. Tabular data such as the Goal and duration of the project. Hence my project needed to account for this problem.

Screen Shot 2019-09-22 at 12.38.07 PM.png

Data:

The data was collected from Kickstarter web scraper called webroot.io.

  1. Kickstarter data

Steps:

  1. Preprocess data using MongoDB

  2. Split data into text data (title, description ) and tabular data.

  3. Pass the text data into an LSTM neural network and the tabular data into a regularized logistic regression.

  4. Combine all the models together in an ensemble model. The accuracy score of my result was 78.3%

  5. Build a website using Flask app to allow users interact with the output.

Steps for Kickstarter Prediction

Steps for Kickstarter Prediction

Video Demo

Future Work:

Presently, I have a working website that predicts campaign successes and failures. Future work include improving the model names and websites.

Thank You:

I want to use this medium to thank my family, friends and colleagues at Metis for their lessons, patience and opportunity to grow. All of my successes at Metis is because of the positive impact they had on me and my projects. So Thank you.

Gradpicture2JPG.JPG



PowerPoint Slides:

September 22, 2019 /Oladotun Opasina
1 Comment
ChurnAnalytics.jpg

Metis Project 3: Why are My Customers Leaving ? - Using Logistic Regression To Interpret Churn Data

August 12, 2019 by Oladotun Opasina in DataScience, Churn, Marketing

We just finished our week 6 at Metis in Seattle, Washington, USA. These past weeks have gone by quickly, we are half way through the program and the skillsets learned are amazing.

On my last project, I worked on predicting NBA player salaries and the feedback received were extremely useful. Thank you.

For this project, we utilized clustering methods discussed in class to solve a business problem. This project was done individually. I decided to focus on a company’s churn data to figure out what sort of customers are leaving and used logistic regression algorithm. I used Python for coding and Tableau for data visualization.The code and data for this project can be found on Github.

My initial plan was to utilize data from the Economist to cluster and figure out what style of leadership is important for economic growth of countries. This was based on a discussion with my fellow Schwarzman scholar: Lorem Aminathia on the model of leadership to ensure Africa’s growth. Unfortunately, there is not enough of data features to properly evaluate this problem.

Challenge:

“We were consulted by Infinity - a hypothetical internet service provider- to figure out their Churn - which customers are leaving - and where their Growth Team can focus on“

Data:

The data was an IBM telco service churn data on kaggle.

  1. IBM Telco Churn data

Approach:

The Minimum Viable Product (MVP) for our client was to address the following point:

  1. Figure out the number of customers churning.

  2. Find out the most frequent types of customer churning.

  3. Provide recommendation of next steps to take for the program.

Steps:

The following steps were taken to produce results, these steps are general data science steps to a solution and are usually iterative.

  1. Data gathering from our data sources.

  2. Data cleaning

  3. Feature Extractions and Cleaning

  4. Data Insights

  5. Client Recommendations

Insights and Reasons

After downloading , cleaning, and aggregating the datasets, the following were noticed:

  1. About 26% of Customers are churning. Out of 7,000 churn data, close to 2,000 are churning.

churnData.png

2. Logistic Regression (accuracy score of 80 %) provided features of the type of customer most likely and not likely to churn.

The image shows the features that will either lead to customer churn or not. Something that surprised me was the fact that fiber optics users were more likely to churn in comparison to digital subscriber line users - a different type of internet service users. It is surprising because fiber optics internet service is usually faster in connecting to the internet than DSL. Another fact is that fiber optics is usually more expensive than DSL and maybe users are getting tired of paying the premium for the service.


LogisticRegression.png

Recommendation

An immediate next step for the growth team is to provide an option for fiber optics customers that are about to leave to switch to DSL service.

Infinity's Customers Leaving! Stop That Churn. Dotun Opasina

August 12, 2019 /Oladotun Opasina
DataScience, Churn, Marketing
Comment
all.jpeg

Metis Project 2: Predicting NBA Player Salaries using Linear Regression

July 21, 2019 by Oladotun Opasina in DataScience, NBA

We just finished our third week at Metis in Seattle, Washington, USA. These past weeks were a roller coaster of learning amazing materials in statistics, python and linear algebra.

On our first project, I worked with other students to provide recommendations for Women in Technology and that experience was amazing.

For our second Project, we worked individually and utilized Linear regression to predict or interpret data on a topic of our choosing. I decided to focus on the NBA because of my rekindled love for the game after watching last seasons tumultuous finals between the Toronto Raptors and the Golden State Warriors.

Even though I worked on this project alone, in understanding the theory, my Metis’ classmate Fatima Loumaini and my instructors helped me.

Big shoutout goes to my ex-Managers at Goldman Sachs who gave me feedback on my model and how to properly create compelling visualizations. Thank you Rose Chen, David Chan and Samanth Muppidi (inside joke).

Goal

The goal of this project is to predict NBA players’ salaries per season based on their statistics using Linear Regression. This project can be used by both Team players and Managers to evaluate the impact a particular player is making on a team and to know whether to increase the players’ salary or trade the player.

Notes:

I am taking the non-traditional approach of explaining my results first and for anyone who is interested in the technicalities of the entire project, can read the remainder of the blog and view the code / presentations .

Results and Insights:

Growing Salaries and Injuries Impacts. Predicting Victor Oladipo’s Salaries:

The model was tested on Victor Oladipo’s per season stats from 2017 - 2019. Victor was the Most Improved Player in 2018 . Using a selection algorithm, the most important stats for a player was selected to predict his salary.

From the charts below, we can see that the ratio of Victor’s actual salaries to his stats increased from year 2017 to 2018 and stayed fixed in year 2019 while my model predicted his salary should have increased from year 2017 to 2018 (but not as high as his actual salary increase) and decrease slightly in 2019. We can see that Victor’s stats from 2017 to 2018 increased while decreasing slightly in 2019.

Observations

In the real world, Victor made a huge impact on his team from 2017 to 2019 (The Indiana Pacers) but got an injury that knocked him out for the season in 2019. This injury affected the impact he made on his team hence the decrease in his stats. A reason why we do not see a change in his salary is because he is currently on a multi-year contract that is usually guaranteed despite injuries.

Screen Shot 2019-07-21 at 6.11.31 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Growing Salaries, Growing Impacts. Predicting Giannis Antetokounmpo Salaries:

The model was tested on Giannis’s stats who was the Most Improved Player in 2017 and the results were used to create the charts below.

From the charts, we can see that the ratio of Giannis’s actual salaries to his stats increased from 2017 to 2019 while my model predicted his salary should have increased from the year 2017 to 2019. Giannis’s stats from 2017 to 2019 saw a steady increase as well. Something worth noting is that my model says that Giannis needs to be making more than his actual salaries from 2017-2019.

Observations

Juxtaposing to the reality, Giannis improved greatly in 2017 and signed a multi year contract that season. Thus we can see an increase in his salaries. My model predicted that because of Giannis’ impact on his team, he should be earning more money. But for Giannis, he cares more about building the Milwaukee Bucks franchise and he is willing to grow with the organization.

Screen Shot 2019-07-21 at 6.40.18 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Growing Salaries, Declining Impact. Predicting Jimmy Butler Salaries:

Finally, the model was evaluated on Jimmy Butler's stats who was the Most Improved Player in 2015 to generate the charts below.

The charts show the ratio of Jimmy’s actual salaries to his stats increased from the year 2017 to 2019 and my model predicted his salary should have decreased over that time period.

We can see that Jimmy’s stats from 2017 to 2019 slightly decreases. Something worth noting is that my model says that Jimmy needs to be making less money than his actual salaries based on his stats.

In actuality, Jimmy’s stats saw a steady decrease from 2017 to 2019 as he switched from the Chicago bulls team to the Minnesota Timberwolves team in 2018 and to the Philadephia 76ers team in 2019. In explaining these phenomena of increasing salaries to decreasing stats, it is general knowledge that a player’s brand also adds to his value and in switching teams, a player needs time to adjust to the style of play of that particular team. So it is not surprising that Jimmy’s stats decreased over time.

Screen Shot 2019-07-21 at 7.18.28 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

If you made it this far, then you are interested in technicality of things. Kindly enjoy your read. below and I welcome any constructive feedbacks.

NBA Introduction:

The National Basketball Association is a men's professional basketball league in North America, composed of 30 teams. It is one of the four major professional sports leagues in the United States and Canada, and is widely considered to be the premier men's professional basketball league in the world.

Find the major stats for the NBA in 2019 below:

Major NBA Stats in 2019

Major NBA Stats in 2019

Approach:

The approach for this project was to utilize specific player stats to predict their salaries using linear regression. I utilized the Lasso Algorithm to select the most important player statistics that affected a player salary.

Steps:

The following steps were taken in achieving my goals for this project.

  1. Data scraping and cleaning.

  2. Data and feature engineering.

  3. Model validation and selection.

  4. Model prediction and evaluation.

Data Scraping and Cleaning:

The data for this project was scraped from:

  1. Basketball Reference: a website that contains basketball players stats .

  2. I selected basketball player stats and salaries from 2017 - 2019 for this project.

  3. I chose around 20 unique stats per player.

The python script that was used to scrape the data can be found on my github page.

Data and Feature Engineering:

After performing Lasso algorithm for feature selections I was able to select the 5 specific stats from the 20 unique stats that affected a players salaries. There are namely:

  1. The player’s age

  2. The minutes played per game

  3. The defensive rebounds per game.

  4. The personal fouls per game.

  5. The average points made per game.

The image below shows a HeatMap of the selected NBA stats to the salaries. Notice that the salaries are logarithm transformed to properly scale with the features and all the stats are positively correlated to the salaries which implies that this problem is ideal for linear regression.

HeatMap displaying positive correlation of my different stats to Salary.

HeatMap displaying positive correlation of my different stats to Salary.

Model Validation and Selection:

I split my data into the train and validate sets before fitting the train data with my model. I got a score of 42% for my R-squared which implies the level of variability in my data.

Model Prediction and Evaluation:

After training my model with my train set, I got the predicted salaries for each player from 2017-2019. The insights of this project can be found in the Results and Insights section.

Conclusions:

Players and Team managers can better work together using the NBA prediction model when creating contracts and have a standardized way to evaluate impact.

Future Works:

  1. Collect more NBA data from 2008 - 2019.

  2. Include features on out-of-season Injuries, beginning of contracts for players, and brand value of a player etc.

  3. Figure out ways for players to improve specific stats.

Below is my presentation for the project at Metis. Looking forward to your feedback.

July 21, 2019 /Oladotun Opasina
NBA, Data Science, LASSO, Goldman Sachs
DataScience, NBA
3 Comments
Women in Technology source:https://www.we-worldwide.com/blog/posts/black-women-in-tech

Women in Technology source:https://www.we-worldwide.com/blog/posts/black-women-in-tech

Metis Project 1 : Analysis for WomenTechWomenYes Summer Gala

July 10, 2019 by Oladotun Opasina

We just finished our first week at Metis in Seattle, Washington, USA. The past few days have been a whirlwind of both review and new materials. As our first project, we leveraged the Python modules Pandas and Seaborn to perform rudimentary EDA and data visualization. In this article, I’ll take you through our approach to framing the problem, designing the data pipeline, and ultimately implementing the code. You can find the project on GitHub, along with links to all the data.

I worked with Aisulu Omar from Kazakhstan, Alex Lou, and Dr. Jeremy Lehner both from America on this project.

I am Nigerian and you can find me on Linkedln.

The Challenge:

WomenTechWomenYes (WTWY), a(-n imaginary) non-profit organization in New York City, is raising money for their annual summer gala. . For marketing purposes, they place street teams at entrances to subway stations to collect email addresses. Those who sign up are sent free tickets to the gala. Our goal is to use MTA subway data and other external data sources, to help optimize the placement of the teams, so that they can gather the most signatures of people who will attend and contribute to WTWY’s cause.

Our Data:

We used three main data sources for this project.

  1. MTA Subway data

  2. Yelp Fusion Api to figure out the zip codes of each station.

  3. University of Michigan Median and Mean income data to zip code.

Our Approach:

For our approach, we discussed as a team on what would be our Minimum Viable Product (MVP) for our client and we came up with three goals.

  1. Find the busiest stations by traffic to easily deploy the street teams.

  2. Find the busiest days of the week at the train stations.

  3. Find and join income data to the busiest stations to figure out who will donate to our cause.

Our Steps:

We took the following steps to get our results, these steps are general data science steps to a solution and are usually iterative.

  1. Data gathering from our data sources.

  2. Data cleaning

  3. Data aggregating

  4. Data insights

  5. Client Recommendations

Our Insights and Reasons

After downloading , cleaning, and aggregating the datasets, we noticed the following:

  • Wednesdays are the busiest days

Screen Shot 2019-07-10 at 4.52.55 AM.png

Busiest Day of the Week is Wednesday by Number of Entries

  • The top 5 Busiest stations by traffic are:

    • 34th St - Penn Station

    • 42nd St - Grand Central 

    • 34th St - Herald Square

    • 14th St - Union Sq

    • 42nd St - Times Sq

Busiest5stations.png

Top 5 Busiest NYC Stations in the Summer

Why?

  • This is because the top 5 stations are located near the Midtown Area of New York City which is commonly busy during the summers.

  • Major restaurants, Landmarks , Colleges and Technology Companies are situated around this area.

MidtownRegion.png

Google Map route of the Top 5 Stations in Proximity to one another

  • After joining income data to the busiest stations and filtering for those who made $70,000 and above, we found:

    • Grand Central - 42 Street to be the station with the highest income.

Busiest Stations by Median Household Income

Busiest Stations by Median Household Income

Our Recommendation:

From our analysis we recommend that WomenTechWomenYes deploy street teams on Wednesdays during peak hours to 34 ST Penn Station and 42nd Grand Central to best target their appropriate audience.

Conclusion:

I would like to thank the Metis team and my classmates for their thoughtful questions and feedback. As we continue with future projects, I hope to incorporate those lessons in them. Our slides can be found below

Analysis for WomenTechWomenYes Annual Gala Aisulu, Alex, Dotun and Jeremy Metis 2019

July 10, 2019 /Oladotun Opasina
Comment

Powered by Squarespace