Dotun Opasina

  • About
  • AI Projects
  • Practical Datascience
  • Trainings
  • AI Mastery Program

Open-Source vs Closed-Source LLMs: What Leaders Need to Consider

December 16, 2025 by Oladotun Opasina

When it comes to deploying LLMs, leaders often default to whatever's easiest to spin up, usually a closed-source API. But that quick start can become a long-term constraint. Publicis Sapient found that 42% of enterprises abandoned most of their AI initiatives last year, and a big reason is infrastructure choices that don't match what the organization actually needs.

The Real Trade-offs

Closed-source models (GPT-4, Claude, Gemini) get you to a working solution fast. API integration takes hours, not months. You get support, predictable SLAs, and best-in-class reasoning.

Open-source models (Llama, Mistral, Falcon, DeepSeek) give you complete control, and complete responsibility. You decide where data lives and how the model behaves. But you're on the hook for infrastructure, security updates, and performance optimization.

The question isn't which is better. It's which trade-off your organization can actually execute on.

The Cost Reality

Closed-source pricing scales fast. A chatbot handling 100,000 monthly interactions might run $2,000-$5,000/month. Scale to 1 million and you're at $20,000-$50,000/month. For high-volume customer service or personalization engines, those API costs add up quickly.

Open-source eliminates API fees but needs GPU infrastructure at $3,000-$10,000/month minimum. The break-even typically hits between 500K-1M monthly interactions. Where open-source often wins: processing proprietary data you can't send externally anyway—production configurations, supply chain optimizations, equipment patterns.

When Regulation Drives the Decision

Processing protected health information? Closed-source APIs without proper Business Associate Agreements create immediate HIPAA violations.

Open-source keeps data on-premises but you need the technical chops to back it up. Many organizations go hybrid: closed-source for general communication, self-hosted for regulated data.

The Decision Framework

Choose closed-source when:

  • You need to ship in weeks, not months

  • You lack ML/AI engineering resources in-house

  • You're handling <500K monthly interactions

Choose open-source when:

  • You're in a regulated industry with data residency requirements

  • You have proprietary data you legally cannot share with vendors

  • You're at scale (>1M monthly interactions) where TCO favors it

Go hybrid when:

  • You have both public-facing and sensitive internal use cases

  • Different departments have different compliance requirements

Before committing, ask: Who on your team has deployed production ML infrastructure? Have you negotiated proper data processing agreements? At what volume do API costs become a problem?

Conclusion

The companies winning with AI aren't the ones with the fanciest models. They're the ones who've matched their technical choices to what they can actually execute.

Sources:

  • a16z State of Enterprise AI: 41% plan to increase open-source LLM usage

  • Publicis Sapient Guide to Next 2026: 42% abandoned most AI initiatives in 2024

  • Astera: Llama models downloaded 400 million times in 2024

December 16, 2025 /Oladotun Opasina
Comment

The Real Cost of AI Systems: What Leaders Need to Know Before Scaling

December 09, 2025 by Oladotun Opasina

Most AI strategy conversations start with capability questions. What can the model do? How accurate is it? What problems can it solve?

But there's a more fundamental question that often gets answered too late: What does this actually cost at the scale we're planning to operate?

The Gap Between Theory and Practice

Your marketing team needs to analyze customer feedback from the past quarter and create a summary report. Someone asks GPT to do it: "Here's our customer feedback data. Summarize the key themes and create recommendations." Clean request. 50 tokens using OpenAI Tokenizer Counter.

Tokens are the units of measuring LLM inputs and outputs, roughly equivalent to 3-4 characters or about three-quarters of a word in English.

The first response comes back with general themes and basic structure. Helpful, but it's missing specific metrics. It hasn't connected feedback to the product roadmap. The tone isn't right for executives.

So they clarify. "Include sentiment scores and tie themes to our Q3 product launches." Better, but now it needs reformatting for slides. Another exchange. Then it needs to match last quarter's framework for comparison.

Input Across Multiple Prompts

Output tokens

Final tally: 92 tokens of input across multiple prompts, 3,428 tokens of output.

This pattern isn't unique. Research shows roughly two-thirds of ChatGPT users need 2+ prompts per task, with about a third requiring 5+ exchanges to get usable output. The cost isn't in the initial query, it's in the iteration required to get from "plausible answer" to "working solution."

Running the Numbers

Using the OpenAI GPT API Pricing Calculator:

  • 100 API calls: ~$3

  • 1 million API calls: ~$30,000

That's for GPT-5.1. But for this feedback analysis, the team needed solid insights, not perfect prose or deep statistical analysis. Switching to Gemini-2.5 could drop that million-call cost to ~$1,000.

The strategic question: Are you paying for capability you don't actually need?

The Tradeoff Matrix Leaders Should Map

  1. Accuracy vs. Cost: State-of-the-art models cost significantly more. For customer-facing responses or critical decisions, that premium might be worth it. For internal summaries or draft content, probably not.

  2. Speed vs. Power: Faster models sacrifice some capability. Real-time customer queries need speed. Monthly report generation doesn't.

  3. Context Length vs. Efficiency: Models handling longer inputs cost more per call. Entire quarterly reports as context? You pay for that. Working section-by-section? You save considerably.

Most organizations pick a model tier—usually the flagship—and apply it uniformly. That's leaving money on the table. The strategy should be matching model capability to specific requirements.

Three Planning Questions

  1. What's your realistic usage multiplier?

    Don't build your cost model on single exchanges. If your team averages 3-5 prompts per task, multiply your projected costs by that factor. Tools like the OpenAI Tokenizer let you measure actual token usage across real workflows.

  2. Have you mapped requirements to model tiers?

    Create a matrix of your use cases and their requirements. Legal contract review might need GPT-4's accuracy. Meeting summaries might work fine with GPT-3.5 or Claude Haiku. Customer feedback analysis could use mid-tier models. The goal is precision matching, not blanket deployment.

  3. What's your cost scaling curve?

    Model your expected costs at 10x, 100x, and 1000x your initial volume. If the curve breaks your economics, you need to know that before you scale, not after.

Why This Matters for Strategy

AI systems scale differently than traditional software. With SaaS, you pay per seat regardless of usage intensity. With AI, cost rises directly with usage.

Success means more expense. If your AI implementation works exactly as planned, high engagement, widespread usage, your costs accelerate. Unless you've structured your model strategy correctly, that success becomes unsustainable.

The organizations getting this right aren't necessarily the ones with the biggest AI budgets. They're the ones doing the systematic work upfront: measuring real prompt patterns, testing model tiers against specific requirements, understanding their cost curve before they're locked into it.

Starting Point

  1. Run a pilot with actual tracking.

  2. Use real tasks from your team.

  3. Measure the prompt-to-solution ratio.

  4. Calculate costs across different model tiers using the pricing calculators.

  5. Map your findings to projected scale.

The exercise won't be glamorous. But it's the difference between an AI strategy built on actual economics and one built on capability demos and vendor promises.

How are you approaching AI cost modeling in your organization? Let's discuss.

Reference

  1. Fishkin, Rand. “We Analyzed Millions of CHATGPT User Sessions: Visits Are down 29% since May, Programming Assistance Is 30% of Use.” SparkToro, 30 Aug. 2023, sparktoro.com/blog/we-analyzed-millions-of-chatgpt-user-sessions-visits-are-down-29-since-may-programming-assistance-is-30-of-use/




December 09, 2025 /Oladotun Opasina
Comment

How Do You Make Workers Choose AI Over Their Old Methods?

November 25, 2025 by Oladotun Opasina

There is a massive challenge with current AI adoption in the workplace. While 37.4% of workers report using AI at work, they spend only 5.7% of their work hours actually using it. Publicis Sapient research reveals 60% of CEOs believe AI will revolutionize operations, but only 24% of executives managing those functions agree. The gap between "we have AI" and "AI transforms how we work" is enormous leaving opportunity for leaders to focus on AI adoption.

But AI is different from past technologies. When companies adopted email and intranets in the 1990s, workers had no alternative—you either used the company system or couldn't communicate. Adoption was fast because it was mandatory infrastructure. AI is optional. Workers can always fall back to their old methods. There's no forcing function.

Unlike becoming "an internet company" (which required rebuilding business models), AI adoption is about workers choosing to change how they personally work. That's harder. Electricity took 40 years to boost productivity because factories had to redesign layouts. AI requires workers to redesign their individual workflows—and they can simply refuse.

Historical adoption required clear forcing functions: "read the Bible or face damnation" for literacy, "track your money or go bankrupt" for bookkeeping. For AI, "be more productive" is too abstract to drive behavior change.

What Leaders Must Do Today

  1. Make AI Mandatory for One Task

    Stop making AI optional. Pick one concrete task where AI demonstrably saves time.

    Action: For example, Starting Monday, require employees to attempt an AI answer before asking colleagues any question. They must share: (1) their prompt, (2) AI's response, (3) why it's insufficient. This creates both immediate friction reduction and a database of real use cases. Track questions saved per week.

  2. Create Your First Worked Example

    For AI to be adopted, you need worked examples—showing exactly how someone uses AI for a specific task, like Manzoni's 1540 bookkeeping manual that taught through real merchant ledgers.Action: Screen-record your best performer using AI for one task start-to-finish (prompts, iterations, output). Make it a 3-minute video showing exactly how Sarah reconciles invoices in 8 minutes instead of 40.

  3. Hire Junior People as AI Champions

    Research shows 93% of Gen Z use two or more AI tools weekly versus 79% of millennials. Gen Z workers lead with 82% adoption compared to 52% of Baby Boomers. Yet many organizations are cutting junior roles, assuming AI replaces entry-level work—exactly backward.

    Action: Hire junior employees specifically as AI champions. Train them from day one on AI tools. Have them spend 25% of their time teaching older colleagues. The strategy of eliminating junior people because "AI will do their work" misses that junior people are your fastest AI adopters and best teachers.

  4. Measure Time Spent, Not "Adoption"

    Workers spend only 5.7% of work hours using AI despite 37% reporting they "use" it. Adoption theater wastes resources.

    Action: Track daily for one month: (1) tasks using AI, (2) minutes spent, (3) time saved. Publish internally weekly. Celebrate highest time-saved, not most "engaged."

The Bottom Line

Successful cognitive technologies required 200-400 years for mass adoption. You don't have that long. Winners won't have "AI strategies"—they'll have rebuilt specific work processes.

Start with one process. Make it work. Document exactly how. Teach person-to-person. Make it mandatory. Measure actual time saved. Then move to the next one.

Stop strategizing. Start rebuilding.

Sources:

  • Publicis Sapient, "AI and Digital Business Transformation," 2025

  • St. Louis Federal Reserve, "The State of Generative AI Adoption in 2025," November 2025

  • Google Survey on Gen Z AI adoption, November 2024

  • London School of Economics & Protiviti, "Bridging the Generational AI Gap," 2025

November 25, 2025 /Oladotun Opasina
Comment

Chat With Your Documents: Build a RAG AI Agent with No Code in 20 minutes

November 21, 2025 by Oladotun Opasina

In my previous tutorials, you built AI agents for web scraping and conversations. Today: teach an AI to answer questions about YOUR documents.

What you're building: A simple chatbot that reads your documents and answers questions with accurate information—no hallucinations.

What you need: N8N account, OpenAI API key, a document, 20 minutes.

Why This Matters

ChatGPT doesn't know your company policies, resume, or product docs. RAG (Retrieval-Augmented Generation) fixes this by letting AI pull from YOUR documents before answering.

Without RAG: "Tell me about my work experience" → Generic advice With RAG: Specific details from your resume about Johns Hopkins, Morgan State, your M.Sc.

This is how companies build internal assistants, support bots, and knowledge bases.

How It Works

Two workflows that share data:

Workflow 1 - Upload: Document → Breaks into chunks → Converts to vectors → Stores in database

Workflow 2 - Chat: Chat → AI Agent → Chat Model→ Searches database → AI reads chunks

Vectors are coordinates on a mathematical map. Similar meanings = close together. When you ask "What university?", it finds education chunks, not hobby chunks.

Build Part 1: Upload Workflow

Step 1: Create workflow, add Form Trigger

  • Add form field: Type = File, Name = "document", Accept = .pdf,.txt

  • Form Title: "Upload Knowledge Document"

Step 2: Add Simple Vector Store (not connected yet)

  • Mode: Insert Documents

  • Memory Key: my-knowledge-base (critical—you'll reuse this)

Step 3: Click Vector Store's "Document Loader" port → Add Default Data Loader

  • Text Splitter: Recursive Character Text Splitter

  • Chunk Size: 1000, Overlap: 200

Step 4: Click Vector Store's "Embedding" port → Add Embeddings OpenAI

  • Model: text-embedding-3-small

  • Add your API key

Step 5: Execute workflow, upload test document, verify chunks created.

In my case, I have created .txt file with the content of my resume to be used for testing

Build Part 2: Chat Workflow

Step 1: New workflow, add When chat message received

  • Public Chat: ON

  • Initial Message: "Ask me anything about uploaded documents"

Step 2: Add AI Agent, connect to Chat Trigger

  • Prompt:

You answer questions using only the knowledge base.
If no relevant info found, say "I don't have that information."
Be specific. Cite details.



Step 3: Click Agent's "Model" port → Add OpenAI Chat Model

  • Model: gpt-4o-mini

  • Temperature: 0.2 (factual, not creative)

Step 4: Click Agent's "Tools" port → Add Simple Vector Store

  • Mode: Retrieve Documents (as Tool)

  • Memory Key: my-knowledge-base (MUST MATCH upload workflow)

  • Description: "Search knowledge base for facts, dates, names, details"

  • Limit: 4

Step 5: Click this Vector Store's "Embedding" port → Add Embeddings OpenAI

  • Model: text-embedding-3-small (MUST MATCH upload workflow)

Step 6: Execute, get chat URL, ask questions about your document.

Real Example

I uploaded my resume. Asked: "What university did the candidate go to?"

Response: "Johns Hopkins University (M.Sc. Engineering Management, May 2016) and Morgan State University (B.Sc. Electrical Engineering, May 2013)."

No hallucination. Just facts from my document.

Common Issues

"I don't have that information" but it's clearly there:

  • Both workflows must use identical Memory Key

  • Increase Limit to 6 chunks

  • Decrease chunk size to 500-700

AI uses general knowledge instead of documents:

  • Add to prompt: "NEVER use training data. ONLY use knowledge base tool."

 

Production Upgrades

Simple Vector Store works for learning but data clears on restart.

For production:

  • Pinecone/Qdrant: Persistent storage

  • Authentication: Add password/API key protection

  • Metadata: Tag documents by department/date for filtering

  • Cost monitoring: Track OpenAI API usage

What You Built

Document → Chunks → Vectors → Storage → Search → Answer

This architecture powers:

  • HR bots ("What's our vacation policy?")

  • Support bots ("How do I reset password?")

  • Research assistants (50-paper summaries)

  • Documentation bots ("How do I use this API?")

Subscribe for the next tutorial. Tag me on LinkedIn if you build something.

The insight: Companies winning with AI use better systems, not better models. You just built one.

Resources: N8N RAG Docs | RAG Template

November 21, 2025 /Oladotun Opasina
Comment

Build Your Web Scraper AI Agent in 15 Minutes with No Code

November 18, 2025 by Oladotun Opasina

In my previous post, you built a basic chatbot. Today, we're leveling up: an AI agent that scrapes websites and extracts structured data through conversation.

What you're building: A web scraper that takes natural language commands and returns clean JSON data.

What you need: N8N account, OpenAI API key, 15 minutes.

Why AI Agents Change Web Scraping

Traditional scrapers break when websites update their HTML. This agent adapts. You tell it what you want in plain English—it figures out how to extract it.

Step 1: Build the Foundation

Create a new workflow. Add two nodes:

  1. Manual Chat Trigger (search "chat")

  2. AI Agent (connect it to Chat Trigger, select "AI Agent" in dropdown)

Step 2: Configure the Model

Inside the AI Agent:

  1. Chat Model → Add → OpenAI Chat Model

  2. Select gpt-4o, set Temperature: 0.2

  3. Add your API credentials

Step 3: Add Tools

In "Tools" section:

  1. Add → HTTP Request Tool

  2. Method: GET, Response: "Include Full Response"

  3. Leave URL empty (agent fills dynamically)

Key concept: The agent decides when to use this tool through reasoning, not predefined logic.

Step 4: Write Agent Instructions

Click on the AI agent and configure it system message, add option, select “System Message”

In "System Message", paste:

You are a professional web scraping AI agent. Follow this protocol:

SCRAPING WORKFLOW:
1. Validate the URL format before scraping
2. Use HTTP Request tool to fetch the page
3. Check HTTP status code:
   - 200: Success, proceed with extraction
   - 404: Page not found, inform user
   - 403/401: Access denied, suggest authentication needed
   - 429: Rate limited, inform user to retry later
   - 5xx: Server error, suggest retry

EXTRACTION RULES:
- Extract ONLY requested data in valid JSON format
- Strip all HTML tags unless specifically requested
- Handle missing data gracefully with null values
- Always include: {"status": "success/error", "data": {...}, "source_url": "..."}

ANTI-PATTERNS TO AVOID:
- Don't scrape if robots.txt disallows (inform user)
- Don't make multiple requests to same URL in one session
- Don't hallucinate data if extraction fails

ERROR RESPONSE FORMAT:
{
  "status": "error",
  "error_type": "http_error/parsing_error/access_denied",
  "message": "Clear explanation",
  "suggestion": "What user should do next"
}

This prompt defines behavior. The agent follows instructions > relies on training data.

Step 5: Add Memory

In AI Agent's "Memory" section:

  1. Add "Window Buffer Memory"

  2. Set Context Window: 10

Now you can say "scrape that same site but get prices instead" and it remembers context.

Step 6: Test Your Scraper

Save and click "Test workflow". Try:

Scrape https://example.com and extract the main heading
Scrape https://news.ycombinator.com - Extract top 3 story titles as JSON
Scrape https://wikipedia.org and return page title, first paragraph (no HTML), and number of languages

You should get clean JSON. If it fails, the agent explains why.

What Just Happened (Strategic View)

The agent isn't following predefined rules. It's reasoning: "I need data from a URL → I have an HTTP tool → I'll use it → Now I'll parse the HTML → Format as requested."

This is fundamentally different from traditional automation.

Making It Production-Ready

Scale: Replace Chat Trigger with Schedule (scrape hourly), store in Google Sheets/Postgres, add Slack alerts.

Monitor: Insert IF node after AI Agent to catch errors and route to notifications.

Rate limits: Add Wait nodes (2-3 seconds), check robots.txt, don't hammer servers.

Limitations

Works for static HTML. Doesn't work for JavaScript-heavy sites, login walls, or CAPTCHAs. Use ScrapingBee or Browserless for those.

The Real Insight

You just built a scraper that adapts to any page structure without writing parsing logic. Compare this to traditional scraping: writing CSS selectors, handling layout changes, managing error states.

The skill isn't just building this. It's knowing when AI agents are the right tool versus when you need traditional scraping's precision.

Subscribe below to follow along as I build and share more on AI, Strategy & Automation.

Subscribe

Sign up with your email address to receive news and updates.

We respect your privacy.

Thank you!
November 18, 2025 /Oladotun Opasina
Comment

Build Your First AI Agent in 10 Minutes with No Code

November 15, 2025 by Oladotun Opasina

Build your first AI agent in 10 minutes with N8N

Read More
November 15, 2025 /Oladotun Opasina
Comment

Exploring New Horizons With N8N AI Workflow Automation

November 11, 2025 by Oladotun Opasina

An exciting way to interact with AI Agents and tools

Read More
November 11, 2025 /Oladotun Opasina
Comment

GreenBookAI: Custom Itinerary Generation with Agentic AI

July 12, 2025 by Oladotun Opasina

In a world where travel can feel exciting for some and unsafe or isolating for others, especially for people from underrepresented backgrounds, personalized and culturally aware trip planning isn’t a luxury it’s a necessity. Yet, most travel platforms today still offer one-size-fits-all recommendations, often overlooking safety, cultural fit, or inclusion.

GreenBookAI was built to change that.

It’s a travel planning tool powered by AI that not only curates personalized itineraries. It does so with an emphasis on trust, safety, and cultural relevance. This post provides a high-level technical look into how GreenBookAI works under the hood, and how we’re using agentic artificial intelligence to build a better travel experience.

If you're curious, I invite you to sign up and test GreenBookAI. It’s still in active development, but the core functionality is already live and your feedback would be incredibly valuable.

Backend Architecture: The Core Intelligence

At the heart of GreenBookAI is a Python-based backend that orchestrates every aspect of the itinerary generation process. This is where most of the intelligence lives.

Multi-Agent System

A standout feature is our use of agentic AI, powered by OpenAI Agents. We’ve created a dedicated Activities Search Agent that performs structured searches and returns detailed JSON responses for businesses.

These results are processed by a central orchestrator that allocates hobbies, restaurants, and content across a multi-day plan based on the user’s travel window and preferences.

Concurrent Processing

To ensure speed, we’ve implemented concurrent async processing using asyncio and httpx.AsyncClient. This allows us to search for multiple types of businesses at once, validate multiple URLs simultaneously, and respond quickly, even when generating multi-day plans

Rate Limiting

To stay compliant with third-party APIs (like OpenAI’s), we use rate limiting via asyncio.Semaphore to throttle concurrent requests and prevent overload.

Frontend Experience: Built in React.js

The user interface is built with React.js & Vite.js, designed to keep things simple but dynamic. The frontend handles:

  • Collecting user inputs (dates, hobbies, food preferences, etc.)

  • Sending that data to the backend

  • Displaying the structured itinerary, complete with location info and image assets from Pexels for added context

It’s a lightweight but functional interface that makes exploring your travel plans feel personal.

Real-World Sharing: What’s Next?

I recently demoed GreenBookAI during a panel at Schneider Electric, where we discussed how AI can be used to make tools more inclusive and context-aware.


The reaction reinforced the need for products like this; tools that don’t just optimize for efficiency, but for empathy and safety too.

There’s still a lot of work ahead but the core engine is live and I’d love for you to give it a try.

Whether you're someone who wants culturally relevant travel suggestions, a builder curious about multi-agent AI systems, or just looking for a fresh take on travel planning. I welcome your thoughts, ideas, and feedback.

July 12, 2025 /Oladotun Opasina
Comment

Building JapaAdvisorAI for Agentic Immigration Assistance In LangGraph

March 22, 2025 by Oladotun Opasina

In my latest hobbyist project, JapaAdvisorAI, I engineered an agent-driven AI assistant that leverages stateful orchestration, large language models (LLMs), and real-time information retrieval to provide dynamic, high-fidelity immigration guidance.

This project is not just a chatbot—it embodies AI as an autonomous agent, making real-time decisions, adapting its strategy based on user interactions, and leveraging external tools for knowledge augmentation. Below, I outline the technical stack, architectural principles, and agentic AI design that underpin JapaAdvisorAI’s intelligence.

Tech Stack for JapaAdvisorAI: From Frontend to Backend

Building an AI-enabled product involved a whole host of learnings and utilizing various tools and products. I relied on using AI questions and answering tools for code generation and UI generation but spent a good amount of time getting things to work. Below are the different categories of tools used:

  1. Frontend & API: Javascript/Html/Css (UI), FastAPI (backend API).

  2. AI Orchestration: LangChain (prompt management), LangGraph (stateful conversation flow).

  3. LLM Integration: ChatOpenAI (response generation, query refinement).

  4. State & Storage: PostgreSQL (data storage).

  5. Knowledge Retrieval: TavilySearchResults (real-time search), content extraction for summaries.

  6. Cloud & Deployment: Digital Ocean Serverless tier, Docker.

  7. Code generation: Github co-pilot, Galileo UI, ChatGPT and Deep Seek.

Architecting an Agentic AI System with LangGraph

Unlike conventional rule-based chatbots, JapaAdvisorAI employs LangGraph to model a multi-agent system where conversational nodes represent discrete agents responsible for distinct cognitive functions. This modular approach enhances adaptability, ensures context retention, and enables autonomous decision-making within the AI system.

The key components include:

  • Multi-Agent Workflow with LangGraph’s StateGraph

At the core of JapaAdvisorAI is LangGraph’s StateGraph, which orchestrates the AI’s decision-making process by defining functional agents as graph nodes. This ensures:

  1. Dynamic conversation routing based on user queries

  2. Adaptive state transitions based on contextual understanding

  3. Parallel execution of reasoning and retrieval tasks

Each node in the graph represents a specialized AI function, forming a decomposed agentic system rather than a monolithic model.

  • Defining Agent Roles and Responsibilities

The AI system consists of distinct functional agents that operate in a coordinated manner:

Preprocessing Agent (Initialization & Context Injection)

    1. preprocess_node: Captures user intent, extracts contextual details, and initializes the state with prior interactions.

    2. Agent Role: Ensures that every user session retains contextual continuity, mitigating information loss between queries.

Query Refinement Agent (Enhancing User Input)

    1. refine_tool: Utilizes prompt engineering to restructure and disambiguate user queries before passing them to the core LLM.

    2. Agent Role: Acts as a pre-LLM filter, ensuring that user questions are structured optimally for precision in AI responses.

Conversational AI Agent (LLM Integration & Response Generation)

    1. LLM backbone: ChatOpenAI API processes refined queries and generates contextual, multi-turn responses.

    2. Agent Role: Dynamically adapts responses based on conversation history and prior decisions in the state graph.

Information Retrieval Agent (Real-Time Search & Augmented Responses)

    1. generate_content_from_link: Extracts and summarizes retrieved documents for inclusion in responses. Uses TavilySearchResults to fetch relevant immigration information from trusted sources.

    2. Agent Role: Functions as an external knowledge augmenter, ensuring that responses remain factually accurate and policy-compliant.

Intent Extraction Agent (Conditional Routing & Adaptive Transitions)

    1. Intent extraction: Implements conditional_edge logic to route user queries dynamically based on complexity and knowledge gaps.

    2. Agent Role: Determines whether to proceed with response generation, follow-up questioning, or external retrieval, ensuring a multi-step reasoning framework.

      • Each agent operates independently yet collaboratively, forming a distributed AI system where decisions emerge from collective reasoning rather than a single-pass LLM call.

Key Innovations in the Agentic AI Architecture

  1. Stateful Orchestration with Graph-Based AI: By leveraging LangGraph’s graph-based AI architecture, JapaAdvisorAI transitions from traditional conversational models to a true agentic AI system.StateGraph ensures a structured, adaptive flow where conversations evolve dynamically rather than following static rules. Agents execute concurrently, enabling parallel processing of follow-ups, search queries, and response generation.AI decisions are no longer sequential; they are emergent behaviors resulting from interactions between different nodes in the system.

  2. Multi-Agent Collaboration for Enhanced Decision-Making: JapaAdvisorAI’s agents operate as a distributed decision-making system where each node specializes in a particular task.

    1. Autonomous Follow-Up Generation: The Follow-Up Agent analyzes gaps in user input and triggers clarifying questions automatically.

    2. Search-Augmented Reasoning: The AI seamlessly determines when it lacks sufficient knowledge and invokes the Retrieval Agent for real-time updates.

    3. Adaptive Response Refinement: The Query Refinement Agent continuously optimizes user inputs to maximize response accuracy.

      This approach mirrors real-world expert consultations, where multiple specialists contribute insights to refine answers progressively.

  3. Hybrid AI: LLM Augmentation with External Knowledge

    One of the major challenges in AI-driven advisory systems is maintaining factual accuracy in rapidly changing domains like immigration law. To address this, JapaAdvisorAI employs a hybrid AI approach:
     - LLM-powered natural language processing for reasoning and conversation
     - External search integration for real-time fact-checking
     - Decision logic to determine when search augmentation is needed

    This ensures that the AI remains accurate, up-to-date, and grounded in real-world data rather than relying on outdated model knowledge.

Lessons Learned: Building the Next Generation of AI Advisors

1. Agentic AI Is Great For Conversational Systems: Traditional chatbots struggle with context retention, adaptability, and complex decision-making. By employing graph-based AI orchestration, JapaAdvisorAI demonstrates that:

  • AI assistants can dynamically adjust their strategies based on real-time inputs.

  • Autonomous agents enhance AI reasoning by specializing in distinct tasks.

  • Multi-agent systems enable emergent behaviors, where responses evolve based on collective agent decisions.

2. The Best AI Advisors Combine Intelligence with Knowledge Augmentation: LLMs are not sufficient on their own—they require:

  • Structured query refinement to ensure clarity

  • Automated knowledge retrieval for real-time updates

  • Graph-based decision-making to route user queries intelligently

  • This hybrid AI paradigm significantly outperforms traditional chatbot architectures in knowledge-sensitive domains.

3. Modular AI Design Increases Scalability & Maintainability: By decoupling AI functions into agent-based nodes, I ensured that:

  • New capabilities (e.g., new search APIs, expanded Q&A models) can be added without overhauling the system.

  • The AI can scale horizontally, distributing different functions across specialized modules.

  • The logic remains explainable and auditable, improving trust and compliance in high-stakes AI applications.

 

JapaAdvisorAI Demo

Here is a video demonstration of Japa Advisor AI agent deployed using Digital Ocean and responding to a simple query about documents required for a USA visa from Nigeria.

Final Thoughts: Building AI for Complex Decision-Making

JapaAdvisorAI represents a paradigm shift in AI advisory systems, proving that agentic architectures, knowledge augmentation, and adaptive workflows are the key to next-gen AI assistants. If you want to use JapaAdvisorAI, please contact me.

As a Principal AI Expert, I specialize in:

  • Designing multi-agent AI systems using LangGraph

  • Developing LLM-powered advisory tools with real-world augmentation

  • Architecting AI that blends reasoning, search, and stateful decision-making

If your organization is exploring agentic AI solutions or needs a leader to drive AI innovation, let’s connect!

March 22, 2025 /Oladotun Opasina
Comment
PatientNoShow.png

Using Machine Learning to know Patients that are No Shows

March 01, 2020 by Oladotun Opasina

Here is a brief introduction into the project.

Please check out the blog post: https://www.dotunopasina.com/datascience/noshowappointments

Introduction

In this project, we will be utilizing machine learning algorithms to perform feature selection on patient appointments data. The goal is to understand what characteristics of a particular patient that makes them miss their appointment.

Dataset

The dataset for this project was gotten from Kaggle consisting of 14 columns and 110527 rows of data.

The data consists of the following columns:

  1. Patient Id

    • Identification of a patient

  2. Appointment ID

    • Identification of each appointment

  3. Gender

    • Male or Female. Female is the greater proportion, woman takes way more care of they health in comparison to a man.

  4. AppointmentDate

    • The day of the actual appointment, when they have to visit the doctor.

  5. Scheduled Date

    • The day someone called or registered the appointment, this is before appointment of course.

  6. Age

    • How old is the patient.

  7. Neighborhood

    • Where the appointment takes place.

  8. Scholarship

    • True of False . Observation, this is a broad topic, consider reading this article https://en.wikipedia.org/wiki/Bolsa_Fam%C3%ADlia

  9. Hypertension

    • True or False

  10. Diabetes

    • True or False

  11. Alcoholism

    • True or False

  12. Handicap

    • True or False

  13. SMS_received

    • 1 or more messages sent to the patient.

  14. No-show

    • True or False.

Machine Learning Process

The steps taken to accomplish our results include the following:

  1. Data preprocessing.

  2. Create awaiting time field (Days between Scheduled and appointed times)

  3. Exploratory data analysis.

  4. Pass the data through the machine learning algorithm

  5. Select top 10 features that affect appointment times and least 10 features that affect appointment times.

The code of the project can be found on my github.

Exploratory Data Analysis

The below pie chart shows the number of Yes (shows up to appointment) as 85,299 and No (misses appointment) as 21,677. This implies we have an imbalanced data set and we need to keep that in mind as we move along.

Number of Yes and No to appointments

Number of Yes and No to appointments

Machine Learning Model

The machine learning model used here was a logistic regression with lasso regularization. Regularization is a way of penalizing the model’s cost function to ensure that the model does not overfit. In this case, the features that are not important are made to zero while we can select the important features.

Results and Insights

The model selected the most important features that affect patients missing their appointment as seen in the figure below.

Feature selections of Appointment No Shows

Feature selections of Appointment No Shows

From the image above we can break the groups of data into more likely to miss appointment and less likely to miss appointment.

More Likely to Miss Appointment

  • Patients who had a large difference between their scheduled and appointment date missed their appointment the most

  • Interestingly patients who received an SMS message still missed their appointment

  • Patients in the Itarare and Santos dumont neighborhood were more likely to miss their appointment

  • Patients between the ages of 13 and 14 were more likely to miss their appointments

Less Likely to Miss Appointment

  • Patients who were age 64 and 69

  • Patients who lived in Santa martha, Jardim da Penha and Jardim Camburi

  • Patients who had Hypertension were less likely to miss their appointments

March 01, 2020 /Oladotun Opasina
Comment
CreditCardFraudDetection.png

Credit Card Fraud Detection using Logistic Regression , Naive Bayes and Random Forest Classifiers

February 23, 2020 by Oladotun Opasina

Introduction

The goal of this project is to utilize machine learning algorithms to classifier a transaction as fraudulent or not based on multiple inputs.

Datasets

The datasets for this project was gotten from Kaggle . The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. The transaction contained inputs of datasets from dimension reduced observations V1…V28 , amount and class representing if the transaction is fradulent or not.

Machine Learning Algorithm Process

The machine learning algorithm process highlights the steps taken to get the models up and running from start to end. It also describes the data preprocessing and cleaning stage of the problem. The machine learning algorithm process includes:

  1. Download data sets from Kaggle.

  2. Load data into Jupyter notebook and perform exploratory analysis.

  3. Split the data into input and output columns.

  4. Standard scale data to remove skewness in the datasets.

  5. Pass the data into grid search logistic regression, naive bayes, support vector classifiers and random forest regression.

  6. Calculate the performance metric of the models. Note since we have imbalanced data we use a confusion matrix and f1 score to evaluate models.

Code for project can be found on my github website

Results

The results of the 4 machine learning models evaluated are as follows:

  1. Naive Bayes performed worse with a f1 score of 11.31 %,

  2. Logistic regression was 72.9 %

  3. Random forest was 87.4 %

  4. Support vector machines took so long to run that I had to stop the process.

Conclusion

In general random forest classifier performed best as it is a combination of decision trees and is protected from the problem overfitting due to it ensembling method. This project was an interesting to learn from and the outcomes from the result was used to further my knowledge in machine learning. Please find the code used for the project on my github page

February 23, 2020 /Oladotun Opasina
Comment
houseprices.jpg

Predicting House Prices in the Massachusetts area using Linear Regression, Support Vector Regression and Random Forest Regression

February 10, 2020 by Oladotun Opasina

Introduction

For this project, I scraped housing data from Zillow websites and used features such as the price, the size of a house in feet, number of beds and baths to predict the price of the house. The goal of this project is to understand multiple regression algorithms and see how they function on a real-world problem.

The Machine Learning Process

The data process included :

  1. Data collection and cleaning: I scraped the data from Zillow websites for the particular cities in Massachusetts and did some data cleaning by removing rows with empty fields and those without bedrooms and baths in their columns. The data was log transformed to ensure that there were perfectly skewed.

  2. Machine Learning Algorithm: The clean data was then passed into multiple machine learning algorithms such as linear regression, support vector regressor and random forest regressor to predict the price of the houses.

  3. Evaluate Machine Learning Algorithms: The myriad algorithms were evaluated to see how they performed. Spoiler alert: Random forest regression did the best amongst the three while linear regression performed worse.

  4. Display Result: The predictor was then sent to a flask app where the users can try multiple inputs to the models to see how the price of the algorithm changes.

Screen Shot 2020-02-10 at 11.24.44 AM.png
 

Machine Learning Introductions

Linear Regression

Linear regression is a supervised learning algorithm that fits a line through the data to predict a continuous variable. The algorithm tries to minimize the residuals (the difference between the predicted and actual value) as it calculates its prediction,

Screen Shot 2020-02-10 at 11.37.21 AM.png

Support Vector Regression

Support Vector regression is a supervised learning algorithm that tries to fit the error rate within a threshold boundary line. Usually the boundaries are the innermost threshold that supports the vectors. A threshold is chosen that allows for misclassification to ensure the data is transformed to its right plane.

Screen Shot 2020-02-10 at 11.42.21 AM.png

Random Forest Regression

Random forest regression is a supervised learning algorithm that utilizes a combination of decision trees to produce a continuous output. Usually the combination of models together is called an ensemble model.Random forest uses a Bagging ensemble method (Bootstrapping and Aggregation) to come up with it model output. Bootstrapping is row sampling with replacement and takes the output and then Aggregates the results using a voting classifier to get the output.

Screen Shot 2020-02-10 at 11.55.48 AM.png

Evaluation Metric

The model was evaluated using the R squared metric which basically finds the goodness of fit of the model. There is a generic formula for calculating r squared that can be found online.

Here are the result of the models.

  1. Linear regression: 65.7%

  2. Support Vector Regressor: 71.2%

  3. Random Forest Regressor: 80.4%


Project Demo

A screenshot of the actual website of the project can be found below. Users can enter their inputs and get results from the different models in the application. Something to note is that we are predicting per city and some cities have more observations than others.

Screen Shot 2020-02-10 at 12.24.56 PM.png

Conclusion and Future Works

Prediction of house prices in the Massachusetts area has been achieved. I need to collect more data for better prediction and develop the flask web ui some more.

Link to the project can be found here on github with my contact.

February 10, 2020 /Oladotun Opasina
Comment
kickstarter-funded.png

Metis Final Project: Predicting Kickstarter Projects Successes using Neural Networks

September 22, 2019 by Oladotun Opasina

We just completed our final project at Metis in Seattle, Washington, USA. These past 12 weeks went by quickly and were filled with many lessons. It is going to take a while to process them and I am glad for the period of growth.

For this project, we were tasked to individually select a passion project and use our newly learned machine learning algorithm skills on them.

After much discussions with our very fine instructors, I decided to focus on predicting successes of Kickstarter data using neural networks and logistic regression.

Tools Used:

  • MongoDB for storing data

  • Python for coding

  • TensorFlow and Keras for neural networks

  • Tableau for data Visualization

  • Flask app for displaying the website

The code and data for this project can be found on Github.

Challenge:

“Why did Football Heroes, a Mobile game company using Kickstarter achieve its goal of 12,000 while CowBell Hero another company that used Kickstarter did not? The goal of this project is to help campaigns succeed as they raise their funds“

This is a Tough Problem

This is a tough problem because Kickstarter data contains both text data such as the project title and description. Tabular data such as the Goal and duration of the project. Hence my project needed to account for this problem.

Screen Shot 2019-09-22 at 12.38.07 PM.png

Data:

The data was collected from Kickstarter web scraper called webroot.io.

  1. Kickstarter data

Steps:

  1. Preprocess data using MongoDB

  2. Split data into text data (title, description ) and tabular data.

  3. Pass the text data into an LSTM neural network and the tabular data into a regularized logistic regression.

  4. Combine all the models together in an ensemble model. The accuracy score of my result was 78.3%

  5. Build a website using Flask app to allow users interact with the output.

Steps for Kickstarter Prediction

Steps for Kickstarter Prediction

Video Demo

Future Work:

Presently, I have a working website that predicts campaign successes and failures. Future work include improving the model names and websites.

Thank You:

I want to use this medium to thank my family, friends and colleagues at Metis for their lessons, patience and opportunity to grow. All of my successes at Metis is because of the positive impact they had on me and my projects. So Thank you.

Gradpicture2JPG.JPG



PowerPoint Slides:

September 22, 2019 /Oladotun Opasina
1 Comment
ChurnAnalytics.jpg

Metis Project 3: Why are My Customers Leaving ? - Using Logistic Regression To Interpret Churn Data

August 12, 2019 by Oladotun Opasina in DataScience, Churn, Marketing

We just finished our week 6 at Metis in Seattle, Washington, USA. These past weeks have gone by quickly, we are half way through the program and the skillsets learned are amazing.

On my last project, I worked on predicting NBA player salaries and the feedback received were extremely useful. Thank you.

For this project, we utilized clustering methods discussed in class to solve a business problem. This project was done individually. I decided to focus on a company’s churn data to figure out what sort of customers are leaving and used logistic regression algorithm. I used Python for coding and Tableau for data visualization.The code and data for this project can be found on Github.

My initial plan was to utilize data from the Economist to cluster and figure out what style of leadership is important for economic growth of countries. This was based on a discussion with my fellow Schwarzman scholar: Lorem Aminathia on the model of leadership to ensure Africa’s growth. Unfortunately, there is not enough of data features to properly evaluate this problem.

Challenge:

“We were consulted by Infinity - a hypothetical internet service provider- to figure out their Churn - which customers are leaving - and where their Growth Team can focus on“

Data:

The data was an IBM telco service churn data on kaggle.

  1. IBM Telco Churn data

Approach:

The Minimum Viable Product (MVP) for our client was to address the following point:

  1. Figure out the number of customers churning.

  2. Find out the most frequent types of customer churning.

  3. Provide recommendation of next steps to take for the program.

Steps:

The following steps were taken to produce results, these steps are general data science steps to a solution and are usually iterative.

  1. Data gathering from our data sources.

  2. Data cleaning

  3. Feature Extractions and Cleaning

  4. Data Insights

  5. Client Recommendations

Insights and Reasons

After downloading , cleaning, and aggregating the datasets, the following were noticed:

  1. About 26% of Customers are churning. Out of 7,000 churn data, close to 2,000 are churning.

churnData.png

2. Logistic Regression (accuracy score of 80 %) provided features of the type of customer most likely and not likely to churn.

The image shows the features that will either lead to customer churn or not. Something that surprised me was the fact that fiber optics users were more likely to churn in comparison to digital subscriber line users - a different type of internet service users. It is surprising because fiber optics internet service is usually faster in connecting to the internet than DSL. Another fact is that fiber optics is usually more expensive than DSL and maybe users are getting tired of paying the premium for the service.


LogisticRegression.png

Recommendation

An immediate next step for the growth team is to provide an option for fiber optics customers that are about to leave to switch to DSL service.

Infinity's Customers Leaving! Stop That Churn. Dotun Opasina

August 12, 2019 /Oladotun Opasina
DataScience, Churn, Marketing
1 Comment
all.jpeg

Metis Project 2: Predicting NBA Player Salaries using Linear Regression

July 21, 2019 by Oladotun Opasina in DataScience, NBA

We just finished our third week at Metis in Seattle, Washington, USA. These past weeks were a roller coaster of learning amazing materials in statistics, python and linear algebra.

On our first project, I worked with other students to provide recommendations for Women in Technology and that experience was amazing.

For our second Project, we worked individually and utilized Linear regression to predict or interpret data on a topic of our choosing. I decided to focus on the NBA because of my rekindled love for the game after watching last seasons tumultuous finals between the Toronto Raptors and the Golden State Warriors.

Even though I worked on this project alone, in understanding the theory, my Metis’ classmate Fatima Loumaini and my instructors helped me.

Big shoutout goes to my ex-Managers at Goldman Sachs who gave me feedback on my model and how to properly create compelling visualizations. Thank you Rose Chen, David Chan and Samanth Muppidi (inside joke).

Goal

The goal of this project is to predict NBA players’ salaries per season based on their statistics using Linear Regression. This project can be used by both Team players and Managers to evaluate the impact a particular player is making on a team and to know whether to increase the players’ salary or trade the player.

Notes:

I am taking the non-traditional approach of explaining my results first and for anyone who is interested in the technicalities of the entire project, can read the remainder of the blog and view the code / presentations .

Results and Insights:

Growing Salaries and Injuries Impacts. Predicting Victor Oladipo’s Salaries:

The model was tested on Victor Oladipo’s per season stats from 2017 - 2019. Victor was the Most Improved Player in 2018 . Using a selection algorithm, the most important stats for a player was selected to predict his salary.

From the charts below, we can see that the ratio of Victor’s actual salaries to his stats increased from year 2017 to 2018 and stayed fixed in year 2019 while my model predicted his salary should have increased from year 2017 to 2018 (but not as high as his actual salary increase) and decrease slightly in 2019. We can see that Victor’s stats from 2017 to 2018 increased while decreasing slightly in 2019.

Observations

In the real world, Victor made a huge impact on his team from 2017 to 2019 (The Indiana Pacers) but got an injury that knocked him out for the season in 2019. This injury affected the impact he made on his team hence the decrease in his stats. A reason why we do not see a change in his salary is because he is currently on a multi-year contract that is usually guaranteed despite injuries.

Screen Shot 2019-07-21 at 6.11.31 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Growing Salaries, Growing Impacts. Predicting Giannis Antetokounmpo Salaries:

The model was tested on Giannis’s stats who was the Most Improved Player in 2017 and the results were used to create the charts below.

From the charts, we can see that the ratio of Giannis’s actual salaries to his stats increased from 2017 to 2019 while my model predicted his salary should have increased from the year 2017 to 2019. Giannis’s stats from 2017 to 2019 saw a steady increase as well. Something worth noting is that my model says that Giannis needs to be making more than his actual salaries from 2017-2019.

Observations

Juxtaposing to the reality, Giannis improved greatly in 2017 and signed a multi year contract that season. Thus we can see an increase in his salaries. My model predicted that because of Giannis’ impact on his team, he should be earning more money. But for Giannis, he cares more about building the Milwaukee Bucks franchise and he is willing to grow with the organization.

Screen Shot 2019-07-21 at 6.40.18 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Growing Salaries, Declining Impact. Predicting Jimmy Butler Salaries:

Finally, the model was evaluated on Jimmy Butler's stats who was the Most Improved Player in 2015 to generate the charts below.

The charts show the ratio of Jimmy’s actual salaries to his stats increased from the year 2017 to 2019 and my model predicted his salary should have decreased over that time period.

We can see that Jimmy’s stats from 2017 to 2019 slightly decreases. Something worth noting is that my model says that Jimmy needs to be making less money than his actual salaries based on his stats.

In actuality, Jimmy’s stats saw a steady decrease from 2017 to 2019 as he switched from the Chicago bulls team to the Minnesota Timberwolves team in 2018 and to the Philadephia 76ers team in 2019. In explaining these phenomena of increasing salaries to decreasing stats, it is general knowledge that a player’s brand also adds to his value and in switching teams, a player needs time to adjust to the style of play of that particular team. So it is not surprising that Jimmy’s stats decreased over time.

Screen Shot 2019-07-21 at 7.18.28 PM.png
Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

Plots of Actual Vs. Predicted Player’s Salaries and Players’ Individual Stats Sum for 2017-2019.

If you made it this far, then you are interested in technicality of things. Kindly enjoy your read. below and I welcome any constructive feedbacks.

NBA Introduction:

The National Basketball Association is a men's professional basketball league in North America, composed of 30 teams. It is one of the four major professional sports leagues in the United States and Canada, and is widely considered to be the premier men's professional basketball league in the world.

Find the major stats for the NBA in 2019 below:

Major NBA Stats in 2019

Major NBA Stats in 2019

Approach:

The approach for this project was to utilize specific player stats to predict their salaries using linear regression. I utilized the Lasso Algorithm to select the most important player statistics that affected a player salary.

Steps:

The following steps were taken in achieving my goals for this project.

  1. Data scraping and cleaning.

  2. Data and feature engineering.

  3. Model validation and selection.

  4. Model prediction and evaluation.

Data Scraping and Cleaning:

The data for this project was scraped from:

  1. Basketball Reference: a website that contains basketball players stats .

  2. I selected basketball player stats and salaries from 2017 - 2019 for this project.

  3. I chose around 20 unique stats per player.

The python script that was used to scrape the data can be found on my github page.

Data and Feature Engineering:

After performing Lasso algorithm for feature selections I was able to select the 5 specific stats from the 20 unique stats that affected a players salaries. There are namely:

  1. The player’s age

  2. The minutes played per game

  3. The defensive rebounds per game.

  4. The personal fouls per game.

  5. The average points made per game.

The image below shows a HeatMap of the selected NBA stats to the salaries. Notice that the salaries are logarithm transformed to properly scale with the features and all the stats are positively correlated to the salaries which implies that this problem is ideal for linear regression.

HeatMap displaying positive correlation of my different stats to Salary.

HeatMap displaying positive correlation of my different stats to Salary.

Model Validation and Selection:

I split my data into the train and validate sets before fitting the train data with my model. I got a score of 42% for my R-squared which implies the level of variability in my data.

Model Prediction and Evaluation:

After training my model with my train set, I got the predicted salaries for each player from 2017-2019. The insights of this project can be found in the Results and Insights section.

Conclusions:

Players and Team managers can better work together using the NBA prediction model when creating contracts and have a standardized way to evaluate impact.

Future Works:

  1. Collect more NBA data from 2008 - 2019.

  2. Include features on out-of-season Injuries, beginning of contracts for players, and brand value of a player etc.

  3. Figure out ways for players to improve specific stats.

Below is my presentation for the project at Metis. Looking forward to your feedback.

July 21, 2019 /Oladotun Opasina
NBA, Data Science, LASSO, Goldman Sachs
DataScience, NBA
3 Comments
Women in Technology source:https://www.we-worldwide.com/blog/posts/black-women-in-tech

Women in Technology source:https://www.we-worldwide.com/blog/posts/black-women-in-tech

Metis Project 1 : Analysis for WomenTechWomenYes Summer Gala

July 10, 2019 by Oladotun Opasina

We just finished our first week at Metis in Seattle, Washington, USA. The past few days have been a whirlwind of both review and new materials. As our first project, we leveraged the Python modules Pandas and Seaborn to perform rudimentary EDA and data visualization. In this article, I’ll take you through our approach to framing the problem, designing the data pipeline, and ultimately implementing the code. You can find the project on GitHub, along with links to all the data.

I worked with Aisulu Omar from Kazakhstan, Alex Lou, and Dr. Jeremy Lehner both from America on this project.

I am Nigerian and you can find me on Linkedln.

The Challenge:

WomenTechWomenYes (WTWY), a(-n imaginary) non-profit organization in New York City, is raising money for their annual summer gala. . For marketing purposes, they place street teams at entrances to subway stations to collect email addresses. Those who sign up are sent free tickets to the gala. Our goal is to use MTA subway data and other external data sources, to help optimize the placement of the teams, so that they can gather the most signatures of people who will attend and contribute to WTWY’s cause.

Our Data:

We used three main data sources for this project.

  1. MTA Subway data

  2. Yelp Fusion Api to figure out the zip codes of each station.

  3. University of Michigan Median and Mean income data to zip code.

Our Approach:

For our approach, we discussed as a team on what would be our Minimum Viable Product (MVP) for our client and we came up with three goals.

  1. Find the busiest stations by traffic to easily deploy the street teams.

  2. Find the busiest days of the week at the train stations.

  3. Find and join income data to the busiest stations to figure out who will donate to our cause.

Our Steps:

We took the following steps to get our results, these steps are general data science steps to a solution and are usually iterative.

  1. Data gathering from our data sources.

  2. Data cleaning

  3. Data aggregating

  4. Data insights

  5. Client Recommendations

Our Insights and Reasons

After downloading , cleaning, and aggregating the datasets, we noticed the following:

  • Wednesdays are the busiest days

Screen Shot 2019-07-10 at 4.52.55 AM.png

Busiest Day of the Week is Wednesday by Number of Entries

  • The top 5 Busiest stations by traffic are:

    • 34th St - Penn Station

    • 42nd St - Grand Central 

    • 34th St - Herald Square

    • 14th St - Union Sq

    • 42nd St - Times Sq

Busiest5stations.png

Top 5 Busiest NYC Stations in the Summer

Why?

  • This is because the top 5 stations are located near the Midtown Area of New York City which is commonly busy during the summers.

  • Major restaurants, Landmarks , Colleges and Technology Companies are situated around this area.

MidtownRegion.png

Google Map route of the Top 5 Stations in Proximity to one another

  • After joining income data to the busiest stations and filtering for those who made $70,000 and above, we found:

    • Grand Central - 42 Street to be the station with the highest income.

Busiest Stations by Median Household Income

Busiest Stations by Median Household Income

Our Recommendation:

From our analysis we recommend that WomenTechWomenYes deploy street teams on Wednesdays during peak hours to 34 ST Penn Station and 42nd Grand Central to best target their appropriate audience.

Conclusion:

I would like to thank the Metis team and my classmates for their thoughtful questions and feedback. As we continue with future projects, I hope to incorporate those lessons in them. Our slides can be found below

Analysis for WomenTechWomenYes Annual Gala Aisulu, Alex, Dotun and Jeremy Metis 2019

July 10, 2019 /Oladotun Opasina
Comment

Powered by Squarespace