Starbucks Capstone Challenge

Patrick Bloomingdale
8 min readAug 21, 2021
Starbucks Logo

Introduction

This blog post is the Capstone Challenge for Udacity’s Data Scientist Nanodegree. Detailed analysis with all required code and data can be obtained in my Github repository.

We were given three data sets that contain simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).

Data Sets

The data is contained in three files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

transcript.json

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

Project Overview

For this project the task was to combine transaction, demographic, and offer data to determine which demographic groups responded best to which offer type. I will be using data science techniques to analyze the simulated data; which also will include machine learning.

Metrics

A machine learning model will be built that predicts whether or not someone will respond to an offer, which includes binary variables (age range, offer type, and channel ). The model will report the precision, recall, f1-score, and support for each output variable (category) of the data set. I selected these model metrics because they are ideal since precision and recall are unbalanced. Below is a brief description of the three metrics selected.

Precision is how precise/accurate your model is out of those predicted positive, how many of them are actual positive.

Recall is how many of the Actual Positives our model capture through labeling it as Positive (True Positive).

F1-Score is a function of precision and recall. Seeks a balance between Precision and Recall and when there is an uneven class distribution (large number of actual negatives).

Cleaning the Data Sets

The data sets were cleaned and additional columns added before the EDA and Machine Learning took place.

Clean Portfolio Data Set

  • rename columns duration to days_duration, and id to id_offer
  • use one hot incode to create dummy columns for channels and offer_type
  • drop channels column

Clean Profile Data Set

  • rename column id to id_person
  • create dt_membership as type date, then drop became_member_on
  • create yr_membership column as type int
  • drop records where age = 118
  • drop records where gender = ‘O’
  • create gender_binary column
  • remove columns with missing income
  • create age range column (age_range)
  • create dummy variables from age_range column

Clean Transcript Data Set

  • rename column person to id_person
  • create new column days_time
  • remove an id if it is not part of the profile data set
  • create column id_offer to join later
  • reorder columns to have id_person and id_offer the first two columns
  • create dummy columns from event
  • create a DataFrame with the transaction data

Exploratory Data Analysis (EDA)

What was the average age of customer by gender?

Stargucks customers by age and gender

Female (57 years old) customers had a slightly higher average age than male (52 years old) customers.

What was the average age of customer by the year they signed up for the Starbucks app?

Starbucks Customers by Age and Membership Year

The year that had highest average age of 56 was 2016, followed by:

  • 2015 and 2017 with an average age of 54,
  • 2018 average age of 53,
  • 2013 average age of 52, and
  • 2014 average age of 51

What were the number of customers by age range?

Number of people by age range

The age range with highest number of people was age 45–64, followed by:

  • 65–84,
  • 25–44,
  • 18–24, and
  • 85+

What were the total amount of offer types by age range?

Total Amount of Offer Types by Age Range

The age range that received the most offers was 25–44, followed by:

  • 18–24,
  • 65–84,
  • 85+, and
  • 45–64

Machine Learning

Machine Learning Pipeline Preparation

  1. Load data and define feature and target variables X and y

X Variable

  • Event (Offer Received, Offer Viewed, Offer Completed)

y Variables

  • 18–24 (1 or 0)
  • 25–44 (1 or 0)
  • 45–64 (1 or 0)
  • 65–84 (1 or 0)
  • 85+ (1 or 0)
  • bogo (1 or 0)
  • discount (1 or 0)
  • informational (1 or 0)
  • email (1 or 0)
  • mobile (1 or 0)
  • social (1 or 0)
  • web(1 or 0)

2a. Write a tokenization function to process your text data

2b. Additional functions

3. Build a machine learning pipeline
This machine pipeline takes in the event column as input and output results on the categories in the dataset.

Three estimators were used to create the pipeline:

  • CountVectorizer
  • TfidfTransformer
  • RandomForestClassifier

The pipeline parameters included:

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estimator__verbose', 'clf__estimator__warm_start', 'clf__estimator', 'clf__n_jobs'])

4. Train pipeline

  • Split data into train and test sets
  • Train pipeline

5. Test model

Report the precision, recall, f1-score, and support for each output category of the dataset.

Model Evaluation, Validation, and Justification

For the age range indicators, the highest F1 Score was over 85 (.67) followed by:

  • 18–24 (.66)
  • 25–44 (.56)
  • 65–84 (.53)
  • 45–64 (.41)
  • For the event indicators, the highest F1 Score was informational (.60) followed by:
  • discount (.55)
  • bogo (.41)
  • For the channels indicators, the highest F1 Score was email (1.00) followed by:
  • mobile (.64)
  • web (.57)
  • social (.48)

6. Improved model

Use grid search to find better parameters.

Using GridSearchCV the parameters included:

'cv': None,
'error_score': nan,
'estimator__memory': None,
'estimator__steps': [('vect',
CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
'estimator__verbose': False,
'estimator__vect': CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>),
'estimator__tfidf': TfidfTransformer(),
'estimator__clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
'estimator__vect__analyzer': 'word',
'estimator__vect__binary': False,
'estimator__vect__decode_error': 'strict',
'estimator__vect__dtype': numpy.int64,
'estimator__vect__encoding': 'utf-8',
'estimator__vect__input': 'content',
'estimator__vect__lowercase': True,
'estimator__vect__max_df': 1.0,
'estimator__vect__max_features': None,
'estimator__vect__min_df': 1,
'estimator__vect__ngram_range': (1, 1),
'estimator__vect__preprocessor': None,
'estimator__vect__stop_words': None,
'estimator__vect__strip_accents': None,
'estimator__vect__token_pattern': '(?u)\\b\\w\\w+\\b',
'estimator__vect__tokenizer': <function __main__.tokenize(text)>,
'estimator__vect__vocabulary': None,
'estimator__tfidf__norm': 'l2',
'estimator__tfidf__smooth_idf': True,
'estimator__tfidf__sublinear_tf': False,
'estimator__tfidf__use_idf': True,
'estimator__clf__estimator__bootstrap': True,
'estimator__clf__estimator__ccp_alpha': 0.0,
'estimator__clf__estimator__class_weight': None,
'estimator__clf__estimator__criterion': 'gini',
'estimator__clf__estimator__max_depth': None,
'estimator__clf__estimator__max_features': 'auto',
'estimator__clf__estimator__max_leaf_nodes': None,
'estimator__clf__estimator__max_samples': None,
'estimator__clf__estimator__min_impurity_decrease': 0.0,
'estimator__clf__estimator__min_impurity_split': None,
'estimator__clf__estimator__min_samples_leaf': 1,
'estimator__clf__estimator__min_samples_split': 2,
'estimator__clf__estimator__min_weight_fraction_leaf': 0.0,
'estimator__clf__estimator__n_estimators': 100,
'estimator__clf__estimator__n_jobs': None,
'estimator__clf__estimator__oob_score': False,
'estimator__clf__estimator__random_state': None,
'estimator__clf__estimator__verbose': 0,
'estimator__clf__estimator__warm_start': False,
'estimator__clf__estimator': RandomForestClassifier(),
'estimator__clf__n_jobs': None,
'estimator': Pipeline(steps=[('vect',
CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
('tfidf', TfidfTransformer()),
('clf',
MultiOutputClassifier(estimator=RandomForestClassifier()))]),
'n_jobs': None,
'param_grid': {'vect__max_df': (0.75, 1.0),
'clf__estimator__n_estimators': [10, 20],
'clf__estimator__min_samples_split': [2, 5]},
'pre_dispatch': '2*n_jobs',
'refit': True,
'return_train_score': False,
'scoring': make_scorer(f1_scorer_eval),
'verbose': 7

7. Test improved model

Show the precision, recall, f1-score, and support of the tuned model.

Model Evaluation, Validation, and Justification

There wasn’t that much of a difference between the first machine learning model and the improved one. Both models performed reasonably well even with the use of GridSearchCV to find better parameters (see prior section for parameters).

K-Fold Cross Validation

My Udacity reviewer gave me the idea of using k-fold cross validation to demonstrate my optimized model is robust. I implemented a 5 fold cross validation on the same data used for my machine learning model and the validation performance was stable. This shows that the model is robust against small perturbations in the training data. Below is the python code and output used to implement the K-Fold Cross Validation. The code was adapted from an online article located at https://www.askpython.com/python/examples/k-fold-cross-validation.

Reflection and Improvement.

As mentioned above, there wasn’t a noticeable difference between the two machine learning models. There are more options to try when creating and improving a machine learning model (e.g. Use of TF-IDF and StartingVerbExtractor to try and improve the model). An example of the code to use in the future to improve on the first two machine learning models is located at the end of the Jupyter Notebook in my Github repository.

Summary

  • Performed Exploratory Data Analysis on data that mimics customer behavior on the Starbucks rewards mobile app.
  • Use preprocessing techniques learned through Udacity’s Data Scientist Nanodegree program to create Machine Learning Models.
  • Created multiple machine learning models and used k-fold cross validation to demonstrate my optimized model was robust.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

No responses yet

Write a response