Starbucks Capstone Challenge

8 min readAug 21, 2021

Introduction

This blog post is the Capstone Challenge for Udacity’s Data Scientist Nanodegree. Detailed analysis with all required code and data can be obtained in my Github repository.

We were given three data sets that contain simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).

Data Sets

The data is contained in three files:

portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
profile.json — demographic data for each customer
transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

id (string) — offer id
offer_type (string) — type of offer ie BOGO, discount, informational
difficulty (int) — minimum required spend to complete an offer
reward (int) — reward given for completing an offer
duration (int) — time for offer to be open, in days
channels (list of strings)

profile.json

age (int) — age of the customer
became_member_on (int) — date when customer created an app account
gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
id (str) — customer id
income (float) — customer’s income

transcript.json

event (str) — record description (ie transaction, offer received, offer viewed, etc.)
person (str) — customer id
time (int) — time in hours since start of test. The data begins at time t=0
value — (dict of strings) — either an offer id or transaction amount depending on the record

Project Overview

For this project the task was to combine transaction, demographic, and offer data to determine which demographic groups responded best to which offer type. I will be using data science techniques to analyze the simulated data; which also will include machine learning.

Metrics

A machine learning model will be built that predicts whether or not someone will respond to an offer, which includes binary variables (age range, offer type, and channel ). The model will report the precision, recall, f1-score, and support for each output variable (category) of the data set. I selected these model metrics because they are ideal since precision and recall are unbalanced. Below is a brief description of the three metrics selected.

Precision is how precise/accurate your model is out of those predicted positive, how many of them are actual positive.

Recall is how many of the Actual Positives our model capture through labeling it as Positive (True Positive).

F1-Score is a function of precision and recall. Seeks a balance between Precision and Recall and when there is an uneven class distribution (large number of actual negatives).

Cleaning the Data Sets

The data sets were cleaned and additional columns added before the EDA and Machine Learning took place.

Clean Portfolio Data Set

rename columns duration to days_duration, and id to id_offer
use one hot incode to create dummy columns for channels and offer_type
drop channels column

Clean Profile Data Set

rename column id to id_person
create dt_membership as type date, then drop became_member_on
create yr_membership column as type int
drop records where age = 118
drop records where gender = ‘O’
create gender_binary column
remove columns with missing income
create age range column (age_range)
create dummy variables from age_range column

Clean Transcript Data Set

rename column person to id_person
create new column days_time
remove an id if it is not part of the profile data set
create column id_offer to join later
reorder columns to have id_person and id_offer the first two columns
create dummy columns from event
create a DataFrame with the transaction data

Exploratory Data Analysis (EDA)

What was the average age of customer by gender?

Female (57 years old) customers had a slightly higher average age than male (52 years old) customers.

What was the average age of customer by the year they signed up for the Starbucks app?

Starbucks Customers by Age and Membership Year

The year that had highest average age of 56 was 2016, followed by:

2015 and 2017 with an average age of 54,
2018 average age of 53,
2013 average age of 52, and
2014 average age of 51

What were the number of customers by age range?

The age range with highest number of people was age 45–64, followed by:

65–84,
25–44,
18–24, and
85+

What were the total amount of offer types by age range?

The age range that received the most offers was 25–44, followed by:

18–24,
65–84,
85+, and
45–64

Machine Learning

Machine Learning Pipeline Preparation

Load data and define feature and target variables X and y

X Variable

Event (Offer Received, Offer Viewed, Offer Completed)

y Variables

18–24 (1 or 0)
25–44 (1 or 0)
45–64 (1 or 0)
65–84 (1 or 0)
85+ (1 or 0)
bogo (1 or 0)
discount (1 or 0)
informational (1 or 0)
email (1 or 0)
mobile (1 or 0)
social (1 or 0)
web(1 or 0)

2a. Write a tokenization function to process your text data

2b. Additional functions

3. Build a machine learning pipeline
This machine pipeline takes in the event column as input and output results on the categories in the dataset.

Three estimators were used to create the pipeline:

CountVectorizer
TfidfTransformer
RandomForestClassifier

The pipeline parameters included:

dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estimator__verbose', 'clf__estimator__warm_start', 'clf__estimator', 'clf__n_jobs'])

4. Train pipeline

Split data into train and test sets
Train pipeline

5. Test model

Report the precision, recall, f1-score, and support for each output category of the dataset.

Model Evaluation, Validation, and Justification

For the age range indicators, the highest F1 Score was over 85 (.67) followed by:

18–24 (.66)
25–44 (.56)
65–84 (.53)
45–64 (.41)
For the event indicators, the highest F1 Score was informational (.60) followed by:
discount (.55)
bogo (.41)
For the channels indicators, the highest F1 Score was email (1.00) followed by:
mobile (.64)
web (.57)
social (.48)

6. Improved model

Use grid search to find better parameters.

Using GridSearchCV the parameters included:

'cv': None,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'estimator__verbose': False,
 'estimator__vect': CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>),
 'estimator__tfidf': TfidfTransformer(),
 'estimator__clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'estimator__vect__analyzer': 'word',
 'estimator__vect__binary': False,
 'estimator__vect__decode_error': 'strict',
 'estimator__vect__dtype': numpy.int64,
 'estimator__vect__encoding': 'utf-8',
 'estimator__vect__input': 'content',
 'estimator__vect__lowercase': True,
 'estimator__vect__max_df': 1.0,
 'estimator__vect__max_features': None,
 'estimator__vect__min_df': 1,
 'estimator__vect__ngram_range': (1, 1),
 'estimator__vect__preprocessor': None,
 'estimator__vect__stop_words': None,
 'estimator__vect__strip_accents': None,
 'estimator__vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'estimator__vect__tokenizer': <function __main__.tokenize(text)>,
 'estimator__vect__vocabulary': None,
 'estimator__tfidf__norm': 'l2',
 'estimator__tfidf__smooth_idf': True,
 'estimator__tfidf__sublinear_tf': False,
 'estimator__tfidf__use_idf': True,
 'estimator__clf__estimator__bootstrap': True,
 'estimator__clf__estimator__ccp_alpha': 0.0,
 'estimator__clf__estimator__class_weight': None,
 'estimator__clf__estimator__criterion': 'gini',
 'estimator__clf__estimator__max_depth': None,
 'estimator__clf__estimator__max_features': 'auto',
 'estimator__clf__estimator__max_leaf_nodes': None,
 'estimator__clf__estimator__max_samples': None,
 'estimator__clf__estimator__min_impurity_decrease': 0.0,
 'estimator__clf__estimator__min_impurity_split': None,
 'estimator__clf__estimator__min_samples_leaf': 1,
 'estimator__clf__estimator__min_samples_split': 2,
 'estimator__clf__estimator__min_weight_fraction_leaf': 0.0,
 'estimator__clf__estimator__n_estimators': 100,
 'estimator__clf__estimator__n_jobs': None,
 'estimator__clf__estimator__oob_score': False,
 'estimator__clf__estimator__random_state': None,
 'estimator__clf__estimator__verbose': 0,
 'estimator__clf__estimator__warm_start': False,
 'estimator__clf__estimator': RandomForestClassifier(),
 'estimator__clf__n_jobs': None,
 'estimator': Pipeline(steps=[('vect',
                  CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
                 ('tfidf', TfidfTransformer()),
                 ('clf',
                  MultiOutputClassifier(estimator=RandomForestClassifier()))]),
 'n_jobs': None,
 'param_grid': {'vect__max_df': (0.75, 1.0),
  'clf__estimator__n_estimators': [10, 20],
  'clf__estimator__min_samples_split': [2, 5]},
 'pre_dispatch': '2*n_jobs',
 'refit': True,
 'return_train_score': False,
 'scoring': make_scorer(f1_scorer_eval),
 'verbose': 7

7. Test improved model

Show the precision, recall, f1-score, and support of the tuned model.

Model Evaluation, Validation, and Justification

There wasn’t that much of a difference between the first machine learning model and the improved one. Both models performed reasonably well even with the use of GridSearchCV to find better parameters (see prior section for parameters).

K-Fold Cross Validation

My Udacity reviewer gave me the idea of using k-fold cross validation to demonstrate my optimized model is robust. I implemented a 5 fold cross validation on the same data used for my machine learning model and the validation performance was stable. This shows that the model is robust against small perturbations in the training data. Below is the python code and output used to implement the K-Fold Cross Validation. The code was adapted from an online article located at https://www.askpython.com/python/examples/k-fold-cross-validation.

Reflection and Improvement.

As mentioned above, there wasn’t a noticeable difference between the two machine learning models. There are more options to try when creating and improving a machine learning model (e.g. Use of TF-IDF and StartingVerbExtractor to try and improve the model). An example of the code to use in the future to improve on the first two machine learning models is located at the end of the Jupyter Notebook in my Github repository.

Summary

Performed Exploratory Data Analysis on data that mimics customer behavior on the Starbucks rewards mobile app.
Use preprocessing techniques learned through Udacity’s Data Scientist Nanodegree program to create Machine Learning Models.
Created multiple machine learning models and used k-fold cross validation to demonstrate my optimized model was robust.

Starbucks Capstone Challenge

Introduction

Project Overview

Metrics

Exploratory Data Analysis (EDA)

Machine Learning

Reflection and Improvement.

Summary

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Patrick Bloomingdale

No responses yet