Starbucks Capstone Challenge

Introduction
This blog post is the Capstone Challenge for Udacity’s Data Scientist Nanodegree. Detailed analysis with all required code and data can be obtained in my Github repository.
We were given three data sets that contain simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free).
Data Sets
The data is contained in three files:
- portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed
Here is the schema and explanation of each variable in the files:
portfolio.json
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
profile.json
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
transcript.json
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
Project Overview
For this project the task was to combine transaction, demographic, and offer data to determine which demographic groups responded best to which offer type. I will be using data science techniques to analyze the simulated data; which also will include machine learning.
Metrics
A machine learning model will be built that predicts whether or not someone will respond to an offer, which includes binary variables (age range, offer type, and channel ). The model will report the precision, recall, f1-score, and support for each output variable (category) of the data set. I selected these model metrics because they are ideal since precision and recall are unbalanced. Below is a brief description of the three metrics selected.
Precision is how precise/accurate your model is out of those predicted positive, how many of them are actual positive.
Recall is how many of the Actual Positives our model capture through labeling it as Positive (True Positive).
F1-Score is a function of precision and recall. Seeks a balance between Precision and Recall and when there is an uneven class distribution (large number of actual negatives).
Cleaning the Data Sets
The data sets were cleaned and additional columns added before the EDA and Machine Learning took place.
Clean Portfolio Data Set
- rename columns duration to days_duration, and id to id_offer
- use one hot incode to create dummy columns for channels and offer_type
- drop channels column
Clean Profile Data Set
- rename column id to id_person
- create dt_membership as type date, then drop became_member_on
- create yr_membership column as type int
- drop records where age = 118
- drop records where gender = ‘O’
- create gender_binary column
- remove columns with missing income
- create age range column (age_range)
- create dummy variables from age_range column
Clean Transcript Data Set
- rename column person to id_person
- create new column days_time
- remove an id if it is not part of the profile data set
- create column id_offer to join later
- reorder columns to have id_person and id_offer the first two columns
- create dummy columns from event
- create a DataFrame with the transaction data
Exploratory Data Analysis (EDA)
What was the average age of customer by gender?

Female (57 years old) customers had a slightly higher average age than male (52 years old) customers.
What was the average age of customer by the year they signed up for the Starbucks app?

The year that had highest average age of 56 was 2016, followed by:
- 2015 and 2017 with an average age of 54,
- 2018 average age of 53,
- 2013 average age of 52, and
- 2014 average age of 51
What were the number of customers by age range?

The age range with highest number of people was age 45–64, followed by:
- 65–84,
- 25–44,
- 18–24, and
- 85+
What were the total amount of offer types by age range?

The age range that received the most offers was 25–44, followed by:
- 18–24,
- 65–84,
- 85+, and
- 45–64
Machine Learning
Machine Learning Pipeline Preparation
- Load data and define feature and target variables X and y
X Variable
- Event (Offer Received, Offer Viewed, Offer Completed)
y Variables
- 18–24 (1 or 0)
- 25–44 (1 or 0)
- 45–64 (1 or 0)
- 65–84 (1 or 0)
- 85+ (1 or 0)
- bogo (1 or 0)
- discount (1 or 0)
- informational (1 or 0)
- email (1 or 0)
- mobile (1 or 0)
- social (1 or 0)
- web(1 or 0)
2a. Write a tokenization function to process your text data

2b. Additional functions


3. Build a machine learning pipeline
This machine pipeline takes in the event column as input and output results on the categories in the dataset.
Three estimators were used to create the pipeline:
- CountVectorizer
- TfidfTransformer
- RandomForestClassifier

The pipeline parameters included:
dict_keys(['memory', 'steps', 'verbose', 'vect', 'tfidf', 'clf', 'vect__analyzer', 'vect__binary', 'vect__decode_error', 'vect__dtype', 'vect__encoding', 'vect__input', 'vect__lowercase', 'vect__max_df', 'vect__max_features', 'vect__min_df', 'vect__ngram_range', 'vect__preprocessor', 'vect__stop_words', 'vect__strip_accents', 'vect__token_pattern', 'vect__tokenizer', 'vect__vocabulary', 'tfidf__norm', 'tfidf__smooth_idf', 'tfidf__sublinear_tf', 'tfidf__use_idf', 'clf__estimator__bootstrap', 'clf__estimator__ccp_alpha', 'clf__estimator__class_weight', 'clf__estimator__criterion', 'clf__estimator__max_depth', 'clf__estimator__max_features', 'clf__estimator__max_leaf_nodes', 'clf__estimator__max_samples', 'clf__estimator__min_impurity_decrease', 'clf__estimator__min_impurity_split', 'clf__estimator__min_samples_leaf', 'clf__estimator__min_samples_split', 'clf__estimator__min_weight_fraction_leaf', 'clf__estimator__n_estimators', 'clf__estimator__n_jobs', 'clf__estimator__oob_score', 'clf__estimator__random_state', 'clf__estimator__verbose', 'clf__estimator__warm_start', 'clf__estimator', 'clf__n_jobs'])
4. Train pipeline
- Split data into train and test sets
- Train pipeline

5. Test model
Report the precision, recall, f1-score, and support for each output category of the dataset.

Model Evaluation, Validation, and Justification
For the age range indicators, the highest F1 Score was over 85 (.67) followed by:
- 18–24 (.66)
- 25–44 (.56)
- 65–84 (.53)
- 45–64 (.41)
- For the event indicators, the highest F1 Score was informational (.60) followed by:
- discount (.55)
- bogo (.41)
- For the channels indicators, the highest F1 Score was email (1.00) followed by:
- mobile (.64)
- web (.57)
- social (.48)
6. Improved model
Use grid search to find better parameters.

Using GridSearchCV the parameters included:
'cv': None,
'error_score': nan,
'estimator__memory': None,
'estimator__steps': [('vect',
CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
'estimator__verbose': False,
'estimator__vect': CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>),
'estimator__tfidf': TfidfTransformer(),
'estimator__clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
'estimator__vect__analyzer': 'word',
'estimator__vect__binary': False,
'estimator__vect__decode_error': 'strict',
'estimator__vect__dtype': numpy.int64,
'estimator__vect__encoding': 'utf-8',
'estimator__vect__input': 'content',
'estimator__vect__lowercase': True,
'estimator__vect__max_df': 1.0,
'estimator__vect__max_features': None,
'estimator__vect__min_df': 1,
'estimator__vect__ngram_range': (1, 1),
'estimator__vect__preprocessor': None,
'estimator__vect__stop_words': None,
'estimator__vect__strip_accents': None,
'estimator__vect__token_pattern': '(?u)\\b\\w\\w+\\b',
'estimator__vect__tokenizer': <function __main__.tokenize(text)>,
'estimator__vect__vocabulary': None,
'estimator__tfidf__norm': 'l2',
'estimator__tfidf__smooth_idf': True,
'estimator__tfidf__sublinear_tf': False,
'estimator__tfidf__use_idf': True,
'estimator__clf__estimator__bootstrap': True,
'estimator__clf__estimator__ccp_alpha': 0.0,
'estimator__clf__estimator__class_weight': None,
'estimator__clf__estimator__criterion': 'gini',
'estimator__clf__estimator__max_depth': None,
'estimator__clf__estimator__max_features': 'auto',
'estimator__clf__estimator__max_leaf_nodes': None,
'estimator__clf__estimator__max_samples': None,
'estimator__clf__estimator__min_impurity_decrease': 0.0,
'estimator__clf__estimator__min_impurity_split': None,
'estimator__clf__estimator__min_samples_leaf': 1,
'estimator__clf__estimator__min_samples_split': 2,
'estimator__clf__estimator__min_weight_fraction_leaf': 0.0,
'estimator__clf__estimator__n_estimators': 100,
'estimator__clf__estimator__n_jobs': None,
'estimator__clf__estimator__oob_score': False,
'estimator__clf__estimator__random_state': None,
'estimator__clf__estimator__verbose': 0,
'estimator__clf__estimator__warm_start': False,
'estimator__clf__estimator': RandomForestClassifier(),
'estimator__clf__n_jobs': None,
'estimator': Pipeline(steps=[('vect',
CountVectorizer(tokenizer=<function tokenize at 0x000001FA7E8BD820>)),
('tfidf', TfidfTransformer()),
('clf',
MultiOutputClassifier(estimator=RandomForestClassifier()))]),
'n_jobs': None,
'param_grid': {'vect__max_df': (0.75, 1.0),
'clf__estimator__n_estimators': [10, 20],
'clf__estimator__min_samples_split': [2, 5]},
'pre_dispatch': '2*n_jobs',
'refit': True,
'return_train_score': False,
'scoring': make_scorer(f1_scorer_eval),
'verbose': 7

7. Test improved model
Show the precision, recall, f1-score, and support of the tuned model.

Model Evaluation, Validation, and Justification
There wasn’t that much of a difference between the first machine learning model and the improved one. Both models performed reasonably well even with the use of GridSearchCV to find better parameters (see prior section for parameters).

K-Fold Cross Validation
My Udacity reviewer gave me the idea of using k-fold cross validation to demonstrate my optimized model is robust. I implemented a 5 fold cross validation on the same data used for my machine learning model and the validation performance was stable. This shows that the model is robust against small perturbations in the training data. Below is the python code and output used to implement the K-Fold Cross Validation. The code was adapted from an online article located at https://www.askpython.com/python/examples/k-fold-cross-validation.




Reflection and Improvement.
As mentioned above, there wasn’t a noticeable difference between the two machine learning models. There are more options to try when creating and improving a machine learning model (e.g. Use of TF-IDF
and StartingVerbExtractor
to try and improve the model). An example of the code to use in the future to improve on the first two machine learning models is located at the end of the Jupyter Notebook in my Github repository.
Summary
- Performed Exploratory Data Analysis on data that mimics customer behavior on the Starbucks rewards mobile app.
- Use preprocessing techniques learned through Udacity’s Data Scientist Nanodegree program to create Machine Learning Models.
- Created multiple machine learning models and used k-fold cross validation to demonstrate my optimized model was robust.