Back to Portfolio
Data Science / NLP

Nike Brand Perception Analysis

How do consumers really feel about Nike vs Adidas vs Under Armour?

301K+ reviews scraped from the App Store, Reddit, and Trustpilot, cleaned to ~16K documents, and dissected with sentiment analysis, topic modeling, TF-IDF, and aspect-based sentiment. Built in R with tidytext, topicmodels, and rvest.

301K+

Raw Reviews

~16K

Cleaned Docs

3

Brands

3

Platforms

View on GitHub

The data

Reviews were scraped from three platforms using custom R scripts. Each source captures a different slice of consumer voice, from app users to forum communities to verified purchasers.

📱App Store

8 countries, 10 pages per brand

💬Reddit

10 search terms per brand via JSON API

Trustpilot

40 pages per brand, __NEXT_DATA__ parsing

The pipeline

Each review passes through seven processing stages. Here's how each one works, why it matters, and what it looks like in action.

01

Tokenization

Breaking raw reviews into individual words the model can work with.

Why it works

NLP models can't read sentences the way we do. Tokenization breaks unstructured text into discrete units, specifically individual words, that become the rows of a tidy data frame. In R's tidytext, unnest_tokens() splits each review into one-word-per-row, converting to lowercase and stripping punctuation in one step. This is the foundation that every downstream operation depends on. Counting, filtering, and scoring all operate on these atomic tokens.

unnest_tokens(word)

"Nike's running shoes are incredibly comfortable for daily training!"

nike's
running
shoes
are
incredibly
comfortable
for
daily
training

9 tokens

02

Stopword Removal

Removing high-frequency filler words so only meaningful terms remain.

Why it works

Words like "the", "is", "for", and "are" appear constantly but carry zero brand-specific meaning. An anti-join against a stopword list (stop_words from tidytext) removes them in one pass. Without this step, TF-IDF and topic modeling would be dominated by these high-frequency filler words instead of the actual signal. Words like "swoosh", "boost", and "c2c" are what actually differentiate brands, and they only surface once the noise is removed.

anti_join(stop_words)

nike'srunningshoesareincrediblycomfortablefordailytraining

7 tokens remain / 2 stopwords removed

03

Stemming

Collapsing word variants into a single root so they count together.

Why it works

"Running", "runs", and "runner" all point to the same concept. Stemming collapses these inflected forms into a single root ("run") so they get counted together. Instead of fragmenting signal across four variants, we concentrate it in one stem, which boosts statistical power significantly. The tradeoff is that stems are not always real words ("comput" for "computing"). However, for frequency-based methods like TF-IDF and LDA, precision of the root matters less than consistent aggregation.

mutate(stem = wordStem(word))

The highlighted suffix gets stripped, leaving just the root form.

runningrun
shoesshoe
comfortablecomfort
trainingtrain
incrediblyincred
sustainabilitysustain

6 words reduced to their root forms. Different inflections now map to the same token for consistent counting.

04

Sentiment Analysis

Assigning each word an emotional score using three independent lexicons.

Why it works

Lexicon-based sentiment analysis maps each token to a pre-defined score. We use three lexicons in parallel: AFINN (integer scores from negative 5 to positive 5), Bing (binary positive/negative), and NRC (8 emotions). Each approaches sentiment differently. AFINN captures intensity ("amazing" = +4 vs "good" = +1), while NRC reveals the emotional texture behind the polarity. By running all three, we triangulate. If all three agree Nike reviews skew positive, that signal is robust.

Sentiment Scoring

+2
comfortable
+3
love
+1
quality
-2
overpriced
-3
broke
+4
amazing

Running Total

0

Net sentiment: ... (positive)

05

Vector Space & Embeddings

Mapping words into a geometric space where proximity reflects meaning.

Why it works

Every NLP technique in this pipeline implicitly treats words as vectors. TF-IDF creates sparse vectors where each dimension is a document. Word embeddings create dense vectors where proximity equals semantic similarity. The classic example is "king" minus "man" plus "woman" approximately equals "queen". In our corpus, this means sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they are semantically close even though they share no characters. Understanding this geometry is key to why TF-IDF and LDA work, since they operate on the distances between these vector representations.

Word Vector Space (conceptual 2D projection)

c2cdeadstockgrailsresellcopdrophypecushionstridetempomileagepronationdurablestitchingsolematerialcraftsmanshipswooshsnkrsjumpmanboostprimeknitthree stripeshovrchargedDimension 1Dimension 2
Vector Analogy

Sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they share semantic meaning even though they share no characters. This is why vector-based methods outperform simple keyword matching.

06

TF-IDF

Surfacing the words that define each brand by rewarding frequency and penalizing commonality.

Why it works

Term Frequency times Inverse Document Frequency is elegant. TF measures how often a word appears in one brand's reviews, and IDF penalizes words that appear in all brands' reviews. The product surfaces words that are both frequent and distinctive. "Shoe" gets crushed because high TF everywhere times low IDF equals noise. "Swoosh" gets amplified because high TF for Nike times high IDF equals brand signal. This is how we discover that "c2c", "snkrs", and "deadstock" are Nike-specific vocabulary tied to the resale and hype culture that does not exist in Adidas or Under Armour reviews.

TF-IDF Explainer

TF (Term Frequency)

Nike
0.0042
Adidas
0.0001
Under Armour
0.0002

IDF (Inverse Document Frequency)

2.71

High IDF = rare across brands

Result: TF × IDF = TF-IDF

Nike0.0042 × 2.71=
0.0114
Adidas0.0001 × 2.71=
0.0003
Under Armour0.0002 × 2.71=
0.0005

"swoosh" is highly unique to Nike. A high TF-IDF score signals brand-specific language that does not appear in competitor reviews.

07

LDA Topics

Letting the algorithm discover what consumers actually talk about, unsupervised.

Why it works

Latent Dirichlet Allocation assumes every document is a mixture of hidden topics, and every topic is a distribution over words. The algorithm discovers these topics unsupervised. We just tell it how many topics to find (k=4). What emerges reveals what consumers actually talk about: performance, product quality, app experience, and customer service. The power of LDA is that it finds structure humans might miss. Reviews mentioning "return" and "refund" cluster with "wait" and "order" even though no one labeled these as a "service" topic.

LDA Topic Model

DOCUMENTSTOPICSTOP WORDSDoc 1Doc 2Doc 3Doc 4Doc 5PerformanceQualityApp ExperienceServicerunshoefastmilecomfortqualitydurablematerialstitchsoleappworkouttrackfeaturesyncreturnrefundsupportshiporder

Click a topic to highlight its words

What the data revealed

After processing ~16K documents through the pipeline, four analytical lenses reveal how consumers perceive Nike, Adidas, and Under Armour.

Emotion Profile Across Brands

NRC lexicon across 8 emotions, 3 brands, proportional breakdown

Nike
Adidas
Under Armour
TrustJoyAnticipationSurpriseFearSadnessAngerDisgust

Net Sentiment by Platform

Bing lexicon, positive minus negative word count per source

Nike
Adidas
Under Armour
-500-2500250500App StoreRedditTrustpilot

Discovered Topics

LDA with k=4, showing top terms per topic with beta weights

Topic 1: Running & Performance

run
0.045
shoe
0.038
comfort
0.032
mile
0.028
fast
0.024
cushion
0.021

Topic 2: Product Quality & Sizing

quality
0.042
size
0.036
fit
0.031
material
0.027
durable
0.023
sole
0.019

Topic 3: App & Workout Experience

app
0.051
workout
0.039
track
0.033
feature
0.026
update
0.022
sync
0.018

Topic 4: Customer Service & Returns

return
0.048
order
0.041
ship
0.035
refund
0.029
support
0.024
wait
0.020

What Makes Each Brand Unique

TF-IDF, showing words with the highest discriminative power per brand

Nike

swoosh
0.0114
air max
0.0098
jordan
0.0087
dunk
0.0076
snkrs
0.0069
flyknit
0.0058
react
0.0052
vapormax
0.0045

Adidas

boost
0.0102
ultraboost
0.0091
yeezy
0.0084
superstar
0.0073
stan smith
0.0065
nmd
0.0058
primeknit
0.0049
samba
0.0042

Under Armour

hovr
0.0095
charged
0.0082
curry
0.0074
ua
0.0068
project rock
0.0059
speedform
0.0051
mapmyrun
0.0044
coldgear
0.0038

Word Correlations

Pairwise phi coefficients showing which words co-occur more frequently than expected across all reviews

air&max
0.82
customer&service
0.74
high&quality
0.68
delivery&late
0.65
running&shoe
0.61
return&refund
0.58
comfortable&fit
0.55
sole&wear
0.52
price&expensive
0.49
app&crash
0.46
size&small
0.44
color&love
0.41

Phi coefficient measures how often two words co-occur in the same review more than expected by chance. Higher values indicate stronger association.

Aspect-Based Sentiment

Keyword-filtered sentences scored for price, quality, and sustainability perception

Nike
Adidas
Under Armour
Price
-0.42
-0.28
-0.35
Quality
+0.31
+0.25
+0.18
Sustainability
+0.15
+0.22
+0.05
Negative
Positive

Stack

NLPSentiment AnalysisRBrand Analytics