Data Science / NLP

Nike Brand Perception Analysis

How do consumers really feel about Nike vs Adidas vs Under Armour?

301K+ reviews scraped from the App Store, Reddit, and Trustpilot, cleaned to ~16K documents, and dissected with sentiment analysis, topic modeling, TF-IDF, and aspect-based sentiment. Built in R with tidytext, topicmodels, and rvest.

301K+

Raw Reviews

~16K

Cleaned Docs

Brands

Platforms

View on GitHub

The data

Reviews were scraped from three platforms using custom R scripts. Each source captures a different slice of consumer voice, from app users to forum communities to verified purchasers.

📱App Store

8 countries, 10 pages per brand

💬Reddit

10 search terms per brand via JSON API

⭐Trustpilot

40 pages per brand, __NEXT_DATA__ parsing

The pipeline

Each review passes through seven processing stages. Here's how each one works, why it matters, and what it looks like in action.

Tokenization

Breaking raw reviews into individual words the model can work with.

Why it works

NLP models can't read sentences the way we do. Tokenization breaks unstructured text into discrete units, specifically individual words, that become the rows of a tidy data frame. In R's tidytext, unnest_tokens() splits each review into one-word-per-row, converting to lowercase and stripping punctuation in one step. This is the foundation that every downstream operation depends on. Counting, filtering, and scoring all operate on these atomic tokens.

unnest_tokens(word)

"Nike's running shoes are incredibly comfortable for daily training!"

nike's

running

shoes

are

incredibly

comfortable

for

daily

training

9 tokens

Stopword Removal

Removing high-frequency filler words so only meaningful terms remain.

Why it works

Words like "the", "is", "for", and "are" appear constantly but carry zero brand-specific meaning. An anti-join against a stopword list (stop_words from tidytext) removes them in one pass. Without this step, TF-IDF and topic modeling would be dominated by these high-frequency filler words instead of the actual signal. Words like "swoosh", "boost", and "c2c" are what actually differentiate brands, and they only surface once the noise is removed.

anti_join(stop_words)

nike'srunningshoesareincrediblycomfortablefordailytraining

7 tokens remain / 2 stopwords removed

Stemming

Collapsing word variants into a single root so they count together.

Why it works

"Running", "runs", and "runner" all point to the same concept. Stemming collapses these inflected forms into a single root ("run") so they get counted together. Instead of fragmenting signal across four variants, we concentrate it in one stem, which boosts statistical power significantly. The tradeoff is that stems are not always real words ("comput" for "computing"). However, for frequency-based methods like TF-IDF and LDA, precision of the root matters less than consistent aggregation.

mutate(stem = wordStem(word))

The highlighted suffix gets stripped, leaving just the root form.

running→run

shoes→shoe

comfortable→comfort

training→train

incredibly→incred

sustainability→sustain

6 words reduced to their root forms. Different inflections now map to the same token for consistent counting.

Sentiment Analysis

Assigning each word an emotional score using three independent lexicons.

Why it works

Lexicon-based sentiment analysis maps each token to a pre-defined score. We use three lexicons in parallel: AFINN (integer scores from negative 5 to positive 5), Bing (binary positive/negative), and NRC (8 emotions). Each approaches sentiment differently. AFINN captures intensity ("amazing" = +4 vs "good" = +1), while NRC reveals the emotional texture behind the polarity. By running all three, we triangulate. If all three agree Nike reviews skew positive, that signal is robust.

Sentiment Scoring

comfortable

love

quality

-2

overpriced

-3

broke

amazing

Running Total

Net sentiment: ... (positive)

Vector Space & Embeddings

Mapping words into a geometric space where proximity reflects meaning.

Why it works

Every NLP technique in this pipeline implicitly treats words as vectors. TF-IDF creates sparse vectors where each dimension is a document. Word embeddings create dense vectors where proximity equals semantic similarity. The classic example is "king" minus "man" plus "woman" approximately equals "queen". In our corpus, this means sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they are semantically close even though they share no characters. Understanding this geometry is key to why TF-IDF and LDA work, since they operate on the distances between these vector representations.

Word Vector Space (conceptual 2D projection)

Vector Analogy

Sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they share semantic meaning even though they share no characters. This is why vector-based methods outperform simple keyword matching.

TF-IDF

Surfacing the words that define each brand by rewarding frequency and penalizing commonality.

Why it works

Term Frequency times Inverse Document Frequency is elegant. TF measures how often a word appears in one brand's reviews, and IDF penalizes words that appear in all brands' reviews. The product surfaces words that are both frequent and distinctive. "Shoe" gets crushed because high TF everywhere times low IDF equals noise. "Swoosh" gets amplified because high TF for Nike times high IDF equals brand signal. This is how we discover that "c2c", "snkrs", and "deadstock" are Nike-specific vocabulary tied to the resale and hype culture that does not exist in Adidas or Under Armour reviews.

TF-IDF Explainer

TF (Term Frequency)

Nike

0.0042

Adidas

0.0001

Under Armour

0.0002

IDF (Inverse Document Frequency)

2.71

High IDF = rare across brands

Result: TF × IDF = TF-IDF

Nike0.0042 × 2.71=

0.0114

Adidas0.0001 × 2.71=

0.0003

Under Armour0.0002 × 2.71=

0.0005

"swoosh" is highly unique to Nike. A high TF-IDF score signals brand-specific language that does not appear in competitor reviews.

LDA Topics

Letting the algorithm discover what consumers actually talk about, unsupervised.

Why it works

Latent Dirichlet Allocation assumes every document is a mixture of hidden topics, and every topic is a distribution over words. The algorithm discovers these topics unsupervised. We just tell it how many topics to find (k=4). What emerges reveals what consumers actually talk about: performance, product quality, app experience, and customer service. The power of LDA is that it finds structure humans might miss. Reviews mentioning "return" and "refund" cluster with "wait" and "order" even though no one labeled these as a "service" topic.

LDA Topic Model

Click a topic to highlight its words

What the data revealed

After processing ~16K documents through the pipeline, four analytical lenses reveal how consumers perceive Nike, Adidas, and Under Armour.

Emotion Profile Across Brands

NRC lexicon across 8 emotions, 3 brands, proportional breakdown

Nike

Adidas

Under Armour

Net Sentiment by Platform

Bing lexicon, positive minus negative word count per source

Nike

Adidas

Under Armour

Discovered Topics

LDA with k=4, showing top terms per topic with beta weights

Topic 1: Running & Performance

run

0.045

shoe

0.038

comfort

0.032

mile

0.028

fast

0.024

cushion

0.021

Topic 2: Product Quality & Sizing

quality

0.042

size

0.036

fit

0.031

material

0.027

durable

0.023

sole

0.019

Topic 3: App & Workout Experience

app

0.051

workout

0.039

track

0.033

feature

0.026

update

0.022

sync

0.018

Topic 4: Customer Service & Returns

return

0.048

order

0.041

ship

0.035

refund

0.029

support

0.024

wait

0.020

What Makes Each Brand Unique

TF-IDF, showing words with the highest discriminative power per brand

Nike

swoosh

0.0114

air max

0.0098

jordan

0.0087

dunk

0.0076

snkrs

0.0069

flyknit

0.0058

react

0.0052

vapormax

0.0045

Adidas

boost

0.0102

ultraboost

0.0091

yeezy

0.0084

superstar

0.0073

stan smith

0.0065

nmd

0.0058

primeknit

0.0049

samba

0.0042

Under Armour

hovr

0.0095

charged

0.0082

curry

0.0074

0.0068

project rock

0.0059

speedform

0.0051

mapmyrun

0.0044

coldgear

0.0038

Word Correlations

Pairwise phi coefficients showing which words co-occur more frequently than expected across all reviews

air&max

0.82

customer&service

0.74

high&quality

0.68

delivery&late

0.65

running&shoe

0.61

return&refund

0.58

comfortable&fit

0.55

sole&wear

0.52

price&expensive

0.49

app&crash

0.46

size&small

0.44

color&love

0.41

Phi coefficient measures how often two words co-occur in the same review more than expected by chance. Higher values indicate stronger association.

Aspect-Based Sentiment

Keyword-filtered sentences scored for price, quality, and sustainability perception

Nike

Adidas

Under Armour

Price

-0.42

-0.28

-0.35

Quality

+0.31

+0.25

+0.18

Sustainability

+0.15

+0.22

+0.05

Negative

Positive

Stack

NLPSentiment AnalysisRBrand Analytics

← Back to Portfolio Source Code