Nike Brand Perception Analysis
How do consumers really feel about Nike vs Adidas vs Under Armour?
301K+ reviews scraped from the App Store, Reddit, and Trustpilot, cleaned to ~16K documents, and dissected with sentiment analysis, topic modeling, TF-IDF, and aspect-based sentiment. Built in R with tidytext, topicmodels, and rvest.
301K+
Raw Reviews
~16K
Cleaned Docs
3
Brands
3
Platforms
The data
Reviews were scraped from three platforms using custom R scripts. Each source captures a different slice of consumer voice, from app users to forum communities to verified purchasers.
8 countries, 10 pages per brand
10 search terms per brand via JSON API
40 pages per brand, __NEXT_DATA__ parsing
The pipeline
Each review passes through seven processing stages. Here's how each one works, why it matters, and what it looks like in action.
Tokenization
Breaking raw reviews into individual words the model can work with.
Why it works
NLP models can't read sentences the way we do. Tokenization breaks unstructured text into discrete units, specifically individual words, that become the rows of a tidy data frame. In R's tidytext, unnest_tokens() splits each review into one-word-per-row, converting to lowercase and stripping punctuation in one step. This is the foundation that every downstream operation depends on. Counting, filtering, and scoring all operate on these atomic tokens.
unnest_tokens(word)
"Nike's running shoes are incredibly comfortable for daily training!"
9 tokens
Stopword Removal
Removing high-frequency filler words so only meaningful terms remain.
Why it works
Words like "the", "is", "for", and "are" appear constantly but carry zero brand-specific meaning. An anti-join against a stopword list (stop_words from tidytext) removes them in one pass. Without this step, TF-IDF and topic modeling would be dominated by these high-frequency filler words instead of the actual signal. Words like "swoosh", "boost", and "c2c" are what actually differentiate brands, and they only surface once the noise is removed.
anti_join(stop_words)
7 tokens remain / 2 stopwords removed
Stemming
Collapsing word variants into a single root so they count together.
Why it works
"Running", "runs", and "runner" all point to the same concept. Stemming collapses these inflected forms into a single root ("run") so they get counted together. Instead of fragmenting signal across four variants, we concentrate it in one stem, which boosts statistical power significantly. The tradeoff is that stems are not always real words ("comput" for "computing"). However, for frequency-based methods like TF-IDF and LDA, precision of the root matters less than consistent aggregation.
mutate(stem = wordStem(word))
The highlighted suffix gets stripped, leaving just the root form.
6 words reduced to their root forms. Different inflections now map to the same token for consistent counting.
Sentiment Analysis
Assigning each word an emotional score using three independent lexicons.
Why it works
Lexicon-based sentiment analysis maps each token to a pre-defined score. We use three lexicons in parallel: AFINN (integer scores from negative 5 to positive 5), Bing (binary positive/negative), and NRC (8 emotions). Each approaches sentiment differently. AFINN captures intensity ("amazing" = +4 vs "good" = +1), while NRC reveals the emotional texture behind the polarity. By running all three, we triangulate. If all three agree Nike reviews skew positive, that signal is robust.
Sentiment Scoring
Running Total
Net sentiment: ... (positive)
Vector Space & Embeddings
Mapping words into a geometric space where proximity reflects meaning.
Why it works
Every NLP technique in this pipeline implicitly treats words as vectors. TF-IDF creates sparse vectors where each dimension is a document. Word embeddings create dense vectors where proximity equals semantic similarity. The classic example is "king" minus "man" plus "woman" approximately equals "queen". In our corpus, this means sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they are semantically close even though they share no characters. Understanding this geometry is key to why TF-IDF and LDA work, since they operate on the distances between these vector representations.
Word Vector Space (conceptual 2D projection)
Sneakerhead terms like "c2c", "deadstock", and "grails" cluster together because they share semantic meaning even though they share no characters. This is why vector-based methods outperform simple keyword matching.
TF-IDF
Surfacing the words that define each brand by rewarding frequency and penalizing commonality.
Why it works
Term Frequency times Inverse Document Frequency is elegant. TF measures how often a word appears in one brand's reviews, and IDF penalizes words that appear in all brands' reviews. The product surfaces words that are both frequent and distinctive. "Shoe" gets crushed because high TF everywhere times low IDF equals noise. "Swoosh" gets amplified because high TF for Nike times high IDF equals brand signal. This is how we discover that "c2c", "snkrs", and "deadstock" are Nike-specific vocabulary tied to the resale and hype culture that does not exist in Adidas or Under Armour reviews.
TF-IDF Explainer
TF (Term Frequency)
IDF (Inverse Document Frequency)
High IDF = rare across brands
Result: TF × IDF = TF-IDF
"swoosh" is highly unique to Nike. A high TF-IDF score signals brand-specific language that does not appear in competitor reviews.
LDA Topics
Letting the algorithm discover what consumers actually talk about, unsupervised.
Why it works
Latent Dirichlet Allocation assumes every document is a mixture of hidden topics, and every topic is a distribution over words. The algorithm discovers these topics unsupervised. We just tell it how many topics to find (k=4). What emerges reveals what consumers actually talk about: performance, product quality, app experience, and customer service. The power of LDA is that it finds structure humans might miss. Reviews mentioning "return" and "refund" cluster with "wait" and "order" even though no one labeled these as a "service" topic.
LDA Topic Model
Click a topic to highlight its words
What the data revealed
After processing ~16K documents through the pipeline, four analytical lenses reveal how consumers perceive Nike, Adidas, and Under Armour.
Emotion Profile Across Brands
NRC lexicon across 8 emotions, 3 brands, proportional breakdown
Net Sentiment by Platform
Bing lexicon, positive minus negative word count per source
Discovered Topics
LDA with k=4, showing top terms per topic with beta weights
Topic 1: Running & Performance
Topic 2: Product Quality & Sizing
Topic 3: App & Workout Experience
Topic 4: Customer Service & Returns
What Makes Each Brand Unique
TF-IDF, showing words with the highest discriminative power per brand
Nike
Adidas
Under Armour
Word Correlations
Pairwise phi coefficients showing which words co-occur more frequently than expected across all reviews
Phi coefficient measures how often two words co-occur in the same review more than expected by chance. Higher values indicate stronger association.
Aspect-Based Sentiment
Keyword-filtered sentences scored for price, quality, and sustainability perception