Jeanne has been building language models since before it was cool.

With nearly nine years of experience in AI, multilingual NLP, and data science, spanning both industry and research, her focus has always been on multilingual and low-resource settings, where data is scarce, noisy, and rarely benchmark-ready. Her work has spanned misinformation detection on social media and real-world language understanding in underrepresented languages, with privacy and data protection as a consistent consideration throughout. More recently, her interests have extended into finance — specifically how language models can be used to extract signal from social media and news to model and anticipate market behaviour.

Her research has been published at ACL, including work on automating multilingual healthcare question answering in low-resource African languages. She approaches problems at the intersection of language, people, and systems — with a particular interest in making AI work in contexts it was never designed for.

Outside of work, she reads widely across behavioural economics, climate, and misinformation and writes occasionally when something is worth saying.

Originally from Cape Town, Jeanne now lives in London with her husband and two Bengal cats, Eira and Kinzy.

Blog

Generative AI, Technology, Machine Learning, Vibe Coding Jeannie Daniel Generative AI, Technology, Machine Learning, Vibe Coding Jeannie Daniel

Vibe-Coding: A Double-Edged Sword

Vibe coding has collapsed the distance between having an idea and having a working app. For prototyping, it's genuinely transformative. But the gap between a convincing demo and production-ready software hasn't closed — and the judgment required to bridge that gap hasn't gotten any cheaper to acquire.

3 March 2026

Andrej Karpathy didn't invent AI-assisted coding, but he did give it a name that stuck. In February 2025, he described a new way of working: fully giving in to the AI, not really writing code so much as directing it. Describing what you want, accepting what it produces, nudging it when it goes wrong. He called it vibe coding. Within weeks, the term was everywhere.

It's easy to see the appeal. The friction between having an idea and having a working implementation has collapsed. What used to take a week of scaffolding, boilerplate, and Stack Overflow archaeology can now be roughed out in an afternoon. For prototyping in particular, this is genuinely transformative.

But vibe coding is a double-edged sword, and which edge you get depends almost entirely on how much you already know.

Solving Problems Like a Pragmatic Programmer

In The Pragmatic Programmer, Andrew Hunt and David Thomas introduce the concept of tracer bullets — a metaphor borrowed from military ammunition that emits a visible trail, letting you see exactly where your shots are landing.

The idea applied to software is this: rather than building each component in isolation and hoping it all fits together at the end, you fire a thin, end-to-end slice through the entire system first. It touches every layer (frontend, API, database, whatever your stack demands) but does very little. It just has to work. Once it does, you fill in the blanks, solve for increasingly complex parts of the problem, and handle the edge cases last.

This is the right way to build software. It validates your architecture early, surfaces integration problems before they compound, and gives you something tangible to iterate on. It is also, not coincidentally, exactly the kind of thing vibe coding is very good at.

Ask an AI to scaffold a working end-to-end prototype of a web app, a data pipeline, or an API and it will do so, quickly and coherently. The tracer bullet is arguably the strongest use case for vibe coding in existence today. The trouble starts when people mistake the tracer bullet for the finished product.

The Illusion of Completeness

AI-generated code has a particular quality that sets it apart from the half-finished scripts most of us used to prototype with: it looks done.

It's formatted correctly, it has docstrings, it compiles and runs without complaint. To the untrained eye, and sometimes even to the trained one under time pressure, it presents as production-ready code. But AI models are trained on the average, known case. They are optimistic about inputs and generous with assumptions — and real software has to survive contact with reality, where the edge cases are exactly the ones that never appear in local testing. Think of it like a house designed by AI with no staircase. Stunning render, perfect floor plan, no way to get to the second floor. It only becomes a problem when someone tries to move in.

The illusion of completeness is not a bug in the AI. It is a predictable consequence of the way these models work. The question is whether the person using the tool knows enough to see through it.

This is where the Dunning-Kruger effect becomes relevant. The less you know about software engineering, the more complete the AI's output will appear. A junior developer or non-technical manager sees formatted, compiling, apparently functional code and concludes the job is done. A senior developer sees the same code and immediately starts asking what's missing. Competence, in this context, is the ability to recognise incompleteness. Vibe coding doesn't change that — it just raises the stakes of not having it.

Experienced Software Engineers Gain The Most

For a senior engineer, vibe coding is a force multiplier. They already have a mental model of what a correct implementation looks like, which means they can use AI to generate the scaffolding and apply their judgment to the parts that actually require it.

They review generated code not as a user reviews a document but as an adversary reviews a contract, looking for what's missing rather than what's there. They know the failure modes. They know which shortcuts are acceptable in a prototype and which will become load-bearing walls in production. They use vibe coding to compress the boring parts of the job while remaining in full control of the interesting ones.

For these engineers, AI coding tools represent a genuine step change in productivity. A controlled experiment by Peng et al. found that developers using GitHub Copilot completed tasks 55.8% faster than those without it. The leverage is real — but it compounds on top of existing skill.

Novices Gain The Least

This is the uncomfortable part of the vibe coding conversation.

A junior developer using vibe coding tools does not learn to code; they learn to prompt. These are not the same skill. Programming is, at its core, the ability to decompose a problem, reason about state, anticipate failure, and make decisions about tradeoffs. You develop this through a particular kind of struggle: writing something that doesn't work, figuring out why, and fixing it. Vibe coding short-circuits this loop entirely.

More immediately dangerous, a junior developer cannot see through the illusion of completeness. They do not yet know what they don't know. A 2024 study by Prather et al. examining novice programmers using GitHub Copilot found that the benefits were sharply uneven: students with strong metacognitive skills performed better with AI assistance, while those without were actively harmed by it. They accepted incorrect code at face value, couldn't diagnose why it failed, and in some cases ended up worse off than if they had written it themselves. A separate GitClear analysis of 153 million lines of code found that code churn — lines reverted or rewritten within two weeks of being authored — was on track to double by 2024 compared to its pre-AI baseline. The code is being written faster. It is also being thrown away faster.

The real-world examples are already piling up. Tea App, a Flutter app built by a developer with six months of experience using AI-assisted development, made headlines when it was reportedly "hacked" — except nobody actually hacked it. The Firebase storage instance had been left completely open with default settings. No authorisation policies, no access controls. Seventy-two thousand images were exposed, including 13,000 government ID photos from user verification. Meanwhile, SaaStr founder Jason Lemkin famously trusted Replit's AI agent to build a production app. It started well. Then the agent began ignoring code freeze instructions and eventually deleted the entire SaaStr production database. Months of curated data, gone. And these aren't isolated incidents — a May 2025 analysis of 1,645 apps built on Lovablefound that 170 of them had vulnerabilities allowing anyone to access personal user data.

This is not an argument against junior developers using these tools. It is an argument for being honest about what they are getting, and what they aren't. Vibe coding can help a junior developer move faster. It cannot substitute for the years of judgment that determine whether moving fast is the right call.

Last Mile Delivery

There is a rule of thumb in software, as in logistics: the last mile is the hardest.

The first 80% of a software project moves quickly. The architecture is in place, the happy path works, the demo is convincing. Then you spend the remaining 80% of your time on the last 20%: the edge cases, the error handling, the security review, the performance profiling, the accessibility audit, the tests you should have written earlier. This is unglamorous, painstaking work that does not lend itself to vibing.

Vibe coding compresses the first 80% dramatically. This is useful, but it also produces a subtle accounting error: it makes you feel like you're further along than you are. A prototype that took three hours to build looks, superficially, like it should take three more hours to finish. It won't. The last mile is still the last mile, and no amount of AI-generated scaffolding changes that.

The risk is that teams, particularly those under pressure to ship, mistake the prototype for the product. They deploy the tracer bullet. And then they spend the next six months patching holes that a proper build would never have had.

Conclusions

Vibe coding is a genuine shift in how software is built. But it is not a democratisation of software engineering — the knowledge required to ship something reliable has not decreased, and the gap between a prototype and a production system has not closed.

A computer science degree doesn't teach you to write code. It teaches you to think about problems. When AI writes the code, that thinking doesn't happen. Taleb's concept of antifragility holds that some things get stronger through stress and disorder. Learning to code the hard way is antifragile. Vibe coding, by removing that friction, risks producing developers who are fast in good conditions and brittle when things go wrong.

Use the tracer bullet. Fill in the blanks. Handle your edge cases. Just don't let the AI convince you it already did.

Read More
Technology, Startups, Innovation, Data, Network Effects Jeannie Daniel Technology, Startups, Innovation, Data, Network Effects Jeannie Daniel

How the Best AI Companies Use Data to Build Unbeatable Moats

Data doesn't scale linearly — it compounds. Here's how companies like OpenAI and Google use virtuous data cycles and network effects to build moats that are almost impossible to compete with.

15 March 2025

One data point is worth a dollar. Two are worth two dollars. But ten million? That can build a company. Data’s value scales non-linearly - when aggregated and leveraged effectively, it fuels exponential growth.

Companies harness this power through the Virtuous Data Cycle, where data collection, analysis, and application continuously enhance products, improve user experience, and attract more users, generating even more valuable data.

In the simplest terms, the Virtuous Data Cycle works as follows:

  1. Collect user data.

  2. Store & organize it efficiently.

  3. Analyze for patterns and insights.

  4. Apply insights to improve products and services.

  5. Enhance user experience, driving engagement and retention.

  6. Repeat, compounding value at every iteration.

Unlocking Exponential Growth with the Network Effect

Network Effects emerge when a product or service becomes more valuable as more people use it. Cities illustrate this principle: as populations grow, infrastructure, businesses, and opportunities expand, making them even more attractive. This self-reinforcing loop follows Zipf’s Law, where the largest city dominates, and the second-largest is about half its size, the third a third, and so on.

The distribution of population sizes of 276 metropolitan areas in the USA in 2000 on a log-log scale, which clearly demonstrates a Power Law distribution.

The same applies to companies. As they grow, they attract more users, talent, and investment, strengthening their market position. The Virtuous Data Cycle ties in closely with the theory of Network Effects. In data-driven businesses, this relationship becomes even more powerful because data not only enhances the user experience but also fuels monetisation and competitive advantage.

How Network Effects Strengthen the Virtuous Data Cycle

  1. More users → More data: As a platform grows, it collects richer insights on user behaviour, refining its services and improving the experience.

  2. More data → Better algorithms & personalisation: Large datasets power smarter AI and recommendation engines - think Facebook’s News Feed, YouTube’s recommendations, or TikTok’s For You Page. Better personalisation boosts engagement, reinforcing the cycle.

  3. Better experience → Higher retention & growth: Improved experiences keep users engaged, drive word-of-mouth growth, and strengthen network effects, fuelling exponential expansion.

  4. More users → Stronger market position: A massive user base creates a competitive moat - attracting top talent, greater investment opportunities, increasing efficiency, and even influencing industry policies. New entrants struggle to compete without comparable data.

  5. More engagement & data → Higher revenue & infrastructure investment: Increased engagement unlocks monetisation (ads, subscriptions, commerce). Higher revenue funds R&D and infrastructure improvements, further enhancing the experience.

  6. The cycle repeats, compounding dominance: Each loop strengthens the platform’s edge, making it harder for competitors to catch up.

Not all companies master this cycle. Twitter had network effects but never fully leveraged its data to drive advertising revenue. OpenAI capitalised on first-mover advantage to amass user feedback, but its long-term profitability remains uncertain. Facebook and Google, however, perfected both the Virtuous Data Cycle and Network Effects - turning data into dominance.

Case Study: OpenAI – From Hallucination Station to the AI Powerhouse

When OpenAI released GPT-2 in 2019, it was impressive but far from revolutionary. OpenAI had taken the research published by Google’s DeepMind on transformers, and turned it into a web app where users could ask it questions. The model was prone to hallucinations (see example below) and novelty wore off as early adopters grew frustrated with the lack of functionality. But OpenAI kept iterating.

OpenAI’s virtuous data cycle improved their family of GPT models by leveraging research breakthroughs and through increased user interactions, using feedback (thumbs-up/down) and conversations to refine responses. They leveraged their industry partnership with Microsoft, amongst others, to fuel growth and adoption.

By making their platform free to use and appealing to consumers directly, OpenAI's ChatGPT reached 100 million users in just two months after its launch in November, making it the fastest-growing consumer application in history. They began to appeal to developers and businesses with their APIs, who embedded them into diverse applications and created a broader ecosystem. They continued moving into the B2B space by leveraging their partnership with Microsoft to integrate GPT-4 into Bing and Microsoft 365, which reinforced their position as the default B2B AI provider. Competitors like Google faced delays in launching alternatives, allowing OpenAI to capture market share before rivals could respond. Their massive user base and collection of data make their models increasingly difficult to rival, although some competitors are now seemingly making headway in this regard.

While they mastered the network effect, their road to profitability remains uncertain. In 2024, OpenAI reportedly spent $9 billion to make $4 billion. They spent an estimated $3-4 billion on training, another $2 billion on inference (running models to answer users’ questions), $1.5 billion on salaries and employee benefits, $500 million on data-related expenses and the remainder on various other operating expenses. Their future profitability hinges on the appetite of users and companies to fork out hundreds or even thousands of dollars for a tool that some consider only marginally better than open-source competitors.

Data and Networks as a Business Model

To grow a B2C platform exponentially, you have to eliminate bottlenecks:

  • Leverage word-of-mouth: users’ testimonials and organic network effects should be your primary marketing strategy. OpenAI never had to advertise heavily to acquire users - virality did the work.

  • Prioritise feedback from super users. A few engaged users will provide the most valuable insights, guiding product development.

  • Build clean, high-quality data pipelines from the get-go: early adopters will provide the most valuable insights into your product’s strengths and weaknesses, and set the direction for the next stage of evolution. A caveat to this - don’t optimise too early. Use off-the shelf tools like Google Sheets while your user base is small enough. It’s not dumb if it works.

  • Reduce onboarding friction. If sign-up takes more than a minute, you risk losing users before they even experience your product.

  • Embed data privacy compliance from day one. Regulations like GDPR and CCPA can become major roadblocks. Retrofitting compliance later is costly and erodes trust.

Network Effects: B2C vs. B2B

In B2C, network effects are straightforward - users want to be where their friends are. FOMO fuels adoption.

B2B is different. The decision-maker isn’t always the end user, meaning you need to convince multiple stakeholders - often their boss’s boss - to invest in your platform. Unlike B2C, where shared adoption creates value, B2B companies sometimes benefit when competitors don’t use the same tools they do.

However, network effects still apply in B2B. Once a tool reaches critical mass, not using it becomes a competitive disadvantage. Employees switch jobs and introduce their favorite tools to new workplaces. Over time, widely adopted products, like SEMRush for SEO or Cloudflare for cybersecurity, become industry standards.

How B2B Can Mimic B2C Growth Strategies

Some B2B markets (e.g., enterprise SaaS, healthcare, government contracts) move slower due to long sales cycles, procurement processes, and compliance requirements. However, many B2B businesses have successfully scaled by adopting B2C-style viral tactics.

1. Freemium Model → ChatGPT

The base product is free, making it easy for individuals and businesses to adopt. However, OpenAI didn’t bake in privacy from the start. Conversations may be used to train future models (part of their Virtuous Data Cycle). To unlock enterprise-grade security and compliance, businesses must upgrade.

2. Pay-to-Play Model → Instagram

Instagram is free for businesses, but organic reach is restricted. Barring going viral with a clever reel or partnering with influencers, to fully benefit from the platform’s network effects, companies must pay transaction fees (2.9% for Instagram Checkout) and invest in ads to reach a broader audience.

3. Viral Adoption + Enterprise Lock-in → Figma

Figma started as a free, collaborative design tool, making it easy for designers to work together. As its adoption grew, it became an industry standard (network effect). Eventually, businesses had no choice but to integrate Figma, and pay for enterprise features like security, admin controls, and private cloud hosting.

Key Takeaway

B2B companies no longer have to rely solely on long sales cycles and enterprise deals to scale. By leveraging freemium models, network effects, and viral adoption strategies, they can accelerate growth and become indispensable in their industries, just like successful B2C platforms.

But growth fueled by network effects is only as strong as the data foundation behind it. Without structured, high-quality data, companies risk losing insights, stalling product evolution, and missing key opportunities.

To fully unlock the potential of your Virtuous Data Cycle, ask yourself:

  • Is your data structured for success? Do you have well-organized databases that enable seamless analysis and decision-making?

  • Is your company truly data-first? Does data literacy extend across teams, or is it siloed within a few roles?

  • What’s the non-monetary value of a new user? Beyond revenue, what insights do you gain from each customer, and what do you lose when they churn?

  • What drives freemium-to-paid conversion? Are you tracking the key incentives and friction points that push users to upgrade?

  • How well do you track user behaviour? Are you consistently analyzing engagement patterns and using those insights to refine your product?

  • Are you gathering direct user feedback? Users tolerate mediocre products, until a competitor better meets their needs. How often do you survey your users?

  • How sticky is your product? How difficult or easy would it be for a user to switch if a better alternative emerged?

B2B companies that master both network effects and data-driven strategy create products that are not just widely adopted but deeply embedded in their industries. The companies that fail to do so leave the door open for someone else to become the next industry standard.

If you enjoyed this post, please consider supporting my writing by buying me a coffee.

Read More
Internet, Technology, ChatGPT, Search Jeannie Daniel Internet, Technology, ChatGPT, Search Jeannie Daniel

Taking Stock of The AI Landscape - 2 Years since ChatGPT Launched

2 years ago, ChatGPT took the world by storm, being the first true conversational agent. This blog post explores the changes to the AI landscape since, from new competitors to litigation about copyright infringement.

4 November 2024

On November 30, 2022, OpenAI launched GPT-3.5, a Large Language Model (LLM) tuned specifically for instruction-following. It was unlike anything else on the market—a true conversational tool that felt remarkably natural. Within five days, ChatGPT gained one million users, making it one of the fastest-growing consumer apps in history. Three months later, it had surpassed 100 million monthly active users.

In this post, I take stock on how much the world has shifted since ChatGPT’s debut.

A competitive landscape

ChatGPT’s overnight success spawned an entirely new industry of AI model providers and competitors, including Perplexity.ai, Anthropic, Google, Meta, Mistral.ai and more. It’s an arms race of who can produce more tokens for cheaper and at the same time stay at the top of the leaderboard.

Nvidia, the provider of the GPUs necessary to train LLMs, has since the launch of GPT-3.5 in November 2022 seen its share price increase 900%. In contrast to most loss-making AI-first companies, Nvidia has also seen its revenue nearly 5x over the same period of time. As the saying goes, in a gold rush, sell shovels.

Bigger is not necessarily better

In 2022, Forbes predicted that the first 10 trillion parameter model was imminent. However, the opposite trend is happening - companies are refining models to be as small as possible while retaining performance.

Smaller models are crucial for driving adoption amongst consumers and researchers by lowering the barrier to entry in terms of memory and compute requirements. Further, it unlocks the potential to run these models on mobile devices.

Smaller models are also cheaper and faster to run inference on. Take GPT-4o-mini, which OpenAI has said is roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash. GPT-4o mini achieves an impressive 82% on the MMLU benchmark and currently ranks 3rd on the ChatBot Arena leaderboard. At 15 cents per million input tokens and 60 cents per million output tokens, it is more than 60% cheaper than GPT-3.5 Turbo.

Running out of training data

As the models have gotten bigger and bigger over the years, AI researchers have been looking for new and unexplored piles of data to continue feeding the beast. When we want to quantify the amount of training data, we talk about tokens. According to OpenAI, one token generally corresponds to around 4 characters of text or on average 3/4 of a word for common English text. Different models use different tokenisers so the numbers vary, but you can expect a novel with roughly 75,000 words to consist of 100,000 tokens.

The sheer size of data required to train models like ChatGPT is staggering. For reference, GPT-3 (3.5’s predecessor) was trained on approximately 300 billion tokens. The majority of the internet has already been “mined” and new content is increasingly being placed behind paywalls. According to Anthropic’s CEO, Dario Amodei, there’s a 10% chance that we could run out of enough data to continue scaling models.

Consequently, researchers are now focused on optimizing existing data and exploring synthetic data. This might explain the shift toward smaller, more efficient models. Resources are the death of creativity, and the opposite holds true.

Lawsuits and Concerns

Not everyone is thrilled about AI companies using the internet for training Large Language Models. There has been a number of lawsuits where major companies are suing AI companies for copyright infringement and unlawful use of their data.

News Corp, who owns publications like The Wall Street Journal and the New York Post, is suing Perplexity.ai for reproducing their news content without authorisation and also falsely attributing content to News Corp’s publications that they never actually wrote. They are seeking penalties of $150k per violation.

The New York Times filed a lawsuit against OpenAI and Microsoft in December 2023, accusing them of infringing on its copywriter works in training their LLMs. The lawsuit remains unresolved at time of writing. In October 2024, the New York Times also sent a cease-and-desist to Perplexity.ai, demanding that they stop using their content without authorisation. Perplexity.ai hit back, saying they do not scrape with the intent of building Large Language Models, rather that they are “indexing web pages and surfacing factual content” and furthermore, that “no one organization owns the copyright over facts.”

In January 2023, Getty Images initiated legal proceedings against Stability AI in the English High Court. The lawsuit alleges that Stability AI scraped millions of images from Getty’s websites without consent to train its AI model, Stable Diffusion. The trial is expected to take place in summer 2025.

These lawsuits underscore the mounting tensions over how content is used for training and the industry’s “ask for forgiveness, not permission” approach. In response, more content providers, especially news outlets, are placing material behind paywalls to protect it.

Our expectations as users

Users took to using ChatGPT like ducks to water. Finally we had the virtual assistant that Sci-Fi has been touting for decades. One who could answer all our menial questions without growing bored or annoyed. One who could structure our essays and emails, give us feedback on our writing, and coach us on our interactions with our people. One who can help us plan our next trip, and suggest recipes for the few ingredients in our fridge…

However, as we become accustomed to this ease, our expectations grow. We expect an immediate response and we get frustrated when ChatGPT misinterprets our request. We don’t want to have to go and verify its claims - we’d like a list of sources please. No hallucinations. Don’t sound so preppy. Also, we’d like it to NOT train further models on our conversations. Oh, and please be free, thanks.

We don’t have AGI yet

And probably won’t, for a long time. Let’s just leave it at that.

Conclusions

Now, you might be wondering, was this blog post written by an LLM? I can confirm it was not. I don’t like using it for writing, as I, personally, find its writing rather bland and uninspiring. I do occasionally use it to plan an outline or get feedback on my writing. I see it as a productivity multiplier and a phenomenal research tool, especially since ChatGPT integrated search results and citations.

I am a huge fan of ChatGPT and pay for the subscription. I highly recommend everyone try it out. But keep in mind its limitations: it can be inaccurate, outdated, and prone to hallucinations, and it’s wise not to share confidential data. (Also, disclaimer: I own shares in Nvidia.)

If you enjoyed this post, please consider supporting my writing by buying me a coffee.

Read More
Fake News, Investment, Memes, Influencer Marketing Jeannie Daniel Fake News, Investment, Memes, Influencer Marketing Jeannie Daniel

The Ronaldo Effect

On 14 June 2021, major news outlets around the world attributed a 1.6% drop in Coca-Cola’s share price to Cristiano Ronaldo’s preference of water over Coke. The truth was both a little more simple and a lot more complicated.

June 21, 2021

An interesting phenomena happened last week: Coca-Cola saw its share price drop from $56.16 close of Friday, 11 June, to about $55.30 during opening on Monday, 14 June (a 1.6% drop or $4BN loss in market value).

At the same time as the markets opened on Monday in New York (9:30AM EST), Ronaldo was getting ready for his soon-to-be memorable EUFA 2020 press conference in Budapest. What would happen next would capture the world’s imagination, literally.

What happened, exactly?

Let us unpack the facts:

It is worth noting that Coca-Cola is one of the major sponsors of the European Championship, which explains the strategic brand placement.

Watch the clip below:

The action itself was not news worthy — “Athlete chooses water over fizzy drink”. But because it was Cristiano Ronaldo, one of the most famous athletes in the world, and because it was Coca-Cola, the largest producer of sugar beverages, the event was news worthy. Investors know that with a four letter word (not Coke, Agua) Ronaldo has the influence to sway millions of consumers to live healthier, and that scares investors and analysts alike.

Many news outlets, like The Independent, The Telegraph, The Guardian, and ESPN attributed the 1.6% drop over the weekend to Ronaldo’s preference of water over Coke. The ESPN article was cautious to not infer causality, by stating “Cristiano Ronaldo’s removal of two Coca-Cola bottles … coincided with a $4 billion drop in the market value of the American drink giant.” The news quickly spread beyond conventional sources and onto social media.

Ronaldo promoting Coke, early 2000s

It’s just such a perfect story. It feels like poetic justice as conglomerates have long used professional athletes to promote their unhealthy products to impressionable fans around the world. In fact, promoting sporting events like EUFA is reportedly part of Coke’s strategy to be perceived as healthier. Then, a superstar like Cristiano Ronaldo, who used to promote Coke, tells his fandom to rather water.

You can’t trademark water, can you?

But it simply is not accurate to attribute Ronaldo’s stunt to the initial 1.6% drop in Coca-cola’s value. The timelines do not match up.

Why was Coca-Cola really down?

14 June 2021 was ex-dividend date for Coca-Cola. This means that if you held Coca-Cola shares up until 11 June (the previous trading day), you qualify for a dividend payout of 42c per share. Now, usually stock prices drop by the expected dividend payment to reflect the discount (the share price is now slightly less valuable to those not entitled to the dividend). Sometimes investors also sell their shares after the ex-dividend date, which could push the price down further. This is the more likely cause for the initial 1.6% drop in Coca-cola’s share price, prior to Ronaldo snubbing the sugary beverage on live TV.

It doesn’t matter, because memes

However, facts do not matter here. As I explained before, the story is so perfect, it sells itself. It went viral. Memes were everywhere. (See a compilation of the funniest memes here.)

As the story grew bigger, it became more believable. Reputable news sources continued to report on the phenomena of Ronaldo snub causing Coca-Cola’s share price to drop, adding fuel to the flames.

Based on Google Search Trends for Coca-Cola and Cristiano Ronaldo, I would guess that the story reached peak viral status on 16 June 2021, before interest slowly started to wane.

Google Search Trends for Coca-Cola and Cristiano Ronaldo, sourced on 21 June, 2021.

Whether Cristiano Ronaldo really moved Coca-Cola’s share price with his Agua stunt or not, did not matter. News sources never even paused to verify their claims, because everyone else was reporting it, so it must be true. It was reported so wide-spread that it was accepted as fact, which demonstrates two sad realities: the truth is what the majority believes it is, and journalists are no longer as rigorous in their fact-checking.

As this article by Forbes described it, we live in a post-truth world: where objective facts are less influential in shaping public opinion than appeals to emotion and personal belief.

The lie becomes the truth

The Ronaldo Effect was so profound that the fake news became a self-fulfilling prophecy: the following week, Coca-cola’s share price continued to decline, to a $53.77 close on Friday 18 June. This represents a 4.3% decline in a single week, while the S&P500 was down only 1.87% over the same period. In fact, Friday 18 June saw the largest trading volume in a single day for the past month.

There is no other plausible explanation, besides the fact that the story of “Ronaldo snubbing Coke causing them to drop 1.6%” made investors panic big time. Fear in the stock market is like a snowball — panic selling is contagious. As the stock price continues to drop, more investors start to doubt the future of Coca-Cola, given the impressive impact a single sport star could supposedly have on the share price of said conglomerate. They, in turn, sell as well, and the snowball continues to grow.

Conclusion

In the end, it did not matter whether Ronaldo snubbing Coke during an Euro 2020 press conference in Budapest caused the value of Coca-Cola to drop. Enough people believed it did, and it became the truth.

Full disclaimer: I have shares in Coca-Cola.

NOTES: I describe the impact of a lie so big it become the truth as the Ronaldo Effect. However, it has been used to describe other phenomena related to Cristiano Ronaldo’s personal footbal record, and was used to describe Ronaldo’s impact on Juventus’ share price (it more than doubled since he joined the team).

Because this blog is based on my personal views and experiences, it does not constitute financial advice.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Read More
Stock Market, Reddit, Finance, Trading, Economics Jeannie Daniel Stock Market, Reddit, Finance, Trading, Economics Jeannie Daniel

A Pre-mortem on the Meme Stock Bubble

Fee-free trading platforms, like Robinhood, made it very easy for novice investors to get started with their trader journey. Investors started to coordinate their efforts on Wall Street Bets, implemented short squeezes on dead-beat stocks like GameStop and the rest is history. This article explores the Meme Stock Bubble and the factors that influenced it.

March 9, 2021

Disclaimer: this article does not constitute financial advice.

What is a speculative bubble?

A speculative bubble is a spike in asset values within a particular industry, commodity, or asset class to unsubstantiated levels, fuelled by irrational speculative activity that is not supported by the fundamentals.

The tweets from this era will one day hang in the Museum of Speculative Bubbles.

With Fundamental Analysis, we try to determine a stock’s real or “fair market” value using various macroeconomic factors, the company’s revenue streams, balance sheets, and debt obligations, and industry position. The goal is to reach a point where an (intelligent) investor can determine if a company is overvalued or undervalued given its current market price. The belief of fundamental analysis is that eventually the market will correct itself, or return to the “fair market” value, because eventually all investors will realise that a particular stock is either overvalued or undervalued and behave accordingly.

Things you don’t see during a bear market

With speculative bubbles, investors buy stocks because they see the price going up, and not because they necessarily believe it is fundamentally sound. The euphoria of speculative bubbles tend to make investors more irrational and less risk-averse. A belief that an asset class will always defy gravity drives investors to buy more of already-overvalued stocks, thinking there will always be someone willing to buy the asset at a higher price.

Even the media, the gatekeepers of truth, fall for the euphoria. A sign of the times is this graphic depicting “Sneakers as an asset class”, which I discovered while scouring Twitter. The Tweet simply said, “headlines you don’t see at bottoms”. Here, the “bottoms” refer to periods of negative sentiment in the stock market, where growth is flat or even heading downwards. No one spoke about “gravity-defying asset classes” post DotCom Bubble, that’s for certain.

Meme Stock Mania

Normally, speculative bubbles are notoriously hard to recognize while happening, but seem obvious after they burst. But this one has been glaringly obvious. Its been meme-worthy, to say the least. An early warning sign was the rapid rise of Tesla’s share price, and the cult following that its CEO, Elon Musk started to amass. Between January 2020 and January 2021, Tesla’s share price rose 750%. At some point, its market cap exceeded the combined market cap of the top 9 car manufacturers in the world.

The impact of Musk’s tweets on the stock market revealed how truly bizarre the current investor landscape had become. On May 1, 2020, Elon Musk tweeted the following:

This resulted in Tesla’s share price closing down by 10.3%, on the day. He pulled a similar in June 2020, with similar effect. In August 2018, Musk posted about Tesla “going private, funding secured” at $420 a share. This was just a joke (apparently) but it cost him his role as chairperson. In meme culture, “420” is used to refer to cannabis or to the act of smoking cannabis.

I highlight Tesla because it seems, to me at least, to be the spark of this meme stock bubble. Or at least smoke telling you there’s a fire somewhere. Meme stocks are called as such because its basically for the LOLs. It’s a meme that has gone viral.

How did we get here?

Low yields in money markets have forced investors to look for alternative, more risky instruments to protect their moneys from inflation. That, combined with the additional disposable income that people have because they were sitting at home for the better part of the past year, plus the growing frustrations due to wide-spread job losses, growing debt and grim economic outlooks saw more and more individuals becoming “retail investors”. Fee-free trading platforms, like Robinhood, made it very easy for novice investors to get started with their trader journey. Suddenly, the stock market was being pumped full of money from investors who lived for the most basic form of technical analysis:

Buy cause it’s going up, sell cause it’s going down.

Retail investors piled onto growth stocks like Apple and Amazon, each of which has grown approximately 1200% over the past ten years. Valid thinking — given that these two highly innovative companies have healthy revenue streams and considerable market share— although the sheer force of their collective efforts pushed these stocks into the overvalued region. Apple’s share price more than doubled in less than a year, from a low of $56.87 on March 21, 2020 to a peak of $145.09 on January 15, 2020. Which is fine. It’s a free market, right?

Emboldened by zero-commission trading, less rigorous margin requirements and an app that gamifies trading, meme stock investors evolved their strategy:

Buy because we want to make it go up.

Now, usually small time investors don’t have big influence on market movements, because their small, individual trades are uncoordinated and largely cancel each other out. But what if millions of traders around the world starting sharing notes and coordinating attacks? This is exactly what happened with meme stocks. Retail investors started buying up shares of not-so-fundamentally sound investments, like AMC Entertainment Holdings, BlackBerry, GameStop, and Nokia. What many of these companies have in common is anaemic financials, declining market share, and lacking innovation in the new economy. Wall Street knows this and had rightly shorted many of these meme stocks. r/WallStreetBets knows that Wall Street has shorted many of these stocks, and decided, for no particular reason, that they will be taking on Wall Street.

Take GameStop. GameStop is an American retailer that specializes in video games, consumer electronics, and gaming merchandise. At one point, it was the most heavily shorted stock in the S&P 500. r/WallStreetBets, with literally millions of subscribers, took notice of this stock, with a little help of Michael Burry (the guy from The Big Short movie).

On the 1st of December, Game Stop opened at $19.96. By 28 January, it had peaked at a high of $483.00, a more than 23,000% increase. This Godzilla-sized rally caused a short-squeeze for those who were bullish (Big Money hedge funds and investment firms) on GameStop’s future. From Investopedia, cause I can’t explain it better myself:

A short squeeze occurs when a stock or other asset jumps sharply higher, forcing traders who had bet that its price would fall, to buy it in order to forestall even greater losses. Their scramble to buy only adds to the upward pressure on the stock’s price.

Even Elon Musk got in on the fun, a link to the WallStreetBets thread on GameStop. This brought the GameStop saga to peak exposure, and resulted in a 100% increase in the share price during after-hours trading.

The tweet that made r/wallstreetbets go mainstream

The story does not end there. The stock might have peaked even higher, had Robinhood and other easy-trading platforms not restricted trade on several meme stocks, including GameStop, AMC Entertainment ,and Nokia. A congressional hearing on the events started on 18 February.

No one really understands why WallStreetBets decided to take on some of the most powerful financial institutions in the world. Was it a sophisticated market manipulation strategy, a class war on institutional investors, a long term troll, or just a joy-ride to the moon for the LOLs? “It certainly started as a meme. That’s how WallStreetBets operates,” Jaime Rogozinski, who created the subreddit, told Wired.

Nevertheless, this reckless behaviour is indicative of a much larger problem. The stock market has, indeed, become violently dislocated from reality. We are certainly in the end game now. One need only look at the charts to understand how nostalgically bizarre things have become. From its lowest point of $6771 to its highest point of $13879 this past year, the Nasdaq 100 essentially doubled in value. The last time the Nasdaq did this was in during the DotCom Bubble.

The Waiting Game

For the past year, believers in fundamental analysis kept looking at the price charts, saying, “It’s gonna come down any day now…”

Google Trends serves as a good proxy for what the general population is thinking. It has been used to successfully track outbreaks of viruses in different countries (because people search for symptoms, etc). It can also be used to gauge market sentiment. Right now, the world seems to be desperately concerned with finding “safe” investments to put their money in, should a crash be coming. Search volumes for the term “safe investment” is at its highest in 2 years.

Google search volumes over time for “safe investment”. Created on 03/03/2021 by Author.

Dala what you must, but this graph tells me that investor fear is at an all-time high. And with good reason. All major stocks, including Apple and Amazon have seen massive declines over the last month. Tesla’s epic bull run has seen a decline of 25% over the past month. Even Michael Burry is short on Tesla. Coincidently, he was long on GameStop before the whole short-squeeze saga.

Every bubble in history has popped spectacularly and, more importantly, eventually. We haven’t seen the crazy single-day double-digit declines that ended many of the previous speculative bubbles. But after much a hoo-ha, people seemed to have quietly accepted that the stock market won’t be going up like it used to. Perhaps the madness has finally come to an end.

Closing remarks

For a long time, I have been monitoring the stock market and wondering whether or not we are in a bubble. Bitcoin rising and then the GameStop mania was the final piece of the puzzle. I was in quite a rush to finish this article because I was terrified there would be a massive crash, and then I would have had to think of a new title! For those interested, a pre-mortem is quite literally the opposite of a postmortem/autopsy.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Read More
NLP, Machine Learning, Language Modeling, Data Science Jeannie Daniel NLP, Machine Learning, Language Modeling, Data Science Jeannie Daniel

What In The Corpus is a Word Embedding?

Word embeddings models language by constructing dense vector representations of words that capture meaning and context, to be used in downstream tasks such as question-answering and sentiment analysis. This article explores the challenges of modelling languages, as well as the evolution of word embeddings, from Word2Vec to BERT.

January 25, 2021

Embeddings visualised. Picture sourced from Bio-inspired Structure Identification in Language Embeddings

While computers are very good at crunching numbers and executing logical commands, they still struggle with the nuances of the human language! Word embeddings aim to bridge that gap by constructing dense vector representations of words that capture meaning and context, to be used in downstream tasks such as question-answering and sentiment analysis. The study of word embedding techniques forms part of the field of computational linguistics. In this article we explore the challenges of modelling languages, as well as the evolution of word embeddings, from Word2Vec to BERT.

Computational Linguistics

Prior to the recent renewed interest in neural networks, computational linguistics relied heavily on linguistic theory, hand-crafted rules, and count-based techniques. Today, the most advanced computational linguistic models are developed by combining large annotated corpora and deep learning models, often in the absence of linguistic theory and hand-crafted features. In this article, I aim to explain the fundamentals and different techniques of word embeddings, whilst keeping the jargon to a minimum and the calculus in the negative.

“What’s a corpus/corpora?” you may ask. Very good question. In linguistics, a corpus refers to an entire set of a particular linguistic element within a language, such as words or sentences. Typically, corpuses (or corpora), are monolingual (of a uniform language) collections of news articles, novels, movie dialogues, etc, etc.

Rare and unseen words

The frequency distribution of words in large corpuses follow something called Zipf’s Law. This law states that, given some corpus of natural language utterances, the frequency of any word is inversely proportional to their rank. This means that if a word like “the” has rank 1 (meaning its the most frequent word), its relative frequency to the rest of the corpus is 1. And, if a word like “of” has rank 2, its relative frequency would be 1/2. And so forth.

The implications of Zipf’s law is that a word with rank 1000 would occur once for every 1st rank word like “the”. And it gets even worse as your vocabulary grows! Zipf’s law means that some words are so rare that they might occur only a few times in your training data of several thousand pieces of text. Or worse, the rare words are absent in your training data but present in your testing data, also known as out-of-vocabulary (OOV) words. This means that researchers need to employ different strategies to upsample the infrequently-found words (also called the long tail of the distribution) as well as build models that are robust to unseen words. Some strategies to deal with these challenges include upsampling, using n-gram embeddings (like FastText) and character-level embeddings, such as ElMo.

Word2Vec, the OG

Every person that has tried their hand at NLP is probably familiar with Word2Vec, which was introduced by Google researchers in 2013. Word2Vec consists of two distinct architectures: Continuous Bag-of-Words (or CBOW) and Skip-gram. Both models produce a word embedding space where similar words are found together, but there is a slight difference in their architectures and training techniques.

With CBOW, we train the model by trying to predict a target word w given a context vector. Skip-gram is the inverse; we train the model by trying to predict the context vector given the target word w. The context vector is just a bag-of-words representation of the words found in the immediate surroundings of our target word w, as shown in the graphic above. Skip-gram is more computationally expensive than CBOW, so down-sampling of distant words is applied to give them less weight. To address the imbalance between rare and common words, the authors also aggressively sub-sampled the corpus — with the probability of discarding a word being proportional to its frequency.

The researchers also demonstrated the remarkable compositionality property of Word2Vec, such that one can perform vector addition and subtraction with word embeddings and find “king” + “women” = “queen”. Word2Vec opened the door to the world of word embeddings, and sparked a decade where language processing now needed bigger and badder machines, and relied less and less on linguistic knowledge.

FastText

The embedding techniques we have discussed up until now represent each word of the vocabulary with a distinct vector, thus ignoring the internal structure of words. FastText extends the Skip-gram model by also taking into account subword information. This model is ideal for languages where grammatical relations like Subject, Predicate, Objects, etc., are reflected by inflections — words are morphed to express changes in their meanings or grammatical relations, rather than by the relative positions of the words or by adding particles.

The reason for this is that FastText learns vectors for character n-grams (almost like subwords — we will get to that in a moment). Words are then represented as the sum of the vectors of their n-grams. The subwords are created as follows:

  • each word is broken up into a set of character n-grams, with special boundary symbols at the beginning and end of each word,

  • the original word is also retained in the set,

  • for example, for n-grams of size 3 and the word “there”, we have the following n-grams:

<th, the, her, ere, re> and the special feature <there>.

There is a clear distinction between the features <the> and the. This simple approach enables sharing representations across the vocabulary, can handle rare words better, and can even handle unseen words (a property the previous models lacked). It trains fast and requires no preprocessing of the words nor any prior knowledge of the language. The authors performed a qualitative analysis and showed that their technique outperforms models that do not take subword information into account.

ELMo

ELMo is an NLP framework developed by AllenNLP in 2018. It constructs deep, contextualized word embeddings that can deal with unseen words, syntax and semantics, as well as polysemy (words taking on multiple meanings given the context). ELMo makes use of a pre-trained two-layer bi-directional LSTM model. The word vectors are extracted from the internal states of a pre-trained deep bidirectional LSTM model. Instead of learning representations for word-level tokens, ELMo is trained to learn representations for character-level tokens. This allows it to effectively deal with out-of-vocabulary words during testing and inference.

The inner workings of a biLSTM. Sourced from https://www.analyticsvidhya.com/

The architecture consists of two layers stacked together. Each layer has 2 passes — a forward pass and a backward pass. To construct character embeddings, ELMo employs character-level convolutions over the input words. The forward pass encodes the context of the sentence leading up to and including a certain word. The backward pass encodes the context of the sentence after and including that same word. The combination of forward and backward LSTM hidden vector representations are concatenated and fed into the second layer of the biLSTM. The final representation (ELMo) is the weighted sum of the raw word vectors and the concatenated forward and backward LSTM hidden vector representations of the second layer of the biLSTM.

What made ELMo so revolutionary at that time (yes, 2018 was a long time ago in NLP years) is that each word embedding encoded the context of the sentence, and that word embeddings were functions of their characters. Thus, ELMo simultaneously addressed the challenges posed by polysemy and unseen words. Besides English, pre-trained ELMo word embeddings are available in Portuguese, Japanese, German, and Basque. The pretrained word embeddings can be used as is in downstream tasks, or further tuned on domain-specific data.

BERT

Google Brain researchers introduced BERT in 2018, a few months after ELMo. Back then, it smashed records for 11 benchmark NLP tasks, including the GLUE task set (which consists of 9 tasks), SQuAD, and SWAG. (Yes, the names are funky, NLP is full of really fun people!)

BERT stands for Bidirectional Encoder Representations from Transformers. Unsurprisingly, BERT makes use of the encoder part of the Transformer architecture, and is pre-trained once in a pseudo-supervised fashion (more on that later) on the unlabelled BooksCorpus (800M words) and the unlabelled English Wikipedia (2,500M words). The pre-trained BERT can then be fine-tuned by adding an additional output (classification) layer for use in various NLP tasks.

If you are unfamiliar with the Transformer (and Attention Mechanisms), check out this article I wrote on the topic. In their highly-memorable paper titled “Attention Is All You Need”, Google Brain researchers introduced the Transformer, a new type of encoder-decoder model that relies solely on attention to draw global dependencies between the input and output sequences. The model injects information about relative and absolute positions of tokens using positional encodings. A token representation is calculated as the sum of the token embedding, the segment embedding and the positional encoding.

In essence, BERT consists of stacked Transformer encoder layers. In Google Brain’s paper, they introduce two variants: BERT Base and BERT Large. The former consists of 12 encoder layers and the latter 24. Similar to ELMo, BERT processes sequences bidirectionally, which enables the model to capture context from left to right, and then again from right to left. Each encoder layer applies self-attention, and passes its outputs through a feed-forward network, and then onto the next encoder.

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

BERT is pretrained in a pseudo-supervised fashion using two tasks:

  • Masked Language Modelling

  • Next Sentence Prediction

Why I say pseudo-supervised is because neural networks inherently need supervision to learn. To train the Transformer, we transform our unsupervised tasks into supervised tasks. We can do this with text, which may be considered series of words. Remember that the term bidirectionalmeans that the context of a word is a function of the words preceding it and by the words following it. Self-attention combined with bidirectional processing would mean that the language model is all-seeing, making it difficult to actually learn the latent variables. Along comes Masked Language Modelling, which exploits the sequential nature of text, and makes the assumption that a word can be predicted using the words surround it (the context). For this training task, 15% of all the words are masked.

Sourced from researchgate.net

The MLM task helps the model learn the relationships between different words. The Next Sentence Prediction (NSP) task helps the model learn the relationships between different sentences. NSP is structured as a binary classification task: given Sentence A and Sentence B, does B follow A, or is it just a random sentence?

These two training tasks are enough to learn really complex language structures — in fact, a paper titled “What does BERT learn about the structure of language?” demonstrated how the layers of BERT captures more and more granular levels of language syntax. The authors showed that the bottom layers of BERT capture phrase-level information, while the middle layers capture syntactic information, and the top layers semantic information.

Since its inception, BERT has inspired many recent state-of-the-art NLP architectures, training approaches and language models, including Google’s TransformerXL, OpenAI’s GPT-3, XLNet, RoBERTa, and Multilingual BERT. It’s universal approach to language understanding means that it can be fine-tuned with minimal effort to a variety of NLP tasks, including question-answering, sentiment analysis, sentence-pair classification, and named entity recognition.

Conclusion

While word embeddings are very useful and easy to compile from a corpus of texts, they are not magic unicorns. We highlighted the fact that many word embeddings struggle with disambiguity and out-of-vocabulary words. And although it is relatively easy to infer semantic relatedness between words based on proximity, it is much more challenging to derive specific relationship types based on word embedding. For example, even though puppy and dog may be found close together, knowing that a puppy is a juvenile dog is much more challenging. Word embeddings have also been shown to reflect ethnic and gender biases that are present in the texts that they are trained on.

Word embeddings are truly remarkable in their ability to learn very complex language structures when trained on large amounts of data. To the untrained eye (or untrained 4IR manager), it may even seem magical, and therefore it is very important to highlight and keep these limitations in mind when we use word embeddings.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Read More

Transformers: Age of Attention

Word embeddings models language by constructing dense vector representations of words that capture meaning and context, to be used in downstream tasks such as question-answering and sentiment analysis. This article explores the challenges of modelling languages, as well as the evolution of word embeddings, from Word2Vec to BERT.

November 26, 2020

In their highly-memorable paper titled “Attention Is All You Need”, Google Brain researchers introduced the Transformer, a new type of encoder-decoder model that relies solely on attention for sequence-to-sequence modelling. Before the Transformer, attention was used to help improve the performance of the likes of Recurrent Neural Networks (RNNs) on sequential data.

Now, this is a lot. You might be wondering, “What the hell is sequence-to-sequence modelling, Jeanne?” You may also suffer from an attention deficiency, so allow me to introduce you to…

Seq2seq models

Sequence-to-sequence models (or seq2seq, for shorthand) are a class of machine learning models that translates an input sequence to an output sequence. Typically, seq2seq models consist of two distinct components: an encoder and a decoder. The encoder constructs a fixed-length latent vector(or context vector) of the input sequence. The decoder uses the latent vector to (re)construct the output or target sequence. Both the input and output sequence can be of variable length.

The applications of seq2seq models extend beyond simply machine translation. The architecture of the encoder-decoder model can be used for question-answering, mapping speech-to-text and vice versa, text summarisation, image captioning, as well as learning contextualised word embeddings.

Unlike (context-independent) word embeddings, which have static representations, contextualized embeddings have dynamic representations that are sensitive to their context. Sourced from Researchgate

The novel RNN Encoder-Decoder model was introduced in 2014 by renowned researcher Kyunghyun Cho and his team, to perform statistical machine translation. This model uses the final hidden representations of the encoder RNN as context vector for the decoder RNN. This approach works fine for short sequences, but fails to accurately encode longer sequences. In addition, RNNs suffer from Vanishing Gradients, and are slow to train. Adding the attention mechanism to the RNN encoder-decoder architecture helps improve on its ability to model long-term dependencies.

Attention? Attention!

Attention is a function of the hidden states of the encoder to help the decoder decide which parts of the input sequence are most important for generating the next output token. Attention allows the decoder to focus on different parts of the input sequence at every step of the output sequence generation. This means that dependencies can be identified and modeled, regardless of their distance in the sequences.

Attention applied to an input sequence to assist in machine translation

When attention is added to the RNN encoder-decoder, all the encoder’s hidden representations are used during the decoding process. The attention mechanism creates a unique mapping between the output of the decoder and the hidden states of the encoder at each time step. These mappings reflects how important that part of the input is for generating the next output token .Thus, the decoder can “see” the entire input sequence and decide which elements to pay attention to when generating the next output token.

There are two major types of attentions: Bahdanau Attention, and Luong Attention.

Bahdanau Attention works by aligning the decoder with the relevant input sentences. The alignment scores is a function of the hidden states produced by the decoder in the previous time step and the encoder outputs. The attention weights are the output of softmax applied to the alignment scores. After this, the encoder’s outputs and their attention weights are multiplied element-wise to form the context vector.

The context vector of Luong Attention is calculated similarly to Bahdanau’s Attention. The key differences between the two are as follows:

  • with Luong Attention, the context vector is only utilized after the RNN produced the output for that time step,

  • the ways in which the alignment scores are calculated, and

  • the context vector is concatenated with the decoder hidden state to produce a new output.

There are three different ways to compute the alignment scores: dot-product (multiply the hidden states of the encoder and decoder), general (multiply the hidden states of the encoder and decoder, preceded by a multiplication with a weight matrix), and concatenation (a function applied on top of adding together the hidden states of the encoder and decoder). More information on this can be found here. Subsequently, the context vector, together with the previous output will determine the new hidden state of the decoder.

The Transformer

Google Brain researchers flipped the tables on the NLP community and showed that sequential modelling can be done using just attention, in their 2017 paper titled “Attention is all you need”. In this paper, they introduce the Transformer, a simplistic architecture that makes use of only attention mechanisms to draw global dependencies between input and output sequences.

The Transformer consists of two parts: an encoding component and a decoding component. Additionally, positional encodings are injected to give information about the absolute and relative positions of the tokens.

The Encoder-Decoder Model

The encoding component is a stack of 6 encoders. Although identical, the encoders do not have shared weights. Each encoder can be deconstructed into a multi-head attention part and a fully connected feed-forward network.

Similarly, the decoding component is a stack of 6 identical decoders, whose architecture is similar to the encoder’s, except it masks its multi-head attention to ensure that the next output can only depend on the known outputs of the previous tokens. Also, the decoder has an extra multi-head attention component that applies self-attention over the output of the encoding component, providing access to the inputs during decoding.

Multi-head attention

Each multi-head attention component consists of several attention layers running in parallel. The Transformer makes use of scaled dot-product attention, which is very similar to Luong’s dot-product attention, except it is scaled. Multi-headed attention allows for scaled dot-product attention to be aggregated across n different, randomly-initialized representation subspaces. This multi-headed attention function can also be parallelized and trained across multiple computers.

Why Transformers?

Much like Recurrent Neural Networks (RNNs), Transformers allows for processing sequential information, but in a much more efficient manner. The Transformer outperform RNNs, both in terms of accuracy and computational efficiency. Its architecture is devoid of any recurrence or convolutions, and thus training can be parallelizable across multiple processing units. It has achieved state-of-the-art performance on several tasks, and, even more importantly, was found to generalize very well to other NLP tasks, even with limited data.

In conclusion

The Transformer has taken the NLP community by storm, earning a place among the ranks of Word2Vec and LSTMs. Today, some of the state-of-the-art language models are based on the Transformer architecture, such as BERTand GPT-3. In this blog post, we discussed the evolution of sequence-to-sequence modelling, from RNNs, to RNNs with Attention, to solely relying on attention to model input to output sequences with the Transformer.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Read More
Entrepeneurship, Startups, Amazon, Technology, Business Jeannie Daniel Entrepeneurship, Startups, Amazon, Technology, Business Jeannie Daniel

Amazon’s Meteoric Rise

This article explores the key components of Amazon’s business model that led to them becoming a trillion dollar company, as well as some of the criticism that has been levelled against it.

September 12, 2020

Today, Amazon is one of the most powerful companies in the world. It has revolutionised online shopping, by making every conceivable consumable available through its online marketplace and mastering a seamless and consistently reliable shopping experience for its customers. Amazon’s latest financial reports states that it made $75.5 billion dollars in sales revenue in the first quarter of 2020. Every day, Amazon ships approximately 1.6 million packages world wide. Like its namesake, the Amazon River, it became the retail store that dwarfed other retail stores.

In this article, I explore the key components of Amazon’s business model that led to them becoming a trillion dollar company, as well as some of the criticism that has been levelled against it.

Humble beginnings

Amazon was started in Jeff Bezos’ garage in 1994, in Washington, Seattle. Back in the 1990s, retailers only had to pay sales taxes for purchases made in the state it operated in. Bezos thought that having Amazon operate out of a heavily-populated state like California or New York would significantly increase its tax liability, so he settled for sleepy Seattle.

The name “Amazon” was chosen for a number of reasons:

  • it started with an A, so it would always be first in any alphabetized list,

  • it sounded different and ‘exotic’ to Jeff,

  • and the fact that Amazon River was so big, it dwarfed other rivers.

Bezos planned to build something that would one day become the biggest online retailer in the world, dwarfing and ultimately swallowing other retailers. It was going to be “The Everything Store” — a market place not confined by a physical structure or even physical limitations on how much it can stock or process. To achieve this goal, Bezos realized that Amazon would have to start small, and so it started by selling books online.

Jeff Bezos in his tiny office in 1999, with a spray-painted Amazon.com logo

The website was launched on 16 July, 1995 and only sold a selection of books. Within one month of its launch, Amazon had already sold books to people in all 50 states and in 45 different countries. Bezos thought that the most promising products to sell online included were, among other, CD’s, computer hardware, computer software, videos, and books. The concept of online shopping with a reliable window of delivery appealed to many people that lived in rural areas or were frustrated with stores that did not stock rare or unpopular items, like chunky textbooks.

It was a crazy time for the young startup. New employees were interviewed by Bezos himself, and were expected to work 60 hour weeks. Amazon’s customer base was growing so fast that the gap between the number of orders placed and the number of orders shipped was widening. This issue came to light right before the 1998 holiday season. This led to Operation Save Santa, which was a call for all hands on deck — employees from all divisions pitched in to help with the packaging of parcels, doing night shifts and bringing their family and friends to help out. That chaotic holiday season paved the way for the enormous and highly optimized supply chain system that Amazon is famous (or infamous) for today.

The importance of cashflow

While Silicon Valley was partying, Amazon was saving every penny and investing in stimulating future cash flow. The fundamental flaw in many failed DotCom startups was that they lacked well-thought out business plans or paths to profitability. It was a race to the IPO, or being acquired (the so-called “exit strategy” of many smooth-talking entrepreneurs), and so many DotCom startups spent up to 90% of their budget just on marketing. Meanwhile, Bezos invested heavily in the company’s infrastructure to support sustainable growth. In the annual letter of 2001, Jeff Bezos highlighted:

“When forced to choose between optimizing the appearance of our GAAP accounting and maximizing the present value of future cash flows, we’ll take the cash flows.”

History is defined by a series of critical points, and Amazon’s path was no different. Amid growing concerns that nervous suppliers might ask to be paid more quickly for the products they sold, Amazon realized that it needed to have a pile of cash on hand. Even though its sales were growing by 30%–40% every month, it was still posting massive losses every quarter. With Y2K no longer a concern, the Federal Reserve started raising interest rates and thus increasing the cost of borrowing. This had the adverse effect of discouraging investment.

To ensure that it had a strong cash position to pay suppliers, Amazon sold $672 million in convertible bonds to investors in Europe a mere month before the DotCom bust on March 10, 2001. This was the critical decision that ensured Amazon’s survival through the DotCom Bust, when investor funding dried up and internet adoption temporarily slowed down. During this time, Amazon stock price fell from $107 to just $7. Today, Amazon’s share price is north of $3000. The tremendous success of Amazon today is a testament to long-term thinking and a focus on providing excellent service.

Amazon has, as of yet, not posted a single dividend since its IPO in 1997. Instead, it has invested every bit of cash generated into infrastructure, improving the customer experience, and new revenue streams such as Amazon Prime, Amazon Web Services and, most recently, the Alexa voice computing platform. Amazon has shown that focusing on cash flow, rather than profits, is a sound strategy for creating longterm shareholder value.

Notorious frugality

Working at Amazon in the early 2000s was nothing like working at other tech startups of that time. Your work computer would be functional but not top-of-the-line. There were no free massages or free meals, like at Google. Only coffee and bananas. You paid for your own parking. The salary and stock options were modest compared to other tech giants. No business-class flying or billing expensive corporate dinners to the company. And whenever Amazon moved to new offices, Bezos had them furnished with cheap desks made from wooden doors.

Amazon’s frugality stemmed from Jeff Bezos own frugality. Even though he was worth $12 billion in 1997, one of the wealthiest people in the world at the time, he still drove a modest Honda Accord. He explained this philosophy to a reporter that questioned his frugality:

“It’s a symbol of spending money on things that matter to customers and not spending money on things that don’t.”

Amazon lives by the motto that frugality breeds resourcefulness, self-sufficiency, and invention. This allowed them to pass the savings to the customer and is tied with its Virtuous Cycle Model. The Virtuous Cyle dictates how reducing costs allows the company to lower its prices, which in turn improves the customer experience. This leads to more traffic and sales, which allows them to both increase their selection and allows them to negotiate lower prices with suppliers. These savings can be then ploughed back into lowering the prices of goods, and so the virtuous cycle continues.

Amazon’s Virtuous Cycle Model

Customer obsession

Amazon is not just customer-centric, it is customer obsessed. Its mission is to figure out what the customer wants, and what’s important to them. Meetings within the organization often have an empty chair to represent the customer’s interests, and whenever a new product or service is proposed, one of the key questions asked are, “What will disappoint the customer most?”

Amazon wanted to take the inconvenience out of online shopping, with the most efficient delivery, and provide the largest offering of products (even if it was rare or highly seasonal). Amazon’s online website has the largest selection in the world — an estimated 350 million products — and is available 24 hours a day, 365 days a year. The only competitor that comes close to this is Alibaba, with a selection of approximately 330 million products.

Amazon’s Leadership Principles

Bezos was also insistent on making sure the customer had access to the best prices, even if it meant they would not buy from Amazon, and rather the third-party sellers that advertised their goods on Amazon‘s marketplace. He said,

“If somebody else can sell it cheaper than us we should let them and figure out how they are able to do it.”

The willingness to sacrifice profits in return for customer trust did not always sit well with shareholders and board members, but the short term losses were long term gains. More and more third-party sellers flocked to Amazon’s marketplace, which increased the selection of products available to the client, and increased customer loyalty. Today, third-party sellers account for more sales on Amazon than Amazon’s first-party retail business, and commision from third-party sales represents 19% of their revenue stream.

Amazon also aimed to provide a personalized shopping experience by using collaborative filtering, a technique used by recommender systems. By leveraging the thousands and thousands of data points — clicks, views, purchases — each user generates on their website, Amazon can predict what you are going to buy next, sometimes even before you make the conscious decision to buy that item. They are so confident in their ability to quantify and predict consumer behaviour that they stock up the products in fulfilment centres near you.

Criticism

On their path to success, Amazon made some poor decisions that have damaged its brand. In fact, an entire Wikipedia page is dedicated to criticisms of Amazon. Its ethics and policies have raised eyebrows in some of the highest offices. Recently, Jeff Bezos testified in a virtual antitrust hearing to answer questions about anti-competitive tactics, that lead to a loss of genuine competition and result in public harm, used by Amazon and other Big Tech companies.

Amazon has found many morally-questionable, but legal, ways to minimize its tax burden. For example, even though it made $11.2 billion in profits in 2018, it paid exactly 0% in income tax for that same year. It also received a $129 million tax rebate from the federal government. This is due to a highly complex scheme of carrying forward losses from previous years, tax credits for research and development projects, and stock-based employee compensation. They’ve been accused of actually “building their company around tax avoidance.

The ugly side of customer obsession and frugality is Amazon’s willingness to treat their blue-collar workers as cannon-fodder in the war for customers’ hearts and wallets. Although workers are generally paid above the national minimum wage, they are subject to harsh and extremely physically-demanding work to ensure that packages are delivered accurately and on time. The fulfilment centres are plagued with workplace injuries. Workers in the fulfilment centres get only two short breaks during eight-hour shift and have to ask for permission to use the bathroom. They often walk up to 14 miles a day (or 22.5km for metric system people) and risk being terminated if they call in sick. Amazon also has a history of shutting down efforts to unionize workers at their fulfilment centres, firing outspoken critics and even firing pregnant workers.

In conclusion

Amazon started as a small online retailer selling a wide range of books, and has grown into a seemingly unstoppable retail giant that still demonstrates double digit growth every year. It is one of the largest employers in the world, with 800,000 permanent employees, and during the holiday season, an additional 200,000 temporary employees. The word ‘Amazon’ has become synonymous with online shopping, in the same way ‘Google’ has become synonymous with online search. The jury is still out on whether Amazon is the villainous exploiter of cheap labour and poorly-written laws, or an efficient empire that expertly traverses the legal tight rope and provide much-needed low-wage jobs to unskilled workers worldwide. Regardless, their rise to Big Tech status, especially following their near-demise during the DotCom Bust, is worth studying.

If you enjoyed this blog post, please consider supporting my writing by buying me a coffee.

Read More
Stock Market, Technology, Bubbles, Startups, Internet Jeannie Daniel Stock Market, Technology, Bubbles, Startups, Internet Jeannie Daniel

The DotCom Bubble

Following the burst of the DotCom bubble, the surviving companies like Apple, Google, and Microsoft became apex predators in their respective fields. This article explores the factors leading up to and during the DotCom bubble, as well as examine the long-term impact on the tech ecosystem.

June 21, 2020

The iconic San Francisco motel-themed billboards of Yahoo. Sourced from VentureBeat.

I was only about 5 years old when the DotCom Bubble took effect, and while the DotCom Bubble was recent enough to live in most people’s memories and not in the dusty history books, in the technology age 20 years is a millennium. Just look at that billboard, it is practically archaic!

The DotCom Bubble highlighted the pitfalls of greed, over-promising, and ignorance. It also proves an interesting case study for the intricate relationship between innovation, and economic growth.

A DotCom company was called as such because many of them simply consisted of a website. They were online platforms that would facilitate everything from banking to streaming content to buying pet supplies. This was the dawn of the Information Age — an economy built on information technology. The Internet would become as revolutionary as railroads and electricity, bringing people closer together and providing the means of powering new services and markets.

So what preceded the DotCom Bubble?

A number of factors:

A picture begins to form of something that does not grow at a linear pace. Today the Internet is ubiquitous, and we cannot imagine our lives without it.

Jack F. Welch, chairman of General Electric, was quoted in 1999 as saying that the Internet “was the single most important event in the U.S. economy since the Industrial Revolution.”

Back then, the Internet was a very new thing, and people were struggling to grasp its potential. There was anxiety around the commoditization and regulation of the Internet, and there was the fear of the Y2K bug — that computers would misread 2000 as 1900, and that this would cause critical computer systems to collapse. But the Internet was about to revolutionalize we shop, socialize, learn, travel, and more.

Party like its 1999

Although very real and opening up a plethora of new business opportunities, the Internet — combined with free-market economics, low interest rates, and heavy speculation — resulted in a Wild Wild West era for DotComs. It created an over-enthusiastic investor pool that seemingly overnight stopped caring about things like business plans and debt piles. It was also the Internet that enabled buying stocks directly online, which added plenty of less experienced, less sophisticated investors (willing to buy stocks that were overvalued) to the investor pool.

The number of venture capitalist firms also grew by 90% between 1995 and 2000. More money than ever before was made available for startup capital investments. During the same period, 439 DotCom companies went public, raising $34 billion in capital.

Rob Glaser, who founded Progressive Networks in 1994, said, “In 1995 and 1996, if you said you were doing an Internet toaster, I’m sure you could find a venture capitalist to fund it.”

Every tech startup (affectionately identifiable with the .com at the end of their name) was seemingly a unicorn — the next big thing — and everyone had FOMO on the IPO of said unicorn. Many DotCom companies were bandwagon jumpers, with few original ideas, thin business plans, and plenty of big talk. Some spent up to 90% of their budget on advertising to get their brand “out there”.

To add to their net operating losses, they were overpaying average talent and hosting exuberant parties. They also offered their products/services for free or at a discount with the hope that they will create loyal customers whom they can charge profitable rates in the future. The goal was to “get big fast” — identify a niche market early and gain market share as quickly as possible, to shut out all competitors.

Fall from grace

During the early years of the DotCom bubble, investors were willing to forgive DotCom companies for posting losses while they were busy developing their IP and expanding their market share. But after a few loss-making years, investors started to get nervous. Many had become overnight paper millionaires from the skyrocketing IPOs, but as we all should know — share price does not equal fair value nor company performance. Surely the goose will eventually run out golden eggs to lay?

Stock market bubbles, during their ascension, tend to be very sensitive to market shocks. The DotCom Bubble was no different — on March 13, news that Japan had once again entered a recession triggered a global sell-off that disproportionately affected the overvalued technology stocks. This, combined with aggressively-raised interest rates, the events of 9/11, several accounting scandals including that of Enron and WorldCom, sparked a two-year decline in the Nasdaq Composite — comprised overwhelmingly of technology stocks. Many DotCom companies struggled to secure further venture capital, whilst burning through their cash pile. IPOs and further stock offerings was out of the question. Since they were nowhere near profitability and received no cash influxes, they eventually went into liquidation. An estimated 52% of DotCom companies went bust by 2004.

The DotCom bust was a combination of increased scrutiny of DotCom companies’ financials, investor fatigue, and the belief that the Internet was a fad. Of course, the Internet was not a fad, and would soon bring forth a new Fourth Industrial Revolution.

The aftermath

If bubbles popping were extinction-level events, then companies like Apple, Google, and Amazon were the crocodiles of the tech ecosystem. The Big Pop allowed them to become apex predators in their respective fields, for several reasons. Real estate became much cheaper, hardware became easier to obtain, the market was flushed with recently-unemployed, talented software engineers, and the extinction of their competitors allowed them to rapidly gain market share. Today, they are some of the most valuable, and most recognizable brands in the world. Their respective portfolios overlap somewhat and often they compete for market share, as well as talent. In later years, companies like Facebook and Netflix would join their ranks. Within their respective workplaces, each of these tech giants demands extremely high performance from their employees and have a habit of acquiring any potential competition. Collectively, they are called FAANG, and as of January 2020, they have a combined market capitalization of over $4.1 trillion.

Although nearly untouchable today, back then these companies were not immune to the fallout. In the face of diminishing confidence, Amazon’s share price fell from $107 to just $7. Google waited out the DotCom bubble and only launched its IPO in 2004. At the height of the DotCom Bubble, Apple’s share price reached a height of almost $5, only to fall below $1 in 2003.

For Apple, the decade following the DotCom Bubble was most prosperous as it led the innovation of consumer electronics. Apple launched the iPod in 2001 and introduced the iTunes Store in 2003, where users could purchase individual tracks for just $0.99. The iTunes Store hit five billion downloads by June 19, 2008. Apple also released Mac OS X, the primary operating system of Apple’s Mac computers, in 2001. The first iPhone, the integration of an Internet-enabled smartphone and the iPod, was introduced in 2007. And in 2010, they introduced the iPad.

Steve Jobs introducing the first iPhone in 2007.

The innovation that followed the malaise of the early 2000s were led by these apex companies. They invested heavily in new startups and even built the infrastructure (cloud computing) that allowed smaller companies to iterate much faster for much less upfront infrastructure investment.

The DotCom bubble fostered an era of entrepreneurship that has not been seen in the US since before the Great Depression. It provided a petri dish to test out the validity and marketability of a wide range of Internet services. Many of the services were way ahead of their time — like online food delivery and online clothing stores. Unfortunately for these services, the consumer base, technology, and infrastructure simply were not ready.

Today, investors look at tech IPOs with increased scrutiny — the consensus is that one simply does not take a tech company public before it reaches profitability. WeWork, Uber, Lyft — all these companies went public before having showing profitability. They were whipped in the public square — figuratively, of course — with their share prices falling on the day of their respective IPOs.

Closing remarks

In hindsight, everyone has 20–20 vision. But during a bubble, everyone seems to have these unrealistic, almost fanatical views of what the future would look like. The first recorded speculative bubble dates back to 1636–1637, named the Tulip Mania. At the height of the mania, the bulbs sold for approximately 10,000 guilders — equal to the value of a mansion on the Amsterdam Grand Canal. Investors believed that there would always be a buyer willing to purchase the bulb at a higher price than their entry point. The perceived value of the tulip bulbs became disjointed from their intrinsic value, which was destined for a correction.

Tulip Mania of 1637

While researching the DotCom Bubble, I noted many similarities with today’s manner of market speculation and that of the DotCom Bubble. Trading apps that allow investors to buy fractional shares with zero commission has introduced plenty of young, inexperienced investors to the market, and this has coincided with some of the strangest events in stock market memory. Is history repeating itself? Perhaps the frequency of bubbles coincides with the memory span of investors.

If you enjoy this blog post, please consider supporting my writing by buying me a coffee.

Read More