What's the origin of the phrase "Big Data Doesn't Fit In Excel"?


Welcome to Yak Shaving School! As part of my MSc I'm reading a book about Data Analytics. So I've been chasing down quotes to find their origin.

One paper had this popular quote in it (emphasis added):

As with many rapidly emerging concepts, Big Data has been variously defined and operationalized, ranging from trite proclamations that Big Data consists of data-sets too large to fit in an Excel spreadsheet or be stored on a single machine (Strom, 2012)

Big Data, new epistemologies and paradigm shifts () SAGE Publications. Big Data & Society. Page: 205395171452848. DOI:

I keep seeing that damned Excel quote. But who originally said it? The "Big Data" paper above references "Strom". Well, here's what Strom has to say for themselves:

Big Data is everywhere. As Bit.ly‘s chief scientist Hilary Mason likes to say: “Big Data usually refers to a dataset that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet.”
Big Data Makes Things Better - Slashdot.org August 3rd 2012

Aha! It's a blogpost from Slashdot. And Strom is quoting someone else - Hilary Mason. I've seen Mason being quoted saying this before. Here's the earliest Tweet I could find attributing Mason - from April 2013:

But I couldn't find the original quote. I want to be able to cite who originally said it, and where & when they said it. Not a second-hand transcription.

Googling around, I found this definition of Big Data from July 2013:

“Big Data” is “it doesn’t fit in Excel”
Stéphane Hamel – één van inmiddels meer dan 30 definities van Big Data!.
Data Science – de toekomst van webanalisten?

Interesting! That's Stéphane Hamel - not Hilary Mason. Searching for Hamel's name, lead me to this 2017 article

The simplest definition of “Big Data” is “it doesn’t fit in Excel”
Stephane Hamel comment 8/2012 Big Data – What It Means For The Digital Analyst.
Definitions of Big Data

The "What it means for the digital analyst" page has since disappeared - but is available in the Wayback Machine. Here's the quote in full:

I have joked that the simplest definition of "Big Data" is "it doesn't fit in Excel" - and when you think of it, it's true for most people who wonder how to make the shift from a traditional approach to a Big Data one. Shifting away from Excel forces the analyst to change his approach, view the data differently, and explore new solutions.
And that's a whole lot of fun to do! 🙂
August 2nd, 2012

There's also a Slideshow from March 2013 in which Hamel uses the phrase:
Slide saying "It Doesn't fit in Excel".

A bit more digging and I found this document from July 2012:

“Simplest definition of #BigData ever:!‘it doesn’t fit in Excel’ :)”!Stephane Hamel @SHamelCP 3 Jul 2012
How Big is Big Data () Columbia University. DOI:

The @SHamelCP Twitter account doesn't exist any more. And while some of its Tweets are in the Internet Archive, that one is missing. But there are contemporary Tweets which suggest that it was Tweeted at about that time:

lol RT @SHamelCP Simplest definition of #BigData ever: "it doesn't fit in Excel" :) #measure #analytics

Back in 2012, the Retweet function didn't exist, hence the slightly weird syntax. Here's a link to a bunch of people quote tweeting it in July 2012.

The reason @SHamelCP doesn't exist is because at some point it was renamed to @SHamel67. Which means, the original Tweet exists! And here it is:

Screenshot of a Tweet.
Stéphane Hamel saying 'Simplest definition of #BigData ever: "it doesn't fit in Excel" :) #measure #analytics'

(Archive link from original)

I reckon that's the earliest directly citable Tweet of the phrase. But there is some evidence of it being used earlier. Here's a report from the BigDataWeek Community meetup in London:

The panel started off with Edd asking, So what is big data? The answers ranged from correct but slightly silly:
   lots of 0s and 1s
to
   too big to fit in x (where x is your usual tool - excel, SQL, memory etc) - Hilary
”Big data, ready or not” 25th April 2012

Here's the video - with the quote at ~15 minutes 30 seconds in:

And, slightly earlier:

“Big Data usually refers to a data set that is too big to fit into your available memory, or too big to store on your own hard drive, or too big to fit into an Excel spreadsheet,” says Mason
Hilary Mason Wants To Get You Started With Big Data 26th December 2011
(Although possibly originally published in September 2011)

Prior to that, things start getting a little fuzzy. In April 2011, Mike Driscoll wrote a blog post about a presentation he gave with Hilary Mason and Joe Adler:

  1. Choose The Right-Sized Tool
    Or, as I like to say, you don’t need a chainsaw to cut butter.
    If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it
    ...
    When you’re data gets very large, so big it can’t fit reasonably on your laptop (in 2010, that’s north of a terabyte), then you’re in Hadoop, parallelized database , or overpriced Big Iron territory.
    the seven secrets of successful data scientists 19th April 2011

So the proto-phrase seems to have appeared between April 2011 and April 2012. By July 2012 it had become much more pithy. And from there became endlessly quotable.

Before April 2011, it was always expressed much more fuzzily. A McKinsey report from May 2011 says:

In some cases, decisions will not necessarily be automated but augmented by analyzing huge, entire datasets using big data techniques and technologies rather than just smaller samples that individuals with spreadsheets can handle and understand.
Big data: The next frontier for innovation, competition, and productivity

And, even further back, here's what RedMonk's Stephen O'Grady had to say back in 2009:

Excel has been used on big data for years, it’s true. But not directly on big data. With a row limit of around 65,000, it certainly can’t be used as a direct window into data warehouses or marts
What’s After Excel? Big Data and the Future of Spreadsheets 19th November 2009

Please don't think I'm picking on any of the people mentioned in this blog post - I've seen the quote attributed to a dozen other people, and to none. It is a catchy little slogan with huge memetic potential. I think it has now now become a standard truism.

But this was a great reminder to me that is always worth following the trail of a quote to see where it leads.

Update!

Thanks to Pete Skomoroch for alerting me to this earlier usage, from March 2009.

Pete recollects that people were using this phrase in 2007 — but I've yet to find evidence of it. If you have, please stick a note in the comments.


Share this post on…

6 thoughts on “What's the origin of the phrase "Big Data Doesn't Fit In Excel"?”

Trackbacks and Pingbacks

  1. Big data is a term used since the early 1990s to describe “huge, overwhelming, and uncontrollable amounts of information”. The rapid rise of the internet has meant the volume of data has increased exponentially over the past two decades.

    There is no universal threshold determining where big data starts. However, it’s generally considered to kick in at the point where traditional tools struggle to meet the demands of processing a dataset.

    Of course, this is highly subjective. What is colossal to small and medium-sized businesses is minuscule to multinational corporations.

    In Excel’s case, the limit is 1,048,576 worksheet rows, although you probably wouldn’t want to get anywhere near that if you wanted a lag-free experience.

    Source: Terence Eden’s Blog

    Swathes of data scientists, data engineers, and even data analysts, scoff at the idea that Excel can have any role in the big data world. What many don’t realise though is there is something very powerful but lacking an ostentatious presence.

    It’s called Power Query.

    Power Query is a data transformation and data preparation engine, which first appeared in Excel 2010 as an add-in. Fast forward to Excel 2016 and it become integrated into the Data tab without the need to install anything additional.

    This article debunks the myth Excel can’t handle more than a million rows by demonstrating a step-by-step method of connecting to a very large dataset. Pivot tables and charts are also constructed to get a feel of the data.

    Scenario

    Just imagine you’re an office administrator who’s been sent a ginormous 1.14GB CSV file containing 10 million records of adult census data. You’ve been asked to inspect it and provide a summary.

    https://www.kaggle.com/brijeshbmehta/adult-datasets?select=adult10m

    Quick caveat: this data is fictional and should not be used to draw real-life conclusions.

    Although you’re an intermediate Excel user with plenty of experience stringing together reports, never have you had to deal with data on this scale, however.

    Usually, you’d open the CSV file directly in Excel and immediately crack on with the rest. The issue is when you try to do that with this, an error message pops up warning you it’s too large for the grid.

    You then scroll to the bottom to find the data has indeed been truncated with 7,951,424 missing records.

    Oops.

    At your workplace is a data scientist who’s the go-to guy for all things data. For alliteration’s sake, let’s call him Paul—Paul the programmer.

    There he is—sat in the corner of the office geeking it out on his ultrawide quad monitor setup. Due to the nature of your work, you rarely need to converse with him, but you know he’s a Python, SQL and Hadoop guru. In fact, he’s the only one there.

    With his wealth of knowledge, you think it would make sense to send over the CSV file for him to generate a few visualisations. However, he’s got enough on his plate. He could do without people like you pestering him with more tasks.

    Paul is an Excel-skeptic who’s been known to belittle it. However, he’s barely used it since school and believes the majority of data crunching and analysis should be carried out in Python.

    “This dataset is far too big for Excel, period,” he bullishly informs you. He sounds convincing — he even sounds cocksure. Little do you know, he’s not well-versed in the program, but given his status, you assume he knows what he’s talking about.

    Paul tells you he’ll do it, but only after completing a few other tasks on his checklist. That could mean waiting two or three days — maybe longer.

    There are two problems here:

    Before asking Paul, you didn’t bother Googling, posting on forums or social media to find out if there was a way of doing it in Excel or another program (eg. Power BI or Tableau).Is Paul even being honest? Perhaps he knows more than he’s letting off but is reluctant to say because he doesn’t want to undermine his role. After all, he wants people to depend on him. If he starts empowering others by dishing out advice willy-nilly, it might lessen the chances of getting that pay rise from the boss.

    At this point, you think relying on him is the only option. If only you knew the alternative—then you wouldn’t think twice about taking matters into your own hands…

    Solution

    To circumvent the million-row limitation, it’s possible to load a dataset into Excel without dumping it onto a worksheet.

    Firstly, go to the Data tab and inside the Get & Transform Data section click on From Text/CSV.

    This opens File Explorer, allowing you to select the CSV file.

    A preview window then pops up so you can get an overview of the data.

    For this exercise, transform the data first (option 3) and click the Close & Load To… ribbon button for the Import Data window (option 2) to pop up. Choose Only Create Connection and check the Add this data to the Data Model box. Press OK to confirm the choices.

    The options are:

    Load

    Loads the data directly into the worksheet as a query. However, in this case, a warning will appear alerting you that the query returned more data than will fit on the worksheet. Therefore, only the first 1,048,576 rows will show.

    Load To…

What are your reckons?

All comments are moderated and may not be published immediately. Your email address will not be published.Allowed HTML: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong> <p> <pre> <br> <img src="" alt="" title="" srcset="">