Data on Big Data

As I wrote “Time to Build Your Big Data Muscles” for Fast Company, I discovered more fascinating bits of data about big data than I could include in the article. Here are some of the eye popping details.

If all these numbers make you wish for a reference guide, James Huggins answers How Much Data Is That? and BroadbandHub offers a terrific video guide helps with the relative size of internet data.

In 2012, every day 2.5 quintillion bytes of data (1 followed by 18 zeros) are created, with 90% of the world’s data created in the last two years alone. As a society, we’re producing and capturing more data each day than was seen by everyone since the beginning of the earth.

This vast amount of digital data would fill DVD stack reaching from the Earth to moon and back. To put things in perspective, the entire works of William Shakespeare (in text form) represent about 5 MB of data. So, you could store about 1,000 copies of Shakespeare on a single DVD. The text in all the books in the Library of Congress would fit comfortably on a stack of DVDs the height of a single-story house.

The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s according to Martin Hilbert and Priscila López.

Given that unstructured data accounts for 80% of the data in the world, and we know much of that is from social media that gets special attention.

How much data is generated through social media tools?

People send more than 144.8 billion Email messages sent a day.
People and brands on Twitter send more than 340 million tweets a day.
People on Facebook share more than 684,000 bits of content a day.
People upload 72 hours (259,200 seconds) of new video to YouTube a minute.
Consumers spend $272,000 on Web shopping a day.
Google receives over 2 million search queries a minute.
Apple receives around 47,000 app downloads a minute.
Brands receive more than 34,000 Facebook ‘likes’a minute.
Tumblr blog owners publish 27,000 new posts a minute.
Instagram photographers share 3,600 new photos a minute.
Flickr photographers upload 3,125 new photos a minute.
People perform over 2,000 Foursquare check-ins a minute.
Individuals and organizations launch 571 new websites a minute.
WordPress bloggers publish close to 350 new blog posts a minute.
The Mobile Web receives 217 new participants a minute.
(The most updated numbers are available from the sites themselves.)

Data specific to the job market

Recent estimates from the US Bureau of Labor Statistics project a 22 percent increase in demand for professionals with management analysis skills between now and 2020. This is faster than the average for all occupations. Demand for the services of these workers will grow as organizations continue to seek ways to improve efficiency and control costs.

Employment in the Massachusetts big data sector alone is expected to more than double over the next six years according to a recent report by the Mass Technology Leadership Council. MassTLC projects that the State’s total Big Data employment could grow by 50,000 jobs for a total of 120,000 jobs by 2018, making it one of the state’s key economic drivers.

Despite high unemployment rates, a lack of skilled workers means many vacancies remain unfilled. Even at the height of the crisis, the OEDC reports more than 40% of employers in the US, Australia, Japan, said they couldn’t find people with the right skills.

Download the underlying data in Excel

Schools participating in IBM’s Academic Initiative, focused specifically on expanding and strengthening analytics curricula include universities such as Fordham, Yale School of Management, DePaul, Northwestern, University of West Scotland, Indian Institute of Management Calcutta, Xi’an Jiao Tong University, University of Ulster, IAE Aix-en-Provence, EDC business school in France, Ottawa University Telfer School of Management.

Tim O’Reilly recently stated in a Google+ conversation: Companies that have massive amounts of data without massive amounts of clue are going to be displaced by startups that have less data but more clue.

More big big numbers

2.5 Petabytes: Data flowing through Walmart’s transaction databases. (The Economist)

40 Terabytes: Data generated every second from nuclear physics experiments at the Large Hadron Collider at CERN. (The Economist)

10 Terabytes: Sensor data produced by a jet every 30 minutes of flight time.

1.25 Terabytes: The amount of data the human brain can hold. It performs at roughly 100 teraflops. (Ray Kurzweil as cited by IBM’s Tony Pearson)

1 Terabyte: Structured trading data collected by the NYSE each day the market is open.

631: Number of enterprise big-data companies expected to exist in 2017 (IBISWorld, visual)

340,000: Projected number of people working specifically on big data in 2018 (McKinsey)

40%: five-year compound annual growth rate (CAGR) growth for worldwide big data market. The growth of individual segments varies from 27.3% for servers and 34.2% for software to 61.4% for storage. Infrastructure technology for big data deployments is expected to grow slightly faster at 44% CAGR. Storage, in particular, shows the strongest growth opportunity, growing at 61.4% CAGR through 2015. (IDC)

6Trillion: Big data cost. (IDC )

3.37 Billion worldwide email accounts, 2.52B consumer email accounts/850M corporate email accounts. (Radicati Group/pdf)

$900 Billion/year: cost of lowered employee productivity and reduced innovation from information overload. Despite its heft, this is a fairly conservative number and reflects the loss of 25% of the knowledge worker’s day to the problem. The total could be as high as $1 trillion. (Basex)

3000%: increase in meter reading data captured from deploying smart meters for better energy management. (IBM/pdf)

15 (out of 17): Number of industry sectors in the U.S. that have more data stored, per company, than the U.S. Library of Congress (McKinsey)

Leslie Johnson at the Library of Congress created a full post listing all the reports that compare the amount of data to those of the LoC called, How many Library of Congress does it take? She also wrote, Defining the “Big” in Big Data and Data is the New Black. Follow her @lljohnston.

Information overload was first mentioned in 1962, in an article entitled “Operation Basic: The Retrieval of Wasted Knowledge” by Gertram M. Gross. The problem was predicted by Alvin Toffler in Future Shock (1970), and in 1989, Richard Saul Wurman warned of it in his book, Information Anxiety.

Big Tools

Examples of new software tools include MapReduce and Hadoop. While Hadoop is more widely talked about and used, a recent article reported that MapReduce was successful in sorting a petabyte file of 100-byte records on a system of 8000 computers in 33 minutes compared to the six hours it took to accomplish the same task on a cluster of 4000 machines in 2008 (see “Sorting Petabytes with MapReduce – The Next Episode,” September 2011). From OEDC.

“Interactive: Analyze your smile,” Forbes.com, March 3, 2011. A Web-based application to use your computer’s camera to track your emotions over time to identify what portions of an advertisement you found amusing.

Big Implications

While the tools keep getting faster, the data sets are growing larger. According to Chris Anderson, “Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.” (Anderson, 2008.)

Big Reports

Big data: Harnessing a game-changing asset (pdf). A report from the Economist Intelligence Unit, sponsored by SAS

Big Data, the Next Frontier for Innovation from McKinsey Global Institute, 2011

Measuring the Internet: The Data Challenge, OECD Digital Economy Papers, No. 194. 2012

Future Work Skills 2020: Computational Thinking from Apollo Research Institute. More in a one of my previous posts, 10 Skills for the Future Workforce.

Making Sense of Big Data, Technology Forecast from PWC (pdf), 2010

Big Data Categories

big input -> small output = lots of data in results in only a little data out
small input -> big output = little data in but produces a big output
big input -> big output = lots of data in, lots of data out
(Vineet Tyagi, Impetus)

Big Books

While working on the article I also asked my social network for suggestions on books that introduce people-centric practices to data-driven people. While none of the responses were precisely what I was looking for, I heard about many books I haven’t read and you might enjoy learning about too.

The first book recommended was Peopleware by Tom DeMarco and Tim Lister. While a bit data, I was told, after three seperate mentions it seems to have made a lasting impression on many people.
What Moneyball did for mathphobics to grasp analytics, Ed Tufte did for numbers folk to grok people. John Cousineau (@jcousineau) He also recommended Slywotzki’s Profit Patterns.
Taleb’s Black Swan has good examples. Another is oldie but goodie Innumeracy. Jason Hull (@hull_j)
The Mathematician Reads the Newspaper by John Allen Paulos. Brian Sletten (@bslette)
Switch by Chip & Dan Heath. Different types of people and how to work together for change and execution. Gr8 Stories. (@Matt_Leopold)
Try Managing Humans. Jennifer Berk (@jcberk) [Note from Marcia: This wonderful website lead in shows why data people need a more human approach. Well worth watching!]

Other recommendations sent privately:

Memoirs of Hadrian by Marguerite Yourcenar. It’s about what it means to be human.
Easy read, espeically for contrarian thinkers: The Opposable Mind: Winning Through Integrative Thinking by Roger Martin.
A favorites in education but a little off point: Cathy Davidson, Now You See It: How Technology and Brain Science Will Transform Schools and Business for the 21st Century
One of the easier reads might be: Consider – The Power of Reflective Thinking In Your Organization by Daniel Patrick Forreste
Predictably Irrational: The Hidden Forces That Shape Our Decisions by Dan Ariely
Anything by self-described “numerical man” Malcolm Gladwell
Anything by Jakob Nielsen

On related topics, I mentioned:

One of my fave books to help anyone grok the people message: Humanize by @maddiegrant and @jamienotter
The Social Life of Information by @jseelybrown and Paul Duguid
John Kenneth Galbraith’s A Tenured Professor

If you have a suggestion, please send it to me directly or include it in a comment after this post.

Big videos

@jonhusband made a compelling case for the movie Pleasantville:

“Pleasantville shows people finding their ‘voice”, coming to life, frantic desire of town hierarchs to stop/control, etc. and all that w nary a hyperlink ;-) Best movie (simplistic) about the +ve side of linking, sharing & finding & using voice.”

Still more?

Messy data are a fact of life. So is the unstructured term data. Singular or plural? @theeconomist

If you’re curious how your workplace, your family, or your school can take steps now to increase their big data thinking: Learn more about computational thinking, a growing field that shows how science and math can apply to all aspects of modern work.

[Photo credit: soupautomat4, fdecomite]

Data on Big Data

34 thoughts on “Data on Big Data”

Leave a Reply