Five Sources of Big Data

Some time ago I’ve described how to think when you build solutions from Big Data in the post Six Graphs of Big Data. Today I am going to look in the opposite direction, where Big Data come from? I see distinctive five sources of the data: Transactional, Crowdsourced, Social, Search and Machine. All details are below.

Transactional Data

This is old good data, most familiar and usual for the geeks and managers. It’s plenty of RBDMSes, running or archived, on premise and in the cloud. Majority of transactional data belong to corporations, because the data was authored/created mainly by businesses. It was a golden era of Oracle and SQL Server (and some others). At some point the RDBMS technology appeared to be incapable of handling more transactional data, thus we got Teradata (and others) to fix the problem. But there was no significant shift for the way we work with those data sources. Data warehouses and analytic cubes are trending, but they were used for years already. Financial systems/modules of the enterprise architectures will continue to rely on transactional data solutions from Oracle or IBM.

Crowdsourced Data

This data source has emerged from the activity rather than from type of technology. The phenomenon of Wikipedia confirmed that crowdsourcing really works. Much time passed since Wikipedia adoption by the masses… We got other fine data sources built by the crowds, for example Open Street Maps, Flickr, Picasa, Instagram.

Interesting things happen with the rise of personal genetic testing (verifying DNA for million of known markers via 23andme). This leads to public crowdsourced databases. More samples available, e.g. amateur astronomy. Volunteers do author useful data. The size of crowdsourced data is increasing.

What differentiates it from transactional/enterprise data? It’s a price. Usually crowdsourced data is free for use, with one of creative commons licenses. Often, the motivation for creation of such data set is digitization of our world or making free alternative to paid content. With the rise of nanofactories, we will see the growth of 3D models of every physical product. By using crowdsourced models we will print the goods at home (or elsewhere).

Social Data

With the rise of Friendster–>MySpace–>Facebook and then others (Linkedin, Twitter etc.) we got new type of data — Social. It should not be mixed for Crowdsourced data, because of completely different nature of it. The social data is a digitization of ourselves as persons and our behavior. Social data is very well complementing the Crowdsourced data. Eventually there will be digital representation of everyone… So far social profiles are good enough for meaningful use. Social data is dynamic, it is possible to analyze it in real-time. E.g. put Tweets or Facebook posts thru the Google Predictive API to grab emotions. I’m sure everybody intuitively understands this type of data source.

Search Data

This is my favourite. Not obvious for many of you, while really strong data source. Just recall how much do you search on Amazon or eBay? How do you search on Wikis (not messing up with Wikipedia). Quora gets plenty of search requests. StackOverflow is a good source of search data within Information Technology. There are intranet searches within Confluence and SharePoint. If those search logs are analyzed properly, then it is clear about potential usefulness and business application. E.g. Intention Graph and Interest Graph are related to the search data.

There is a problem of “walled gardens” for search data… This problem is big, bigger than for social data, because public profiles are fully or partially available, while searches are kept behind the walls.

Machine Data

This is also my favourite. In the Internet of Things every physical thing will be connected. New things are designed to be connectable. Old things are got connected via M2M. Consumers adopted wearable technology. I’ve posted about it earlier. Go to Wearable Technology and Wearable Technology, Part II.

The cost of data gathering is decreasing. The cost of wireless data transfer is decreasing. The bandwidth of wireless transfer is increasing dramatically. Fraunhofer and KIT completed 100Gbps transmission. It’s fourteen times faster than the most robust 802.11ac. The moral is — measure everything, just gather data until it become Big Data, then analyze it properly and operate proactively. Machine data is probably the most important data source for Big Data during next years. We will digitize the world and ourselves via devices. Open Street Map got competitors, the fleet of eBees described Matterhorn with million of spatial points. More to expect from machines.

Advertisements
Tagged , , , , , , , , , , , , , , , , , , , , ,

6 thoughts on “Five Sources of Big Data

  1. […] I’ve posted on Six Graphs of Big Data and mentioned Consumption Graph there. Then I presented Five Sources of Big Data on the data-aware conference, mentioned how retailers track people (time, movement, sex, age, goods […]

  2. […] or quantified-self individuals. Social and machine data is not necessarily SQL friendly. Check out Five Sources of Big Data for more […]

  3. […] More details how to handle interest graph, intention graph, mobile graph, social graph and which sensors could bring the modern new data available in my older posts. So far I propose to personalize the text message for default screen […]

  4. […] Graphs of Big Data as a pathway to the individual user experience. Then I’ve did the same for Five Sources of Big Data. But what’s between them remained untold. Today I am going to give my vision how different […]

  5. […] design of Six Graphs of Big Data from Five Sources of Big Data. The relation between graphs and sources is many-to-many. Blending of the graphs is not trivial. […]

  6. […] First of all, start recognizing novel data sources, such as Search, Social, Crowdsourced, Machine. It is different from Traditional CRM, ERP data. Record data from them, filter noise, recognize motifs, find intelligence origins, build data intelligence, bind to existing business intelligence models to improve them. Check out Five Sources of Big Data. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: