Tag Archives: Five Sources

Big Data Graphs Revisited

Some time ago I’ve outlined Six Graphs of Big Data as a pathway to the individual user experience. Then I’ve did the same for Five Sources of Big Data. But what’s between them remained untold. Today I am going to give my vision how different data sources allow to build different data graphs. To make it less dependent on those older posts, let’s start from the real-life situation, business needs, then bind to data streams and data graphs.

 

Context is a King

Same data in different contexts has different value. When you are late to the flight, and you got message your flight was delayed, then it is valuable. In comparison to receiving same message two days ahead, when you are not late at all. Such message might be useless if you are not traveling, but airline company has your contacts and sends such message on the flight you don’t care about. There was only one dimension – time to flight. That was friendly description of the context, to warm you up.

Some professional contexts are difficult to grasp by the unprepared. Let’s take situation from the office of some corporation. Some department manager intensified his email communication with CFO, started to use a phone more frequently (also calling CFO, and other department managers), went to CFO office multiple times, skipped few lunches during a day, remained at work till 10PM several days. Here we got multiple dimensions (five), which could be analyzed together to define the context. Most probably that department manager and CFO were doing some budgeting: planning or analysis/reporting. Knowing that, it is possible to build and deliver individual prescriptive analytics to the department manager, focused and helping to handle budget. Even if that department has other escalated issues, such as release schedule or so. But severity of the budgeting is much higher right away, hence the context belongs to the budgeting for now.

By having data streams for each dimension we are capable to build run-time individual/personal context. Data streams for that department manager were kind of time series, events with attributes. Email is a dimension we are tracking; peers, timestamps, type of the letter, size of the letter, types and number of attachments are attributes. Phone is a dimension; names, times, durations, number of people etc. are attributes. Location is a dimension; own office, CFO’s office, lunch place, timestamps, durations, sequence are attributes. And so on. We defined potentially useful data streams. It is possible to build an exclusive context out of them, from their dynamics and patterns. That was more complicated description of the context.

 

Interpreting Context

Well, well, but how to interpret those data streams, how to interpret the context? What we have: multiple data streams. What we need: identify the run-time context. So, the pipeline is straightforward.

First, we have to log the Data, from each interested dimension. It could be done via software or hardware sensors. Software sensors are usually plugins, but could be more sophisticated, such as object recognition from surveillance cameras. Hardware sensors are GPS, Wi-Fi, turnstiles. There could be combinations, like check-in somewhere. So, think that it could be done a lot with software sensors. For the department manager case, it’s plugin to Exchange Server or Outlook to listen to emails, plugin to ATS to listen to the phone calls and so on.

Second, it’s time for low-level analysis of the data. It’s Statistics, then Data Science. Brute force to ensure what is credible or not, then looking for the emerging patterns. Bottleneck with Data Science is a human factor. Somebody has to look at the patterns to decrease false positives or false negatives. This step is more about discovery, probing and trying to prepare foundation to more intelligent next step. More or less everything clear with this step. Businesses already started to bring up their data science teams, but they still don’t have enough data for the science:)

Third, it’s Data Intelligence. As MS said some time ago “Data Intelligence is creating the path from data to information to knowledge”. This should be described in more details, to avoid ambiguity. From Technopedia: “Data intelligence is the analysis of various forms of data in such a way that it can be used by companies to expand their services or investments. Data intelligence can also refer to companies’ use of internal data to analyze their own operations or workforce to make better decisions in the future. Business performance, data mining, online analytics, and event processing are all types of data that companies gather and use for data intelligence purposes.” Some data models need to be designed, calibrated and used at this level. Those models should work almost in real-time.

Fourth, is Business Intelligence. Probably the first step familiar to the reader:) But we look further here: past data and real-time data meet together. Past data is individual for business entity. Real-time data is individual for the person. Of course there could be something in the middle. Go find comparison between stats, data science, business intelligence.

Fifth, finally it is Analytics. Here we are within individual context for the person. There worth to be a snapshot of ‘AS-IS’ and recommendations of ‘TODO’, if the individual wants, there should be reasoning ‘WHY’ and ‘HOW’. I have described it in details in previous posts. Final destination is the individual context. I’ve described it in the series of Advanced Analytics posts, link for Part I.

Data Streams

Data streams come from data sources. Same source could produce multiple streams. Some ideas below, the list is unordered. Remember that special Data Intelligence must be put on top of the data from those streams.

In-door positioning via Wi-Fi hotspots contributing to mobile/mobility/motion data stream. Where the person spent most time (at working place, in meeting rooms, on the kitchen, in the smoking room), when the person changed location frequently, directions, durations and sequence etc.

Corporate communication via email, phone, chat, meeting rooms, peer to peer, source control, process tools, productivity tools. It all makes sense for analysis, e.g. because at the time of release there should be no creation of new user stories. Or the volumes and frequency of check-ins to source control…

Biometric wearable gadgets like BodyMedia to log intensity of mental (or physical) work. If there is low calories burn during long bad meetings, then that could be revealed. If there is not enough physical workload, then for the sake of better emotional productivity, it could be suggested to take a walk.

 

Data Graphs from Data Streams

Ok, but how to build something tangible from all those data streams? The relation between Data Graphs and Data Streams is many to many. Look, it is possible to build Mobile Graph from the very different data sources, such as face recognition from the camera, authentication at the access point, IP address, GPS, Wi-Fi, Bluetooth, check-in, post etc. Hence when designing the data streams for some graph, you should think about one to many relations. One graph can use multiple data streams from corresponding data sources.

To bring more clarity into relations between graphs and streams, here is another example: Intention Graph. How could we build Intention Graph? The intentions of somebody could be totally different in different contexts. Is it week day or weekend? Is person static in the office or driving the car? Who are those peers that the person communicates a lot recently? What is a type of communication? What is a time of the day? What are person’s interests? What were previous intentions? As you see there could be data logged from machines, devices, comms, people, profiles etc. As a result we will build the Intention Graph and will be able to predict or prescribe what to do next.

 

Context from Data Graphs

Finally, having multiple data graphs we could work on the individual context, personal UX. Technically, it is hardly possible to deal with all those graphs easily. It’s not possible to overlay two graphs. It is called modality (as one PhD taught me). Hence you must split and work with single modality. Select which graph is most important for your needs, use it as skeleton. Convert relations from other graphs into other things, which you could apply to the primary graph. Build intelligence model for single modality graph with plenty of attributes from other graphs. Obtain personal/individual UX at the end.

Advertisements
Tagged , , , , , , , , , , , , , , , , , , , , , ,

Five Sources of Big Data

Some time ago I’ve described how to think when you build solutions from Big Data in the post Six Graphs of Big Data. Today I am going to look in the opposite direction, where Big Data come from? I see distinctive five sources of the data: Transactional, Crowdsourced, Social, Search and Machine. All the details are below.

Transactional Data

This is old good data, most familiar and usual for the geeks and managers. It’s plenty of RBDMSes, running or archived, on-premise and in the cloud. The majority of transactional data belong to corporations because the data was authored/created mainly by businesses. It was a golden era of Oracle and SQL Server (and some others). At some point the RDBMS technology appeared to be incapable of handling more transactional data, thus we got Teradata (and others) to fix the problem. But there was no significant shift for the way we work with those data sources. Data warehouses and analytic cubes are trending, but they were used for years already. Financial systems/modules of the enterprise architectures will continue to rely on transactional data solutions from Oracle or IBM.

Crowdsourced Data

This data source has emerged from the activity rather than from a type of technology. The phenomenon of Wikipedia confirmed that crowdsourcing really works. Much time passed since Wikipedia adoption by the masses… We got other fine data sources built by the crowds, for example, OpenStreetMaps, Flickr, Picasa, Instagram.

Interesting things happen with the rise of personal genetic testing (verifying DNA for millions of known markers via 23andme). This leads to public crowdsourced databases. More samples available, e.g. amateur astronomy. Volunteers do author useful data. The size of the crowdsourced data is increasing.

What differentiates it from transactional/enterprise data? It’s a price. Usually, crowdsourced data is free for use, with one of creative commons licenses. Often, the motivation for the creation of such data set is the digitization of our world or making a free alternative to paid content. With the rise of nanofactories, we will see the growth of 3D models of every physical product. By using crowdsourced models we will print the goods at home (or elsewhere).

Social Data

With the rise of Friendster–>MySpace–>Facebook and then others (Linkedin, Twitter, etc.) we got a new type of data — Social. It should not be mixed for Crowdsourced data, because of completely different nature of it. The social data is the digitization of ourselves as persons and our behavior. Social data is very well complementing the Crowdsourced data. Eventually, there will be digital representation of everyone… So far social profiles are good enough for meaningful use. Social data is dynamic, it is possible to analyze it in real-time. E.g. put Tweets or Facebook posts thru the Google Predictive API to grab emotions. I’m sure everybody intuitively understands this type of data source.

Search Data

This is my favorite. Not obvious for many of you, while really strong data source. Just recall how much do you search on Amazon or eBay? How do you search on Wikis (not messing up with Wikipedia)? Quora gets plenty of search requests. StackOverflow is a good source of search data within Information Technology. There are intranet searches within Confluence and SharePoint. If those search logs are analyzed properly, then it is clear about potential usefulness and business application. E.g. Intention Graph and Interest Graph are related to the search data.

There is a problem with “walled gardens” for search data… This problem is big, bigger than for social data, because public profiles are fully or partially available, while searches are kept behind the walls.

Machine Data

This is also my favorite. In the Internet of Things every physical thing will be connected. New things are designed to be connectable. Old things are got connected via M2M. Consumers adopted wearable technology. I’ve posted about it earlier. Go to Wearable Technology and Wearable Technology, Part II.

The cost of data gathering is decreasing. The cost of wireless data transfer is decreasing. The bandwidth of wireless transfer is increasing dramatically. Fraunhofer and KIT completed 100Gbps transmission. It’s fourteen times faster than the most robust 802.11ac. The moral is — measure everything, just gather data until it becomes Big Data, then analyze it properly and operate proactively. Machine data is probably the most important data source for Big Data during the next years. We will digitize the world and ourselves via devices. Open Street Map got competitors, the fleet of eBees described Matterhorn with million of spatial points. More to expect from machines.

Tagged , , , , , , , , , , , , , , , , , , , , ,