Big Data Explained - what it is and what it does

Content

– Characteristics of Big Data

– Big Data Definitions

– The DIKW Pyramid – From Data to Information to Knowledge to Wisdom

– Examples of Big Data Analytics

• Big Data and Climate Simulations

• Big Data and Transportation

• Big Data and Materials Science

One implication of ubiquitous sensing as evidenced by the rapid growth of the Internet-of-Things (IoT) is the explosive increase in data being collected. This data comes from myriad sources: According to predictions, the number of IoT connected devices will grow dramatically to 75 billion in 2025 and a staggering 125 billion by 2030. At that point, there will be almost 15 things connected to the Internet for each human on earth.

In our interconnected and mobile world, data sources are becoming more complex than those for traditional data because they are being driven by artificial intelligence (AI), mobile devices, social media and the IoT. For example, the different types of data originate from IoT and other sensors and devices, financial transactions, electronic health records, electronic government records, video/audio, networks, log files, transactional applications, web and social media – much of it generated in real time and at a very large scale.

In short: data is everywhere. But it is not enough to have big data sets – you also need the proper tools to make sure they are useful and to get actionable knowledge out of them. Unless it is possible to clearly identify which data points, and the connections between them, are actually important, Big Data will provide nothing but noise.

Big Data is about the harvesting of raw data from multiple, disparate sources, storing it for use by analytics programs, and using it to derive value from the data in entirely new ways. In other words, Big Data is not about the data – it is about the value that can be extracted from the data i.e., the meaning contained in the data.

This means that no single technology can be called Big Data, which requires a tightly coordinated ecosystem of data acquisition, storage, and application technologies to make it work.

Characteristics of Big Data

Traditional data architectures are not able to deal with these data sets. Actually, the term Big Data seems to imply that other data is somehow small (it isn’t) or that the key issue to deal with it is the massive size. However, the characteristics of Big Data that require new architectures go beyond just volume (the size of the digital universe, i.e., all digital data created in the world, is estimated to be around 40 zettabytes – 40 trillion gigabytes – in 2020):

Variety (i.e., data from multiple repositories, domains, or types). Structured data is that which can be organized neatly within the columns of a database. This type of data is relatively easy to enter, store, query, and analyze. Unstructured data is more difficult to sort and extract value from. Examples of unstructured data include emails, social media posts, word-processing documents; audio, video and photo files; and web pages.

Velocity (i.e., rate of flow). Every second, Google receives almost 100,000 searches; the same amount of YouTube videos is watched; more than 1000 photos are uploaded to Instagram, almost 10,000 tweets are posted, and more than 3 million emails sent (Source).

Variability (i.e., the change in other characteristics). Data's meaning is constantly changing. For example, language processing by computers is exceedingly difficult because words often have several meanings. Data scientists must account for this variability by creating sophisticated programs that understand context and meaning.

These characteristics – volume, variety, velocity, and variability – are known colloquially as the Four Vs of Big Data.

In addition, Big-Data practitioners are proposing additional Vs such as:

Veracity (i.e., the quality of data). If source data is not correct, analyses will be worthless.

Visualization (i.e., the meaning of the data). Data must be understandable to nontechnical stakeholders and decision makers. Visualization is the creation of complex graphs that tell the data scientist’s story, transforming the data into information, information into insight, insight into knowledge, and knowledge into advantage. A great example of this are the various graphics that often accompany stories on climate change: They are pictures that are worth more than a billion data points:

Value (i.e., opportunities and savings). Ultimately, the entire point of Big Data is to improve decision-making by organizations.

Big Data Definitions

Several definitions of Big Data have been proposed, including ‘extremely large data sets’; ‘extensive datasets require a scalable architecture for efficient storage, manipulation, and analysis’; and the exponential increase and availability of data in our world’.

The term Big Data describes the massive amounts of data being collected in today’s networked, digitized, sensor-populated, and information-driven world and the tools used to analyze and extract information from these large and complex data sets.

The growth rates for data volumes, speeds, and complexity of Big Data overwhelms conventional data processing software. That’s why improved, and entirely new analytical techniques and processes are in the process of being developed and continuously refined. This includes the areas of data capture, data storage, data analysis, search, sharing, transfer, visualization, querying, updating, information privacy, and data source.

In this context, the term Big Data Analytics describes the process of applying serious computing power – the latest in machine learning and artificial intelligence – to massive and highly complex sets of information.

One important concept to Big Data is metadata, which is often described as ‘data that describes other data’, for instance how and when data was collected, how it has been processed, or how it is linked to other data.

The DIKW Pyramid – From Data to Information to Knowledge to Wisdom

The DIKW pyramid refers to a class of models for representing purported structural and/or functional relationships between data, information, knowledge, and wisdom. The DIKW model is used to describe methods for problem-solving or decision making. Although developed in the early days of computers, it still models many concepts used in data science and machine learning.

(Source: Mind Map created by G. Wagenmaker)

Data usually is just a collection of raw facts, often collected from various sources and in multiple formats, that are quite useless unless they are analyzed and organized. For example, images and videos can hold a lot of data that requires interpretation to extract information from them.

Information is gained from data by consistently organizing, structuring and contextualizing raw data according to users’ requirements. This makes information more valuable than raw data. Essentially information is found in answers to questions that begin with such words as who, what, where, when, and how many.

One key aspect of knowledge is the application of information to answer a question or solve a problem. Combined with past experience and insights, know-how, and skills, contextualized information is key to gaining knowledge. Knowledge is the most valuable distillation of data, and although knowledge gives you the means to solve a problem, it doesn’t necessarily show you the best way to do so.

The ability to pick the best way to reach the desired outcome comes from experience gained in earlier attempts to reach a successful solution. The DIKW model describes this ability as wisdom. People gain wisdom through experience and knowledge, some of which comes from developing an understanding of problem-solving methods and gathering intelligence from other people solving the same problems.

Examples of Big Data Analytics

There are many areas and industries where Big Data Analytics already play a role and the examples are too numerous to be covered here. So we just showcase a few to give you an idea of what kind of impact Big Data already has.

Big Data and Climate Simulations

One of the most data-intensive scientific disciplines involves planetary climate simulations. As scientists are refining these models in order to describe the intricacies of the Earth’s climate system as detailed and precise as possible, the amount and complexity of the associated data is growing exponentially.

Climate models, also known as earth system models, work by representing the physics, chemistry, and biology of the climate system as mathematical equations. These equations are solved on a three-dimensional grid with cells representing the atmosphere, land, and ocean.

Current climate models divide Earth into cells to project changes into the future; here’s what a cell representing the atmosphere might look like when greater detail is added through next-generation AI methods. (Image: Columbia University)

Most earth system models run on supercomputers, but they require even more computing power scientists have available. This limits the size of the cells in their 3D grid (see image above). In current models, a cell typically measures 80-100 km on each side, with one value per cell representing a single variable like temperature, cloud cover, or rainfall.

To bring greater precision to climate modeling and encourage societies to prepare for the inevitable disruptions ahead, in the U.S. for instance, the National Science Foundation (NSF) has selected Columbia to lead a climate modeling center called Learning the Earth with Artificial Intelligence and Physics (LEAP).

The volume of worldwide climate data is expanding rapidly, creating challenges for both physical archiving and sharing, as well as for ease of access and finding what’s needed, particularly if you are not a climate scientist. The figure below shows the projected increase in global climate data holdings for climate models, remotely sensed data, and in situ instrumental/proxy data.

(Source: 10.1126/science.1197869)

Climate models are based on well-documented physical processes to simulate the transfer of energy and materials through the climate system. These models use mathematical equations to characterize how energy and matter interact in different parts of the ocean, atmosphere, and land.

Building and running a climate model is complex process of identifying and quantifying Earth system processes, representing them with mathematical equations, setting variables to represent initial conditions and subsequent changes in climate forcing, and repeatedly solving the equations using powerful supercomputers.

Framework of big data in climate change studies. (click on image to enlarge) (Source doi:10.3390/bdcc3010012)

Big Data and Transportation

Big data and the IoT work in conjunction. Huge amounts of unstructured data extracted from the sensors embedded in IoT devices provide the basis for sophisticated DIKW-type problem-solving and decision-making processes in order to improve products and services across many industries. Examples:

Logistics. UPS’s vaccine tracking technology uses a GPS-enabled device to monitor COVID-19 vaccines in transit, by shipment. This device transmits data about factors that could delay or damage sensitive healthcare shipments like vaccines – location, temperature, motion and shock, light exposure (open box), atmospheric pressure and remaining battery life for the device. These details are transmitted in real-time to the UPS Healthcare command center, a 24/7 monitoring center dedicated to safeguarding the timely delivery of vaccine shipments and other critical healthcare packages. These sensors also help ensure vaccine shipments and other critical healthcare packages receive priority placement when loaded on planes, trailers and delivery trucks.

Traffic Management. Smart traffic systems that are deployed in Smart Cities require extensive sensor networks that create huge volumes of data on traffic flow and public transit systems. These systems gather data from thousands of traffic cameras, road detectors, traffic lights, parking meters, air quality and other sensors, mobility apps and connected cars.

This data can then be utilized to make traffic flow more efficient, reduce congestion and, longer-term, help city planners address bottlenecks. Citizens also benefit from open data through real-time access to traffic information so that they can better plan their journeys and avoid congestion. Real-time navigation alerts drivers to delays and helps them choose the fastest route. Smart parking apps point them directly to available spots, eliminating time spent fruitlessly circling city blocks. Emergency services benefit from systems that monitors traffic in real time so that accidents and disruptions can be handled immediately. For instance, by optimizing emergency call dispatching and synchronizing traffic lights for emergency vehicles, cities can cut emergency response times by 20–35 percent.

A key challenge of smart cities is the need to process extremely large amounts of complex and geographically distributed sources of data (citizens, traffic, vehicles, city infrastructures, IoT devices, etc.), combined with the additional need to deal with this information in real time.

These systems require new approaches to Big Data management. For instance, the European CLASS project developed a novel software architecture framework to design, deploy and execute distributed big data analytics with real-time constraints for smart cities, connected cars and future autonomous vehicles.

Aircraft Safety and Maintenance. Sensors are found across wings, in engines, in passenger and cargo compartments; practically every square centimeter of an airliner is brimming with sensors that monitor everything from engine performance to how often reading lights are activated. For instance, the latest Airbus A350 has 50,000 sensors on board collecting 2.5 terabytes of data every day it operates. Engine data is amongst the most complex and the thousands of sensors in each modern aircraft engine feeds data into AI-embedded maintenance and engineering systems that allow operators to act and solve problems immediately.

These systems are able to harvest data from aircraft operations automatically and then update maintenance programs. As a result, life-limited engine part maintenance deadlines can be updated based on actual operating conditions and life consumed by each engine in use. In addition, by monitoring every operational aspect of an aircraft (fleet), airlines have already saved millions in fuel costs, improved routes and safety, and learned to reallocate ground resources so when flights are delayed, backup plans can be automatically triggered.

Big Data in the Financial Services Sector

Financial services have always been a data intensive industry, from the vast amount of credit card transactions to credit scoring and fraud detection. To give you an idea of the scope, global credit cards generated about 441 billion purchase transactions in 2019.

The main areas where financial services companies apply Big Data is in:

Security and fraud detection: Big secondary data like transaction records are monitored and analyzed to enhance banking security and distinguish the unusual behavior and patterns indicating fraud, phishing, or money laundering, among others.

Risk management: Analysis of in-house credit card data freely accessible for banks enables credit scoring and credit granting which form part of the most popular tools for risk management and investment evaluation.

Customer relationship management: Big Data techniques have been widely applied in banking for marketing and customer relationship management related purposes such as customer profiling, customer segmentation, and cross/up selling. These help institutions get a better understanding of their customers, predict customer behavior, accurately target potential customers and further improve customer satisfaction with a strategic service design.

Big Data and Materials Science

Materials innovation is the key to the most pressing challenges from global climate change to future energy sources. However, trial-and-error and the lack of systematic data have significantly hampered breakthrough discovery in materials research.

Creating new materials is not as simple as dropping a few different elements into a test tube and shaking it up to see what happens. You need the elements that you combine to bond with each other at the atomic level to create something new and different rather than just a heterogeneous mixture of ingredients. With a nearly infinite number of possible combinations of the various squares on the periodic table, the challenge is knowing which combinations will yield such a material.

In an effort to overcome this, in 2011 the U.S. government launched the Materials Project to develop novel scalable approaches to discover, manufacture, and deploy advanced materials twice as fast, at a fraction of the cost.

This initiative leverages large-scale collaborations between materials and computer scientists and harnesses the power of supercomputers and customized machine-learning algorithms based on state-of-the art quantum mechanical theory, to apply computational methods to screening and optimize material properties at an unparalleled scale and rate.

For example, high-throughput computational screening has been successfully used to predict phase diagrams of multicomponent crystals and alloys, the performance of lithium-based batteries, the nonlinear optical response in organic molecules, the current voltage characteristics of photovoltaic materials, electrode transparency and conductivity for solar cells, and amalgamation enthalpy.

The number of possible materials is increasing exponentially, along with their intrinsic structural complexity, making even the application of efficient density functional theory infeasible. In nanotechnology, things are even more complicated: The enormous complexity of nanomaterials arises from the sheer vastness of potential combinatorial variations that can be developed by choosing different nanomaterial size (including agglomeration and aggregation), solubility and dispersibility, chemical form, chemical reactivity, surface chemistry, shape, and porosity.

Unexpected variability in shape can have detrimental effects in the nanoparticle behavior and their functional properties. This represents a tremendous challenge because the selection of experimentally significant samples becomes increasingly difficult and requires knowledge of the relevant sizes, shapes, and structural complexity a priori.

Big Data combined with data mining and statistical methods can tackle this problem.

For instance, researchers at Osaka University employed machine learning to design new polymers for use in photovoltaic devices. After virtually screening over 200,000 candidate materials, they synthesized one of the most promising and found its properties were consistent with their predictions. To do that, they screened hundreds of thousands of donor:acceptor pairs based on an algorithm trained with data from previously published experimental studies. Trying all possible combinations of 382 donor molecules and 526 acceptor molecules resulted in 200,932 pairs that were virtually tested by predicting their energy conversion efficiency.

Check out our SmartWorlder section to read more about smart technologies.