big data & analytics
The Big Data Footprint, and its components seen here, allow our customers to get a simple comprehensive view of the architectural requirements for a Cognitive Analytics and Data Driven Business. These components form the foundation work that will help our customer in the journey to put data to in every part of your business and by doing so, to unlock new business possibilities.
Functional capabilities can be seen along the top of the diagram. Data sources, both internal and external, are shown at the left. Data flow from these sources, through the Ingestion & Integration component, into one or more analytical data repositories within the Analytical Data Lake Storage component. Data is then accessed through the Data Access component and passed to analytical tools which help generate insights into the data. These tools can take a number of forms, depending on the required task. Exploratory analytics for the purposes of discovering and exploring data will leverage the Discovery & Exploration component. More structured day-to-day reporting, dashboarding and analysis will leverage the tools in the Actionable Insight component.
There may also be analytical tools and models embedded into an enterprise’s operational systems for predictive and prescriptive analytics, and these will fall into the Enhanced Applications component.
There are other components that sit beneath these functional components and provide supporting capabilities. These include real-time streaming analytics, within the Analytics In-Motion component, a common data movement and analytics engine within the Analytics Operating System component, and a common management framework within the Information Management and Governance component. All of these components are in turn supported on a robust, common security framework, the Security Component, and all can be deployed in cloud, on-premise or hybrid environments in the Platform component.
How does each area work?
Component Description: The Ingestion & Integration component focuses on the processes and environments that deal with the capture, qualification, processing, and movement of data in order to prepare it for storage in the Analytical Data Lake Storage component, which is subsequently shared with the Discovery & Exploration and Actionable Insights components, via the Data Access component. The Ingestion & Integration component may process data in scheduled batch intervals or in near real-time/”just-in-time” intervals, depending on the nature of the data and the business purpose for its use.
Batch Ingestion: This capability can be leveraged to ingest and prepare structured or unstructured data in batch mode, either on-demand or at scheduled intervals.
The capability includes standard Extract, Transform and Load (ETL) and Extract, Load and Transform (ELT) paradigms, in addition to manual data preparation and movement.
Real-Time Ingestion: This capability can be leveraged to ingest and prepare structured or unstructured data arriving from source systems in real-time or near-realtime.
The capability deals with transactional data transmitted via a message hub or Enterprise Service Bus, as well as streaming data such as sensor data or video feeds.
Change Data Capture: This capability can be leveraged for replicating data in near real time to support data migrations, application consolidation, data synchronization, dynamic warehousing, master data management (MDM), business analytics and data quality processes.
Document interpretation & classification: This capability streamlines the capture, recognition, and classification of business documents, to quickly and accurately extract important information from those documents for use by business users and in applications.
Data Quality Analysis: This capability allows data quality rules (stored and maintained in the Information Management & Governance component) to to be applied to data during ingestion and transformation, and for quality measures to be stored as metadata associated with the data sets in the analytics environment.
Component Description: The Analytics In-Motion component includes capabilities related to real-time analytics on transactional or streaming data. Analytics inmotion is specific to the ability to positively affect business relevant data immediately as it comes available, no matter the size or velocity of the data.
Streaming analytics is the ability to perform analytics on data ingested in real-time. These analytics could be simple data filtering techniques and simple deviation models, or advanced analytics such as highly complex predictive algorithms. Data is typically analyzed over a window of time or over a particular number of incoming events.
Complex event processing is the ability to correlate various (often non-related) events into a useful pattern to detect anomalies as compared to normally seen ranges of activities. Complex event processing is used in many industries and use cases including fraud in the financial sector, patient health scores in the healthcare sector, etc.
Component Description: The Analytical Data Lake Storage component is a set of secure data repositories that allows data to stored for subsequent consumption by analytics tools and users. These repositories form the heart of the analytics environment. The repositories within this component may vary from a single Hadoop repository or Enterprise Data Warehouse, to multiple repositories used for different purposes by different analytical tools. Note that operational and transactional data stores (such as OLTP, ECM, etc.) are not included in this component. Instead they form part of the Data Sources component.
Landing Zone is typically an initial location for data ingested from source systems. Data in the Landing Zone may be of varying types and formats. It may or may not be modeled or structured, and its quality may or may not be understood.
Data Archive is a repository where raw data is persisted for archive purposes. The Data Archive is typically optimized for high volume, high velocity write operations, rather than for read operations.
History is a repository that stores state-change histories, log data etc. Such repositories are typically optimized for write operations, and are used for data that will not
normally be accessed often or with real-time response requirements.
Deep Analytics is the application of sophisticating data mining and analysis techniques to yield insights from large, typically heterogeneous data sets. Deep Analytics repositories are typically optimized for low-cost storage of very large data volumes, without the need for real-time response.
Exploratory Analytics repositories are used to store shared, heterogeneous data for use by data scientists. Such repositories often bear much in common with Deep Analytics repositories, but are most often used to share the results of analyses for reuse by others.
Sand boxes are repositories used by individual data scientists or groups who need a temporary data repository to experiment and do quick analyses. Sand boxes are provisioned, populated, deleted and re-deployed more often than other data repositories.
Data Warehouses & Data Marts are analogous to the traditional Enterprise Data Warehouse and Data Marts, and are used to store data which will be read and analyzed frequently, with interactive real-time response requirements.
Predictive Analytics are used to support operationalized predictive or prescriptive analytics use cases such as Next Best Action scenarios, where real-time response is a requirement to support interactive workloads.
Component Description: The Data Access component contains the various capabilities needed to interact with the Analytical Data Lake Storage component. These capabilities serve the access needs of Business Analysts and Application Developers who need access to data and to a lesser extent, Data Scientists and Data Engineers.
Self-Service is the ability for users, typically Business Analysts and Data Scientists, to ’shop for data’; to understand the data that is available, access that data, request the provisioning of new sandboxes and provision data into those sandboxes.
Data Virtualization describes any approach to data management that allows a user or application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located.
Data Federation is often used in conjunction with data virtualization, but is the specific ability to have a single queryable interface to various different datasets. Allowing for a single more usable view of the data to business analysts and report writers.
APIs: With the ever-changing technology landscape there is a continued need to allow for multiple types of APIs, including Open Source and IBM-specific APIs, as access channels to the data lake repository. This capability represents that fluid list of technologies and solutions.
Component Description: The Discovery & Exploration component enables consumers of data to search the data in the analytics environment in order to find and evaluate data for their particular purposes. Data Scientists, Business Analysts, Data Engineers, and Application Developers need to quickly and easily search and access data, and to collaborate when exploring and analyzing the data.
Data science symbolizes a powerful approach to making discoveries. By bringing together the areas of applied math, statistics, computer science and visualization, data science can bring insight and structure to the digital era. Data scientists have brought this ability to the forefront, but it is the presence of collaboration and tools that will bring data science to the masses.
Search describes not only traditional search engines but also the use natural language processing techniques to search, learn, and discover things about disparate data sets with both structured and unstructured content.
Component Description: The Actionable Insight component analyzes data from the Analytical Data Lake Storage component in a cohesive manner and derives insight that is meaningful and actionable for the business domain. A variety of techniques are used to derive these insights: visualization, intelligent queries on multidimensional data, statistical models, data mining, content analytics, artificial intelligence, machine learning, optimization and cognition.
Visualization and Storyboarding: Visualization helps users to analyze data by means of presenting the data in intuitive manner. It makes complex data more accessible, understandable and usable. Storyboarding enables users to organize a series of visualizations in a sequence to effectively communicate an idea as a meaningful story.
Reporting, Analysis and Content Analytics refers to the use of business intelligence techniques and solutions to analyze both structured and unstructured data.
These solutions answer predefined business questions and utilize high-end visualizations to represent results in tabular views, graphs, charts, scorecards etc., and to assemble dashboards with key performance indicators (KPIs). Further, the content analytics capability augments the above mentioned capabilities and enables analysis of unstructured content such as text, images and video by utilizing analytical methods and techniques.
Decision Management applies prescriptive analytics tp the decision making process within organizations, using all available information to increase the precision, consistency and agility of decisions by taking known risks and constraints into consideration. Decision management makes use of business rules, statistical and optimization methods, and predictive analytics to automate the decision making process.
Predictive Analytics and Modeling brings together advanced analytics capabilities spanning ad-hoc statistical analysis, predictive modeling, data mining, text analytics, entity analytics, optimization, real-time scoring, machine learning and more. IBM puts these capabilities into the hands of business users, data scientists, and developers.
Cognitive Analytics simulates human thought processes in a computerized model. It involves self-learning abilities that use data mining, pattern recognition and
natural language processing to mimic the way the human brain works. Actionable insight is derived as a result of the cognitive processing on the underlying domain data.
Insight as a Service is the collection of accessible data domains into composable services that utilize in advanced analytics applications or to push data back into data lake repositories.
Component Description: The Security component is critical in all analytics and data architecture. With a focus on data protection there are specific characteristics to keep in mind. These are the abilities to mask / hide data at a granular level for those that still need to interact with it, the ability to encrypt the data from all users from a static content perspective, the ability to know who accesses it and why, and the ability to have an overall view of all these activities.
Data Masking and Redaction encompasses the ability to mask and hide data at a granular level (down to the individual attribute level). This is often used when generating test data, to maintain referential integrity but to prevent users seeing the original sensitive data. It is also used often for web based applications to control who can see specific data attributes, based upon role and access credentials.
Data Encryption is often used as a way to protect data in both production and test environments, where the data is known to have sensitive attributes.
Data Protection is the ability to know sensitive data is located, and to monitor, block, and report upon access and usage of that data.
Security Intelligence is the over-arching intelligence layer that includes a single automation and reporting platform to take in all the events, alerts, and other information from the other data security component capabilities. The Security Intelligence capability provides a single point of control and coordination for enterprise security activities.
Component Description: Information and Governance helps build confidence and trust in data by maintaining an accurate view of critical data, providing a standardized approach to discovering IT assets, and defining a common business language. The result is a higher degree of confidence leading to better, faster decision-making, improving operational efficiency and competitive advantage.
Data Lifecycle Management provides a policy-based approach to managing the flow of data throughout its life cycle. Data Lifecycle includes records management, electronic discovery, compliance, storage optimization and data migration.
Master and Entity Data provides a ’single source of the truth’ for critical business entities to users and applications. Master Data Management reconciles conflicting, redundant and inconsistent data from various systems.
Reference Data provides consistent values for common data elements and attributes such as currency conversion rates solution. This capability supports defining and managing reference data as an enterprise standard.
Data Catalog provides comprehensive capabilities to help understand data and foster collaboration across the enterprise. The catalog provides a foundation for information integration and governance projects.
Data Models contain data warehouse design models, business terminology models and analysis templates to accelerate the development of business intelligence applications.
Data Quality Rules provide the basis for managing data quality. Rules are defined and maintained within this component, but are typically applied in the Ingestion and Integration component to measure quality and cleanse data.