By Dr. Muthukumaran B- Practice Head – Big Data, HTC Global Services
Dr. Muthukumaran B- Practice Head – Big Data, HTC Global Services
HTC Global Services is a leading global Information technology (IT) and business process outsourcing (BPO) service and solution provider with over 25 years of extensive market presence and experience. Dr. Muthukumaran B has profound experience and expertise across various IT domains, and is heading the Research & Practice division in HTC Global Services.
Big Data architecture is postulated on the ability to understand and reason for developing reliable, scalable, and completely automated data pipelines to manage data streams and create intelligence from data streams. Big Data calls for the adoption of new generation technologies like, Hadoop distributed processing capabilities and advanced analytic platforms to meet the challenges of data streams coupled with complexities associated with unstructured data.
Different data platforms generate data differently by creating the essential data variety and data velocity. Data streams generated from data platforms deliver data elements very rapidly and are unbounded in length. They are relevant for a variety of application domains, similar to sensor processing, network monitoring and financial analysis. Data streams are seen as processes in which events occur continuously and independently from each other. Data elements come in various forms like JSON, XML, etc. and the processing elements need to handle this variety. The elements need to be processed in real time, or one loses the opportunity to process them for tangible business benefits. The capability to manage data streams becomes an essential application requirement.
Data stream model of computation imposes limitation on the data. Once a data element passes through or is streamed by, it cannot be revisited for calculations and computations. This limitation means that ad hoc queries that are issued after some data has already been discarded may be impossible to answer accurately. The continuous data stream model is most applicable to problems where timely query responses are important and there are large volumes of data that are being continually produced at a high rate over time. With new streaming data coupled with old data being processed for data analytics, the amount of computation time per data element must be low. Furthermore, since the systems are limited to available memory, it is difficult and time consuming to produce the exact answers and hence a trade-off needs to be worked out. The usual approach is to look for approximate answers which are accepted as viable solutions.
“The continuous data stream model is most applicable to problems where timely query responses are important and there are large volumes of data that are being continually produced at a high rate over time”
Two standard techniques are adopted for business benefits and time management viz., data reduction and sliding windows. From a business point of view, sliding windows and its variants are adopted for decision requirements. Imposing sliding windows on data streams is a natural method for approximation that has several attractive properties, well-defined and easily understood. Another technique for derivingthe approximate answers is to give up processing every data element as it arrives and perform some sort of sampling or batch processing technique to speed up query execution.
Data Streams versus Data
Data is stored in databases. Data streams are different from the regular data stores. Some of the differences from the conventional stored relation model can be outlined for the benefit of readers viz., the streaming data elements arrive online and are considered to be real time with no clue on the futuristic events. The streaming data system has no control over the order in which data elements arrive to be processed, either within a data stream or across data streams and hence creates a lot of new possibilities and visualizations. Usually the data elements originating from data streams are processed and later discarded or archived depending on individual cases and requirements. It cannot be retrieved easily unless it is explicitly stored in memory, which typically is small relative to the size of the data streams.
Streaming data is usually captured through digital data pipes built around publish-subscribe frameworks for processing and generating insights. Ingesting data into the data platform is a challenge as data arrives in batches or in real-time in a spectrum of formats from various verticals like manufacturing, banking, healthcare etc. Early data ingestion designs focused more on batch processing. Ingesting data from structured data sources is approached through Apache Sqoop to exploit the capabilities of MapReduce as its workhorse and address the movement of data between Hadoop and RDBMS. VoltDB is adopted for ingestion and interaction on unlimited streams of inbound fast data. Batch ingest method of file management has turned out to be a complex approach with trade-offs. The choice of specific technology and techniques are strategic in nature based on the data properties.
Data ingestion pipes are a critical component for any production environment. The data pipeline becomes an inlet to ingest raw data and generate insights as an outcome. Based on HTC’s experience in data ingestion over various data platforms, it is understood that there are organizations with more than 8 types of ingestion pipelines creating challenges around design, data quality, metadata management, development, and operations. Querying data streams is quite different from querying in the conventional relational model.
Stream Processing Solutions
The proper interpretation of a stream varies from application to application. Stream processing solutions are designed to handle high volumes in real time with a scalable, highly available and fault tolerant architecture. They are designed to solve challenges like processing massive amounts of streaming events to generate real time responsiveness to changing market conditions leading to live data discovery and monitoring, continuous query processing, automated alerts and reactions. Stream processing is adopted to process fast and continuously changing data. This has emerged as a mandate into every vertical. A variety of frameworks and products are available on the market.