Managing Massively Parallel Processing (MPP) Database Enviroments
MPP databases are typically used in major companies for large Enterprise Data Warehouses (EDWs). These are truly big and large integrated systems for Business Intelligence and Analytics reporting. Major corporations need these systems to ingest and analyze bigger datasets quicker to increase company competitiveness through gained intelligence. Hadoop and Spark are also used to feed MPPs for company-wide reporting (the truly Big Data systems).
MPP databases can have millions or billions of records that users run reports directly against. We have seen over 40 data sources integrated in these databases. Sizing the initial MPP environment and adding capacity once the system is being used can be a difficult task without professional 3rd party assistance. Alexicon provides consulting services for big data environments. Our specialties are architecture, governance, enterprise data models, Extract, Transform and Loading (ETL), computations, Performance Management (goals/targets) and Process-based structures.
We provide Performance-based Testing to determine the most cost effective mix of hardware and software licenses for current and future needs. Workload Management is supported by Integrated Reporting and Work Plan Coordination across the entire system.
Complete System Approach
We take a complete system approach by providing integrated reporting and work management methods to ensure the entire system process is well coordinated. By using a Process-based Approach, metrics and charts are used to understand how queries are processed through the complete system. System Throughput and performance inhibitors are the focus to improve overall performance. Below is a diagram that shows the database and major elements that influence performance:
With the MPP databases at the center of all the action, we focus on loading, administration and user queries to ensure peak periods are resourced correctly or work is moved to better suited times.
MPP databases are used for providing reasonable query responses on Extreme Data that easily exceeds a billion records. Traditional databases cannot perform with reasonable response times at such record levels. Typically, retailers have dealt with large data volumes and have relied on MPP solutions to help analyze these massive record sets. An example of why retail has such large data sets is the Universal Product Code (UPC) used at individual Stock Keeping Units (SKU) which creates an explosion of Point-of-sale (POS) records. These records and many other internal and external data sources feed MPPs at transaction and/or summary levels so users can query this data with BI and other SQL-based tools. There is still the art of building the right data model(s) to provide optimal frontend dashboard and reporting performance which does not change from traditional databases. MPPs provide the advantage holding more data and running much quicker which is why the adoption rate is increasing.
In 2010 Gartner reported that "nearly 70% of data warehouses experience performance constraint issues of various types. These typically affect data warehouses with varying levels of mixed workload, especially those with high query counts..."
Our experience is that large organizations with high query counts need planned and coordinated workload management to provide optimal performance. For global companies, workload management is critical considering different time zones, required loads and user queries.
Integrated Processes and Reporting
Out-of-the-box reports for databases, ETL and BI tools are like typical data mart/warehouse builds for business users. We need integration from multiple source applications for a complete system picture.
Workload Management Areas
- ETL Jobs
- Interactive Refreshes
- Scheduled Report Refreshes
- On-demand and Ad Hoc Reports
- Query Arrival Rate
- Query Backlog
- Query Wait Time
- Query CPU Time
- Memory Consumption
- Query I/O
- Query Throughputd
- Group and User Metrics
- DBA Scheduled Backups
The areas above are focused on MPP Databases and are eqaully applicable to Hadoop and Spark. Without "Workload Analytics" and "Best Practices," bottlenecks often occur with MPPs and cause major performance issues, distractions to the business and consulting or development costs to resolve. We believe it is better to have the performance visibility needed with "Workload Analytics" to watch by hour, day, week, month and year. MPPs have also become a costly asset to manage and deserves the attention and opportunity to optimize hourly performance for for a lean structure that avoids unnessacary upgrades to hardware/memory or processor and memory licencing.
«Big Data Analytics «Home Contact Us»