Data Mgmt.

Data Mgmt.

< Previous

Rules / Code >

Popular software: R, Python, Scala, SQL Server, MySQL, Snowflake, HANA, Informatica, Trifacta, Tableau Prep Builder, Oracle, Teradata, PostgreSQL, Hadoop, Spark, Windows, Linux, Google Cloud Platform (GCP) - (Cloud Storage, BigQuery and BigQuery ML), AWS (S3 Data Lake, Glue, Redshift, Athena, EMR and SageMaker) and Azure (Power BI dataflows, Blob Storage, Data Lake Storage G1/2, SQL and Data Warehouse).

Data Mgmt.

Data Lakes, Data Warehouses, Data Marts and Data Science

Most Companies are skilled with data integration and analytics. It is basic to survivial to ensure coordinated efforts. Being super-skilled takes experience. This is where we can help (quicker timelines and results).

Data Warehouses (DWs) exist to provide analytic reporting for two or more data sources (applications and files mainly). Companies can have 10’s to 100’s of individual software applications running along with many files (data sources). The ones that need better analytic reporting and visibility can be included in the DW. The next level is integrating a data source with other data in the DW for integrated analytics. Enterprise Data Warehouses (EDWs) do this on a large scale and can source from and store data in the Data Lake (storage).

Why a Data Warehouse?

  1. View a data source in the Company’s standard Analytic tool(s). This can be useful because Users are able to access information in a Company approved analytic tool.
  2. The next level is to integrate the application or file data with other DW data. This is where the value is created and information is gained across two or more applications or files.
  3. Application (ERP, CRM, SCM and many smaller) and file feeds (business and technical data) get replaced with newer solutions. The data warehouse bridges the change to keep past company records and likely enhance future reporting for the new or replacement feed (DW enhancement).
  4. Once all data is sourced and staged, the enterprise analytic or "target data model" becomes the most important design and there is no one go-to model. This model is also used for data science computations and needed structures are considerated when designing the enterprise data model. Companies typically build their own data warehouse models with outside assistance when needed.

Enterprise Data Lake

Data Lakes are a hot subject now for many businesses. The use of "hyperscaler" clouds like AWS, Azure and Google Cloud are accelerating the movement of data storage and processing to cloud systems. Data lakes allow storage of structured and unstructured data for machine learning, predictive analytics, data discovery, profiling and database access. Typical users are Data Scientists, Business Analysts and Data Engineers for analysis, statistics, data modeling and direct reporting. Data Lakes work with and support databases for processing structured analytics. These databases are superfast. They are Massively Parallel Processing (MPP) databases used for very large datasets with advanced speeds for large query workload processing and user analytic queries. Business and technical users typically access the database.

"A data lake is a system or repository of data stored in its natural/raw format, usually object blobs or files. A data lake is usually a single store of data including raw copies of source system data, sensor data, social data etc., and transformed data used for tasks such as reporting, visualization, advanced analytics and machine learning. A data lake can include structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video). A data lake can be established "on premises" (within an organization's data centers) or "in the cloud" (using cloud services from vendors such as Amazon, Microsoft, or Google). A data swamp is a deteriorated and unmanaged data lake that is either inaccessible to its intended users or is providing little value." - Wikipedia

Alexicon focuses more on Business Process, Analytics, Structuring and Data Modeling for Data Lakes and EDWs, balancing where raw and processed data will reside.

Large Enterprise Databases

Superfast databases have also become the new norm for analytic systems sourcing from the EDW. These superfast databases are comprised of software and hardware platform components. They are MPP databases used for very large datasets with advanced speeds for large query workload processing and user or analytic queries. MPPs are used in most larger businesses for EDWs.

Types of database schema designs (data models):

  1. Third normal form (3NF) - Normalized/Keys
  2. Data Warehouse (Stars and Snowflake) - Denormalized/Keys
  3. Data Mart (Stars and Snowflake) - Denormalized/Keys
  4. Computed Application Data Model - Denormalized/Keys
  5. Computed Tables - Denormalized/Keys/No Keys

Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. - Wikipedia

Data stored in the warehouse is uploaded from operational systems (such as marketing or sales). The data may pass through an operational data store and may require data cleansing for additional operations to ensure data quality before it is used in the DW for reporting. Extract, transform, load (ETL) and extract, load, transform (ELT) are the two main approaches used to build a data warehouse system. - Wikipedia

An integrated and process-based data warehouse considers quantities, dollars and time or duration which is key for measuring process steps and task time.

Golden Rule

As datasets have grown, Big Data systems and smaller databases need business logic, summarizations and calculations to be handled in the database layer. Many systems handle these important operations in the middle layer or Analytics Server and/or in the frontend or Visualization layer. Moving to bigger data and faster systems requires different and newer data model builds rather than trying to compensate with additional software and hardware purchases that can provide less than acceptable performance gains because of existing slow or "hard-to-analyze" data models. Alexicon can provide help and advice to avoid performance problems while opening up analysis capabilities as data scales by pushing Visualization or Analytic Server level work to the database layer.

Relational Database Management System (RDBMS) / Database

The Database is where all the sorting, filtering, aggregations and computations should be performed. This utilizes the databases computing power and passes requested detail or summarized results to the Analytics Server layer which passes the resulting data to the Visualization layer. One example is a simple need to aggregate records for five regions on 20 billion base records. If the summarization is done in the database it will produce five rows with the region names and accompanying numeric values as columns. It passes those to the Analytics Server which renders those five rows in the web visualization layer or passes data to a client analytics tool (desktop application).

While this is an extreme case to produce five rows of data, these types of aggregations are critical for databases to handle (big or small). Typically, the Analytics Server is overworked because this basic rule is not followed. While 20 billion rows of data could stress an Analytics Server(s), there are still many systems that run large aggregations in the Analytics and sometimes even in the visualization layer (aggregating detail records for summary data). Sometimes Users do this to integrate data on the frontend (better in the DW). With Big Data, this type of model becomes painfully clear quickly as data grows. Analytics Servers start failing as data size and velocity increases. This is why data manipulations and computations need to be done in the database where possible.

The "Data Model"

Data Warehouses depend on the right overall or enterprise data model to produce data structures for reports, dashboards and visualizations. As data Volume, Variety and Velocity (3V's) continues to increase, more advanced ways on ingesting, storing and providing data structures for analytic tools are available. Cloud solutions have added to the ease of scaling analytic systems. With their added “Big Compute” abilities, clouds are even more attractive. Alexicon focuses or the overall enterprise data model, indicators, sub data models, summaries and detail records. The goal is to drill from top levels as quickly as possible to any level in the system with accuracy.

Experience Curve

Alexicon provides over 20+ years of experience helping customers with progressive enterprise data management solutions across a variety of industries and functional company areas.